Estimating the number of available states for normal and tumor tissues in gene expression space
EEstimating the number of available statesfor normal and tumor tissuesin gene expression space
Frank Quintela (1,2) and Augusto Gonzalez (3,2)(1) University of Modena, Italy(2) Institute of Cybernetics, Mathematics and Physics,Havana, Cuba(3) University of Electronic Science and Technology,Chengdu, People Republic of ChinaMay 6, 2020
Abstract
Gene expression data for a set of 12 localizations from The CancerGenome Atlas are processed in order to evaluate an entropy-like mag-nitude allowing the characterization of tumors and comparison withthe corresponding normal tissues. The comparison indicates that thenumber of available states in gene expression space is much greaterfor tumors than for normal tissues and points out to a scaling relationbetween the fraction of available states and the overlapping betweenthe tumor and normal sample clouds.
The extreme difficulties in treating cancer [1] reveal that the survival capa-bilities of cancer cells are much stronger than those of the somatic cells inour body. restricted by the conditions of homeostasis. The reason for such“advantages” is explained in the atavistic theory of cancer [2, 3, 4, 5, 6] as1 a r X i v : . [ q - b i o . T O ] M a y he result of a core genetic programme, which helped primitive multicellularorganisms to overcome the extreme conditions posed by the ancient earth.One aspect of these enhanced capabilities is related to tissue fitness. Can-cer cells are known to turn off the mechanism of fitness control in homeostasisand exhibit replication rates much higher than the stem cells in healthy tis-sues [7].In vivo measurements of fitness in normal and tumor tissues is a difficulttask. However, there is a way of looking at fitness which is related to thenumber of available states for a system in configuration space and may be thesubject of numerical computations. Indeed, for a tissue (or a small portion ofit) there should be a fitness landscape in gene expression space. Regions ofhigh fitness are characterized by their volumes which, in some sense, measurethe number of available states for the system.In the present paper, we aim at estimating the number of available statesfor normal and tumor tissues or, more precisely, the ratio of numbers forthe tumor and the corresponding normal tissue. To this end, we processgene expression data for 12 cancer localizations, coming from The CancerGenome Atlas (TCGA) portal [8], and introduce an entropy-like magnitudemeasuring the volume or the number of available states in gene expressionspace.The main result is that, as expected, the number of available states fortumors are much higher than for normal tissues. An homeostatic tissue hasmuch less possibilities of realization or much higher order than the primitivemulticellular tumor. Additional facts are discussed below. As mentioned, our starting point is the TCGA expression data for 12 tumorsand the corresponding normal tissues. The selected localizations are charac-terized by more than 20 normal and more than 300 tumor samples, as shownin Table 1.We perform a Principal Component Analysis [9, 10, 11] of the expressiondata. Methodological aspects are briefly explained in the Methods sectionbelow. In paper [12], we study the topology of gene expression space fornormal and tumor tissues. We sketch the main results of that paper that2able 1: The set of studied cancer localizations and the main results of thepaper. TCGA abbreviations are used.
Tissues Normal samples Tumor samples ∆ S S tumor − ln I BRCA 112 1096 17.92 89.16 62.10COAD 41 473 29.93 90.87 87.50HNSC 44 502 12.67 89.48 58.02KIRC 72 539 22.22 89.31 68.16KIRP 32 289 29.86 90.40 82.34LIHC 50 374 29.00 90.16 63.94LUAD 59 535 26.24 92.18 66.08LUSC 49 502 28.59 90.58 75.25PRAD 52 499 9.79 85.96 49.85STAD 32 375 19.82 93.10 64.17THCA 58 510 15.00 85.62 54.51UCEC 23 552 26.06 93.26 77.78shall be used in our computations:1. Although there are around 60000 genes, normal tissues and tumorsspan a region with reduced effective dimension.Then, we will use the first 20 principal components in order to describethe state of a sample in gene expression space (GES). These 20 componentscapture no less than 85 % of the total variance in the dispersion of experi-mental samples in GES.2. For a given tissue, normal samples are well separated from tumorsamples in GES. Both regions seem to be the basins of atraction of twosingular points: the normal homeostatic and the cancer attractors.Fig. 1 upper panel shows as example the (PC1, PC2) plane for LungSquamous Cell cancer (LUSC in TCGA notations). Points in the figurerepresent samples from different patients. The clouds of points are grouppedin well defined regions defining the attractors. We shall estimate the volumeof each region, which gives an indication of the number of accesible states.More precisely, for both normal tissues and tumors we shall introduce theentropy-like magnitude: S = − (cid:90) d D x ρ ( x ) ln ρ ( x ) , (1)3 C2 PC1Normal Tumor PC1-Fitness
Figure 1: Upper panel: PCA of gene expression data for Squamous Cell Lungcancer (LUSC). The first axis (PC1) discriminates between a normal sampleand a tumor. Lower panel: Schematics of the fitness landscape.4here D = 20 is the number of principal components to be used in thedescription of the system in GES, and ρ is the probability density, comingfrom a fit to the observed sample data.The relation between the S magnitude and the volume of the basin ofattraction is roughly S ≈ ln( V ol ), thus S measures the logarithm of thenumber of available states in the region.We fit the observed distribution of sample points to a multivariate gaus-sian density, ρ . This procedure guarantees a maximal entropy. Details arefound in the Methods chapter below.We show in Table 1 the magnitudes S tumor and ∆ S = S tumor − S normal for the set of tissues under study. The number of states in GES seems to bemuch larger for tumors than for normal tissues, leading to ∆ S >>
In the analysis of complex systems, it is usual to add a second magnitude, be-sides entropy, and construct a complexity map [14]. In our problem, we havealready estimated the volumes of the basins of attraction. We may intro-duce an additional magnitude characterizing the transition region betweenthe two attractors. In this way, we may compare the different tumors or,more precisely, tumor-normal tissue pairs according to their configurationsin GES.Let us define the density overlap: I = (cid:90) d D x (cid:112) ρ tumor ( x ) ρ normal ( x ) . (2)The magnitude I measures the overlapping between the clouds of normal andtumor samples. The square root is introduced for normalization purposes.Analytic expressions for I when ρ are gaussian distributions are provided inthe Methods chapter below.Table 1 and Fig. 2 show that the number of accesible states is verysimilar for all tumors, but shows higher variation in normal tissues. If we5ake this number as an indication of “structure”, one may think that thecolon (COAD) is more structured than the prostate (PRAD).With regard to the observed overlap between tumor and normal cloudsof points, a first aspect to be stressed is the distance between the centers ofthe clouds. In PRAD, THCA and HNSC the centers are much closer thanin COAD, KIRP and UCEC.From the overlap one may also infer properties of the fitness landscape.Indeed, one may assume that the observed density of samples at a givenpoint of GES is somehow proportional to the fitness. In Fig. 1 lower panelwe schematically represent the fitness distribution in LUSC. The curve isobtained simply from the histogram of samples. Notice that in the interme-diate region there are practically no samples. The curve is highly peaked atthe attractor points, and very shallow in the intermediate region. In PRAD,THCA and HNSC, on the other hand, there are many more samples in theintermediate region, meaning that the fitness is higher in this region. Sam-ples in the intermediate region are related in paper [12] to tumors in theinitial stages of progression.The numbers in Table 1 and Fig. 1 indicate also the apparent correlationbetween − ln I and ∆ S , i.e. − ln I = 1 .
36 ∆ S + 37 .
07 or I ∼ ( V ol n /V ol t ) . .The nature of this dependence is intriguing. The fact is that the larger theentropy difference (the ratio of basin volumes) the smaller the overlappingbetween the tumor and normal sample clouds. We do not have an interpre-tation for this fact. The normal homeostatic state shall be protected against transitions to thecancer state by a barrier. Otherwise the transitions are unavoidable becauseboth the fitness and the number of available states in the cancer region aremuch higher than in the normal region.It is natural to assume that the intermediate region holds a low-fitnessbarrier, as schematically represented in Fig. 1 lower panel for Lung SquamousCell Cancer (LUSC). Indeed, the normal homeostatic state is a state withregulated fitness [15]. In cancer, on the other hand, these constraints areremoved and tumor growth is only limited by the availability of space andnutrients. The intermediate region is a space for senescence or different kindsof illness, where fitness is reduced and the compensation mechanisms are notcapable of keeping homeostasis. In paper [12] it was shown that tumors in6igure 2: The entropy-overlapping map. Notice that tumors exhibit a nearlyconstant entropy, and that there is an exponential relationship between theoverlap I and the entropy variation ∆ S .the initial stages of progression transit through this region.In Fig. 1 lower panel the x axis, as in the upper panel, is PC1, whichwas identified as the cancer axis [12] separating the normal from the cancerstates. The y axis is the fitness (with a minus sign). The absolute maximumof fitness is cancer. The normal homeostatic state is a local metastablemaximum, which should be characterized by a mean decay time, τ H .Notice that with a rough estimation of the fitness landscape along thePC1 axis we could get, in principle, an estimation for τ H , and thus the riskof cancer in a tissue.The time for the reverse process to occur, τ C , that is from the tumor tothe normal state, is expected to be much larger than τ H . We could get arough value for it by using a detailed balance equation [16]:7 c = τ H N states ( C ) N states ( H ) = τ H exp(∆ S ) . (3)Taking τ H ≈
60 years we get for prostate tumors, for example, τ c ≈ years. For thyroid cancer, on the other hand, τ c ≈ × years.These are ficticious numbers, not related to any biological processes. Wecompute them with the only purpose of confirming that the progression tocancer is an almost irreversible process.However, it is a curious fact that the required times for early multicellularorganisms to evolve to modern metazoans are precisely hundreds of My [17].At the level of conglomerates of cells one can imagine evolution as jumpsagainst entropy, that is from states like C to states like H. These are highlyimprobable processes which, however, may be the source of further advan-tages. When one says that it may take 200 My to occur, it means that fromthe many conglomerates living in this time period a few of them could makethe transition and start a new line of evolution. We estimated the volumes of the basins of attraction in gene expressionspace for the normal and cancer regions in each of the 12 cancer localizationsdescribed in Table 1. These volumes, which are proportional to the number ofaccesible micro-states, are measured by means of a “configurational” entropy-like magnitude, constructed from the probability density of samples in thespace. The latter is obtained from a multivariate gaussian fit to the observeddistribution of samples.The results are mainly three: 1. The number of accesible states is muchhigher for tumors than for normal samples, 2. All studied tumor localizationshave roughly the same number of accesible states, and 3. The overlap betweenthe tumor and normal samples clouds of points is roughly proportional toexp( − ∆ S ).The reduced number of accesible states for the normal tissue can be inter-preted as a higher level of organization than the tumor. The higher variabilityof entropy in normal tisues, on the other hand, can be taken as a manifesta-tion of tissue differentiation and structure. However, the biological relevanceof the scaling between cloud overlapping and the entropy difference shouldbe further clarified. 8 Methods
The TCGA data of Table 1 is analyzed by means of the PCA technique. Thedetails of the PC analysis are described in paper [12]. Me briefly sketch themin the present section.Gene expression are given in FPKM format. The number of genes is60483. This is the dimension of matrices in the Principal Component analy-sis. Usually, in order to compute the average expression of a gene the medianor the geometric mean are used. We prefer geometric averages, but then thedata should be slightly distorted to avoid zeroes. To this end, we added aconstant 0.1 to the data. By applying this regularization procedure, genesidentified as relevant could be under question if the differential expression isrelatively low and their expression in normal tissues is near zero. As we aremainly interested in the strongly over- or under-expressed genes, they areout of the question.We take the mean geometric average over normal samples in order todefine the reference expression for each gene, e ref . Then the normalized ordifferential expression is defined as: e diff = e/e ref . The fold variation is de-fined in terms of the logarithm ˆ e = log ( e diff ). Besides reducing the variance,the logarithm allows treating over- and sub-expression in a symmetrical way.Deviations and variances are measured with respect to ˆ e = 0. Thatis, with respect to the average over normal samples. This election is quitenatural, because normal samples are the majority in a population.With these assumptions, the covariance matrix is written: σ ij = (cid:88) ˆ e i ( s )ˆ e j ( s ) / ( N samples − , (4)where the sum runs over the samples, s , and N samples is the total number ofsamples (normal plus tumor). ˆ e i ( s ) is the fold variation of gene i in sample s . As mentioned, the dimension of matrix is 60483. By diagonalizing it, weget the axes of maximal variance: the Principal Components (PCs). Theyare sorted in descending order of their contribution to the variance.In LUSC, for example, PC1 accounts for 67% of the variance. This largenumber is partly due to our choice of the reference, ˆ e = 0, and the fact thatmost of the samples are tumors. The reward is that PC1 may be definedas the cancer axis. The projection over PC1 defines whether a sample isclassified as normal or tumor. 9he next PCs account for a smaller fraction of the variance. PC2 isresponsible of 4%, PC3 of 3%, etc. Around 20 PCs are enough for an ap-proximate description of the region of the gene expression space occupied bythe set of samples.We want to compute only a small number of the eigenvalues and eigen-vectors of σ . To this end, we use a Lanczos routine in Python language, andrun it in a node with 2 processors, 12 cores and 64 GB of RAM memory. As aresult, we get the first 100 eigenvalues and their corresponding eigenvectors.For a sample, the projections over the PC vectors define the new coordi-nates. These are the starting data for the computation of the configurationalentropy. We organize it as 24 matrices M , each one corresponding to a tissuein a stage, i.e. M ( LIHC, tumor ). The number of columns in any case is 20(number of Principal Components) and the number of rows is the number ofsamples, as reported in Table 1.From M the sample covariance matrix, Σ, is defined asΣ jk = 1 N − N (cid:88) i =1 ( M ij − µ j )( M ik − µ k ) , (5)where µ j = N (cid:80) i =1 M ij is the mean value of coordinate j in the set ofsamples.In order to find probability distributions for the sets of normal and tu-mor samples we maximize the entropy taking the covariance matrices as con-straints. These are quadratic constraints, thus the result is a multivariategaussian [18]: ρ ( x ) = 1(2 π ) D (cid:112) | Σ | exp {−
12 ( x − µ ) T Σ − ( x − µ ) } . (6)Notice our convention for vectors, x . There are advantages in using thisprocedure. First, with normal distributions we may analytically computethe quantities of interest, second this distribution is, in accordance with theCentral Limit Theorem, an estimation of the actual distribution for muchlarger data sets, and third this distribution is, from the point of view ofinformation theory, the most unbiased one with regard to data covariance,that is no heuristic criteria have been used for chosing it.For our target quantities, the entropy and the overlap integral, we get: S = 12 ln | Σ | + D π ) , (7)10 = 2 D | Λ n | / | Λ t | / | Λ c | / exp { ( η Tc Λ − c η c − µ Tn Λ n µ n − µ Tt Λ t µ t ) / } , (8)where Λ j = Σ − j for j = n, t ; η c = Λ n µ n + Λ t µ t , and Λ c = Λ n + Λ t . Acknowledgements
A.G. acknowledges the Cuban Program for BasicSciences, the Office of External Activities of the Abdus Salam Centre forTheoretical Physics, and the University of Electronic Science and Technologyof China for support. The research is carried on under a project of thePlatform for Bio-informatics of BioCubaFarma, Cuba. The data for thepresent analysis come from the TCGA Research Network [8].