Identifying statistically significant patterns in gene expression data
aa r X i v : . [ q - b i o . Q M ] J un Identifying statistically significant patterns in gene expression data
Patrick E. McSharry and Edmund J. Crampin , ∗ Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, Oxford, OX1 3LB, UKand Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK, and Centre forMathematical Biology, Mathematical Institute, Oxford, OX1 3LB, UK, University Laboratory ofPhysiology, University of Oxford, Oxford, OX1 3PT. ∗ To whom correspondence should be addressed.Both authors contributed equally to this work.Manuscript Date: 9th August 2002
Running head:
Statistical significance in gene expression data
Abstract
Motivation
Clustering techniques are routinely applied to iden-tify patterns of co-expression in gene expression data.Co-regulation, and involvement of genes in similarcellular function, is subsequently inferred from theclusters which are obtained. Increasingly sophisti-cated algorithms have been applied to microarraydata, however, less attention has been given to thestatistical significance of the results of clustering stud-ies. We present a technique for the analysis of com-monly used hierarchical linkage-based clustering calledSignificance Analysis of Linkage Trees (SALT).
Results
The statistical significance of pairwise similarity lev-els between gene expression profiles, a measure ofco-expression, is established using a surrogate dataanalysis method. We find that a modified versionof the standard linkage technique, complete-linkage ,must be used to generate hierarchical linkage treeswith the appropriate properties. The approach is il-lustrated using synthetic data generated from a novelmodel of gene expression profiles and is then appliedto previously analysed microarray data on the tran-scriptional response of human fibroblasts to serumstimulation.
Availability
A set of MATLAB functions are available on request.
Contact [email protected]
The ability to measure expression levels of multi-ple genes simultaneously promises insights into theregulation of gene expression under both normal andpathological conditions. Gene expression data is nowroutinely collected using oligonucleotide and cDNAmicroarray technologies (Fodor et al., 1993; Pease et al.,1994; Schena et al., 1996). Microarray experimentsinvolve many separate steps in the preparation ofsamples, of the arrays themselves and in the subse-quent image acquisition and analysis (Hauser et al.,1998). However, as the reliability of the data im-proves, there is an increasing need for tools for dataanalysis and interpretation. While increasingly so-phisticated clustering techniques are being applied tomicroarray data, the statistical significance of clus-tering results has yet to be fully explored.A wide variety of statistical techniques have beenused to investigate these gene expression profiles, in-cluding principal component analysis (Holter et al.,2000; Alter et al., 2000; Holter et al., 2001), corre-spondence analysis (Fellenberg et al., 2001), neural1etworks (Herrero et al., 2001) and the constructionof statistical models (Zhao et al., 2001; Ramoni et al.,2002). A common starting point for the analysis ofmicroarray data is to use a clustering technique togroup together genes with similar expression profiles(Wen et al., 1998; Eisen et al., 1998; Yeung and Ruzzo,2001). Genes exhibiting similar patterns suggest co-regulation of gene expression, and co-expressed genesmay be involved in similar functions within the cell.The ultimate goal of such studies is to be able topredict the underlying gene networks giving rise tothe gene expression data (Yeung et al., 2002).In this paper we aim to determine the signifi-cance of the similarity between gene expression pro-files. We deal with data collected over several timeintervals (time series data), although the techniqueapplies equally for gene expression data recordedover multiple separate experiments. In particular,we show how the significance of the number of clus-ters emerging from linkage analysis techniques canbe assessed. Linkage analysis, a simple and widelyused clustering technique, performs clustering by sort-ing the gene expression profiles according to pairwisesimilarity. A distance metric is used to quantify thesimilarity between genes, where the closer two genesare in distance the more similar are their expres-sion patterns, i.e. the more likely they are to be co-expressed. The ordered gene expression profiles canbe represented graphically using a tree (called a den-drogram) where the position of the branch connect-ing two genes reflects the similarity of their expres-sion profiles. Sorting the genes in this way results insimilar expression profiles being grouped together.Currently many investigators identify clusters of in-terest from the tree by both visual inspection and a priori knowledge. The tree identifies many differ-ent possible numbers of clusters, ranging from oneextreme where there are as many clusters as genes(one per cluster) to the other extreme, where onecluster contains all genes in the data set. Cuttingthe tree at a particular distance value establishes anumber of clusters into which the expression profilesare grouped.The major difficulties in the analysis of microar-ray data, as for most data sets, arise in discriminat-ing between signal and noise. Given limited data,there will always be a nonzero probability of in- correctly identifying noise as genuine co-expression.For microarray data the situation is exacerbated bythe generally low number of observations made onlarge numbers of variables (genes). Any statisticaltest applied to the data must indicate an accept-able level of such ‘false positives’ arising. Statisticaltests rely on assumptions made about what consti-tutes noise in the data set, against which a null hy-pothesis is tested. The nominal level of the statisti-cal test will only be meaningful if the assumptionsabout the noise are correct, and the null hypoth-esis relevant to the data set. We use the methodof surrogate data (Theiler et al., 1992; Smith, 1992)to determine a threshold on the tree at which wecan reject, with a given confidence level, the null hy-pothesis that observed similarity values could haveoccurred by chance. This threshold can be used todetermine which clusters are significant.We demonstrate the approach (Significance Anal-ysis of Linkage Trees, SALT) using synthetic gene ex-pression data generated using a simple model, andsubsequently apply the surrogate analysis techniqueto a publicly available gene expression data set whichanalyses the response of human fibroblasts to serum(Iyer et al., 1999).
Gene expression data is commonly expressed as thelogarithm of the ratio of an observed signal to the ini-tial or other reference expression level. Hence valuesare initially zero, and a positive value indicates up-regulation, whereas negative values represent down-regulation of gene expression, and up- and down-regulation are given equal numerical importance.
Synthetic Gene Expression Data
To illustrate how linkage clustering performs on adata set containing predetermined patterns of co-expression of genes, where gene clusters are known,we generated synthetic gene expression data. Sixtime-dependent response functions (labelled I to VI)were used to simulate early and late response, andup- and down-regulation expression patterns (Fig.1). We used a Gaussian function f g ( t, τ, c ) =exp[ − ( t − τ r − τc ) ] to simulate genes switching on and2 i (t) IIIIIIIVVVI Figure 1: Examples of the six response functionsused for generating synthetic expression data (withno observational noise added).off transiently during the experiment (response func-tions I through IV in Fig. 1) and a sigmoid function f s ( t ) = [1 + tanh( t − τ r − τc )] to replicate genes re-sponding slowly during the experiment, to reach anew threshold expression level (response functionsV and VI in Fig. 1). Here τ r is a time delay for eachof the six responses, and τ and c are gene-specificdelays and timescales.A time series of T = 13 points was used to mimicthe change in gene expression data over a 24hr pe-riod. Our synthetic data set contained N = 120genes: 20 genes for each of the six responses. Co-expressed genes may respond after different delaysand over different timescales, and will show differ-ent amplitudes of expression. We incorporated suchvariation into the amplitudes of the gene expres-sion profiles and the response times by represent-ing the expression level for gene i through x i ( t ) = A i f ( t, τ i , c i ) + η i ( t ), where f stands for either up- ordown-regulation with response function f g or f s , asdescribed in Table 1. For each gene i , the variationin the amplitude A i , time delay τ i and timescale c i are sampled from uniform distributions with A i ∈ [0 . , . τ i ∈ [ − ,
1] and c i ∈ [2 . , . η i ( t ) with zero mean and standard de-viation 0.05. Table 1: Parameters for generating the syntheticdata set.Response r I II III IV V VI f + f g − f g + f g − f g + f s − f s τ r Fibroblast Data Set
We used the published data set of Iyer et al. (1999)for the response of human fibroblasts to serum fol-lowing serum starvation. Data was collected at 12times over a 24hr period for around 8,600 distincthuman transcripts. A further data point is includedfor exponentially growing cells (“unsynch”) to givea series of T Similarity Measure
In order to cluster a data set containing gene ex-pression profiles using a linkage algorithm, a math-ematical definition of the similarity between expres-sions is required. Two genes which are co-expressedare likely to be similar in shape, but not necessar-ily in magnitude and for this reason the correlationcoefficient is a suitable similarity metric. Follow-ing Eisen et al. (1998), denoting the logarithm ofthe expression ratio for gene i at time k by x ik ,the similarity between genes i and j is quantified by ρ ij = T P Tk =1 ( x ik σ i )( x jk σ j ), where σ i = q T P Tk =1 x ik and T is the number of observations. Note that ifthe means were subtracted from each of the expres-sion profiles then σ i would be the standard deviationand ρ ij the correlation coefficient (Chatfield, 1989).By explicitly setting the mean to zero (Eisen et al.,1998), corresponding to an expression ratio of one,we are selecting a reference state against which sub-sequent changes are contrasted. Values of ρ ij canvary from 1 (completely correlated) to − ρ ij = 0 implies that3he two genes are uncorrelated.The similarity measure ρ ij takes negative val-ues for anti-correlated data and so cannot be usedas a distance between two observations. A distancemeasure d ij can be calculated from ρ ij using d ij = p − ρ ij ) which fulfils the conditions required fora distance metric: namely (i) d ij = 0 if and only if i = j , i.e. the genes have the same expression pro-files, (ii) d ij = d ji and (iii) d ij ≤ d ik + d kj (Mantegna,1999). The N ( N − / d ij ( i =1 , . . . , N , j = i + 1 , . . . , N ) can be used to determinea tree connecting N genes using a graph consistingof N − Clustering by Complete-Linkage
Linkage algorithms iteratively combine the N genesinto clusters. A measure of affinity between clus-ters is used to decide the order in which clusters arecombined at any given step. Starting from N clus-ters, each consisting of a single gene, the two clus-ters with the highest affinity are combined into anew cluster. This linkage is marked on the tree by aconnection between the clusters at a linkage distanceequal to the affinity value. This process is repeateduntil there is only one large cluster containing all N genes. If the affinity between clusters is chosento be the distance between the closest pair of genesthen the method is known as single-linkage. The treeconstructed with single-linkage is called a minimumspanning tree. This has the disadvantage that thelinkage distance does not place a bound on how dis-similar genes within the same cluster may be. This isbecause the distances d ij for each of the other pair-ings of genes between the clusters will be larger thanthe linkage distance, δ ij , by construction. Choosingthe distance between the averages of the clusters asthe measure of affinity is known as average-linkage,and is a choice commonly used in microarray dataanalysis (Eisen et al., 1998). Similarly, for average-linkage δ ij does not place an upper bound on thepairwise distances within clusters. We suggest that the appropriate linkage tech-nique to choose is complete-linkage, where the max-imal pairwise distance between clusters δ ij is usedto determine the tree. At each step the two clus-ters with the smallest maximal pairwise distance arecombined. Thus the tree contains maximal distancesbetween clusters, and hence is appropriate for ananalysis of statistical significance. In particular, forthe complete-linkage algorithm all genes clusteredbelow a threshold distance d ij = d thrs must be sep-arated by distances satisfying d ij ≤ d thrs . If we cutthe tree at d thrs and ignore all linkages above d thrs then we are certain not to neglect any gene pairingswhich are separated by a distance d ij ≤ d thrs . Thisproperty of the complete-linkage algorithm will beemployed to obtain significant clusters.It still remains to be decided at which distanceto cut the tree, to determine the number of clusterswhich with high probability represent co-expressionof their constituent genes. To obtain the distributionof distances that one would expect to find for geneswhich are not being co-expressed, we have generatedsurrogate data (Theiler et al., 1992; Smith, 1992). Surrogate Gene Profiles
The SALT technique determines how small the dis-tance between two genes must be before they are in-ferred to be co-expressed. This is achieved by testingagainst a particular null hypothesis: that a particu-lar value of the distance arises by chance from twogenes which are not co-expressed. We test this nullhypothesis using the distribution of distances whichresults from genes that are not co-expressed. Sincewe do not have an analytical description for this dis-tribution, we estimate it using surrogates gene ex-pression profiles, generated by sampling from theoriginal data set. A significance level must be speci-fied at which the null hypothesis is tested. For exam-ple, if we allow a 5% chance that the null hypothesisis falsely rejected, then the test is valid at the 95%level. Testing at the 95% significance level meansthat we reject the null hypothesis for any distancein the tree less than a threshold distance d thrs cor-responding to the 5th percentile of the distributionobtained from the surrogates. Each pair of surrogategene expression profiles provides one sample of the4 G ene e x p r e ss i on 0 h r m i n m i n h r h r h r h r h r h r h r h r h r un sy n c h Time−3−2−10123
Figure 2: Probability density function (PDF) of thefibroblast gene expression data set.distribution. A large number of surrogates (10000)was used to resolve the tail of the distribution, givinga robust estimate of the 5th percentile.We generate surrogate expression profiles whichpreserve some aspects of the original data but whichare also consistent with the null hypothesis. Appro-priate surrogates should reflect obvious properties ofthe data, in particular that the gene expression pro-files vary smoothly with time. An analysis of theprobability density function (PDF) of the gene ex-pression profiles for the fibroblast data set (the prob-ability that, at a given time point, a gene selectedat random from the data set will have a given ex-pression level) shows that there is a time-dependenttrend running through the data (Fig. 2). This sug-gests that surrogate data sets based on permutationswhich shuffle the temporal information are likely todestroy important correlations which are due to thetime series nature of the experiment. To preservethe temporal continuity we have constructed surro-gate gene expression profiles by sampling without re-placement from the original data at each time pointindependently. In this way the PDF of the gene ex-pression at different time points (Fig. 2) is preservedin the surrogates.We can then reject the null hypothesis for dis-tances calculated from the original data for which d ij < d thrs with confidence 95%. The constructionof the tree using complete-linkage algorithm ensuresthat for clusters below d thrs , the distance betweenevery gene pair within a cluster is statistically sig-nificant. Note that this is not the case for the otherlinkage algorithms described above. We applied complete-linkage clustering to both thesynthetic and fibroblast data sets. Surrogates weregenerated to find d thrs and this threshold was usedto determine statistically significant clusters. Wefound that 10000 surrogates was sufficient to producerobust results when testing at the 95% significancelevel. Synthetic data
The distances corresponding to the synthetic dataset (Fig. 3) fall into distinct groups because of theclearly defined response profiles underlying the data(Fig. 1). The tree (left) shows the hierarchical or-ganisation of the linkage distances, indicating thatthe correct number of clusters can be identified bycutting the tree at a threshold distance in the range0 . < d ij < . d thrs = 0 .
963 corresponding tothe 95% significance level. This threshold correctlyidentifies the clusters, assigning the 20 genes for eachresponse function to the corresponding cluster. Thecorrespondence between the original responses (Fig.1) and the clusters (Fig. 3) is given by: I (a),II (b), III (d), IV (c), V (e) and VI (f). Note that the expression profiles are clusteredtogether, even though their magnitudes are different,because they have the same shape, as quantified bythe choice of distance metric based on the correlationcoefficient.We expect that significantly higher noise levelswill obscure the original pattern of gene expression.Increasing the observational noise in the syntheticdata set was found to increase the linkage distancesused to construct the tree, but not to have a stronginfluence on the threshold distance (data not shown).Many more of the distances in the data set are foundto be consistent with the null hypothesis, and there-5 δ ij −1012x i (a)−2−101x i (b)0 6 12 18 24−2−101 timex i (c) −1012 (d)−1012 (e)0 6 12 18 24−2−101 time(f) Figure 3: Analysis of the synthetic data set showing the hierarchical tree (left) constructed using complete-linkage, with distances δ ij from 2 (no co-expression) to 0 (co-expression). The clusters recovered by cuttingthe tree at significance threshold d thrs = 0 .
963 (indicated on the tree by the dashed line) are shown on theright. The darker trace indicates the cluster average expression profile.fore not considered significant for clustering. In thiscase, a larger number of clusters than the six origi-nal responses is obtained by cutting the tree at thethreshold distance, however, the members of eachcluster were still found to correspond to only one ofthe six response functions.
Fibroblast Data Results
We applied the same surrogate data analysis tech-nique to the 517 gene expression profiles in the pub-lished data set corresponding to fibroblast transcrip-tional response to serum stimulation. The publishedclustering for this data set found clusters correspond-ing to different aspects of the physiological responseof fibroblasts to wound healing (serum stimulation).Iyer et al. used hierarchical clustering (using average-linkage, Eisen et al., 1998) from which they obtainedclusters by visual inspection of the ordered tree, re-porting cluster sizes of 142, 100, 60, 40, 32 and 31 genes for the six largest clusters.We used surrogates to obtain a threshold dis-tance of d thrs = 0 .
95 corresponding to the 95% sig-nificance level, which was used to cut the tree. Thefifteen largest clusters obtained, those containing 10or more genes, are shown in Fig. 4, along with thered-green display (Eisen et al., 1998) of the entireclustered data set, ordered by increasing mean ex-pression value.In total 47 statistically significant clusters werefound, including 30 containing more than two genes.The maximum cluster size found to be of statisti-cal significance by this technique is 74, and the sixlargest clusters are of size 74 (b), 40 (d), 39 (a), 30(j), 29 (l) and 23 (c,f,g), suggesting that some of thecorrelations within the larger clusters reported pre-viously may not be statistically significant.6 h r m i n m i n h r h r h r h r h r h r h r h r h r un sy n c h −3 −2 −1 0 1 2 3abcdefghijklmno −3−2−101 a (39)−3−2−101 b (74)−3−2−101 c (23)−3−2−101 d (40)−4−20 e (20)−202 f (23)−4−202 g (23)−2024 h (12) −202 i (12)024 j (30)−2024 k (18)−2024 l (29)−20246 m (21)024 n (20)024 o (17) Figure 4: Analysis of the fibroblast gene expression data. The red-green display shows the clustered dataset sorted by increasing mean expression level. The scale is log-expression ratio − d thrs = 0 .
95 correspond to the locations indicated by coloured bars left of the red-green display(cluster sizes are shown in brackets). 7
Discussion
The measurement of gene expression data using DNAmicroarray techniques may provide a deeper under-standing of many of the complicated processes un-derlying biological systems. To investigate the largedata sets resulting from microarray experiments wehave presented a new technique (SALT). This tech-nique is straightforward and can be easily imple-mented in existing computer analysis packages.SALT uses surrogate data analysis to identifyclusters on the hierarchical tree generated using thecomplete-linkage algorithm, at a prescribed signif-icance level. We have used surrogate data whichpreserves the temporal variation of the gene expres-sion profiles. This is also the appropriate methodfor data collected for different experimental condi-tions or from different tissue preparations, in orderto preserve characteristics of the data within the dif-ferent experiments. A better understanding of thedynamical processes underlying the gene expressiondata and, in particular, the sources of measurementerror, would allow surrogates to be constructed us-ing better models of the properties of independentgene expression data.The application of the technique to the publishedfibroblast data set found significant clusters contain-ing fewer genes than previous analyses (Ramoni et al.,2002; Iyer et al., 1999). One explanation is that pre-vious analyses employ similarity values between datapoints which are not statistically significant to obtaintheir clusters. Studies using alternative methods ofclustering in which the data is divided into a pre-determined number of clusters (K-means clustering,for example) also risk grouping together genes forwhich the expression profiles have distances abovethe significance threshold.Statistical significance testing using SALT couldalso be used for clustering other examples of largedata sets where one wishes to identify which ele-ments of a given set of profiles are (i) interactingwith each other or (ii) interacting in a similar wayin response to some external perturbation.
Acknowledgements
This research was supported by an Engineering andPhysical Sciences Research Council Grant GR/N02641to PEM and The Wellcome Trust (EJC).
References
O. Alter, P. O. Brown, and D. Botstein. Singular valuedecomposition for genome-wide expression data pro-cessing and modeling.
Proc. Natl. Acad. Sci. USA , 97:10101–10106, 2000.C. Chatfield.
The Analysis of Time Series . Chapman andHall, London, 4th edition, 1989.M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Bot-stein. Cluster analysis and display of genome-wideexpression patterns.
Proc. Natl. Acad. Sci. USA , 95:14863–14868, 1998.K. Fellenberg, N. C. Hauser, B. Brors, A. Neutzner, J. D.Hoheisel, and M. Vingron. Correspondence analysis ap-plied to microarray data.
Proc. Natl. Acad. Sci. USA ,98:10781–10786, 2001.S. P. Fodor, R. P. Rava, X. C. Huang, A. C. Pease, C. P.Holmes, and C. L. Adams. Multiplexed biochemicalassays with biological chips.
Nature , 364:555–556, 1993.N. C. Hauser, M. Vingron, M. Scheideler, B. Krems,K. Hellmuth, K. D. Entian, and J. D. Hoheisel. Tran-scriptional profiling on all open reading frames of sac-charomyces cerevisiae.
Yeast , 14:1209–1221, 1998.J. Herrero, A. Valencia, and J. Dopazo. A hierarchical un-supervised growing neural network for clustering geneexpression patterns.
Bioinformatics , 17:126–136, 2001.N. S. Holter, M. Madhusmita, A. Maritan, M. Cieplak,and N. V. Banavar, J. R.and Fedoroff. Fundamentalpatterns underlying gene expression profiles: simplic-ity from complexity.
Proc. Natl. Acad. Sci. USA , 97:18409–8414, 2000.N. S. Holter, A. Maritan, M. Cieplak, N. V. Fedoroff, andJ. R. Banavar. Dynamic modeling of gene expressiondata.
Proc. Natl. Acad. Sci. USA , 98:1693–1698, 2001.V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore,J. C. F. Lee, J. M. Trent, L. M. Staudt, J. Hudston Jr.,M. S. Boguski, D. Lashkari, D. Shalon, D. Botstein,and P. O. Brown. The transcriptional program in theresponse of human fibroblasts to serum.
Science , 283:83–87, 1999. . N. Mantegna. Hierarchical structure in financial mar-kets. Eur. Phys. J. B , 11:193–197, 1999.A. C. Pease, D. Solas, E. J. Sullivan, M. T. Cronin, C. P.Holmes, and S. P. Fodor. Light-generated nucleotidearrays for rapid DNA sequence analysis.
Proc. Natl.Acad. Sci. USA , 91:5022–5026, 1994.M. F. Ramoni, P. Sebastiani, and I. S. Kohane. Clusteranalysis of gene expression dynamics.
Proc. Natl. Acad.Sci. USA , 99:9121–9126, 2002.M. Schena, D. Shalon, R. Heller, A. Chai, P. O. Brown,and R. W. Davis. Parallel human genome analysis:microarray-based expression monitoring of 1000 genes.
Proc. Natl. Acad. Sci. USA , 93:10614–10619, 1996.L. A. Smith. Identification and prediction of low-dimensional dynamics.
Physica D , 58:50–76, 1992.J. Theiler, S. Eubank, A. Longtin, B. Galdrikan, and J. D.Farmer. Testing for nonlinearity in time series: themethod of surrogate data.
Physica D , 58:77, 1992.X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr,S. Smith, J. L. Barker, and R. Somogyi. Large-scaletemporal gene expression mapping of central nervoussystem development.
Proc. Natl. Acad. Sci. USA , 95:334–339, 1998.K. Y. Yeung and W. L. Ruzzo. Principle component anal-ysis for clustering gene expression data.
Bioinformat-ics , 17:763–774, 2001.M. K. S. Yeung, J. Tegn´er, and J. J. Collins. Reverseengineering gene networks using singular value decom-position and robust regression.
Proc. Natl. Acad. Sci.USA , 99:6163–6168, 2002.L. P. Zhao, R. Prentice, and L. Breeden. Statistical mod-eling of large microarray data sets to identify stimulus-response profiles.
Proc. Natl. Acad. Sci. USA , 98:5631–5636, 2001., 98:5631–5636, 2001.