[PDF] Identifying statistically significant patterns in gene expression data

Abstract

Motivation: Clustering techniques are routinely applied to identify patterns of co-expression in gene expression data. Co-regulation, and involvement of genes in similar cellular function, is subsequently inferred from the clusters which are obtained. Increasingly sophisticated algorithms have been applied to microarray data, however, less attention has been given to the statistical significance of the results of clustering studies. We present a technique for the analysis of commonly used hierarchical linkage-based clustering called Significance Analysis of Linkage Trees (SALT). Results: The statistical significance of pairwise similarity levels between gene expression profiles, a measure of co-expression, is established using a surrogate data analysis method. We find that a modified version of the standard linkage technique, complete-linkage, must be used to generate hierarchical linkage trees with the appropriate properties. The approach is illustrated using synthetic data generated from a novel model of gene expression profiles and is then applied to previously analysed microarray data on the transcriptional response of human fibroblasts to serum stimulation.

Full PDF

aa r X i v : . [ q - b i o . Q M ] J un Identifying statistically signiﬁcant patterns in gene expression data

Patrick E. McSharry and Edmund J. Crampin , ∗ Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, Oxford, OX1 3LB, UKand Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK, and Centre forMathematical Biology, Mathematical Institute, Oxford, OX1 3LB, UK, University Laboratory ofPhysiology, University of Oxford, Oxford, OX1 3PT. ∗ To whom correspondence should be addressed.Both authors contributed equally to this work.Manuscript Date: 9th August 2002

Running head:

Statistical signiﬁcance in gene expression data

Abstract

Motivation

Clustering techniques are routinely applied to iden-tify patterns of co-expression in gene expression data.Co-regulation, and involvement of genes in similarcellular function, is subsequently inferred from theclusters which are obtained. Increasingly sophisti-cated algorithms have been applied to microarraydata, however, less attention has been given to thestatistical signiﬁcance of the results of clustering stud-ies. We present a technique for the analysis of com-monly used hierarchical linkage-based clustering calledSigniﬁcance Analysis of Linkage Trees (SALT).

Results

The statistical signiﬁcance of pairwise similarity lev-els between gene expression proﬁles, a measure ofco-expression, is established using a surrogate dataanalysis method. We ﬁnd that a modiﬁed versionof the standard linkage technique, complete-linkage ,must be used to generate hierarchical linkage treeswith the appropriate properties. The approach is il-lustrated using synthetic data generated from a novelmodel of gene expression proﬁles and is then appliedto previously analysed microarray data on the tran-scriptional response of human ﬁbroblasts to serumstimulation.

Availability

A set of MATLAB functions are available on request.

Contact [email protected]

The ability to measure expression levels of multi-ple genes simultaneously promises insights into theregulation of gene expression under both normal andpathological conditions. Gene expression data is nowroutinely collected using oligonucleotide and cDNAmicroarray technologies (Fodor et al., 1993; Pease et al.,1994; Schena et al., 1996). Microarray experimentsinvolve many separate steps in the preparation ofsamples, of the arrays themselves and in the subse-quent image acquisition and analysis (Hauser et al.,1998). However, as the reliability of the data im-proves, there is an increasing need for tools for dataanalysis and interpretation. While increasingly so-phisticated clustering techniques are being applied tomicroarray data, the statistical signiﬁcance of clus-tering results has yet to be fully explored.A wide variety of statistical techniques have beenused to investigate these gene expression proﬁles, in-cluding principal component analysis (Holter et al.,2000; Alter et al., 2000; Holter et al., 2001), corre-spondence analysis (Fellenberg et al., 2001), neural1etworks (Herrero et al., 2001) and the constructionof statistical models (Zhao et al., 2001; Ramoni et al.,2002). A common starting point for the analysis ofmicroarray data is to use a clustering technique togroup together genes with similar expression proﬁles(Wen et al., 1998; Eisen et al., 1998; Yeung and Ruzzo,2001). Genes exhibiting similar patterns suggest co-regulation of gene expression, and co-expressed genesmay be involved in similar functions within the cell.The ultimate goal of such studies is to be able topredict the underlying gene networks giving rise tothe gene expression data (Yeung et al., 2002).In this paper we aim to determine the signiﬁ-cance of the similarity between gene expression pro-ﬁles. We deal with data collected over several timeintervals (time series data), although the techniqueapplies equally for gene expression data recordedover multiple separate experiments. In particular,we show how the signiﬁcance of the number of clus-ters emerging from linkage analysis techniques canbe assessed. Linkage analysis, a simple and widelyused clustering technique, performs clustering by sort-ing the gene expression proﬁles according to pairwisesimilarity. A distance metric is used to quantify thesimilarity between genes, where the closer two genesare in distance the more similar are their expres-sion patterns, i.e. the more likely they are to be co-expressed. The ordered gene expression proﬁles canbe represented graphically using a tree (called a den-drogram) where the position of the branch connect-ing two genes reﬂects the similarity of their expres-sion proﬁles. Sorting the genes in this way results insimilar expression proﬁles being grouped together.Currently many investigators identify clusters of in-terest from the tree by both visual inspection and a priori knowledge. The tree identiﬁes many diﬀer-ent possible numbers of clusters, ranging from oneextreme where there are as many clusters as genes(one per cluster) to the other extreme, where onecluster contains all genes in the data set. Cuttingthe tree at a particular distance value establishes anumber of clusters into which the expression proﬁlesare grouped.The major diﬃculties in the analysis of microar-ray data, as for most data sets, arise in discriminat-ing between signal and noise. Given limited data,there will always be a nonzero probability of in- correctly identifying noise as genuine co-expression.For microarray data the situation is exacerbated bythe generally low number of observations made onlarge numbers of variables (genes). Any statisticaltest applied to the data must indicate an accept-able level of such ‘false positives’ arising. Statisticaltests rely on assumptions made about what consti-tutes noise in the data set, against which a null hy-pothesis is tested. The nominal level of the statisti-cal test will only be meaningful if the assumptionsabout the noise are correct, and the null hypoth-esis relevant to the data set. We use the methodof surrogate data (Theiler et al., 1992; Smith, 1992)to determine a threshold on the tree at which wecan reject, with a given conﬁdence level, the null hy-pothesis that observed similarity values could haveoccurred by chance. This threshold can be used todetermine which clusters are signiﬁcant.We demonstrate the approach (Signiﬁcance Anal-ysis of Linkage Trees, SALT) using synthetic gene ex-pression data generated using a simple model, andsubsequently apply the surrogate analysis techniqueto a publicly available gene expression data set whichanalyses the response of human ﬁbroblasts to serum(Iyer et al., 1999).

Gene expression data is commonly expressed as thelogarithm of the ratio of an observed signal to the ini-tial or other reference expression level. Hence valuesare initially zero, and a positive value indicates up-regulation, whereas negative values represent down-regulation of gene expression, and up- and down-regulation are given equal numerical importance.

Synthetic Gene Expression Data

To illustrate how linkage clustering performs on adata set containing predetermined patterns of co-expression of genes, where gene clusters are known,we generated synthetic gene expression data. Sixtime-dependent response functions (labelled I to VI)were used to simulate early and late response, andup- and down-regulation expression patterns (Fig.1). We used a Gaussian function f g ( t, τ, c ) =exp[ − ( t − τ r − τc ) ] to simulate genes switching on and2 i (t) IIIIIIIVVVI Figure 1: Examples of the six response functionsused for generating synthetic expression data (withno observational noise added).oﬀ transiently during the experiment (response func-tions I through IV in Fig. 1) and a sigmoid function f s ( t ) = [1 + tanh( t − τ r − τc )] to replicate genes re-sponding slowly during the experiment, to reach anew threshold expression level (response functionsV and VI in Fig. 1). Here τ r is a time delay for eachof the six responses, and τ and c are gene-speciﬁcdelays and timescales.A time series of T = 13 points was used to mimicthe change in gene expression data over a 24hr pe-riod. Our synthetic data set contained N = 120genes: 20 genes for each of the six responses. Co-expressed genes may respond after diﬀerent delaysand over diﬀerent timescales, and will show diﬀer-ent amplitudes of expression. We incorporated suchvariation into the amplitudes of the gene expres-sion proﬁles and the response times by represent-ing the expression level for gene i through x i ( t ) = A i f ( t, τ i , c i ) + η i ( t ), where f stands for either up- ordown-regulation with response function f g or f s , asdescribed in Table 1. For each gene i , the variationin the amplitude A i , time delay τ i and timescale c i are sampled from uniform distributions with A i ∈ [0 . , . τ i ∈ [ − ,

1] and c i ∈ [2 . , . η i ( t ) with zero mean and standard de-viation 0.05. Table 1: Parameters for generating the syntheticdata set.Response r I II III IV V VI f + f g − f g + f g − f g + f s − f s τ r Fibroblast Data Set

We used the published data set of Iyer et al. (1999)for the response of human ﬁbroblasts to serum fol-lowing serum starvation. Data was collected at 12times over a 24hr period for around 8,600 distincthuman transcripts. A further data point is includedfor exponentially growing cells (“unsynch”) to givea series of T Similarity Measure

In order to cluster a data set containing gene ex-pression proﬁles using a linkage algorithm, a math-ematical deﬁnition of the similarity between expres-sions is required. Two genes which are co-expressedare likely to be similar in shape, but not necessar-ily in magnitude and for this reason the correlationcoeﬃcient is a suitable similarity metric. Follow-ing Eisen et al. (1998), denoting the logarithm ofthe expression ratio for gene i at time k by x ik ,the similarity between genes i and j is quantiﬁed by ρ ij = T P Tk =1 ( x ik σ i )( x jk σ j ), where σ i = q T P Tk =1 x ik and T is the number of observations. Note that ifthe means were subtracted from each of the expres-sion proﬁles then σ i would be the standard deviationand ρ ij the correlation coeﬃcient (Chatﬁeld, 1989).By explicitly setting the mean to zero (Eisen et al.,1998), corresponding to an expression ratio of one,we are selecting a reference state against which sub-sequent changes are contrasted. Values of ρ ij canvary from 1 (completely correlated) to − ρ ij = 0 implies that3he two genes are uncorrelated.The similarity measure ρ ij takes negative val-ues for anti-correlated data and so cannot be usedas a distance between two observations. A distancemeasure d ij can be calculated from ρ ij using d ij = p − ρ ij ) which fulﬁls the conditions required fora distance metric: namely (i) d ij = 0 if and only if i = j , i.e. the genes have the same expression pro-ﬁles, (ii) d ij = d ji and (iii) d ij ≤ d ik + d kj (Mantegna,1999). The N ( N − / d ij ( i =1 , . . . , N , j = i + 1 , . . . , N ) can be used to determinea tree connecting N genes using a graph consistingof N − Clustering by Complete-Linkage

Linkage algorithms iteratively combine the N genesinto clusters. A measure of aﬃnity between clus-ters is used to decide the order in which clusters arecombined at any given step. Starting from N clus-ters, each consisting of a single gene, the two clus-ters with the highest aﬃnity are combined into anew cluster. This linkage is marked on the tree by aconnection between the clusters at a linkage distanceequal to the aﬃnity value. This process is repeateduntil there is only one large cluster containing all N genes. If the aﬃnity between clusters is chosento be the distance between the closest pair of genesthen the method is known as single-linkage. The treeconstructed with single-linkage is called a minimumspanning tree. This has the disadvantage that thelinkage distance does not place a bound on how dis-similar genes within the same cluster may be. This isbecause the distances d ij for each of the other pair-ings of genes between the clusters will be larger thanthe linkage distance, δ ij , by construction. Choosingthe distance between the averages of the clusters asthe measure of aﬃnity is known as average-linkage,and is a choice commonly used in microarray dataanalysis (Eisen et al., 1998). Similarly, for average-linkage δ ij does not place an upper bound on thepairwise distances within clusters. We suggest that the appropriate linkage tech-nique to choose is complete-linkage, where the max-imal pairwise distance between clusters δ ij is usedto determine the tree. At each step the two clus-ters with the smallest maximal pairwise distance arecombined. Thus the tree contains maximal distancesbetween clusters, and hence is appropriate for ananalysis of statistical signiﬁcance. In particular, forthe complete-linkage algorithm all genes clusteredbelow a threshold distance d ij = d thrs must be sep-arated by distances satisfying d ij ≤ d thrs . If we cutthe tree at d thrs and ignore all linkages above d thrs then we are certain not to neglect any gene pairingswhich are separated by a distance d ij ≤ d thrs . Thisproperty of the complete-linkage algorithm will beemployed to obtain signiﬁcant clusters.It still remains to be decided at which distanceto cut the tree, to determine the number of clusterswhich with high probability represent co-expressionof their constituent genes. To obtain the distributionof distances that one would expect to ﬁnd for geneswhich are not being co-expressed, we have generatedsurrogate data (Theiler et al., 1992; Smith, 1992). Surrogate Gene Proﬁles

The SALT technique determines how small the dis-tance between two genes must be before they are in-ferred to be co-expressed. This is achieved by testingagainst a particular null hypothesis: that a particu-lar value of the distance arises by chance from twogenes which are not co-expressed. We test this nullhypothesis using the distribution of distances whichresults from genes that are not co-expressed. Sincewe do not have an analytical description for this dis-tribution, we estimate it using surrogates gene ex-pression proﬁles, generated by sampling from theoriginal data set. A signiﬁcance level must be speci-ﬁed at which the null hypothesis is tested. For exam-ple, if we allow a 5% chance that the null hypothesisis falsely rejected, then the test is valid at the 95%level. Testing at the 95% signiﬁcance level meansthat we reject the null hypothesis for any distancein the tree less than a threshold distance d thrs cor-responding to the 5th percentile of the distributionobtained from the surrogates. Each pair of surrogategene expression proﬁles provides one sample of the4 G ene e x p r e ss i on 0 h r m i n m i n h r h r h r h r h r h r h r h r h r un sy n c h Time−3−2−10123

Figure 2: Probability density function (PDF) of theﬁbroblast gene expression data set.distribution. A large number of surrogates (10000)was used to resolve the tail of the distribution, givinga robust estimate of the 5th percentile.We generate surrogate expression proﬁles whichpreserve some aspects of the original data but whichare also consistent with the null hypothesis. Appro-priate surrogates should reﬂect obvious properties ofthe data, in particular that the gene expression pro-ﬁles vary smoothly with time. An analysis of theprobability density function (PDF) of the gene ex-pression proﬁles for the ﬁbroblast data set (the prob-ability that, at a given time point, a gene selectedat random from the data set will have a given ex-pression level) shows that there is a time-dependenttrend running through the data (Fig. 2). This sug-gests that surrogate data sets based on permutationswhich shuﬄe the temporal information are likely todestroy important correlations which are due to thetime series nature of the experiment. To preservethe temporal continuity we have constructed surro-gate gene expression proﬁles by sampling without re-placement from the original data at each time pointindependently. In this way the PDF of the gene ex-pression at diﬀerent time points (Fig. 2) is preservedin the surrogates.We can then reject the null hypothesis for dis-tances calculated from the original data for which d ij < d thrs with conﬁdence 95%. The constructionof the tree using complete-linkage algorithm ensuresthat for clusters below d thrs , the distance betweenevery gene pair within a cluster is statistically sig-niﬁcant. Note that this is not the case for the otherlinkage algorithms described above. We applied complete-linkage clustering to both thesynthetic and ﬁbroblast data sets. Surrogates weregenerated to ﬁnd d thrs and this threshold was usedto determine statistically signiﬁcant clusters. Wefound that 10000 surrogates was suﬃcient to producerobust results when testing at the 95% signiﬁcancelevel. Synthetic data

The distances corresponding to the synthetic dataset (Fig. 3) fall into distinct groups because of theclearly deﬁned response proﬁles underlying the data(Fig. 1). The tree (left) shows the hierarchical or-ganisation of the linkage distances, indicating thatthe correct number of clusters can be identiﬁed bycutting the tree at a threshold distance in the range0 . < d ij < . d thrs = 0 .

963 corresponding tothe 95% signiﬁcance level. This threshold correctlyidentiﬁes the clusters, assigning the 20 genes for eachresponse function to the corresponding cluster. Thecorrespondence between the original responses (Fig.1) and the clusters (Fig. 3) is given by: I (a),II (b), III (d), IV (c), V (e) and VI (f). Note that the expression proﬁles are clusteredtogether, even though their magnitudes are diﬀerent,because they have the same shape, as quantiﬁed bythe choice of distance metric based on the correlationcoeﬃcient.We expect that signiﬁcantly higher noise levelswill obscure the original pattern of gene expression.Increasing the observational noise in the syntheticdata set was found to increase the linkage distancesused to construct the tree, but not to have a stronginﬂuence on the threshold distance (data not shown).Many more of the distances in the data set are foundto be consistent with the null hypothesis, and there-5 δ ij −1012x i (a)−2−101x i (b)0 6 12 18 24−2−101 timex i (c) −1012 (d)−1012 (e)0 6 12 18 24−2−101 time(f) Figure 3: Analysis of the synthetic data set showing the hierarchical tree (left) constructed using complete-linkage, with distances δ ij from 2 (no co-expression) to 0 (co-expression). The clusters recovered by cuttingthe tree at signiﬁcance threshold d thrs = 0 .

963 (indicated on the tree by the dashed line) are shown on theright. The darker trace indicates the cluster average expression proﬁle.fore not considered signiﬁcant for clustering. In thiscase, a larger number of clusters than the six origi-nal responses is obtained by cutting the tree at thethreshold distance, however, the members of eachcluster were still found to correspond to only one ofthe six response functions.

Fibroblast Data Results

We applied the same surrogate data analysis tech-nique to the 517 gene expression proﬁles in the pub-lished data set corresponding to ﬁbroblast transcrip-tional response to serum stimulation. The publishedclustering for this data set found clusters correspond-ing to diﬀerent aspects of the physiological responseof ﬁbroblasts to wound healing (serum stimulation).Iyer et al. used hierarchical clustering (using average-linkage, Eisen et al., 1998) from which they obtainedclusters by visual inspection of the ordered tree, re-porting cluster sizes of 142, 100, 60, 40, 32 and 31 genes for the six largest clusters.We used surrogates to obtain a threshold dis-tance of d thrs = 0 .

95 corresponding to the 95% sig-niﬁcance level, which was used to cut the tree. Theﬁfteen largest clusters obtained, those containing 10or more genes, are shown in Fig. 4, along with thered-green display (Eisen et al., 1998) of the entireclustered data set, ordered by increasing mean ex-pression value.In total 47 statistically signiﬁcant clusters werefound, including 30 containing more than two genes.The maximum cluster size found to be of statisti-cal signiﬁcance by this technique is 74, and the sixlargest clusters are of size 74 (b), 40 (d), 39 (a), 30(j), 29 (l) and 23 (c,f,g), suggesting that some of thecorrelations within the larger clusters reported pre-viously may not be statistically signiﬁcant.6 h r m i n m i n h r h r h r h r h r h r h r h r h r un sy n c h −3 −2 −1 0 1 2 3abcdefghijklmno −3−2−101 a (39)−3−2−101 b (74)−3−2−101 c (23)−3−2−101 d (40)−4−20 e (20)−202 f (23)−4−202 g (23)−2024 h (12) −202 i (12)024 j (30)−2024 k (18)−2024 l (29)−20246 m (21)024 n (20)024 o (17) Figure 4: Analysis of the ﬁbroblast gene expression data. The red-green display shows the clustered dataset sorted by increasing mean expression level. The scale is log-expression ratio − d thrs = 0 .

95 correspond to the locations indicated by coloured bars left of the red-green display(cluster sizes are shown in brackets). 7

Discussion

The measurement of gene expression data using DNAmicroarray techniques may provide a deeper under-standing of many of the complicated processes un-derlying biological systems. To investigate the largedata sets resulting from microarray experiments wehave presented a new technique (SALT). This tech-nique is straightforward and can be easily imple-mented in existing computer analysis packages.SALT uses surrogate data analysis to identifyclusters on the hierarchical tree generated using thecomplete-linkage algorithm, at a prescribed signif-icance level. We have used surrogate data whichpreserves the temporal variation of the gene expres-sion proﬁles. This is also the appropriate methodfor data collected for diﬀerent experimental condi-tions or from diﬀerent tissue preparations, in orderto preserve characteristics of the data within the dif-ferent experiments. A better understanding of thedynamical processes underlying the gene expressiondata and, in particular, the sources of measurementerror, would allow surrogates to be constructed us-ing better models of the properties of independentgene expression data.The application of the technique to the publishedﬁbroblast data set found signiﬁcant clusters contain-ing fewer genes than previous analyses (Ramoni et al.,2002; Iyer et al., 1999). One explanation is that pre-vious analyses employ similarity values between datapoints which are not statistically signiﬁcant to obtaintheir clusters. Studies using alternative methods ofclustering in which the data is divided into a pre-determined number of clusters (K-means clustering,for example) also risk grouping together genes forwhich the expression proﬁles have distances abovethe signiﬁcance threshold.Statistical signiﬁcance testing using SALT couldalso be used for clustering other examples of largedata sets where one wishes to identify which ele-ments of a given set of proﬁles are (i) interactingwith each other or (ii) interacting in a similar wayin response to some external perturbation.

Acknowledgements

This research was supported by an Engineering andPhysical Sciences Research Council Grant GR/N02641to PEM and The Wellcome Trust (EJC).

References

O. Alter, P. O. Brown, and D. Botstein. Singular valuedecomposition for genome-wide expression data pro-cessing and modeling.

Proc. Natl. Acad. Sci. USA , 97:10101–10106, 2000.C. Chatﬁeld.

The Analysis of Time Series . Chapman andHall, London, 4th edition, 1989.M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Bot-stein. Cluster analysis and display of genome-wideexpression patterns.

Proc. Natl. Acad. Sci. USA , 95:14863–14868, 1998.K. Fellenberg, N. C. Hauser, B. Brors, A. Neutzner, J. D.Hoheisel, and M. Vingron. Correspondence analysis ap-plied to microarray data.

Proc. Natl. Acad. Sci. USA ,98:10781–10786, 2001.S. P. Fodor, R. P. Rava, X. C. Huang, A. C. Pease, C. P.Holmes, and C. L. Adams. Multiplexed biochemicalassays with biological chips.

Nature , 364:555–556, 1993.N. C. Hauser, M. Vingron, M. Scheideler, B. Krems,K. Hellmuth, K. D. Entian, and J. D. Hoheisel. Tran-scriptional proﬁling on all open reading frames of sac-charomyces cerevisiae.

Yeast , 14:1209–1221, 1998.J. Herrero, A. Valencia, and J. Dopazo. A hierarchical un-supervised growing neural network for clustering geneexpression patterns.

Bioinformatics , 17:126–136, 2001.N. S. Holter, M. Madhusmita, A. Maritan, M. Cieplak,and N. V. Banavar, J. R.and Fedoroﬀ. Fundamentalpatterns underlying gene expression proﬁles: simplic-ity from complexity.

Proc. Natl. Acad. Sci. USA , 97:18409–8414, 2000.N. S. Holter, A. Maritan, M. Cieplak, N. V. Fedoroﬀ, andJ. R. Banavar. Dynamic modeling of gene expressiondata.

Proc. Natl. Acad. Sci. USA , 98:1693–1698, 2001.V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore,J. C. F. Lee, J. M. Trent, L. M. Staudt, J. Hudston Jr.,M. S. Boguski, D. Lashkari, D. Shalon, D. Botstein,and P. O. Brown. The transcriptional program in theresponse of human ﬁbroblasts to serum.

Science , 283:83–87, 1999. . N. Mantegna. Hierarchical structure in ﬁnancial mar-kets. Eur. Phys. J. B , 11:193–197, 1999.A. C. Pease, D. Solas, E. J. Sullivan, M. T. Cronin, C. P.Holmes, and S. P. Fodor. Light-generated nucleotidearrays for rapid DNA sequence analysis.

Proc. Natl.Acad. Sci. USA , 91:5022–5026, 1994.M. F. Ramoni, P. Sebastiani, and I. S. Kohane. Clusteranalysis of gene expression dynamics.

Proc. Natl. Acad.Sci. USA , 99:9121–9126, 2002.M. Schena, D. Shalon, R. Heller, A. Chai, P. O. Brown,and R. W. Davis. Parallel human genome analysis:microarray-based expression monitoring of 1000 genes.

Proc. Natl. Acad. Sci. USA , 93:10614–10619, 1996.L. A. Smith. Identiﬁcation and prediction of low-dimensional dynamics.

Physica D , 58:50–76, 1992.J. Theiler, S. Eubank, A. Longtin, B. Galdrikan, and J. D.Farmer. Testing for nonlinearity in time series: themethod of surrogate data.

Physica D , 58:77, 1992.X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr,S. Smith, J. L. Barker, and R. Somogyi. Large-scaletemporal gene expression mapping of central nervoussystem development.

Proc. Natl. Acad. Sci. USA , 95:334–339, 1998.K. Y. Yeung and W. L. Ruzzo. Principle component anal-ysis for clustering gene expression data.

Bioinformat-ics , 17:763–774, 2001.M. K. S. Yeung, J. Tegn´er, and J. J. Collins. Reverseengineering gene networks using singular value decom-position and robust regression.

Proc. Natl. Acad. Sci.USA , 99:6163–6168, 2002.L. P. Zhao, R. Prentice, and L. Breeden. Statistical mod-eling of large microarray data sets to identify stimulus-response proﬁles.

Proc. Natl. Acad. Sci. USA , 98:5631–5636, 2001., 98:5631–5636, 2001.