[PDF] A simple and robust method for connecting small-molecule drugs using gene-expression signatures

Abstract

Interaction of a drug or chemical with a biological system can result in a gene-expression profile or signature characteristic of the event. Using a suitably robust algorithm these signatures can potentially be used to connect molecules with similar pharmacological or toxicological properties. The Connectivity Map was a novel concept and innovative tool first introduced by Lamb et al to connect small molecules, genes, and diseases using genomic signatures [Lamb et al (2006), Science 313, 1929-1935]. However, the Connectivity Map had some limitations, particularly there was no effective safeguard against false connections if the observed connections were considered on an individual-by-individual basis. Further when several connections to the same small-molecule compound were viewed as a set, the implicit null hypothesis tested was not the most relevant one for the discovery of real connections. Here we propose a simple and robust method for constructing the reference gene-expression profiles and a new connection scoring scheme, which importantly allows the valuation of statistical significance of all the connections observed. We tested the new method with the two example gene-signatures (HDAC inhibitors and Estrogens) used by Lamb et al and also a new gene signature of immunosuppressive drugs. Our testing with this new method shows that it achieves a higher level of specificity and sensitivity than the original method. For example, our method successfully identified raloxifene and tamoxifen as having significant anti-estrogen effects, while Lamb et al's Connectivity Map failed to identify these. With these properties our new method has potential use in drug development for the recognition of pharmacological and toxicological properties in new drug candidates.

Full PDF

aa r X i v : . [ q - b i o . Q M ] J a n A simple and robust method for connecting small-molecule drugs usinggene-expression signatures

Shu-Dong Zhang ∗ and Timothy W. Gant MRC Toxicology Unit, Hodgkin Building, Lancaster Road, University of Leicester, Leicester, UK

Interaction of a drug or chemical with a biological system can result in a gene-expression proﬁleor signature characteristic of the event. Using a suitably robust algorithm these signatures canpotentially be used to connect molecules with similar pharmacological or toxicological properties.The Connectivity Map was a novel concept and innovative tool ﬁrst introduced by Lamb et al toconnect small molecules, genes, and diseases using genomic signatures [Lamb et al (2006), Science313, 1929-1935]. However, the Connectivity Map had some limitations, particularly there wasno eﬀective safeguard against false connections if the observed connections were considered onan individual-by-individual basis. Further when several connections to the same small-moleculecompound were viewed as a set, the implicit null hypothesis tested was not the most relevant onefor the discovery of real connections. Here we propose a simple and robust method for constructingthe reference gene-expression proﬁles and a new connection scoring scheme, which importantlyallows the valuation of statistical signiﬁcance of all the connections observed. We tested the newmethod with the two example gene-signatures (HDAC inhibitors and Estrogens) used by Lamb etal and also a new gene signature of immunosuppressive drugs. Our testing with this new methodshows that it achieves a higher level of speciﬁcity and sensitivity than the original method. Forexample, our method successfully identiﬁed raloxifene and tamoxifen as having signiﬁcant anti-estrogen eﬀects, while Lamb et al’s Connectivity Map failed to identify these. With these propertiesour new method has potential use in drug development for the recognition of pharmacological andtoxicological properties in new drug candidates.

I. INTRODUCTION

One of the most fundamental challenges in all formsof ’omic technologies is the connection of biological eventsignatures with others previously derived to allow therecognition of new molecule properties or biological al-teration. Simple, robust, and eﬃcient matching methodsare required to connect a new gene signature with thosein a database. This problem was ﬁrst tackled by Lambet al [1] who introduced the Connectivity Map as a re-source and tool to connect small-molecule drugs, genes,and diseases. Lamb et al’s Connectivity Map achieveda good degree of success, but also suﬀered from severaldeﬁciencies, particularly an inability to apply a measureof statistical validity at the individual reference signaturelevel to allow rational ﬁltering of the results to excludefalse connections. We took the method of Lamb as a basisfor development and have derived a simple, robust andstatistically testable method for making connections be-tween biological event signatures. The method was testedwith genomic signatures resulting from small molecule in-teractions in cells, but also could be applied to any formof signature such as those from proteomic or metabolomicscience.The main assumption behind the concept of Connec-tivity Map is that a biological state, whether physio-logical, pathological, or induced with chemical or ge-nomic perturbations, can be described in terms of a ge-nomic signature, eg., the genome-wide mRNA levels as ∗ Electronic address: [email protected] measured by DNA microarray technologies. The work-ing of the Connectivity Map involves several key com-ponents. First, a large collection of pre-built referencegene-expression proﬁles serve as the core database, whereeach reference proﬁle characterizes a well-deﬁned biolog-ical state. Secondly, a query gene signature from somespeciﬁc studies. A query gene signature is basically ashort (as compared to the list of genes in a typical refer-ence proﬁle) list of genes most relevant and important tocharacterize the biological state of the researchers’ inter-est. Finally, a pattern matching algorithm or similaritymetric deﬁned between a query gene signature and a ref-erence gene-expression proﬁle to quantify the closeness orconnection between the two biological states. The Con-nectivity Map could be used by biomedical researchers toﬁnd connections between the reference biological statesand those of their own interest. This may lead to newﬁndings of unexpected connections and testable new bi-ological hypotheses. In this paper, we present a newframework for the construction of reference proﬁles andnew connection scoring scheme and testing proceduresfor the observed connections. We compare our methodwith that of Lamb et al, and show that more robust re-sults are achieved using our method. In particular ourmethod not only oﬀers more principled statistical pro-cedures for testing connections, but more importantlyit provides eﬀective safeguard against false connectionswhile at the same time achieves much increased sensi-tivity. As a consequence it can beneﬁt the end users bysaving them enormous amount of time and resources inpursuing new biological hypotheses based on the ﬁndingsof connectivity maps.

II. METHODSA. Construction of reference proﬁles

For the ﬁrst-generation connectivity map, Lamb etal carried out a series of gene-expression proﬁling ex-periments [1], using 164 distinct small-molecule com-pounds in a few selected human cell lines. Each treat-ment instance consisted of one treatment sample and one(or more) vehicle control samples, whose genome widemRNA levels were measured using Aﬀymetrix GeneChipmicroarrays. In total 564 samples were microarrayed,which represented 453 diﬀerent treatment instances. Forexample, treatment instance ID988 consisted of 1 treat-ment sample and 6 vehicle control samples. The treat-ment sample was obtained by treating human MCF7 cellswith 100 nM estradiol for 6 hours. The control sampleswere obtained by treating MCF7 cells with vehicle controlfor 6 hours. A gene-expression proﬁle was constructed foreach treatment instance, in which the relative expression(treatment relative to the control) of all measured geneswere speciﬁed, and sorted in descending order. A querygene signature, obtained and ordered in the same man-ner, can be compared to each reference proﬁle in the Con-nectivity Map to calculate a connectivity score. For “ pos-itive connectivity ”, the up-regulated genes of the querysignature ﬁnd matches near the top of the reference pro-ﬁle, and the down-regulated genes ﬁnd matches near thebottom of the reference proﬁle. For “ negative connectiv-ity ”, the matches are opposite.We obtained Lamb et al’s data set (Accession NumberGSE5258) from the GEO (Gene Expression Omnibus)database, and rebuilt 453 reference gene-expression pro-ﬁles using a new ranking scheme based on the follow-ing guiding principles: (1) A treatment instance was de-ﬁned relative to a control, thus the eﬀect of the treat-ment could be characterized by the relative diﬀerentialexpression status of all the genes together, (2) diﬀerentgenes were aﬀected to diﬀerent extents by the treatment,so genes which showed a greater diﬀerential expressionshould have more weight in characterizing this treatment,and (3) up- and down-regulated genes should be treatedequally in a uniﬁed manner. This meant that a 2-folddown-regulated gene was considered as equally impor-tant as a 2-fold up-regulated gene in deﬁning a referenceproﬁle. There are several reasons for the choice of treat-ing up- and down-regulated genes equally. Theoretically,unless we have a lot of further information about so manygenes on the microarray it is diﬃcult to decide whetherthis 2-fold up-regulated gene is more important than that2-fold down-regulated gene or the opposite. So a logicalchoice is to assign them equal weights. Another reasonis the consideration of symmetry: if a gene is 2-fold up-regulated in sample 1 versus sample 2, it can also beviewed as 2-fold down-regulated in sample 2 versus sam-ple 1. We should emphasize that assigning two genesequal weights does not imply in any sense they sharethe same molecular mechanism. Even two up-regulated genes with the same fold changes could have very diﬀer-ent molecular mechanisms as to why they increased theirexpression. To adhere to the above guidelines, an obviouschoice for organizing the genes is the logarithm of the ex-pression ratio (treatment over control). Thus instead oftreating the down- and up-regulated genes separately asin the methods of Lamb et al, we ordered genes in a refer-ence proﬁle by the absolute value of their expression log-ratios. Therefore the most diﬀerentially expressed genes(either up or down) appear ﬁrst in the list, and thosenon-diﬀerentially expressed genes appear at the bottomof the list. In this way, the genes are ordered by theirimportance in deﬁning the reference proﬁle. It is thenstraightforward to assign ranks to them. Suppose thereare in total N genes, the ﬁrst gene in the list will be as-signed a rank N if it is up-regulated, or a rank − N if it isdown-regulated. In general the i th gene in the list will beranked with ( N − i + 1) for up-regulation or − ( N − i + 1)for down-regulation. With this new ranking method, theimportance of a gene is reﬂected by the absolute value ofits rank, while the sign of its rank indicates its regulationstatus. The consequence and advantage of this methodfor creating reference proﬁles is that attaching statisti-cal signiﬁcance to the connection observed is a relativelystraightforward step. B. The new scoring scheme

A query gene signature can be an ordered gene list,or just a collection of genes without speciﬁc orderingamong them. We will refer to these two types of querygene signatures as ordered and unordered gene signaturerespectively. For an ordered gene signature, we rankthe genes in the list in the same way as for a referenceproﬁle. Namely, the most important (diﬀerentially ex-pressed) gene in the signature will be assigned a rank m or − m depending on its regulation status, where m isthe number of genes in the signature. While the leastimportant gene in the signature be ranked 1 or − R denote a reference gene-expression proﬁle, and s a query gene signature. We deﬁne the connection strength between R and s as C ( R , s ) = m X i =1 R ( g i ) s ( g i ) , (1)where g i represents the i th gene in the signature, s ( g i ) isits signed rank in the signature, and R ( g i ) is this gene’ssigned rank in the reference proﬁle. It is worth not-ing some properties of the connection strength deﬁnedabove: (1) if a gene has the same regulation status (ei-ther up- or down-regulation) in both the reference andthe query, it will make a positive contribution to theconnection strength, otherwise its contribution will benegative; (2) the magnitude of a gene’s contribution tothe connection strength is determined by its position inboth lists; and (3) a gene signature with some of its genescontributing positively and others negatively will have anoverall low connection strength, because the positive andnegative contributions cancel each other to some extent.Therefore calculating the connection strength between agene signature and a reference proﬁle, the maximum con-nection strength achievable is the situation where the m genes in the signature match the ﬁrst m genes in the ref-erence proﬁle in the correct order, and their regulationstatus also match. In such a case, the maximum positiveconnection strength is, for an ordered gene signature, C omax ( N, m ) = m X i =1 ( N − i + 1)( m − i + 1) . (2)In another equally interesting situation, where the m genes in the signature match the ﬁrst m genes in thereference proﬁle in the right order, but the sign of eachgene in the query is diﬀerent from its sign in the reference,the connection strength is − C omax ( N, m ), the opposite ofEq.(2).For an unordered query signature, all the genes inthe list have equal weight because there is no particu-lar ordering among them. The calculation of connectionstrength is the same as Eq.(1), the only diﬀerence beingthat s ( g i ) = 1 if gene g i is up-regulated, or s ( g i ) = − unordered signatureis C umax ( N, m ) = m X i =1 ( N − i + 1) . (3)Given a query signature gene and a reference gene-expression proﬁle, we can calculate their connection scoreby c = C ( R , s ) /C max ( N, m ) . (4)So a connection score c = 1 means that the gene signaturehas the maximum positive connection strength with thereference proﬁle, which indicates that the experimentalcondition that gave rise to this gene signature had thestrongest possible correlation with the treatment instancethat generated the reference proﬁle. A connection score c = − c will be within the range of ( − , C. Connection Testing

As for most biomedical experiments with unavoidablebiological and technical variation, statistical signiﬁcanceis a crucial aid to the interpretation and subsequent val-idation of the result. Here we propose calculating thep-value associated with a connection score by testing thefollowing null hypothesis.

Null hypothesis H : For a reference gene-expressionproﬁle R and a query gene signature s , the null hypoth-esis H states that there is no underlying biological con-nection between the two, and that the query signature s is merely a random m -gene signature, as generated byProcedure 1 described below. Procedure 1:

Generation of a random m -gene signa-ture. Let R be a given reference gene-expression proﬁleof N genes. Select m genes sequentially and randomlyfrom the N genes (without replacement), and assign +1(up-regulation) or − m selected genes. If thisgene signature is to be used as an ordered list, its orderis just the order in which the m genes are selected; or ifthis gene signature is to be used as merely a collection ofgenes, then order is irrelevant.Given a reference proﬁle R and a gene signature s , wecalculate their connection score c by Eq.(4), and the two-tailed p-value associated with this observed connectionscore is p = Prob {| ˜ c | ≥ | c | | H } , (5)where ˜ c is the connection score between a random genesignature and the reference proﬁle. To estimate the p-value, a large number (eg., 10 ) of random connectionscores can be generated using Procedure 1 and Eq.(4),the proportion of random scores that are no less thanthe observed scores c in absolute value is an estimate ofthe two-tailed p-value.The 453 individual treatment instances of the dataset GSE5258 were created using only 164 distinct small-molecule compounds. Some treatment instances werereplication experiments using the same compound at thesame or diﬀerent doses. It is thus interesting to considerall the treatment instances of the same compound as aset, and to assess the overall connection of the set witha query gene signature. We deﬁne the connection scorefor a treatment set as follows t = 1 n n X i =1 c i , (6)where n is the number of individual treatment instancesbelonging to the treatment set, c i is the connection scoreof the i th instance. To test the signiﬁcance of a treatmentset as a whole. We used the following null hypothesis, Null hypothesis H set : Where T denotes a set oftreatment instances, R i the reference proﬁle based ontreatment instance i , and s a query gene signature, thenull hypothesis H set states that there is no underlyingbiological connection between the gene signature s andany of the reference proﬁles in T. The query signature ismerely a random gene list generated by Procedure 1.Thus the null hypothesis H set is an extension of H to a higher level. Alternatively, H can be viewed as aspecial case of H set , in which there is only one treat-ment instance in the treatment set. Once the connectionscore for a set is observed, its associated p value can TABLE I: The connectivity scores of the random gene signa-ture rds01 to the reference proﬁles in the Lamb ConnectivityMap. rank ID compound score up down1 1080 sirolimus 1 0.232 -0.5782 913 colforsin 0.953 0.245 -0.5273 1138 phentolamine 0.912 0.316 -0.4234 1048 alpha-estradiol 0.886 0.324 -0.3945 1115 phenanthridinone 0.869 0.379 -0.325 · · · · · · · · · · · · · · · · · ·

112 885 5186223 0.363 0.137 -0.157113 3 metformin 0.342 0.158 -0.119114 663 U0125 0 0.405 0.194115 124 mesalazine 0 0.371 0.256 · · · · · · · · · · · · · · · · · ·

369 1008 geldanamycin 0 -0.43 -0.339370 1064 17AAG 0 -0.436 -0.395371 605 monastrol -0.38 -0.114 0.177372 494 ﬂuphenazine -0.392 -0.113 0.187 · · · · · · · · · · · · · · · · · ·

449 601 MK-886 -0.834 -0.303 0.336450 604 arachidonic acid -0.855 -0.301 0.354451 387 estradiol -0.901 -0.162 0.528452 379 cobalt chloride -0.916 -0.238 0.464453 378 tacrolimus -1 -0.23 0.536 be estimated in a similar way: a large number of randomgene signatures are generated using Procedure 1, and theconnection score of the set to each of the random genesignatures is calculated using Eq.(6); the proportion ofrandom connection scores that are greater than the ob-served score in absolute value is an estimation of the pvalue.

III. RESULTSA. Querying the connectivity maps with randomgene signatures

To compare the speciﬁcities of Lamb et al’s Connectiv-ity Map and the connectivity map presented here, we gen-erated random gene signatures and tested these randomgene signatures in both. The ﬁrst example was a randomgene signature, rds01, which contained 25 Aﬀymetrixprobe-set IDs randomly selected from the 22283 IDs onthe Aﬀymetrix HG-U133A microarray platform. Query-ing Lamb et al’s Connectivity Map with this signature,we obtained the connectivity scores of rds01 to all the 453individual treatment instances. The results are shown inTable I, which suggests that this signature has positiveconnections to 113 individual reference proﬁles, with con-nectivity scores ranging from 1 to 0 . − − . /N , where N is the number of nullhypotheses being tested simultaneously. In this exam-ple, N = 453 (the total number of individual treatmentinstances in the database and hence the number of nullhypothesis being tested at the treatment instance level).So a connection with a p value p < /

453 = 0 . /N was intended to control theexpected number of false connections at 1, we thus ex-pected to have one false connection on average among allthe connections declared as signiﬁcant. Table II showsthat our connectivity map gave the correct result, ie.,no signiﬁcant connection between this random signatureand any of the treatment instances. This compares withthe 196 connections (113 positive and 83 negative) foundwith the method of Lamb et al. Note that to control theexpected number of false connections more precisely, thethreshold hold p value should be set at 1 /N , where N is the number of true null hypothesis. Of course in a gen-eral situation N is unknown, so it has to be estimated,for example, using the methods developed in [2, 3]. Inthis paper, we set threshold p value at 1 /N for simplicity.Since N ≤ N , our criteria tend to be slightly conserva-tive, meaning that the actual number of false connectionson average will be ≤ p < /

453 = 0 . TABLE II: The connection scores for the random gene signa-ture rds01 using the new methods. To control the expectednumber of false connections at 1, the threshold p value was setat 1 /

453 = 0 . · · · · · · · · · · · ·

451 felodipine 848 -0.002 0.988452 estradiol 988 -0.001 0.991453 bucladesine 959 0.001 0.993 values were combined to give a single connectivity score.This way of deﬁning the connection score was unneces-sarily complex, which made the calculation of a possiblep-value on an individual level diﬃcult, if not impossible.On the treatment set level, Lamb et al’s ConnectivityMap provides a permutation p value when a set of treat-ment instances associated with the same compound wereviewed as a whole. The permutation p value was calcu-lated by comparing a statistic (again based on K-S statis-tic) of the treatment set with the distribution of manyrandom-set statistics. Those random sets were formedby randomly selecting treatment instances from all in-stances. From the way the permutation p value was cal-culated, it was clear that the following null hypothesiswas being tested:

The set of instances in question havethe same pattern (distribution) of connections with thequery gene signature compared with a randomly formedset of the same size.

A null hypothesis such as this is notthe most appropriate one to test, because it is not di-rectly relevant to the question of whether the query genesignature had any real biological connection to the treat-ment set. Lamb et al themselves also recognized thatmore rigorous methods for the estimation of statisticalsigniﬁcance were probably required[1]. Putting its appro-priateness aside, if we do use the permutation p valuesfrom Lamb et al’s Connectivity Map to control false con-nections, the same criteria for setting threshold p valuecan be used. The results of signiﬁcance analysis on theconnections between the two random gene signatures andthe 164 treatments sets can be found in the supplemen-tary data. In this instance both connectivity maps gavethe right answers for these two random gene signatures.However, for Lamb et al’s Connectivity Map permutationp values were not available for all the 164 treatment sets.For those treatment sets which only contain 1 treatmentinstance or those sets with mean connectivity score 0, nopermutation p values could be calculated, and hence nostatistical signiﬁcance. This problem aﬀects the coverageand consequently the sensitivity of Lamb et al’s Connec-tivity Map, because real biological connections between

FIG. 1: HDAC inhibitor gene signature: The Venn diagramsummarizing the ﬁndings of signiﬁcant connections as identi-ﬁed by the connectivity maps. A represents the instance-levelanalysis using the new method; B, the set-level analysis usingthe new method; C, the set-level analysis using Lamb et al’sConnectivity Map. The label “AB 13” means that 13 com-pounds are identiﬁed as signiﬁcant solely by A and B (notC), “B 4” means that 4 compounds are identiﬁed as signif-icant solely by B (not A, not C), and so on. The areas areapproximately proportional to the numbers they represent. a query gene signature and any of those treatment setsmay not be recognized.

B. Querying the connectivity maps withexperimentally derived real gene signatures

1. HDAC inhibitors

To test the ability of the new method for identifyingreal biological connections we utilized some of the sameexamples used in [1] to compare with Lamb et al’s Con-nectivity Map. The ﬁrst example was a gene signatureof histone deacetylase (HDAC) inhibitors (Lamb’s genesignature sigs01), which was compiled from [4] on the re-sponses of T24, MDA435 and MDA468 cells respectivelyto three histone deacetylase (HDAC) inhibitors: vorino-stat, MS-27-275, and trichostatin A. This gene signatureconsisted of 8 up- and 5 down-regulated genes, repre-sented by 25 Aﬀymetrix probe-set IDs on the AﬀymetrixHG-U133A microarrays.As Lamb et al’s Connectivity Map does not provideany safeguard on individual treatment instance levelagainst possible false connections, we can only use theresults of Lamb et al’s Connectivity Map on the treat-ment set level. In total 6 compounds (vorinostat, tricho-statin A, resveratrol, geldanamycin, valproic acid, and17AAG) were identiﬁed to have signiﬁcant positive con-nectivity with the HDAC inhibitor gene signature; and2 compounds (5182598 and ﬂudrocortisone) had signif-icant negative connectivity. Vorinostat, trichostatin A,and valproic acid are known HDAC inhibitors thus theidentiﬁcation of these can be regarded as a success. How-ever another known HDAC inhibitor HC-toxin, a refer-ence proﬁle of which was contained in the database, wasnot identiﬁed. This happened because there was onlyone treatment instance of the compound HC-toxin in thedatabase and so no permutation p value could be ob-tained using Lamb et al’s method. Based on their in-stance level results, Lamb et al highlighted HC-Toxin in[1] as it had the 7th highest connectivity score (0.914)of all instances in the dataset. However, as the two ex-amples of random gene signatures showed Lamb et al’sinstance level analysis gave false connections even for thehighest connectivity scores 1.0. So the rational choice isto disregard the instance level results from Lamb et al’sConnectivity Map.We used this same HDAC inhibitor gene signature toquery the new connectivity map presented here, with pvalues calculated at both the individual instance leveland the treatment set level. 56 treatment instances, rep-resenting 22 distinct compounds, were found to have sig-niﬁcant connections to sigs01. On the treatment set level24 compounds were found to have signiﬁcant connec-tions with the signature. Near the top of the signiﬁcantconnection list were those known HDAC inhibitors high-lighted in [1]. Importantly though also included in theoutput was HC-Toxin, which was not identiﬁed by theset-level analysis Lamb et al’s Connectivity Map. Thefull results of querying both connectivity maps using theHDAC inhibitor gene signature are included in the sup-plementary data. In Fig. 1, we summarize the numberof signiﬁcant connections as identiﬁed by: (A) Instancelevel analysis using the new method presented here; (B)Set level analysis using the new method; (C) Set levelanalysis using Lamb et al’s Connectivity Map. In total,our method (A) and (B) combined identiﬁed 27 com-pounds and Lamb et al’s Connectivity Map identiﬁed8 compounds, as having signiﬁcant connections to theHDAC inhibitors. Our method missed only 1 of the 8compounds found signiﬁcant by Lamb et al’s Connectiv-ity Map, while the latter missed 20 of the 27 compoundsidentiﬁed by our method, with HC-toxin among the 20compounds that were missed. The HDAC inhibitors ex-ample thus shows that our new method is more sensitiveat detecting real connections. With the increased sen-sitivity and false connections being properly controlled,the potential beneﬁt of our method is obvious. In thisexample, the set-level analysis of Lamb et al’s Connec-tivity Map identiﬁed 8 compounds with a false discoveryrate of 12 .

5% (1/8), while the set-level analysis using ourmethod identiﬁed 24 compounds with a false discoveryrate of 4 .

2% (1/24). Based on the ﬁndings of the connec-tivity maps, researchers can prioritize a small sub set ofthose compounds for further investigations and/or devel-oping new biological hypotheses. For this example, usingLamb et al’s Connectivity Map, the chance of pursuinga false connection is 12 . .

2. Estrogens

The second example was a gene signature (Lamb’s genesignature sigs02) taken from a study [5] of MCF7 cellstreated with estradiol. This gene signature consisted of40 up- and 89 down-regulated genes represented by 189Aﬀymetrix probe-set IDs on the Aﬀymetrix HG-U133Amicroarrays. We queried both Lamb et al’s ConnectivityMap and the connectivity map presented in this paperwith this estrogen signature.For the same reason given above, we only used thetreatment set level results of Lamb et al’s ConnectivityMap, which identiﬁed 4 compounds (genistein, estradiol,tretinoin, and alpha-estradiol) as having signiﬁcant posi-tive connectivity with the estrogen signature; and 5 com-pounds (trichostatin A, fulvestrant, LY-294002, vorinos-tat, and geldanamycin) that had signiﬁcant negative con-nectivity.With our new connectivity map set-level analysis 16compounds were found to have signiﬁcant positive con-nection, and 25 compounds had signiﬁcant negativeconnection to the estrogen gene signature. The 16compounds with positive connection included genistein,estradiol, and alpha-estradiol, all known to be estrogenreceptor agonists. The 25 compounds with negative con-nection included fulvestrant, raloxifene and tamoxifen,all known to be estrogen receptor antagonists. In com-parison, the Lamb method identiﬁed all the estrogen re-ceptor agonists above, but missed all the estrogen recep-tor antagonists except fulvestrant. These results there-fore indicate the sensitivity of our new method is substan-tially increased. The Lamb method was able to detect thepure estrogen receptor antagonist fulvestrant, but missedthe two compounds tamoxifen and raloxifene which havemixed antagonist and agonist estrogen receptor activi-ties.The results from our set level analysis also suggestthat nordihydroguaiaretic acid (NDGA) has signiﬁcantpositive connection with estradiol. This connection issupported by recent studies [6, 7], where NDGA hasbeen shown to have estrogenic activity and able to elicitan estrogen-like response. Another compound monor-den (radicicol), suggested by our method as having neg-ative connection to the estrogen gene signature, has beenshown to repress the transcriptional function of the es-trogen receptor [8] which suggests that it may have someestrogen receptor antagonist-like properties. The tabu-lated results of querying both connectivity maps usingthe estrogen gene signature are included in the supple-mentary data. Fig.2 summarizes the numbers of signif-icant connections identiﬁed by both connectivity maps.All 9 compounds found signiﬁcant by Lamb et al’s Con-nectivity Map were also identiﬁed by the new method(either on the instance level or on the set level or both).However many compounds identiﬁed as signiﬁcant witheither positive or negative connection to estradiol usingour method were not identiﬁed by Lamb et al’s Con-nectivity Map, included amongst these were raloxifene,

FIG. 2: Estrogen gene signature: The Venn diagram sum-marizing the ﬁndings of signiﬁcant connections as identiﬁedby Lamb et al’s Connectivity Map and the new method here.The labelling follows the same conventions as in the previousﬁgure. tamoxifen, monorden, and NDGA.

3. Immunosuppressive therapy

For further testing we compiled a new gene signaturefrom a study on cardiac allograft rejection and responseto immunosuppressive therapy [9], where patients weretreated with standard immunosuppression with corticos-teroids, antimetabolites, calcineurin inhibitors, and/orsirolimus. This gene signature consisted of 40 Aﬀymetrixprobe set IDs (see Table 2 of [9]). Our set-level analy-sis identiﬁed 29 compounds as having signiﬁcant connec-tions with this gene signature. The three top compoundswere azathioprine, thalidomide, and rosiglitazone. Aza-thioprine is a commonly used immunosuppressive drug[10, 11], so its signiﬁcant positive connection with thegene signature is a good indication that our new connec-tivity map works very well here. The second compoundthalidomide, which had a positive connection score, alsohas known immunosuppressive activities [12], inhibits re-lease of TNF α from monocytes, and modulates othercytokine actions. The recognized properties of thesemolecules therefore accord with the outcome of the con-nectivity matching. The third compound rosiglitazonehad a negative connection with the signature suggestingit may have properties to reduce or mitigate the eﬀects ofimmunosuppressive activity. Recently, rosiglitazone wasreported to suppress cyclosporin-induced chronic trans-plant dysfunction and prolong survival of rat cardiac al-lografts (Chen et al, 2007, Transplantation) , where cy-closporin is also a commonly used immunosuppressivedrug [14].At the instance level, our method identiﬁed 89 ref-erence proﬁles as having signiﬁcant connections to theimmunosuppressive drug gene signature, representing 63distinct compounds. The top 3 compounds were aza-thioprine, staurosporine, and trichostatin A, which all achieved positive connection scores with this gene sig-nature. The second compound, staurosporine, a pro-tein kinase C inhibitor, is classiﬁed as an antineoplasticand immunosuppressive antibiotic drug [15]. The thirdcompound, trichostatin A, was recently shown to havesome immunosuppressive eﬀects in leukemia T cells [16].Therefore the method of instance testing could be partic-ularly valuable for the identiﬁcation of pharmacologicaland toxicological properties in novel molecules. IV. DISCUSSION

We have presented in this paper a new framework fora connectivity map, with the advantages that statisticalsigniﬁcance measures are calculated at both treatmentinstance level and treatment set level, thus providing ef-fective control over false connections. This importantsafeguard was not available in Lamb et al’s Connectiv-ity Map at the instance level, as revealed by the twoexamples of random gene signatures. As the connectiv-ity maps are most useful for high throughput screeningand for generating new biological hypothesis, it is crucialthat false connections are tightly controlled, such thatthe users do not end up chasing false connections andwasting time and money. We compared the performanceof Lamb et al’s Connectivity Map and the new methodpresented in this paper, using two gene signatures com-piled by Lamb et al and also a new gene signature forimmunosuppressive drugs. All these examples demon-strated that our new method is more sensitive and givesmore robust results than Lamb et al’s method.The set-level analysis of Lamb et al’s Connectivity Maptests an implicit null hypothesis that is not most appro-priate. This can be seen more clearly from recent studieson the signiﬁcance analysis of gene sets [17, 18]. Tian etal were among the ﬁrst groups of authors who made ex-plicit distinctions between two null hypotheses [17] con-cerning a set of entities ( a set of genes in context of Tianet al’s paper). In the present context of connectivity map,the two null hypotheses are:1. Hypothesis Q : The treatment instances in a setshow the same pattern of connections with the querygene signature compared with the rest of the treatmentinstances.2. Hypothesis Q : The treatment set does not containany treatment instances which have real connections withthe query gene signature.Tian et al’s discussions over the relationship between Q and Q [17] apply to a treatment set as follows: Givenall the 453 treatment instances, even if none of them havereal connection to the query gene signature, the observedconnection scores of a treatment set could still be verydiﬀerent from those treatment instances outside of theset because of the special correlation structure among thetreatment instances within the set. Chen et al also raisedsome concerns with testing the hypothesis Q (Chen et al,2007, Bioinformatics), that this hypothesis does not testwhether the treatment set has above-random connectionsto the query gene signature, but rather it tests whetherthe observed connections in the set are more or less thana randomly formed set of same size. We agree with Chenet al’s assessment on the inappropriateness of Q . It wasclear that Lamb et al’s Connectivity Map was testingthe null hypothesis Q . Here we tested Q which is moreappropriate.Our results indicate that the method presented herecan identify many signiﬁcant connections to a query genesignature. Then what criteria should we use and whichcompounds should we choose if new biological hypothe-ses are to be developed? Our suggestion is to concentratemore on those compounds which have many replicate in-stances in the connectivity map. Because the results ob-tained for those big treatment sets do not depend heavilyon the quality of a small number of treatment instances,as in the case of small treatment sets or singleton sets(treatment sets with only 1 instance each). Lamb et alalso recognized the importance of having replicate in-stances, and noted that the power to detect connectionsmight be greater for compounds with many replicates.In deﬁning a treatment set, ideally only the treatmentinstances of the same compound with the same dose andthe same cell type should be considered as a set. Forexample, the biological state of HL60 cells perturbed byraloxifene should be considered as a diﬀerent biologicalstate from that of MCF7 cells perturbed with the same raloxifene, thus these two instances of raloxifene shouldnot be put into the same set. In this paper for compar-ison purpose we adopted Lamb et al’s choice in deﬁninga treatment set, i.e., all the instances of a compoundswere grouped together as a set regardless of their possi-ble diﬀerences in dose or cell type. Mixing the instancesof a compound with diﬀerent doses and/or diﬀerent celltypes increases the heterogeneity of an otherwise morehomogenous treatment set. This tends to average outthe distinct characteristics attributable to the cell type ordosage diﬀerence, making some set-level connections in-signiﬁcant or their interpretation diﬃcult. In such casesthe instance level connections supported by statisticalsigniﬁcance can be of great help in interpreting the re-sults. For future connectivity maps, eﬀorts should bemade to provide as many more replicate treatment in-stances (replicates with the same compound, the samedose, and the same cell type, etc) so that the undesirablereliance on individual instances can be minimized. V. ACKNOWLEDGMENTS

We wish to acknowledge the support of the microarrayteam of the MRC Toxicology Unit. SDZ thanks QingWen for helpful discussions on a searching algorithm inthe implementation of the new connectivity map. [1] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C.Blat, M. J. Wrobel, J. Lerner, J.-P. Brunet, A. Subrama-nian, K. N. Ross, et al., Science , 1929 (2006).[2] J. D. Storey and R. Tibshirani, Proc. Natl. Acad. Sci.USA , 9440 (2003).[3] S.-D. Zhang and T. W. Gant, Bioinformatics , 2821(2004).[4] K. B. Glaser, M. J. Staver, J. F. Waring, J. Stender,R. G. Ulrich, and S. K. Davidsen, Mol Cancer Ther ,151 (2003).[5] J. Frasor, F. Stossi, J. M. Danes, B. Komm, C. R. Lyttle,and B. S. Katzenellenbogen, Cancer Res , 1522 (2004).[6] N. Fujimoto, R. Kohta, S. Kitamura, and H. Honda, LifeSciences , 1417 (2004).[7] N. Sathyamoorthy, T. Wang, and J. Phang, Cancer Res. , 957 (1994).[8] M. Lee, E. Kim, H. Kwon, Y. Kim, H. Kang, H. Kang,and J. Lee, Mol Cell Endocrinol. , 47 (2002).[9] P. A. Horwitz, E. J. Tsai, M. E. Putt, J. M. Gilmore,J. J. Lepore, M. S. Parmacek, A. C. Kao, S. S. Desai,L. R. Goldberg, S. C. Brozena, et al., Circulation ,3815 (2004).[10] V. W. Armstrong and M. Oellerich, Clinical Biochem- istry , 9 (2001).[11] S. T. Matalon, A. Ornoy, and M. Lishner, ReproductiveToxicology , 219 (2004).[12] S. McHugh, I. Rifkin, J. Deighton, A. Wilson, P. Lach-mann, C. Lockwood, and P. Ewan, Clin Exp Immunol. , 160 (1995).[13] Y. Chen, Y. Liu, Z. Yuan, L. Tian, M. J. Dallman, P. W.Thompson, P. K. H. Tam, and J. R. Lamb, Transplanta-tion , 1602 (2007).[14] J. S. Warrington and L. M. Shaw, Expert Opinion onDrug Metabolism & Toxicology , 487 (2005).[15] C.-C. Ting, J. Wang, and M. E. Hargrove, Immunophar-macology , 119 (1995).[16] R. Januchowski and P. P. Jagodzinski, International Im-munopharmacology , 198 (2007).[17] L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S.Kohane, and P. J. Park, PNAS , 13544 (2005).[18] B. Efron and R. Tibshirani, Ann. Appl. Statist. , 107(2007).[19] J. J. Chen, T. Lee, R. R. Delongchamp, T. Chen, andC.-A. Tsai, Bioinformatics23