A simple and robust method for connecting small-molecule drugs using gene-expression signatures
aa r X i v : . [ q - b i o . Q M ] J a n A simple and robust method for connecting small-molecule drugs usinggene-expression signatures
Shu-Dong Zhang ∗ and Timothy W. Gant MRC Toxicology Unit, Hodgkin Building, Lancaster Road, University of Leicester, Leicester, UK
Interaction of a drug or chemical with a biological system can result in a gene-expression profileor signature characteristic of the event. Using a suitably robust algorithm these signatures canpotentially be used to connect molecules with similar pharmacological or toxicological properties.The Connectivity Map was a novel concept and innovative tool first introduced by Lamb et al toconnect small molecules, genes, and diseases using genomic signatures [Lamb et al (2006), Science313, 1929-1935]. However, the Connectivity Map had some limitations, particularly there wasno effective safeguard against false connections if the observed connections were considered onan individual-by-individual basis. Further when several connections to the same small-moleculecompound were viewed as a set, the implicit null hypothesis tested was not the most relevant onefor the discovery of real connections. Here we propose a simple and robust method for constructingthe reference gene-expression profiles and a new connection scoring scheme, which importantlyallows the valuation of statistical significance of all the connections observed. We tested the newmethod with the two example gene-signatures (HDAC inhibitors and Estrogens) used by Lamb etal and also a new gene signature of immunosuppressive drugs. Our testing with this new methodshows that it achieves a higher level of specificity and sensitivity than the original method. Forexample, our method successfully identified raloxifene and tamoxifen as having significant anti-estrogen effects, while Lamb et al’s Connectivity Map failed to identify these. With these propertiesour new method has potential use in drug development for the recognition of pharmacological andtoxicological properties in new drug candidates.
I. INTRODUCTION
One of the most fundamental challenges in all formsof ’omic technologies is the connection of biological eventsignatures with others previously derived to allow therecognition of new molecule properties or biological al-teration. Simple, robust, and efficient matching methodsare required to connect a new gene signature with thosein a database. This problem was first tackled by Lambet al [1] who introduced the Connectivity Map as a re-source and tool to connect small-molecule drugs, genes,and diseases. Lamb et al’s Connectivity Map achieveda good degree of success, but also suffered from severaldeficiencies, particularly an inability to apply a measureof statistical validity at the individual reference signaturelevel to allow rational filtering of the results to excludefalse connections. We took the method of Lamb as a basisfor development and have derived a simple, robust andstatistically testable method for making connections be-tween biological event signatures. The method was testedwith genomic signatures resulting from small molecule in-teractions in cells, but also could be applied to any formof signature such as those from proteomic or metabolomicscience.The main assumption behind the concept of Connec-tivity Map is that a biological state, whether physio-logical, pathological, or induced with chemical or ge-nomic perturbations, can be described in terms of a ge-nomic signature, eg., the genome-wide mRNA levels as ∗ Electronic address: [email protected] measured by DNA microarray technologies. The work-ing of the Connectivity Map involves several key com-ponents. First, a large collection of pre-built referencegene-expression profiles serve as the core database, whereeach reference profile characterizes a well-defined biolog-ical state. Secondly, a query gene signature from somespecific studies. A query gene signature is basically ashort (as compared to the list of genes in a typical refer-ence profile) list of genes most relevant and important tocharacterize the biological state of the researchers’ inter-est. Finally, a pattern matching algorithm or similaritymetric defined between a query gene signature and a ref-erence gene-expression profile to quantify the closeness orconnection between the two biological states. The Con-nectivity Map could be used by biomedical researchers tofind connections between the reference biological statesand those of their own interest. This may lead to newfindings of unexpected connections and testable new bi-ological hypotheses. In this paper, we present a newframework for the construction of reference profiles andnew connection scoring scheme and testing proceduresfor the observed connections. We compare our methodwith that of Lamb et al, and show that more robust re-sults are achieved using our method. In particular ourmethod not only offers more principled statistical pro-cedures for testing connections, but more importantlyit provides effective safeguard against false connectionswhile at the same time achieves much increased sensi-tivity. As a consequence it can benefit the end users bysaving them enormous amount of time and resources inpursuing new biological hypotheses based on the findingsof connectivity maps.
II. METHODSA. Construction of reference profiles
For the first-generation connectivity map, Lamb etal carried out a series of gene-expression profiling ex-periments [1], using 164 distinct small-molecule com-pounds in a few selected human cell lines. Each treat-ment instance consisted of one treatment sample and one(or more) vehicle control samples, whose genome widemRNA levels were measured using Affymetrix GeneChipmicroarrays. In total 564 samples were microarrayed,which represented 453 different treatment instances. Forexample, treatment instance ID988 consisted of 1 treat-ment sample and 6 vehicle control samples. The treat-ment sample was obtained by treating human MCF7 cellswith 100 nM estradiol for 6 hours. The control sampleswere obtained by treating MCF7 cells with vehicle controlfor 6 hours. A gene-expression profile was constructed foreach treatment instance, in which the relative expression(treatment relative to the control) of all measured geneswere specified, and sorted in descending order. A querygene signature, obtained and ordered in the same man-ner, can be compared to each reference profile in the Con-nectivity Map to calculate a connectivity score. For “ pos-itive connectivity ”, the up-regulated genes of the querysignature find matches near the top of the reference pro-file, and the down-regulated genes find matches near thebottom of the reference profile. For “ negative connectiv-ity ”, the matches are opposite.We obtained Lamb et al’s data set (Accession NumberGSE5258) from the GEO (Gene Expression Omnibus)database, and rebuilt 453 reference gene-expression pro-files using a new ranking scheme based on the follow-ing guiding principles: (1) A treatment instance was de-fined relative to a control, thus the effect of the treat-ment could be characterized by the relative differentialexpression status of all the genes together, (2) differentgenes were affected to different extents by the treatment,so genes which showed a greater differential expressionshould have more weight in characterizing this treatment,and (3) up- and down-regulated genes should be treatedequally in a unified manner. This meant that a 2-folddown-regulated gene was considered as equally impor-tant as a 2-fold up-regulated gene in defining a referenceprofile. There are several reasons for the choice of treat-ing up- and down-regulated genes equally. Theoretically,unless we have a lot of further information about so manygenes on the microarray it is difficult to decide whetherthis 2-fold up-regulated gene is more important than that2-fold down-regulated gene or the opposite. So a logicalchoice is to assign them equal weights. Another reasonis the consideration of symmetry: if a gene is 2-fold up-regulated in sample 1 versus sample 2, it can also beviewed as 2-fold down-regulated in sample 2 versus sam-ple 1. We should emphasize that assigning two genesequal weights does not imply in any sense they sharethe same molecular mechanism. Even two up-regulated genes with the same fold changes could have very differ-ent molecular mechanisms as to why they increased theirexpression. To adhere to the above guidelines, an obviouschoice for organizing the genes is the logarithm of the ex-pression ratio (treatment over control). Thus instead oftreating the down- and up-regulated genes separately asin the methods of Lamb et al, we ordered genes in a refer-ence profile by the absolute value of their expression log-ratios. Therefore the most differentially expressed genes(either up or down) appear first in the list, and thosenon-differentially expressed genes appear at the bottomof the list. In this way, the genes are ordered by theirimportance in defining the reference profile. It is thenstraightforward to assign ranks to them. Suppose thereare in total N genes, the first gene in the list will be as-signed a rank N if it is up-regulated, or a rank − N if it isdown-regulated. In general the i th gene in the list will beranked with ( N − i + 1) for up-regulation or − ( N − i + 1)for down-regulation. With this new ranking method, theimportance of a gene is reflected by the absolute value ofits rank, while the sign of its rank indicates its regulationstatus. The consequence and advantage of this methodfor creating reference profiles is that attaching statisti-cal significance to the connection observed is a relativelystraightforward step. B. The new scoring scheme
A query gene signature can be an ordered gene list,or just a collection of genes without specific orderingamong them. We will refer to these two types of querygene signatures as ordered and unordered gene signaturerespectively. For an ordered gene signature, we rankthe genes in the list in the same way as for a referenceprofile. Namely, the most important (differentially ex-pressed) gene in the signature will be assigned a rank m or − m depending on its regulation status, where m isthe number of genes in the signature. While the leastimportant gene in the signature be ranked 1 or − R denote a reference gene-expression profile, and s a query gene signature. We define the connection strength between R and s as C ( R , s ) = m X i =1 R ( g i ) s ( g i ) , (1)where g i represents the i th gene in the signature, s ( g i ) isits signed rank in the signature, and R ( g i ) is this gene’ssigned rank in the reference profile. It is worth not-ing some properties of the connection strength definedabove: (1) if a gene has the same regulation status (ei-ther up- or down-regulation) in both the reference andthe query, it will make a positive contribution to theconnection strength, otherwise its contribution will benegative; (2) the magnitude of a gene’s contribution tothe connection strength is determined by its position inboth lists; and (3) a gene signature with some of its genescontributing positively and others negatively will have anoverall low connection strength, because the positive andnegative contributions cancel each other to some extent.Therefore calculating the connection strength between agene signature and a reference profile, the maximum con-nection strength achievable is the situation where the m genes in the signature match the first m genes in the ref-erence profile in the correct order, and their regulationstatus also match. In such a case, the maximum positiveconnection strength is, for an ordered gene signature, C omax ( N, m ) = m X i =1 ( N − i + 1)( m − i + 1) . (2)In another equally interesting situation, where the m genes in the signature match the first m genes in thereference profile in the right order, but the sign of eachgene in the query is different from its sign in the reference,the connection strength is − C omax ( N, m ), the opposite ofEq.(2).For an unordered query signature, all the genes inthe list have equal weight because there is no particu-lar ordering among them. The calculation of connectionstrength is the same as Eq.(1), the only difference beingthat s ( g i ) = 1 if gene g i is up-regulated, or s ( g i ) = − unordered signatureis C umax ( N, m ) = m X i =1 ( N − i + 1) . (3)Given a query signature gene and a reference gene-expression profile, we can calculate their connection scoreby c = C ( R , s ) /C max ( N, m ) . (4)So a connection score c = 1 means that the gene signaturehas the maximum positive connection strength with thereference profile, which indicates that the experimentalcondition that gave rise to this gene signature had thestrongest possible correlation with the treatment instancethat generated the reference profile. A connection score c = − c will be within the range of ( − , C. Connection Testing
As for most biomedical experiments with unavoidablebiological and technical variation, statistical significanceis a crucial aid to the interpretation and subsequent val-idation of the result. Here we propose calculating thep-value associated with a connection score by testing thefollowing null hypothesis.
Null hypothesis H : For a reference gene-expressionprofile R and a query gene signature s , the null hypoth-esis H states that there is no underlying biological con-nection between the two, and that the query signature s is merely a random m -gene signature, as generated byProcedure 1 described below. Procedure 1:
Generation of a random m -gene signa-ture. Let R be a given reference gene-expression profileof N genes. Select m genes sequentially and randomlyfrom the N genes (without replacement), and assign +1(up-regulation) or − m selected genes. If thisgene signature is to be used as an ordered list, its orderis just the order in which the m genes are selected; or ifthis gene signature is to be used as merely a collection ofgenes, then order is irrelevant.Given a reference profile R and a gene signature s , wecalculate their connection score c by Eq.(4), and the two-tailed p-value associated with this observed connectionscore is p = Prob {| ˜ c | ≥ | c | | H } , (5)where ˜ c is the connection score between a random genesignature and the reference profile. To estimate the p-value, a large number (eg., 10 ) of random connectionscores can be generated using Procedure 1 and Eq.(4),the proportion of random scores that are no less thanthe observed scores c in absolute value is an estimate ofthe two-tailed p-value.The 453 individual treatment instances of the dataset GSE5258 were created using only 164 distinct small-molecule compounds. Some treatment instances werereplication experiments using the same compound at thesame or different doses. It is thus interesting to considerall the treatment instances of the same compound as aset, and to assess the overall connection of the set witha query gene signature. We define the connection scorefor a treatment set as follows t = 1 n n X i =1 c i , (6)where n is the number of individual treatment instancesbelonging to the treatment set, c i is the connection scoreof the i th instance. To test the significance of a treatmentset as a whole. We used the following null hypothesis, Null hypothesis H set : Where T denotes a set oftreatment instances, R i the reference profile based ontreatment instance i , and s a query gene signature, thenull hypothesis H set states that there is no underlyingbiological connection between the gene signature s andany of the reference profiles in T. The query signature ismerely a random gene list generated by Procedure 1.Thus the null hypothesis H set is an extension of H to a higher level. Alternatively, H can be viewed as aspecial case of H set , in which there is only one treat-ment instance in the treatment set. Once the connectionscore for a set is observed, its associated p value can TABLE I: The connectivity scores of the random gene signa-ture rds01 to the reference profiles in the Lamb ConnectivityMap. rank ID compound score up down1 1080 sirolimus 1 0.232 -0.5782 913 colforsin 0.953 0.245 -0.5273 1138 phentolamine 0.912 0.316 -0.4234 1048 alpha-estradiol 0.886 0.324 -0.3945 1115 phenanthridinone 0.869 0.379 -0.325 · · · · · · · · · · · · · · · · · ·
112 885 5186223 0.363 0.137 -0.157113 3 metformin 0.342 0.158 -0.119114 663 U0125 0 0.405 0.194115 124 mesalazine 0 0.371 0.256 · · · · · · · · · · · · · · · · · ·
369 1008 geldanamycin 0 -0.43 -0.339370 1064 17AAG 0 -0.436 -0.395371 605 monastrol -0.38 -0.114 0.177372 494 fluphenazine -0.392 -0.113 0.187 · · · · · · · · · · · · · · · · · ·
449 601 MK-886 -0.834 -0.303 0.336450 604 arachidonic acid -0.855 -0.301 0.354451 387 estradiol -0.901 -0.162 0.528452 379 cobalt chloride -0.916 -0.238 0.464453 378 tacrolimus -1 -0.23 0.536 be estimated in a similar way: a large number of randomgene signatures are generated using Procedure 1, and theconnection score of the set to each of the random genesignatures is calculated using Eq.(6); the proportion ofrandom connection scores that are greater than the ob-served score in absolute value is an estimation of the pvalue.
III. RESULTSA. Querying the connectivity maps with randomgene signatures
To compare the specificities of Lamb et al’s Connectiv-ity Map and the connectivity map presented here, we gen-erated random gene signatures and tested these randomgene signatures in both. The first example was a randomgene signature, rds01, which contained 25 Affymetrixprobe-set IDs randomly selected from the 22283 IDs onthe Affymetrix HG-U133A microarray platform. Query-ing Lamb et al’s Connectivity Map with this signature,we obtained the connectivity scores of rds01 to all the 453individual treatment instances. The results are shown inTable I, which suggests that this signature has positiveconnections to 113 individual reference profiles, with con-nectivity scores ranging from 1 to 0 . − − . /N , where N is the number of nullhypotheses being tested simultaneously. In this exam-ple, N = 453 (the total number of individual treatmentinstances in the database and hence the number of nullhypothesis being tested at the treatment instance level).So a connection with a p value p < /
453 = 0 . /N was intended to control theexpected number of false connections at 1, we thus ex-pected to have one false connection on average among allthe connections declared as significant. Table II showsthat our connectivity map gave the correct result, ie.,no significant connection between this random signatureand any of the treatment instances. This compares withthe 196 connections (113 positive and 83 negative) foundwith the method of Lamb et al. Note that to control theexpected number of false connections more precisely, thethreshold hold p value should be set at 1 /N , where N is the number of true null hypothesis. Of course in a gen-eral situation N is unknown, so it has to be estimated,for example, using the methods developed in [2, 3]. Inthis paper, we set threshold p value at 1 /N for simplicity.Since N ≤ N , our criteria tend to be slightly conserva-tive, meaning that the actual number of false connectionson average will be ≤ p < /
453 = 0 . TABLE II: The connection scores for the random gene signa-ture rds01 using the new methods. To control the expectednumber of false connections at 1, the threshold p value was setat 1 /
453 = 0 . · · · · · · · · · · · ·
451 felodipine 848 -0.002 0.988452 estradiol 988 -0.001 0.991453 bucladesine 959 0.001 0.993 values were combined to give a single connectivity score.This way of defining the connection score was unneces-sarily complex, which made the calculation of a possiblep-value on an individual level difficult, if not impossible.On the treatment set level, Lamb et al’s ConnectivityMap provides a permutation p value when a set of treat-ment instances associated with the same compound wereviewed as a whole. The permutation p value was calcu-lated by comparing a statistic (again based on K-S statis-tic) of the treatment set with the distribution of manyrandom-set statistics. Those random sets were formedby randomly selecting treatment instances from all in-stances. From the way the permutation p value was cal-culated, it was clear that the following null hypothesiswas being tested:
The set of instances in question havethe same pattern (distribution) of connections with thequery gene signature compared with a randomly formedset of the same size.
A null hypothesis such as this is notthe most appropriate one to test, because it is not di-rectly relevant to the question of whether the query genesignature had any real biological connection to the treat-ment set. Lamb et al themselves also recognized thatmore rigorous methods for the estimation of statisticalsignificance were probably required[1]. Putting its appro-priateness aside, if we do use the permutation p valuesfrom Lamb et al’s Connectivity Map to control false con-nections, the same criteria for setting threshold p valuecan be used. The results of significance analysis on theconnections between the two random gene signatures andthe 164 treatments sets can be found in the supplemen-tary data. In this instance both connectivity maps gavethe right answers for these two random gene signatures.However, for Lamb et al’s Connectivity Map permutationp values were not available for all the 164 treatment sets.For those treatment sets which only contain 1 treatmentinstance or those sets with mean connectivity score 0, nopermutation p values could be calculated, and hence nostatistical significance. This problem affects the coverageand consequently the sensitivity of Lamb et al’s Connec-tivity Map, because real biological connections between
FIG. 1: HDAC inhibitor gene signature: The Venn diagramsummarizing the findings of significant connections as identi-fied by the connectivity maps. A represents the instance-levelanalysis using the new method; B, the set-level analysis usingthe new method; C, the set-level analysis using Lamb et al’sConnectivity Map. The label “AB 13” means that 13 com-pounds are identified as significant solely by A and B (notC), “B 4” means that 4 compounds are identified as signif-icant solely by B (not A, not C), and so on. The areas areapproximately proportional to the numbers they represent. a query gene signature and any of those treatment setsmay not be recognized.
B. Querying the connectivity maps withexperimentally derived real gene signatures
1. HDAC inhibitors
To test the ability of the new method for identifyingreal biological connections we utilized some of the sameexamples used in [1] to compare with Lamb et al’s Con-nectivity Map. The first example was a gene signatureof histone deacetylase (HDAC) inhibitors (Lamb’s genesignature sigs01), which was compiled from [4] on the re-sponses of T24, MDA435 and MDA468 cells respectivelyto three histone deacetylase (HDAC) inhibitors: vorino-stat, MS-27-275, and trichostatin A. This gene signatureconsisted of 8 up- and 5 down-regulated genes, repre-sented by 25 Affymetrix probe-set IDs on the AffymetrixHG-U133A microarrays.As Lamb et al’s Connectivity Map does not provideany safeguard on individual treatment instance levelagainst possible false connections, we can only use theresults of Lamb et al’s Connectivity Map on the treat-ment set level. In total 6 compounds (vorinostat, tricho-statin A, resveratrol, geldanamycin, valproic acid, and17AAG) were identified to have significant positive con-nectivity with the HDAC inhibitor gene signature; and2 compounds (5182598 and fludrocortisone) had signif-icant negative connectivity. Vorinostat, trichostatin A,and valproic acid are known HDAC inhibitors thus theidentification of these can be regarded as a success. How-ever another known HDAC inhibitor HC-toxin, a refer-ence profile of which was contained in the database, wasnot identified. This happened because there was onlyone treatment instance of the compound HC-toxin in thedatabase and so no permutation p value could be ob-tained using Lamb et al’s method. Based on their in-stance level results, Lamb et al highlighted HC-Toxin in[1] as it had the 7th highest connectivity score (0.914)of all instances in the dataset. However, as the two ex-amples of random gene signatures showed Lamb et al’sinstance level analysis gave false connections even for thehighest connectivity scores 1.0. So the rational choice isto disregard the instance level results from Lamb et al’sConnectivity Map.We used this same HDAC inhibitor gene signature toquery the new connectivity map presented here, with pvalues calculated at both the individual instance leveland the treatment set level. 56 treatment instances, rep-resenting 22 distinct compounds, were found to have sig-nificant connections to sigs01. On the treatment set level24 compounds were found to have significant connec-tions with the signature. Near the top of the significantconnection list were those known HDAC inhibitors high-lighted in [1]. Importantly though also included in theoutput was HC-Toxin, which was not identified by theset-level analysis Lamb et al’s Connectivity Map. Thefull results of querying both connectivity maps using theHDAC inhibitor gene signature are included in the sup-plementary data. In Fig. 1, we summarize the numberof significant connections as identified by: (A) Instancelevel analysis using the new method presented here; (B)Set level analysis using the new method; (C) Set levelanalysis using Lamb et al’s Connectivity Map. In total,our method (A) and (B) combined identified 27 com-pounds and Lamb et al’s Connectivity Map identified8 compounds, as having significant connections to theHDAC inhibitors. Our method missed only 1 of the 8compounds found significant by Lamb et al’s Connectiv-ity Map, while the latter missed 20 of the 27 compoundsidentified by our method, with HC-toxin among the 20compounds that were missed. The HDAC inhibitors ex-ample thus shows that our new method is more sensitiveat detecting real connections. With the increased sen-sitivity and false connections being properly controlled,the potential benefit of our method is obvious. In thisexample, the set-level analysis of Lamb et al’s Connec-tivity Map identified 8 compounds with a false discoveryrate of 12 .
5% (1/8), while the set-level analysis using ourmethod identified 24 compounds with a false discoveryrate of 4 .
2% (1/24). Based on the findings of the connec-tivity maps, researchers can prioritize a small sub set ofthose compounds for further investigations and/or devel-oping new biological hypotheses. For this example, usingLamb et al’s Connectivity Map, the chance of pursuinga false connection is 12 . .
2. Estrogens
The second example was a gene signature (Lamb’s genesignature sigs02) taken from a study [5] of MCF7 cellstreated with estradiol. This gene signature consisted of40 up- and 89 down-regulated genes represented by 189Affymetrix probe-set IDs on the Affymetrix HG-U133Amicroarrays. We queried both Lamb et al’s ConnectivityMap and the connectivity map presented in this paperwith this estrogen signature.For the same reason given above, we only used thetreatment set level results of Lamb et al’s ConnectivityMap, which identified 4 compounds (genistein, estradiol,tretinoin, and alpha-estradiol) as having significant posi-tive connectivity with the estrogen signature; and 5 com-pounds (trichostatin A, fulvestrant, LY-294002, vorinos-tat, and geldanamycin) that had significant negative con-nectivity.With our new connectivity map set-level analysis 16compounds were found to have significant positive con-nection, and 25 compounds had significant negativeconnection to the estrogen gene signature. The 16compounds with positive connection included genistein,estradiol, and alpha-estradiol, all known to be estrogenreceptor agonists. The 25 compounds with negative con-nection included fulvestrant, raloxifene and tamoxifen,all known to be estrogen receptor antagonists. In com-parison, the Lamb method identified all the estrogen re-ceptor agonists above, but missed all the estrogen recep-tor antagonists except fulvestrant. These results there-fore indicate the sensitivity of our new method is substan-tially increased. The Lamb method was able to detect thepure estrogen receptor antagonist fulvestrant, but missedthe two compounds tamoxifen and raloxifene which havemixed antagonist and agonist estrogen receptor activi-ties.The results from our set level analysis also suggestthat nordihydroguaiaretic acid (NDGA) has significantpositive connection with estradiol. This connection issupported by recent studies [6, 7], where NDGA hasbeen shown to have estrogenic activity and able to elicitan estrogen-like response. Another compound monor-den (radicicol), suggested by our method as having neg-ative connection to the estrogen gene signature, has beenshown to repress the transcriptional function of the es-trogen receptor [8] which suggests that it may have someestrogen receptor antagonist-like properties. The tabu-lated results of querying both connectivity maps usingthe estrogen gene signature are included in the supple-mentary data. Fig.2 summarizes the numbers of signif-icant connections identified by both connectivity maps.All 9 compounds found significant by Lamb et al’s Con-nectivity Map were also identified by the new method(either on the instance level or on the set level or both).However many compounds identified as significant witheither positive or negative connection to estradiol usingour method were not identified by Lamb et al’s Con-nectivity Map, included amongst these were raloxifene,
FIG. 2: Estrogen gene signature: The Venn diagram sum-marizing the findings of significant connections as identifiedby Lamb et al’s Connectivity Map and the new method here.The labelling follows the same conventions as in the previousfigure. tamoxifen, monorden, and NDGA.
3. Immunosuppressive therapy
For further testing we compiled a new gene signaturefrom a study on cardiac allograft rejection and responseto immunosuppressive therapy [9], where patients weretreated with standard immunosuppression with corticos-teroids, antimetabolites, calcineurin inhibitors, and/orsirolimus. This gene signature consisted of 40 Affymetrixprobe set IDs (see Table 2 of [9]). Our set-level analy-sis identified 29 compounds as having significant connec-tions with this gene signature. The three top compoundswere azathioprine, thalidomide, and rosiglitazone. Aza-thioprine is a commonly used immunosuppressive drug[10, 11], so its significant positive connection with thegene signature is a good indication that our new connec-tivity map works very well here. The second compoundthalidomide, which had a positive connection score, alsohas known immunosuppressive activities [12], inhibits re-lease of TNF α from monocytes, and modulates othercytokine actions. The recognized properties of thesemolecules therefore accord with the outcome of the con-nectivity matching. The third compound rosiglitazonehad a negative connection with the signature suggestingit may have properties to reduce or mitigate the effects ofimmunosuppressive activity. Recently, rosiglitazone wasreported to suppress cyclosporin-induced chronic trans-plant dysfunction and prolong survival of rat cardiac al-lografts (Chen et al, 2007, Transplantation) , where cy-closporin is also a commonly used immunosuppressivedrug [14].At the instance level, our method identified 89 ref-erence profiles as having significant connections to theimmunosuppressive drug gene signature, representing 63distinct compounds. The top 3 compounds were aza-thioprine, staurosporine, and trichostatin A, which all achieved positive connection scores with this gene sig-nature. The second compound, staurosporine, a pro-tein kinase C inhibitor, is classified as an antineoplasticand immunosuppressive antibiotic drug [15]. The thirdcompound, trichostatin A, was recently shown to havesome immunosuppressive effects in leukemia T cells [16].Therefore the method of instance testing could be partic-ularly valuable for the identification of pharmacologicaland toxicological properties in novel molecules. IV. DISCUSSION
We have presented in this paper a new framework fora connectivity map, with the advantages that statisticalsignificance measures are calculated at both treatmentinstance level and treatment set level, thus providing ef-fective control over false connections. This importantsafeguard was not available in Lamb et al’s Connectiv-ity Map at the instance level, as revealed by the twoexamples of random gene signatures. As the connectiv-ity maps are most useful for high throughput screeningand for generating new biological hypothesis, it is crucialthat false connections are tightly controlled, such thatthe users do not end up chasing false connections andwasting time and money. We compared the performanceof Lamb et al’s Connectivity Map and the new methodpresented in this paper, using two gene signatures com-piled by Lamb et al and also a new gene signature forimmunosuppressive drugs. All these examples demon-strated that our new method is more sensitive and givesmore robust results than Lamb et al’s method.The set-level analysis of Lamb et al’s Connectivity Maptests an implicit null hypothesis that is not most appro-priate. This can be seen more clearly from recent studieson the significance analysis of gene sets [17, 18]. Tian etal were among the first groups of authors who made ex-plicit distinctions between two null hypotheses [17] con-cerning a set of entities ( a set of genes in context of Tianet al’s paper). In the present context of connectivity map,the two null hypotheses are:1. Hypothesis Q : The treatment instances in a setshow the same pattern of connections with the querygene signature compared with the rest of the treatmentinstances.2. Hypothesis Q : The treatment set does not containany treatment instances which have real connections withthe query gene signature.Tian et al’s discussions over the relationship between Q and Q [17] apply to a treatment set as follows: Givenall the 453 treatment instances, even if none of them havereal connection to the query gene signature, the observedconnection scores of a treatment set could still be verydifferent from those treatment instances outside of theset because of the special correlation structure among thetreatment instances within the set. Chen et al also raisedsome concerns with testing the hypothesis Q (Chen et al,2007, Bioinformatics), that this hypothesis does not testwhether the treatment set has above-random connectionsto the query gene signature, but rather it tests whetherthe observed connections in the set are more or less thana randomly formed set of same size. We agree with Chenet al’s assessment on the inappropriateness of Q . It wasclear that Lamb et al’s Connectivity Map was testingthe null hypothesis Q . Here we tested Q which is moreappropriate.Our results indicate that the method presented herecan identify many significant connections to a query genesignature. Then what criteria should we use and whichcompounds should we choose if new biological hypothe-ses are to be developed? Our suggestion is to concentratemore on those compounds which have many replicate in-stances in the connectivity map. Because the results ob-tained for those big treatment sets do not depend heavilyon the quality of a small number of treatment instances,as in the case of small treatment sets or singleton sets(treatment sets with only 1 instance each). Lamb et alalso recognized the importance of having replicate in-stances, and noted that the power to detect connectionsmight be greater for compounds with many replicates.In defining a treatment set, ideally only the treatmentinstances of the same compound with the same dose andthe same cell type should be considered as a set. Forexample, the biological state of HL60 cells perturbed byraloxifene should be considered as a different biologicalstate from that of MCF7 cells perturbed with the same raloxifene, thus these two instances of raloxifene shouldnot be put into the same set. In this paper for compar-ison purpose we adopted Lamb et al’s choice in defininga treatment set, i.e., all the instances of a compoundswere grouped together as a set regardless of their possi-ble differences in dose or cell type. Mixing the instancesof a compound with different doses and/or different celltypes increases the heterogeneity of an otherwise morehomogenous treatment set. This tends to average outthe distinct characteristics attributable to the cell type ordosage difference, making some set-level connections in-significant or their interpretation difficult. In such casesthe instance level connections supported by statisticalsignificance can be of great help in interpreting the re-sults. For future connectivity maps, efforts should bemade to provide as many more replicate treatment in-stances (replicates with the same compound, the samedose, and the same cell type, etc) so that the undesirablereliance on individual instances can be minimized. V. ACKNOWLEDGMENTS
We wish to acknowledge the support of the microarrayteam of the MRC Toxicology Unit. SDZ thanks QingWen for helpful discussions on a searching algorithm inthe implementation of the new connectivity map. [1] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C.Blat, M. J. Wrobel, J. Lerner, J.-P. Brunet, A. Subrama-nian, K. N. Ross, et al., Science , 1929 (2006).[2] J. D. Storey and R. Tibshirani, Proc. Natl. Acad. Sci.USA , 9440 (2003).[3] S.-D. Zhang and T. W. Gant, Bioinformatics , 2821(2004).[4] K. B. Glaser, M. J. Staver, J. F. Waring, J. Stender,R. G. Ulrich, and S. K. Davidsen, Mol Cancer Ther ,151 (2003).[5] J. Frasor, F. Stossi, J. M. Danes, B. Komm, C. R. Lyttle,and B. S. Katzenellenbogen, Cancer Res , 1522 (2004).[6] N. Fujimoto, R. Kohta, S. Kitamura, and H. Honda, LifeSciences , 1417 (2004).[7] N. Sathyamoorthy, T. Wang, and J. Phang, Cancer Res. , 957 (1994).[8] M. Lee, E. Kim, H. Kwon, Y. Kim, H. Kang, H. Kang,and J. Lee, Mol Cell Endocrinol. , 47 (2002).[9] P. A. Horwitz, E. J. Tsai, M. E. Putt, J. M. Gilmore,J. J. Lepore, M. S. Parmacek, A. C. Kao, S. S. Desai,L. R. Goldberg, S. C. Brozena, et al., Circulation ,3815 (2004).[10] V. W. Armstrong and M. Oellerich, Clinical Biochem- istry , 9 (2001).[11] S. T. Matalon, A. Ornoy, and M. Lishner, ReproductiveToxicology , 219 (2004).[12] S. McHugh, I. Rifkin, J. Deighton, A. Wilson, P. Lach-mann, C. Lockwood, and P. Ewan, Clin Exp Immunol. , 160 (1995).[13] Y. Chen, Y. Liu, Z. Yuan, L. Tian, M. J. Dallman, P. W.Thompson, P. K. H. Tam, and J. R. Lamb, Transplanta-tion , 1602 (2007).[14] J. S. Warrington and L. M. Shaw, Expert Opinion onDrug Metabolism & Toxicology , 487 (2005).[15] C.-C. Ting, J. Wang, and M. E. Hargrove, Immunophar-macology , 119 (1995).[16] R. Januchowski and P. P. Jagodzinski, International Im-munopharmacology , 198 (2007).[17] L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S.Kohane, and P. J. Park, PNAS , 13544 (2005).[18] B. Efron and R. Tibshirani, Ann. Appl. Statist. , 107(2007).[19] J. J. Chen, T. Lee, R. R. Delongchamp, T. Chen, andC.-A. Tsai, Bioinformatics23