[PDF] Reconciling Multiple Connectivity Scores for Drug Repurposing

Abstract

The basis of several recent methods for drug repurposing is the key principle that an efficacious drug will reverse the disease molecular 'signature' with minimal side-effects. This principle was defined and popularized by the influential 'connectivity map' study in 2006 regarding reversal relationships between disease- and drug-induced gene expression profiles, quantified by a disease-drug 'connectivity score.' Over the past 15 years, several studies have proposed variations in calculating connectivity scores towards improving accuracy and robustness in light of massive growth in reference drug profiles. However, these variations have been formulated inconsistently using various notations and terminologies even though they are based on a common set of conceptual and statistical ideas. Therefore, we present a systematic reconciliation of multiple disease-drug similarity metrics (ES, css, Sum, Cosine, XSum, XCor, XSpe, XCos, EWCos) and connectivity scores (CS, RGES, NCS, WCS, Tau, CSS, EMUDRA) by defining them using consistent notation and terminology. In addition to providing clarity and deeper insights, this coherent definition of connectivity scores and their relationships provides a unified scheme that newer methods can adopt, enabling the computational drug-development community to compare and investigate different approaches easily. To facilitate the continuous and transparent integration of newer methods, this article will be available as a live document (this https URL) coupled with a GitHub repository (this https URL) that any researcher can build on and push changes to.

Full PDF

1 Reconciling Multiple Connectivity Scores for Drug Repurposing

Kewalin Samart * , Phoebe Tuyishime * , Arjun Krishnan , Janani Ravi Pathobiology and Diagnostic Investigation, Michigan State University; Mathematics, Michigan State University; Food Science and Nutrition, Michigan State University; Computational Mathematics, Science and Engineering; Biochemistry and Molecular Biology, Michigan State University. *These authors contributed equally to this work.

Corresponding authors: [email protected]; [email protected].

Abstract

The basis of several recent methods for drug repurposing is the key principle that an efficacious drug will reverse the disease molecular ‘signature’ with minimal side-effects. This principle was defined and popularized by the influential ‘connectivity map’ study in 2006 regarding reversal relationships between disease- and drug-induced gene expression profiles, quantified by a disease-drug ‘connectivity score.’ Over the past 14 years, several studies have proposed variations in calculating connectivity scores towards improving accuracy and robustness in light of massive growth in reference drug profiles. However, these variations have been formulated inconsistently using varied notations and terminologies even though they are based on a common set of conceptual and statistical ideas. Therefore, we present a systematic reconciliation of multiple disease-drug connectivity scores by defining them using consistent notation and terminology. In addition to providing clarity and deeper insights, this coherent definition of connectivity scores and their relationships provides a unified scheme that newer methods can adopt, enabling the computational drug-development community to compare and investigate different approaches easily. To facilitate the continuous and transparent integration of newer methods, this review will be available as a live document (https://jravilab.github.io/connectivity_score_review) coupled with a GitHub repository (https://github.com/jravilab/connectivity_score_review) that any researcher can build on and push changes to.

Keywords drug repurposing | disease gene signature | drug profile | connectivity mapping | transcriptomics

Key points • Connectivity mapping is a powerful approach for drug repurposing based on finding drugs that reverse the transcriptional signature of a disease, quantified by a connectivity score. • Though a number of connectivity scores have been proposed until now, they have been described using inconsistent notations and terminologies to refer to a common set of concepts and ideas. • Here, we present a coherent definition of multiple connectivity scores using a unified notation and terminology, along with delineating the clear relationship between these scores. • Our unified scheme can be adopted easily by newer methods and used for systematic comparisons. • The live document and GitHub repository will allow continuous incorporation of newer methods. Introduction

The manifestation of a disease or perturbation by a small molecule in a tissue leaves a characteristic imprint (a “signature”) in its gene expression profile [1]. These signatures, recorded for thousands of diseases and drugs, form the basis of a powerful and widely-adopted method for drug repurposing called “drug-disease connectivity analysis” [2]. In this analysis, novel drug indications for a specific disease of interest are identified based on the extent to which the ranked drug-gene profile is a “reversal” of the disease gene signature [3,4];

Fig. 1 ). Connectivity-based drug repurposing has been used to discover drugs in various cancers and non-cancer diseases [5].

Figure 1. Drug-disease connectivity. A.

Gene expression signatures. 𝐿 is the drug gene-expression profile containing the rank-ordered list of genes going from the most significantly up-regulated gene to the most significantly down-regulated gene. 𝑆 is the gene set for the disease of interest with 𝑆 !" containing the set of up-regulated genes and 𝑆 containing the set of down-regulated genes. B. Connectivity.

Positions of 𝑆 !" and 𝑆 disease genes in the ranked drug list, 𝐿 , determine the signs and magnitudes of enrichment scores ( 𝐸𝑆 ; 𝐸𝑆 !" , 𝐸𝑆 ). Positive connectivity is defined as the case when the disease signature and drug profile show similar perturbations, i.e. when 𝐸𝑆 !" is positive and/or when 𝐸𝑆 is negative. This happens when 𝑆 !" predominantly appears towards the top of the drug profile or when 𝑆 appears predominantly towards the bottom of the drug profile (scenarios 1 and 4). Conversely, negative connectivity is defined as the case when the disease signature and drug profile show dissimilar perturbations, i.e. when 𝐸𝑆 !" is negative and/or when 𝐸𝑆 is positive. This happens when 𝑆 !" predominantly appears towards the bottom of the drug profile or when 𝑆 appears towards the top of the drug profile (scenarios 2 and 3). Negative connectivity indicates drug reversal of disease signature. From its inception in 2006, the exact method for connectivity analysis has evolved, with a series of proposed modifications over the past decade and a half (

Fig. 2A ). The first method for connectivity analysis [6] builds on the seminal paper by Subramanian et al. , 2005 [7] that proposed the Gene Set Analysis (GSEA) method. GSEA uses a modified Kolmogorov-Smirnov statistic [8] – referred to as “enrichment statistic” ( 𝐸𝑆 ) – to evaluate if genes in a certain pathway appear towards the top or bottom of a gene (differential) expression profile. Lamb et al ., 2006 [6] built a reference database (CMap, which we refer to as CMap 1.0 in this review) with gene expression profiles for 1000s of small molecules and proposed the first method for connectivity analysis based on GSEA. This method compares a query signature (disease) to each of the ranked drug-gene expression profiles in their reference database and ranks all the drugs based on their connectivity scores. A connectivity score ranges between –1 (indicating a complete ‘drug-disease’ reversal) and +1 (indicating perfect ‘drug-disease’ similarity). Another study adapted this connectivity score calculation and used it to find compounds in the L1000 LINCS collection [9] that could be repurposed for three cancer types [10]. This study quantified the reversal relationship between the drug and disease by computing the proposed Reverse Gene Expression Score ( 𝑅𝐺𝐸𝑆 ). Finally, CMap 1.0 itself was further updated by expanding the LINCS L1000 to more than 1.3 million profiles [11] (referred to as CMap 2.0 in this review). Along with expansion of data, the CMap 2.0 study also proposed another variation of the connectivity score called the weighted connectivity score that uses GSEA’s weighted Kolmogorov-Smirnov enrichment statistic along with ways to normalize the resulting score and correcting them further to account for background associations. Connectivity scores and methodologies have been evaluated in the past to assess their performance in predicting drug-drug relationships or drug-disease relationships. The performance of CMap 1.0 was evaluated in predicting drug-drug relationships using the Anatomical Therapeutic Chemical classification [12,13], and in predicting drug-disease relationships [14]. Furthermore, a recent review [15] assessed advances that have been made in CMap 1.0 and computational tools that have been applied in the drug repurposing and discovery fields. Lin et al ., 2019 [16] further evaluated connectivity approaches that use L1000 data, including six different scores that are used to predict drug-drug relationships. All these proposed variations of the connectivity score share a common set of conceptual and statistical ideas. Yet, they have been formulated inconsistently using varied notations and terminologies in the original papers. The lack of consistency in the precise notation makes it difficult to seamlessly understand the subtle differences and the intuition underlying each score. For example, the connectivity score referred to as “

𝑅𝐺𝐸𝑆 ” [10] directly builds on “ 𝐶𝑆 ” [6]. Another example is the “ 𝑊𝑇𝐶𝑆 ” in [11] that is a bi-directional weighted version of “

𝐾𝑆(𝐸𝑆) ” used in GSEA [7]; in this case, they are named and notated quite differently though they share themes and statistical ideas. The aforementioned reviews and evaluation studies consider different scores only at a conceptual level, i.e. without delving into terms and definitions. In this review, we develop a systematic scheme that defines the aforementioned methodologies using consistent notations and terms. Additionally, we provide summary tables throughout the article to relate our consistent scheme with the previously published ones.

A taxonomy of connectivity scores

Figure 2. A taxonomy of connectivity scores. A.

Relationship between connectivity scores.

The main formulations discussed here are GSEA enrichment score ( 𝐸𝑆 ) [7], CMap 1.0 connectivity score ( 𝐶𝑆 ) [6], 𝑅𝐺𝐸𝑆 and 𝑠𝑅𝐺𝐸𝑆 [10], CMap 2.0 weighted connectivity score (

𝑊𝐶𝑆 ), normalized connectivity score (

𝑁𝐶𝑆 ), and Tau score ( t ) [11]. B. Detailed definitions of connectivity scores in A. We begin creating a standardized set of notations and terms to denote the various concepts and quantities required to define the different connectivity scores. A connectivity score between a disease and a drug is computed by comparing the genes up- ( 𝑆 !" ) and down-regulated ( 𝑆 ) by the disease (compared to a healthy control) to a ranked list of genes ( 𝐿 ) ordered based on their differential expression in response to a drug. A good connectivity score is usually a lower negative value since it is designed to indicate a reversal relationship between the disease and the drug. A good connectivity score is usually achieved when genes in 𝑆 !" appear at the bottom of 𝐿 and/or when genes in 𝑆 appear at the top of 𝐿 . When there is no relationship or when 𝑆 !" appears at the top and/or when 𝑆 appears at the bottom of 𝐿 ( i.e. similarity between the disease and drug signatures), the drug is unlikely to be efficacious in treating that disease. These scenarios are depicted in Figure 1 , and the general notations, which we use throughout this work, are presented in

Table 1 , Figure 2 . Building on these general notations and terms, in the rest of this review, we develop and present a systematic scheme that defines four formulations of the drug-disease connectivity scores using consistent notations and terms, detailed formulation, and a summary table that will enable researchers to relate our consistent scheme back to the notations and terminology used in the original publications.

Table 1. General Notations

Notation Description 𝑆 disease gene set ( i.e. query) ( Fig. 1 ) Without any loss in generality, only the subset of disease genes that are also part of 𝐿 are considered throughout ( i.e. 𝑆 ⊆ 𝐿 ). 𝑆 !" disease up-regulated gene set; 𝑆 !" ⊆ 𝑆 𝑆 disease down-regulated gene set; 𝑆 ⊆ 𝑆 ; 𝑆 = 𝑆 !" ∪ 𝑆 𝐿 rank-ordered list (drug) ( Fig. 1 ) 𝑁 ’ number of genes in 𝑆 𝑁 ( number of genes in 𝐿 𝑔𝑙 ) , 𝑔𝑠 ) 𝑖 *+ gene in list 𝐿 or set 𝑆 𝑖𝑑𝑥(𝐿, 𝑔𝑠 ) ) index of gene 𝑔𝑠 ) in list 𝐿 𝑡 each treatment instance ( i.e. a treated-and-vehicle-control pair) that results in a single drug profile 𝐿 . 𝑁 , total number of drug profiles ( 𝐿 ) in the reference database 𝑁 number of drug profiles ( 𝐿 ) in the reference database that corresponds to a specific drug 𝑑 KS Kolmogorov-Smirnov 𝐸𝑆 enrichment score 𝐸𝑆 !" 𝐸𝑆 for up-regulated gene set ( 𝑆 !" ) 𝐸𝑆 𝐸𝑆 for down-regulated gene set ( 𝑆 ) Gene Set Enrichment Analysis (GSEA)

All connectivity scores described here begin with the calculation of some form of an Enrichment Score ( 𝐸𝑆 ) that captures the relationship between a drug and a disease. The basis of all these 𝐸𝑆 formulations is the Gene Set Enrichment Analysis (GSEA) [7]; that was originally developed to assess the enrichment (over-representation) of predefined biological gene sets (e.g., pathways, targets of a regulator, etc.) at the top or bottom of a list of genes ranked by their extent of differential expression in response to an experimental factor of interest. Enriched gene sets are then hypothesized to be biologically relevant to that experimental factor. When adapted to the question of drug repurposing, a method like GSEA can be used to assess the enrichment of sets of genes associated with a disease at the top or bottom of a list of genes ranked by their extent of differential expression in response to a drug ( Fig. 1 ). Enrichment Score (ES)

GSEA is a weighted signed version of the classical Kolmogorov-Smirnov test. It takes two inputs: i) a disease gene set composed of a set of genes significantly perturbed in response to a disease (denoted 𝑆 ), and ii) a rank-ordered list ( 𝐿 ) of drug genes (in decreasing order of a drug-response score 𝑑(𝑔𝑙 ’ ) for each gene 𝑔𝑙 ’ ). Using these two inputs, GSEA quantifies the level of association between the disease and the drug by calculating an enrichment score ( 𝐸𝑆 ) based on the following steps: 1. For each position 𝑖 in the rank-ordered list ( 𝐿 ) from top to bottom, 1.1. if the gene is in 𝑆 , calculate: 𝑃 ()* (𝑆, 𝑖) = 3 |𝑑(𝑔𝑙 ’ )| % -. 𝑁 |( / ∈1’2) , 𝑤ℎ𝑒𝑟𝑒 𝑁 |( = 3 | ./ / ∈1 𝑑(𝑔𝑙 ’ )| % -. 𝑆 , calculate: 𝑃 (𝑆, 𝑖) = 3 1𝑁 − 𝑁 / ∉1’2) 𝑒𝑠 ) ) 𝑒𝑠 ) = 𝑃 ()* (𝑆, 𝑖) − 𝑃 (𝑆, 𝑖)

2. Finally, calculate the final enrichment score ( 𝐸𝑆 ): 𝐸𝑆 = 𝑚𝑎𝑥 ) (𝑒𝑠), the maximum positional enrichment score. When 𝑤 = 0 , 𝑁 |( = ∑ | ./ / ∈1 𝑑(𝑔𝑙 ’ )| = 𝑁 , which result in 𝑃 ()* (𝑆, 𝑖) = 3 |𝑑(𝑔𝑙 ’ )| 𝑁 |( / ∈1’2) = 3 1𝑁 / ∈1’2) . Thus, 𝑃 ()* (𝑆, 𝑖) and 𝑃 (𝑆, 𝑖) are both empirical distribution functions of the positions of the disease genes ( i.e. 𝑆 ) and the positions of the non-disease genes ( i.e. 𝐿 − 𝑆 ), respectively, in the drug gene list 𝐿 . Therefore, when 𝑤 = 0 , 𝐸𝑆 (the signed maximum distance between the two functions) reduces to a signed two-sample Kolmogorov-Smirnov (KS) statistic: 𝐸𝑆 = 𝑚𝑎𝑥(𝑃 ()* (𝑆, 𝑖) − 𝑃 (𝑆, 𝑖)) = 𝑠𝑖𝑔𝑛(𝑃 ()* (𝑆, 𝑖) − 𝑃 (𝑆, 𝑖)) × 𝐾𝑆 where

𝐾𝑆 = 𝑚𝑎𝑥|𝐹 (𝑖) − 𝐹 (𝑖)| is the classical two-sample KS statistic, with 𝐹 and 𝐹 being the empirical distribution function of 𝑆 and 𝐿 − 𝑆 , respectively, defined as follows: 𝐹 (𝑖) = 1𝑁 ’2): . ’;<./ / ∈1 , 𝐹 (𝑖) = 1𝑁 − 𝑁 ’2)’;<./ / ∉1 When 𝑤 = 1 , 𝐸𝑆 becomes a weighted signed two-sample KS statistic with each position 𝑗 in the drug gene list 𝐿 weighted by the drug-response score 𝑑(𝑔𝑙 ’ ) . Setting 𝑤 to one is recommended for GSEA. We point the reader to the original GSEA publication for a discussion of statistics when 𝑤 is set to lesser or greater than one. Summary • Enrichment score, 𝐸𝑆 , ranges from –1 to +1 ( Fig. 3 ). • 𝐸𝑆 is the maximum deviation from zero encountered between the empirical distributions of the disease and non-disease genes in drug gene list 𝐿 . – A positive 𝐸𝑆 indicates disease gene set enrichment towards the top of drug gene list 𝐿 . – A negative 𝐸𝑆 indicates disease enrichment at the bottom of 𝐿 . • When 𝑆 is randomly distributed in 𝐿 , the magnitude of 𝐸𝑆 is small but if a large proportion of genes in 𝑆 is concentrated at the top or bottom of 𝐿 , the magnitude of 𝐸𝑆 is large ( Fig. 4 ). • When calculated separately for genes up- ( 𝑆 !" ) and down-regulated ( 𝑆 ) by the disease, good drug candidates that show a reversal relationship with the disease profile have a negative 𝐸𝑆 !" and a positive 𝐸𝑆 ( Fig. 4, Table 2 ). • Revised notations used in this GSEA section are summarized in

Table 2 . Figure 3. Connectivity scores vs drug reversal phenotype. The figure shows estimated signs of the different connectivity scores for all eight scenarios corresponding to combinations of up- and down-regulated disease genes ( 𝑆 ) and their relative position on the drug list ( 𝐿 ). The top three scenarios (coded in blue) correspond to favorable outcomes of the drug fully or partially reversing the disease gene signature. The bottom three scenarios (coded in red) correspond to unfavorable outcomes of the drug not reversing the disease gene signature. The middle two scenarios (coded in grey) indicates neutral outcomes. Figure 4. ES distribution for up- and down-regulated genes.

Shown are plots of the running sum of up- and down-regulated gene sets (green curves) of an example liver cancer dataset (NCBI GEO Accession: GSE84073 [17]), including the location of the maximum and minimum enrichment scores (top and bottom dashed red lines) and the leading-edge subset (vertical black lines ( 𝑆 genes in 𝐿 ) that show up at, or before, the running sum reaches the final enrichment score (the maximum deviation from zero)). Table 2. GSEA Notations

Current Notation Previous Notation Description 𝑤 𝑝 the weight of the step in enrichment score calculation 𝑔𝑙 𝑔 a 𝐿 gene at index 𝑗 ; 𝑖, 𝑗 are indices of genes 𝑑(𝑔𝑙 ) 𝑟 the drug-response score of gene 𝑔𝑙 in drug gene list 𝐿 ; this score is used to rank the genes in 𝐿 𝑁 |( 𝑁 the sum of absolute drug gene score ( 𝑑(𝑔𝑙 ) ) of every 𝐿 gene in 𝑆 weighted by 𝑤 𝑃 +)* (𝑆, 𝑖) − the fraction of genes in 𝑆 (“hits”) weighted by their drug gene score ( 𝑑(𝑔𝑙 ) ) 𝑃 (𝑆, 𝑖) − the fraction of genes not in 𝑆 (“misses”) 𝑁 ( 𝑁 number of genes in 𝐿 𝑁 ’ 𝑁 number of genes in 𝑆 Connectivity Map 1.0: Disease-Drug Connectivity Scores (CMap 1.0)

The connectivity map 1.0 (CMap 1.0) project pioneered the identification of drug candidates based on their ability to reverse disease gene expression profiles [6]. Key to this project was the creation of a large collection of reference gene expression profiles of multiple human cell lines that are treated with 164 small molecules, including approved drugs. The expression profiles were generated using Affymetrix microarrays. The original CMap 1.0 study and several others focused on cancer [18], inflammatory bowel disease [3] and spinal muscular atrophy [19] have used this reference library of drug profiles for drug repurposing. In all these cases, the starting point is a disease “signature” defined by the sets of genes up- and down-regulated in the disease. This signature is compared to each drug profile in the reference library using a GSEA-like analysis that results in an enrichment score ( 𝐸𝑆 ) for each of the up- and down-regulated disease gene sets separately. The 𝐸𝑆 captures the level and direction of association of the disease gene set with that drug. Then, the ‘up’ and ‘down’ 𝐸𝑆 are combined into a single connectivity score ( 𝐶𝑆 ) for the disease with respect to that drug. Finally, for the given disease, drug candidates are identified as those that have low negative 𝐶𝑆 . ES Calculation

The drug-disease enrichment score ( 𝐸𝑆 ) in CMap 1.0 is adapted from GSEA. Instead of using GSEA’s signed two-sample KS test formulation that compares the positions of 𝑆 genes to those of 𝐿 − 𝑆 genes, CMap 1.0 uses a signed one-sample KS test to compare the empirical distribution of the positions of 𝑆 genes in 𝐿 compared to a reference uniform distribution (of disease genes in the drug gene list): 𝐸𝑆 = H𝑎 , 𝑖𝑓 𝑎 > 𝑏−𝑏 , 𝑖𝑓 𝑏 > 𝑎 where 𝑎 = max : . );< [ 𝑖𝑁 − 𝑖𝑑𝑥(𝐿, 𝑔𝑠 ) )𝑁 ] 𝑏 = max : . );< [𝑖𝑑𝑥(𝐿, 𝑔𝑠 ) )𝑁 − (𝑖 − 1)𝑁 ] This formulation is used to calculate an 𝐸𝑆 !" and an 𝐸𝑆 value for the genes up- ( 𝑆 !" ) and down-regulated ( 𝑆 ) by the disease, respectively. Connectivity Score (CS) Calculation – Normalization across treatment instances

These two scores are then used to calculate a raw connectivity score 𝑐𝑠 : 𝑐𝑠 = H𝐸𝑆 !" − 𝐸𝑆 , 𝑖𝑓 𝑠𝑖𝑔𝑛(𝐸𝑆 !" ) ≠ 𝑠𝑖𝑔𝑛(𝐸𝑆 )0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 The final connectivity score is calculated by normalizing the raw score by dividing by the maximum or minimum of raw scores across treatment instances, depending on the sign of 𝑐𝑠 , bringing it back to range between –1 and +1: 𝐶𝑆 = U 𝑐𝑠𝑚𝑎𝑥 * (𝑐𝑠) , 𝑖𝑓 𝑐𝑠 > 0−𝑐𝑠𝑚𝑖𝑛 * (𝑐𝑠) , 𝑖𝑓 𝑐𝑠 < 0 Summary • 𝐸𝑆 !" and 𝐸𝑆 represent the association between the up- ( 𝑆 !" ) and down-regulated ( 𝑆 ) disease genes in the disease of interest ( 𝑆 ) with the ranked drug gene list ( 𝐿 ). • 𝐶𝑆 is the connectivity score that combines 𝐸𝑆 !" and 𝐸𝑆 per drug treatment and normalizes them across treatments. Similar to 𝐸𝑆 , 𝐶𝑆 ranges from –1 to +1 ( Fig. 3 ). • Lower 𝐶𝑆 indicates a better reversal relationship between the disease and the drug. • Revised notations used in this CMap 1.0 section are summarized in

Table 3 . Table 3. CMap 1.0 Notations

Current Notation Previous Notation Description 𝐶𝑆 𝑆 ) connectivity score; normalized connectivity score across all treatment instances 𝑡 𝑖 treatment instances 𝑐𝑠 𝑠 ) connectivity score for each treatment instance 𝐸𝑆 𝑘𝑠 enrichment score 𝑖𝑑𝑥(𝐿, 𝑔𝑠 ) ) 𝑉(𝑗) position of 𝑔𝑠 ) in 𝐿 𝑁 ’ 𝑡 number of genes in 𝑆 𝑁 ( 𝑛 number of genes in 𝐿 Reverse Gene Expression Scores (RGES)

The Connectivity Map project was subsequently expanded into the NIH library of integrated network-based cellular signatures (LINCS) program by using a cost-effective gene-expression assay called L1000 [11]. The L1000 platform measures only about 1000 carefully-chosen genes with the rest of the transcriptome estimated by an imputation model trained using publicly available genome-scale expression data [20]. The pilot phase of the LINCS program included data for about 20,000 compounds assayed on about 50 human cell lines across a range of doses to result in over one million L1000 profiles. The focus of the study by Chen et al ., 2017 [10] was to use this LINCS data to not only capture expression-based drug-disease reversal relationships but also evaluate if these reversals correlate with independently-measured drug efficacies. Towards this goal, the authors selected compounds with both efficacy data in ChEMBL [21] and gene expression LINCS data. Using these two datasets, this study showed that the distribution of connectivity scores ( 𝐶𝑆 ) from CMap 1.0 [6] are enriched at 0 and that these scores do not correlate well with 𝐼𝐶 =8 values. To address this issue, the authors proposed a new connectivity score called the Reverse Gene Expression Score ( 𝑅𝐺𝐸𝑆 ). In CMap 1.0, the connectivity score for a drug is set to zero if 𝐸𝑆 !" and 𝐸𝑆 , the enrichment scores for the up- and down-regulated disease gene sets have the same signs. 𝑅𝐺𝐸𝑆 , on the other hand, is computed as the difference between absolute values of the two 𝐸𝑆 values: 𝑅𝐺𝐸𝑆 = |𝐸𝑆 !" | − |𝐸𝑆 | Summary • The

𝑅𝐺𝐸𝑆 connectivity score is based on the difference between the absolute values of the scores of the up- and down-regulated disease genes regardless of whether they are enriched at the top or the bottom of the drug gene list 𝐿 . • Similar to 𝐸𝑆 and 𝐶𝑆 , 𝑅𝐺𝐸𝑆 ranges from –1 to +1 (

Fig. 3 ). • 𝑅𝐺𝐸𝑆 is inversely correlated with drug efficacy. • Revised notations used in this

𝑅𝐺𝐸𝑆 section are summarized in

Table 4 . Summarization of Reverse Gene Expression Score

Since the LINCS dataset contains multiple profiles corresponding to the same drug assayed on multiple cell lines, concentrations, and time points, the study also proposed summarizing a drug’s

𝑅𝐺𝐸𝑆 values across these various conditions into a single score called the Summarization of Reverse Gene Expression Score ( 𝑠𝑅𝐺𝐸𝑆 ). 𝑠𝑅𝐺𝐸𝑆 is estimated by first setting the condition that corresponds to 10 𝜇𝑀 and 24 hours (the most common in the LINCS database) as the ‘reference’ condition and setting all other conditions as ‘target’ conditions. Then, for a specific cell line, a drug’s 𝑅𝐺𝐸𝑆 in a target condition is assumed to be dependent on the target condition’s dose and time relative to the reference condition, quantified using a heuristic “awarding function” ( 𝑓 ): 𝑓(𝑑𝑜𝑠𝑒(𝑡), 𝑡𝑖𝑚𝑒(𝑡)) = U𝛼, 𝑑𝑜𝑠𝑒(𝑡) < 10𝜇𝑀 𝑎𝑛𝑑 𝑡𝑖𝑚𝑒(𝑡) < 24 ℎ𝑜𝑢𝑟𝑠𝛽, 𝑑𝑜𝑠𝑒(𝑡) < 10𝜇𝑀 𝑎𝑛𝑑 𝑡𝑖𝑚𝑒(𝑡) ≥ 24 ℎ𝑜𝑢𝑟𝑠𝛾, 𝑑𝑜𝑠𝑒(𝑡) ≥ 10𝜇𝑀 𝑎𝑛𝑑 𝑡𝑖𝑚𝑒(𝑡) < 24 ℎ𝑜𝑢𝑟𝑠0, 𝑑𝑜𝑠𝑒(𝑡) ≥ 10𝜇𝑀 𝑎𝑛𝑑 𝑡𝑖𝑚𝑒(𝑡) ≥ 24 ℎ𝑜𝑢𝑟𝑠 Target conditions are first divided into four groups (as in the equation above), and the value of the function for each target group (e.g., 𝑑𝑜𝑠𝑒(𝑡) < 10𝜇𝑀 𝑎𝑛𝑑 𝑡𝑖𝑚𝑒(𝑡) < 24 ℎ𝑜𝑢𝑟𝑠 ) is estimated by averaging the difference in

𝑅𝐺𝐸𝑆 between the target group and reference group across all the drugs in the reference database that were profiled in the same cell line in that target condition and the reference condition. Then, to combine

𝑅𝐺𝐸𝑆 values across cell lines, a weight 𝑤(𝑡) is calculated for each treatment that reflects how much that treatment’s corresponding cell line, 𝑐𝑒𝑙𝑙(𝑡) is similar to the disease under study: 𝑤(𝑡) = 𝑐𝑜𝑟(𝑐𝑒𝑙𝑙(𝑡), 𝑑𝑖𝑠𝑒𝑎𝑠𝑒))𝑚𝑎𝑥 > (𝑐𝑜𝑟(𝑐𝑒𝑙𝑙(𝑘), 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)) Here, the correlation between cell line, 𝑐𝑒𝑙𝑙(𝑡) , and the disease, 𝑐𝑜𝑟(𝑐𝑒𝑙𝑙(𝑡), 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)) , is the average of the Spearman correlations between the expression profiles of the cell line and disease of interest, normalized by the maximum correlation between all cell lines and the disease. Finally, 𝑠𝑅𝐺𝐸𝑆 is defined as the following: 𝑠𝑅𝐺𝐸𝑆 = 3( : * 𝑅𝐺𝐸𝑆(𝑡) + 𝑓(𝑑𝑜𝑠𝑒(𝑡), 𝑡𝑖𝑚𝑒(𝑡))) × 𝑤(𝑡)𝑁

This study shows that these new formulations of the connectivity scores,

𝑅𝐺𝐸𝑆 and 𝑠𝑅𝐺𝐸𝑆 , show a correlation with drug 𝐼𝐶 =8 values, with drugs with low negative 𝑅𝐺𝐸𝑆 or 𝑠𝑅𝐺𝐸𝑆 tending to have low 𝐼𝐶 =8 values. Summary • The 𝑠𝑅𝐺𝐸𝑆 connectivity score is designed to combine the

𝑅𝐺𝐸𝑆 values of based on the difference between the absolute values of the scores of the up- and down-regulated disease genes regardless of whether they are enriched at the top or the bottom of the drug gene list 𝐿 . • Similar to 𝐸𝑆 and 𝐶𝑆 , 𝑠𝑅𝐺𝐸𝑆 ranges from –1 to +1 ( Fig. 3 ). • 𝑠𝑅𝐺𝐸𝑆 is inversely correlated with drug efficacy. • Revised notations used in this 𝑠𝑅𝐺𝐸𝑆 sub-section are summarized in

Table 4 . Table 4.

𝑹𝑮𝑬𝑺 and 𝒔𝑹𝑮𝑬𝑺

Notations

Current Notation Previous Notation Description

𝑅𝐺𝐸𝑆 − reverse gene expression score 𝑠𝑅𝐺𝐸𝑆 − summarized reverse gene expression score 𝑓(𝑑𝑜𝑠𝑒(𝑡), 𝑡𝑖𝑚𝑒(𝑡)) 𝑓(𝑑𝑜𝑠𝑒(𝑖), 𝑡𝑖𝑚𝑒(𝑖)) the difference in 𝑅𝐺𝐸𝑆 between a target condition and reference condition, modeled as a function of dose and time 𝑐𝑜𝑟(𝑐𝑒𝑙𝑙(𝑡), 𝑑𝑖𝑠𝑒𝑎𝑠𝑒) 𝑐𝑜𝑟(𝑐𝑒𝑙𝑙(𝑖), 𝑑𝑖𝑠𝑒𝑎𝑠𝑒) the average Spearman correlation between the expression profiles of a cell line 𝑐𝑒𝑙𝑙(𝑡) and the disease of interest 𝐸𝑆 𝐾𝑆 enrichment score 𝑁 𝑁 number of treatments for a given drug ( 𝑑 ) 𝑡 𝑖 treatment instances CMap 2.0 Connectivity Scores

CMap 2.0 is a massive expansion of the L1000 dataset to ~1.4 million profiles, which represent 42K genetic and small molecules perturbed across multiple cell lines [11]. As part of the release of this data, the study also proposed new connectivity score calculations (Weighted Connectivity Score, Normalized Connectivity Score, and Tau Score). Similar to other scenarios outlined above, the CMap 2.0 methodology works by comparing the disease gene set ( 𝑆 ) (containing the up- ( 𝑆 !" ) and down-regulated ( 𝑆 ) genes) to reference drug profiles in the L1000 database to get a rank-ordered list of all drugs based on a slightly new formulation of the connectivity score, along with new proposals for normalizing the scores across cell lines and drug types and for correcting the resulting normalized score against the background of the entire reference library. Weighted Connectivity Score (

𝑊𝐶𝑆 ) The disease-drug enrichment score ( 𝐸𝑆 ) in CMap 2.0 is based directly on GSEA’s weighted signed two-sample KS statistic that compares the positions of 𝑆 genes to those of 𝐿 − 𝑆 genes with the weight 𝑤 set to 1. 𝐸𝑆 is then used to calculate a Weighted Connectivity Score ( 𝑊𝐶𝑆 ) that represents a non-parametric disease-drug similarity measure.

𝑊𝐶𝑆 is defined as follow:

𝑊𝐶𝑆 = H(𝐸𝑆 !" − 𝐸𝑆 )/2 , 𝑖𝑓 𝑠𝑖𝑔𝑛(𝐸𝑆 !" ) ≠ 𝑠𝑖𝑔𝑛(𝐸𝑆 )0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Summary • The disease-drug similarities ( 𝐸𝑆 !" & 𝐸𝑆 ) are computed using the two-sided weighted KS statistic. • 𝑊𝐶𝑆 ranges from –1 to +1 (

Fig. 3 ). • A positive

𝑊𝐶𝑆 indicates that 𝑆 and 𝐿 are positively related (similar). • A negative

𝑊𝐶𝑆 indicates that 𝑆 and 𝐿 are negatively related (dissimilar). • A zero

𝑊𝐶𝑆 indicates that 𝑆 and 𝐿 are unrelated. • Revised notations used in this

𝑊𝐶𝑆 sub-section are summarized in

Table 5 . Normalized Connectivity Score (

𝑁𝐶𝑆 ) The Normalized Connectivity Score (

𝑁𝐶𝑆 ) was developed to enable the comparison of

𝑊𝐶𝑆 across cell lines and drug type. Given the

𝑊𝐶𝑆 for a disease in relation to a specific drug of a type 𝑑𝑡 , tested in cell line 𝑐 , the corresponding 𝑁𝐶𝑆 is computed by mean-scaling

𝑊𝐶𝑆 : 𝑁𝐶𝑆 = j𝑊𝐶𝑆/𝜇 ?, , 𝑖𝑓 𝑠𝑖𝑔𝑛(𝑊𝐶𝑆) > 0𝑊𝐶𝑆/𝜇 ?, , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Here, 𝜇 ?, and 𝜇 ?, are absolute values of the means of the positive and negative 𝑊𝐶𝑆 values, respectively, across all the drugs of the same type 𝑑𝑡 tested in the same cell line 𝑐 . This procedure is identical to that used in the original GSEA for normalizing 𝐸𝑆 scores to make them comparable across gene sets of different sizes. Tau scores

Finally, the Normalized Connectivity Score

𝑁𝐶𝑆 for a disease to a specific drug ( i.e. the

𝑁𝐶𝑆 for a given disease-drug pair) is converted to a tau ( 𝜏 ) score by comparing it to 𝑁𝐶𝑆 values of that disease to all the drugs in the reference database (referred to as “touchstone” in CMap 2.0) of the same type 𝑑𝑡 tested in the same cell line 𝑐 , expressed as signed percentage value between –100 and +100: 𝜏 = 𝑠𝑖𝑔𝑛(𝑁𝐶𝑆) 100𝑁 B : : >;< |𝑁𝐶𝑆 > | < |𝑁𝐶𝑆|] Thus, a 𝜏 of 95 indicates that only 5% of drugs in the reference database of the same type and tested in the same cell line (containing 𝑁 B drugs) showed stronger connectivity to the disease than the drug of interest. Since any disease is queried against the same fixed drug reference database, 𝜏 values are comparable across diseases. Another way to calculate a 𝜏 score corresponding to the 𝑁𝐶𝑆 value for a disease-drug pair is to compare to the

𝑁𝐶𝑆 values of that specific drug to all the perturbation signatures in a reference database. This comparison will yield a 𝜏 that represents the signed percentage of reference signatures that are less connected to the drug than the disease of interest. In other words, based on this comparison, a 𝜏 of 95 indicates that only 5% of signatures in a reference database showed stronger connectivity to the drug than the disease of interest. Similarly, 𝜏 values in this new setting are comparable across drugs in the reference database. Summary • The normalized connectivity score

𝑁𝐶𝑆 was developed to enable the comparison of

𝑊𝐶𝑆 across cell lines and drug type. • The tau score ( 𝜏 ) measures further corrects for non-specific associations by expressing the 𝑁𝐶𝑆 of a given disease-drug pair in terms of the fraction of signatures/profiles in a reference database that exceed this

𝑁𝐶𝑆 value. • Tau ( 𝜏 ) ranges from –100 to +100 ( Fig. 3 ) and a lower negative score reveals a better disease-drug reversal relationship. • Good tau scores ( 𝜏 ) should range between –95 and –100. A 𝜏 of 95 indicates that only 5% of reference signatures/profiles in the reference database showed stronger connectivity. • Revised notations used in the

𝑁𝐶𝑆 and 𝜏 sub-sections are summarized in Table 5 . Table 5. CMap 2.0 Notations

Current Notation Previous Notation Description

𝑊𝐶𝑆

𝑊𝑇𝐶𝑆 ; 𝑤 ;,* weighted connectivity score; also used to refer to a specific instance of the weighted connectivity score of a given cell line 𝑐 and perturbagen type 𝑑𝑡 𝑐 − cell line 𝑑𝑡 𝑡 drug type 𝑘 𝑖 index of each drug in the reference database; 𝑘 = 1,2,3,…, 𝑁 𝜇 ;, , 𝜇 ;, 𝜇 ;,*= , 𝜇 ;,*> absolute values of means of positive and negative raw weighted connectivity scores, respectively 𝑁 , 𝑁 total number of drug profiles ( 𝐿 ) in the reference database 𝑆 𝑞 disease gene set ( i.e. query) 𝐿 𝑟 rank-ordered gene list (drug) Conclusion

In this review, we have reconciled four key formulations of drug-disease connectivity scores by defining them using consistent notation and terminology. This unified scheme will foster long-term adoption and potential collaboration within the growing computational drug-repurposing community. This review provides significant insights on different methods that have been proposed in the drug repurposing field. Our coherent definition of connectivity scores and their relationships will allow researchers to better understand the current state-of-the-art, including expressing all other existing methods using the same notation and terminology. The drug-repurposing community can adopt this consolidated framework to develop and compare new computational drug-repurposing quantification metrics in the context of existing methods. To facilitate the continuous and transparent integration of newer methods, this review is hosted in a GitHub repository (https://github.com/jravilab/connectivity_score_review) that can be edited by the research community to include new methods for connectivity score calculation. The review document has been written using RMarkdown [22,23] and distill [24], and rendered as a living document at https://jravilab.github.io/connectivity_score_review. Funding

This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to A.K., MSU Diversity Research Network Launch Awards Program to J.R., MSU College of Natural Science Scholarships to K.S., and, in part, by MSU start-up funds to A.K. and J.R.

Acknowledgments

We are grateful to members of the Ravi and Krishnan labs for feedback on the manuscript.

References

1. Huang C-T, Hsieh C-H, Chung Y-H, et al. Perturbational Gene-Expression Signatures for Combinatorial Drug Discovery. iScience 2019; 15:291–306 2. Keenan AB, Wojciechowicz ML, Wang Z, et al. Connectivity Mapping: Methods and Applications. Annual Review of Biomedical Data Science 2019; 2:69–92 3. Dudley JT, Sirota M, Shenoy M, et al. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci Transl Med 2011; 3:96ra76 4. Sirota M, Dudley JT, Kim J, et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med 2011; 3:96ra77 5. Parvathaneni V, Kulkarni NS, Muth A, et al. Drug repurposing: a promising tool to accelerate the drug discovery process. Drug Discov. Today 2019; 24:2076–2085 6. Lamb J, Crawford ED, Peck D, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006; 313:1929–1935 7. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:15545–15550 8. Hollander M, Chicken E, Wolfe DA. Nonparametric Statistical Methods. 1999; 9. Duan Q, Flynn C, Niepel M, et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 2014; 42:W449-460 10. Chen B, Ma L, Paik H, et al. Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets. Nat Commun 2017; 8:16022 11. Subramanian A, Narayan R, Corsello SM, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 2017; 171:1437-1452.e17 12. Iskar M, Campillos M, Kuhn M, et al. Drug-induced regulation of target expression. PLoS Comput. Biol. 2010; 6: 13. Cheng J, Xie Q, Kumar V, et al. Evaluation of analytical methods for connectivity map data. Pac Symp Biocomput 2013; 5–16 14. Cheng J, Yang L, Kumar V, et al. Systematic evaluation of connectivity map for disease indications. Genome Med 2014; 6:540 15. Musa A, Ghoraie LS, Zhang S-D, et al. A review of connectivity map and computational approaches in pharmacogenomics. Brief. Bioinformatics 2018; 19:506–523 16. Lin K, Li L, Dai Y, et al. A comprehensive evaluation of connectivity methods for L1000 data. Brief. Bioinformatics 2019; 17. Broutier L, Mastrogiovanni G, Verstegen MM, et al. Human primary liver cancer-derived organoid cultures for disease modeling and drug screening. Nat. Med. 2017; 23:1424–1435 18. Singh AR, Joshi S, Zulcic M, et al. PI-3K Inhibitors Preferentially Target CD15+ Cancer Stem Cell Population in SHH Driven Medulloblastoma. PLoS ONE 2016; 11:e015083614