[PDF] Clinical connectivity map for drug repurposing: using laboratory tests to bridge drugs and diseases

Abstract

Drug repurposing has attracted increasing attention from both the pharmaceutical industry and the research community. Many existing computational drug repurposing methods rely on preclinical data (e.g., chemical structures, drug targets), resulting in translational problems for clinical trials. In this study, we propose a clinical connectivity map framework for drug repurposing by leveraging laboratory tests to analyze complementarity between drugs and diseases. We establish clinical drug effect vectors (i.e., drug-laboratory test associations) by applying a continuous self-controlled case series model on a longitudinal electronic health record data. We establish clinical disease sign vectors (i.e., disease-laboratory test associations) by applying a Wilcoxon rank sum test on a large-scale national survey data. Finally, we compute a repurposing possibility score for each drug-disease pair by applying a dot product-based scoring function on clinical disease sign vectors and clinical drug effect vectors. We comprehensively evaluate 392 drugs for 6 important chronic diseases (e.g., asthma, coronary heart disease, type 2 diabetes, etc.). We discover not only known associations between diseases and drugs but also many hidden drug-disease associations. Moreover, we are able to explain the predicted drug-disease associations via the corresponding complementarity between laboratory tests of drug effect vectors and disease sign vectors. The proposed clinical connectivity map framework uses laboratory tests from electronic clinical information to bridge drugs and diseases, which is explainable and has better translational power than existing computational methods. Experimental results demonstrate the effectiveness of the proposed framework and suggest that our method could help identify drug repurposing opportunities, which will benefit patients by offering more effective and safer treatments.

Full PDF

WWen et al.

RESEARCH

Clinical connectivity map for drug repurposing:using laboratory tests to bridge drugs anddiseases

Qianlong Wen † , Ruoqi Liu † and Ping Zhang AbstractBackground:

Drug repurposing, the process of identifying additional therapeutic uses for existing drugs, hasattracted increasing attention from both the pharmaceutical industry and the research community. Manyexisting computational drug repurposing methods rely on preclinical data (e.g., chemical structures, drugtargets), resulting in translational problems for clinical trials.

Methods:

In this study, we propose a clinical connectivity map framework for drug repurposing by leveraginglaboratory tests to analyze complementarity between drugs and diseases. We establish clinical drug eﬀectvectors (i.e., drug-laboratory test associations) by applying a continuous self-controlled case series model on alongitudinal electronic health record data. We establish clinical disease sign vectors (i.e., disease-laboratory testassociations) by applying a Wilcoxon rank sum test on a large-scale national survey data. Finally, we computea repurposing possibility score for each drug-disease pair by applying a dot product-based scoring function onclinical disease sign vectors and clinical drug eﬀect vectors.

Results:

We comprehensively evaluate 392 drugs for 6 important chronic diseases (include asthma, coronaryheart disease, congestive heart failure, heart attack, type 2 diabetes, and stroke). We discover not only knownassociations between diseases and drugs, but also many hidden drug-disease associations. For example,clopidogrel and alendronate may be repurposed as candidate drugs for diabetes and cardiovascular diseasesrespectively. Moreover, we are able to explain the predicted drug-disease associations via the correspondingcomplementarity between laboratory tests of drug eﬀect vectors and disease sign vectors.

Conclusion:

The proposed clinical connectivity map framework uses laboratory tests from electronic clinicalinformation to bridge drugs and diseases, which is explainable and has better translational power than existingcomputational methods. Experimental results demonstrate the eﬀectiveness of the proposed framework andsuggest that our method could help identify drug repurposing opportunities, which will beneﬁt patients byoﬀering more eﬀective and safer treatments.

Availability:

The code for this paper is available at: https://github.com/HoytWen/CCMDR

Keywords:

Drug Repurposing; Connectivity Map; Electronic Health Record; National Health and NutritionExamination Survey

Introduction

Traditional de novo drug discovery is a long and com-plicated process [1, 2], which usually takes more than15 years [3], and costs 800 million to 1 billion US dol-lars [4] to develop a new drug. Drug repurposing, inves- * Correspondence: [email protected] Department of Computer Science and Engineering, The Ohio StateUniversity, 2015 Neil Ave 43210 Columbus, Ohio, USA Department of Biomedical Informatics, The Ohio State University, 1800Cannon Drive 43210 Columbus, Ohio, USAFull list of author information is available at the end of the article † These authors contributed equally to this work. tigation of potential additional uses for existing drugs,is becoming an appealing research ﬁeld given its po-tential in lowering overall costs and shortening drugdevelopment timelines [5].There has been a surge of computational methodsproposed for drug repurposing in recent years, whichcan be roughly classiﬁed into two categories based ondiﬀerent data sources: preclinical data-based and clin-ical data-based. Preclinical data-based methods oftenbuild machine learning models based on preclinicaldata, such as drug chemical structure, protein tar- a r X i v : . [ q - b i o . Q M ] J u l en et al. Page 2 of 11 gets and gene expression information, to identify po-tential drug-disease associations. For example, Keiseret al. [6] use drug structural similarity as the mea-surements to ﬁnd the drugs with similar eﬀects. Lambet al. [7, 8] raise the connectivity map (CMap) ap-proach for drug repurposing by using gene expressiondata, which is based on molecular activity. Luo et al.[9] develop a server named DPDR-CPI which predictsthe new indications of existing drugs by analyzing thechemical-protein interactome (CPI) proﬁle. Some re-searchers also tried to construct computational frame-works that integrated several kinds of data sources andeven disease similarity measurement proﬁles to makebetter predictions. PreDR model proposed by Wanget al. [10] integrated drug structure, drug target, side-eﬀects and disease phenotype data to ﬁnd the noveldrug indications. Zhang et al. [11] raised a similar-ity constrained matrix factorization method to predictdrug-disease association based on known drug-diseaseassociations, drug features and disease semantic infor-mation. However, all of these methods rely heavily onpreclinical information to make predictions. This willcause a large translation gap when we apply the drugson humans. It is estimated that of all compounds ef-fective in cell assays, only 30% of them could work inanimals and only 5% of them could work in humans[12].Compared with preclinical data, clinical data providemore applicable and reliable data sources for drug re-purposing as clinical information (e.g., laboratory testresults) may be seen as valuable read-outs of drug ef-fects directly on human bodies. It is directly observedform patients, so there is no need to consider about thetranslational problems. Many computational frame-works based on clinical information has been raiseddue to the large amount of available electronica clini-cal data. Jung et al. [13] ﬁnd the connection betweendrugs and diseases in clinical diagnose notes by liter-ature mining, but it does not include any other struc-tured data, like laboratory test results. Jang et al. [14]propose a framework that use laboratory test resultsto reﬂect the inﬂuence of drugs and diseases on hu-man physiological activities, and the method they useto establish drug eﬀects is counting co-occurrence be-tween drug and laboratory tests. However, it is noteﬃcient enough to dig the hidden relation betweendrugs and laboratory tests, especially when we havea large dataset and include many laboratory and ex-isting drugs in our experiment. Kuang et al. [15] andGhalwash et al. [16] raised more advanced methodsto compute the inﬂuence of drugs on laboratory tests,however, they reﬂect the eﬀect of drugs on single labo-ratory (e.g., blood sugar level), which it is not enoughto represent the state of the complex human system. It would be more eﬃcient and accurate if we build anelectronic clinical information-based drug repurposingframework and implement it by more eﬃcient statisti-cal analysis methods designed for large datasets. Dur-ing this process, we will include as many laboratorytests as we can in our experiment to completely rep-resent the state of human biological system. The ideaof CMap raised by Lamb et al. [7, 8] which uses geneexpression values to bridge drugs and diseases, directlyinspires us to formulate and leverage all the laboratorytests involved in our experiment to build associationsbetween drugs and diseases from clinical perspective.In this paper, we propose a clinical connectivity mapframework for drug repurposing (CCMDR) by lever-aging laboratory tests to analyze the inﬂuence of drugsand diseases on the human biological system. Specif-ically, we ﬁrst establish clinical disease sign vectors(i.e., disease-laboratory test associations) by applyinga Wilcoxon rank sum test on a large-scale nationalsurvey data. We then establish clinical drug eﬀect vec-tors (i.e., drug-laboratory test associations) by apply-ing a continuous self-controlled case series model on alongitudinal electronic health record data. Finally, wecompute a repurposing possibility score for each drug-disease pair by applying a dot product-based scoringfunction on clinical disease sign vectors and clinicaldrug eﬀect vectors. Experimental results show that ourmethod can not only retrieve the known drug-diseaseassociations in high accuracy but also can ﬁnd poten-tial indications, which can be veriﬁed from medicalliterature. For example, clopidogrel and alendronatemay be repurposed as candidate drugs for diabetesand cardiovascular diseases respectively. Moreover, wecan explain the predicted drug-disease associations viathe corresponding complementarity between labora-tory tests of drug eﬀect vectors and disease sign vec-tors. So, it is suggested that our method can be poten-tially used in drug repurposing tasks.In brief, the contribution of the paper can be sum-marized as below: • We propose a clinical connectivity mapping frame-work for drug repurposing. The new frameworksolely based on the clinical patient data, thus withless translational problems. • We evaluate our framework for 392 drugs on 6important chronic diseases (include asthma, coro-nary heart disease, congestive heart failure, heartattack, type 2 diabetes, and stroke). Experimentalresults show that our method achieves high accu-racy in retrieving the known indications of drugs. • We study the predicted drug repurposing candi-dates via the corresponding complementarity be-tween laboratory tests of drug eﬀect vectors anddisease sign vectors. Case studies with literature en et al. Page 3 of 11 support show the potential of our method to dis-cover previously unknown indications of existingdrugs.

Methodology

Dataset and Data Preprocess

We use the questionnaire and laboratory test resultsfrom the National Health and Nutrition ExaminationSurvey (NHANES) [17] to establish the clinical dis-ease sign vectors. According to the questionnaire sur-vey (e.g., ”Has been diagnosed with type 2 diabetes?”),individual samples are divided into disease group (whoanswered ”yes”) and healthy group (who answered”no”). Next, we perform the statistical analysis toidentify those disease-related clinical variables fromcollected laboratory test results in NHANES data. Weextract 87,464 individual samples, 986 numerical clini-cal variables and more than 30 disease conditions fromNHANES data range from 1999 to 2016. Here, we onlyconsider the disease conditions with more than 1000individual samples, which results in 6 unique diseases(i.e., asthma, coronary heart disease, congestive heartfailure, heart attack, type 2 diabetes and stroke).We use the prescription and laboratory test re-sult histories of patients in Electronic Health Record(EHR) to establish the clinical drug eﬀect vectors. Wetransform the prescription records of patients into ma-trixes based on medication use situations. To studythe associations between prescribed drugs and labora-tory test results, we apply a continuous self-controlledcase series model [15] to analyze the eﬀects of a drugon the laboratory test results. Our proprietary EHRscontain comprehensive health records of more than300 thousand patients over four years. We only con-sider patients with complete records (i.e., having bothprescription and its corresponding laboratory test re-sults), which results in 91,934 patients, 1,344 kindsof treatments and 65 kinds of laboratory tests. Afterexcluding those prescriptions with less than 1000 pa-tients, we obtain 392 unique prescribed drugs.We bridge the drug and disease using the laboratorytest results obtained from each side. Since the labora-tory test results are from diﬀerent data resources (i.e.,national survey data and electronic health records),we need to standardize those laboratory tests for fur-ther analysis. The laboratory tests that appear in bothdatasets are included and mapped to a standard listwith consistent names. Also, the non-numerical labo-ratory tests are excluded. Finally, we obtain 35 labo-ratory tests considered as clinical variables. The fulllist of the 35 clinical variables can be found in

TableS1 . Our inference of a drug-disease pair is based onthe complementary and adverse eﬀects that each drugcandidate and disease condition has on the 35 clinicalvariables.

Clinical Disease Sign Vector

We extract 6 disease conditions and 35 clinical vari-ables from NHANES after preprocessing to establishthe clinical disease sign vector. The dimension of eachdisease sign vector is 1 ×

35. There are three types ofrelations between a disease and clinical vectors (i.e.,”Up”, ”Down” and ”No”), which represents increas-ing, decreasing and not signiﬁcantly changing of lab-oratory tests level, respectively. As mentioned above,the combined data is divided into disease group andcontrol group according to the questionnaire data, weapply Wilcoxon rank sum test (a.k.a., Mann Whit-ney U test) on two groups to calculate the p-value foreach clinical variable. To get the p-value, we combinethe values of the two groups and rank them. Then wecalculate two statistical values U and U , which aredeﬁned as follows: U = R − n ( n +1)2 U = R − n ( n +1)2 (1)where n and n are the sample sizes of the two groups, R and R are the sum of the ranks in the two groups,respectively. The smaller value of U and U is usedto consult the Mann-Whitney signiﬁcance table. Cer-tain p-value cut-oﬀ is used to examine whether thevalue change is signiﬁcant or not [18]. In our work, thep-value threshold is set to be 0.05. Only the clinicalvariables satisfy the condition that p-values are lessthan 0.05 can be regarded as signiﬁcant clinical vari-ables concerning the disease. We consult the Mann-Whitney table of α = 0 .

05. If the smaller value of U and U is larger than the value given in the table, thenull hypothesis is true otherwise false. Then we assignrelation direction to this clinical variable by comparingthe average clinical variable value of the disease groupand control group. Up relation (” ↑ ”) indicates a signiﬁ-cant value increase in the disease group compared withthe control group, while down relation (” ↓ ”) meansthe laboratory test value of the disease group is signif-icantly lower than that of the control group, no rela-tion (”-”) indicate the laboratory test level will not besigniﬁcantly inﬂuenced by the disease. Clinical Drug Eﬀect Vector

To establish clinical drug eﬀect vectors, we extract 392drugs and 35 clinical variables from EHR data. Theclinical variables used here are the same as ones inestablishing the disease sign vectors. So, the dimen-sion of each drug eﬀect vector is 1 ×

35. We need toconsider the prescription records of patients and theircorresponding laboratory test results records simulta-neously, and the EHR dataset we use is a large datasetthat includes millions of records. So, we need to ﬁnd a en et al. Page 4 of 11

Figure 1

This ﬁgure presents the pipeline of our framework. The framework contains three main components: (1) establishingclinical drug eﬀect vectors by applying a continuous self-controlled case series model on a longitudinal electronic health record data(EHR), (2) establishing clinical disease sign vectors by applying a Wilcoxon rank sum test on a large-scale national survey data(NHANES), (3) computing repurposing possibility score for each drug-disease pair by applying a dot product-based scoring functionon clinical disease sign vectors and clinical drug eﬀect vectors. We do a terminology mapping before we establish the clinical drugeﬀect vectors and clinical disease sign vectors to make sure each clinical vector includes the same laboratory tests. There are threekinds of relation types in the clinical vectors (”Up”, ”Down”, ”No”), which represent increasing, decreasing and not signiﬁcantlychanging laboratory tests level, respectively. way to analyze the high-dimensional longitudinal data.In our work, we adopt the continuous self-controlledcase Series (CSCCS) model proposed by Kuang et al.[15], it is a lasso regression analysis model designed todo the data analytical work for EHR dataset.Assuming there are N patients with a speciﬁc kindof clinical variable measurement and M kinds of drugsin EHR dataset. Continuous variable y ij , where i ∈{ , , · · · , N } , j ∈ { , , · · · , J i } , indicates the valueof j th clinical variable measurement taken among atotal number of J i measurements for the i th patient,while binary variable x ijm , where i ∈ { , , · · · , N } , j ∈ { , , · · · , J i } , m ∈ { , , · · · , M } , are used to in-dicated the drug whether i th patient are exposed to the m th drug when the j th clinical variable measurementis taken. 0 represents no and 1 represents yes. y ij is regard as the output variables when we ﬁt thestructured data into the linear regression model, so wehave: y ij | x ij = α i + β (cid:62) x ij + (cid:15) ij , (cid:15) ij iid ∼ N (cid:0) , σ (cid:1) (2) β = (cid:2) β β · · · β M (cid:3) (cid:62) , x ij = (cid:2) x ij x ij · · · x ijM (cid:3) (cid:62) α i in equation (2) represents the average baselinelevel of y ij on i th patient. That means it is independentof the date the measurement was taken and drugs thepatient used when the measurement was taken. Eachpatient has an individual baseline value. (cid:15) ij here isan independent and identically distributed Gaussiannoises with zero means and ﬁxed but unknown vari-ance σ . Then the linear model can be easily convertedto a least square problem as follows: arg min α , β L ( α , β ) = arg min α , β (cid:13)(cid:13)(cid:13)(cid:13) y − (cid:2) Z X (cid:3) (cid:20) αβ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (3)where α = (cid:2) α α · · · α N (cid:3) (cid:62) , Z = diag (cid:0) , · · · , N (cid:1) , y = (cid:2) y · · · y J · · · y N · · · y NJ N (cid:3) (cid:62) , X = (cid:2) x · · · x J · · · x N · · · x NJ N (cid:3) (cid:62) where Z is a block diagonal matrix and i is a J i × en et al. Page 5 of 11 this problem, we can get the optimized parameter β ,which is also the interest of our task. β is a 1 × M pa-rameter vector, parameter β m in β indicates the eﬀectof m th drug on the output variable y . The optimizedparameter we get with the CSCCS model is numer-ical. Positive and negative parameters in this vectorrepresent the corresponding drugs that may increaseand decrease the level of output variable respectively,while 0 indicates the corresponding drugs do not inﬂu-ence it. In the CSCCS model, parameter α is regardedas a nuisance parameter, our interest is parameter β so we do not need to care the value of α . To eliminatethe eﬀect of α , [15] consider: ∂ L ( α , β ) ∂ α = ⇒ α = (cid:16) Z (cid:62) Z (cid:17) − Z (cid:62) ( y − Xβ ) = y − Xβ (4) Where y is a N × N patients, y i = J i (cid:80) J i j =1 y ij . X is a N × M matrix and X i = J i (cid:80) J i j =1 x (cid:62) ij . So, the expression of CSCCS model be-low, which is free of α , is derived by substituting equa-tion (4) into equation (3):arg min β (cid:107) y − Zy − ( X − ZX ) β (cid:107) (5)When we apply the CSCCS model on the high-dimensional longitudinal EHR data, we will add a L penalty term because there is an assumption that thelevel of clinical variables will only be signiﬁcantly inﬂu-enced by a small portion of drugs. The L penalizationdrives most components of β to zero or closed to zero[19]. In other words, we simply want to know the drugswhich are most correlated to the level change of clin-ical variables. So the ﬁnal expression of the CSCCSmodel we apply to this problem is:arg min β (cid:107) y − Zy − ( X − ZX ) β (cid:107) + λ (cid:107) β (cid:107) (6)where λ > λ decides the sparsity of optimized resultso we need to tune this parameter to get a ﬁnal resultwith proper sparsity level.In order to further ﬁlter out the drugs which do nothave a signiﬁcant eﬀect on clinical variables, our imple-mentation also returns the p-value of each componentin β . We apply the same p-value cut-oﬀ strategy on theoptimized result. Parameters with a p-value greaterthan 0.05 in β are regarded as insigniﬁcant eﬀect andwe assume their corresponding drugs are uncorrelated with clinical variable level change. The signiﬁcant ef-fects can be divided into increasing or decreasing ef-fect based on the coeﬃcient value is positive or neg-ative. Then we assign each drug-clinical variable pairup (” ↑ ”), down (” ↓ ”) and no (”-”) relation type justlike clinical disease sign vectors. Scoring Function

After we establish the clinical vectors for each drug-disease pair, we need to deﬁne a scoring function to cal-culate the repurposing possibility score for each drug-disease pair. The inference for each drug-disease pairis based on complementary and adverse eﬀects. Specif-ically, complementary eﬀect refers to the opposed re-lation type between a clinical disease sign vector andclinical drug eﬀect vector on the same clinical vari-ables, while adverse eﬀect refers to the same relationtype between a clinical disease sign vector and clinicaldrug eﬀect vector on the same clinical variables. Thecomplementary relation direction between the two vec-tors will increase the ﬁnal repurposing possibility scoreof a drug-disease pair while adverse relation directionwill decrease it. Here, we use a dot product-based scor-ing function to consider both complementary and ad-verse eﬀects of a drug candidate on a disease. The scor-ing function can be written as follow:

T S disease − drug = − CV Drug · CV Disease (7)where CV Drug is the clinical drug eﬀect vector, and CV Disease is the clinical disease sign vector. We trans-form the 3 kinds of relation type in clinical vectors(” ↑ ”, ” ↓ ” and ”-”) into numerical values (1.0, -1.0, 0.0)for the convenience of calculation. To rank the drugsin descending order and emphasize the most power-ful drug candidates predicted by our model, we add aminus sign before the product of the two vectors. So,the positive result calculated by this scoring functionmeans there are more complementary relation direc-tions than adverse relation directions between a drugcandidate and a disease, while negative results indicatemore adverse relation directions between this pair. Results and Discussion

Evaluation Metrics

After we calculate the repurposing possibility score ofeach drug-disease pair, we need to prove that the scoreis qualiﬁed enough to serve as a metric to show whethera drug candidate is likely to be the potential treatmentor not. The ﬁnal drug candidate list is sorted by therepurposing possibility score in a descending order forthe convenience of validation. The validation data weuse comes from Side Eﬀect Resource(SIDER) [20]. Itcontains drugs with indications or side-eﬀects for many en et al. Page 6 of 11 kinds of disease conditions. We take it for ground truthto testify whether our method can retrieve known in-dications of drugs. The hypothesis is that drug can-didates with higher repurposing possibility score aremore likely to be the treatment of the disease, whichmeans most of the top-ranking drugs can be found inthe drugs with indication and most of the bottom rank-ing drugs can be found in the drugs with side-eﬀectsprovided by SIDER. In this case, those drugs can notbe found in the validation data but still predicted withhigh repurposing possibility score by our model couldbe served as a potential treatment of the disease. Sowe need to use some evaluation metrics to test whetherthe known drug-diseases pairs are enriched at the topof our prediction list. We will use two kinds of evalua-tion metrics to validate our prediction.

Precision at K

The First kind of evaluation metric is precision at K.The top K precision value is the ratio of known treat-ment for a disease among the top K drug candidatesfor the disease predicted by our framework. For eachdisease, we rank the drugs using the calculated repur-posing possibility score. Then we compute the preci-sion at K ( K ∈ { , , , } ) of each disease using thetop-ranked K drugs (e.g., precision at 10 correspondsto the proportion of correct retrieved drugs among thetop 10 ranked drugs). Fold-Enrichment Test

Another evaluation metric to access whether our re-purposing possibility score is correlated with the like-lihood that disease-drug pair occurs or not is the fold-enrichment (FE) test. FE score can be deﬁned by thefollowing formula:FE Score = ( n/m )( N/M ) (8)where M is the number of all the mapped drugs and N is the number of drugs in the gold-standard datasetcorresponding to each kind of disease condition. Wewill divide all the mapped drugs evenly into severalgroups according to their repurposing possibility score.So, m is the total number of drugs in one group and n is the number of drugs involved in the gold-standarddataset within the group. FE test can demonstrate theenrichment of known disease-drug pair (we assume thedrug-disease pairs in SIDER is ground truth) withindiﬀerent score ranges. Our prediction can be provedto be reasonable if the FE test score is positively cor-related with the repurposing possibility score. Thereare 392 drug-disease pairs for each kind of disease con-dition in our experiment, and all of them are ranked by repurposing possibility score and binned into groups of80 pairs (the last group contains 72 drug-disease pairs).The scoring function is reasonable if the FE score isdecreasing with the ascending order of the 5 groups be-cause the average repurposing possibility score of eachgroup is decreasing in that order. Established Disease and Drug Vectors

In our experiment, we ﬁrst establish all of the clini-cal diseases and drug vectors. All the clinical diseasesign vectors are represented in

Table S2 , and all theclinical drug eﬀect vectors are represented in

TableS3 .Then, we calculate the repurposing possibility scoreof 392 kinds of drugs on six disease conditions (asthma,coronary heart disease, congestive heart failure, heartattack, type 2 diabetes and stroke). The repurposingpossibility score of each drug-disease pair is listed in

Table S4 . We also transform the table into a heatmap Figure 4 to vividly present the repurposing possi-bility score. Due to page limitation, we just present thedrugs that have an inﬂuence on any of the 6 diseasesin our experiment (153 kinds of drugs). The completeheat map can be is in

Figure S3 . Then we performvalidations on our prediction for each of the six dis-ease conditions. Each of the six disease conditions hasenough sample size which can make our validation re-sult more conﬁdent. We extract a list of drugs fromthe drug indication information resources for each ofthe six disease conditions provided by SIDER. All ofthe drugs in the six lists are known to treat the sixdisease conditions respectively, so we assume them asthe ground truth and further compare them with ourprediction.

Evaluation of Known Drug-disease Associations

The results of the prediction at K for 6 disease condi-tions are shown in Figure 2. The ﬁgure demonstratesthe precision of our prediction at K ∈ { , , , } .For type 2 diabetes, stroke, heart attack and conges-tive heart failure, it is clear that most of the drugs canbe mapped into the ground truth(SIDER drug list)when the K is small, the precisions of all the fourdisease conditions in the ﬁgure are greater than orequal to 0.8 when K = 5, their precision will decreasewith the increase of K. However, the results of asthmaand coronary heart disease were not as expected. Forcoronary heart disease, there is not so many knowndrug-disease pair in SIDER, which could be a reasonfor the low precision of this disease condition. Someof the clinical variables, like cholesterol, LDL (low-density lipoprotein), HDL (high-density lipoprotein)and triglycerides are more salient features than otherclinical variables. So, our analysis for the disease con-dition which does not have a strong correlation with en et al. Page 7 of 11

Figure 2

Top K precision of each disease condition demonstrates the proportion of known drug-disease pairs among the top Kranking drugs in our prediction list. This prediction list is ranked by repurposing possibility score.

Figure 3

Fold-enrichment result of each disease condition, we divide the drug-disease pair into 5 groups with the descending order ofaverage repurposing possibility score. So, the negative linear relations between the group order and FE score indicate the positiverelationship between the average repurposing possibility score. It shows our scoring function is useful in ﬁnding the drugs which havea therapeutic eﬀect on target diseases. these clinical variables could have low precision. Apart from those known treatments of each target disease en et al. Page 8 of 11 that can be found in the ground truth, there could besome unknown drug candidates which are likely to bethe treatment of target disease.The results of the FE test are shown in Figure 3.As we can see, there is a negative linear relationshipbetween the FE score and the group order. Since theaverage FE test score is decreasing with the ascendingorder of groups, so there is a positive linear relationshipbetween FE score and average repurposing possibilityscore. The result in Figure 3 shows that all 6 diseaseconditions demonstrate a negative linear relationshipbetween their FE score and group order. Therefore,our scoring function is proven to be reasonable.

Case Study and Explainability

Having presented that our model successfully identi-ﬁed known associations between drugs and diseases,we further demonstrate the explainability of our modelvia corresponding complementarity between labora-tory tests of drug eﬀect vectors and disease sign vec-tors. To exemplify this, we select 5 drug-disease associ-ation pairs (i.e., Type 2 diabetes-Clopidogrel hydrogensulfate, Type 2 diabetes-Doxycycline hyclate, Coro-nary heart disease-Alendronate sodium, Congestiveheart failure-Alendronate sodium and Heart attack-Alendronate sodium). For a given disease, the selecteddrugs are in its top-20 predicted list but not have beenindicated as the treatment. In order to vividly com-pare the clinical vectors of the drug candidates andthe corresponding disease, we present the clinical vari-ables which contribute to their repurposing possibilityscore in Table 1. All the detailed clinical vectors can befound in

Table S2 and

Table S3 of SupplementaryMaterials. Combining the clinical disease vectors withclinical drug eﬀect vectors, we can analyze why thedrug candidates we select are potential treatments forcorresponding disease conditions from the standpointof clinical variables included in our experiment.In the case of type 2 diabetes, we found that clopido-grel hydrogen sulfate could have a therapeutic eﬀect ontype 2 diabetes and Doxycycline Hyclate. Clopidogrelhydrogen sulfate is an antiplatelet medication and canbe used to reduce the risk of myocardial infarction andstroke [21]. A study reported that clopidogrel will alle-viate insulin resistance and improve glycemic controlin type 2 diabetic patients [22], which is an importantcause of insulin resistance. From the clinical drug eﬀectvector of clopidogrel and clinical disease sign vector oftype 2 diabetes, we can see clopidogrel and type 2 di-abetes have the opposite eﬀect on the cholesterol andLDL level. Lower cholesterol and LDL levels are bio-logical markers of good glycaemic control [23], whichis also corresponding to the literature study. Doxycy-cline Hyclate is an antibiotic which is primarily used to treat a wide range of bacterial infections. From theclinical vectors of Doxycycline and type 2 diabetes, wecan see they have the opposite eﬀect on the serum glu-cose level. High fasting blood glucose level is a commonbiological marker among type 2 diabetes patients. Thisﬁnding is supported by a medical study that doxycy-cline can improve insulin resistance and fasting bloodglucose level [24]. The analysis based on the oppositeeﬀect of type 2 diabetes and clopidogrel proves ourprediction is reasonable, clopidogrel and doxycyclinemay be used as treatments for type 2 diabetes .Alendronate sodium is usually used to treat osteo-porosis [25]. We found it can potentially have a thera-peutic eﬀect on cardiovascular disease, including con-gestive heart failure, heart attack and coronary heartdisease. Experiments show that alendronate can in-duce signiﬁcantly lower cardiovascular mortality andreduce the risk of cardiovascular incidents [26]. A pos-sible explanation given by this study is that boneand cardiovascular remodeling share some biologicalmarkers. From the clinical drug eﬀect vectors of al-endronate, we can see alendronate can lower alkalinephosphatase (ALP) and elevate the HDL level. Re-searches show that ALP can catalyze the inhibitor ofvascular calciﬁcation, thus high-level ALP may lead tovascular hardening and promotes the atheroscleroticprocess [27]. On the other hand, HDL will promote re-verse cholesterol transport, which could reduce the riskof cardiovascular events [28]. Thus, it seems possiblethat alendronate could be repurposed as a treatmentfor cardiovascular disease.

Highly related drugs and diseases

In Fig. 4, we demonstrate part of the repurposing pos-sibility scores in the form of heat map. To further dig-ging the relation within diﬀerent drugs or diseases,we use bi-clustering algorithm to do a clustering forthe drugs and diseases in Fig. 4. Bi-clustering is adata mining technology that simultaneous clusteringof both row and column sets in a data matrix [29].Given an m × n , bi-clustering algorithm will generatenew m × n matrix that a subset of rows which exhibitsimilar behavior across a subset of columns, or viceversa. In our work, we use bi-clustering algorithm toﬁnd diﬀerent drugs with similar eﬀect on some a dis-ease and diﬀerent diseases which can be treated withsame kind of drug. The clustering result is plotted inFig. 5. As shown in Fig. 5, type 2 diabetes has a strongcorrelation with heart diseases and stroke. We can alsoﬁnd many drugs that can decrease blood lipid or sugarlevel have a therapy eﬀect on those diseases. In thefurther, these ﬁndings can help to ﬁnd potential drug-disease pairs. en et al. Page 9 of 11

Table 1

This table presents the selected previously unknown drug-disease pairs predicted by our method, we just show the clinicalvariables that contribute to the ﬁnal repurposing possibility of each drug-disease pair in the table.Laboratory Test Alkaline Phosphatase Cholesterol Glucose HDL LDL TriglyceridesDisease Type 2 Diabetes ↓ ↑ ↑ ↓ ↑ ↑

Drug Clopidogrel Hydrogen Sulfate - ↓ - - ↓ -Disease Type 2 Diabetes ↓ ↑ ↑ ↓ ↑ ↑ Drug Doxycycline Hyclate - - ↓ - - -Disease Coronary Heart Disease ↑ ↑ ↑ ↓ ↑ ↑ Drug Alendronate Sodium ↓ - - ↑ - -Disease Congestive Heart Failure ↑ ↑ ↑ ↓ ↑ ↑ Drug Alendronate Sodium ↓ - - ↑ - -Disease Heart Attack ↑ ↑ ↑ ↓ ↑ ↑ Drug Alendronate Sodium ↓ - - ↑ - - Limitations and further work

The veriﬁcation results above show that our frame-work may identify some potential drug indications andthus help researchers ﬁnd novel uses of existing drugs.However, our framework still has some limitations andspace to improve.First of all, we only include 6 kinds of diseases and392 kinds of drugs in the our work. Actually, thereare some other disease conditions and drugs that canbe found in NHANES and EHR dataset. The reasonwe just include a part of drugs and diseases is thatmany of them have a small sample size so that we cannot get a reliable result from them. To guarantee theresults we get from the dataset are reliable enough, thesample size of each drug and disease that included inthis work is larger than 1000. Due to this threshold,the experiments are conducted on 6 diseases conditionsand 392 drugs, but the results we get are reliable androbust. In the future, we can include more drug-diseasepairs with a large-scaled dataset.The second limitation is the clinical variables in-volved in the experiment. Hundreds of clinical vari-ables (laboratory tests) can be found in the NHANESdataset, but we still need to match them with the clin-ical variables in the EHR dataset. However, 35 kinds ofclinical variables cannot completely reﬂect the humanphysiological activity, so it would be also addressedif we have a larger EHR dataset that contains moreclinical variables.

Conclusion

In this paper, we establish a drug repurposing compu-tational framework by using the electronic clinical in-formation from the National Health and Nutrition Ex-amination Survey (NHANES) and Electronic HealthRecords(EHR). We consider both of the opposite andsame expressions between clinical disease sign vectorand clinical drug eﬀect vector in each drug-disease pairto calculate the repurposing possibility score. Our in-ferences of the novel use for diﬀerent drugs are basedon their repurposing possibility score with diﬀerentdisease conditions. We verify our predictions by fold-enrichment test and top K precision. Then, we further prove the feasibility of our model by doing a literatureanalysis of our prediction result. The result shows thatour framework can not only retrieve the known indi-cations of existing drugs but also ﬁnd the previouslyunknown indications of existing drugs. So our frame-work can be potentially used in the drug repurposingtasks.

Acknowledgements

Not applicable.

Author’s contributions

PZ conceived the project. QW, RL, and PZ developed the method. QWconducted the experiments. QW, RL, and PZ analyzed experimentalresults. QW, RL, and PZ wrote the manuscript. All authors read andapproved the ﬁnal manuscript.

Funding

This work was funded in part by the National Center for AdvancingTranslational Research of the National Institutes of Health under awardnumber CTSA Grant UL1TR002733. The content is solely theresponsibility of the authors and does not necessarily represent the oﬃcialviews of the National Institutes of Health.

Availability of data and materials

NHANES data analysed in the study is available on National Center forHealth Statistics. The source code is provided for reproducing and isavailable at https://github.com/HoytWen/CCMDR . Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Author details Department of Electrical and Computer Engineering, The Ohio StateUniversity, 2015 Neil Ave 43210 Columbus, Ohio, USA. Department ofComputer Science and Engineering, The Ohio State University, 2015 NeilAve 43210 Columbus, Ohio, USA. Department of Biomedical Informatics,The Ohio State University, 1800 Cannon Drive 43210 Columbus, Ohio,USA.

References

1. O’Connor, K.A., Roth, B.L.: Finding new tricks for old drugs: aneﬃcient route for public-sector drug discovery. Nature reviews Drugdiscovery (12), 1005 (2005)2. Chong, C.R., Sullivan Jr, D.J.: New uses for old drugs. Nature (7154), 645 (2007)3. DiMasi, J.A.: New drug development in the united states from 1963 to1999. Clinical Pharmacology and Therapeutics (5), 286–296 (2001) en et al. Page 10 of 11

4. Adams, C.P., Brantner, V.V.: Estimating the cost of new drugdevelopment: is it really 802 million? Health aﬀairs (2), 420–428(2006)5. Pushpakom, S., Iorio, F., Eyers, P.A., Escott, K.J., Hopper, S., Wells,A., Doig, A., Guilliams, T., Latimer, J., McNamee, C., et al. : Drugrepurposing: progress, challenges and recommendations. Naturereviews Drug discovery (1), 41–58 (2019)6. Keiser, M.J., Setola, V., Irwin, J.J., Laggner, C., Abbas, A.I., Hufeisen,S.J., Jensen, N.H., Kuijer, M.B., Matos, R.C., Tran, T.B., et al. :Predicting new molecular targets for known drugs. Nature (7270),175 (2009)7. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel,M.J., Lerner, J., Brunet, J.-P., Subramanian, A., Ross, K.N., et al. :The connectivity map: using gene-expression signatures to connectsmall molecules, genes, and disease. science (5795), 1929–1935(2006)8. Lamb, J.: The connectivity map: a new tool for biomedical research.Nature reviews cancer (1), 54–60 (2007)9. Luo, H., Zhang, P., Cao, X.H., Du, D., Ye, H., Huang, H., Li, C., Qin,S., Wan, C., Shi, L., et al. : Dpdr-cpi, a server that predicts drugpositioning and drug repositioning via chemical-protein interactome.Scientiﬁc reports (1), 1–9 (2016)10. Wang, Y., Chen, S., Deng, N., Wang, Y.: Drug repositioning bykernel-based integration of molecular structure, molecular activity, andphenotype data. PloS one (11) (2013)11. Zhang, W., Yue, X., Lin, W., Wu, W., Liu, R., Huang, F., Liu, F.:Predicting drug-disease associations by using similarity constrainedmatrix factorization. BMC bioinformatics (1), 1–12 (2018)12. Pammolli, F., Magazzini, L., Riccaboni, M.: The productivity crisis inpharmaceutical rd. Nature reviews Drug discovery (6), 428–438(2011)13. Jung, J., Lee, D.: Inferring disease association using clinical factors ina combinatorial manner and their use in drug repositioning.Bioinformatics (16), 2017–2023 (2013)14. Jang, D., Lee, S., Lee, J., Kim, K., Lee, D.: Inferring new drugindications using the complementarity between clinical diseasesignatures and drug eﬀects. Journal of biomedical informatics ,248–257 (2016)15. Kuang, Z., Thomson, J., Caldwell, M., Peissig, P., Stewart, R., Page,D.: Computational drug repositioning using continuous self-controlledcase series. In: Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pp. 491–500(2016). ACM16. Ghalwash, M., Li, Y., Zhang, P., Hu, J.: Exploiting electronic healthrecords to mine drug eﬀects on laboratory test results. In: Proceedingsof the 2017 ACM on Conference on Information and KnowledgeManagement, pp. 1837–1846 (2017)17. Cdc, C.: National health and nutrition examination survey. ncfhs(nchs). US Department of Health and Human Services, Centers forDisease Control and Prevention (2005)18. Storey, J.D., Tibshirani, R.: Statistical signiﬁcance for genomewidestudies. Proceedings of the National Academy of Sciences (16),9440–9445 (2003)19. Tibshirani, R.: Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological) (1), 267–288 (1996)20. Kuhn, M., Letunic, I., Jensen, L.J., Bork, P.: The sider database ofdrugs and side eﬀects. Nucleic acids research (D1), 1075–1079(2015)21. Jarvis, B., Simpson, K.: Clopidogrel. Drugs (2), 347–377 (2000)22. Taher, M.A., Nassir, E.S.: Beneﬁcial eﬀects of clopidogrel on glycemicindices and oxidative stress in patients with type 2 diabetes. SaudiPharmaceutical Journal (2), 107–113 (2011)23. Khan, H., Sobki, S., Khan, S.: Association between glycaemic controland serum lipids proﬁle in type 2 diabetic patients: Hba 1c predictsdyslipidaemia. Clinical and experimental medicine (1), 24–29 (2007)24. Wang, N., Tian, X., Chen, Y., Tan, H.-q., Xie, P.-j., Chen, S.-j., Fu,Y.-c., Chen, Y.-x., Xu, W.-c., Wei, C.-j.: Low dose doxycyclinedecreases systemic inﬂammation and improves glycemic control, lipidproﬁles, and islet morphology and function in db/db mice. Scientiﬁcreports (1), 1–15 (2017) 25. Porras, A.G., Holland, S.D., Gertz, B.J.: Pharmacokinetics ofalendronate. Clinical pharmacokinetics (5), 315–328 (1999)26. Sing, C.-W., Wong, A.Y., Kiel, D.P., Cheung, E.Y., Lam, J.K.,Cheung, T.T., Chan, E.W., Kung, A.W., Wong, I.C., Cheung, C.-L.:Association of alendronate and risk of cardiovascular events in patientswith hip fracture. Journal of Bone and Mineral Research (8),1422–1434 (2018)27. Panh, L., Ruidavets, J.B., Rousseau, H., Petermann, A., Bongard, V.,B´erard, E., Taraszkiewicz, D., Lairez, O., Galinier, M., Carri´e, D., etal. : Association between serum alkaline phosphatase and coronaryartery calciﬁcation in a sample of primary cardiovascular preventionpatients. Atherosclerosis , 81–86 (2017)28. Rader, D.J., Hovingh, G.K.: Hdl and cardiovascular disease. TheLancet (9943), 618–625 (2014)29. Mirkin, B.: Mathematical Classiﬁcation and Clustering vol. 11.Springer, ??? (2013) Additional Files

Table S1 — Laboratory Test ListThis table includes the name of 35 kinds of laboratory test involved in ourexperiments and their corresponding NHANES code. (SupplementaryData.pdf)Table S2 — Clinical Disease Sign VectorThis table presents 6 kinds of disease conditions involved in our experimentand their inﬂuences on the 35 kinds of laboratory tests. (DiseaseVector.csv)Table S3 — Clinical Drug Eﬀect VectorThis table presents 392 kinds of existing drugs or drug combinationsinvolved in our experiment and their inﬂuences on the 35 kinds oflaboratory tests. (DrugVector.csv)Table S4 — Repurposing Possibility ScoreThis table include the repurposing possibility score of each drug-diseasepair in our experiment. (Repurposing Possibility Score.csv)Figure S1 — Disease Clinical Variable StatisticsThe ﬁgure present number of diseases will increase (Up) or decrease(Down) the level of each clinical variables. X-axis is the name of eachclinical variable, Y-axis is the number diseases. Blue bar stands for the ”Up”relation, red bar stands for the ”Down” relation. (Supplementary Data.pdf)Figure S2 — Drug Clinical Variable StatisticsThe number of drugs will increase (Up) or decrease (Down) the level ofeach clinical variables. X-axis is the name of each clinical variable, Y-axis isthe number diseases. Blue bar stands for the ”Up” relation, red bar standsfor the ”Down” relation. (Supplementary Data.pdf)Figure S3 — Detailed Drug-Disease Heat MapWe transform the repurposing possibility score table into heat map andpresent it in this ﬁgure. This version includes the repurposing possibilityscores of all the drug-disease pair. (HeatMap(full).png) en et al. Page 11 of 11

Figure 4

Heat map of drug-disease repurposing possibilityscores. X-axis stands for the 6 disease conditions and Y-axis isthe name of the 153 drugs that have an inﬂuence on any ofthe diseases involved in our experiment. The color bar abovethe heat map annotates the scores that diﬀerent colors in theheat map stand for.