Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles
Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles
Lu Han + , G.C. Shan * , H.Y. Wang , S.Q. Gao , W. Zhou * Beijing Institute of Pharmacology and Toxicology, State Key Laboratory of Toxicology and Medical Countermeasures, Beijing 100850, China; E-mail: [email protected]; Prof. Dr. G. C. Shan School of Instrumentation Science and Opto-electronics Engineering & Beijing Advanced Innovation Center for Big Data-based Precision Medicine, Beihang University, Beijing 100083, China; *Corresponding authors. Prof. Dr. G.C. Shan. E-mail: [email protected]; Address: Beihang University, No.37 Xueyuan Road, Haidian District, Beijing 100191, China. Prof. WX Zhou, E-mail: [email protected]; Tel: 86-10-66931422. Address:
State Key Laboratory of Toxicology and Medical Countermeasures, No.27 Taiping Road, Haidian District, Beijing, 100850, China. + These authors contributed equally to this work.
Abstract
The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell picture data. This work reported one state-of-the-art machine learning method to identify drug uses based on the cell image features of 1024 drugs generated in the LINCS program. Because the multi-dimensional features of the image are affected by non-experimental factors, the characteristics of similar drugs vary greatly, and the current sample number is not enough to use deep learning and other methods are used for learning optimization. As a consequence, this study is based on the supervised ITML algorithm to convert the characteristics of drugs. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. The analysis of feature conversion shows that different features play important roles in identifying different drug functions. For the current COVID-19, Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting endocytosis, etc., and were classified to the same community. And Clomiphene in the same community inibited the entry of Ebola Virus, indicated a similar MoAs that could be reflected by cell image.
Keywords: coronavirus, drug repurposing, machine learning, cell image feature, LINCS
Introduction
Emerging coronavirus disease 2019, known as COVID-19, is caused by the SARS-CoV-2 virus identified. Since the outbreak of COVID-19 in December 2019 in Wuhan, more than 80 thousand patients were affected in China and meanwhile nearly one million patients were affected all over the world. The disease resulted in more than 100 thousand deaths by April 15 2020, according to the National Health Commission of
People′s Republic of China and WHO. Unfortunately, the number of patients diagnosed and death is still on the rise, and no proven effective medicine and/or treatment is available thus far for patients contracted this emerging infection. Experience in the management of coronavirus epidemic of severe acute respiratory syndrome (SARS) and the Middle East respiratory syndrome (MERS) has limited applicability for treating COVID-19. The medicine teams from China and all over the world have been fully engaged at the frontline to deal with this COVID-19 epidemic. [1]
They are also very active in conducting many scientific studies on the pathogenesis of the diseases caused by this virus, mode of transmission, clinical profiles, management, and disease prevention.
Drug repurposing can quickly find potential therapeutic drugs from drugs with known safety, which can be quickly applied to the clinic and solve the treatment problems of COVID-19. [1-8]
At present, a large number of potential drugs have already begun clinical treatment trials of COVID-19.
In addition to drugs that can directly interact with viruses, a large number of drugs may exert antiviral effects through host targets. For example, chloroquine can inhibit the endocytosis of COVID-19 by changing the intracellular environment, increasing the intracellular PH value, etc., so as to achieve the antiviral effect; and some drugs targeting the cell pathway may produce anti-Coronavirus effects (Antiviral potential of ERK/MAPK and PI3K/AKT/mTOR signaling modulation for Middle East respiratory syndrome coronavirus infection as identified by temporal kinome analysis). Due to the virus invasion, replication and release are highly host-dependent, therefore, analyzing the effect of drugs on cells is very important to find effective antiviral drugs. Transcriptome, proteome data and other information are the direct influence of drugs on intracellular molecules, which is very helpful for the discovery of new uses of drugs. However, due to the complexity of the image data itself and the fact that it does not directly reflect the molecular characteristics, it is still difficult to use the image data for drug function discovery despite its relatively low cost. In this study, by analyzing the image data of 1024 drugs acting on cells, the mode of action of the drugs was found. In terms of mechanism of action, 1024 drugs represent more than 300 different mechanisms of action. Fewer samples and more classifications make it difficult for machine learning methods to effectively use these data for action pattern identification. Because the multi-dimensional features of the image are affected by non-experimental factors, the characteristics of similar drugs vary greatly, and the current sample number is not enough to use deep learning and other methods are used for learning optimization. [2-9]
As a consequence, this study is based on the supervised ITML algorithm to convert the characteristics of drugs. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. In this paper, 1105 drugs are analyzed by clustering algorithm in machine learning. At the same time, this paper also uses a variety of data preprocessing methods to improve the clustering effect. In data preprocessing, this paper compares Principal Component Analysis (PCA) algorithm and metric learning algorithm and in data clustering we use Affinity Propagation (AP) algorithm. [10-14]
Therefore, we adopted the ITML algorithm to achieve the identification of drugs with similar mechanisms by optimizing the measurement of drug image features. Compared with the original data and PCA without the ITML algorithm, this method can help to identify drugs with similar mechanisms. For the enrichment analysis of drug types with a sample size exceeding 5, it is found that 39 clusters were enriched with 604 types of drugs, suggesting that a variety of cell features may be used to discover the mechanism of action (MOA) of the drug.
Material and Methods
Data Collection
The cell imaging data set containing 372 drugs was sorted and screened (see Methods), 1105 image data (including 812-dimensional image information) encompassing cell responses to 372 drugs were collected cumulatively (Supplementary Table S1). These 372 drugs represent a broad range of clinical use. These image data represent the most intuitive phenotypic effects of these drugs on cells. The image data includes 812-dimensional data, including Cells_Area Shape_Area, Cells_AreaShape_Compactness, Cells_AreaShape_Eccentricity, etc. The distribution of data in each dimension ranges from +677 to -384, etc. We have adopted the mean variance normalization method and after normalization, the data is distributed within the range of mean 0 and variance 1. We collected the MoA information of the drugs from the LINCS database, which contains a total of 372 kinds of MoA, 49 kinds of MoA have 5 or more drugs, and the most MoA has as many as 43 kinds of drugs shared.
PCA algorithm
PCA algorithm is an algorithm to analyze the most important components of input data, which is often used for data dimensionality reduction in machine learning. Through PCA algorithm, we can reduce an n-dimensional vector to an m-dimensional vector as follows: nmnXPCAmX <= )),(()( X (n) is the original data, and X (m) is the output data after mapping the original data from the n-dimensional space to the m-dimensional space. Note that PCA algorithm can not only reduce the dimension of input data, but also extract the more effective features in origin data.
Metric learning
In addition to PCA algorithm, this paper also uses metric learning method to preprocess input data before clustering. We use Information-theoretic metric learning(ITML) algorithm in metric learning to learn the similarity between input data. Metric learning is a machine learning algorithm for learning the similarity between data, which is widely used in face recognition and other fields. Metric learning classifies the similarity of input data by learning the distance function in a specific task. Compared with the deep learning method, metric learning is more practical. For example, the deep learning model trained in a specific task can only adapt to the data similar to the training samples, [10-14] and for the input data with large differences from the sample data, the results tend to have large errors. At the same time, when the categories of data increase, the former training model needs to be retrained under the new categories, which will cause a lot of resource and time consumption. Therefore, deep learning method is often limited in practical application. As a kind of machine learning, metric learning can solve this problem well. It makes the similarity between the same type of data increase, and the similarity between different types of data decrease. Obviously, the result of data clustering after metric learning will be more accurate. In this paper, we use ITML algorithm to preprocess the input data as follows: )(S nn nm InputITML ×× = nnnmnm SInputOutput ××× ×= Input m × n is the input data, m is the number of input data, n is the dimension of input data. S n × n is the similarity matrix learned by metric learning, and Output m×n is the output data after metric learning. In Output m×n , the distance between similar classes of drugs will be closer than the original data, and the distance between different classes of drugs will be farther than the original data. AP clustering
After data preprocessing, we cluster the data. There are many clustering algorithms, including unsupervised clustering algorithm and supervised clustering algorithm.
Among them, supervised clustering algorithms often need some prior conditions, such as the categories that need clustering. Unsupervised clustering algorithm often does not need a priori condition, but through the analysis of input data, such as the density or mean of input data. The unsupervised clustering algorithm has a good effect on some data which is difficult to label or assign a category. In this paper, we use unsupervised clustering algorithm Affinity Propagation(AP) to cluster. Figure 0. The distance plots for data distribution.
Results and Discussions
It is found that the currently effective drugs for COVID-19, chloroquine and hydroxychloroquine, although not the same on the MoAs label, have similar drug characteristics. On the one hand, they can inhibit T cell proliferation and reduce the release of proinflammatory cytokines, which can increase the pH of endosome and block endocytosis. They can all appear in category 21. This result indicates that their common mechanism is related to the DIP after they act on the cells. The other drugs in 同类间 不同类间 the cluster are not the same as the known mechanisms of action of these drugs, suggesting that they may also have similar effects. The other drugs contained in cluster 21 have different mechanisms of action. These DIP-like drugs may have similar effects in antiviral. The MoA of Clomifene in cluster 21 is annotated as Estrogen receptor antagonist, which has been found to be resistant to Ebola, suggesting that it may have a similar mechanism of action. This research work laid the foundation for the discovery of a new drug mechanism based on image data, and at the same time provided a new method for COVID19 drug relocation. In the next step, we will conduct antiviral screening experiments on some new predicted drugs.
Cell imaging dataset containing 1105 drugs
The cell imaging data set containing 372 drugs was sorted and screened (see Methods), 1105 image data (including 812-dimensional image information) encompassing cell responses to 372 drugs were collected cumulatively (Supplementary Table S1). These 372 drugs represent a broad range of clinical use. These image data represent the most intuitive phenotypic effects of these drugs on cells. The image data includes 812-dimensional data, including Cells_Area Shape_Area, Cells_AreaShape_Compactness, Cells_AreaShape_Eccentricity, etc. The distribution of data in each dimension ranges from +677 to -384, etc, , and more than 98.8% of the data is between -20 and 20. We have adopted the mean variance normalization method and after normalization, the data is distributed within the range of mean 0 and variance 1. The data in each dimension follows a positive distribution with a mean of 0 and a variance of 1, in which the maximum value is 13.934 and the minimum value is -7.930. The original data distribution diagram is shown in Figure 1a, and the normalized data distribution diagram is shown in Figure 1b. Figure1a Origin_data Figure1b Norm_data Figure 1c Figure 1.
Data distribution plots (a) data distribution for original data, (b) the normalized data distribution. (c)
The relevant data distribution, in which the pie chart with MoA; the following five types are not represented in the pie chart. The top 10 MoA types are presented as a bar chart.
We collected the MoA information of the drugs from the LINCS database, which contains a total of 372 kinds of MoA, 49 kinds of MoA have 5 or more drugs, and the most MoA has as many as 43 kinds of drugs shared. The types that contain the most drugs include Adrenergic receptor antagonist , Dopamine receptor antagonist , Cyclooxygenase inhibitor 和 Serotonin receptor antagonist. The relevant data distribution is shown in Figure 1c (the pie chart represents MoA, and the following 5 types are not represented in the pie chart. The Top 10 is made into a bar chart) The overall flowchart in this work is illustrated in Figure 2.
Figure 2. Overall flowchart of the present study.
ITML to Realize the 812-dimensional image data conversion
In computer vision research, different features are not the same for the classification value applied to different purposes. The same group of pictures can be classified according to different attributes such as shape, color, size and so on. Supervised learning methods, such as deep learning, can automatically discover more efficient classification methods based on the training set. However, when faced with smaller data, the performance of these models is greatly reduced. In general image recognition tasks, such as human photo recognition, we can manually select a suitable feature and construct a distance function. If our goal is to recognize a human face, then we need to build a distance function to strengthen the appropriate feature Such as hair color, face shape, etc., if our goal is to recognize posture, then we need to build a distance function that captures the similarity of posture. This artificially constructed feature often consumes a lot of human effort and may not be robust to data changes. ITML as a supervised global metric learning method is an alternative method, [10-14] which can learn the metric distance function for a specific task according to different learning tasks.
Figure3a Figure3b Figure 3. TSNEplot of original data, and TSNE plot of ITML-processed data. As a supervised global metric learning method, ITML is an alternative method. It can learn the metric distance function for a specific task according to different learning tasks. We use the ITML method to learn the distance measurement of the drug MoA classification task, where num_constraints (number of constraints to generate) = 20, max_iter = 1000, convergence_threshold = 0.001. The TSNE graphs of the Top ten drugs before and after learning is given in Figure 3 A and Figure 3 B. Through training for all drugs’ MoA, we get the T matrix, and the 812-dimensional vector is transformed to be a new vector via supervised learning after passing through the T matrix.
Drugs divided into 39 categories based on ITML-transformed features
We use the T-matrix transformed features to establish drug image phenotype (DIP) connections. The DIP connections were represented as an “association score” and computed with Euler distance. A total of 609960 pairs of DIP connections (Distance Matrix) among 1105 drugs were shown in a heat map representation of Figure 3b(heatmap of the Distance Matrix). Figure 4. Heatmap figure of data. Application of an automated, parameter-free clustering algorithm yielded 39 drug groups with prominent consensus internal DIP similarities. [13,14]
We distinguished each of these 39 groups as an DIP community (Figure 3a , clustering figure). We use MoA composed of more than 5 drugs as a test set for whether the DIP community can be used for drug MoA discovery. Our enrichment analysis identified significant (P<0.01) enriched community-specific drug MoAs for each DIP community (Figure 3b and Supplementary Table S* , to give what the MoA is enriched by each cluster). Notably, Communities 18,36,29 and 34 were enriched with Protein synthesis inhibitor, Cytochrome P450 inhibitor, Calmodulin antagonist and Phosphodiesterase inhibitor, respectively. To examine whether ITML can help MOA recognition, we compared the effect of MOA recognition using raw data and data processed by PCA algorithm. Using the original data and the data processed by the PCA algorithm for clustering, 57 and 48 clusters were obtained, and the frequency of enriched MOA was 26 and 24, respectively. The enrichment ratio is 0.4561 and 0.5000, which is also lower than the results of ITML. Therefore, clustering the ITML-processed data makes it easier to find drugs with consistent MoA. a b c Figure 5. The data clustering figures for a) origin data, b) ITML data, and c) PCA data, respectively. drugs on the phenotype of tumor cells are not significant, and therefore it may cause image data to be non-specific feature.
DIP facilitate to identify the drug use
In order to identify which image features may be more conducive to the identification of drug use, we calculated the intra-class distance ratio between the features of each dimension in each cluster (Table S2 (in the annex)), see the method section for details). And determine the Community specific image features (CSIF) according to the intra-class ratio <0.01. We found that the CSIF in each cluster rarely overlaps, only 26 different features play a role in 2 clusters, and no features are simultaneously Become a CSIF of 3 or more Communities. For example, Nuclei_Intensity_MeanIntensity_Ph_golgi is the CSIF of cluster 16 of Dopamine uptake inhibitor and cluster 22 without enrichment of any kind of drugs. CSIF suggests that drugs within clusters may have specific pattern responses on these characteristics. We see that although there are only 4 Tubulin inhibitors in the data set, they are all concentrated in Community 20. Community 20 is also enriched with CDK inhibitors, and the only 2 Microtubule inhibitors are all in this cluster. The CSIF corresponding to this Community is Cells_Texture_InfoMeas1_Hoechst_5, Cells_Texture_InfoMeas2_Ph_golgi_5, Cells_Texture_Variance_Hoechst_3, Cytoplasm_AreaShape_Zernike_8_8. This may be related to the effect of the above drugs on the cell cycle, inhibiting the division of cells and causing changes in cell texture. These results suggest that it is feasible to discover the function of drugs or new compounds based on DIP (See Table S1 in the attachment for more details). It is found that the COVID-19 prevention and treatment drugs are in cluster 21. The above research results show that the effect of drugs on cells will be through multi-dimensional image features, and the use of ITML for feature conversion will help us find similar drugs. This feature can help us to find more drugs with similar mechanisms through the similarity of drug phenotypes. At present, COVID-19 still lacks effective drugs. Chloroquine, redisevir and other drugs may be candidate drugs with certain effects. Drugs such as ridxivir target viral proteins, and DIP derived from uninfected cell lines may not be able to reflect drug functions. Chloroquine exerts an anti-infective effect on the host. However, serious side effects of drugs such as chloroquine may limit its use. Therefore, the discovery of new alternative drugs through DIP is also of great significance. We found that the currently effective drugs for COVID19, chloroquine and hydroxychloroquine, although not the same on the MoAs label, have similar drug characteristics. On the one hand, they can inhibit T cell proliferation and reduce the release of proinflammatory cytokines. , It can increase the pH of endosome and block endocytosis. They can all appear in category 21. The CSIFs of the drugs in the cluster include Nuclei_Intensity_IntegratedIntensity_Mito, Nuclei_Texture_InfoMeas1_Mito_3, and so on. The result indicates that their common mechanism is related to the DIP after they act on the cells. The other drugs in the cluster are not the same as the known mechanisms of action of these drugs, suggesting that they may also have similar effects. The other drugs contained in cluster 21 have different mechanisms of action. These DIP-like drugs may have similar effects in antiviral. The MoA of Clomifene in cluster 21 is annotated as Estrogen receptor antagonist, which has been found to be resistant to Ebola, suggesting that it may have a similar mechanism of action. The image data after the drug acts on the cell is one of the most easily obtained screening data. Judging the potential effects of drugs from images has great application value. Here we use the cell characteristic data processed by professional cell image processing software (cellprofiler) to predict MoA. Due to the complex MoA of drugs, we use third-party MoA annotation data and optimize the metrics based on ITML. The optimized cluster is more closely related to the known MoA. Next, we found that the clustering results based on image features are closely related to the MoA of the drug itself. Such as Tubulin inhibitors, CDK inhibitors and Microtubule inhibitors can inhibit the formation of spindles, and they are simultaneously classified into cluster 20. The COVID-19 drug candidates chloroquine and hydroxychloroquine, and the anti-Ebola drug clomiphene appeared in cluster 21 at the same time. Although their MoA annotations are different, they can inhibit the entry of the virus, and on multiple image features Have similarities. These results confirm the possibility and accuracy of drug discovery based on cell image data. It should be noted that the cell image data we used came from a cell line without SARS-CoV-2 infection. Therefore, drugs that target viral proteins may not have a consistent effect on the image of this cell infection, so these data may not be applicable to virus-targeted drug discovery. The image data of the effect of drugs on SARS-CoV-2 infected cells may be used to screen virus-targeted drugs.
Conclusions
In summary, the functional and association analysis of DIPs provided a hypothesis for drug repurposing of MOA. The present study here indicates that their common mechanism is related to the DIP after they act on the cells. The other drugs in the cluster are not the same as the known mechanisms of action of these drugs, suggesting that they may also have similar effects. The other drugs contained in cluster 21 have different mechanisms of action. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. The analysis of feature conversion shows that different features play important roles in identifying different drug functions. For the current COVID-19, Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting endocytosis, etc., and were classfied to the same community. And Clomiphene in the same community inibited the entry of Ebola Virus, indicated similar MoAs with Chloroquine and Hydroxychloroquine that could be reflected by cell image. The MoA of Clomifene in cluster 21 is annotated as Estrogen receptor antagonist, which has been found to be resistant to Ebola, suggesting that it may have a similar mechanism of action. In future, antiviral screening experiments on some new predicted drugs will be conducted.The present work laid the foundation for the discovery of a new MOA of drug based on machine learning of image data, and at the same time provided a new way for the COVID-19 drug relocation.
Acknowledgments
This work was supported by grants from the National Natural Science Foundation of China (No. 81273488, No. 81230089) and also supported by Beijing Advanced Innovation Center for Big Data-based Precision Medicine. We thank Prof. Bijie Hu from the Zhongshan Hospital, affiliated to Fudan University, an infectious disease professor and a senior expert at the National Scientific Panel on Infectious diseases of COVID-19 in China, for helpful discussion.
Competing financial interests
The authors declare no competing financial interests.
Received: (April 29 2020) Published online: ((May 2020))
References [1]
B. Cao, Y. Wang, D. Wen, W. Liu, J. Wang, G. Fan, L. Ruan, B. Song, Y. Cai, M. Wei, X. Li, J. Xia, N. Chen, J. Xiang, T. Yu, T. Bai, X. Xie, L. Zhang, C. Li, Y. Yuan, H. Chen, H. Li, H. Huang, S. Tu, F. Gong, Y. Liu, Y. Wei, C. Dong, F. Zhou, X. Gu, J. Xu, Z. Liu, Y. Zhang, H. Li, L. Shang, K. Wang, K. Li, X. Zhou, X. Dong, Z. Qu, S. Lu, X. Hu, S. Ruan, S. Luo, J. Wu, L. Peng, F. Cheng, L. Pan, J. Zou, C. Jia, J. Wang, X. Liu, S. Wang, X. Wu, Q. Ge, J. He, H. Zhan, F. Qiu, L. Guo, C. Huang, T. Jaki, F. G. Hayden, P. W. Horby, D. Zhang, C. Wang, N. Engl. J. Med. , in press. DOI 10.1056/NEJMoa2001282. [2]
G.C. Shan, W. Zhou, L. Han, et al., A Deep Learning-based Method and Equipment for Predicting Pharmacological Properties of Drugs Using Transcriptomic Data,
CN-Patent
No.202010130277.1, . [3]
J. Lamb, The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease.
Science , 313, 1929. [4]
Y. Donner, S. Kazmierczak, K. Fortney, Drug Repurposing Using Deep Embeddings of Gene Expression Profiles.
Molecular Pharmaceutics , 15, 4314. [5]
A. Subramanian, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.
Cell , 171, 1437. [6]
S. M. Corsello, et al.
The Drug Repurposing Hub: a next-generation drug library and information resource.
Nature Medicine
23, 405. [7]
M. Abadi, et al.
TensorFlow: A system for large-scale machine learning. arXiv:1605.08695
Y. Chen, Y. Li, R. Narayan, A. Subramanian, X. Xie, Gene expression inference with deep learning.
Bioinformatics , 32, 1832. [9]
L. van der Maaten, G. Hinton, Visualizing Data using t-SNE.
Journal of Machine Learning Research , 9, 2579. [10]
F. Pedregosa, et al.
Scikit-learn: Machine Learning in Python.
Journal of Machine
Learning Research, ,12, 2025. [11]
Frey B J, Dueck D. Clustering by passing messages between data points[J]. Science, 2007, 315(5814): 972-976 [12]
Davis, Jason & Kulis, Brian & Jain, Prateek & Sra, Suvrit & Dhillon, Inderjit. Information-theoretic metric learning. ACM International Conference Proceeding Series. , 227. 209-216. 10.1145/1273496.1273523. [13]
Amir, E.-a.D. Visualizing high-dimensional data. nature methods, (7):608. [14](7):608. [14]