Classification of chemical compounds based on the correlation between \textit{in vitro} gene expression profiles
Jun-ichi Takeshita, Akinobu Toyoda, Hidenori Tani, Yasunori Endo, Sadaaki Miyamoto
CClassification of chemical compounds based on the correlation between in vitro gene expression profiles ∗ Jun-ichi Takeshita † Akinobu Toyoda ‡ Hidenori Tani § Yasunori Endo ¶ Sadaaki Miyamoto ‖ January 7, 2021
Abstract
Toxicity evaluation of chemical compounds has traditionally relied on animal experiments; how-ever, the demand for non-animal-based prediction methods for toxicology of compounds is increasingworldwide. Our aim was to provide a classification method for compounds based on in vitro geneexpression profiles. The in vitro gene expression data analyzed in the present study was obtainedfrom our previous study. The data concerned nine compounds typically employed in chemical man-agement. We used agglomerative hierarchical clustering to classify the compounds; however, therewas a statistical difficulty to be overcome. We needed to properly extract RNAs for clusteringfrom more than 30,000 RNAs. In order to overcome this difficulty, we introduced a combinatorialoptimization problem with respect to both gene expression levels and the correlation between geneexpression profiles. Then, the simulated annealing algorithm was used to obtain a good solution forthe problem. As a result, the nine compounds were divided into two groups using 1,000 extractedRNAs. Our proposed methodology enables read-across, one of the frameworks for predicting toxi-cology, based on in vitro gene expression profiles. keywords : Statistical classification, Multiobjective combinatorial optimization, Chemical toxicity,Alternatives to animal experiments,
In vitro gene expression, Mathematical formulation
Traditionally, toxicity evaluation of chemical compounds has relied on animal experiments [[2]]. How-ever, in terms of time, cost efficiency, and animal welfare concerns, there is an increasing demandfor the development of non-animal-based methodologies for predicting chemical toxicity. Recently,several elements have been proposed for use in toxicity prediction methods, such as (quantitative) ∗ This study is supported by JSPS KAKENSHI Grant Numbers JP20K12203. † Research Institute of Science for Safety and Sustainability, National Institute of Advanced Industrial Science andTechnology (AIST), Tsukuba, Japan. ( [email protected] ) ‡ School of Science and Engineering, University of Tsukuba, Tsukuba, Japan. § Environmental Management Research Institute, National Institute of Advanced Industrial Science and Technology(AIST), Tsukuba, Japan. ¶ Faculty of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan. ‖ Faculty of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan. a r X i v : . [ s t a t . A P ] J a n tructure-activity relationships ((Q)SAR), quantitative activity-activity relationships (QAAR), andread-across [[9]].Read-across is a method whereby the toxicity of a given compound is predicted without the use ofanimal test data, but instead based on the animal toxicity data for similar compounds. One study,for example, conducted read-across based on similarity in chemical structure and toxicology expertjudgment [[7]]. For read-across, it is important to properly group compounds using data that does notrely on animal experiments. In order to facilitate the grouping of compounds, the Organisation forEconomic Co-operation and Development (OECD) published the OECD QSAR Toolbox [[6, Section4.3]], [[10]]. In addition, the Japanese government and participating academic institutes developed theHazard Evaluation Support System Integrated Platform (HESS) [[11]], [[6, Section 4.4]], [[8]]. Theseplatforms group compounds based on in silico parameters, that is, chemical structures and essentialphysicochemical parameters; and by employing them, we can successfully group compounds based onthese parameters.In order to predict toxicity, it is useful to use not only in silico parameters but also in vitro param-eters, because the latter can reflect certain biological characteristics of compounds. In fact, in the caseof predicting hepatotoxicity, one of the most prevalent forms of toxicity, two studies have reported thatusing in vitro parameters increased the accuracy of discriminative models for predicting the presenceor absence of hepatotoxicity, compared to using in silico parameters [[4]], [[5]]. These studies suggestthe hypothesis that grouping compounds based on in vitro parameters would increase the accuracyof read-across approaches for predicting chemical toxicity, compared to grouping based on in silico parameters.The present study proposes a methodology for grouping compounds using in vitro gene expressiondata. In order to group compounds, agglomerative hierarchical clustering was applied; however, astatistical difficulty appeared. The gene expression data, from which it was necessary to properlyextract a limited number of RNAs for clustering compounds, included more than 30,000 RNAs. Inorder to overcome this difficulty, we introduced a multiobjective combinatorial optimization problemwith respect to both gene expression levels and the correlation between gene expression profiles. Then,we applied the simulated annealing algorithm, a metaheuristic algorithm, to obtain a good solutionfor the multiobjective combinatorial optimization problem. The present study used the data reported in [12]. In that study, cell-based assays were conducted, induplicate, using mouse embryonic stem cells, for nine compounds typically employed in chemical man-agement: bis-phthalate, p -dicholorobenzene, phenol, trichloroethylene, benzene, chloroform, p -cresol,and tri- n -butyl-phosphate. In other words, each compound had two samples, and each group of ninesamples had the same control condition. Then, the gene expression levels were quantified using thefragments per kilobase of exon per million mapped fragments (FPKMs), and reported for a total of32,586 RNAs. We used these RNAs and their FPKMs in the present study’s analysis.2 .2 Gene expression ratio For any compound-treated group, the following gene expression ratio was used for each gene: f ( x, y ) = log (cid:16) yx (cid:17) , where x and y denote the gene expression levels of the control and compound-treated groups, re-spectively. Note that if x or y was zero, then the next smallest value in the respective group wasused. In order to group compounds, agglomerative hierarchical clustering (the average linkage betweenthe merged groups) was used, because this method can be used for any dissimilarity measures. Thefollowing dissimilarity measure was used for the hierarchical clustering: for any two compounds, x and y , the dissimilarity measure, or distance, between x and y , say, d ( x, y ) was defined by d ( x, y ) = 1 − corr( x, y )2 , where corr( x, y ) is the correlation coefficient between the respective FPKM vectors of Compound x and Compound y . The dissimilarity measure takes a value between 0 and 1. In order to extract RNAs that clearly revealed the difference between compounds, the present studyintroduced the following combinatorial optimization problem to extract n RNAs for a given naturalnumber n : objective function U = (1 − α ) U + αU (0 ≤ α ≤ ,U = 1Count( w ) × (cid:88) x,y ∈ E,x (cid:54) = y w x,y | (1 /n ) (cid:80) ni =1 ( x i − x )( y i − y ) |{ (1 /n ) (cid:80) ni =1 ( x i − x ) } / { (1 /n ) (cid:80) ni =1 ( y i − y ) } / ,U = 1 n n (cid:88) i =1 (cid:18) (cid:107) r i (cid:107) max j (cid:107) r j (cid:107) (cid:19) , subject to { x i } , { y i } ⊂ Γ and { x i } = { y i } = n, where E and Γ denote the set of compound-treated groups and the set of all the RNAs (32,586 RNAs),respectively; x i and y i denote the expression ratios of RNA i for x and y ∈ E , respectively; x and y denote the means of x i and y i , respectively; w x,y denotes weights and takes a value in {− , , } ; andCount( w ) denotes the number of w x,y taking a number of 1. Note that, for some x and y , if w x,y = 1,we can extract RNAs that increase the correlation between Compound x and Compound y ; and if w x,y = −
1, we can extract RNAs that decrease the correlation between Compound x and Compound y ; otherwise, we are not interested in the correlation between the two compounds. In addition, r i designates the FPKM vector for RNA i , and (cid:107) · (cid:107) represents the Euclidean norm.The function U takes a value in [0 , w ). Thefunction U also takes a value in [0 , U takes a value in [0 , α is3 parameter taking a value in [0 , U describes the strength of thecorrelation between two compound-treated groups, x and y , based on extracted RNAs. The function U describes the gene expression level of extracted RNAs. Then, the function U is a linear combinationof U and U .In the present study, the simulated annealing algorithm, which was originally introduced by [3]and [1], was used to obtain a good solution for the combinatorial optimization problem. Let n bethe number of RNAs we want to extract, T be the initial temperature, T t (0 < T t < T ) be the finaltemperature, and γ (0 < γ <
1) be the cooling rate. Then, the following is the algorithm to extractRNAs.
Step 1
Choose n RNAs randomly as the initial state. Let R be the set of the chosen n RNAs, andthen calculate the objective function U using the set R . In addition, set the initial temperature T . Step 2
As a neighbor solution, generate a set R (cid:48) , in which random elements of R and R (the comple-ment set of R ) are exchanged. Then, calculate the objective function U (cid:48) using the set R (cid:48) . Step 3 If U < U (cid:48) , then R := R (cid:48) ; otherwise R := R (cid:48) with the following probability P : P = exp (cid:18) − | U (cid:48) − U | T (cid:19) . Step 4 T := γT. If T ≥ T t , output the set R as the final state; otherwise, go to Step 2. Figure 1 shows a dendrogram obtained by applying agglomerative hierarchical clustering (the averagelinkage between the merged groups) to the data set of gene expression ratios, and Figure 2 a dendogramobtained by applying similar clustering to the set of gene expression levels, for the nine compoundsin duplicate and all the RNAs (32,586 RNAs). In both figures, the y -axes show the dissimilaritymeasures where two clusters were merged, while the x -axes show the distribution of compounds. Notethat subscripts 1 and 2 refer to the sample numbers; that is, if any two compounds have the samesubscript number, the control conditions for the two compounds are identical.In Figure 1, we can see that there are two robust clusters, and each cluster consists of the ninecompounds with the same subscript number. This implies that the two compound groups stronglydepend on the gene expression levels in the control conditions. In Figure 2, the two control conditionsare closely proximate, but there is no significant difference between the 18 samples (the nine com-pounds in duplicate). These results indicate that there is no significant difference between the geneexpression patterns of the two control conditions. Therefore, we may infer that the significant differ-ence between the two control conditions in Figure 1 was the result of the incremental accumulation ofdifference among the RNAs, as more than 30,000 RNAs were in the data set. These results indicatethat it is unreliable to use the information of all the RNAs for classifying compounds, but instead wemust extract a limited number of RNAs in order to properly classify compounds using in vitro geneexpression data. 4 .2 Grouping compounds using extracted RNAs We used the parameters n = 3 , , , , α = 0 . , . , . , .
3, for the combinatorial opti-mization problem to extract RNAs. Figures 3, 4, and 5 show dendrograms obtained by agglomerativehierarchical clustering (the average linkage between the merged groups) applied to the data set withthe nine compounds in duplicate, and 3 , , α = 0 . , . , .
2, and 0 .
3, respectively. In each panel in these figures, the y -axis showsthe dissimilarity measures where two clusters were merged, while the x -axis shows the distribution ofcompounds. The meaning and function of the subscripts are as in Figures 1 and 2.In Figures 3, 4, and 5, it is clear that the two samples are initially merged for all the compounds,and then different compounds are merged. However, when 3 ,
000 RNAs were extracted (Figure 3),there are roughly 0 .
05 dissimilarity measures between the two samples of each compound, althoughthe maximum ranges are roughly 0 .
20. This result suggests that 3 ,
000 RNAs are not suitable forclassifying compounds because the dissimilarity measures between the two samples of each compoundare not small enough, compared to the dissimilarity measures between the compounds. On the otherhand, when 1 ,
000 or 100 RNAs were extracted (Figures 4 and 5, respectively), and α = 0 . . ,
000 and 100 RNAs were extracted (Figures 4 and 5, respectively).When α = 0 .
1, neither case shows clear cluster structures; however, when α = 0 .
2, with 1 ,
000 extractedRNAs, the nine compounds are divided into two groups, one consisting of three compounds (bis-phthalate, trichloroethylene, and tri-n-butyl-phosphate) and the other of the remaining six compounds.These results indicate that not only limiting the number of RNAs but also reflecting the sizes of RNAsis effective for revealing the difference between the compounds. Figures 6 ( α = 0) and 7 ( α = 0 . ,
586 RNAs between Samples 1 and 2, in the case of bis-phthalate. Thered plots indicate the 1 ,
000 extracted RNAs, and the blue plots the remaining RNAs. When α = 0 . α = 0 .
2, there is an increased number of RNAs of significant size. Thus, using n = 1 ,
000 and α = 0 . The present study has been applied agglomerative hierarchical clustering methods and a multiobjectivecombinatorial optimization problem to classify chemical compounds, based on in vitro gene expressiondata. This approach enables read-across based on in vitro parameters, and the prediction of chemicaltoxicity. However, even if we can properly classify compounds using in vitro parameters, the resultsmay not show the same similarity between compounds as in in vivo experiments. There is a gap inthe respective similarities based on in vitro parameters and in vivo experiments. Therefore, we mustdevelop a means of extracting RNAs that reflects the similarity determined by in vivo experiments,and this requires collaboration with toxicology experts. This is a consideration for the future, but ourapproach would aid in such analysis. 5 cknowledgement
The authors would like to express appreciation to Mr. Ryosuke Abe, Dr. Hiroshi Aoki, Dr. HiroakiSato, Dr. Masaki Torimura, and Dr. Masashi Gamo for thier fruitful discussions for the first versionof manuscript.
References [1]
Cerny, V. , Thermodynamical approach to the traveling salesman problem : An efficient simu-lation algorithm,
J. Optim. Theory Appl. , 45 (1985), 41-51.[2]
Eaton, D.L. and Gilbert, S.G. , Principles of toxicology, In: Klaassen, C.D. and Watkins, IIIJ.B. (eds.)
Casarett Doull’s Essentials Toxicology , Third Edition, 5-20, McGraw-Hill Education,NY, USA. 2015[3]
Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P. , Optimization by simulated annealing,
Science , 220 (1983), 671-680.[4]
Liu, J., Mansouri, K., Judson, R.S., Martin, M.T., Hong, H., Chen, M., Xu, X.,Thomas, R.S., and Shah, I. , Predicting hepatotoxicity using ToxCast in vitro bioactivity andchemical structure,
Chem. Res. Toxicol. , 28 (2015), 738-751.[5]
Low, Y., Uehara, T., Minowa, Y., Yamada, H., Ohno, Y., Urushidani, T., Sedykh,A., Muratov, E., Kuzmin, V., Fourches, D., Zhu, H., Rusyn, I., and Tropsha, A. ,Predicting drug-induced hepatotoxicity using QSAR and toxicogenomics approaches,
Chem. Res.Toxicol. , 24 (2011), 1251-1262.[6]
Madden, J.C. , Tools for grouping chemicals and forming categories, In: Cronin, M. et al. (eds)
Chemical Toxicity Prediction: Category Formation and Read-Across , 72-97, Royal Society ofChemical Publishing, London. (2013)[7]
Mellor, C.L., Schultz, T.W., Przybylak, K.R., Richarz, A.N., Bradbury, S.P., andCronin, M.T.D.
Read-across for rat oral gavage repeated-dose toxicity for short-chain mono-alkylphenols: A case study,
Comput. Toxicol. , 2 (2017), 1-11.[8]
NITE: Hazard Evaluation Support System Integrated Platform (HESS) , , Accessed 5 January 2021.[9] OECD , Guidance on grouping of chemicals, Second edition, Number 194,ENV/JM/MONO(2014)4, Paris, France. 2014[10]
OECD , The OECD QSAR Toolbox. ,Accessed 5 January 2021[11]
Sakuratani, Y., Zhang, H.Q., Nishikawa, S., Yamazaki, K., Yamada, T., Yamada, J.,Gerova, K., Chankov, G., Mekenyan, O., and Hayashi, M. , Hazard Evaluation SupportSystem (HESS) for predicting repeated dose toxicity using toxicological categories,
SAR QSAREnviron. Res. , 24 (2013), 351-63. 612]
Tani, H., Takeshita, J., Aoki, H., Nakmura, K., Abe, R., Toyoda, A., Endo, Y.,Miyamoto, S., Gamo, M., Sato, H., and Torimura, M. , Identification of RNA biomarkersfor chemical safety screening in mouse embryonic stem cells using RNA-seq,
Biochem. Biophys.Res. Commun. , 512 (2019), 641-646. 7 igures
The nine chemicals in duplicate D i ss i m il a li t y m ea s u r e s Figure 1: A dendrogram obtained by applying aggregative hierarchical clustering (the average linkagebetween the merged groups) to the data of the gene expression ratios for the nine compoundsand 32,586 RNAs. The y -axis marks the dissimilarity measures at which the clusters merge,and the x -axis the distribution of the nine compounds in duplicate.8 he nine chemicals in duplicate D i ss i m il a li t y m ea s u r e s Figure 2: A dendrogram obtained by applying aggregative hierarchical clustering (the average linkagebetween the merged groups) to the data of the gene expression levels for the nine compoundsand 32,586 RNAs. The y -axis marks the dissimilarity measures at which the clusters merge,and the x -axis the distribution of the nine compounds in duplicate.9 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 3: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 3 ,
000 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .
2, and 0 .
3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.10 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 4: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 1 ,
000 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .
2, and 0 .
3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.11 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 5: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 100 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .
2, and 0 .
3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.12igure 6: Scatter plot of all the RNAs (32 ,
586 RNAs) between the sample 1 and 2 in case of bis-phthalate. The red plots indicate the 1 ,
000 extracted RNAs in case of α = 0 .
0, and the blueplots are the rest RNAs. The sizes of the extracted RNAs are almost zeros.Figure 7: Scatter plot of all the RNAs (32 ,
586 RNAs) between the sample 1 and 2 in case of bis-phthalate. The red plots indicate the 1 ,
000 extracted RNAs in case of α = 0 .
2, and the blueplots are the rest RNAs. The number of RNAs whose sizes are not zeros increase, comparedto the case of α = 0 ..