[PDF] Classification of chemical compounds based on the correlation between \textit{in vitro} gene expression profiles

Abstract

Toxicity evaluation of chemical compounds has traditionally relied on animal experiments;however, the demand for non-animal-based prediction methods for toxicology of compounds is increasing worldwide. Our aim was to provide a classification method for compounds based on \textit{in vitro} gene expression profiles. The \textit{in vitro} gene expression data analyzed in the present study was obtained from our previous study. The data concerned nine compounds typically employed in chemical management.We used agglomerative hierarchical clustering to classify the compounds;however, there was a statistical difficulty to be overcome.We needed to properly extract RNAs for clustering from more than 30,000 RNAs. In order to overcome this difficulty, we introduced a combinatorial optimization problem with respect to both gene expression levels and the correlation between gene expression profiles. Then, the simulated annealing algorithm was used to obtain a good solution for the problem. As a result, the nine compounds were divided into two groups using 1,000 extracted RNAs. Our proposed methodology enables read-across, one of the frameworks for predicting toxicology, based on \textit{in vitro} gene expression profiles.

Full PDF

CClassiﬁcation of chemical compounds based on the correlation between in vitro gene expression proﬁles ∗ Jun-ichi Takeshita † Akinobu Toyoda ‡ Hidenori Tani § Yasunori Endo ¶ Sadaaki Miyamoto ‖ January 7, 2021

Abstract

Toxicity evaluation of chemical compounds has traditionally relied on animal experiments; how-ever, the demand for non-animal-based prediction methods for toxicology of compounds is increasingworldwide. Our aim was to provide a classiﬁcation method for compounds based on in vitro geneexpression proﬁles. The in vitro gene expression data analyzed in the present study was obtainedfrom our previous study. The data concerned nine compounds typically employed in chemical man-agement. We used agglomerative hierarchical clustering to classify the compounds; however, therewas a statistical diﬃculty to be overcome. We needed to properly extract RNAs for clusteringfrom more than 30,000 RNAs. In order to overcome this diﬃculty, we introduced a combinatorialoptimization problem with respect to both gene expression levels and the correlation between geneexpression proﬁles. Then, the simulated annealing algorithm was used to obtain a good solution forthe problem. As a result, the nine compounds were divided into two groups using 1,000 extractedRNAs. Our proposed methodology enables read-across, one of the frameworks for predicting toxi-cology, based on in vitro gene expression proﬁles. keywords : Statistical classiﬁcation, Multiobjective combinatorial optimization, Chemical toxicity,Alternatives to animal experiments,

In vitro gene expression, Mathematical formulation

Traditionally, toxicity evaluation of chemical compounds has relied on animal experiments [[2]]. How-ever, in terms of time, cost eﬃciency, and animal welfare concerns, there is an increasing demandfor the development of non-animal-based methodologies for predicting chemical toxicity. Recently,several elements have been proposed for use in toxicity prediction methods, such as (quantitative) ∗ This study is supported by JSPS KAKENSHI Grant Numbers JP20K12203. † Research Institute of Science for Safety and Sustainability, National Institute of Advanced Industrial Science andTechnology (AIST), Tsukuba, Japan. ( [email protected] ) ‡ School of Science and Engineering, University of Tsukuba, Tsukuba, Japan. § Environmental Management Research Institute, National Institute of Advanced Industrial Science and Technology(AIST), Tsukuba, Japan. ¶ Faculty of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan. ‖ Faculty of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan. a r X i v : . [ s t a t . A P ] J a n tructure-activity relationships ((Q)SAR), quantitative activity-activity relationships (QAAR), andread-across [[9]].Read-across is a method whereby the toxicity of a given compound is predicted without the use ofanimal test data, but instead based on the animal toxicity data for similar compounds. One study,for example, conducted read-across based on similarity in chemical structure and toxicology expertjudgment [[7]]. For read-across, it is important to properly group compounds using data that does notrely on animal experiments. In order to facilitate the grouping of compounds, the Organisation forEconomic Co-operation and Development (OECD) published the OECD QSAR Toolbox [[6, Section4.3]], [[10]]. In addition, the Japanese government and participating academic institutes developed theHazard Evaluation Support System Integrated Platform (HESS) [[11]], [[6, Section 4.4]], [[8]]. Theseplatforms group compounds based on in silico parameters, that is, chemical structures and essentialphysicochemical parameters; and by employing them, we can successfully group compounds based onthese parameters.In order to predict toxicity, it is useful to use not only in silico parameters but also in vitro param-eters, because the latter can reﬂect certain biological characteristics of compounds. In fact, in the caseof predicting hepatotoxicity, one of the most prevalent forms of toxicity, two studies have reported thatusing in vitro parameters increased the accuracy of discriminative models for predicting the presenceor absence of hepatotoxicity, compared to using in silico parameters [[4]], [[5]]. These studies suggestthe hypothesis that grouping compounds based on in vitro parameters would increase the accuracyof read-across approaches for predicting chemical toxicity, compared to grouping based on in silico parameters.The present study proposes a methodology for grouping compounds using in vitro gene expressiondata. In order to group compounds, agglomerative hierarchical clustering was applied; however, astatistical diﬃculty appeared. The gene expression data, from which it was necessary to properlyextract a limited number of RNAs for clustering compounds, included more than 30,000 RNAs. Inorder to overcome this diﬃculty, we introduced a multiobjective combinatorial optimization problemwith respect to both gene expression levels and the correlation between gene expression proﬁles. Then,we applied the simulated annealing algorithm, a metaheuristic algorithm, to obtain a good solutionfor the multiobjective combinatorial optimization problem. The present study used the data reported in [12]. In that study, cell-based assays were conducted, induplicate, using mouse embryonic stem cells, for nine compounds typically employed in chemical man-agement: bis-phthalate, p -dicholorobenzene, phenol, trichloroethylene, benzene, chloroform, p -cresol,and tri- n -butyl-phosphate. In other words, each compound had two samples, and each group of ninesamples had the same control condition. Then, the gene expression levels were quantiﬁed using thefragments per kilobase of exon per million mapped fragments (FPKMs), and reported for a total of32,586 RNAs. We used these RNAs and their FPKMs in the present study’s analysis.2 .2 Gene expression ratio For any compound-treated group, the following gene expression ratio was used for each gene: f ( x, y ) = log (cid:16) yx (cid:17) , where x and y denote the gene expression levels of the control and compound-treated groups, re-spectively. Note that if x or y was zero, then the next smallest value in the respective group wasused. In order to group compounds, agglomerative hierarchical clustering (the average linkage betweenthe merged groups) was used, because this method can be used for any dissimilarity measures. Thefollowing dissimilarity measure was used for the hierarchical clustering: for any two compounds, x and y , the dissimilarity measure, or distance, between x and y , say, d ( x, y ) was deﬁned by d ( x, y ) = 1 − corr( x, y )2 , where corr( x, y ) is the correlation coeﬃcient between the respective FPKM vectors of Compound x and Compound y . The dissimilarity measure takes a value between 0 and 1. In order to extract RNAs that clearly revealed the diﬀerence between compounds, the present studyintroduced the following combinatorial optimization problem to extract n RNAs for a given naturalnumber n : objective function U = (1 − α ) U + αU (0 ≤ α ≤ ,U = 1Count( w ) × (cid:88) x,y ∈ E,x (cid:54) = y w x,y | (1 /n ) (cid:80) ni =1 ( x i − x )( y i − y ) |{ (1 /n ) (cid:80) ni =1 ( x i − x ) } / { (1 /n ) (cid:80) ni =1 ( y i − y ) } / ,U = 1 n n (cid:88) i =1 (cid:18) (cid:107) r i (cid:107) max j (cid:107) r j (cid:107) (cid:19) , subject to { x i } , { y i } ⊂ Γ and { x i } = { y i } = n, where E and Γ denote the set of compound-treated groups and the set of all the RNAs (32,586 RNAs),respectively; x i and y i denote the expression ratios of RNA i for x and y ∈ E , respectively; x and y denote the means of x i and y i , respectively; w x,y denotes weights and takes a value in {− , , } ; andCount( w ) denotes the number of w x,y taking a number of 1. Note that, for some x and y , if w x,y = 1,we can extract RNAs that increase the correlation between Compound x and Compound y ; and if w x,y = −

1, we can extract RNAs that decrease the correlation between Compound x and Compound y ; otherwise, we are not interested in the correlation between the two compounds. In addition, r i designates the FPKM vector for RNA i , and (cid:107) · (cid:107) represents the Euclidean norm.The function U takes a value in [0 , w ). Thefunction U also takes a value in [0 , U takes a value in [0 , α is3 parameter taking a value in [0 , U describes the strength of thecorrelation between two compound-treated groups, x and y , based on extracted RNAs. The function U describes the gene expression level of extracted RNAs. Then, the function U is a linear combinationof U and U .In the present study, the simulated annealing algorithm, which was originally introduced by [3]and [1], was used to obtain a good solution for the combinatorial optimization problem. Let n bethe number of RNAs we want to extract, T be the initial temperature, T t (0 < T t < T ) be the ﬁnaltemperature, and γ (0 < γ <

1) be the cooling rate. Then, the following is the algorithm to extractRNAs.

Step 1

Choose n RNAs randomly as the initial state. Let R be the set of the chosen n RNAs, andthen calculate the objective function U using the set R . In addition, set the initial temperature T . Step 2

As a neighbor solution, generate a set R (cid:48) , in which random elements of R and R (the comple-ment set of R ) are exchanged. Then, calculate the objective function U (cid:48) using the set R (cid:48) . Step 3 If U < U (cid:48) , then R := R (cid:48) ; otherwise R := R (cid:48) with the following probability P : P = exp (cid:18) − | U (cid:48) − U | T (cid:19) . Step 4 T := γT. If T ≥ T t , output the set R as the ﬁnal state; otherwise, go to Step 2. Figure 1 shows a dendrogram obtained by applying agglomerative hierarchical clustering (the averagelinkage between the merged groups) to the data set of gene expression ratios, and Figure 2 a dendogramobtained by applying similar clustering to the set of gene expression levels, for the nine compoundsin duplicate and all the RNAs (32,586 RNAs). In both ﬁgures, the y -axes show the dissimilaritymeasures where two clusters were merged, while the x -axes show the distribution of compounds. Notethat subscripts 1 and 2 refer to the sample numbers; that is, if any two compounds have the samesubscript number, the control conditions for the two compounds are identical.In Figure 1, we can see that there are two robust clusters, and each cluster consists of the ninecompounds with the same subscript number. This implies that the two compound groups stronglydepend on the gene expression levels in the control conditions. In Figure 2, the two control conditionsare closely proximate, but there is no signiﬁcant diﬀerence between the 18 samples (the nine com-pounds in duplicate). These results indicate that there is no signiﬁcant diﬀerence between the geneexpression patterns of the two control conditions. Therefore, we may infer that the signiﬁcant diﬀer-ence between the two control conditions in Figure 1 was the result of the incremental accumulation ofdiﬀerence among the RNAs, as more than 30,000 RNAs were in the data set. These results indicatethat it is unreliable to use the information of all the RNAs for classifying compounds, but instead wemust extract a limited number of RNAs in order to properly classify compounds using in vitro geneexpression data. 4 .2 Grouping compounds using extracted RNAs We used the parameters n = 3 , , , , α = 0 . , . , . , .

3, for the combinatorial opti-mization problem to extract RNAs. Figures 3, 4, and 5 show dendrograms obtained by agglomerativehierarchical clustering (the average linkage between the merged groups) applied to the data set withthe nine compounds in duplicate, and 3 , , α = 0 . , . , .

2, and 0 .

3, respectively. In each panel in these ﬁgures, the y -axis showsthe dissimilarity measures where two clusters were merged, while the x -axis shows the distribution ofcompounds. The meaning and function of the subscripts are as in Figures 1 and 2.In Figures 3, 4, and 5, it is clear that the two samples are initially merged for all the compounds,and then diﬀerent compounds are merged. However, when 3 ,

000 RNAs were extracted (Figure 3),there are roughly 0 .

05 dissimilarity measures between the two samples of each compound, althoughthe maximum ranges are roughly 0 .

20. This result suggests that 3 ,

000 RNAs are not suitable forclassifying compounds because the dissimilarity measures between the two samples of each compoundare not small enough, compared to the dissimilarity measures between the compounds. On the otherhand, when 1 ,

000 or 100 RNAs were extracted (Figures 4 and 5, respectively), and α = 0 . . ,

000 and 100 RNAs were extracted (Figures 4 and 5, respectively).When α = 0 .

1, neither case shows clear cluster structures; however, when α = 0 .

2, with 1 ,

000 extractedRNAs, the nine compounds are divided into two groups, one consisting of three compounds (bis-phthalate, trichloroethylene, and tri-n-butyl-phosphate) and the other of the remaining six compounds.These results indicate that not only limiting the number of RNAs but also reﬂecting the sizes of RNAsis eﬀective for revealing the diﬀerence between the compounds. Figures 6 ( α = 0) and 7 ( α = 0 . ,

586 RNAs between Samples 1 and 2, in the case of bis-phthalate. Thered plots indicate the 1 ,

000 extracted RNAs, and the blue plots the remaining RNAs. When α = 0 . α = 0 .

2, there is an increased number of RNAs of signiﬁcant size. Thus, using n = 1 ,

000 and α = 0 . The present study has been applied agglomerative hierarchical clustering methods and a multiobjectivecombinatorial optimization problem to classify chemical compounds, based on in vitro gene expressiondata. This approach enables read-across based on in vitro parameters, and the prediction of chemicaltoxicity. However, even if we can properly classify compounds using in vitro parameters, the resultsmay not show the same similarity between compounds as in in vivo experiments. There is a gap inthe respective similarities based on in vitro parameters and in vivo experiments. Therefore, we mustdevelop a means of extracting RNAs that reﬂects the similarity determined by in vivo experiments,and this requires collaboration with toxicology experts. This is a consideration for the future, but ourapproach would aid in such analysis. 5 cknowledgement

The authors would like to express appreciation to Mr. Ryosuke Abe, Dr. Hiroshi Aoki, Dr. HiroakiSato, Dr. Masaki Torimura, and Dr. Masashi Gamo for thier fruitful discussions for the ﬁrst versionof manuscript.

References [1]

Cerny, V. , Thermodynamical approach to the traveling salesman problem : An eﬃcient simu-lation algorithm,

J. Optim. Theory Appl. , 45 (1985), 41-51.[2]

Eaton, D.L. and Gilbert, S.G. , Principles of toxicology, In: Klaassen, C.D. and Watkins, IIIJ.B. (eds.)

Casarett Doull’s Essentials Toxicology , Third Edition, 5-20, McGraw-Hill Education,NY, USA. 2015[3]

Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P. , Optimization by simulated annealing,

Science , 220 (1983), 671-680.[4]

Liu, J., Mansouri, K., Judson, R.S., Martin, M.T., Hong, H., Chen, M., Xu, X.,Thomas, R.S., and Shah, I. , Predicting hepatotoxicity using ToxCast in vitro bioactivity andchemical structure,

Chem. Res. Toxicol. , 28 (2015), 738-751.[5]

Low, Y., Uehara, T., Minowa, Y., Yamada, H., Ohno, Y., Urushidani, T., Sedykh,A., Muratov, E., Kuzmin, V., Fourches, D., Zhu, H., Rusyn, I., and Tropsha, A. ,Predicting drug-induced hepatotoxicity using QSAR and toxicogenomics approaches,

Chem. Res.Toxicol. , 24 (2011), 1251-1262.[6]

Madden, J.C. , Tools for grouping chemicals and forming categories, In: Cronin, M. et al. (eds)

Chemical Toxicity Prediction: Category Formation and Read-Across , 72-97, Royal Society ofChemical Publishing, London. (2013)[7]

Mellor, C.L., Schultz, T.W., Przybylak, K.R., Richarz, A.N., Bradbury, S.P., andCronin, M.T.D.

Read-across for rat oral gavage repeated-dose toxicity for short-chain mono-alkylphenols: A case study,

Comput. Toxicol. , 2 (2017), 1-11.[8]

NITE: Hazard Evaluation Support System Integrated Platform (HESS) , , Accessed 5 January 2021.[9] OECD , Guidance on grouping of chemicals, Second edition, Number 194,ENV/JM/MONO(2014)4, Paris, France. 2014[10]

OECD , The OECD QSAR Toolbox. ,Accessed 5 January 2021[11]

Sakuratani, Y., Zhang, H.Q., Nishikawa, S., Yamazaki, K., Yamada, T., Yamada, J.,Gerova, K., Chankov, G., Mekenyan, O., and Hayashi, M. , Hazard Evaluation SupportSystem (HESS) for predicting repeated dose toxicity using toxicological categories,

SAR QSAREnviron. Res. , 24 (2013), 351-63. 612]

Tani, H., Takeshita, J., Aoki, H., Nakmura, K., Abe, R., Toyoda, A., Endo, Y.,Miyamoto, S., Gamo, M., Sato, H., and Torimura, M. , Identiﬁcation of RNA biomarkersfor chemical safety screening in mouse embryonic stem cells using RNA-seq,

Biochem. Biophys.Res. Commun. , 512 (2019), 641-646. 7 igures

The nine chemicals in duplicate D i ss i m il a li t y m ea s u r e s Figure 1: A dendrogram obtained by applying aggregative hierarchical clustering (the average linkagebetween the merged groups) to the data of the gene expression ratios for the nine compoundsand 32,586 RNAs. The y -axis marks the dissimilarity measures at which the clusters merge,and the x -axis the distribution of the nine compounds in duplicate.8 he nine chemicals in duplicate D i ss i m il a li t y m ea s u r e s Figure 2: A dendrogram obtained by applying aggregative hierarchical clustering (the average linkagebetween the merged groups) to the data of the gene expression levels for the nine compoundsand 32,586 RNAs. The y -axis marks the dissimilarity measures at which the clusters merge,and the x -axis the distribution of the nine compounds in duplicate.9 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 3: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 3 ,

000 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .

2, and 0 .

3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.10 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 4: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 1 ,

000 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .

2, and 0 .

3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.11 he nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate The nine compounds in duplicate D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s D i ss i m il a r i t y m ea s u r e s (a) α = 0.0 (b) α = 0.1 (c) α = 0.2 (d) α = 0.3 Figure 5: Four dendrograms obtained by applying aggregative hierarchical clustering (the averagelinkage between the merged groups) to the data of the gene expression ratios for the ninecompounds and 100 extracted RNAs. The upper-left (a), upper-right (b), lower-left (c) andlower-right (d) panels are the cases of α = 0 . , . , .

2, and 0 .

3, respectively. In each panel,the y -axis marks the dissimilarity measures at which the clusters merge, and the x -axis thedistribution of the nine compounds in duplicate.12igure 6: Scatter plot of all the RNAs (32 ,

586 RNAs) between the sample 1 and 2 in case of bis-phthalate. The red plots indicate the 1 ,

000 extracted RNAs in case of α = 0 .

0, and the blueplots are the rest RNAs. The sizes of the extracted RNAs are almost zeros.Figure 7: Scatter plot of all the RNAs (32 ,

586 RNAs) between the sample 1 and 2 in case of bis-phthalate. The red plots indicate the 1 ,

000 extracted RNAs in case of α = 0 .

2, and the blueplots are the rest RNAs. The number of RNAs whose sizes are not zeros increase, comparedto the case of α = 0 ..