[PDF] Neural networks for Anatomical Therapeutic Chemical (ATC) classification

Abstract

Motivation: Automatic Anatomical Therapeutic Chemical (ATC) classification is a critical and highly competitive area of research in bioinformatics because of its potential for expediting drug develop-ment and research. Predicting an unknown compound's therapeutic and chemical characteristics ac-cording to how these characteristics affect multiple organs/systems makes automatic ATC classifica-tion a challenging multi-label problem. Results: In this work, we propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM). Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature, including the state-of-the-art developed by the fast.ai research group. Availability: All source code developed for this study is available at this https URL Contact: [email protected]

Full PDF

Neural networks for Anatomical Therapeutic Chemical (ATC)

Loris Nanni , Alessandra Lumini and Sheryl Braham University of Padova, Via Gradenigo 6, Padova, Italy DISI, University of Bologna, via dell’Università 50, Cesena, Italy, Missouri State University, USA

Abstract

Motivation:

Automatic Anatomical Therapeutic Chemical (ATC) classification is a critical and highly competitive area of research in bioinformatics because of its potential for expediting drug development and research. Predicting an unknown compound's therapeutic and chemical characteristics according to how these characteristics affect multiple organs/systems makes automatic ATC classification a chal-lenging multi-label problem.

Results:

In this work, we propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM). Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature, including the state-of-the-art developed by the fast.ai research group.

Availability:

All source code developed for this study is available at https://github.com/LorisNanni.

Contact: [email protected] Introduction

The average cost for developing a new drug from start to market, which can take decades before final approval, is now estimated to be 2.8 billion US dollars (Wouters, McKee, & Luyten, 2020). Of all drugs currently un-der development, approximately 86% (Pitts, 2014) will fail to be better than placebo or will prove to cause more harm than good. (Wong, Siah, & Lo, 2019). To weed out new drugs with a low probability of being effica-cious and safe has led researchers to search for automatic methods for classifying compounds according to the organs they are likely to affect based on these compounds' Anatomical Therapeutic Chemical (ATC) classes. An automatic classification system with good ATC prediction would not only accelerate but also significantly reduce drug development costs. The ATC coding system (MacDonald & Potvin, 2004) classifies com-pounds into one or more classes at five levels in terms of the therapeutic, pharmacological, and chemical properties of the drugs and on the organs or systems the drugs affect. Relevant to the automatic ATC classification problem is the first ATC level, which determines the general anatomical groups, as coded with fourteen mnemonic letters, that a particular com-pound targets: code A for alimentary tract and metabolism, B for blood and blood-forming organs, C for the cardiovascular system, D for derma-tologicals, G for the genitourinary system and sex hormones, H for sys-temic hormonal preparations, excluding sex hormones and insulins, J for anti-infectives for systemic use, L for antineoplastic and immunomodu-lating agents, M for the musculoskeletal system, N for the nervous system, P for antiparasitic products, insecticides, and repellents, R for the respira-tory system, S for sensory organs, and V for various. Levels 2 and 3 rep-resent pharmacological subgroups and levels 4 and 5 chemical subgroups and substances. A compound is assigned one or more ATC codes con-tained within each of these five levels. Despite the serviceability of the ATC classification system for as-sessing the clinical value of a compound, most pharmaceuticals have yet to be assigned ATC codes; accurate coding involves expensive, labor-in-tensive experimental procedures. Hence, the pressing need for machine learning (ML) to be applied to this problem (Dunkel, Günther, Ahmed, Wittig, & Preissner, 2008; Wu, Ai, Liu, & Fan, 2013). Early ML systems tended to simplify the complexity of this problem by reducing the level 1 multi-class problem to a single class problem. Dunkel et al. (2008), for example, took advantage of a compound's structural fin-gerprint information to identify a class, while Wu et al. (Wu et al., 2013) developed an approach based on extracting relationships among level 1 classes. Chen et al. (2012) built one of the first multi-label systems by examin-ing a compound's chemical-chemical interactions. The authors also estab-lished the de facto benchmark dataset for ATC classification. More re-cently, Cheng et al. (Cheng, Zhao, Xiao, & Chou, 2017a, 2017b) designed two systems (iATC-mHyb (2017a) and iATCmISF (2017b)) that handle class overlapping by fusing different sets of 1D descriptors: structural sim-ilarity, chemical-chemical interaction, and fingerprint similarity. Nanni and Brahnam (2017) transformed these same 1D descriptors into several 2D matrices from which Histogram of Gradients (HoG) (Dalal & Triggs, 2005) texture descriptors were extracted and trained on ensembles of multi-label classifiers. Convolutional Neural Networks (CNNs) were . Nanni al. trained on 2D descriptors in (Schmidhuber, 2015) and in (Lumini & Nanni, 2018), but in Lumini and Nanni (2018), the CNNs, along with a Long Short-Term Memory Network (LSTM) (Hochreiter & Schmidhuber, 1997), were used as feature extractors for training two multi-label classi-fiers. This approach was further expanded in Nanni, Brahnam, and Lumini (2017) by constructing ensembles of CNNs (with adjusted batch sizes and learning rates) and by introducing techniques for processing CNNs with multi-label inputs. In this work, an ensemble of different feature descriptors and classifiers is proposed that strongly outperforms the best methods published in the literature for ATC classification on the benchmark dataset in Chen et al. (2012). The system proposed here was experimentally developed by com-paring and evaluating multilabel classifiers trained on different feature sets. Our best results were obtained by combining a Bidirectional Long Short-Term Memory Network (BiLSTM) with a multi-label classifier based on Multiple Linear Regression. Methods

Fig. 1. Schematic of the proposed approach . As illustrated in Figure 1, the approach taken in this study is to produce experimentally ensembles that combine multi-label classifiers (hML), based on Multiple Linear Regression, with LSTM. These classifiers are trained on 𝑋 , a set of three different features (DDI, FRAKEL, NRAKEL, detailed in section 3.1). Sets of features are also extracted from LSTM and fed into hML classifiers. The results are combined and evaluated. The remainder of this section describes each of the elements involved in this approach. LSTM As Multi-label Classifier and Feature Extrac-tor

LSTM (Hochreiter & Schmidhuber, 1997) is a Recurrent Neural Net-work that evaluates what to forget and what to remember at every step. As illustrated in Figure 2, is built with the following components: 1) forget gate 𝑓 , which is a single layer network with sigmoid 𝜎 , 2) candidate layer 𝐶̅ , which is a single layer network with 𝑇𝑎𝑛ℎ ; 3) input gate 𝐼 , which is a single layer network with 𝜎 ; 4) output gate 𝑂 , which is a single layer net-work with 𝜎 ; 5) hidden state 𝐻 vector; and 6) memory state 𝐶 vector. The inputs to the LSTM at time step 𝑡 are the current input 𝑋 , the previous hidden state 𝐻 𝑡−1 , and the previous memory cell state 𝐶 𝑡−1 . Fig. 2.

Long Short-Term Memory (LSTM) classifier . The algorithm for updating LSTM at time 𝑡 is as follows: Given 𝑋 𝑡 as the current input and 𝐻 𝑡−1 as the previously hidden layer and letting 𝑈, 𝑊, 𝑏 , which represent the learnable weights of the network and are independent of time step 𝑡 , the candidate layer 𝐶 𝑡 ̅ is calculated as 𝐶 𝑡 ̅ = 𝑇𝑎𝑛ℎ(𝑈 𝑐 𝑋 𝑡 + 𝑊 𝑐 𝐻 𝑡−1 + 𝑏 𝑐 ) . (1) The new memory cell is a linear combination of the previous memory cell and of the candidate layer, such that 𝐶 𝑡 = 𝑓 𝑡 ∗ 𝐶 𝑡−1 + 𝐼 𝑡 ∗ 𝐶 𝑡 ̅ , (2) where ∗ is element-wise multiplication. The forget gate 𝑓 𝑡 is defined as 𝑓 𝑡 = 𝜎(𝑈 𝑓 𝑋 𝑡 + 𝑊 𝑓 𝐻 𝑡−1 + 𝑏 𝑓 ) ; (3) the input gate 𝐼 𝑡 as 𝐼 𝑡 = 𝜎(𝑈 𝑖 𝑋 𝑡 + 𝑊 𝑖 𝐻 𝑡−1 + 𝑏 𝑖 ); (4) and the output gate as 𝑂 𝑡 = 𝜎(𝑈 𝑜 𝑋 𝑡 + 𝑊 𝑜 𝐻 𝑡−1 + 𝑏 𝑜 ). (5) The output of the block is the linear combination 𝐻 𝑡 = 𝑂 𝑡 ∗ 𝜎(𝐶 𝑡 ) of the output gate 𝑂 𝑡 and the sigmoid of the memory cell 𝐶 𝑡 . Regarding input, it should be noted that all sequences for this task are of the same length, so sorting input by length is not required. The output of LSTM can be the entire sequence 𝐻 𝑡 or the last term of this sequence. The former case makes it possible to stack many LSTM layers within the same network. A Bidirectional LSTM (BiLSTM is a stack of two LSTMs trained on the same data, with the second starting at the end of the first sequence and going backward. BiLSTM identifies causality correlations that go in the opposite direction and is usually applied to data that is not related to time. MATLAB LSTM, which has one BiLSTM layer, was used to imple-ment LSTM in this work. Parameters were set to the following values: 𝑛𝑢𝑚𝐻𝑖𝑑𝑑𝑒𝑛𝑈𝑛𝑖𝑡𝑠 = 100 , 𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠 = 14 , and 𝑚𝑖𝑛𝑖𝐵𝑎𝑡𝑐ℎ𝑆𝑖𝑧𝑒 =27 . rticle short title LSTM is not ordinarily used as a multilabel classifier but can perform multilabel classification if the training strategy in (Nanni et al., 2017 ) is implemented: if a training pattern belongs to 𝑚 classes, it is replicated in the training set 𝑚 times, once for each of the different labels. To assign a test pattern to more than one class, in the final softmax layer, a rule is applied that assigns a given pattern to all classes whose score is larger than a given threshold. LSTM can function not only as a classifier but also as a feature extrac-tor. As noted in Figure 1, in this study LSTM functions in both capacities. Feature extraction with LSTM can be accomplished by representing each pattern using the activations from the last layer, which produce a feature vector with a dimension equal to the number of classes. Feature perturba-tion and extraction are performed several times by randomly sorting the original set of features used to train LSTM. Classification by hML

The novel algorithm hML-KNN, proposed in (P. Wang, Ge, Xiao, Zhou, & Zhou, 2017) is a multi-label classifier based on the integration of a feature score and a neighbor score. Feature score evaluates whether a sample belongs to a class based on its features according to the global information in the training data, while neighbor score determines this as-signment based on the class labels of its neighbors. The feature score 𝑓 (𝑥, 𝑔 𝑗 ) for a given pattern 𝑥 with respect to an anatomical group 𝑔 𝑗 is calculated to evaluate whether the pattern belongs to the group 𝑔 𝑗 based on a regression model. The neighbor score 𝑓 (𝑥, 𝑔 𝑗 ) is used to evaluate how significantly the K neighbors of a pattern belong to a given group 𝑔 𝑗 : the neighbor score increases if more neighbors of 𝑥 have the label 𝑔 𝑗. That is, 𝑓 (𝑥, 𝑔 𝑗 ) is 1 if all neighbors of 𝑥 belong to 𝑔 𝑗 , 0 otherwise. The final score of 𝑥 is a weighted sum of the two factors: 𝑓 (𝑥, 𝑔𝑗) = 𝛼𝑓 (𝑥, 𝑔 𝑗 ) + (1 − 𝛼)𝑓 (𝑥, 𝑔 𝑗 ) (6) In our experiments, we use the default values where the weight factor α is set to 0.5, and the number of neighbors is K=15. Classification by FastAI Tabular model

In addition to hML and LSTM, we explore the FastAI Tabular model (Howard & Gugger, 2020), which is a powerful deep learning technique for tabular/structured data based on the creation of some embedding layers for categorical variables. This deep learner uses embedding layers to rep-resent categorical variables by a numerical vector, whose values are learned during training. Embeddings allow for relationships between cat-egories to be captured, and they can also serve as inputs to other models. A graphic representation of a FastAI Tabular model is presented in fig-ure 3, where categorical variables are transformed into N-dimensional fea-tures by categorical embeddings followed by a dropout layer to prevent overfitting. Numerical variables are simply normalized. Then all the vari-ables are concatenated and passed as input into the following layers, which, in our experiments, are two hidden and one output layers, as illus-trated in Figure 3. In our experiments, we used a binary encoding to represent binary var-iables, and we considered the resulting variable as categorical.

Fig. 3. FastaAI Tabular model. Results

Benchmark dataset

The ensembles generated by the proposed approach are compared and evaluated on the aforementioned benchmark dataset found in (Chen, 2012) (Supporting Information S1). This dataset is a collection of 3883 ATC-coded pharmaceuticals taken from KEGG (Ogata et al., 1999), a publicly available drug databank. Some samples belong to more than one of the 14 level 1 ATC classes: 3295 drugs belong to one class, 370 to two classes, 110 to three classes, 37 to four classes, 27 to five classes, and 44 belong to six classes. As explained in (Chen, 2012), N(Vir) is the sum (4912) of labels asso-ciated with the drugs in the dataset. In other words, N(Vir) is the number of available virtual patterns, including all samples replicated in the train-ing set with more than one label. The average number of labels per sample is thus 4912/3883=1.27. A total of 540 samples belong to class A , 133 to B , 591 to C , 421 to D , 248 to G , 126 to H , 521 to J , 232 to L , 208 to M , 737 to N , 127 to P , 427 to R , 390 to S , and 211 to V. The following descriptors represent each drug in this dataset: • DDI represents each drug with three mathematical expressions (the maximum interaction score with the drugs, the maximum structural similarity score, and the molecular fingerprint similarity score) with each expression reflecting its intrinsic correlation with each of the 14 level 1 classes. Thus, the resulting descriptor is of size 14×3=42. These descriptors are available in the supplemen-tary material in Nanni and Brahnam (2017).

FRAKEL • NRAKEL represents a drug by a 700-D descriptor obtained from the Mashup algorithm (Cho, Berger, & Peng, 2016) which gener-ates output from seven drug networks (five based on chemical-chemical interaction and two on drug similarities).

Testing protocol

The jackknife testing protocol is used here to generate both the training and testing sets. At each iteration of this protocol, one sample is placed in the testing set and all the others are included in the training set. Iteration continues until every pattern has been left out from the training set exactly once. Results are averaged as in K-fold cross-validation. The jackknife protocol was selected because it is considered to be the least arbitrary cross-validation method commonly used in statistical prediction (Chou, 2013).

Performance indicators

ATC classification is evaluated using the classic performance indicators defined in (Chou, 2013):

Aiming = ∑ ( ‖𝕃 𝑘 ∩𝕃 𝑘∗ ‖‖𝕃 𝑘∗ ‖ ) 𝑁𝑘=1 , (7) Coverage = ∑ ( ‖𝕃 𝑘 ∩𝕃 𝑘∗ ‖‖𝕃 𝑘 ‖ ) 𝑁𝑘=1 , (8) Accuracy = ∑ ( ‖𝕃 𝑘 ∩𝕃 𝑘∗ ‖‖𝕃 𝑘 ∪𝕃 𝑘∗ ‖ ) 𝑁𝑘=1 , (9) Absolute True = ∑ Δ(𝕃 𝑘 , 𝕃 𝑘∗ ) 𝑁𝑘=1 , (10) Absolute False = ∑ ( ‖𝕃 𝑘 ∪𝕃 𝑘∗ ‖−‖𝕃 𝑘 ∩𝕃 𝑘∗ ‖𝑀 ) 𝑁𝑘=1 , (11) where 𝕃 𝑘 is the true label, with 𝕃 𝑘∗ the predicted label for a given sample k . N is the number of samples, M the number of classes, and Δ(∙,∙) is an operator that returns 1 if the two sets have the same elements, 0 otherwise.

Experiments

The first experiment compares the performance of the multi-label classi-fiers. Reported in Table 1 are the results of the three multi-label classifiers detailed in section 3 along with three other standard classifiers, each trained on the three sets of features (DDI, FRAKEL, and NRAKEL). As already mentioned, LSTM is not a native multi-label classifier; threshold-ing was used as described in section 2.1 for this aim. The six classifiers reported in Table 1 are the following: 1)

RR, an ensemble of Ridge Regression classifiers implemented in the MATLAB/OCTAVE library for multi-class classifica-tion in the MLC Toolbox (Kimura, Sun, & Kudo, 2017); 2)

LIFT, multi-label learning with Label specIfic FeaTures) (Zhang & Wu, 2015); 3)

Group Preserving Label Embedding (GR) (Kumar, Pujari, Padmanabhan, & Kagita, 2019); 4)

LSTM; 5)

Tab (label for the FastAI Tabular model) (Howard & Gugger, 2020) 6) hML (P. Wang et al., 2017). To avoid overfitting, default parameters were used for the classifiers.

Table 1. Absolute true rates achieved by different classifiers using different sets of descriptors

Absolute True DDI NRAKEL FRAKEL RR 0.5127 0.6062 0.5006 LIFT 0.6111 0.5282 0.3579 GR 0.4991 0.6093 0.4963 LSTM hML 0.5710 0.6791 0.5977 In the cell Tab-FRAKEL, the reported value was obtained by trans-forming the original 1024 bit feature vector into 64 int16 features, since the original descriptor gained very low performance (0.3165) Examining the results in Table 1, Tab is the best standalone approach producing an outstanding 0.7422 absolute true rate using NRAKEL de-scriptors. Of note as well is LSTM, which produced good on all three de-scriptors.

Table 2. Absolute true rates achieved by different ensembles using different set of descriptors

Absolute True DDI NRAKEL FRAKEL LS 0.6902 0.7092 0.6709 eLS 0.6995 0.7177 0.6853 LSTM+hML 0.6647 0.7371 0.6716 eLS+LSTM+hML 0.6915 0.7358 0.6894 eLS+LSTM+hML +Tab 0.6928 0.7538 0.7072

Table 3. Absolute true rates achieved by different ensembles using combinations of features.

Absolute True DDI NRAKEL FRAKEL Tab 0.7667 --- --- Tab 0.7734 eLS+ LSTM+hML 0.7577 --- --- eLS+ LSTM+hML 0.7762 eLS+LSTM+hML +2×Tab 0.7919 eLS+LSTM+hML +3×Tab

LS+LSTM+Tab 0.7812 The second experiment, reported in Table 2, considers the following ensembles: • LS is a stacking method based on the approach described in section 2, where LSTM is used as a feature extractor and the resulting de-scriptors are given as input to a hML classifier; • X+Y is fusion by the average rule between the methods X and Y; rticle short title • eLS: is an ensemble based on feature perturbation obtained as the fu-sion of ten LS methods trained using random rearrangements of the input features; Results reported in Table 2 show a strong performance improvement for descriptors trained on LS, which is a single hML classifier trained with LSTM features, to eLS, which is an ensemble of 10 LS classifiers. The best performance is obtained by the ensemble eLS+LSTM+hML+Tab, which is the fusion of methods that has the highest diversity compared to each other. This ensemble produces the highest performance in this clas-sification problem, outperforming all the standalone approaches (for each of the three descriptors). In the third experiment, fusion at the feature level is tested (see Table 3). The starting descriptor is the concatenation of two or three sets of fea-tures for the Tab approach, while for other classifiers, the combination is the average rule applied to each of them (e.g., LSTM, trained on DDI, is combined by average rule with LSTM, trained on NRAKEL). When a cell in Table 3 spans more columns, that indicates that the re-lated classifier is trained using more features, and, for each feature, a dif-ferent classifier is trained with results combined by average rule. The results reported in Table 3 show the usefulness of the ensemble: all the approaches that contain Tab outperform the state of the art produced by the fast.ai research group. Table 4. Comparison with the state-of-the-art methods in the literature

Finally, in Table 4, we report a comparison of our proposed method with the best systems reported in the literature. Clearly, our ensemble strongly outperforms the other approaches. Notice the performance difference of the original papers of NRAKEL (Zhou et al., 2019) and FRAKEL (Zhou et al., 2020) and the classifiers tested here. The main reason for this dif-ference is that the classifiers were not optimized here since we are using a single dataset. Our concern in this regard is to avoid any risk of overfitting by running the approaches using default values.

Conclusion

Since automatic level 1 ATC classification is a complex multi-label problem, the goal of this study was to improve performance by generating ensembles trained on three different feature vectors. The original 1D input vectors were also used to train an LSTM (specifically a BiLSTM), which functioned (with modification) not only as a multi-label classifier but also as a feature extractor, with features taken from the output layer. Two other classifiers aside from LSTM were evaluated: one based on Multiple Linear Regression and another a deep learning technique for tab-ular/structured data based on the creation of some embedding layers for categorical variables. To boost the performance of these classifiers they were trained on three sets of features with results combined by the average rule. Comparisons of the best ensembles were made with the standalone classifiers and other state-of-the-art systems. Experimental results show that the best ensemble constructed by the method proposed here obtained superior results across the five performance indicators for ATC classifica-tion, including the systems introduced in Lumini and Nanni (2018), Nanni et al. (2017), and the fast.ai research group. Future work will explore performance of different LSTM and CNN to-pologies combined with new activation functions for replacing the stand-ard ReLu. The fusion of other deep learning topologies (using both the standard ReLu and variants) for extracting features will also be the focus of investigation. The code of the proposed ensemble is available at https://github.com/LorisNanni.

Acknowledgement

Through their GPU Grant Program, NVIDIA donated the TitanX GPU that was used to train the CNNs presented in this work.

References

Chen, L. (2012). Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities.

PLoS ONE, 7 (e35254). Cheng, X., Zhao, S.-G., Xiao, X., & Chou, K.-C. (2017a). iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals.

Oncotarget, 8 , 58494–58503. doi:doi:10.18632/oncotarget.17028 Cheng, X., Zhao, S.-G., Xiao, X., & Chou, K.-C. (2017b). iATC-mISF: a multi-label classifier for predicting the classes of anatomical Method Aiming Coverage Accuracy Absolute True Absolute False eLS+ LSTM+hML +3*Tab

NRAKEL (Zhou, Chen, & Guo, 2019) 0.7888 0.7936 0.7786 0.7593 0.0363 FRAKEL (Zhou et al., 2020) 0.7851 0.7840 0.7721 0.7511 0.0370 NLSP (X. Wang, Wang, Xu, Xiong, & Wei, 2019) 0.8135 0.7950 0.7828 0.7497 0.0343 FUS3 (Nanni et al., 2017) 0.8755 0.6973 0.7346 0.6871 0.0238 . Nanni al. therapeutic chemicals.

BioInformatics, 33 (3), 341-346. doi:10.1093/bioinformatics/btw644 Cho, H., Berger, B., & Peng, J. (2016). Compact Integration of Multi-Network Topology for Functional Analysis of Genes.

Cell Systems, 3 (6), 540-548.e545. doi:https://doi.org/10.1016/j.cels.2016.10.017 Chou, K. C. (2013). Some remarks on predicting multi-label attributes in molecular biosystems.

Molecular Biosystems, 9 , 10922-11100. Dalal, N., & Triggs, B. (2005).

Histograms of oriented gradients for human detection . Paper presented at the 9th European Conference on Computer Vision, San Diego, CA. Dunkel, M., Günther, S., Ahmed, J., Wittig, B., & Preissner, R. (2008). SuperPred: update on drug classification and target prediction.

Nucleic Acids Research, 36 (May), W55-W59. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.

Neural Computation, 9 (8), 1735-1780. Howard, J., & Gugger, S. (2020). Fastai: A Layered API for Deep Learning.

Information, 11

ArXiv, arXiv:1704.02592 . Kumar, V., Pujari, A. K., Padmanabhan, V., & Kagita, V. R. (2019). Group preserving label embedding for multi-label classification.

Pattern Recognition, 90 , 23-34. doi:https://doi.org/10.1016/j.patcog.2019.01.009 Lu, Z., & Chou, K.-C. (2020). iATC_Deep-mISF: A Multi-Label Classifier for Predicting the Classes of Anatomical Therapeutic Chemicals by Deep Learning.

Advances in Bioscience and Biotechnology, 11 , 153-159. Lumini, A., & Nanni, L. (2018). Convolutional neural networks for ATC classification.

Current Pharmaceutical Design, 24

Canadian Pharmacists Journal / Revue des Pharmaciens du Canada, 137 (7), 29-34. doi:10.1177/171516350413700703 Nanni, L., & Brahnam, S. (2017). Multi-label classifier based on histogram of gradients for predicting the anatomical therapeutic chemical class/classes of a given compound.

BioInformatics, 33 , 2837-2841. doi:10.1093/bioinformatics/btx278 Nanni, L., Brahnam, S., & Lumini, A. (2017).

Ensemble of Deep Learning Approaches for ATC Classification . Paper presented at the Smart Intelligent Computing and Applications - Proceedings of the Third International Conference on Smart Computing and Informatics, Bhubaneswar, India. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., & Kanehisa, M. (1999). KEGG: Kyoto Encyclopedia of Genes and Genomes.

Nucleic Acids Research, 27 (1), 29-34. doi:10.1093/nar/27.1.29 Pitts, R. C. (2014). Reconsidering the concept of behavioral mechanisms of drug action.

Journal of the Experimental Analysis of Behavior, 101 , 422–441. doi:doi:10.1002/jeab.80 Schmidhuber, J. (2015). Deep learning in neural networks: An overview.

Neural Networks, 61 , 85-117. Wang, P., Ge, R., Xiao, X., Zhou, M., & Zhou, F. (2017). hMuLab: A Biomedical Hybrid MUlti-LABel Classifier Based on Multiple Linear Regression.

IEEE/ACM Trans. Comput. Biol. Bioinformatics, 14 (5), 1173–1180. doi:10.1109/tcbb.2016.2603507 Wang, X., Wang, Y.-J., Xu, Z., Xiong, Y., & Wei, D.-Q. (2019). ATC-NLSP: Prediction of the Classes of Anatomical Therapeutic Chemicals Using a Network-Based Label Space Partition Method.

Frontiers in Pharmacology, 10 . Wong, C. H., Siah, K. W., & Lo, A. W. (2019). Estimation of clinical trial success rates and related parameters.

Biostatistics (Oxford, England), 20 (2), 273-286. doi:10.1093/biostatistics/kxx069 Wouters, O. J., McKee, M., & Luyten, J. (2020). Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018.

JAMA, 323 (9), 844-853. doi:10.1001/jama.2020.1166 Wu, L., Ai, N., Liu, Y., & Fan, X. (2013). Relating anatomical therapeutic indications by the ensemble similarity of drug sets.

Journal of Chemical Information and Modeling, 53 (8), 2154-2160. Zhang, M.-L., & Wu, L. (2015). Lift: multi-label learning with label-specific features.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (1), 107-120. Zhou, J.-P., Chen, L., & Guo, Z.-H. (2019). iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs.

BioInformatics, 36 (5), 1391-1396. doi:10.1093/bioinformatics/btz757 Zhou, J.-P., Chen, L., Wang, T., & Liu, M. (2020). iATC-FRAKEL: a simple multi-label web server for recognizing anatomical therapeutic chemical classes of drugs with their fingerprints only.

BioInformatics, 36 (11), 3568-3569. doi:10.1093/bioinformatics/btaa166(11), 3568-3569. doi:10.1093/bioinformatics/btaa166