[PDF] Learning graph representations of biochemical networks and its application to enzymatic link prediction

Abstract

The complete characterization of enzymatic activities between molecules remains incomplete, hindering biological engineering and limiting biological discovery. We develop in this work a technique, Enzymatic Link Prediction (ELP), for predicting the likelihood of an enzymatic transformation between two molecules. ELP models enzymatic reactions catalogued in the KEGG database as a graph. ELP is innovative over prior works in using graph embedding to learn molecular representations that capture not only molecular and enzymatic attributes but also graph connectivity. We explore both transductive (test nodes included in the training graph) and inductive (test nodes not part of the training graph) learning models. We show that ELP achieves high AUC when learning node embeddings using both graph connectivity and node attributes. Further, we show that graph embedding for predicting enzymatic links improves link prediction by 24% over fingerprint-similarity-based approaches. To emphasize the importance of graph embedding in the context of biochemical networks, we illustrate how graph embedding can also guide visualization. The code and datasets are available through this https URL.

Full PDF

LLearning graph representations of bio-chemical networks and its application toenzymatic link prediction

Julie Jiang , Li-Ping Liu , and Soha Hassoun Department of Computer Science, Tufts University, Medford, 02155, USA Department of Chemical and Biological Engineering, Tufts University, Medford, 02155, USA

February 11, 2020

Abstract — The complete characterization of enzymaticactivities between molecules remains incomplete, hinderingbiological engineering and limiting biological discovery. Wedevelop in this work a technique, Enzymatic Link Prediction(ELP), for predicting the likelihood of an enzymatic trans-formation between two molecules. ELP models enzymaticreactions catalogued in the KEGG database as a graph. ELPis innovative over prior works in using graph embedding tolearn molecular representations that capture not only molec-ular and enzymatic attributes but also graph connectivity.We explore both transductive (test nodes included in thetraining graph) and inductive (test nodes not part of the train-ing graph) learning models. We show that ELP achieves highAUC when learning node embeddings using both graph con-nectivity and node attributes. Further, we show that graphembedding for predicting enzymatic links improves link pre-diction by 24% over ﬁngerprint-similarity-based approaches.To emphasize the importance of graph embedding in thecontext of biochemical networks, we illustrate how graphembedding can also guide visualization.The code and datasets are available throughhttps://github.com/HassounLab/ELP.

Characterizing enzymes through sequencing, anno-tation, and homology has enabled the creation ofcomplex system models that have played a criticalrole in advancing many biomedical and bioengineer-ing applications. Insuﬃcient characterization of en-zymes, however, fundamentally limits our understand-ing of metabolism and creates knowledge gaps acrossmany applications. For example, while nearly 300 β - glucuronidases (gut-bacterial enzymes that hydrolyzeglucuronate-containing polysaccharides such as hep-arin and hyaluronate as well as small-molecule drugglucuronides) have been cataloged, functional infor-mation is available for only a small fraction (<10%)(Pellock et al., 2019), thus limiting our ability to an-alyze host-microbiota interactions. Importantly, mostenzymes if not all are promiscuous, acting on sub-strates other than the enzymes’ natural substrates(D’Ari and Casadesús, 1998; Tawﬁk and S, 2010; Hultand Berglund, 2007). At least one-third of protein su-perfamilies are functionally diverse, each superfamilycatalyzing multiple reactions (Almonacid and Babbitt,2011). Despite progress in assigning functional prop-erties to enzyme sequences (Brown and Babbitt, 2014;Cuesta et al., 2015; Merkl and Sterner, 2016; Baier,Copp, and Tokuriki, 2016; Finn et al., 2016), the com-plete characterization or curation of enzyme functionand the reactions they catalyze remains elusive.Computational prediction of enzymatic transforma-tions promises to complement existing databases andprovide new opportunities for biological discovery.The most common predictor of enzyme-compound in-teraction is compound and/or enzyme similarity tothose within known enzymatic reactions. In biologi-cal engineering, molecular similarity between a querymolecule and native substrates that are known to becatalyzed by the enzyme inform putative enzymatictransformation steps along a synthesis pathway (Per-tusi et al., 2014). A high similarity score indicatesa likely transformation. Molecular ﬁngerprints, one-dimensional vectors with entries representing the pres- a r X i v : . [ q - b i o . M N ] F e b earning graph representations of biochemical networks and its application to enzymatic link prediction ence or absence of molecular feature such as combi-nations of atom properties, bond properties, are usedto compute similarity. Enzyme sequence informationsuch as binding site covalence and thermodynamicfavorability were also used to inform the prediction(Cho et al., 2010). Similarity-based predictions arealso utilized in predicting drug-protein interactions.A recent survey (Kurgan and Wang, 2018) summa-rizes available drug-protein databases and 35 recentsimilarity-based prediction techniques. Several suchtechniques apply SIMCOMP, a heuristic algorithm forcomputing structural molecular similarity based onsubgraph isomorphism (Hattori et al., 2010). Othertechniques use machine learning on molecular featurevectors or enzyme features to compute the likelihoodof interactions (e.g., (Cobanoglu et al., 2013; Tsub-aki, Tomii, and Sese, 2018; Öztürk, Özgür, and Ozkir-imli, 2018)). Numerous computational methods in-cluding quantum mechanics, molecular docking, andmachine learning, have been used to predict atomsand bonds that undergo biochemical transformations,referred to as site of metabolism, due to CytochromeP450 enzymes (Tyzack and Kirchmair, 2019). Pre-diction of promiscuous products have utilized eitherhand-curated rules (e.g., (Morreel et al., 2014; Li etal., 2004)), or rules culled from enzymatic reactions(e.g., (Adams, 2010; Yousofshahi, Lee, and Hassoun,2011)).We present in this paper a novel technique, Enzy-matic Link Prediction (ELP), for predicting enzymatictransformations between two molecules. ELP advancesover the state-of-the art in several ways. First, ELPmaps known enzymatic reactions already cataloguedin databases (the KEGG database for this work) to agraph structure, where compounds are represented asgraph nodes while reactions are represented as graphedges. While such graph representations have beenexploited during pathway analysis and construction,e.g., (Yousofshahi, Lee, and Hassoun, 2011), they havenot been exploited when studying enzyme promiscu-ity. Second, ELP uses graph embeddings (Goyal andFerrara, 2018; Cai, Zheng, and Chang, 2018) to learnmolecular representations that reﬂect not only molec-ular structural properties but also relationships withother molecules in the network graph. Such embed-dings have proven eﬀective in predicting missing infor-mation, identifying spurious interactions, predictinglinks appearing in future evolving network, and ana-lyzing biomedical networks (Goyal and Ferrara, 2018;Cai, Zheng, and Chang, 2018; Yue et al., 2019). Third,ELP ﬁrst uses embedding propagation to compute em-beddings for graph nodes, and then uses these embed-dings to predict links between two nodes. We analyzeboth transductive (test nodes included in the traininggraph) and inductive (test nodes not part of the traininggraph) models. We evaluate ELP when learning nodeembeddings using both graph connectivity and nodeattributes and compare to similarity-based approaches. Figure 1:

Overview of ELP. (A) Embeddings are learned formolecules using graph embedding. EP computesreal-value embeddings, with d = 128 . (B) Learnedembeddings are used to predict enzymatic links. The KEGG database is used to construct a data graph.Molecules in the KEGG database are represented asnodes. For each biochemical reaction, each substrate-product pair within the reaction is modeled as an edgein the graph. As most reactions within the KEGGdatabase are reversible, we assume that each biochemi-cal reaction is reversible and construct a non-directionalgraph. Biochemical networks have cofactor molecules(e.g., NADP, H O) that participate in many reactions,forming high-connectivity hub nodes within the graph(Ravasz et al., 2002). As we aim to predict connec-tivity between non-cofactor metabolites, we excludesuch high-connectivity nodes and their edges from thegraph.Nodes are assigned molecular ﬁngerprints as at-tributes. The ﬁngerprints are encoded as binary vec-tors of ﬁxed length K . We select two ﬁngerprints thatreﬂect the presence or absence of pre-deﬁned struc-tural molecular fragments: the MACCS ﬁngerprint with K = 166 structural keys (Durant et al., 2002), and thePubChem ﬁngerprint with K = 881 structural keys(Kim et al., 2015).Enzymatic reaction data is assigned as edge at-tributes. Each edge is assigned the enzyme commission(EC) number that catalyzes the associated chemicalreaction. EC numbers are represented as four numbersseparated by periods (Tipton and McDonald, 2018).For example L-lactate dehydrogenase is assigned ECnumber 1.1.1.27. Each edge is also assigned a KEGGreaction class (RC) label. Each such label is associatedwith a group of reactions that share the same localizedstructural change between a substrate and a product(e.g., the addition or removal of a hydroxyl group)(Kanehisa et al., 2015). Each RC label is given usinga 5 digit label. Although a reaction may be associatedwith one or more RC labels, each substrate-productpair is associated with only one RC label. If a reactionhas no label, we assign it a null label. Thus, each edgein the graph is associated with an EC label and a RClabel. A graph G = ( V, E ) therefore consists of a set ofvertices V and a set of edges E . Every node i ∈ V rep-resents a molecule and every edge ( i, j ) ∈ E for some ( i, j ) ∈ V represents an enzymatic reaction connectingtwo molecules i and j . Page 2 of 7 earning graph representations of biochemical networks and its application to enzymatic link prediction

ELP has two steps (Figure 1): (A) learning embeddingvectors of graph nodes, and (B) predicting interactionbetween a pair of nodes from their embedding vectors.For the ﬁrst step, we use the Embedding Propagation(EP) algorithm (Duran and Niepert, 2017). EP was se-lected because it almost consistently outperformed sev-eral other methods in the presence of node attributeson several datasets. Further, EP has the advantageof fewer parameters and hyperparameters when com-pared to other link prediction methods (e.g., (Perozzi,Al-Rfou, and Skiena, 2014; Tang et al., 2015; Groverand Leskovec, 2016)). For the second step, we train aneural network that takes pairs of learned embeddingvectors as input and predicts the connectivity of twomolecules via a reaction.

The simplest form of EP is to learn a set of node em-bedding vectors U = { u i ∈ R d : i ∈ V } , where d is theembedding size. Embeddings are randomly initializedprior to training. Node embeddings are learned via aniterative process, by propagating forward (represen-tations of nodes) and backward (gradients) messagesbetween neighboring nodes. The iterative process re-peats until a convergence threshold is reached.Suppose N ( i ) = { j ∈ V : ( i, j ) ∈ E } is the setof neighboring nodes of node i . The model aims toreconstruct the embeddings u i from the embeddingsof i ’s neighbors. The reconstructed node embedding u i for node i is: ˜ u i = 1 |N ( i ) | (cid:88) j ∈N ( i ) u j (1)The learning objective of EP is to maximize the sim-ilarly between ˜ u i and u i . Instead of maximizing theabsolute values of inner products for all such nodes,EP maximizes their values in a relative sense: thereconstruction should be more similar to the corre-sponding embedding vector than any other embeddingvectors. The error in reconstruction is therefore min-imized through a margin-based ranking loss (Duranand Niepert, 2017): L = (cid:88) i ∈ V (cid:88) j ∈ V,j (cid:54) = i max { γ − ˜ u (cid:62) i u i + ˜ u (cid:62) i u j , } , (2)where γ > is a chosen margin hyperparameter. Theobjective is optimized by stochastic gradient descent.However, summing over all nodes as indicated by theinner sum is very expensive. For performance, we ran-domly select one node as the negative example for eachreal node in every iteration to compute an estimation of L and its gradient, as was done in (Duran and Niepert,2017). To incorporate information from edge attributes, EPlearns embedding vectors Z = { z c ∈ R d : c =1 , . . . , C } for the C reaction labels. The reconstructednode embedding u i for node i is modiﬁed as follows: ˜ u i = 1 |N ( i ) | (cid:88) j ∈N ( i ) u j + α z r ( i,j ) , (3)where r i,k is the edge label of the edge ( i, j ) and z r ( i, j ) is the corresponding edge embedding. The hyperpa-rameter α ∈ { , } weights the importance of edgefeatures. The vector z corresponding to null edgeattributes is ﬁxed to zero to avoid aﬀecting the recon-struction. Embeddings based on edge-attributes canbe learned simultaneously while learning connectivity-based embeddings. While the edge embeddings areused during training, they are not used to compute theﬁnal embeddings of nodes after EP training.EP can also learn K ﬁngerprint embedding vectors V = { v i ∈ R d : k = 1 , . . . , K } . Speciﬁcally, thenode-attribute-based embedding u i of a node i is themean of ﬁngerprint embeddings v k corresponding topositive ﬁngerprint entries in the ﬁngerprint vector f i ∈ { , } K : u fpi = 1 (cid:80) Kk =1 f ik K (cid:88) k =1 f ik v k (4)When computing embeddings based on node-attributeswe optimize the ﬁngerprint embeddings V , insteadof U , through the learning objective in Eq. (2). Anadvantage of the EP algorithm is its ability to learn onlyone of the node embedding types or all. If both node-attributes and connectivity embeddings are trained,we simply concatenate u i and u fpi to form the ﬁnalnode embedding vector of node i before applying thelink prediction model. L2 regularizations is applied toall variables U , V and Z . The trained node embeddings are used as inputs to alogistics link prediction model. Pairs of embeddingsof nodes involved in a known reaction are positive ex-amples; pairs of embeddings of nodes that have no orunknown interaction are treated as negative examples.To make link predictions, the neural network outputsthe likelihood of an edge for every pair of input nodeembeddings. The ﬁnal result of the model is evaluatedbased on the Area Under Curve (AUC) metric, whereinthe false positive rate and true positive rate are evalu-ated at every threshold to compute the area under theReceiver Operating Characteristic (ROC) curve.

We explore two learning scenarios: transductive andinductive. In the transductive setting, the model is

Page 3 of 7 earning graph representations of biochemical networks and its application to enzymatic link prediction

Table 1:

Training and testing graph statistics for the KEGGdataset under transductive and inductive learningscenario

Training TestingNodes Edges Nodes Edges

Transductive Learning

Test Ratio 0.1 6,833 12,350 6,833 1,311Test Ratio 0.3 5,766 9,255 5,766 2,819Test Ratio 0.5 4,352 6,136 4,352 3,552

Inductive Learning

Out-of-Sample Ratio 0.05 7,377 12,867 387 1,446 trained on all available nodes and evaluated on edgerecovery for a set of test edges that were withheld fromtraining. Hence, the graph is split into training andtesting sets by partitioning on the edges. To ensurethat the training graph is connected, the largest con-nected component is considered as the training graph.Therefore, the embeddings of all nodes incident to testedges are trained through the EP algorithm. ELP pre-dicts the likelihood of enzymatic interactions betweentwo nodes from their learned embeddings. In the in-ductive scenario, the model must predict interactionsfor one or more out-of-sample nodes excluded fromthe training set. ELP therefore computes embeddingsfor out-of-sample nodes from their attributes and pre-dicts possible enzymatic reactions for them. Due to thelack of prior connectivity information for out-of-samplenodes, only embeddings based on node-attributes arelearned during training. To generate the training andtesting sets, we reserve a certain portion of nodes andtheir incident edges for the test graph. All other nodesand edges are included in the training graph.

Once cofactors were excluded, our dataset represent-ing the biochemical network underlying the KEGGdatabase consisted of 7850 nodes and 14313 edges(Table 1). For transductive learning experiments, weevaluate link prediction on test ratios of 0.1, 0.3 and0.5. In each case, the training graph represents thelargest connected component after edge partitioningbased on the test ratio. Nodes and edges not present inthe largest connected component are discarded. For in-ductive learning experiments, the ratio of out-of-sampletest nodes is ﬁxed to 0.05. For all experiments, the em-bedding dimension is set to 128. The learning rate forthe EP framework is set to 0.5 and the regularizationto 0.0001. The embedding vectors are trained for 100epochs on a batch size of 512. The NN that predictsthe connectivity of two molecules based on their em-beddings consists of two hidden layers of sizes 32 and16. This NN is trained for 40 epochs on a batch sizeof 2048 with learning rate 0.2. The margin hyperpa-rameter γ > is set to 10. In experiments using edgefeatures, α is set to 1. Table 2:

Link prediction AUCs for transductive learning.

Model AUCMethod Connectivity Node Edge Test RatiosEmbedding Attribute Attribute 0.1 0.3 0.5

A. Connectivity-based embeddings only

ELP Yes – – 0.801

DeepWalk Yes – –

B. Connectivity and one additional attribute

ELP Yes MACCS –

ELP Yes PubChem – 0.891 0.882 0.864ELP Yes – EC 0.795 0.808 0.810ELP Yes – RC 0.810 0.798 0.810

C. Connectivity with one node and one edge attribute

ELP Yes MACCS EC 0.941

ELP Yes MACCS RC

D. Embedding based on MACCS ﬁngerprints

ELP No MACCS – 0.931 0.916 0.898ELP No MACCS EC

ELP No MACCS RC 0.939 0.904 0.896

E. Embeddings based on PubChem ﬁngerprints

ELP No PubChem – 0.665

ELP No PubChem RC 0.728 0.706 0.720

F. Jaccard index similarity scoring; no embeddings

Jaccard No MACCS –

Jaccard No PubChem – 0.542 0.526 0.535

Results are partitioned to facilitate comparisons. (A) Usingnode embeddings representing connectivity. (B) Usingconnectivity embedding and one additional attribute inthe form of a node or edge attribute. (C) Using nodeembeddings representing connectivity and one node andone edge attribute. (D) Using MACCS ﬁngerprints and zeroor more additional edge attribute with no connectivity-basednode embeddings. (E) Same as (D) but using PubChemFingerprints. (F) Using Jaccard similarity scoring onmolecular ﬁngerprints. Bold values in each partitionindicates the best result in that train-test split. Bold valueswith ∗ indicate the best overall result. The ConnectivityEmbedding column refers to the use of connectivity-basednode embeddings. Results for several transductive are reported (Table 2,partitions (A)-(E)). For almost all cases, a smaller testratio, results in higher AUC. This result is expected as alarger training graph better informs prediction. Whenperforming connectivity-based prediction (partition A),ELP, node2vec (Grover and Leskovec, 2016), and Deep-Walk (Perozzi, Al-Rfou, and Skiena, 2014) performedcomparably, with less than < 05. AUC diﬀerences foreach test ratio. Per partition (B), concatenating theMACCS ﬁngerprints embeddings to those based on con-nectivity improves the AUC to 0.953 from 0.801 fortest ratio 0.1. Gains of more than 0.1 are recorded forthe other test ratios. Not the PubChem ﬁngerprint, norany of the edge attributes enhance the AUC as muchas the MACCs ﬁngerprint. Further, there is little dif-ference between using the two edge embeddings, andboth seem to add little enhancement to the AUC re-sults over the connectivity-only ELP. Per partition (C),

Page 4 of 7 earning graph representations of biochemical networks and its application to enzymatic link prediction

Table 3:

Link prediction AUCs for inductive learning.

Method Connectivity Embedding Node Attribute AUC

A. Embeddings based on node attributes

ELP Yes MACCS

ELP Yes PubChem 0.605

B. Jaccard index similarity scoring

Jaccard No MACCS 0.744Jaccard No PubChem 0.553

The best AUC is denoted in bold.adding an edge and a node attribute simultaneously tograph connectivity results in a slight decrease in per-formance when compared to using connectivity andMACCS ﬁngerprints as node attributes for test ratio 0.1,but adding EC edge attributes improves performancefor test ratio 0.5. Per partition (D), using embeddingsfor node attributes only provides results comparableto that of using graph connectivity and the MACCS ﬁn-gerprint for all test ratios (e.g., 0.931 vs 0.953 for 0.1test ratio). Comparing partitions (D) and (E), link pre-diction using the MACCS ﬁngerprint outperforms thescenario when using PubChem ﬁngerprints. When us-ing embeddings for edge attributes, more pronouncedAUC improvements are observed for embeddings basedon the PubChem ﬁngerprint vs the MACCS ﬁngerprint.When using the Jaccard index to compute substrate-product similarity for each link, as in partition (E), theAUC is 0.808 for the MACCS ﬁngerprints, while sig-niﬁcantly lower when using the PubChem ﬁngerprint.Figure 4 presents plots for two scenarios using ELP: (A)connectivity only and (B) connectivity with MACCSﬁngerprints as node attributes. The former plot revealsthat the lower AUC performance is mostly attributedto having a higher FPR when there is a higher TPR.In other words, we can achieve an almost .50 TPR atlittle cost (little sacriﬁce in FPR), but as the need toobserve improvement in TPR increases, the FPR risesdramatically.

Several inductive scenarios were investigated (Table3). In these scenarios, 5% of all nodes were removedfrom the graph during training. ELP based on MACCSnode attributes achieves an AUC of 0.921. When com-pared to the similar transductive scenario, ELP basedon MACCS node attributes, the AUC for the inductivescenario is lower than the AUC for the 0.1 and 0.3 testratios (AUC of .953, 0.935, respectively), but higherthan the AUC for the 0.5 test ratio (AUC of 0.900).Using the PubChem ﬁngerprint results in a lower AUC.Similarity analysis based on the Jaccard index achievesan AUC of 0.744 when using MACCS ﬁngerprints andresults in a much lower AUC when using the PubChemﬁngerprints. This analysis indicates that more informa-tive embeddings even for out of sample nodes are bestpredicted through ELP.

Figure 2Figure 3Figure 4:

AUC plots for transductive learning using test ra-tio of 0.1. (A) graph connectivity only, and (B)graph connectivity with MACCS ﬁngerprint as nodeattributes.

To further illustrate the importance of graph embed-ding in the context of biochemical networks, we showhow graph embedding can be used for visualization.Figure 7 presents a visualization of the embeddingsfor two reference pathways, the citrate cycle (TCA)cycle, and Glycolysis/Gluconeogenesis, as documentedin the KEGG database. The resulting subgraph for theTCA cycle consists of 21 nodes and 44 edges, whilethe subgraph for Glycolysis/Gluconeogenesis pathwayconsists 37 nodes and 96 edges. Nine compounds arecommon to both pathways, and include phosphate,diphosphate, pyruvate, thiamine diphosphate, lysine,oxaloacetate, and phosphoenolpyruvate. These com-pounds contribute to 14 edges that overlap in bothpathways. To visualize embeddings of these metabo-lites, we reduce the dimensionality of the embeddingsto 2 via Principal Component Analysis (PCA). For the

Page 5 of 7 earning graph representations of biochemical networks and its application to enzymatic link prediction

Figure 5Figure 6Figure 7: connectivity only plot (Figure 7A), we observe tightclustering of metabolites within each pathway, whilewe observe looser clustering when using MACCS ﬁn-gerprints as node attributes (Figure 7 B). Nodes thatare embedded far away from the clusters, phosphate,diphosphate, and carbon dioxide, exhibit high connec-tivity within the KEGG graph, with node degrees of408, 320, and 494, respectively. On the contrary, othernodes within the KEGG graph have an average degreeof 27, with most nodes having degrees under 30.

This work uses embedding propagation to learn molec-ular representations that capture both graph connec-tivity, enzymatic properties, and structural molecularproperties. We show that link prediction using onlygraph connectivity is on par with using molecular sim-ilarity. Additionally, we show high accuracy in linkprediction when using both graph connectivity and molecular attributes. This work has broader and prac-tical impact. ELP can be used to guide many biologicaldiscoveries and engineering applications such as iden-tifying catalyzing enzymes when constructing novelsynthesis pathways or predicting interaction betweenmicrobes and human hosts. Graph embedding can beused for other applications such as biochemical net-work visualization, as demonstrated herein, and iden-tifying synthesis routes for synthetic biology. Further,while our approach is applied to biochemical enzymaticnetworks, it can enhance link prediction in chemicalnetworks, where rule-based and path-based link predic-tion respectively yielded 52.7% and 67.5% predictionaccuracy (Segler and Waller, 2017).

Funding

This research is supported by NSF, Award CCF-1909536, and also by NIGMS of the National Institutesof Health, Award R01GM132391. The content is solelythe responsibility of the authors and does not necessar-ily represent the oﬃcial views of the National Institutesof Health.

Bibliography

Adams, Samuel E (2010). “Molecular similarity andxenobiotic metabolism”. Thesis. University of Cam-bridge.Almonacid, Daniel E and Patricia C Babbitt (2011).“Toward mechanistic classiﬁcation of enzyme func-tions”. In:

Current opinion in chemical biology

Biochemistry

Journal of Biological Chemistry

IEEE Transactions on Knowledge andData Engineering

BMC Systems Biology

Journal of chemical information and mod-eling

Biophys-ical journal

Page 6 of 7 earning graph representations of biochemical networks and its application to enzymatic link prediction

D’Ari, Richard and Josep Casadesús (1998). “Under-ground metabolism”. In:

Bioessays

Advances in neural information pro-cessing systems , pp. 5119–5130.Durant, Joseph L et al. (2002). “Reoptimization of MDLkeys for use in drug discovery”. In:

Journal of chemi-cal information and computer sciences

Nucleicacids research

Knowledge-Based Systems

Proceed-ings of the 22nd ACM SIGKDD international confer-ence on Knowledge discovery and data mining . ACM,pp. 855–864.Hattori, Masahiro et al. (2010). “SIM-COMP/SUBCOMP: chemical structure searchservers for network analyses”. In:

Nucleic acidsresearch

Trends inbiotechnology

Nucleicacids research

Nucleic acids research

Curr Med Chem . issn: 1875-533X (Elec-tronic) 0929-8673 (Linking). doi:

10 . 2174 /0929867325666181101115314 . url: .Li, Chunhui et al. (2004). “Computational discoveryof biochemical routes to specialty chemicals”. In:

Chemical engineering science

Biological chemistry

The PlantCell

Bioinformatics

10 . 1093 / bioinformatics / bty593 . url: http : / / dx . doi .org/10.1093/bioinformatics/bty593 .Pellock, Samuel J et al. (2019). “Discovery and Charac-terization of FMN-Binding β -Glucuronidases in theHuman Gut Microbiome”. In: Journal of molecularbiology

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery anddata mining . ACM, pp. 701–710.Pertusi, Dante A et al. (2014). “Eﬃcient searchingand annotation of metabolic networks using chem-ical similarity”. In:

Bioinformatics .Ravasz, Erzsébet et al. (2002). “Hierarchical organiza-tion of modularity in metabolic networks”. In: science

Chemistry-A European Journal

Proceedings of the 24thinternational conference on world wide web . Interna-tional World Wide Web Conferences Steering Com-mittee, pp. 1067–1077.Tawﬁk, Olga Khersonsky and Dan S (2010). “Enzymepromiscuity: a mechanistic and evolutionary perspec-tive”. In:

Annual review of biochemistry

79, pp. 471–505.Tipton, Keith and Andrew McDonald (2018). “A BriefGuide to Enzyme Nomenclature and Classiﬁcation”.In:Tsubaki, Masashi, Kentaro Tomii, and Jun Sese (2018).“Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs andSequences”. In:

Bioinformatics .Tyzack, Jonathan D and Johannes Kirchmair (2019).“Computational methods and tools to predict cy-tochrome P450 metabolism for drug discovery”. In:

Chemical biology & drug design

Metabolic engineering

Bioinformatics . btz718. issn:1367-4803. doi:

10 . 1093 / bioinformatics /btz718 . eprint: http : / / oup . prod . sis . lan /bioinformatics / advance - article - pdf / doi /10 . 1093 / bioinformatics / btz718 / 30160037 /btz718 . pdf . url: https : / / doi . org / 10 . 1093 /bioinformatics/btz718 ..