Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery
Hakime Öztürk, Arzucan Özgür, Philippe Schwaller, Teodoro Laino, Elif Ozkirimli
EExploring Chemical Space using Natural LanguageProcessing Methodologies for Drug Discovery
Hakime ¨Ozt¨urk a , Arzucan ¨Ozg¨ur a , Philippe Schwaller b , Teodoro Laino b, ∗ , ElifOzkirimli c,d, ∗ a Department of Computer Engineering, Bogazici University, Istanbul, Turkey b IBM Research, Zurich, Switzerland c Department of Chemical Engineering, Bogazici University, Istanbul, Turkey d Department of Biochemistry, University of Zurich, Winterthurerstrasse 190, CH-8057Zurich, Switzerland
Abstract
Text based representations of chemicals and proteins can be thought of as un-structured languages codified by humans to describe domain specific knowledge.Advances in natural language processing (NLP) methodologies in the process-ing of spoken languages accelerated the application of NLP to elucidate hiddenknowledge in textual representations of these biochemical entities and then use itto construct models to predict molecular properties or to design novel molecules.This review outlines the impact made by these advances on drug discovery andaims to further the dialogue between medicinal chemists and computer scien-tists.
Teaser.
The application of natural language processing methodologies to an-alyze text based representations of molecular structures opens new doors indeciphering the information rich domain of biochemistry toward the discoveryand design of novel drugs.
Keywords:
Natural Language Processing, Machine Translation, MoleculeGeneration, Drug Discovery, Cheminformatics, Bioinformatics, BiochemicalLanguages, SMILES ∗ Corresponding author
Email addresses: [email protected] (Teodoro Laino ), [email protected] (Elif Ozkirimli ), +41 76 349 7471 (Elif Ozkirimli )
Preprint submitted to Elsevier February 17, 2020 a r X i v : . [ q - b i o . B M ] F e b . Introduction The design and discovery of novel drugs for protein targets is powered byan understanding of the underlying principles of protein-compound interaction.Biochemical methods that measure affinity and biophysical methods that de-scribe the interaction in atomistic level detail have provided valuable informa-tion toward a mechanistic explanation for bimolecular recognition [1]. However,more often than not, compounds with drug potential are discovered serendipi-tously or by phenotypic drug discovery [2] since this highly specific interactionis still difficult to predict [3]. Protein structure based computational strategiessuch as docking [4], ultra-large library docking for discovering new chemotypes[5], and molecular dynamics simulations [4] or ligand based strategies such asquantitative structure-activity relationship (QSAR) [6, 7], and molecular sim-ilarity [8] have been powerful at narrowing down the list of compounds to betested experimentally. With the increase in available data, machine learning anddeep learning architectures are also starting to play a significant role in chem-informatics and drug discovery [9]. These approaches often require extensivecomputational resources or they are limited by the availability of 3D informa-tion. On the other hand, text based representations of biochemical entities aremore readily available as evidenced by the 19,588 biomolecular complexes (3Dstructures) in PDB-Bind [10] (accessed on Nov 13, 2019) compared with 561,356(manually annotated and reviewed) protein sequences in Uniprot [11] (accessedon Nov 13, 2019) or 97 million compounds in Pubchem [12] (accessed on Nov13, 2019). The advances in natural language processing (NLP) methodologiesmake processing of text based representations of biomolecules an area of intenseresearch interest.The discipline of natural language processing (NLP) comprises a variety ofmethods that explore a large amount of textual data in order to bring unstruc-tured, latent (or hidden) knowledge to the fore [13]. Advances in this fieldare beneficial for tasks that use language (textual data) to build insight. The2anguages in the domains of bioinformatics and cheminformatics can be investi-gated under three categories: (i) natural language (mostly English) that is usedin documents such as scientific publications, patents, and web pages, (ii) domainspecific language, codified by a systematic set of rules extracted from empiricaldata and describing the human understanding of that domain (e.g. proteins,chemicals, etc), and (iii) structured forms such as tables, ontologies, knowledgegraphs or databases [14]. Processing and extracting information from textualdata written in natural languages is one of the major application areas of NLPmethodologies in the biomedical domain (also known as
BioNLP ). Informationextracted with BioNLP methods is most often shared in structured databasesor knowledge graphs [15]. We refer the reader to the comprehensive review on
BioNLP by Krallinger et al. [16]. Here, we will be focusing on the applicationof NLP to domain specific, unstructured biochemical textual representationstoward exploration of chemical space in drug discovery efforts.We can view the textual representation of biomedical/biochemical entitiesas a domain-specific language. For instance, a genome sequence is an exten-sive script of four characters (A, T, G, C) constituting a genomic language. Inproteins, the composition of 20 different natural amino acids in varying lengthsbuilds the protein sequences. Post-translational modifications expand this 20letter alphabet and confer different properties to proteins [17]. For chemicalsthere are several text based alternatives such as chemical formula, IUPAC Inter-national Chemical Identifier (InChI) [18] and Simplified Molecular Input LineEntry Specification (SMILES) [19].Today, the era of “big data” boosts the “learning” aspect of computationalapproaches substantially, with the ever-growing amounts of information pro-vided by publicly available databases such as PubChem [12], ChEMBL [20],UniProt [11]. These databases are rich in biochemical domain knowledge thatis in textual form, thus building an efficient environment in which NLP-basedtechniques can thrive. Furthermore, advances in computational power allow thedesign of more complex methodologies, which in turn drive the fields of machinelearning (ML) and NLP. However, biological and chemical interpretability and3xplainability remain among the major challenges of AI-based approaches. Datamanagement in terms of access, interoperability and reusability are also criticalfor the development of NLP models that can be shared across disciplines.With this review, we aim to provide an outline of how the field of NLPhas influenced the studies in bioinformatics and cheminformatics and the im-pact it has had over the last decade. Not only are NLP methodologies facil-itating processing and exploitation of biochemical text, they also promise an“understanding” of biochemical language to elucidate the underlying principlesof bimolecular recognition. NLP technologies are enhancing the biological andchemical knowledge with the final goal of accelerating drug discovery for improv-ing human health. We highlight the significance of an interdisciplinary approachthat integrates computer science and natural sciences.
Chowdhury [21] describes NLP on three levels: (i) the word level in whichthe smallest meaningful unit is extracted to define the morphological structure,(ii) the sentence level where grammar and syntactic validity are determined, and(iii) the domain or context level in which the sentences have global meaning.Similarly, our review is organized in three parts in which bio-chemical data isinvestigated at: (i) word level, (ii) sentence (text) level, and (iii) understand-ing text and generating meaningful sequences. Table 1 summarizes importantNLP concepts related to the processing of biochemical data. We refer to theseconcepts and explain their applications in the following sections.All NLP technology relates to specific AI architectures. In Table 2 W-wesummarize the main ML and deep learning (DL) architectures that will bementioned throughout the review.
2. Biochemical Language Processing
The language-like properties of text-based representations of chemicals wererecognized more than 50 years ago by Garfield [22]. He proposed a “chemico-linguistic” approach to representing chemical nomenclature with the aim of4nstructing the computer to draw chemical diagrams. Protein sequence hasbeen an important source of information about protein structure and functionsince Anfinsen’s experiment [23]. Alignment algorithms, such as Needleman-Wunsh [24] and Smith-Waterman [25], rely on sequence information to identifyfunctionally or structurally critical elements of proteins (or genes).To make predictions about the structure and function of compounds or pro-teins, the understanding of these sequences is critical for bioinformatics taskswith the final goal of accelerating drug discovery. Much like a linguist who usesthe tools of language to bring out hidden knowledge, biochemical sequencescan be processed to propose novel solutions, such as predicting interactions be-tween chemicals and proteins or generating new compounds based on the levelof understanding. In this section, we will review the applications of some ofthe NLP-concepts to biochemical data in order to solve bio/cheminformaticsproblems.
Information about chemicals can be found in repositories such as PubChem[12], which includes information on around 100 million compounds, or Drugbank[26], which includes information on around 10,000 drugs. The main textualsources used in drug discovery are textual representations of chemicals andproteins. Table 3 lists some sources that store different types of biochemicalinformation.Chemical structures can be represented in different forms that can be one-dimensional (1D), 2D, and 3D. Table 4 depicts different identifiers/representationsof the drug ampicillin . While the 2D and 3D representations are also used in MLbased approaches [9], here we focus on the 1D form, which is the representationcommonly used in NLP.
IUPAC name.
The International Union of Pure and Applied Chemistry (IU-PAC) scheme (i.e. nomenclature) is used to name compounds following pre-defined rules such that the names of the compounds are unique and consistentwith each other ( iupac.org/ ). 5 hemical Formula.
The chemical formula is one of the simplest and most widely-known ways of describing chemicals using letters (i.e. element symbols), num-bers, parentheses, and (-/+) signs. This representation gives information aboutwhich elements and how many of them are present in the compound.
SMILES.
The Simplified Molecular Input Entry Specification (SMILES) is atext-based form of describing molecular structures and reactions [19]. SMILESstrings can be obtained by traversing the 2D graph representation of the com-pound and therefore SMILES provides more complex information than thechemical formula. Moreover, due to its textual form, SMILES takes 50% to70% less space than other representation methods such as an identical connec-tion table ( daylight.com/dayhtml/doc/theory/theory.smiles.html ).SMILES notation is similar to a language with its own set of rules. Justlike it is possible to express the same concept with different words in naturallanguages, the SMILES notation allows molecules to be represented with morethan one unique SMILES. Although this may sound like a significant ambigu-ity, the possibility of using different SMILES to represent the same moleculewas successfully adopted as a data augmentation strategy by various groups(Bjerrum [27], Kimber et al. [28], Schwaller et al. [29]).Canonical SMILES can provide a unique SMILES representation. How-ever, different databases such as PubChem and ChEMBL might use differentcanonicalization algorithms to generate different unique SMILES. OpenSMILES( opensmiles.org/opensmiles.html ) is a new platform that aims to universal-ize the SMILES notation. In isomeric SMILES, isotopism and stereochemistryinformation of a molecule is encoded using a variety of symbols (“/”, “ \ ”, “@”,“@@”). DeepSMILES.
DeepSMILES is a novel SMILES-like notation that was proposedto address two challenges of the SMILES syntax: (i) unbalanced parentheses and(ii) ring closure pairs [30]. It was initially designed to enhance machine/deep-learning based approaches that utilize SMILES data as input ( github.com/nextmovesoftware/deepsmiles ). DeepSMILES was adopted in a drug-target6inding affinity prediction task in which the findings highlighted the efficacy ofDeepSMILES over SMILES in terms of identifying undetectable patterns [31].DeepSMILES was also utilized in a molecule generation task in which it wascompared to canonical and randomized SMILES text [32]. Here, the resultssuggested that DeepSMILES might limit the learning ability of the SMILES-based molecule generation models because its syntax is more grammar sensitivewith the ring closure alteration and the use of a single symbol for branching(i.e. “)”) introducing longer sequences.
SELFIES.
SELF-referencIng Embedding Strings (SELFIES) is an alternativesequence-based representation that is built upon “semantically constrained graphs”[33]. Each symbol in a SELFIES sequence indicates a recursive Chomsky-2type grammar, and can thus be used to convert the sequence representation toa unique graph. SELFIES utilize SMILES syntax to extract words that willcorrespond to semantically valid graphs ( github.com/aspuru-guzik-group/selfies ). Krenn et al. [33] compared SELFIES, DeepSMILES and SMILESrepresentations in terms of validity in cases where random character mutationsare introduced. The evaluations on the QM9 dataset yielded results in the favorof SELFIES.
InChI.
InChI is the IUPAC International Chemical Identifier, which is a non-proprietary and open-source structural representation ( inchi-trust.org ) [34].The InChIKey is a character-based representation that is generated by hashingthe InChI strings in order to shorten them. InChi representation has severallayers (each) separated by the “/” symbol.The software that generates InChi is publicly available and InChi does notsuffer from ambiguity problems. However, its less complex structure makes theSMILES representation easier to use as shown in a molecular generation study[35] and in building meaningful chemical representations with a translation-based system [36]. Interestingly, the translation model was able to translatefrom InChi to canonical SMILES, whereas it failed to translate from canonical7MILES to InChi. Winter et al. [36] suggested that the complex syntax of InChimade it difficult for the model to generate a correct sequence.
SMARTS.
SMiles ARbitrary Target Specification (SMARTS) is a language thatcontains specialized symbols and logic operators that enable substructure (pat-tern) search on SMILES strings [37]. SMARTS can be used in any task that re-quires pattern matching on a SMILES string such as, querying databases or cre-ating rule dictionaries such as RECAP [38] and BRICS [39] to extract fragmentsfrom SMILES ( daylight.com/dayhtml/doc/theory/theory.smarts.html ). SMIRKS.
SMIRKS notation can be used to describe generic reactions (alsoknown as transforms) that comprise one or more changes in atoms and bonds( https://daylight.com/daycgi_tutorials/smirks_examples.html ). Thesetransforms are based on “reactant to product” notation, and thus make use ofSMILES and SMARTS languages. SMIRKS is utilized in tasks such as con-structing an online transform database [40] and predicting metabolic trans-formations [41]. A recent study achieves a similar performance to rule-basedsystems in classifying chemical reactions by learning directly from SMILES textwith transforms via neural networks [42].
Similar to words in natural languages, we can assume that the “words” ofbiochemical sequences convey significant information (e.g. folding, function etc)about the entities. In this regard, each compound/protein is analogous to a sen-tence, and each compound/protein unit is analogous to a word. Therefore, if wecan decipher the grammar of biochemical languages, it would be easier to modelbio/cheminformatics problems. However, protein and chemical words are notexplicitly known and different approaches are needed to extract syntacticallyand semantically meaningful biochemical word units from these textual infor-mation sources (i.e. sequences). Here, we review some of the most commontokenization approaches used to determine the words of biochemical languages.8 -mers ( n -grams). One of the simplest approaches in NLP to extract a smalllanguage unit is to use k -mers, also known as n -grams. k -mers indicate k consecutive overlapping characters that are extracted from the sequence witha sliding window approach. “LINGO”, which is one of the earliest applica-tions of k -mers in cheminformatics, is the name of the overlapping 4-mers thatare extracted from SMILES strings [43]. 4-mers of the SMILES of ampicillin ,“CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C”, can belisted as { ‘CC1(’, ‘C1(C’, ‘1(C(’, ..., ‘O)O)’, ‘)O)C’ } . From a sequence oflength l , a total of ( l − n ) + 1 k -mers can be extracted. Extracting LINGOsfrom SMILES is a simple yet powerful idea that has been successfully used tocompute molecular similarities, to differentiate between bioisosteric and ran-dom molecular pairs [43] and in a drug-target interaction prediction task [44],without requiring 2D or 3D information. The results suggested that a SMILES-based approach to compute the similarity of chemicals is not only as good as a2D-based similarity measurement, but also faster [44]. k -mers were successfully utilized as protein [45] and chemical words [46] inprotein family classification tasks. 3-mers to 5-mers were often considered asthe words of the protein sequence. Motomura et al. [47] reported that some 5-mers could be matched to motifs and protein words are most likely a mixture ofdifferent k -mers. For the protein function prediction task, Cao et al. [48] decidedto choose among the 1000 most frequent words to build the protein vocabulary,whereas Ranjan et al. [49] utilized each k -mer type separately and showed that4-mers provided the best performance. In the latter work, instead of using thewhole protein sequence, the words were extracted from different length proteinsegments, which are also long k -mers (i.e. 100-mer, 120-mer) with 30 amino-acid gaps. The use of segmented protein sequences yielded better results thanusing the whole protein sequence, and important and conserved subsequenceswere highlighted. k -mers were also used as features, along with position specificscore matrix features, in the protein fold prediction problem [50].9 ongest Common Subsequences. The identification of the longest common sub-sequence (LCS) of two sequences is critical for detecting their similarity. Whenthere are multiple sequences, LCSs can point to informative patterns. LCSs ex-tracted from SMILES sequences performed similarly well to 4-mers in chemicalsimilarity calculation [44].
Maximum Common Substructure.
Cadeddu et al. [51] investigated organic chem-istry as a language in an interesting study that extracts maximum commonsubstructures (MCS) from the 2D structures of pairs of compounds to build avocabulary of the molecule corpus. Contrary to the common idea of functionalgroups (e.g. methyl, ethyl etc.) being “words” of the chemical language, theauthors argued that MCSs (i.e. fragments) can be described as the words ofthe chemical language [51]. A recent work investigated the distribution of thesewords in different molecule subsets [52]. The “words” followed
Zipf ’s Law , whichindicates the relationship between the frequency of a word and its rank (basedon the frequency) [53], similar to most natural languages. Their results alsoshowed that drug “words” are shorter compared to natural product “words”.
Minimum Description Length.
Minimum Description Length (MDL) is an un-supervised compression-based word segmentation technique in which words ofan unknown language are detected by compressing the text corpus. In a proteinclassification task, each protein was assigned to the family in which its sequenceis compressed the most, according to the MDL-based representation [54]. Gane-san et al. [54] investigated whether the MDL-based words of the proteins showsimilarities to PROSITE patterns [55] and showed that less conserved residueswere compressed less by the algorithm. Ganesan et al. [54] also emphasizedthat the integration of domain knowledge, such as the consideration of the hy-drophilic and hydrophobic aminoacids in the words (i.e. grammar building),might prove effective.
Byte-Pair Encoding.
Byte-Pair Encoding (BPE) generates words based on highfrequency subsequences starting from frequent characters [56]. A recent study10dopted a linguistic-inspired approach to predict protein-protein interactions(PPIs) [57]. Their model was built upon “words” (i.e. bio-words) of the proteinlanguage, in which BPE was utilized to build the bio-word vocabulary. Wanget al. [57] suggested that BPE-segmented words indicate a language-like behav-ior for the protein sequences and reported improved accuracy results comparedto using 3-mers as words.
Pattern-based words.
Subsequences that are conserved throughout evolutionare usually associated with protein structure and function. These conservedsequences can be detected as patterns via multiple sequence alignment (MSA)techniques and Hidden Markov Models (HMM). PROSITE [55], a public databasethat provides information on domains and motifs of proteins, uses regular ex-pressions (i.e. RE or regex) to match these subsequences.Protein domains have been investigated for their potential of being the wordsof the protein language. One earlier study suggested that folded domains couldbe considered as “phrases/clauses” rather than “words” because of the highersemantic complexity between them [58]. Later, domains were described as thewords, and domain architectures as sentences of the language [59, 60]. Proteindomains were treated as the words of multi-domain proteins in order to evalu-ate the semantic meaning behind the domains [61]. The study supported priorwork by Yu et al. [60] suggesting that domains displayed syntactic and seman-tic features, but there are only a few multi-domain proteins with more thansix domains limiting the use of domains as words to build sentences. Proteindomains and motifs have also been utilized as words in different drug discoverytasks such as the prediction of drug-target interaction affinity [62, 63]. Thesestudies showed that motifs and domains together contribute to the predictionas much as the use of the full protein sequence.SMARTS is a well-known regex-based querying language that is used toidentify patterns in a SMILES string. SMARTS has been utilized to build spe-cific rules for small-molecule protonation [64], to design novel ligands based onthe fragments connected to the active site of a target [65], and to help generate11roducts in reaction prediction [66]. MolBlocks, a molecular fragmentation tool,also adopted SMARTS dictionaries to partition a SMILES string into overlap-ping fragments [37]. Furthermore, MACCS [67] and PubChem [12] Fingerprints(FP) are molecular descriptors that are described as binary vectors based on theabsence/presence of substructures that are predefined with SMARTS language.A recent study on protein family clustering uses a ligand-centric representa-tion to describe proteins in which ligands were represented with SMILES-based(i.e. 8-mers) representation, MACCS and Extended Connectivity Fingerprint(ECFP6) [46]. The results indicate that three of the ligand representation ap-proaches provide similar performances for protein family clustering.To the best of our knowledge, there is no comprehensive evaluation of thedifferent word extraction techniques except a comparison by Wang et al. [57] ofthe performance of BPE-based words against k -mers in a PPI prediction task.Such comparison would provide important insights to the bio/cheminformaticscommunity. The representation of a text (e.g. molecule or protein sequence) aims tocapture syntactic, semantic or relational meaning. In the widely used VectorSpace Model (VSM), a text is represented by a feature vector of either weightedor un-weighted terms [68]. The terms of this vector may correspond to words,phrases, k-grams, characters, or dimensions in a semantic space such as in thedistributed word embedding representation models. The similarity between twotexts represented in the vector space model is usually computed using the cosinesimilarity metric [69], which corresponds to the cosine of the angle between thetwo vectors.Similarly to the one-hot encoding scheme [70], in the traditional bag-of-words [71] and term frequency-inverse document frequency (TF-IDF) [72] textrepresentation models, each word corresponds to a different dimension in thevector space. Therefore, the similarity between two words in the vector space iszero, even if they are synonymous or related to each other. In the distributed12epresentation models [73] on the other hand, words are represented as densevectors based on their context. Words that occur in similar contexts have similarvector representations. In this subsection, we review these commonly used textrepresentation models with their applications in cheminformatics.
Bag-of-words representation.
In this representation model, a text is representedas a vector of bag-of-words , where the multiplicity of the words is taken into ac-count, but the order of the words in the text is lost [71]. For instance, theSMILES of ampicillin “CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C” can be represented as a bag-of 8-mers as fol-lows: { “CC1(C(N2”, “C1(C(N2C”, “1(C(N2C(”, “(C(N2C(S”,...,“N)C(=O)O”,“)C(=O)O)” ,“C(=O)O)C” } . We can vectorize it as S = [1 , , , , ..., , , sentence and words , respectively [43]. The unique LINGOs were consid-ered for each pair and a Tanimoto coefficient was used to measure the similarity[43]. Another approach called SMILES Fingerprint (SMIfp) also adopted bag-of-words to create representations of molecules for a ligand-based virtual screeningtask [74]. SMIfp considered 34 unique symbols in SMILES strings to create afrequency-based vector representation, which was utilized to compute molecu-lar similarity. SMIfp provided comparable results to a chemical representationtechnique that also incorporated polar group and topological information, aswell as atom and bond information, in recovering active compounds amongstdecoys [74]. TF-IDF.
The bag-of-words model, which is based on counting the terms ofthe sentence/document, might prioritize insignificant but frequent words. Toovercome this issue, a weighting scheme can be integrated into the vector repre-sentation in order to give more importance to the rare terms that might play akey role in detecting similarity between two documents. One popular weightingapproach is to use term frequency-inverse document frequency (TF-IDF) [72].13F refers to the frequency of a term in the document, and IDF denotes the loga-rithm of the total number of documents over the number of documents in whichthe term appears. IDF is therefore an indicator of uniqueness. For instance,the IDF of “C3=CC=CC” is lower than that of “(C(N2C(S”, which appears infewer compounds. Therefore, the existence of “(C(N2C(S” in a compound maybe more informative.TF-IDF weigthing was utilized to assign weights to LINGOs that were ex-tracted from SMILES in order to compute molecule similarity using cosine sim-ilarity [44]. Molecular similarities were then used as input for drug-target in-teraction prediction. A similar performance between TF-IDF weighted LINGOand a graph-based chemical similarity measurement was obtained. Cadedduet al. [51] used TF-IDF weighting on chemical bonds to show that bonds withhigher TF-IDF scores have a higher probability of breaking.
One-hot representation.
In one-hot representation, for a given vocabulary of atext, each unique word/character is represented with a binary vector that hasa 1 in the corresponding position, while the vector positions for the remainingwords/characters are filled with 0s [70]. One-hot encoding is fast to build, butmight lead to sparse vectors with large dimensions based on the size of the vo-cabulary (e.g. one million unique words in the vocabulary means one milliondimensional binary vectors filled with zeros except one). It is a popular choice,especially in machine learning-based bio/cheminformatic studies to encode dif-ferent types of information such as SMILES characters [75, 76], atom/bondtypes [77, 78] and molecular properties [79].
Distributed representations.
The one-hot encoding builds discrete representa-tions, and thus does not consider the relationships between words. For instance,the cosine similarity of two different words is 0 even if they are semantically sim-ilar. However, if the word (i.e. 8-mer) “(C(N2C(S” frequently appears togetherwith the word “C(C2=O)N” in SMILES strings, this might suggest that theyhave related “meanings”. Furthermore, two words might have similar semantic14eanings even though they are syntactically apart. This is where distributedvector representations come into play.The distributed word embeddings models gained popularity with the intro-duction of Word2Vec [73] and GloVe [80]. The main motivation behind theWord2Vec model is to build real-valued high-dimensional vectors for each wordin the vocabulary based on the context in which they appear. There are twomain approaches in Word2Vec: (i) Skip-Gram and (ii) Continuous Bag of Words(CBOW). The aim of the Skip-Gram model is to predict context words given thecenter word, whereas in CBOW the objective is to predict the target word giventhe context words. Figure 1 depicts the Skip-gram architecture in Word2Vec[73]. For the vocabulary of size V , given the target word “2C(S”, the modellearns to predict two context words. Both target word and context words arerepresented as one-hot encoded binary vectors of size V . The number of neuronsin the hidden layer determines the size of the embedding vectors. The weightmatrix between the input layer and the hidden layer stores the embeddings ofthe vocabulary words. The i th row of the embedding matrix corresponds to theembedding of the i th word.The Word2Vec architecture has inspired a great deal of research in thebio/cheminformatics domains. The Word2Vec algorithm has been successfullyapplied for determining protein classes [45] and protein-protein interactions(PPI) [57]. Asgari and Mofrad [45] treated 3-mers as the words of the proteinsequence and observed that 3-mers with similar biophysical and biochemicalproperties clustered together when their embeddings were mapped onto the 2Dspace. Wang et al. [57], on the other hand, utilized BPE-based word segmen-tation (i.e. bio-words) to determine the words. The authors argued that theimproved performance for bio-words in the PPI prediction task might be dueto the segmentation-based model providing more distinct words than k -mers,which include repetitive segments. Another recent study treated multi-domainproteins as sentences in which each domain was recognized as a word [61]. TheWord2Vec algorithm was trained on the domains (i.e. PFAM domain identifiers)of eukaryotic protein sequences to learn semantically interpretable representa-15ions of them. The domain representations were then investigated in terms ofthe Gene Ontology (GO) annotations that they inherit. The results indicatedthat semantically similar domains share similar GO terms.The Word2Vec algorithm was also utilized for representation of chemicals.SMILESVec, a text-based ligand representation technique, utilized Word2Vecto learn embeddings for 8-mers (i.e. chemical words) that are extracted fromSMILES strings [46]. SMILESVec was utilized in protein representation suchthat proteins were represented as the average of the SMILESVec vectors of theirinteracting ligands. The results indicated comparable performances for ligand-based and sequence based protein representations in protein family/superfamilyclustering. Mol2Vec [81], on the other hand, was based on the identifiers of thesubstructures (i.e. words of the chemical) that were extracted via ExtendedConnectivity Fingerprint (ECFP) [82]. The results showed a better perfor-mance with Mol2Vec than with the simple Morgan Fingerprint in a solubilityprediction task, and a comparable performance to graph-based chemical repre-sentation [83]. Chakravarti [84] also employed the Word2vec model that wastrained on the fragments that are extracted from SMILES strings using a graphtraversing algorithm. The results favored the distributed fragment-based lig-and representation over fragment-based binary vector representation in a ringsystem clustering task and showed a comparable performance in the predic-tion of toxicity against Tetrahymena [84]. Figure 2 illustrates the pipeline of atext-based molecule representation based on k -mers.FP2Vec is another method that utilizes embedding representation for molecules,however instead of the Word2Vec algorithm, it depends on a ConvolutionalNeural Network (CNN) to build molecule representations to be used in toxic-ity prediction tasks [85]. CNN architectures have also been utilized for drug-target binding affinity prediction [86] and drug-drug interaction prediction [76]to build representations for chemicals from raw SMILES strings, as well as forprotein fold prediction [87] to learn representations for proteins from amino-acid sequences. SMILES2Vec adopted different DL architectures (GRU, LSTM,CNN+GRU, and CNN+LSTM) to learn molecule embeddings, which were then16sed to predict toxicity, affinity and solubility [88]. A CNN+GRU combinationwas better at the prediction of chemical properties. A recent study comparedseveral DL approaches to investigate the effect of different chemical representa-tions, which were learned through these architectures, on a chemical propertyprediction problem [89]. The authors also combined DL architectures that weretrained on SMILES strings with the MACCS fingerprint, proposing a combinedrepresentation for molecules (i.e. CheMixNet). The CheMixNet representationoutperformed the other representations that were trained on a single data typesuch as SMILES2Vec (i.e. SMILES) and Chemception (i.e. 2D graph) [90]. Text generation is a primary NLP task, where the aim is to generate gram-matically and semantically correct text, with many applications ranging fromquestion answering to machine translation [91]. It is generally formulated as alanguage modeling task, where a statistical model is trained using a large cor-pus to predict the distribution of the next word in a given context. In machinetranslation, the generated text is the translation of an input text in anotherlanguage.Medicinal chemistry campaigns use methods such as scaffold hopping [92] orfragment-based drug design [4] to build and test novel molecules but the chemo-type diversity and novelty may be limited. It is possible to explore unchartedchemical space with text generation models, which learn a distribution from theavailable data (i.e. SMILES language) and generate novel molecules that sharesimilar physicochemical properties with the existing molecules [75]. Moleculegeneration can then be followed by assessing physicochemical properties of thegenerated compound or its binding potential to a target protein [75]. For a com-prehensive review of molecule generation methodologies, including graph-basedmodels, we refer the reader to the review of Elton et al. [93]. Machine transla-tion models have also been recently adapted to text-based molecule generation,which start with one “language” such as that of reactants and generate a noveltext in another “language” such as that of products [29]. Below, we present17ecent studies on text based molecule generation.RNN models, which learn a probability distribution from a training set ofmolecules, are commonly used in molecule generation to propose novel moleculessimilar to the ones in the training data set. For instance, given the SMILESsequence “C(=O”, the model would predict the next character to be “)” with ahigher probability than “(”. The production of valid SMILES strings, however,is a challenge because of the complicated SMILES syntax that utilizes paren-theses to indicate branches and ring numbers. The sequential nature of RNNs,which may miss long range dependencies, is a disadvantage of these models[75]. RNN descendants LSTM and GRU, which model long-term dependencies,are better suited for remembering matching rings and branch closures. Moti-vated by such a hypothesis, Segler et al. [75] and Ertl et al. [94] successfullypioneered de novo molecule generation using LSTM architecture to generatevalid novel SMILES. Segler et al. [75] further modified their model to generatetarget-specific molecules by integrating a target bioactivity prediction step tofilter out inactive molecules and then retraining the LSTM network. In anotherstudy, transfer learning was adopted to fine-tune an LSTM-based SMILES gen-eration model so that structurally similar leads were generated for targets withfew known ligands [95]. Olivecrona et al. [96] and Popova et al. [97] used re-inforcement learning (RL) to bias their model toward compounds with desiredproperties. Merk et al. [98, 99] fine-tuned their LSTM model on a target-focusedlibrary of active molecules and synthesized some novel compounds. Ar´us-Pouset al. [100] explored how much of the GDB-13 database [101] they could redis-cover by using an RNN-based generative model.The variational Auto-encoder (VAE) is another widely adopted text gener-ation architecture [102]. G´omez-Bombarelli et al. [35] adopted this architecturefor molecule generation. A traditional auto-encoder encodes the input intothe latent space, which is then decoded to reconstruct the input. VAE differsfrom AE by explicitly defining a probability distribution on the latent space togenerate new samples. G´omez-Bombarelli et al. [35] hypothesized that the vari-ational part of the system integrates noise to the encoder, so that the decoder18an be more robust to the large diversity of molecules. However, the authorsalso reported that the non-context free property of SMILES caused by match-ing ring numbers and parentheses might often lead the decoder to generateinvalid SMILES strings. A grammar variational auto-encoder (GVAE), wherethe grammar for SMILES is explicitly defined instead of the auto-encoder learn-ing the grammar itself, was proposed to address this issue [103]. This way, thegeneration is based on the pre-defined grammar rules and the decoding processgenerates grammar production rules that should also be grammatically valid.Although syntactic validity would be ensured, the molecules may not have se-mantic validity (chemical validity). Dai et al. [104] built upon the VAE [35] andGVAE [103] architectures and introduced a syntax-directed variational autoen-coder (SD-VAE) model for the molecular generation task. The syntax-directgenerative mechanism in the decoder contributed to creating both syntacticallyand semantically valid SMILES sequences. Dai et al. [104] compared the la-tent representations of molecules generated by VAE, GVAE, and SD-VAE, andshowed that SD-VAE provided better discriminative features for druglikeness.Blaschke et al. [105] proposed an adversarial AE for the same task. ConditionalVAEs [106, 107] were trained to generate molecules conditioned on a desiredproperty. The challenges that SMILES syntax presents inspired the introduc-tion of new syntax such as DeepSMILES [30] and SELFIES [33] (details inSection 2.1).Generative Adversarial Network (GAN) models generate novel molecules byusing two components: the generator network generates novel molecules, andthe discriminator network aims to distinguish between the generated moleculesand real molecules [108]. In text generation models, the novel molecules aredrawn from a distribution, which are then fine-tuned to obtain specific features,whereas adversarial learning utilizes generator and discriminator networks toproduce novel molecules [108, 109]. ORGAN [109], a molecular generationmethodology, was built upon a sequence generative adversarial network (Se-qGAN) from NLP [110]. ORGAN integrated RL in order to generate moleculeswith desirable properties such as solubility, druglikeness, and synthetizability19hrough using domain-specific rewards [109].
Machine Translation.
Machine translation finds use in cheminformatics in “trans-lation” from one language (e.g. reactants) to another (e.g. products). Machinetranslation is a challenging task because the syntactic and semantic dependen-cies of each language differ from one another and this may give rise to ambi-guities. Neural Machine Translation (NMT) models benefit from the potentialof deep learning architectures to build a statistical model that aims to find themost probable target sequence for an input sequence by learning from a corpusof examples [111, 112]. The main advantage of NMT models is that they providean end-to-end system that utilizes a single neural network to convert the sourcesequence into the target sequence. Sutskever et al. [111] refer to their model asa sequence-to-sequence (seq2seq) system that addresses a major limitation ofDNNs that can only work with fixed-dimensionality information as input andoutput. However, in the machine translation task, the length of the input se-quences is not fixed, and the length of the output sequences is not known inadvance.The NMT models are based on an encoder-decoder architecture that aimsto maximize the probability of generating the target sequence (i.e. most likelycorrect translation) for the given source sequence. The first encoder-decoder ar-chitectures in NMT performed poorly as the sequence length increased mainlybecause the encoder mapped the source sequence into a single fixed-length vec-tor. However, fixed-size representation may be too small to encode all theinformation required to translate long sequences [113]. To overcome the issueof the fixed context vector (Figure 3a), a new method was developed, in whichevery source token was encoded into a memory bank independently (Figure 3b).The decoder could then selectively focus on parts of this memory bank duringtranslation [113, 114]. This technique is known as “attention mechanism” [115].Inspired by the successes in NMT, the first application of seq2seq modelsin cheminformatics was for reaction prediction by Nam and Kim [116], whoproposed to translate the SMILES strings of reactants and separated reagents20o the corresponding product SMILES. The authors hypothesized that the re-action prediction problem can be re-modelled as a translation system in whichboth inputs and output are sequences. Their model used GRUs for the encoder-decoder and a Bahdanau [113] attention layer in between. Liu et al. [117] incontrast, performed the opposite task, the single-step retrosynthesis prediction,using a similar encoder-decoder model. When given a product and a reactionclass, their model predicted the reactants that would react together to form thatproduct. One major challenge in the retrosynthesis prediction task is the possi-bility of multiple correct targets, because more than one reactant combinationcould lead to the same product. Similarly to Nam and Kim [116], Schwalleret al. [118] also adopted a seq2seq model to translate precursors into products,utilizing the SMILES representation for the reaction prediction problem. Theirmodel used a different attention mechanism by Luong et al. [114] and LSTMsin the encoder and decoder. By visualizing the attention weights, an atom-wisemapping between the product and the reactants could be obtained and used tounderstand the predictions better. Schwaller et al. [118] showed that seq2seqmodels could compete with graph neural network-based models in the reactionprediction task [119].A translation model was also employed to learn a data-driven representationof molecules [36]. Winter et al. [36] translated between two textual representa-tions of a chemical, InChi and SMILES, to extract latent representations thatcan integrate the semantic “meaning” of the molecule. The results indicated astatistically significant improvement with the latent representations in a ligand-based virtual screening task against fingerprint methods such as ECFP (i.e.Morgan algorithm). NMT architectures were also adopted in a protein func-tion prediction task for the first time, in which “words” that were extractedfrom protein sequences are translated into GO identifiers using RNNs as en-coder and decoder [48]. Although exhibiting a comparable performance to thestate-of-the-art protein function prediction methods, the authors argued thatthe performance of the model could be improved by determining more mean-ingful “words” such as biologically interpretable fragments.21ransformer is an attention-based encoder-decoder architecture that was in-troduced in NMT by Vaswani et al. [120]. Although similar to previous studies[111, 112, 113] in terms of adopting an encoder-decoder architecture, Trans-former differs from the others because it only consists of attention and feed-forward layers in the encoder and decoder. As transformers do not contain anRNN, positional embeddings are needed to capture order relationships in thesequences. Schwaller et al. [29] were the first to adopt the Transformer architec-ture in cheminformatics and designed a
Molecular Transformer for the chemicalreaction prediction task. The Molecular Transformer, which was atom-mappingindependent, outperformed the other algorithms (e.g. based on a two-step con-volutional graph neural network [121]) on commonly used benchmark data sets.Transformer architecture was also adopted to learn representations for chemicalsin prediction of drug-target interactions [122] and molecular properties [123] inwhich the proposed systems either outperformed the state-of-the-art systems orobtained comparable results.
3. Future Perspectives
The increase in the biochemical data available in public databases combinedwith the advances in computational power and NLP methodologies have givenrise to a rapid growth in the publication rate in bio/cheminformatics, especiallythrough pre-print servers. As this interdisciplinary field grows, novel opportu-nities come hand in hand with novel challenges.
The major challenges that can be observed from investigating these studiescan be summarized as follows: (i) the need for universalized benchmarks andmetrics, (ii) reproducibility of the published methodologies, (iii) bias in avail-able data, and (iv) biological and chemical interpretability/explainability of thesolutions. 22 enchmarking.
There are several steps in the drug discovery pipeline, fromaffinity prediction to the prediction of other chemical properties such as toxic-ity, and solubility. The use of different datasets and different evaluation metricsmakes the assessment of model performance challenging. Comprehensive bench-marking platforms that can assess the success of different tools are still lacking.A benchmarking environment rigorously brings together the suitable data setsand evaluation methodologies in order to provide a fair comparison between theavailable tools. Such environments are available for molecule generation taskfrom MOSES [124] and GuacaMol [125].
MoleculeNet is also a similar attemptto build a benchmarking platform for tasks such as prediction of binding affinityand toxicity [83].
Reproducibility.
Despite the focus on sharing datasets and source codes on pop-ular software development platforms such as GitHub (github.com) or Zenodo(zenodo.org), it is still a challenge to use data or code from other groups. The useof FAIR (Findable, Accessible, Interoperable and Reusable) (meta)data princi-ples can guide the management of scientific data [126]. Automated workflowsthat are easy to use and do not require programming knowledge encourage theflow of information from one discipline to the other. Platform-free solutions suchas Docker (docker.com) in which an image of the source code is saved and can beopened without requiring further installation could accelerate the reproductionprocess. A recent initiative to provide a unified-framework for predictive mod-els in genomics can quickly be adopted by the medicinal chemistry community[127].
Bias in data.
The available data has two significant sources of bias, one relatedto the limited sampling of chemical space and the other related to the qualityand reproducibility of the data. The lack of information about some regionsof the protein/chemical landscape limits the current methodologies to the ex-ploitation of data rather than full exploration. The data on protein-compoundinteractions is biased toward some privileged molecules or proteins because theprotein targets are related to common diseases or the molecules are similar to23nown actives. Hence, not all of chemical space is sampled, and chemical spaceis expanded based on the similarity of an active compound to others, which isalso referred to as inductive bias [128]. Data about proteins or molecules relatedto rare diseases is limited and inactive molecules are frequently not reported.Moreover, some experimental measurements that are not reproducible acrossdifferent labs or conditions limit their reliability [129]. Sieg et al. [130] andZhang and Lee [131] have recently discussed the bias factors in dataset compo-sition. Zhang and Lee have also addressed the sources of bias in the data andproposed to use Bayesian deep learning to quantify uncertainty.
Interpretability.
The black box nature of ML/DL methodologies makes assign-ing meaning to the results difficult. Explainability of an ML model is especiallycritical in drug discovery to facilitate the use of these findings by medicinalchemists, who can contribute to the knowledge loop. explainable-AI (XAI)is a current challenge that calls for increased interpretability of AI solutionsfor a given context and includes several factors such as trust, safety, privacy,security, fairness and confidence [132]. Explainability is also critical for the do-main experts to assess the reliability of new methodolodogies. Interpretabilityis usually classified into two categories: post-hoc (i.e. after) and ante-hoc (i.e.before). Post-hoc approaches explain the predictions of the model, whereasante-hoc approaches integrate explainability into the model. Recent studieshave already aimed to map the semantic meaning behind the models onto thebiochemical description. An attentive pooling network, a two-way attentionsystem that extends the attention mechanism by allowing input nodes to beaware of one another, is one approach that has been employed in drug-targetinteraction prediction [133]. Preuer et al. [77] showed that mapping activationsof hidden neurons in feed-forward neural networks to pharmacophores, or link-ing atom representations computed by convolutional filters to substructures ina graph-convolution model, are possible ways of integrating explainability intoAI-based drug discovery systems. Bradshaw et al. [134] also demonstrated anovel approach that combines molecule generation and retrosynthesis predic-24ion to generate synthesizable molecules. Integration of such solutions to drugdiscovery problems will not only be useful for computational researchers butalso for the medicinal chemistry community.
The NLP field has seen tremendous advances in the past five years, start-ing with the introduction of distributed word embedding algorithms such asWord2Vec [73] and Glove [80]. The concept of contextualized word embed-dings (i.e. ELMo) was introduced soon after [135]. Here, the embedding ofthe word is not fixed, but changes according to the context (i.e. sentence) inwhich it appears. These advances continued with more complicated architec-tures such as Transformer (i.e. Generative Pre-Training or GPT) [136] andBERT [137], RoBERTa [138], GPT2 [139], Transformer-XL [140], and XLNet[141] models. Such models with a focus on context might have significant im-pact not only on drug discovery, but also on the protein folding problem, whichis critical for predicting structural properties of the protein partner. Secondarystructure [142, 143, 144], domain boundary [145] and fold [50] prediction studiesoften use sequence information in combination with similarity to available struc-tures. The recent success of AlphaFold [146] in Critical Assessment of ProteinStructure Prediction (CASP) competitions ( http://predictioncenter.org/ )showed that the enhanced definitions of context, brought about by the advancesin machine/deep learning systems, might be useful for capturing the global de-pendencies in protein sequences to detect interactions between residues sepa-rated in sequence space but close together in 3D space [142].Unsupervised learning can be used on “big” textual data through using lan-guage models with attention [120] and using pre-trained checkpoints from lan-guage models [147]. Encoder-decoder architectures have also had significant im-pact on solving text generation and machine translation problems and were suc-cessfully applied to molecule generation problem. As NLP moves forward, themost recent approaches such as Topic-Guided VAE [91] and knowledge graphswith graph transformers [148] will easily find application in bio/cheminformatics.25ecent NLP models are not domain-specific, and they can help with thegeneralization of models [139]. Current studies emphasize multi-task learning,which requires the use of DNNs that share parameters to learn more informationfrom related but individual tasks [149, 139]. Combined with the transferabilityof contextual word representation models, multi-task learning can also providesolutions to drug discovery which has many interwoven tasks, such as chemicalproperty prediction and molecule generation.Language has an important power, not only for daily communication but alsofor the communication of codified domain knowledge. Deciphering the mean-ing behind text is the primary purpose of NLP, which inevitably has found itsway to bio/cheminformatics. The complicated nature of biochemical text makesunderstanding the semantic construction of the hidden words all the more chal-lenging and interesting. The applications we discussed in this review providea broad perspective of how NLP is already integrated with the processing ofbiochemical text. A common theme in all of these applications is the use ofAI-based methodologies that drive and benefit from the NLP field. Novel ad-vances in NLP and ML are providing auspicious results to solving long-standingbio/cheminformatics problems.With this review, we have summarized the impact of NLP on bio/cheminformaticsto encourage this already interdisciplinary field to take advantage of recent ad-vances. The communication between researchers from different backgrounds anddomains can be enhanced through establishing a common vocabulary towardcommon goals. This review has been an attempt to facilitate this conversation.
Acknowledgement
This work is partially supported by TUBITAK (The Scientific and Techno-logical Research Council of Turkey) under grant number 119E133. HO acknowl-edges TUBITAK-BIDEB 2211 scholarship program and thanks G¨ok¸ce Uludo˘ganfor her comments on figures. EO thanks Prof. Amedeo Caflisch for hosting herat the University of Zurich during her sabbatical.26 eferences [1] G. Schneider, Automating drug discovery, Nature Reviews Drug Discovery17 (2018) 97–113.[2] J. G. Moffat, F. Vincent, J. A. Lee, J. Eder, M. Prunotto, Opportunitiesand challenges in phenotypic drug discovery: an industry perspective,Nature reviews Drug discovery 16 (2017) 531.[3] Y. Duarte, V. M´arquez-Miranda, M. J. Miossec, F. Gonz´alez-Nilo, Inte-gration of target discovery, drug discovery and drug delivery: A review oncomputational strategies, Wiley Interdisciplinary Reviews: Nanomedicineand Nanobiotechnology (2019) e1554.[4] P. ´Sled´z, A. Caflisch, Protein structure-based drug design: from dockingto molecular dynamics, Current opinion in structural biology 48 (2018)93–102.[5] J. Lyu, S. Wang, T. E. Balius, I. Singh, A. Levit, Y. S. Moroz, M. J.OMeara, T. Che, E. Algaa, K. Tolmachova, et al., Ultra-large librarydocking for discovering new chemotypes, Nature 566 (2019) 224.[6] P. Schneider, G. Schneider, De novo design at the edge of chaos: Miniper-spective, Journal of medicinal chemistry 59 (2016) 4077–4086.[7] N. Bosc, F. Atkinson, E. Felix, A. Gaulton, A. Hersey, A. R. Leach, Largescale comparison of qsar and conformal prediction methods and their ap-plications in drug discovery, Journal of cheminformatics 11 (2019) 4.[8] H. Eckert, J. Bajorath, Molecular similarity analysis in virtual screening:foundations, limitations and novel approaches, Drug discovery today 12(2007) 225–233.[9] Y.-C. Lo, S. E. Rensi, W. Torng, R. B. Altman, Machine learning inchemoinformatics and drug discovery, Drug discovery today 23 (2018)1538–1546. 2710] R. Wang, X. Fang, Y. Lu, C.-Y. Yang, S. Wang, The pdbbind database:methodologies and updates, Journal of medicinal chemistry 48 (2005)4111–4119.[11] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro,E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al., Uniprot: theuniversal protein knowledgebase, Nucleic acids research 32 (2004) D115–D119.[12] E. E. Bolton, Y. Wang, P. A. Thiessen, S. H. Bryant, Pubchem: integratedplatform of small molecules and biological activities, in: Annual reportsin computational chemistry, volume 4, Elsevier, 2008, pp. 217–241.[13] C. D. Manning, C. D. Manning, H. Sch¨utze, Foundations of statisticalnatural language processing, MIT press, 1999.[14] D. Oliveira, R. Sahay, M. d’Aquin, Leveraging ontologies for knowledgegraph schemas (2019).[15] P. Ernst, A. Siu, G. Weikum, Knowlife: a versatile approach for construct-ing a large knowledge graph for biomedical sciences, BMC bioinformatics16 (2015) 157.[16] M. Krallinger, O. Rabal, A. Lourenco, J. Oyarzabal, A. Valencia, Infor-mation retrieval and text mining technologies for chemistry, Chemicalreviews 117 (2017) 7673–7761.[17] T. M. Karve, A. K. Cheema, Small changes huge impact: the role ofprotein posttranslational modifications in cellular homeostasis and disease,Journal of amino acids 2011 (2011).[18] S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, I. Pletnev, Inchi-theworldwide chemical structure identifier standard, Journal of cheminfor-matics 5 (2013) 7. 2819] D. Weininger, Smiles, a chemical language and information system. 1.introduction to methodology and encoding rules, Journal of chemicalinformation and computer sciences 28 (1988) 31–36.[20] A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey,Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al., Chembl: alarge-scale bioactivity database for drug discovery, Nucleic acids research40 (2011) D1100–D1107.[21] G. G. Chowdhury, Natural language processing, Annual review of infor-mation science and technology 37 (2003) 51–89.[22] E. Garfield, Chemico-linguistics: computer translation of chemical nomen-clature, Nature 192 (1961) 192.[23] C. B. Anfinsen, Principles that govern the folding of protein chains, Sci-ence 181 (1973) 223–230.[24] S. B. Needleman, C. D. Wunsch, A general method applicable to thesearch for similarities in the amino acid sequence of two proteins, Journalof molecular biology 48 (1970) 443–453.[25] T. F. Smith, M. S. Waterman, et al., Identification of common molecularsubsequences, Journal of molecular biology 147 (1981) 195–197.[26] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali,P. Stothard, Z. Chang, J. Woolsey, Drugbank: a comprehensive resourcefor in silico drug discovery and exploration, Nucleic acids research 34(2006) D668–D672.[27] E. J. Bjerrum, Smiles enumeration as data augmentation for neural net-work modeling of molecules, arXiv preprint arXiv:1703.07076 (2017).[28] T. B. Kimber, S. Engelke, I. V. Tetko, E. Bruno, G. Godin, Synergy effectbetween convolutional neural networks and the multiplicity of smiles for29mprovement of molecular prediction, arXiv preprint arXiv:1812.04439(2018).[29] P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas,A. A. Lee, Molecular transformer: A model for uncertainty-calibratedchemical reaction prediction, ACS Central Science (2019).[30] N. O’Boyle, A. Dalke, Deepsmiles: An adaptation of smiles for use inmachine-learning of chemical structures (2018).[31] H. ¨Ozt¨urk, A. ¨Ozg¨ur, E. Ozkirimli, A chemical language based approachfor protein-ligand interaction prediction, arXiv preprint arXiv:1811.00761(2018).[32] J. Ar´us-Pous, S. Johansson, O. Ptykhodko, E. J. Bjerrum, C. Tyrchan,J. Reymond, H. Chen, O. Engkvist, Randomized smiles strings improvethe quality of molecular generative models (2019).[33] M. Krenn, F. H¨ase, A. Nigam, P. Friederich, A. Aspuru-Guzik, Selfies: arobust representation of semantically constrained graphs with an exampleapplication in chemistry, arXiv preprint arXiv:1905.13741 (2019).[34] S. R. Heller, A. McNaught, I. Pletnev, S. Stein, D. Tchekhovskoi, Inchi,the iupac international chemical identifier, Journal of cheminformatics 7(2015) 23.[35] R. G´omez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern´andez-Lobato,B. S´anchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel,R. P. Adams, A. Aspuru-Guzik, Automatic chemical design using a data-driven continuous representation of molecules, ACS central science 4(2018) 268–276.[36] R. Winter, F. Montanari, F. No´e, D.-A. Clevert, Learning continuousand data-driven molecular descriptors by translating equivalent chemicalrepresentations, Chemical science 10 (2019) 1692–1701.3037] D. Ghersi, M. Singh, molblocks: decomposing small molecule sets anduncovering enriched fragments, Bioinformatics 30 (2014) 2081–2083.[38] X. Q. Lewell, D. B. Judd, S. P. Watson, M. M. Hann, Recap retrosyntheticcombinatorial analysis procedure: a powerful new technique for identifyingprivileged molecular fragments with useful applications in combinatorialchemistry, Journal of chemical information and computer sciences 38(1998) 511–522.[39] J. Degen, C. Wegscheid-Gerlach, A. Zaliani, M. Rarey, On the art ofcompiling and using’drug-like’chemical fragment spaces, ChemMedChem:Chemistry Enabling Drug Discovery 3 (2008) 1503–1507.[40] S. Avramova, N. Kochev, P. Angelov, Retrotransformdb: A dataset ofgeneric transforms for retrosynthetic analysis, Data 3 (2018) 14.[41] S. Arvidsson, O. Spjuth, L. Carlsson, P. Toccaceli, Prediction of metabolictransformations using cross venn-abers predictors, in: Conformal andProbabilistic Prediction and Applications, 2017, pp. 118–131.[42] P. Schwaller, A. C. Vaucher, V. H. Nair, T. Laino, Data-driven chemicalreaction classification with attention-based neural networks (2019).[43] D. Vidal, M. Thormann, M. Pons, Lingo, an efficient holographic textbased method to calculate biophysical properties and intermolecular simi-larities, Journal of Chemical Information and Modeling 45 (2005) 386–393.[44] H. ¨Ozt¨urk, E. Ozkirimli, A. ¨Ozg¨ur, A comparative study of smiles-based compound similarity functions for drug-target interaction predic-tion, BMC bioinformatics 17 (2016) 128.[45] E. Asgari, M. R. Mofrad, Continuous distributed representation of bio-logical sequences for deep proteomics and genomics, PloS one 10 (2015)e0141287. 3146] H. ¨Ozt¨urk, E. Ozkirimli, A. ¨Ozg¨ur, A novel methodology on distributedrepresentations of proteins using their interacting ligands, Bioinformatics34 (2018) i295–i303.[47] K. Motomura, T. Fujita, M. Tsutsumi, S. Kikuzato, M. Nakamura, J. M.Otaki, Word decoding of protein amino acid sequences with availabilityanalysis: a linguistic approach, PloS one 7 (2012) e50039.[48] R. Cao, C. Freitas, L. Chan, M. Sun, H. Jiang, Z. Chen, Prolango: proteinfunction prediction using neural machine translation based on a recurrentneural network, Molecules 22 (2017) 1732.[49] A. Ranjan, M. S. Fahad, D. Fernandez-Baca, A. Deepak, S. Tripathi, Deeprobust framework for protein function prediction using variable-lengthprotein sequences, IEEE/ACM transactions on computational biologyand bioinformatics (2019).[50] L. Wei, M. Liao, X. Gao, Q. Zou, Enhanced protein fold predictionmethod through a novel feature extraction technique, IEEE transactionson nanobioscience 14 (2015) 649–659.[51] A. Cadeddu, E. K. Wylie, J. Jurczak, M. Wampler-Doty, B. A. Grzy-bowski, Organic chemistry as a language and the implications of chemicallinguistics for structural and retrosynthetic analyses, Angewandte ChemieInternational Edition 53 (2014) 8108–8112.[52] M. Wo´zniak, A. Wo(cid:32)los, U. Modrzyk, R. L. G´orski, J. Winkowski, M. Ba-jczyk, S. Szymku´c, B. A. Grzybowski, M. Eder, Linguistic measures ofchemical diversity and the keywords of molecular collections, Scientificreports 8 (2018).[53] G. K. Zipf, Human behavior and the principle of least effort. (1949).[54] D. Ganesan, A. V. Tendulkar, S. Chakraborti, Protein word detectionusing text segmentation techniques, in: BioNLP 2017, 2017, pp. 238–246.3255] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P. S.Langendijk-Genevaux, M. Pagni, C. J. Sigrist, The prosite database, Nu-cleic acids research 34 (2006) D227–D230.[56] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rarewords with subword units, arXiv preprint arXiv:1508.07909 (2015).[57] Y. Wang, Z.-H. You, S. Yang, X. Li, T.-H. Jiang, X. Zhou, A high effi-cient biological language model for predicting protein–protein interactions,Cells 8 (2019) 122.[58] M. Gimona, Protein linguisticsa grammar for modular protein assembly?,Nature Reviews Molecular Cell Biology 7 (2006) 68.[59] A. Scaiewicz, M. Levitt, The language of the protein universe, Currentopinion in genetics & development 35 (2015) 50–56.[60] L. Yu, D. K. Tanwar, E. D. S. Penha, Y. I. Wolf, E. V. Koonin, M. K.Basu, Grammar of protein domain architectures, Proceedings of theNational Academy of Sciences 116 (2019) 3636–3645.[61] D. Buchan, D. Jones, Inferring protein domain semantic roles usingword2vec, bioRxiv (2019) 617647.[62] P. Greenside, M. Hillenmeyer, A. Kundaje, Prediction of protein-ligandinteractions from paired protein sequence motifs and ligand substructures,in: Pacific Symposium on Biocomputing, volume 23, World Scientific,2017.[63] H. ¨Ozt¨urk, E. Ozkirimli, A. ¨Ozg¨ur, Widedta: prediction of drug-targetbinding affinity, arXiv preprint arXiv:1902.04166 (2019).[64] P. J. Ropp, J. C. Kaminsky, S. Yablonski, J. D. Durrant, Dimorphite-dl:an open-source program for enumerating the ionization states of drug-likesmall molecules, Journal of cheminformatics 11 (2019) 14.3365] N. Cheron, N. Jasty, E. I. Shakhnovich, Opengrowth: an automated andrational algorithm for finding new protein ligands, Journal of medicinalchemistry 59 (2015) 4171–4188.[66] J. N. Wei, D. Duvenaud, A. Aspuru-Guzik, Neural networks for the predic-tion of organic chemistry reactions, ACS central science 2 (2016) 725–732.[67] J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, Reoptimization ofmdl keys for use in drug discovery, Journal of chemical information andcomputer sciences 42 (2002) 1273–1280.[68] G. Salton, A. Wong, C.-S. Yang, A vector space model for automaticindexing, Communications of the ACM 18 (1975) 613–620.[69] M. Bilenko, R. J. Mooney, Adaptive duplicate detection using learnablestring similarity measures, in: Proceedings of the ninth ACM SIGKDDinternational conference on Knowledge discovery and data mining, ACM,2003, pp. 39–48.[70] C. M. Bishop, Pattern recognition and machine learning, Springer Sci-ence+ Business Media, 2006.[71] P. D. Turney, P. Pantel, From frequency to meaning: Vector space modelsof semantics, Journal of artificial intelligence research 37 (2010) 141–188.[72] K. S. Jones, A statistical interpretation of term specificity and its appli-cation in retrieval, Journal of documentation (2004).[73] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributedrepresentations of words and phrases and their compositionality, in: Ad-vances in neural information processing systems, 2013, pp. 3111–3119.[74] J. Schwartz, M. Awale, J.-L. Reymond, Smifp (smiles fingerprint) chem-ical space for virtual screening and visualization of large databases of or-ganic molecules, Journal of chemical information and modeling 53 (2013)1979–1989. 3475] M. H. Segler, T. Kogej, C. Tyrchan, M. P. Waller, Generating focusedmolecule libraries for drug discovery with recurrent neural networks, ACScentral science 4 (2018) 120–131.[76] S. Kwon, S. Yoon, Deepcci: End-to-end deep learning for chemical-chemical interaction prediction, in: Proceedings of the 8th ACM Interna-tional Conference on Bioinformatics, Computational Biology, and HealthInformatics, ACM, 2017, pp. 203–212.[77] K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, T. Un-terthiner, Interpretable deep learning in drug discovery, arXiv preprintarXiv:1903.02788 (2019).[78] N. De Cao, T. Kipf, Molgan: An implicit generative model for smallmolecular graphs, arXiv preprint arXiv:1805.11973 (2018).[79] A. Mayr, G. Klambauer, T. Unterthiner, S. Hochreiter, Deeptox: toxic-ity prediction using deep learning, Frontiers in Environmental Science 3(2016) 80.[80] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for wordrepresentation, in: Proceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP), 2014, pp. 1532–1543.[81] S. Jaeger, S. Fulle, S. Turk, Mol2vec: Unsupervised machine learningapproach with chemical intuition, Journal of chemical information andmodeling 58 (2018) 27–35.[82] D. Rogers, M. Hahn, Extended-connectivity fingerprints, Journal of chem-ical information and modeling 50 (2010) 742–754.[83] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S.Pappu, K. Leswing, V. Pande, Moleculenet: a benchmark for molecularmachine learning, Chemical science 9 (2018) 513–530.3584] S. K. Chakravarti, Distributed representation of chemical fragments, ACSomega 3 (2018) 2825–2836.[85] W. Jeon, D. Kim, Fp2vec: a new molecular featurizer for learning molec-ular properties, Bioinformatics (2019).[86] H. ¨Ozt¨urk, A. ¨Ozg¨ur, E. Ozkirimli, Deepdta: deep drug–target bindingaffinity prediction, Bioinformatics 34 (2018) i821–i829.[87] J. Hou, B. Adhikari, J. Cheng, Deepsf: deep convolutional neural networkfor mapping protein sequences to folds, Bioinformatics 34 (2017) 1295–1303.[88] G. B. Goh, N. O. Hodas, C. Siegel, A. Vishnu, Smiles2vec: An in-terpretable general-purpose deep neural network for predicting chemicalproperties, arXiv preprint arXiv:1712.02034 (2017).[89] A. Paul, D. Jha, R. Al-Bahrani, W.-k. Liao, A. Choudhary, A. Agrawal,Chemixnet: Mixed dnn architectures for predicting chemical propertiesusing multiple molecular representations, arXiv preprint arXiv:1811.08283(2018).[90] G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, N. Baker, Chemcep-tion: a deep neural network with minimal chemistry knowledge matchesthe performance of expert-developed qsar/qspr models, arXiv preprintarXiv:1706.06689 (2017).[91] W. Wang, Z. Gan, H. Xu, R. Zhang, G. Wang, D. Shen, C. Chen, L. Carin,Topic-guided variational autoencoders for text generation, arXiv preprintarXiv:1903.07137 (2019).[92] F. Grisoni, D. Merk, V. Consonni, J. A. Hiss, S. G. Tagliabue, R. Todes-chini, G. Schneider, Scaffold hopping from natural products to syntheticmimetics by holistic molecular similarity, Communications Chemistry 1(2018) 44. 3693] D. C. Elton, Z. Boukouvalas, M. D. Fuge, P. W. Chung, Deep learningfor molecular design-a review of the state of the art, Molecular SystemsDesign & Engineering (2019).[94] P. Ertl, R. Lewis, E. Martin, V. Polyakov, In silico generation of novel,drug-like chemical matter using the lstm neural network, arXiv preprintarXiv:1712.07449 (2017).[95] A. Gupta, A. T. M¨uller, B. J. Huisman, J. A. Fuchs, P. Schneider,G. Schneider, Generative recurrent networks for de novo drug design,Molecular informatics 37 (2018) 1700111.[96] M. Olivecrona, T. Blaschke, O. Engkvist, H. Chen, Molecular de-novodesign through deep reinforcement learning, Journal of cheminformatics9 (2017) 48.[97] M. Popova, O. Isayev, A. Tropsha, Deep reinforcement learning for denovo drug design, Science advances 4 (2018) eaap7885.[98] D. Merk, L. Friedrich, F. Grisoni, G. Schneider, De novo design of bioac-tive small molecules by artificial intelligence, Molecular informatics 37(2018) 1700153.[99] D. Merk, F. Grisoni, L. Friedrich, G. Schneider, Tuning artificial intelli-gence on the de novo design of natural-product-inspired retinoid x receptormodulators, Communications Chemistry 1 (2018) 68.[100] J. Ar´us-Pous, T. Blaschke, S. Ulander, J.-L. Reymond, H. Chen, O. En-gkvist, Exploring the gdb-13 chemical space using deep generative models,Journal of cheminformatics 11 (2019) 20.[101] L. C. Blum, J.-L. Reymond, 970 million druglike small molecules forvirtual screening in the chemical universe database gdb-13, Journal of theAmerican Chemical Society 131 (2009) 8732–8733.37102] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, S. Ben-gio, Generating sentences from a continuous space, arXiv preprintarXiv:1511.06349 (2015).[103] M. J. Kusner, B. Paige, J. M. Hern´andez-Lobato, Grammar variationalautoencoder, in: Proceedings of the 34th International Conference onMachine Learning-Volume 70, JMLR. org, 2017, pp. 1945–1954.[104] H. Dai, Y. Tian, B. Dai, S. Skiena, L. Song, Syntax-directed variationalautoencoder for molecule generation, in: Proceedings of the InternationalConference on Learning Representations, 2018.[105] T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, H. Chen, Appli-cation of generative autoencoder in de novo molecular design, Molecularinformatics 37 (2018) 1700123.[106] J. Lim, S. Ryu, J. W. Kim, W. Y. Kim, Molecular generative modelbased on conditional variational autoencoder for de novo molecular design,Journal of cheminformatics 10 (2018) 31.[107] S. Kang, K. Cho, Conditional molecular design with deep generativemodels, Journal of chemical information and modeling 59 (2018) 43–52.[108] Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networksand their variants work: An overview, ACM Computing Surveys (CSUR)52 (2019) 10.[109] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias,A. Aspuru-Guzik, Objective-reinforced generative adversarial networks(organ) for sequence generation models, arXiv preprint arXiv:1705.10843(2017).[110] L. Yu, W. Zhang, J. Wang, Y. Yu, Seqgan: Sequence generative adver-sarial nets with policy gradient, in: Thirty-First AAAI Conference onArtificial Intelligence, 2017. 38111] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning withneural networks, in: Advances in neural information processing systems,2014, pp. 3104–3112.[112] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, Y. Bengio, Learning phrase representations using rnnencoder-decoder for statistical machine translation, arXiv preprintarXiv:1406.1078 (2014).[113] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointlylearning to align and translate, arXiv preprint arXiv:1409.0473 (2014).[114] M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).[115] A. Graves, Generating sequences with recurrent neural networks, arXivpreprint arXiv:1308.0850 (2013).[116] J. Nam, J. Kim, Linking the neural machine translation and the predictionof organic chemistry reactions, arXiv preprint arXiv:1612.09529 (2016).[117] B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen,S. Ho, J. Sloane, P. Wender, V. Pande, Retrosynthetic reaction predictionusing neural sequence-to-sequence models, ACS central science 3 (2017)1103–1113.[118] P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, T. Laino, found in trans-lation: predicting outcomes of complex organic chemistry reactions usingneural sequence-to-sequence models, Chemical science 9 (2018) 6091–6098.[119] W. Jin, C. Coley, R. Barzilay, T. Jaakkola, Predicting organic reactionoutcomes with weisfeiler-lehman network, in: Advances in Neural Infor-mation Processing Systems, 2017, pp. 2607–2616.39120] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,(cid:32)L. Kaiser, I. Polosukhin, Attention is All you Need (2017) 5998–6008.[121] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H.Green, R. Barzilay, K. F. Jensen, A graph-convolutional neural networkmodel for the prediction of chemical reactivity, Chemical science 10 (2019)370–377.[122] B. Shin, S. Park, K. Kang, J. C. Ho, Self-attention based moleculerepresentation for predicting drug-target interaction, arXiv preprintarXiv:1908.06760 (2019).[123] S. Wang, Y. Guo, Y. Wang, H. Sun, J. Huang, Smiles-bert: Large scaleunsupervised pre-training for molecular property prediction, in: Pro-ceedings of the 10th ACM International Conference on Bioinformatics,Computational Biology and Health Informatics, ACM, 2019, pp. 429–436.[124] D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov,O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy,M. Veselov, et al., Molecular sets (moses): a benchmarking platformfor molecular generation models, arXiv preprint arXiv:1811.12823 (2018).[125] N. Brown, M. Fiscato, M. H. Segler, A. C. Vaucher, Guacamol: bench-marking models for de novo molecular design, Journal of chemical infor-mation and modeling 59 (2019) 1096–1108.[126] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax-ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E.Bourne, et al., The fair guiding principles for scientific data managementand stewardship, Scientific data 3 (2016).[127] ˇZ. Avsec, R. Kreuzhuber, J. Israeli, N. Xu, J. Cheng, A. Shrikumar,A. Banerjee, D. S. Kim, T. Beier, L. Urban, et al., The kipoi reposi-tory accelerates community exchange and reuse of predictive models forgenomics, Nature biotechnology (2019) 1.40128] A. E. Cleves, A. N. Jain, Effects of inductive bias on computationalevaluations of ligand-based modeling and on drug discovery, Journal ofcomputer-aided molecular design 22 (2008) 147–159.[129] R. E. Pogue, D. P. Cavalcanti, S. Shanker, R. V. Andrade, L. R. Aguiar,J. L. de Carvalho, F. F. Costa, Rare genetic diseases: update on diagnosis,treatment and online resources, Drug discovery today 23 (2018) 187–195.[130] J. Sieg, F. Flachsenberg, M. Rarey, In need of bias control: Evaluatingchemical data for machine learning in structure-based virtual screening,Journal of chemical information and modeling 59 (2019) 947–961.[131] Y. Zhang, A. A. Lee, Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning, arXivpreprint arXiv:1902.00925 (2019).[132] A. Holzinger, C. Biemann, C. S. Pattichis, D. B. Kell, What do we needto build explainable ai systems for the medical domain?, arXiv preprintarXiv:1712.09923 (2017).[133] K. Y. Gao, A. Fokoue, H. Luo, A. Iyengar, S. Dey, P. Zhang, Interpretabledrug target prediction using deep neural representation., in: IJCAI, 2018,pp. 3371–3377.[134] J. Bradshaw, B. Paige, M. J. Kusner, M. H. S. Segler, J. M. Hern´andez-Lobato, A model to search for synthesizable molecules, CoRRabs/1906.05221 (2019).[135] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,L. Zettlemoyer, Deep contextualized word representations, arXiv preprintarXiv:1802.05365 (2018).[136] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improv-ing language understanding by generative pre-training, URLhttps://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf (2018).41137] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deepbidirectional transformers for language understanding, arXiv preprintarXiv:1810.04805 (2018).[138] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pre-training approach, arXiv preprint arXiv:1907.11692 (2019).[139] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Languagemodels are unsupervised multitask learners, OpenAI Blog 1 (2019).[140] Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le,R. Salakhutdinov, Transformer-xl: Attentive language models beyonda fixed-length context, arXiv preprint arXiv:1901.02860 (2019).[141] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet:Generalized autoregressive pretraining for language understanding, arXivpreprint arXiv:1906.08237 (2019).[142] J. Hanson, K. K. Paliwal, T. Litfin, Y. Yang, Y. Zhou, Getting to knowyour neighbor: protein structure prediction comes of age with contextualmachine learning, Journal of Computational Biology (2019).[143] X.-J. Zhu, C.-Q. Feng, H.-Y. Lai, W. Chen, L. Hao, Predicting proteinstructural classes for low-similarity sequences by evaluating different fea-tures, Knowledge-Based Systems 163 (2019) 787–793.[144] S. Wang, J. Peng, J. Ma, J. Xu, Protein secondary structure predictionusing deep convolutional neural fields, Scientific reports 6 (2016) 18962.[145] Q. Shi, W. Chen, S. Huang, F. Jin, Y. Dong, Y. Wang, Z. Xue, Dnn-dom:predicting protein domain boundary from sequence alone by deep neuralnetwork, Bioinformatics (2019).[146] R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Zidek,A. Nelson, A. Bridgland, H. Penedones, et al., De novo structure pre-42iction with deeplearning based scoring, Annu Rev Biochem 77 (2018)6.[147] S. Rothe, S. Narayan, A. Severyn, Leveraging Pre-trained Checkpointsfor Sequence Generation Tasks, arXiv.org (2019).[148] R. Koncel-Kedziorski, D. Bekal, Y. Luan, M. Lapata, H. Hajishirzi,Text generation from knowledge graphs with graph transformers, arXivpreprint arXiv:1904.02342 (2019).[149] S. Ruder, Neural Transfer Learning for Natural Language Processing,Ph.D. thesis, NATIONAL UNIVERSITY OF IRELAND, GALWAY,2019.[150] X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, K. Tsuda, Chemts: anefficient python library for de novo molecular generation, Science andtechnology of advanced materials 18 (2017) 972–976.[151] O. Prykhodko, S. Johansson, P.-C. Kotsias, E. J. Bjerrum, O. Engkvist,H. Chen, A de novo molecular generation method using latent vectorbased generative adversarial network (2019).[152] Y. Bengio, et al., Learning deep architectures for ai, Foundations andtrends R (cid:13) in Machine Learning 2 (2009) 1–127.[153] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learn-ing applied to document recognition, Proceedings of the IEEE 86 (1998)2278–2324.[154] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.[155] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural compu-tation 9 (1997) 1735–1780.[156] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXivpreprint arXiv:1312.6114 (2013).43157] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MITpress, 2018.[158] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions onknowledge and data engineering 22 (2009) 1345–1359.[159] R. J. Williams, D. Zipser, A learning algorithm for continually runningfully recurrent neural networks, Neural computation 1 (1989) 270–280.[160] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weis-sig, I. N. Shindyalov, P. E. Bourne, The protein data bank, Nucleic acidsresearch 28 (2000) 235–242.[161] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones,A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, et al., The pfamprotein families database, Nucleic acids research 32 (2004) D138–D141.[162] T. Liu, Y. Lin, X. Wen, R. N. Jorissen, M. K. Gilson, Bindingdb: a web-accessible database of experimentally determined protein–ligand bindingaffinities, Nucleic acids research 35 (2006) D198–D201.[163] J. J. Irwin, B. K. Shoichet, Zinc- a free database of commercially availablecompounds for virtual screening, Journal of chemical information andmodeling 45 (2005) 177–182. 44igure 1: The illustration of the Skip-Gram architecture of the Word2Vec algo-rithm. For a vocabulary of size V, each word in the vocabulary is described as aone-hot encoded vector (a binary vector in which only the corresponding wordposition is set to 1). The Skip-Gram architecture is a simple one hidden-layerneural network that aims to predict context (neighbor) words of a given targetword. The extent of the context is determined by the window size parameter.In this example, the window size is equal to 1, indicating that the system willpredict two context words (the word on the left and the word on the right ofthe target word) based on their probability scores. The number of nodes in thehidden layer (N) controls the size of the embedding vector. The weight matrixof VxN stores the trained embedding vectors.45igure 2: (Continued on the following page.)46igure 2: The workflow for building a SMILES-based molecule representation.In the first box, SMILES text of ampicillin is utilized to extract words. In thiscase, the words are overlapping 4-mers and there are total 42 unique words.To represent multiple compounds, words are extracted from each compound,thus building a vocabulary of size V . In the second box, two popular word rep-resentations are illustrated: (left) one-hot encoded representation, and (right)distributed representation. With the one-hot encoding, we build a binary vectorof size V , in which the position of the corresponding word is set to 1, while therest remains as 0. In the distributed representations, however, the dimensionof the word representation (embedding) is D , which is usually smaller than V and 50 < D < x x x … y y y y …y y y y … { } x x x x …Br c 1 c …Br c 1 c … y y y y … c 1 c … c 1 c … BrNN OH OHBHO NN + + + + y y y y …cc 1 c …cc 1 c …h h = T h h h …encoderencoder decoderdecoder memory bankfixed sizedvectorEncoder-decoderattention K, V Q c a) seq-2-seqb) seq-2-seq with attention N N NN Pd Figure 3: (Continued on the following page.)48igure 3: Sequence-2-Sequence models take as input a sequence of tokens andgenerate a sequence of tokens as output. The example in this Figure is a chemicalreaction prediction, where given a set of precursors the most likely productsare predicted. The input tokens correspond to the tokenized SMILES of theprecursors and the generated tokens to the SMILES of the product. In theoriginal sequence-2-sequence models, the encoder encoded the input sequenceinto a fixed size context vector, as shown in (a). The decoder had access only tothis fixed size vector, which limited its application for long input sequences. Toovercome this drawback, the attention mechanism was introduced, as shown in(b). In a sequence-2-sequence model with attention, the encoder encodes everytoken independently into a memory bank. The longer the input sequence is,the larger is the memory bank. The decoder then queries the memory bank atevery decoding step and selectively attends the most relevant value vectors topredict the next token. 49able 1: NLP concepts and their applications in drug discovery
Concept Definition Methodologies ApplicationsToken/word
A series of characters (i.e. word, number, symbol)that constitutes the smallest unit of a language.The identification of tokens (i.e. tokenization) is animportant pre-processing step in many NLP tasks,e.g. substructures of a molecule. k-mers protein family classification [45, 46]protein function prediction [48, 49]protein language analysis [47]molecular similarity [43, 44]patterns drug-target interaction prediction[62, 63]protein language analysis [58, 59, 60, 61]molecule fragmentation [37]reaction prediction [66]ligand design [65]MCS chemical language analysis [51, 52]BPE protein-protein interaction prediction [57]MDL protein family classification [54]
Sentence
A text containing one or more tokens/words,e.g. textual representations of chemicals andproteins. SMILES [19] molecular property prediction [89]binding affinity prediction [86, 88]reaction prediction [116, 117, 118, 29]data augmentation [27]and more.DeepSMILES [30] binding affinity prediction [31]SELFIES [33] -protein sequence toxicity prediction [81]protein family classification [45, 54]protein function prediction [48, 49]protein language analysis [47]and more.
Word/sentencerepresentation
The aim to describe a text that can reflect its syntacticand semantic features,e.g. vector representation of SMILES based onthe occurrences of each symbol. bag-of-words molecular similarity [43, 44]distributedrepresentations binding affinity prediction [86]chemical property prediction [81, 84]toxicity prediction [88, 81, 84, 85]drug-drug interaction prediction [76]protein family classification [45, 46]protein-protein interaction prediction [57]
Machinetranslation
The task of converting a sequence of meaningfulsymbols in one language into a meaningful sequencein another language,e.g.translating SMILES to InChi in molecules. RNN-basedseq2seq protein function prediction [48]chemical representation [36]reaction prediction [116, 118]retrosynthesis [117]Transformer reaction prediction [29]drug-target interaction prediction [122]
Languagegeneration
The aim to generate a sequence of meaningful symbolsin the given language that are close to real.e.g. generating SMILES of a novel lead RNN-types molecule generation [75, 94, 95, 96, 150]VAE-types molecule generation [35, 103, 104]GAN molecule generation [109, 151]
Model Description
Deep Neural Network (DNN) [152] An artificial neural network (ANN) witha large numberof hidden layers and neurons.Word2Vec [73] An ANN-based word embedding architecture thatcaptures the semantic information of the words basedon the context in which they appear.Convolutional Neural Network (CNN) [153] A type of ANN that utilizes convolutions in the layers.Recurrent Neural Network (RNN) [154] A type of ANN that has a feedback loop connected toprevious time samples.Long-short Term Memory (LSTM) [155] A type of RNN that captures long distance dependenciesand comprises update, forget, and output gates.Gated Recurrent Unit A type of RNN that captures long distance dependenciesand comprises an update gate.Auto-encoder (AE) [154] A neural network based architecture that comprises anencoder that maps the input in a narrow space and adecoder that reconstructs the compressed representation.Variational Auto-encoder (VAE) [156] A type of AE that generates outputs based on a specificdistribution.Generative Adversarial Network (GAN) [109] A generative model with generator and discriminatornetworks.Sequence-to-sequence (seq2seq) An encoder-decoder based architecture that maps aninput sequence into an output sequence.Attention mechanism [113] enables the model to choose among the important partsof a sequence that are relevant to the output.Transformer [120] An encoder-decoder architecture that employsself-attention and ANNs in encoder and decoder parts.Neural Machine Translation (NMT) [113] A seq2seq translation architecture.Reinforcement Learning (RL) [157] A ML algorithm in which an agent performs a series ofdecisions in order to maximize its rewards.Transfer Learning [158] A methodology to learn a model on a task (or on a largedata) and then to adjust (i.e. fine-tune) the learned modelon a different task (or on a smaller dataset) with the finalgoal of generalization.Teacher Forcing [159] A technique that is used in training RNNs such that theactual word is given to the decoder as the input insteadof the output word that is predicted in the previous step.
Source Address Description ampicillin
Identifier Representation
IUPAC name (2S,5R,6R)-6-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylic acidChemical Formula C H N O S Canonical SMILES CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)CIsomeric SMILES CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)[C@@H](C3=CC=CC=C3)N)C(=O)O)CDeepSMILES(Canonical) CCCNCS5)CC4=O))NC=O)CC=CC=CC=C6))))))N)))))))C=O)O)))CSELFIES(Canonical) [C][C][Branch2 3][Ring1][epsilon][C][Branch2 3][epsilon][=O][N][C][Branch1 3][Ring2][S][Ring1][Ring2][C][Branch1 3][Branch1 1][C][Ring1][Ring2][=O][N][C][Branch1 3][epsilon][=O][C][Branch1 3][Branch2 2][C][=C][C][=C][C][=C][Ring1][Branch1 1][N][C][Branch1 3][epsilon][=O][O][C]InChi InChI=1S/C16H19N3O4S/c1-16(2)11(15(22)23)19-13(21)10(14(19)24-16)18-12(20)9(17)8-6-4-3-5-7-8/h3-79-11,14H,17H2,1-2H3,(H,18,20)(H,22,23)/t9-,10-,11+14-/m1/s1InChi Key AVKUERGKIZMTKX-NJBDSQKTSA-N2D3D2D and 3D figures were generated using MolView ( molview.orgmolview.org