Autoencoding Undirected Molecular Graphs With Neural Networks
Jeppe Johan Waarkjær Olsen, Peter Ebert Christensen, Martin Hangaard Hansen, Alexander Rosenberg Johansen
AAutoencoding Undirected Molecular GraphsWith Neural Networks
Jeppe Johan Waarkjær Olsen, Peter Ebert Christensen, Martin Hangaard Hansen,and Alexander Rosenberg Johansen ∗ Department of Computing, Technical University of Denmark
E-mail: [email protected]
Abstract
Discrete structure rules for validating molecular structures are usually limited tofulfilment of the octet rule or similar simple deterministic heuristics. We propose amodel, inspired by language modeling from natural language processing, with the abil-ity to learn from a collection of undirected molecular graphs, enabling fitting of anyunderlying structure rule present in the collection.We introduce an adaption to the popular Transformer model, which can learn re-lationships between atoms and bonds. To our knowledge, the Transformer adaption isthe first model that is trained to solve the unsupervised task of recovering partially ob-served molecules. In this work, we assess how different degrees of information impactsperformance w.r.t. to fitting the QM9 dataset, which conforms to the octet rule, andto fitting the ZINC dataset, which contains hypervalent molecules and ions requiringthe model to learn a more complex structure rule. More specifically, we test a fulldiscrete graph with bond order information, full discrete graph with only connectivity,a bag-of-neighbors , a bag-of-atoms , and a count-based unigram statistics.These results provide encouraging evidence that neural networks , even when onlyconnectivity is available, can learn arbitrary molecular structure rules specific to a a r X i v : . [ c s . L G ] M a r ataset, as the Transformer adaption surpasses a strong octet rule baseline on theZINC dataset. ntroduction In drug discovery, catalysis, and combustion the number of possible relevant moleculesgrows exponentially with size of the molecule or reaction network. Modeling and exploringlarge datasets of molecules benefits from fast coarse grained methods to generate, filter,consistency check, validate and correct molecules .Databases of molecular properties calculated by ab initio methods rely on consistencybetween 3D structures and molecular graphs. This consistency should be highly reliable andavoid any erroneous identifications after structure relaxation. The reliability requirementbenefits from several redundant methods, which can flag any possible inconsistencies.The tasks of validating, correcting, completing, and generating molecules in discreterepresentations usually rely on a simple heuristic such as the octet rule as the fundamentalstructure rule to determine the validity of molecules.
The octet rule is, however, notsatisfactory to validate all synthesizable molecules due to the occurrence of hypervalentmolecules, ions, and non-integer bond orders such as in aromatic bonds.Machine learning methods working on discrete molecular graphs can work as structurerules, learned from molecular datasets. At the same time, they can fit a great complexityof underlying trends, while being low cost compared to 3D representations and quantumchemical calculations. Machine learning for predicting properties based on discrete repre-sentations of molecules have only recently begun to use undirected graphs as opposed todirected linear graphs or sequences such as SMILES.
Moreover, predictions shouldbe invariant under permutation, translation, and rotation of the molecular representation,which calls for undirected graphs. We introduce an unsupervised task, known as masked language modeling or denoisingautoencoder, over an undirected discrete graph representation of a given molecule. Wedefine the unsupervised task as corrupting a molecule and learning how to revert such corrup-tion to recover the valid molecule. This objective allows us to learn the underlying structure3ule without any hard-coded heuristic by merely observing valid molecular graphs. Thiscan correct molecular graphs directly or cross-validate molecular graphs generated from 3Dstructures to check for consistency. In addition, the binary-transformer encodings devel-oped for this model are generally of interest for other tasks including generation in contextof drug-discovery.This paper presents several models trained on two datasets: QM9, which we use as abenchmark to verify that the models can learn a simple known heuristic defining the dataset– namely the octet rule – and ZINC, as a more challenging dataset due to ions and hyper-valent molecules .
The models are evaluated on several metrics including: perplexity,sample F1, and a new octet F1, which measures if the predictions satisfy the octet rule .In Natural Language Processing (NLP) the a goal of statistical and probabilistic languagemodeling is to learn the joint probability mass function of sequences. Historically, this hasbeen accomplished by calculating the probability of observing a word given the sentencethat precedes it. Methods exploiting the sequential relationship between words in text hasbeen ranging from probabilistic finite automaton, to distributed word embeddings , andrecurrent neural networks (RNNs). To cover the most recent development in languagemodeling, adapted to fit undirected graphs with a degree above two, we test the followingmethods of increasing complexity: • unigram — unconditional probabilities of the atoms • bag-of-atoms/neighbors — neural network that aggregates either all atoms or onlyneighboring atoms in the molecule • binary/bond-transformer — neural network architecture with attention using eitherbinary representations of connectivity or full bond type informationThe binary/bond-transformer are inspired by a recent trend in NLP, known as masked4anguage modeling, where the sequential requirement can be relaxed. Most noticeably, wemodify masked language modeling to work with molecules by masking atoms and using graphadjacency matrices to model intermolecular relationships.
Methods
In this section we present several methods to restore partially observed molecules. Weformally define this as an unsupervised learning task over discrete molecular graphs. Totrain the model we apply a a simple corruption function that masks atoms and challenge themodel to recover the corruption. Formal task definition and the five unsupervised modelsof increasing complexity are defined below, as a baseline we apply the deterministic octetrule.
Unsupervised learning of discrete molecular graphs
Autoencoders are a type of neural networks that are trained with unsupervised learning.They create an efficient representation of data by extracting important features. The de-noising autoencoder is an autoencoder variant that trains challenges the neural networkto revert corruptions of the input. By reverting corruptions, the neural network has to un-derstand the underlying structure of the data distribution.In our case the input is a molecule, which we represent as an undirected graph withdiscrete edges G = ( V, E ) . Here V is a set of vertices (atoms), such that ( a, i ) ∈ V where a ∈ A is the element and i ∈ N is the index. E is the set of undirected bonds betweenatoms in the molecule, such that E ⊆ { x, y, b } | ( x, y ) ∈ V ∧ ( x, y ) = ( y, x ) ∧ x (cid:54) = y , where b ∈ { , , } is the bond type: single, double, or triple.We denote the corruption function of the denoising autoencoder as κ : V → ˜ V .5or the experiments we use a corruption function that mask atoms in a molecule withbond type intact. This method of corruption is inspired by the masked language modelpresented in BERT. To apply the corruption function we replace a set of vertices with the
A counting-based model obtains the distribution of atom types by calculating their frequen-cies over a dataset. Counting-based models will by intuition have high accuracy when thedataset is biased, which we find the
QM9 and
ZINC are (see Table 1).The count-based model is motivated by the probability chain-rule, where we can modelthe joint probability of the atoms v i = ( a, i ) ∈ V in a molecule. P ( v , v , . . . , v n ) = P ( v ) P ( v | v ) . . . P ( v n | v , . . . , v n − ) (4)While equation 4 allows us to exactly estimate the conditional atom distribution, the con-6ition grows exponentially with the amount of vertices and becomes infeasible due to theexponential requirement of data and compute. In NLP the directionality of the sentence al-lows for clipped, n-gram, versions of equation 4 where the prediction of the word distributionis only conditioned on the last k tokens P ( v n | v , . . . , v n − ) = P ( v n | v n − k , . . . , v n − ) . Usingn-grams significantly reduces required computation and data while exploiting the locality oflanguage. In molecules, the degree of vertices and lack of directionality makes such n-gram modelscumbersome as each atom can have a tree of recursive n-grams. Because of such, we limitourselves to only consider unigram models (1-grams) for the counting case.A unigram model splits the probability of different terms in a context into a product ofindividual terms, disregarding the condition of equation 4. P unigram ( v , v , . . . , v n ) = P ( v ) P ( v ) . . . P ( v n ) (5) P ( a j ) = count ( a j ) (cid:80) a count ( a ) (6)The unigram model has the benefit of being relatively simple to implement and interpret asit merely counts the occurrence of elements in the training set. The unigram distribution ofthe QM9 and
ZINC training sets are shown in Table 1.7able 1: Unigram probabilities for the QM9 and ZINC training sets. The unigram proba-bilities corresponds to the distribution of elements in the dataset.
Elements QM9 ZINCP(H)
P(C)
P(O)
P(N)
P(F)
P(P)
P(S)
P(Cl)
P(Br)
P(I) H and C . Note thatthe unigram model will always predict with the same probability distribution of elements forany atom as it does not use context.Using our objective from equation 3 we calculate our unigram probability of the corruptedmolecule as max P ( V subset | ˜ G ) = max (cid:89) ˜ v ∈ V subset P (˜ v ) (7) Bag of vectors: neighbors and atoms
In a bag-of-vectors model a molecule is represented as a multiset of its tokens (elementsand/or bonds), disregarding structure but keeping multiplicity (i.e. multiple occur-rences of the same token). Each token, x , is embedded as a trainable vector of real num-bers x ∈ R d . By summing the n tokens of a molecule over the d features we obtain the bag-of-vectors : R n × d → R d representation (sum is used instead of mean to keep mul-tiplicity). The bag-of-vectors representation is used as input to a neural network that8earns to predict the masked tokens V subset . The token vectors, also known as embeddings,and the neural network are jointly optimised with stochastic gradient descent. Using eq.3 we define two bag-of-vector models for our study: a bag of neighboring atoms (eq. 8 )and a bag of all atoms in the corrupted atoms (eq. 10). max θ P bag-of-neighbors θ ( V subset | ˜ G ) = max θ (cid:89) v ∈ V subset P θ ( v | V neighbors ) (8) V neighbors = { v j | ( v j , v ) ∈ E } (9) max θ P bag-of-atoms θ ( V subset | ˜ G ) = max θ (cid:89) v ∈ V subset P θ ( v | ˜ V ) (10) P ( x j | ˜ X ) θ = softmax ( W h θ ( ˜ X )) j = exp(( W h θ ( ˜ X ) j ) (cid:80) | Σ |− i =0 exp(( W h θ ( ˜ X )) i ) (11) h θ ( ˜ X ) = NN ( z θ ( ˜ X )) (12) z θ ( ˜ X ) = (cid:88) ˜ x ∈ ˜ X embedding (˜ x )) (13)To represent our corrupted tokens in equation 13, ˜ X being the elements of either ˜ V or V neighbors , we use an embedding function. Embedding functions, embedding ( x ) ∈ R d emb ,are a popular way to represent input tokens in NLP. The embedding function uses adense vector representation for each token class, which allows the embedding function tolearn relations between token classes. The
As with the original transformer, we use the key-value lookup
Attention function. Thislayer can adaptively align information between atoms conditioned on the context of otheratoms.
Our implementation takes a layer of hidden representations, h l , and an adjacencymatrix of edges, E , as input. Notice that we have separate trainable bond-embedding functions for the key, e K , and value, e V , edge representations. Attention ( h, E ) i = n (cid:88) j =1 α ij (cid:0) h j W V + e Vij (cid:1) (20) α ij = exp φ ij (cid:80) nk =1 exp φ ik (21) φ ij = (cid:0) h i W Q (cid:1) (cid:0) h j W K + e Ki,j (cid:1) T (cid:112) d transform (22) e = bond-embedding ( E ) (23)12Where a ij ∈ [0 , is the attention weights; n is the number of vertices; and W Q , W V ,W K ∈ R d transform × d transform are trainable weights. The bond-embedding : E | V |×| V | → R | V |×| V |× d transform takes an adjacency matrix and returns a three dimensional tensor with a distributed represen-tation for each edge. Notice that compared to most graph based models we use informationfrom all the nodes and edges in the graph to calculate the attention weights, at each layer.From our experiments, this improved the performance (see Figure S.3).To have a more expressive attention function we use the multi-head attention mechanismby concatenating k attention layers. The k attention layers are projected to the hidden sizeof the network R d transform × k → R d transform , such that Multi-Head-Attention ( h, E ) i = [ C _ , C _ , . . . C _ k ] W multi (24)where C _i corresponds to an instance of Attention (eq. 20) and W multi ∈ R ( d transform · k ) × d transform is a trainable weight. This is further illustrated in Figure 1 Experimental setup
In our experiments we test the described models of the unigram , bag-of-neighbors , bag-of-atoms , binary-transformer , and bond-transformer as denoising autoencoders on the QM9 and
ZINC datasets.
Pre-processing
The
QM9 dataset has
134 000 organic molecules with five types of atoms; A = { H, C, N,O, F}. Similarly, the
ZINC dataset has
250 000 drug-like molecules with 10 types of atoms, A = { H, C, N, O, F, P, S, Cl, Br, I}. The molecules are represented as a SMILE strings corresponding to their discrete graph representations. We kekulize the molecules – thus13 caleMatMulMask (Opt.)SoftMaxEmbed MatMulAdditionScaled Dot-Product Attention Transformer LayerScaled Dot-Product AttentionConcatLinearLayer-normLinearLayer-norm Figure 1: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention with multiplelayers consisting of several attention layers running in parallel. Figure modified from Vaswaniet al. .resulting in the dataset only containing single, double and triple bond types – and obtainan adjacency matrix for each molecule from the SMILES string using Rdkit. Since we use the
QM9 dataset to benchmark our ability to approximate the octet rule, wediscard any molecules that contains atoms with net charges (1808 molecules). In the ZINCdataset, we keep all molecules including molecules with charges and hypervalent molecules.The resulting set of adjacency matrices are split using scaffolding to homology partitionthe molecules. We make a 15% test, 15% validation, and 70% training set split. In Figure2 we show the distribution of elements for different sizes of molecules. Here we see that inboth the
QM9 and
ZINC dataset, the size of the molecules are not uniformly distributed, withfew small and large molecules. Furthermore, the molecules in
ZINC are generally larger thanthe ones in
QM9 ; up to 80 atoms in
ZINC compared to a maximum of around 30 atoms for14 M9 . The distribution of different elements depends somewhat on the size of the molecule,especially for smaller molecules.To stress test the models we generate several validation-/ and tests sets with increasingcomplexity. For ZINC , the datasets have either 1, 10, 20, 30, 40, 50, 60, 70, or 80 atomsrandomly masked in the molecule, denoted by n corrupt .For n corrupt = 1 , we oversample the molecules, by generating five unique different mask-ings per molecule. This is done to reduce the variance of our estimated performance, espe-cially on molecules with few atoms, since there only exist few of these in the dataset.15 M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomCFHNO (a) M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomCFHNO (b) M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomBrCClFHINOPS (c) M o l e c u l e c o un t
13 18 23 28 33 38 43 48 53 58 63 68 73 78Number of atoms0.00.20.40.60.81.0 A t o m d i s t r i b u t i o n atomBrCClFHINOPS (d) Figure 2: Count (top) and distribution of elements per molecule size (number of atoms) for(a)
QM9 training set, (b)
QM9 test set, (c)
ZINC training set and (d)
ZINC test set.
Training details
To train the model we optimize the objective for each of the methods (equation 7, 8, 10, and15) by corrupting the atoms, with masking, and reversing the corruption. When increasingthe masking we have an exponentially growing combination of corruptions, for which reasonwe sample the atom modifications in an online manner for training.To make the model robust towards different levels of corruption we employ an (cid:15) -greedy16orruption scheme. Pr( no. of corruptions = k ) = − (cid:15) + (cid:15) | V | k = n corrupt (cid:15) | V | k (cid:54) = n corrupt (25)in the first case, with probability − (cid:15) , we corrupt n corrupt atoms and in the second case,with probability (cid:15) − (cid:15) | V | , we uniformly corrupt between 1 to | V | where | V | is the amount ofatoms in the molecule. We use n corrupt = 1 for training and found (cid:15) = 0 . to work well (seeFigure S.1). The models are trained for 100 epochs on an Nvidia Tesla V100 GPU, usingAdam optimization, with a learning rate of 0.001 and batch size of 248 for all models. Notmuch hyperparameter optimization was done, as these default values performed well. Wefound that an embedding dimension of 64 and 4 layers worked well for the bag-of-atoms and bag-of-neighbors as more layers caused more overfitting, while 8 layers with 6 attentionheads was chosen for the transformers (see Table S.4). The experiments are implemented inPyTorch . Evaluation
When predicting the true value of a masked atom in a molecule, several solutions might beequally correct. In NLP this is often handled by considering sample exact match. However,for molecular structures we know that multiple elements could exist in the same position.This is formalized by the octet rule, which allows the prediction of elements with the samenumber of unpaired valence electrons. We define the union of an exact match and elementsthat are correct with respect to the octet rule as octet accuracy . Given that the
QM9 datasetis generated by the octet rule, correctly understanding the octet rule would result in 100%octet accuracy, which is why we use it as our first dataset - to see how difficult it is to learnthe octet rule. The
ZINC dataset on the other hand does not conform to the octet rule as itcontains hypervalent molecules. This tests our models ability to go beyond the octet rule https://github.com/jeppe742/language_of_molecules ZINC .Since the distribution of atoms in the data is heavily biased, we use the F1-micro andF1-macro scores, which are a weighted average of the precision and recall. While the octet rule becomes increasingly ambiguous when more elements are allowed,understanding what underlying structures are more common, exact match is of interest. Thisis important to evaluate if we can fit the specific distribution of a dataset. We define exactmatch as sample accuracy and F1. Moreover, we supply sample perplexity measures, whichis a more fine grained way of assessing certainty in model prediction.Perplexity = exp (cid:32) − | V subset | (cid:88) v ∈ V subset log P ( v | ˜ G ) (cid:33) (26)We benchmark our proposed models against an octet rule model. The octet rule modelcounts the number of covalent bonds of the masked atom and predicts the unigram prob-abilities of the elements of the corresponding group in the periodic table. We denote thismodel as the octet-rule-unigram . When predicting elements with ambiguity (e.g hydrogenand fluorine in the QM9 dataset) the octet-rule-unigram will therefore not obtain perfectperplexity. As no predictions exist for hypervalent elements (five and six covalent bonds),the octet-rule-unigram predicts uniform probability. Notice that as opposed to using aunigram model, this actually gives better perplexity as S is underrepresented in the dataset(see Table 1).
Results
We test all proposed models on octet and sample accuracy, F1, and perplexity. First, weevaluate the models on the
QM9 dataset, where the purpose is to learn an approximation to theoctet rule. Next, we measure the models on the
ZINC dataset and attempt to extend the octet18pproximation with hypervalent molecules and ions. Finally, we provide a qualitative insightinto model prediction by analyzing six different samples (three correct, three incorrect) fromthe binary-transformer . QM9 - approximating Octet rule
In Table 2, we evaluate our models on octet rule accuracy, octet rule F1-(micro/macro) andsample perplexity.As expected, the bond-transformer achieves almost perfect performance ( octetaccuracy), since the task becomes a matter of counting covalent bonds, once you include theorder of the bonds. The binary-transformer also achieves excellent performance ( octet accuracy), even though it is not given any information about bond types. With 1masked atom, the problem of recovering the corrupted atom, without any bond types, canbe seen as a combinatorial problem. This suggests that the binary-transformer is ableapproximately solve this problem by inferring the bond orders from the remaining molecule.By only using neighborhood information, the
Bag-of-neighbors model gets 90%, whichserves as a very strong baseline, but without the full structural context, the model can-not approximate the octet rule. Similar, by only providing compositional information, the
Bag-of-Atoms model, performs significantly worse, showing that structural and neighboringinformation is important.Finally, the
Unigram , relies purely on the frequency of occurrence of elements in thedataset, thus always guessing the masked atom is hydrogen and performs poorly.We provide extended results on masking multiple atoms, transformer model sizes, andaccuracy by length in Supporting information.19able 2: Performance of our models for 1 masked atoms per molecule. The uncertaintycorresponds to the standard deviation of ten models, trained with different start seed.Model Octet Accuracy Octet F1 (micro/macro) Perplexity bond-transformer ± ± ± ± binary-transformer ± ± ± ± bag-of-neighbors ± ± ± ± bag-of-atoms ± ± ± ± Unigram octet-rule-unigram
100 100 / 100 1.002
ZINC - going beyond the octet rule
We consider the
ZINC dataset as it cannot be fully explained by the octet rule and has alarger quantity of ambiguous elements than
QM9 . E.g. with n corrupt = 1 , our ZINC test setcontains fluorine atoms to be predicted as opposed to fluorine atoms in QM9 .Given some elements, namely ions and hypervalent molecules, cannot be predicted bythe octet rule we add k-smoothing to the octet-rule-unigram model. This avoids thecase of 0 probability, which would result in infinite perplexity loss. We optimize k on thevalidation set and found the optimum at k=1842.(see Figure S2)From Table 3 we see that the octet-rule-unigram model no longer has 100% octet F1,which emphasizes to what extend that the dataset cannot be fully explained by the octetrule, due to molecules with charges and hypervalency. Both our transformer models performsimilar or better than the octet-rule-unigram , when evaluated on Octet F1, sample F1 andsample perplexity. This is especially the case with with F1 macro, that puts more emphasison the underrepresented cases, which in our case are the most interesting. This indicatesthat the transformer models also have learned to discriminate between elements that shouldbe equally likely from the perspective of the octet rule, but might have higher likelihoodunder a given structure. 20able 3: Performance of our models for 1 masked atoms per molecule. The uncertaintycorresponds to the standard deviation of ten models, trained with different start seed.Model Octet F1 (micro/macro) Sample F1 (micro/macro) Perplexity bond-transformer ± ± ± ± ± binary-transformer ± ± ± ± ± octet-rule-unigram bag-of-neighbors ± ± ± ± ± bag-of-atoms ± ± ± ± ± Unigram
Bond-Transformer barely is affected, even when all the atoms inthe molecule are masked. This suggests that the model primarily uses the structural infor-mation (bond type and connections). The
Binary-Transformer however drops slightly inaccuracy as the molecule is corrupted. This makes sense, as without bond type informa-tion, the model can use the label of the remaining atoms to infer the bond types, but as wecorrupt more, we limit the available information in the molecule. The same is the case forthe
Bag-of-neighbors . In the case of
Bag-of-atoms , the model seem to converge to the
Unigram . 21 .9500.9751.000 Binary-Transformer Octet-Rule-UnigramBond-Transformer10 20 30 40 50 60 70 80 n corrupt A cc u r a c y UnigramBag-of-AtomsBag-of-Neighbors
Figure 3: Sample accuracy of the models, evaluated by different number of masked atomson ZINC dataset. Errors bar corresponds to standard deviation of 10 models trained withdifferent start seed.To investigate if our model can understand ions we have visualized the confusion matrixfor atoms with four covalent bonds in Figure 4 (other bond order confusion matrices can befound in Supporting information). For a masked atom with four covalent bonds the possibleclasses in the dataset are a C, a N + ion or a hypervalent S. The confusion matrix shows thatwhile our Octet-rule-unigram model only predicts C, both the
Binary-transformer and
Bond-transformer has learned, that both S and N can have four covalent bonds and howto discriminate between them. Thus the models seems to have successfully learned a morecomplex structure rule, than the octet rule.To better understand the models success in predicting hypervalent elements we visualizethe confusion matrix for five and six covalent bonds in Figure S8 and S9 (see Supporting22nformation). With five covalent bonds we only have one occurrence of P, which is correctlypredicted by the bond-transformer . For six covalent bonds, both transformers correctlypredict all elements with S.To assess the models ability for predicting ambiguous elements we visualize the confu-sion matrix for one covalent bond in Figure S2. In particular, we find that both transformermodels ( binary-transformer / bond-transformer ) can successfully predict a large numberof F molecules (279/270) while only misclassifying a small amount of H (23/21) as F.For future investigations, we find that the QM9 and
ZINC datasets are heavily biasedtowards H and C. This might make training difficult due to dataset imbalances and couldbe improved by oversampling rare elements. C N Sprediction C N S t a r g e t Octet-Rule-Unigram
C N Sprediction C N S t a r g e t Binary-Transformer
C N Sprediction C N S t a r g e t Bond-Transformer
Figure 4: Confusion matrix for the test set, with n corrupt = 1 , where the masked atom has fourcovalent bonds. We provide this matrix for the octet-rule-unigram , binary-transformer ,and bond-transformer . Qualitative results
To investigate the binary-transformer corrections of atoms in a molecule, we inspect afew interesting predictions on the
ZINC dataset. We show the molecules with the predictedconditional probabilities of the possible element labels on the masked atoms. Figure 5a23llustrates an example where the model correctly predicts N, even though N − ions are veryrare in the dataset. It also puts a reasonable amount of probability of the target beingO, which could be a valid guess assuming the octet rule applies. In Figure 5b we see anexample of a hypervalent S, which our model correctly predicts, with a very high certainty.The hypervalent S often appears in the dataset with the two double bonded O, which mightbe a giveaway for the model. The example in Figure 5c would however most likely not havea immediate explanation, but the model is very certain of it prediction, which is also correct.The context of the elements with one covalent bond is expected to be identical, under theoctet rule, since both have one neighbor to any of the other elements in the data, but sincethe data is heavily biased towards hydrogen it is worth checking if the predicted probabilitiesare also biased. From Figure 5d, we see that even though the model incorrectly predicts H,the second most likely guess of Cl is correct, even though F appears twice as often in thedataset. A similar case can be seen in 5e where the model is in doubt between two elements,that both could be considered correct under the octet rule. Finally, in Figure 5f we have anexample where the model is very certain, but makes a completely wrong prediction.24 NN - SO OCl S H H H HH H H HHHHHH H HH (a)
SOONNN H H HHHH HH HH HHH HH HH H H (b)
OONNS N N HHHHHHH HHHHH H H HH HH (c)
N N S O N ClOSH HH H H HHH (d)
NNNONO NHH HHHHHHH (e)
NNO ONNHHHH HH H H HHHH (f)
Figure 5: Predicted atom probabilities. The molecule corresponds to the true molecule,where the colored atom is the target we want to predict. Green corresponds to correct, andred to wrong predictions. 25 onclusion
In this work we have introduced the binary-transformer and bond-transformer models,and evaluated their ability to recover masked atoms in an undirected molecular graph withdiscrete representations of bonds. The models achieves . ± . % and . ± . octetF1-micro on the QM9 dataset, while masking 1 atom per molecule, suggesting that the modelis capable of learning the octet rule, which is the underlying selection criteria for the QM9dataset.When evaluated on the
ZINC dataset, which contains more complex structure rules, ourtransformer models outperforms the octet-rule-unigram model in all metrics, includingachieving . ± . and . ± . octet F1-micro, when masking 1 atom per molecule.When paired with the analysis of the confusion matrix, this indicates that the models haslearned rules that exceed the octet rule, like ions and hypervalent molecules.Deep learning models are extremely flexible and we have shown that the transformer ar-chitecture, which makes no assumption of the amount of atoms or bonds in a molecule, andcould in theory be able to model a wide variety of molecular rules. With the high accuracyon the QM9 and
ZINC datasets we hypothesize that the transformer models, both the bondand binary based versions, could be well suited for learning other molecular rules, such asstructure rules related to properties. As inference with the transformer is cheap, correctingbillions of molecules is therefore possible.The transformer model and embeddings made from undirected molecular graphs mayfurthermore be useful in chemical discovery tasks such as automatically generating and enu-merating new molecules.Moreover, years of progress in language modeling for NLP has given rise to strong con-textual vectors of information that is now the defacto standard for state-of-the-art modelsin close to every popular dataset for benchmarking neural network performance.
In26articularly, these pretrained language models works surprisingly well for areas of limitedlabeled data, something that is fairly prevalent in many molecular chemistry tasks as datamight be expensive to gather.
Acknowledgments
This research is funded by the Innovation Foundation Denmark through the DABAI project
References (1) Ertl, P.; Rohde, B.; Selzer, P. Fast calculation of molecular polar surface area as a sumof fragment-based contributions and its application to the prediction of drug transportproperties.
Journal of medicinal chemistry , , 3714–3717.(2) Lo, Y.-C.; Rensi, S. E.; Torng, W.; Altman, R. B. Machine learning in chemoinformaticsand drug discovery. Drug discovery today , , 1538–1546.(3) Ulissi, Z. W.; Medford, A. J.; Bligaard, T.; Nørskov, J. K. To address surface reactionnetwork complexity using scaling relations machine learning and DFT calculations. Nature communications , , 14621.(4) Boes, J. R.; Mamun, O.; Winther, K.; Bligaard, T. Graph Theory Approach to High-Throughput Surface Adsorption Structure Generation. The Journal of Physical Chem-istry A ,(5) Van Geem, K. M.; Pyl, S. P.; Marin, G. B.; Harper, M. R.; Green, W. H. Accuratehigh-temperature reaction networks for alternative fuels: butanol isomers.
Industrial &engineering chemistry research , , 10399–10420.(6) Broadbelt, L. J.; Pfaendtner, J. Lexicography of kinetic modeling of complex reactionnetworks. AIChE journal , , 2112–2121.277) Fink, T.; Bruggesser, H.; Reymond, J.-L. Virtual exploration of the small-moleculechemical universe below 160 daltons. Angewandte Chemie International Edition , , 1504–1508.(8) Blum, L. C.; Reymond, J.-L. 970 million druglike small molecules for virtual screeningin the chemical universe database GDB-13. Journal of the American Chemical Society , , 8732–8733.(9) Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166billion organic small molecules in the chemical universe database GDB-17. Journal ofchemical information and modeling , , 2864–2875.(10) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; Von Lilienfeld, O. A. Quantum chemistrystructures and properties of 134 kilo molecules. Scientific data , , 140022.(11) Elton, D. C.; Boukouvalas, Z.; Fuge, M. D.; Chung, P. W. Deep learning for molec-ular generation and optimization-a review of the state of the art. arXiv preprintarXiv:1903.04388 ,(12) Li, Y.; Zhang, L.; Liu, Z. Multi-objective de novo drug design with conditional graphgenerative model. Journal of cheminformatics , , 33.(13) Salakhutdinov, R. Learning deep generative models. Annual Review of Statistics andIts Application , , 361–385.(14) De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973 ,(15) You, J.; Liu, B.; Ying, Z.; Pande, V.; Leskovec, J. Graph convolutional policy networkfor goal-directed molecular graph generation. Advances in Neural Information Process-ing Systems. 2018; pp 6410–6421. 2816) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous represen-tation of molecules. ACS central science , , 268–276.(17) Jaeger, S.; Fulle, S.; Turk, S. Mol2vec: Unsupervised Machine Learning Approach withChemical Intuition. Journal of Chemical Information and Modeling , , 27–35,PMID: 29268609.(18) Zheng, S.; Yan, X.; Yang, Y.; Xu, J. Identifying Structure–Property Relationshipsthrough SMILES Syntax Analysis with Self-Attention Mechanism. Journal of chemicalinformation and modeling , , 914–923.(19) Mater, A. C.; Coote, M. L. Deep Learning in Chemistry. Journal of Chemical Informa-tion and Modeling ,(20) Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic LanguageModel.
J. Mach. Learn. Res. , , 1137–1155.(21) Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidi-rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,(22) Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and ComposingRobust Features with Denoising Autoencoders. Proceedings of the 25th InternationalConference on Machine Learning. New York, NY, USA, 2008; pp 1096–1103.(23) Gerratt, J.; Cooper, D.; Karadakov, P. a.; Raimondi, M. Modern valence bond theory. Chemical Society Reviews , , 87–100.(24) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;29spuru-Guzik, A. Automatic chemical design using a data-driven continuous represen-tation of molecules. ACS central science , , 268–276.(25) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: a freetool to discover chemistry for biology. Journal of chemical information and modeling , , 1757–1768.(26) Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic LanguageModel. J. Mach. Learn. Res. , , 1137–1155.(27) Jurafsky, D.; Martin, J. H. Speech and Language Processing (2Nd Edition) ; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2009.(28) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representationsof Words and Phrases and Their Compositionality. Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 2. USA, 2013; pp 3111–3119.(29) Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neuralnetwork based language model. INTERSPEECH. 2010; pp 1045–1048.(30) Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization.
CoRR , abs/1409.2329 .(31) Merity, S.; Keskar, N. S.; Socher, R. Regularizing and Optimizing LSTM LanguageModels. 6th International Conference on Learning Representations, ICLR 2018, Van-couver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. 2018.(32) Hinton, G. E.; Salakhutdinov, R. R. Reducing the dimensionality of data with neuralnetworks. Science , , 504–507.(33) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; Von Lilienfeld, O. A.;Müller, K.-R.; Tkatchenko, A. Machine learning predictions of molecular properties:30ccurate many-body potentials and nonlocality in chemical space. The journal of phys-ical chemistry letters , , 2326–2331.(34) Hansen, M. H.; Torres, J. A. G.; Jennings, P. C.; Wang, Z.; Boes, J. R.; Mamun, O. G.;Bligaard, T. An Atomistic Machine Learning Package for Surface Science and Catalysis. arXiv preprint arXiv:1904.00904 ,(35) Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representa-tions in Vector Space. 1st International Conference on Learning Representations, ICLR2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. 2013.(36) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.;Polosukhin, I. Attention is all you need. Advances in neural information processingsystems. 2017; pp 5998–6008.(37) Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning toalign and translate. arXiv preprint arXiv:1409.0473 ,(38) Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representa-tions. CoRR , abs/1803.02155 .(39) Ba, L. J.; Kiros, R.; Hinton, G. E. Layer Normalization. CoRR , abs/1607.06450 .(40) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. CoRR , abs/1512.03385 .(41) Srivastava, R. K.; Greff, K.; Schmidhuber, J. Highway Networks. CoRR , abs/1505.00387 .(42) Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. Proceedingsof the Fourteenth International Conference on Artificial Intelligence and Statistics. FortLauderdale, FL, USA, 2011; pp 315–323.3143) Luong, M.-T.; Pham, H.; Manning, C. D. Effective approaches to attention-based neuralmachine translation. arXiv preprint arXiv:1508.04025 ,(44) Weininger, D. SMILES, a chemical language and information system. 1. Introductionto methodology and encoding rules. Journal of Chemical Information and ComputerSciences , , 31–36.(45) Landrum, G. RDKit: Open-source cheminformatics. .(46) Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction ; 2018.(47) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 ,(48) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Des-maison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. NIPS AutodiffWorkshop. 2017.(49) Yutaka, S. The truth of the F-measure.
Teach Tutor mater , , 1–5.(50) Buda, M.; Maki, A.; Mazurowski, M. A. A systematic study of the class imbalanceproblem in convolutional neural networks. Neural networks : the official journal of theInternational Neural Network Society , , 249–259.(51) Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.Deep contextualized word representations. arXiv preprint arXiv:1802.05365 ,(52) Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for NaturalLanguage Understanding. CoRR , abs/1901.11504 .32 ppendix Table S.1: Describtion of variables used.Variable Description G Graph, defined as a set of nodes and edges (V,E) V Set of nodes (atoms) in the graph V subset set of masked atoms. | V subset | = n corrupt n corrupt Number of atoms corrupted per molecule E Adjacency matrix ( E ij ∈ { , , , } ) (cid:101) G Corrupted graph, with V subset replaced with a
I’th head of attention function for an atom.
C_i ∈ R d transform Table S.2: Training time of our different models, on the QM9 and ZINC datasets.Model Training time (min) Dataset binary-transformer
110 QM9 binary-transformer
482 ZINC bond-transformer
112 QM9 bond-transformer
484 ZINC bag-of-atoms
71 QM9 bag-of-atoms
158 ZINC bag-of-neighbors
72 QM9 bag-of-neighbors
144 ZINC33 psilon-greedy n masks V a li d a t i o n p e r p l e x i t y QM9
Bond transformer - = 0Binary transformer - = 0Bond transformer - = 0.2Binary transformer - = 0.2 n masks V a li d a t i o n p e r p l e x i t y ZINC
Bond transformer - = 0Binary transformer - = 0Bond transformer - = 0.2Binary transformer - = 0.2
Figure S.1: Validation perplexity of binary and bond transformer – with and without (cid:15) -greedymasking strategy – with different number of masked atoms. (a) is on the QM9 dataset and(b) is on the ZINC dataset 34 -smoothing k C r o ss E n t r o p y k=1842 Figure S.2: Cross entropy as a function of k-smoothing evaluated on the ZINC validationdataset. 35 raph Attention V a li d a t i o n p e r p l e x i t y Bond transformer on QM9
All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Bond transformer on ZINC
All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Binary transformer on QM9
All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Binary transformer on ZINC
All nodes and edgesNeighbor nodesNeighbor nodes and edges
Figure S.3: Perplexity on validation dataset – with one atom masked per molecule – foreach epoch of training. (a) is a bond transformer trained on QM9, (b) is a bond transformertrained on ZINC, (c) is a binary transformer trained on QM9 and (d) is a binary transformertrained on ZINC
QM9 extended results
From Table S.3 and Figure S.4a,S.4b we see that as we mask more atoms per molecule, the bond-transformer , maintains a perfect score, since it can solve the task by only lookingat the bonds. The
Binary-transformer drops slightly in performance, as we mask moreatoms. The
Bag-of-neighbors doesn’t seem to depend on the number of masked atoms.This indicates that the model most likely, base its predictions on the number of neighbors,36hich also can be an indication of the number of covalent bonds. As we remove informationexcept compositional, the bag-of-atoms model drops significantly has we mask more atoms,reaching similar performance to the
Unigram , as we approach fully masked molecules. This isno surprise, as a fully masked molecule, only gives the model information about the numberof atoms, which should not be enough to infer anything.Table S.3: Performance of our models for 1, 5 and 30 masked atoms per molecule. acc isoctet rule accuracy, F is octet rule F1-micro score and PP is the sample perplexity, eachare averaged over the test set. The uncertainty corresponds to the standard deviation of tenmodels, trained with different start seed.Model Metric n mask = 1 n mask = 5 all masked octet-rule-unigram acc 100 100 100f1 100 100 100PP 1.002 1.002 1.002 bond-transformer acc ± ± ± ± ± ± ± ± ± binary-transformer acc 99.73 ± ± ± ± ± ± ± ± ± bag-of-neighbors acc 90.7 ± ± ± ± ± ± ± ± ± bag-of-atoms acc 65.8 ± ± ± ± ± ± ± ± ± Unigram acc 47.3 47.2 48.3F1 47.3 47.2 48.3PP 3.104 3.113 3.03837 .99980.99991.0000 Bond-Transformer5 10 15 20 25 30 n corrupt O c t e t F ( m i c r o ) UnigramBag-of-Atoms Bag-of-NeighborsBinary-Transformer (a) n corrupt O c t e t F ( m a c r o ) UnigramBag-of-Atoms Binary-TransformerBag-of-Neighbors (b) O c t e t F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (c) O c t e t F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (d)(d)
Unigram , as we approach fully masked molecules. This isno surprise, as a fully masked molecule, only gives the model information about the numberof atoms, which should not be enough to infer anything.Table S.3: Performance of our models for 1, 5 and 30 masked atoms per molecule. acc isoctet rule accuracy, F is octet rule F1-micro score and PP is the sample perplexity, eachare averaged over the test set. The uncertainty corresponds to the standard deviation of tenmodels, trained with different start seed.Model Metric n mask = 1 n mask = 5 all masked octet-rule-unigram acc 100 100 100f1 100 100 100PP 1.002 1.002 1.002 bond-transformer acc ± ± ± ± ± ± ± ± ± binary-transformer acc 99.73 ± ± ± ± ± ± ± ± ± bag-of-neighbors acc 90.7 ± ± ± ± ± ± ± ± ± bag-of-atoms acc 65.8 ± ± ± ± ± ± ± ± ± Unigram acc 47.3 47.2 48.3F1 47.3 47.2 48.3PP 3.104 3.113 3.03837 .99980.99991.0000 Bond-Transformer5 10 15 20 25 30 n corrupt O c t e t F ( m i c r o ) UnigramBag-of-Atoms Bag-of-NeighborsBinary-Transformer (a) n corrupt O c t e t F ( m a c r o ) UnigramBag-of-Atoms Binary-TransformerBag-of-Neighbors (b) O c t e t F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (c) O c t e t F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (d)(d)
Figure S.4: Octet F1 micro (a) and octet F1 macro (b) evaluated by different number ofmasked atoms. Octet F1 micro (c) and Octet F1 macro (d) evaluated on molecules of varyingsize, with 1 atom masked. Error bar corresponds to standard deviation of 10 models trainedwith different start seedThe transformer model is very flexible in terms of modeling capability, like any other deeplearning model, so to gauge complexity of the task, we evaluate five binary-transformer models of various sizes, which can be seen in Table S.4. Here we see that even very smalltransformer models perform well. As the models increase in number of parameters the per-formance increases, which however comes at a cost of computation and memory consumption.38able S.4: Performance of binary-transformer models with different number of trainableparameters, for 1 and 5 masked atoms per molecule. acc is octet accuracy, F is octetF1-score and P P is perplexity, each averaged over the test set. t train is the training time.Model Metric n mask = 1 n mask = 5 t train (min) Parameters layers=1, heads=1, d emb =4 acc 86.0 85.8 60 199F1 86.0 85.8PP 1.426 1.441 layers=2, heads=1, d emb =4 acc 89.9 89.8 63 265F1 89.9 89.8PP 1.261 1.272 layers=2, heads=3, d emb =64 acc 96.3 94.4 77 118149F1 96.3 94.4PP 1.089 1.130 layers=4, heads=3, d emb =64 acc 98.4 97.4 82 234885F1 98.4 97.3PP 1.031 1.056 layers=8, heads=6, d emb =64 acc 99.8 97.9 110 866181F1 99.8 97.9PP 1.008 1.045 Zinc extended results
From Figure S.5 we see that both our transformer models, has learn to discriminate betweencertain elements, that under the octet-rule should be indistinguishable, like F, but also toallow for ions, in the form of O − .A similar story can be seen in Figure S.6, where we have ambiguity between O,S but alsoN − ions.Figure S.7,S.8 and S.9 does not provide any insights, as the dataset is too bias, andalmost only contain one type of element for each number of covalent bonds.39 O F S Cl Br Iprediction H O F S C l B r I t a r g e t OctetRule
H O N F S Cl Br Iprediction H O N F S C l B r I t a r g e t Binary Transformer
H O F S Cl Br Iprediction H O F S C l B r I t a r g e t Bond Transformer
Figure S.5: Confusion matrix for cases where the masked atom has one covalent bond.
O N Sprediction O N S t a r g e t OctetRule
H O N Sprediction H O N S t a r g e t Binary Transformer
O N Sprediction O N S t a r g e t Bond Transformer
Figure S.6: Confusion matrix for cases where the masked atom has two covalent bond.
N Sprediction N S t a r g e t OctetRule
C O N Sprediction C O N S t a r g e t Binary Transformer
N Sprediction N S t a r g e t Bond Transformer
Figure S.7: Confusion matrix for cases where the masked atom has three covalent bond.40
Iprediction P I t a r g e t OctetRule
P Sprediction P S t a r g e t Binary Transformer
Pprediction P t a r g e t Bond Transformer
Figure S.8: Confusion matrix for cases where the masked atom has five covalent bond.
S Iprediction S I t a r g e t OctetRule
Sprediction S t a r g e t Binary Transformer
Sprediction S t a r g e t Bond Transformer
Figure S.9: Confusion matrix for cases where the masked atom has six covalent bond.Figure S.10,S.11 we see the save story as underlined in the main text, namely thatthe
Bond-transformer outperforms and the
Binary-transformer also performs similar orbetter than the octet-rule-unigram model, depending on the number of masks, and metricused to evaluate. 41 .03.23.4 UnigramBag-of-Atoms10 20 30 40 50 60 70 80 n masks P e r p l e x i t y Bag-of-Neighbors Binary-TransformerOctet-Rule-UnigramBond-Transformer
Figure S.10: Sample perplexity evaluated by different number of masked atoms. Error barcorresponds to standard deviation of 10 models trained with different start seed42 .960.981.00 Binary-Transformer Octet-Rule-UnigramBond-Transformer10 20 30 40 50 60 70 80 n masks O c t e t F ( m i c r o ) UnigramBag-of-AtomsBag-of-Neighbors (a)
10 20 30 40 50 60 70 80 n masks O c t e t F ( m a c r o ) UnigramBag-of-AtomsBag-of-Neighbors Binary-TransformerOctet-Rule-UnigramBond-Transformer (b) n masks F ( m i c r o ) UnigramBag-of-AtomsBag-of-Neighbors (c)
10 20 30 40 50 60 70 80 n masks F ( m a c r o ) Unigram Bag-of-Atoms Bag-of-NeighborsBinary-TransformerOctet-Rule-Unigram Bond-Transformer (d)(d)
10 20 30 40 50 60 70 80 n masks F ( m a c r o ) Unigram Bag-of-Atoms Bag-of-NeighborsBinary-TransformerOctet-Rule-Unigram Bond-Transformer (d)(d)
Figure S.11: Octet F1 micro (a), octet F1 macro (b), sample F1 micro (c) and sample F1macro (d) evaluated by different number of masked atoms. Error bar corresponds to standarddeviation of 10 models trained with different start seed43 O c t e t F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (a)
20 30 40 50 60 70 80length0.00.20.40.60.81.0 O c t e t F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (b)
20 30 40 50 60 70 80length0.20.40.60.81.0 F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (c)
20 30 40 50 60 70 80length0.00.20.40.60.81.0 F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (d)(d)