[PDF] Autoencoding Undirected Molecular Graphs With Neural Networks

Abstract

Discrete structure rules for validating molecular structures are usually limited to fulfillment of the octet rule or similar simple deterministic heuristics. We propose a model, inspired by language modeling from natural language processing, with the ability to learn from a collection of undirected molecular graphs, enabling fitting of any underlying structure rule present in the collection. We introduce an adaption to the popular Transformer model, which can learn relationships between atoms and bonds. To our knowledge, the Transformer adaption is the first model that is trained to solve the unsupervised task of recovering partially observed molecules. In this work, we assess how different degrees of information impact performance w.r.t. to fitting the QM9 dataset, which conforms to the octet rule, and to fitting the ZINC dataset, which contains hypervalent molecules and ions requiring the model to learn a more complex structure rule. More specifically, we test a full discrete graph with bond order information, a full discrete graph with only connectivity, a bag-of-neighbors, a bag-of-atoms, and a count-based unigram statistics. These results provide encouraging evidence that neural networks, even when only connectivity is available, can learn arbitrary molecular structure rules specific to a dataset, as the Transformer adaption surpasses a strong octet rule baseline on the ZINC dataset.

Full PDF

AAutoencoding Undirected Molecular GraphsWith Neural Networks

Jeppe Johan Waarkjær Olsen, Peter Ebert Christensen, Martin Hangaard Hansen,and Alexander Rosenberg Johansen ∗ Department of Computing, Technical University of Denmark

E-mail: [email protected]

Abstract

Discrete structure rules for validating molecular structures are usually limited tofulﬁlment of the octet rule or similar simple deterministic heuristics. We propose amodel, inspired by language modeling from natural language processing, with the abil-ity to learn from a collection of undirected molecular graphs, enabling ﬁtting of anyunderlying structure rule present in the collection.We introduce an adaption to the popular Transformer model, which can learn re-lationships between atoms and bonds. To our knowledge, the Transformer adaption isthe ﬁrst model that is trained to solve the unsupervised task of recovering partially ob-served molecules. In this work, we assess how diﬀerent degrees of information impactsperformance w.r.t. to ﬁtting the QM9 dataset, which conforms to the octet rule, andto ﬁtting the ZINC dataset, which contains hypervalent molecules and ions requiringthe model to learn a more complex structure rule. More speciﬁcally, we test a fulldiscrete graph with bond order information, full discrete graph with only connectivity,a bag-of-neighbors , a bag-of-atoms , and a count-based unigram statistics.These results provide encouraging evidence that neural networks , even when onlyconnectivity is available, can learn arbitrary molecular structure rules speciﬁc to a a r X i v : . [ c s . L G ] M a r ataset, as the Transformer adaption surpasses a strong octet rule baseline on theZINC dataset. ntroduction In drug discovery, catalysis, and combustion the number of possible relevant moleculesgrows exponentially with size of the molecule or reaction network. Modeling and exploringlarge datasets of molecules beneﬁts from fast coarse grained methods to generate, ﬁlter,consistency check, validate and correct molecules .Databases of molecular properties calculated by ab initio methods rely on consistencybetween 3D structures and molecular graphs. This consistency should be highly reliable andavoid any erroneous identiﬁcations after structure relaxation. The reliability requirementbeneﬁts from several redundant methods, which can ﬂag any possible inconsistencies.The tasks of validating, correcting, completing, and generating molecules in discreterepresentations usually rely on a simple heuristic such as the octet rule as the fundamentalstructure rule to determine the validity of molecules.

The octet rule is, however, notsatisfactory to validate all synthesizable molecules due to the occurrence of hypervalentmolecules, ions, and non-integer bond orders such as in aromatic bonds.Machine learning methods working on discrete molecular graphs can work as structurerules, learned from molecular datasets. At the same time, they can ﬁt a great complexityof underlying trends, while being low cost compared to 3D representations and quantumchemical calculations. Machine learning for predicting properties based on discrete repre-sentations of molecules have only recently begun to use undirected graphs as opposed todirected linear graphs or sequences such as SMILES.

Moreover, predictions shouldbe invariant under permutation, translation, and rotation of the molecular representation,which calls for undirected graphs. We introduce an unsupervised task, known as masked language modeling or denoisingautoencoder, over an undirected discrete graph representation of a given molecule. Wedeﬁne the unsupervised task as corrupting a molecule and learning how to revert such corrup-tion to recover the valid molecule. This objective allows us to learn the underlying structure3ule without any hard-coded heuristic by merely observing valid molecular graphs. Thiscan correct molecular graphs directly or cross-validate molecular graphs generated from 3Dstructures to check for consistency. In addition, the binary-transformer encodings devel-oped for this model are generally of interest for other tasks including generation in contextof drug-discovery.This paper presents several models trained on two datasets: QM9, which we use as abenchmark to verify that the models can learn a simple known heuristic deﬁning the dataset– namely the octet rule – and ZINC, as a more challenging dataset due to ions and hyper-valent molecules .

The models are evaluated on several metrics including: perplexity,sample F1, and a new octet F1, which measures if the predictions satisfy the octet rule .In Natural Language Processing (NLP) the a goal of statistical and probabilistic languagemodeling is to learn the joint probability mass function of sequences. Historically, this hasbeen accomplished by calculating the probability of observing a word given the sentencethat precedes it. Methods exploiting the sequential relationship between words in text hasbeen ranging from probabilistic ﬁnite automaton, to distributed word embeddings , andrecurrent neural networks (RNNs). To cover the most recent development in languagemodeling, adapted to ﬁt undirected graphs with a degree above two, we test the followingmethods of increasing complexity: • unigram — unconditional probabilities of the atoms • bag-of-atoms/neighbors — neural network that aggregates either all atoms or onlyneighboring atoms in the molecule • binary/bond-transformer — neural network architecture with attention using eitherbinary representations of connectivity or full bond type informationThe binary/bond-transformer are inspired by a recent trend in NLP, known as masked4anguage modeling, where the sequential requirement can be relaxed. Most noticeably, wemodify masked language modeling to work with molecules by masking atoms and using graphadjacency matrices to model intermolecular relationships.

Methods

In this section we present several methods to restore partially observed molecules. Weformally deﬁne this as an unsupervised learning task over discrete molecular graphs. Totrain the model we apply a a simple corruption function that masks atoms and challenge themodel to recover the corruption. Formal task deﬁnition and the ﬁve unsupervised modelsof increasing complexity are deﬁned below, as a baseline we apply the deterministic octetrule.

Unsupervised learning of discrete molecular graphs

Autoencoders are a type of neural networks that are trained with unsupervised learning.They create an eﬃcient representation of data by extracting important features. The de-noising autoencoder is an autoencoder variant that trains challenges the neural networkto revert corruptions of the input. By reverting corruptions, the neural network has to un-derstand the underlying structure of the data distribution.In our case the input is a molecule, which we represent as an undirected graph withdiscrete edges G = ( V, E ) . Here V is a set of vertices (atoms), such that ( a, i ) ∈ V where a ∈ A is the element and i ∈ N is the index. E is the set of undirected bonds betweenatoms in the molecule, such that E ⊆ { x, y, b } | ( x, y ) ∈ V ∧ ( x, y ) = ( y, x ) ∧ x (cid:54) = y , where b ∈ { , , } is the bond type: single, double, or triple.We denote the corruption function of the denoising autoencoder as κ : V → ˜ V .5or the experiments we use a corruption function that mask atoms in a molecule withbond type intact. This method of corruption is inspired by the masked language modelpresented in BERT. To apply the corruption function we replace a set of vertices with the token asdescribed in equation (1) ˜ V = V − V subset ∪ κ ( V subset ) , V subset ⊆ V (1) κ ( a, i ) = ( , i ) (2)Given the corrupted graph, ˜ G = ( ˜ V , E ) , we want to maximize the probability of recov-ering the original graph, G , which equals maximizing the probability of the masked atoms. max P ( G | ˜ G ) = max P ( V subset | ˜ G ) (3)In the following subsections we present ﬁve models maximizing this objective, where eachmodel has an increasing access to graph information and modeling complexity. Counting: atomic frequencies

A counting-based model obtains the distribution of atom types by calculating their frequen-cies over a dataset. Counting-based models will by intuition have high accuracy when thedataset is biased, which we ﬁnd the

QM9 and

ZINC are (see Table 1).The count-based model is motivated by the probability chain-rule, where we can modelthe joint probability of the atoms v i = ( a, i ) ∈ V in a molecule. P ( v , v , . . . , v n ) = P ( v ) P ( v | v ) . . . P ( v n | v , . . . , v n − ) (4)While equation 4 allows us to exactly estimate the conditional atom distribution, the con-6ition grows exponentially with the amount of vertices and becomes infeasible due to theexponential requirement of data and compute. In NLP the directionality of the sentence al-lows for clipped, n-gram, versions of equation 4 where the prediction of the word distributionis only conditioned on the last k tokens P ( v n | v , . . . , v n − ) = P ( v n | v n − k , . . . , v n − ) . Usingn-grams signiﬁcantly reduces required computation and data while exploiting the locality oflanguage. In molecules, the degree of vertices and lack of directionality makes such n-gram modelscumbersome as each atom can have a tree of recursive n-grams. Because of such, we limitourselves to only consider unigram models (1-grams) for the counting case.A unigram model splits the probability of diﬀerent terms in a context into a product ofindividual terms, disregarding the condition of equation 4. P unigram ( v , v , . . . , v n ) = P ( v ) P ( v ) . . . P ( v n ) (5) P ( a j ) = count ( a j ) (cid:80) a count ( a ) (6)The unigram model has the beneﬁt of being relatively simple to implement and interpret asit merely counts the occurrence of elements in the training set. The unigram distribution ofthe QM9 and

ZINC training sets are shown in Table 1.7able 1: Unigram probabilities for the QM9 and ZINC training sets. The unigram proba-bilities corresponds to the distribution of elements in the dataset.

Elements QM9 ZINCP(H)

P(C)

P(O)

P(N)

P(F)

P(P)

P(S)

P(Cl)

P(Br)

P(I) H and C . Note thatthe unigram model will always predict with the same probability distribution of elements forany atom as it does not use context.Using our objective from equation 3 we calculate our unigram probability of the corruptedmolecule as max P ( V subset | ˜ G ) = max (cid:89) ˜ v ∈ V subset P (˜ v ) (7) Bag of vectors: neighbors and atoms

In a bag-of-vectors model a molecule is represented as a multiset of its tokens (elementsand/or bonds), disregarding structure but keeping multiplicity (i.e. multiple occur-rences of the same token). Each token, x , is embedded as a trainable vector of real num-bers x ∈ R d . By summing the n tokens of a molecule over the d features we obtain the bag-of-vectors : R n × d → R d representation (sum is used instead of mean to keep mul-tiplicity). The bag-of-vectors representation is used as input to a neural network that8earns to predict the masked tokens V subset . The token vectors, also known as embeddings,and the neural network are jointly optimised with stochastic gradient descent. Using eq.3 we deﬁne two bag-of-vector models for our study: a bag of neighboring atoms (eq. 8 )and a bag of all atoms in the corrupted atoms (eq. 10). max θ P bag-of-neighbors θ ( V subset | ˜ G ) = max θ (cid:89) v ∈ V subset P θ ( v | V neighbors ) (8) V neighbors = { v j | ( v j , v ) ∈ E } (9) max θ P bag-of-atoms θ ( V subset | ˜ G ) = max θ (cid:89) v ∈ V subset P θ ( v | ˜ V ) (10) P ( x j | ˜ X ) θ = softmax ( W h θ ( ˜ X )) j = exp(( W h θ ( ˜ X ) j ) (cid:80) | Σ |− i =0 exp(( W h θ ( ˜ X )) i ) (11) h θ ( ˜ X ) = NN ( z θ ( ˜ X )) (12) z θ ( ˜ X ) = (cid:88) ˜ x ∈ ˜ X embedding (˜ x )) (13)To represent our corrupted tokens in equation 13, ˜ X being the elements of either ˜ V or V neighbors , we use an embedding function. Embedding functions, embedding ( x ) ∈ R d emb ,are a popular way to represent input tokens in NLP. The embedding function uses adense vector representation for each token class, which allows the embedding function tolearn relations between token classes. The token is treated as a normal token andthus results in a special mask embedding vector. As we want to model all the tokens inthe molecule with a neural network we need to have a ﬁxed feature space. A convenientway to achieve such is the bag-of-vectors , which sums all tokens to achieve a ﬁxed-sizeddistributed feature representation of ˜ X .Given a bag-of-vectors representation, z θ , we want to model the corrupted atoms. Wechoose to use a feed forward neural network in equation 12, N N : R d emb → R d nn . A neural9etwork is a powerful non-linear function approximator that can learn relations betweentokens.To map the N N output onto probabilities for the element classes we use a trainablelinear projection, W ∈ R | Σ |× d nn , followed by the softmax function (eq. 11), which squeezesthe output to the probability domain. | Σ | denotes the amount of elements we predict overfor each atom (e.g in QM9 that would be ﬁve: H, C, N, O and F). The bag-of-vector models are trained end-to-end with stochastic gradient descent usinga cross-entropy loss function given the set of correctly labelled atoms V subset . L ( V subset , ˜ G ) = (cid:88) v ∈ V subset log P θ ( v | ˜ G ) (14)Where the conditional probability, P θ ( v | ˜ G ) , is calculated accordingly to; equation 10 forthe bag-of-atoms and equation 8 for the bag-of-neighbors .Since these models rely on either pairs of atoms (neighbors) or mere counts (atoms) theycan work with a broad family of corruption functions. However, only including compositionalinformation is a coarse representation of a molecule, e.g. we have several large subsets ofmolecules in QM9 with ﬁxed element compositions, which have varied structures but iden-tical bag-of-atoms representations.Moreover, in equation 10 for the all-atom based model we have the same condition for allthe predictions, V subset . As such, it will always predict the same distribution for the maskedatoms, given the composition.While these models are limited in representational power they provide a rudimentarybaseline for comparison to the transformer model on undirected molecular graphs.10 he Transformer: atomic context The ideal discrete representation of a molecule must have permutation, translation, and ro-tational invariance as well as allowing branched and aromatic molecules, in other words, anundirected graph with a degree of vertices above two and connectivity description. In this section we present an adaption of the Transformer to handle such input rep-resentation. The Transformer is a neural network architecture that uses repeated adaptivereceptive ﬁelds (known as attention ) to model relations between words in a text given theircontext. The original Transformer, like many other NLP models, uses sequence informationto build context from relative word positioning. Instead of a sequence representation werepresent the molecule by an adjacency matrix . We test two approaches for encoding bond information: the binary-transformer , whereall bonds are binary, and the bond-transformer , where bonds type (1, 2, or 3) is given.Using eq. 3, the Transformers take the entire graph representation as input and learn aparameterized function that we train to maximize eq. 15. max θ P transformer θ ( V subset | ˜ G ) = max θ (cid:89) v ∈ V subset P θ ( v | ˜ G ) (15) P θ ( v j | ˜ G ) = softmax ( W transform θ ( ˜ G ) L ) j (16)Similar to the bag-of-vectors we use a softmax function to learn class (atomic element)probabilities. The Transformers consist of L transform θ layers. Each layer applies a non-linear function to build molecular context. The ﬁnal layer, transform θ ( ˜ G ) L , is used forclassiﬁcation. As described in eq. 17, each layer consist of an attention mechanism with layernormalization; skip-connections; and a feed forward neural network, which allows the11ransformer to model structures and dependencies for each atom using the entire molecule. transform θ ( V, E ) l = h l = layer-norm ( z l + FFN ( z l )) (17) z l = layer-norm ( h l − + Attention ( h l − , E )) (18) h = atom-embedding ( V ) (19)Where the atomic representation of each layer is deﬁned as h l , z l ∈ R | V |× d transform and d transform is the hidden size of the transformer layers. The atom-embedding , h ∈ R | V |× d emb ,is identical to the embedding in equation 13. Notice that the size of the distributed rep-resentation changes from d emb to d transform in the ﬁrst transformer layer h . To representeither the full bond type or just the binary edge information we set the adjacency matrix E i,j ∈ { , } for the binary-transformer and E i,j ∈ { , , , } for the bond-transformer .Like the bag-of-vector models, this is trained with stochastic gradient descent using thecross-entropy loss function (see equation 14). Attention

As with the original transformer, we use the key-value lookup

Attention function. Thislayer can adaptively align information between atoms conditioned on the context of otheratoms.

Our implementation takes a layer of hidden representations, h l , and an adjacencymatrix of edges, E , as input. Notice that we have separate trainable bond-embedding functions for the key, e K , and value, e V , edge representations. Attention ( h, E ) i = n (cid:88) j =1 α ij (cid:0) h j W V + e Vij (cid:1) (20) α ij = exp φ ij (cid:80) nk =1 exp φ ik (21) φ ij = (cid:0) h i W Q (cid:1) (cid:0) h j W K + e Ki,j (cid:1) T (cid:112) d transform (22) e = bond-embedding ( E ) (23)12Where a ij ∈ [0 , is the attention weights; n is the number of vertices; and W Q , W V ,W K ∈ R d transform × d transform are trainable weights. The bond-embedding : E | V |×| V | → R | V |×| V |× d transform takes an adjacency matrix and returns a three dimensional tensor with a distributed represen-tation for each edge. Notice that compared to most graph based models we use informationfrom all the nodes and edges in the graph to calculate the attention weights, at each layer.From our experiments, this improved the performance (see Figure S.3).To have a more expressive attention function we use the multi-head attention mechanismby concatenating k attention layers. The k attention layers are projected to the hidden sizeof the network R d transform × k → R d transform , such that Multi-Head-Attention ( h, E ) i = [ C _ , C _ , . . . C _ k ] W multi (24)where C _i corresponds to an instance of Attention (eq. 20) and W multi ∈ R ( d transform · k ) × d transform is a trainable weight. This is further illustrated in Figure 1 Experimental setup

In our experiments we test the described models of the unigram , bag-of-neighbors , bag-of-atoms , binary-transformer , and bond-transformer as denoising autoencoders on the QM9 and

ZINC datasets.

Pre-processing

The

QM9 dataset has

134 000 organic molecules with ﬁve types of atoms; A = { H, C, N,O, F}. Similarly, the

ZINC dataset has

250 000 drug-like molecules with 10 types of atoms, A = { H, C, N, O, F, P, S, Cl, Br, I}. The molecules are represented as a SMILE strings corresponding to their discrete graph representations. We kekulize the molecules – thus13 caleMatMulMask (Opt.)SoftMaxEmbed MatMulAdditionScaled Dot-Product Attention Transformer LayerScaled Dot-Product AttentionConcatLinearLayer-normLinearLayer-norm Figure 1: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention with multiplelayers consisting of several attention layers running in parallel. Figure modiﬁed from Vaswaniet al. .resulting in the dataset only containing single, double and triple bond types – and obtainan adjacency matrix for each molecule from the SMILES string using Rdkit. Since we use the

QM9 dataset to benchmark our ability to approximate the octet rule, wediscard any molecules that contains atoms with net charges (1808 molecules). In the ZINCdataset, we keep all molecules including molecules with charges and hypervalent molecules.The resulting set of adjacency matrices are split using scaﬀolding to homology partitionthe molecules. We make a 15% test, 15% validation, and 70% training set split. In Figure2 we show the distribution of elements for diﬀerent sizes of molecules. Here we see that inboth the

QM9 and

ZINC dataset, the size of the molecules are not uniformly distributed, withfew small and large molecules. Furthermore, the molecules in

ZINC are generally larger thanthe ones in

QM9 ; up to 80 atoms in

ZINC compared to a maximum of around 30 atoms for14 M9 . The distribution of diﬀerent elements depends somewhat on the size of the molecule,especially for smaller molecules.To stress test the models we generate several validation-/ and tests sets with increasingcomplexity. For ZINC , the datasets have either 1, 10, 20, 30, 40, 50, 60, 70, or 80 atomsrandomly masked in the molecule, denoted by n corrupt .For n corrupt = 1 , we oversample the molecules, by generating ﬁve unique diﬀerent mask-ings per molecule. This is done to reduce the variance of our estimated performance, espe-cially on molecules with few atoms, since there only exist few of these in the dataset.15 M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomCFHNO (a) M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomCFHNO (b) M o l e c u l e c o un t A t o m d i s t r i b u t i o n atomBrCClFHINOPS (c) M o l e c u l e c o un t

13 18 23 28 33 38 43 48 53 58 63 68 73 78Number of atoms0.00.20.40.60.81.0 A t o m d i s t r i b u t i o n atomBrCClFHINOPS (d) Figure 2: Count (top) and distribution of elements per molecule size (number of atoms) for(a)

QM9 training set, (b)

QM9 test set, (c)

ZINC training set and (d)

ZINC test set.

Training details

To train the model we optimize the objective for each of the methods (equation 7, 8, 10, and15) by corrupting the atoms, with masking, and reversing the corruption. When increasingthe masking we have an exponentially growing combination of corruptions, for which reasonwe sample the atom modiﬁcations in an online manner for training.To make the model robust towards diﬀerent levels of corruption we employ an (cid:15) -greedy16orruption scheme. Pr( no. of corruptions = k ) =  − (cid:15) + (cid:15) | V | k = n corrupt (cid:15) | V | k (cid:54) = n corrupt (25)in the ﬁrst case, with probability − (cid:15) , we corrupt n corrupt atoms and in the second case,with probability (cid:15) − (cid:15) | V | , we uniformly corrupt between 1 to | V | where | V | is the amount ofatoms in the molecule. We use n corrupt = 1 for training and found (cid:15) = 0 . to work well (seeFigure S.1). The models are trained for 100 epochs on an Nvidia Tesla V100 GPU, usingAdam optimization, with a learning rate of 0.001 and batch size of 248 for all models. Notmuch hyperparameter optimization was done, as these default values performed well. Wefound that an embedding dimension of 64 and 4 layers worked well for the bag-of-atoms and bag-of-neighbors as more layers caused more overﬁtting, while 8 layers with 6 attentionheads was chosen for the transformers (see Table S.4). The experiments are implemented inPyTorch . Evaluation

When predicting the true value of a masked atom in a molecule, several solutions might beequally correct. In NLP this is often handled by considering sample exact match. However,for molecular structures we know that multiple elements could exist in the same position.This is formalized by the octet rule, which allows the prediction of elements with the samenumber of unpaired valence electrons. We deﬁne the union of an exact match and elementsthat are correct with respect to the octet rule as octet accuracy . Given that the

QM9 datasetis generated by the octet rule, correctly understanding the octet rule would result in 100%octet accuracy, which is why we use it as our ﬁrst dataset - to see how diﬃcult it is to learnthe octet rule. The

ZINC dataset on the other hand does not conform to the octet rule as itcontains hypervalent molecules. This tests our models ability to go beyond the octet rule https://github.com/jeppe742/language_of_molecules ZINC .Since the distribution of atoms in the data is heavily biased, we use the F1-micro andF1-macro scores, which are a weighted average of the precision and recall. While the octet rule becomes increasingly ambiguous when more elements are allowed,understanding what underlying structures are more common, exact match is of interest. Thisis important to evaluate if we can ﬁt the speciﬁc distribution of a dataset. We deﬁne exactmatch as sample accuracy and F1. Moreover, we supply sample perplexity measures, whichis a more ﬁne grained way of assessing certainty in model prediction.Perplexity = exp (cid:32) − | V subset | (cid:88) v ∈ V subset log P ( v | ˜ G ) (cid:33) (26)We benchmark our proposed models against an octet rule model. The octet rule modelcounts the number of covalent bonds of the masked atom and predicts the unigram prob-abilities of the elements of the corresponding group in the periodic table. We denote thismodel as the octet-rule-unigram . When predicting elements with ambiguity (e.g hydrogenand ﬂuorine in the QM9 dataset) the octet-rule-unigram will therefore not obtain perfectperplexity. As no predictions exist for hypervalent elements (ﬁve and six covalent bonds),the octet-rule-unigram predicts uniform probability. Notice that as opposed to using aunigram model, this actually gives better perplexity as S is underrepresented in the dataset(see Table 1).

Results

We test all proposed models on octet and sample accuracy, F1, and perplexity. First, weevaluate the models on the

QM9 dataset, where the purpose is to learn an approximation to theoctet rule. Next, we measure the models on the

ZINC dataset and attempt to extend the octet18pproximation with hypervalent molecules and ions. Finally, we provide a qualitative insightinto model prediction by analyzing six diﬀerent samples (three correct, three incorrect) fromthe binary-transformer . QM9 - approximating Octet rule

In Table 2, we evaluate our models on octet rule accuracy, octet rule F1-(micro/macro) andsample perplexity.As expected, the bond-transformer achieves almost perfect performance ( octetaccuracy), since the task becomes a matter of counting covalent bonds, once you include theorder of the bonds. The binary-transformer also achieves excellent performance ( octet accuracy), even though it is not given any information about bond types. With 1masked atom, the problem of recovering the corrupted atom, without any bond types, canbe seen as a combinatorial problem. This suggests that the binary-transformer is ableapproximately solve this problem by inferring the bond orders from the remaining molecule.By only using neighborhood information, the

Bag-of-neighbors model gets 90%, whichserves as a very strong baseline, but without the full structural context, the model can-not approximate the octet rule. Similar, by only providing compositional information, the

Bag-of-Atoms model, performs signiﬁcantly worse, showing that structural and neighboringinformation is important.Finally, the

Unigram , relies purely on the frequency of occurrence of elements in thedataset, thus always guessing the masked atom is hydrogen and performs poorly.We provide extended results on masking multiple atoms, transformer model sizes, andaccuracy by length in Supporting information.19able 2: Performance of our models for 1 masked atoms per molecule. The uncertaintycorresponds to the standard deviation of ten models, trained with diﬀerent start seed.Model Octet Accuracy Octet F1 (micro/macro) Perplexity bond-transformer ± ± ± ± binary-transformer ± ± ± ± bag-of-neighbors ± ± ± ± bag-of-atoms ± ± ± ± Unigram octet-rule-unigram

100 100 / 100 1.002

ZINC - going beyond the octet rule

We consider the

ZINC dataset as it cannot be fully explained by the octet rule and has alarger quantity of ambiguous elements than

QM9 . E.g. with n corrupt = 1 , our ZINC test setcontains ﬂuorine atoms to be predicted as opposed to ﬂuorine atoms in QM9 .Given some elements, namely ions and hypervalent molecules, cannot be predicted bythe octet rule we add k-smoothing to the octet-rule-unigram model. This avoids thecase of 0 probability, which would result in inﬁnite perplexity loss. We optimize k on thevalidation set and found the optimum at k=1842.(see Figure S2)From Table 3 we see that the octet-rule-unigram model no longer has 100% octet F1,which emphasizes to what extend that the dataset cannot be fully explained by the octetrule, due to molecules with charges and hypervalency. Both our transformer models performsimilar or better than the octet-rule-unigram , when evaluated on Octet F1, sample F1 andsample perplexity. This is especially the case with with F1 macro, that puts more emphasison the underrepresented cases, which in our case are the most interesting. This indicatesthat the transformer models also have learned to discriminate between elements that shouldbe equally likely from the perspective of the octet rule, but might have higher likelihoodunder a given structure. 20able 3: Performance of our models for 1 masked atoms per molecule. The uncertaintycorresponds to the standard deviation of ten models, trained with diﬀerent start seed.Model Octet F1 (micro/macro) Sample F1 (micro/macro) Perplexity bond-transformer ± ± ± ± ± binary-transformer ± ± ± ± ± octet-rule-unigram bag-of-neighbors ± ± ± ± ± bag-of-atoms ± ± ± ± ± Unigram

Bond-Transformer barely is aﬀected, even when all the atoms inthe molecule are masked. This suggests that the model primarily uses the structural infor-mation (bond type and connections). The

Binary-Transformer however drops slightly inaccuracy as the molecule is corrupted. This makes sense, as without bond type informa-tion, the model can use the label of the remaining atoms to infer the bond types, but as wecorrupt more, we limit the available information in the molecule. The same is the case forthe

Bag-of-neighbors . In the case of

Bag-of-atoms , the model seem to converge to the

Unigram . 21 .9500.9751.000 Binary-Transformer Octet-Rule-UnigramBond-Transformer10 20 30 40 50 60 70 80 n corrupt A cc u r a c y UnigramBag-of-AtomsBag-of-Neighbors

Figure 3: Sample accuracy of the models, evaluated by diﬀerent number of masked atomson ZINC dataset. Errors bar corresponds to standard deviation of 10 models trained withdiﬀerent start seed.To investigate if our model can understand ions we have visualized the confusion matrixfor atoms with four covalent bonds in Figure 4 (other bond order confusion matrices can befound in Supporting information). For a masked atom with four covalent bonds the possibleclasses in the dataset are a C, a N + ion or a hypervalent S. The confusion matrix shows thatwhile our Octet-rule-unigram model only predicts C, both the

Binary-transformer and

Bond-transformer has learned, that both S and N can have four covalent bonds and howto discriminate between them. Thus the models seems to have successfully learned a morecomplex structure rule, than the octet rule.To better understand the models success in predicting hypervalent elements we visualizethe confusion matrix for ﬁve and six covalent bonds in Figure S8 and S9 (see Supporting22nformation). With ﬁve covalent bonds we only have one occurrence of P, which is correctlypredicted by the bond-transformer . For six covalent bonds, both transformers correctlypredict all elements with S.To assess the models ability for predicting ambiguous elements we visualize the confu-sion matrix for one covalent bond in Figure S2. In particular, we ﬁnd that both transformermodels ( binary-transformer / bond-transformer ) can successfully predict a large numberof F molecules (279/270) while only misclassifying a small amount of H (23/21) as F.For future investigations, we ﬁnd that the QM9 and

ZINC datasets are heavily biasedtowards H and C. This might make training diﬃcult due to dataset imbalances and couldbe improved by oversampling rare elements. C N Sprediction C N S t a r g e t Octet-Rule-Unigram

C N Sprediction C N S t a r g e t Binary-Transformer

C N Sprediction C N S t a r g e t Bond-Transformer

Figure 4: Confusion matrix for the test set, with n corrupt = 1 , where the masked atom has fourcovalent bonds. We provide this matrix for the octet-rule-unigram , binary-transformer ,and bond-transformer . Qualitative results

To investigate the binary-transformer corrections of atoms in a molecule, we inspect afew interesting predictions on the

ZINC dataset. We show the molecules with the predictedconditional probabilities of the possible element labels on the masked atoms. Figure 5a23llustrates an example where the model correctly predicts N, even though N − ions are veryrare in the dataset. It also puts a reasonable amount of probability of the target beingO, which could be a valid guess assuming the octet rule applies. In Figure 5b we see anexample of a hypervalent S, which our model correctly predicts, with a very high certainty.The hypervalent S often appears in the dataset with the two double bonded O, which mightbe a giveaway for the model. The example in Figure 5c would however most likely not havea immediate explanation, but the model is very certain of it prediction, which is also correct.The context of the elements with one covalent bond is expected to be identical, under theoctet rule, since both have one neighbor to any of the other elements in the data, but sincethe data is heavily biased towards hydrogen it is worth checking if the predicted probabilitiesare also biased. From Figure 5d, we see that even though the model incorrectly predicts H,the second most likely guess of Cl is correct, even though F appears twice as often in thedataset. A similar case can be seen in 5e where the model is in doubt between two elements,that both could be considered correct under the octet rule. Finally, in Figure 5f we have anexample where the model is very certain, but makes a completely wrong prediction.24 NN - SO OCl S H H H HH H H HHHHHH H HH (a)

SOONNN H H HHHH HH HH HHH HH HH H H (b)

OONNS N N HHHHHHH HHHHH H H HH HH (c)

N N S O N ClOSH HH H H HHH (d)

NNNONO NHH HHHHHHH (e)

NNO ONNHHHH HH H H HHHH (f)

Figure 5: Predicted atom probabilities. The molecule corresponds to the true molecule,where the colored atom is the target we want to predict. Green corresponds to correct, andred to wrong predictions. 25 onclusion

In this work we have introduced the binary-transformer and bond-transformer models,and evaluated their ability to recover masked atoms in an undirected molecular graph withdiscrete representations of bonds. The models achieves . ± . % and . ± . octetF1-micro on the QM9 dataset, while masking 1 atom per molecule, suggesting that the modelis capable of learning the octet rule, which is the underlying selection criteria for the QM9dataset.When evaluated on the

ZINC dataset, which contains more complex structure rules, ourtransformer models outperforms the octet-rule-unigram model in all metrics, includingachieving . ± . and . ± . octet F1-micro, when masking 1 atom per molecule.When paired with the analysis of the confusion matrix, this indicates that the models haslearned rules that exceed the octet rule, like ions and hypervalent molecules.Deep learning models are extremely ﬂexible and we have shown that the transformer ar-chitecture, which makes no assumption of the amount of atoms or bonds in a molecule, andcould in theory be able to model a wide variety of molecular rules. With the high accuracyon the QM9 and

ZINC datasets we hypothesize that the transformer models, both the bondand binary based versions, could be well suited for learning other molecular rules, such asstructure rules related to properties. As inference with the transformer is cheap, correctingbillions of molecules is therefore possible.The transformer model and embeddings made from undirected molecular graphs mayfurthermore be useful in chemical discovery tasks such as automatically generating and enu-merating new molecules.Moreover, years of progress in language modeling for NLP has given rise to strong con-textual vectors of information that is now the defacto standard for state-of-the-art modelsin close to every popular dataset for benchmarking neural network performance.

In26articularly, these pretrained language models works surprisingly well for areas of limitedlabeled data, something that is fairly prevalent in many molecular chemistry tasks as datamight be expensive to gather.

Acknowledgments

This research is funded by the Innovation Foundation Denmark through the DABAI project

References (1) Ertl, P.; Rohde, B.; Selzer, P. Fast calculation of molecular polar surface area as a sumof fragment-based contributions and its application to the prediction of drug transportproperties.

Journal of medicinal chemistry , , 3714–3717.(2) Lo, Y.-C.; Rensi, S. E.; Torng, W.; Altman, R. B. Machine learning in chemoinformaticsand drug discovery. Drug discovery today , , 1538–1546.(3) Ulissi, Z. W.; Medford, A. J.; Bligaard, T.; Nørskov, J. K. To address surface reactionnetwork complexity using scaling relations machine learning and DFT calculations. Nature communications , , 14621.(4) Boes, J. R.; Mamun, O.; Winther, K.; Bligaard, T. Graph Theory Approach to High-Throughput Surface Adsorption Structure Generation. The Journal of Physical Chem-istry A ,(5) Van Geem, K. M.; Pyl, S. P.; Marin, G. B.; Harper, M. R.; Green, W. H. Accuratehigh-temperature reaction networks for alternative fuels: butanol isomers.

Industrial &engineering chemistry research , , 10399–10420.(6) Broadbelt, L. J.; Pfaendtner, J. Lexicography of kinetic modeling of complex reactionnetworks. AIChE journal , , 2112–2121.277) Fink, T.; Bruggesser, H.; Reymond, J.-L. Virtual exploration of the small-moleculechemical universe below 160 daltons. Angewandte Chemie International Edition , , 1504–1508.(8) Blum, L. C.; Reymond, J.-L. 970 million druglike small molecules for virtual screeningin the chemical universe database GDB-13. Journal of the American Chemical Society , , 8732–8733.(9) Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166billion organic small molecules in the chemical universe database GDB-17. Journal ofchemical information and modeling , , 2864–2875.(10) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; Von Lilienfeld, O. A. Quantum chemistrystructures and properties of 134 kilo molecules. Scientiﬁc data , , 140022.(11) Elton, D. C.; Boukouvalas, Z.; Fuge, M. D.; Chung, P. W. Deep learning for molec-ular generation and optimization-a review of the state of the art. arXiv preprintarXiv:1903.04388 ,(12) Li, Y.; Zhang, L.; Liu, Z. Multi-objective de novo drug design with conditional graphgenerative model. Journal of cheminformatics , , 33.(13) Salakhutdinov, R. Learning deep generative models. Annual Review of Statistics andIts Application , , 361–385.(14) De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973 ,(15) You, J.; Liu, B.; Ying, Z.; Pande, V.; Leskovec, J. Graph convolutional policy networkfor goal-directed molecular graph generation. Advances in Neural Information Process-ing Systems. 2018; pp 6410–6421. 2816) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous represen-tation of molecules. ACS central science , , 268–276.(17) Jaeger, S.; Fulle, S.; Turk, S. Mol2vec: Unsupervised Machine Learning Approach withChemical Intuition. Journal of Chemical Information and Modeling , , 27–35,PMID: 29268609.(18) Zheng, S.; Yan, X.; Yang, Y.; Xu, J. Identifying Structure–Property Relationshipsthrough SMILES Syntax Analysis with Self-Attention Mechanism. Journal of chemicalinformation and modeling , , 914–923.(19) Mater, A. C.; Coote, M. L. Deep Learning in Chemistry. Journal of Chemical Informa-tion and Modeling ,(20) Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic LanguageModel.

J. Mach. Learn. Res. , , 1137–1155.(21) Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidi-rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,(22) Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and ComposingRobust Features with Denoising Autoencoders. Proceedings of the 25th InternationalConference on Machine Learning. New York, NY, USA, 2008; pp 1096–1103.(23) Gerratt, J.; Cooper, D.; Karadakov, P. a.; Raimondi, M. Modern valence bond theory. Chemical Society Reviews , , 87–100.(24) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;29spuru-Guzik, A. Automatic chemical design using a data-driven continuous represen-tation of molecules. ACS central science , , 268–276.(25) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: a freetool to discover chemistry for biology. Journal of chemical information and modeling , , 1757–1768.(26) Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic LanguageModel. J. Mach. Learn. Res. , , 1137–1155.(27) Jurafsky, D.; Martin, J. H. Speech and Language Processing (2Nd Edition) ; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2009.(28) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representationsof Words and Phrases and Their Compositionality. Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 2. USA, 2013; pp 3111–3119.(29) Mikolov, T.; Karaﬁát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neuralnetwork based language model. INTERSPEECH. 2010; pp 1045–1048.(30) Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization.

CoRR , abs/1409.2329 .(31) Merity, S.; Keskar, N. S.; Socher, R. Regularizing and Optimizing LSTM LanguageModels. 6th International Conference on Learning Representations, ICLR 2018, Van-couver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. 2018.(32) Hinton, G. E.; Salakhutdinov, R. R. Reducing the dimensionality of data with neuralnetworks. Science , , 504–507.(33) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; Von Lilienfeld, O. A.;Müller, K.-R.; Tkatchenko, A. Machine learning predictions of molecular properties:30ccurate many-body potentials and nonlocality in chemical space. The journal of phys-ical chemistry letters , , 2326–2331.(34) Hansen, M. H.; Torres, J. A. G.; Jennings, P. C.; Wang, Z.; Boes, J. R.; Mamun, O. G.;Bligaard, T. An Atomistic Machine Learning Package for Surface Science and Catalysis. arXiv preprint arXiv:1904.00904 ,(35) Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Eﬃcient Estimation of Word Representa-tions in Vector Space. 1st International Conference on Learning Representations, ICLR2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. 2013.(36) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.;Polosukhin, I. Attention is all you need. Advances in neural information processingsystems. 2017; pp 5998–6008.(37) Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning toalign and translate. arXiv preprint arXiv:1409.0473 ,(38) Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representa-tions. CoRR , abs/1803.02155 .(39) Ba, L. J.; Kiros, R.; Hinton, G. E. Layer Normalization. CoRR , abs/1607.06450 .(40) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. CoRR , abs/1512.03385 .(41) Srivastava, R. K.; Greﬀ, K.; Schmidhuber, J. Highway Networks. CoRR , abs/1505.00387 .(42) Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectiﬁer Neural Networks. Proceedingsof the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics. FortLauderdale, FL, USA, 2011; pp 315–323.3143) Luong, M.-T.; Pham, H.; Manning, C. D. Eﬀective approaches to attention-based neuralmachine translation. arXiv preprint arXiv:1508.04025 ,(44) Weininger, D. SMILES, a chemical language and information system. 1. Introductionto methodology and encoding rules. Journal of Chemical Information and ComputerSciences , , 31–36.(45) Landrum, G. RDKit: Open-source cheminformatics. .(46) Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction ; 2018.(47) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 ,(48) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Des-maison, A.; Antiga, L.; Lerer, A. Automatic Diﬀerentiation in PyTorch. NIPS AutodiﬀWorkshop. 2017.(49) Yutaka, S. The truth of the F-measure.

Teach Tutor mater , , 1–5.(50) Buda, M.; Maki, A.; Mazurowski, M. A. A systematic study of the class imbalanceproblem in convolutional neural networks. Neural networks : the oﬃcial journal of theInternational Neural Network Society , , 249–259.(51) Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.Deep contextualized word representations. arXiv preprint arXiv:1802.05365 ,(52) Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for NaturalLanguage Understanding. CoRR , abs/1901.11504 .32 ppendix Table S.1: Describtion of variables used.Variable Description G Graph, deﬁned as a set of nodes and edges (V,E) V Set of nodes (atoms) in the graph V subset set of masked atoms. | V subset | = n corrupt n corrupt Number of atoms corrupted per molecule E Adjacency matrix ( E ij ∈ { , , , } ) (cid:101) G Corrupted graph, with V subset replaced with a token. v i I’th atom in the graph. v i = ( a, i ) a Element of an atom. a ∈ { H, C, O, N, F, P, S, Cl, Br, I } x Token represented as a vector. x ∈ R d embedding ( x ) Embedding of a token. embedding ( x ) ∈ R d emb z θ Intermediate representation in the BoW model. z θ ∈ R d emb h θ Hidden representation in the BoW model. h θ ∈ R d nn h Embedding of the nodes in the graph. h ∈ R | V |× d emb z l , h l Intermediate and hidden representation of the l’th layer. z l , h l ∈ R | V |× d transform e V , e K Embeddings of the bonds. e V , e K ∈ R | V |×| V |× d transform W Q , W K , W V Trainable weights. W Q , W K , W V ∈ R d transform × d transform α ij Attention weights between i’th and j’th atom. α ij ∈ [0 , , (cid:80) j α ij = 1 φ ij Unnormalized attention weights. φ ij ∈ R W multi Trainable weight. W multi ∈ R ( d transform · k ) × d transform C_i

I’th head of attention function for an atom.

C_i ∈ R d transform Table S.2: Training time of our diﬀerent models, on the QM9 and ZINC datasets.Model Training time (min) Dataset binary-transformer

110 QM9 binary-transformer

482 ZINC bond-transformer

112 QM9 bond-transformer

484 ZINC bag-of-atoms

71 QM9 bag-of-atoms

158 ZINC bag-of-neighbors

72 QM9 bag-of-neighbors

144 ZINC33 psilon-greedy n masks V a li d a t i o n p e r p l e x i t y QM9

Bond transformer - = 0Binary transformer - = 0Bond transformer - = 0.2Binary transformer - = 0.2 n masks V a li d a t i o n p e r p l e x i t y ZINC

Bond transformer - = 0Binary transformer - = 0Bond transformer - = 0.2Binary transformer - = 0.2

Figure S.1: Validation perplexity of binary and bond transformer – with and without (cid:15) -greedymasking strategy – with diﬀerent number of masked atoms. (a) is on the QM9 dataset and(b) is on the ZINC dataset 34 -smoothing k C r o ss E n t r o p y k=1842 Figure S.2: Cross entropy as a function of k-smoothing evaluated on the ZINC validationdataset. 35 raph Attention V a li d a t i o n p e r p l e x i t y Bond transformer on QM9

All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Bond transformer on ZINC

All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Binary transformer on QM9

All nodes and edgesNeighbor nodesNeighbor nodes and edges V a li d a t i o n p e r p l e x i t y Binary transformer on ZINC

All nodes and edgesNeighbor nodesNeighbor nodes and edges

Figure S.3: Perplexity on validation dataset – with one atom masked per molecule – foreach epoch of training. (a) is a bond transformer trained on QM9, (b) is a bond transformertrained on ZINC, (c) is a binary transformer trained on QM9 and (d) is a binary transformertrained on ZINC

QM9 extended results

From Table S.3 and Figure S.4a,S.4b we see that as we mask more atoms per molecule, the bond-transformer , maintains a perfect score, since it can solve the task by only lookingat the bonds. The

Binary-transformer drops slightly in performance, as we mask moreatoms. The

Bag-of-neighbors doesn’t seem to depend on the number of masked atoms.This indicates that the model most likely, base its predictions on the number of neighbors,36hich also can be an indication of the number of covalent bonds. As we remove informationexcept compositional, the bag-of-atoms model drops signiﬁcantly has we mask more atoms,reaching similar performance to the

Unigram , as we approach fully masked molecules. This isno surprise, as a fully masked molecule, only gives the model information about the numberof atoms, which should not be enough to infer anything.Table S.3: Performance of our models for 1, 5 and 30 masked atoms per molecule. acc isoctet rule accuracy, F is octet rule F1-micro score and PP is the sample perplexity, eachare averaged over the test set. The uncertainty corresponds to the standard deviation of tenmodels, trained with diﬀerent start seed.Model Metric n mask = 1 n mask = 5 all masked octet-rule-unigram acc 100 100 100f1 100 100 100PP 1.002 1.002 1.002 bond-transformer acc ± ± ± ± ± ± ± ± ± binary-transformer acc 99.73 ± ± ± ± ± ± ± ± ± bag-of-neighbors acc 90.7 ± ± ± ± ± ± ± ± ± bag-of-atoms acc 65.8 ± ± ± ± ± ± ± ± ± Unigram acc 47.3 47.2 48.3F1 47.3 47.2 48.3PP 3.104 3.113 3.03837 .99980.99991.0000 Bond-Transformer5 10 15 20 25 30 n corrupt O c t e t F ( m i c r o ) UnigramBag-of-Atoms Bag-of-NeighborsBinary-Transformer (a) n corrupt O c t e t F ( m a c r o ) UnigramBag-of-Atoms Binary-TransformerBag-of-Neighbors (b) O c t e t F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (c) O c t e t F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsUnigramBinary-TransformerBond-Transformer (d)(d)

Figure S.4: Octet F1 micro (a) and octet F1 macro (b) evaluated by diﬀerent number ofmasked atoms. Octet F1 micro (c) and Octet F1 macro (d) evaluated on molecules of varyingsize, with 1 atom masked. Error bar corresponds to standard deviation of 10 models trainedwith diﬀerent start seedThe transformer model is very ﬂexible in terms of modeling capability, like any other deeplearning model, so to gauge complexity of the task, we evaluate ﬁve binary-transformer models of various sizes, which can be seen in Table S.4. Here we see that even very smalltransformer models perform well. As the models increase in number of parameters the per-formance increases, which however comes at a cost of computation and memory consumption.38able S.4: Performance of binary-transformer models with diﬀerent number of trainableparameters, for 1 and 5 masked atoms per molecule. acc is octet accuracy, F is octetF1-score and P P is perplexity, each averaged over the test set. t train is the training time.Model Metric n mask = 1 n mask = 5 t train (min) Parameters layers=1, heads=1, d emb =4 acc 86.0 85.8 60 199F1 86.0 85.8PP 1.426 1.441 layers=2, heads=1, d emb =4 acc 89.9 89.8 63 265F1 89.9 89.8PP 1.261 1.272 layers=2, heads=3, d emb =64 acc 96.3 94.4 77 118149F1 96.3 94.4PP 1.089 1.130 layers=4, heads=3, d emb =64 acc 98.4 97.4 82 234885F1 98.4 97.3PP 1.031 1.056 layers=8, heads=6, d emb =64 acc 99.8 97.9 110 866181F1 99.8 97.9PP 1.008 1.045 Zinc extended results

From Figure S.5 we see that both our transformer models, has learn to discriminate betweencertain elements, that under the octet-rule should be indistinguishable, like F, but also toallow for ions, in the form of O − .A similar story can be seen in Figure S.6, where we have ambiguity between O,S but alsoN − ions.Figure S.7,S.8 and S.9 does not provide any insights, as the dataset is too bias, andalmost only contain one type of element for each number of covalent bonds.39 O F S Cl Br Iprediction H O F S C l B r I t a r g e t OctetRule

H O N F S Cl Br Iprediction H O N F S C l B r I t a r g e t Binary Transformer

H O F S Cl Br Iprediction H O F S C l B r I t a r g e t Bond Transformer

Figure S.5: Confusion matrix for cases where the masked atom has one covalent bond.

O N Sprediction O N S t a r g e t OctetRule

H O N Sprediction H O N S t a r g e t Binary Transformer

O N Sprediction O N S t a r g e t Bond Transformer

Figure S.6: Confusion matrix for cases where the masked atom has two covalent bond.

N Sprediction N S t a r g e t OctetRule

C O N Sprediction C O N S t a r g e t Binary Transformer

N Sprediction N S t a r g e t Bond Transformer

Figure S.7: Confusion matrix for cases where the masked atom has three covalent bond.40

Iprediction P I t a r g e t OctetRule

P Sprediction P S t a r g e t Binary Transformer

Pprediction P t a r g e t Bond Transformer

Figure S.8: Confusion matrix for cases where the masked atom has ﬁve covalent bond.

S Iprediction S I t a r g e t OctetRule

Sprediction S t a r g e t Binary Transformer

Sprediction S t a r g e t Bond Transformer

Figure S.9: Confusion matrix for cases where the masked atom has six covalent bond.Figure S.10,S.11 we see the save story as underlined in the main text, namely thatthe

Bond-transformer outperforms and the

Binary-transformer also performs similar orbetter than the octet-rule-unigram model, depending on the number of masks, and metricused to evaluate. 41 .03.23.4 UnigramBag-of-Atoms10 20 30 40 50 60 70 80 n masks P e r p l e x i t y Bag-of-Neighbors Binary-TransformerOctet-Rule-UnigramBond-Transformer

Figure S.10: Sample perplexity evaluated by diﬀerent number of masked atoms. Error barcorresponds to standard deviation of 10 models trained with diﬀerent start seed42 .960.981.00 Binary-Transformer Octet-Rule-UnigramBond-Transformer10 20 30 40 50 60 70 80 n masks O c t e t F ( m i c r o ) UnigramBag-of-AtomsBag-of-Neighbors (a)

10 20 30 40 50 60 70 80 n masks O c t e t F ( m a c r o ) UnigramBag-of-AtomsBag-of-Neighbors Binary-TransformerOctet-Rule-UnigramBond-Transformer (b) n masks F ( m i c r o ) UnigramBag-of-AtomsBag-of-Neighbors (c)

10 20 30 40 50 60 70 80 n masks F ( m a c r o ) Unigram Bag-of-Atoms Bag-of-NeighborsBinary-TransformerOctet-Rule-Unigram Bond-Transformer (d)(d)

Figure S.11: Octet F1 micro (a), octet F1 macro (b), sample F1 micro (c) and sample F1macro (d) evaluated by diﬀerent number of masked atoms. Error bar corresponds to standarddeviation of 10 models trained with diﬀerent start seed43 O c t e t F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (a)

20 30 40 50 60 70 80length0.00.20.40.60.81.0 O c t e t F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (b)

20 30 40 50 60 70 80length0.20.40.60.81.0 F ( m i c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (c)

20 30 40 50 60 70 80length0.00.20.40.60.81.0 F ( m a c r o ) Bag-of-AtomsBag-of-NeighborsOctet-Rule-UnigramUnigramBinary-TransformerBond-Transformer (d)(d)