[PDF] Adaptive Invariance for Molecule Property Prediction

Abstract

Effective property prediction methods can help accelerate the search for COVID-19 antivirals either through accurate in-silico screens or by effectively guiding on-going at-scale experimental efforts. However, existing prediction tools have limited ability to accommodate scarce or fragmented training data currently available. In this paper, we introduce a novel approach to learn predictors that can generalize or extrapolate beyond the heterogeneous data. Our method builds on and extends recently proposed invariant risk minimization, adaptively forcing the predictor to avoid nuisance variation. We achieve this by continually exercising and manipulating latent representations of molecules to highlight undesirable variation to the predictor. To test the method we use a combination of three data sources: SARS-CoV-2 antiviral screening data, molecular fragments that bind to SARS-CoV-2 main protease and large screening data for SARS-CoV-1. Our predictor outperforms state-of-the-art transfer learning methods by significant margin. We also report the top 20 predictions of our model on Broad drug repurposing hub.

Full PDF

AAdaptive Invariance for Molecule Property Prediction

Wengong Jin Regina Barzilay Tommi Jaakkola

CSAIL, Masssachusetts Institute of Technology {wengong,regina,tommi}@csail.mit.edu

Abstract

Effective property prediction methods can help accelerate the search for COVID-19antivirals either through accurate in-silico screens or by effectively guiding on-going at-scale experimental efforts. However, existing prediction tools have limitedability to accommodate scarce or fragmented training data currently available. Inthis paper, we introduce a novel approach to learn predictors that can generalizeor extrapolate beyond the heterogeneous data. Our method builds on and extendsrecently proposed invariant risk minimization, adaptively forcing the predictorto avoid nuisance variation. We achieve this by continually exercising and ma-nipulating latent representations of molecules to highlight undesirable variationto the predictor. To test the method we use a combination of three data sources:SARS-CoV-2 antiviral screening data, molecular fragments that bind to SARS-CoV-2 main protease and large screening data for SARS-CoV-1. Our predictoroutperforms state-of-the-art transfer learning methods by signiﬁcant margin. Wealso report the top 20 predictions of our model on Broad drug repurposing hub.

The race to identify promising repurposing drug candidates against COVID-19 calls for improvementsin the underlying property prediction methodology. The accuracy of many existing techniquesdepends heavily on access to reasonably large, uniform training data. Such high-throughput, ontarget screening data is not yet publicly available for COVID-19. Indeed, we have only 48 drugs withmeasured in-vitro SARS-CoV-2 activity shared with the research community [6]. This limited datascenario is not unique to the current pandemic but likely to recur with each evolving or new viralchallenge. The ability to make accurate predictions based on all the available data, however limited,is also helpful in guiding later high-throughput targeted experimental effort.We can supplement scarce on-target data with other related data sources, either related screenspertaining to COVID-19 or screens involving related viruses. For instance, we can use additionaldata pertaining to molecular fragment screens that measure binding to SARS-CoV-2 main protease,obtained via crystallography screening [9]. On average, these fragments consist of only 14 atoms,comprising roughly 37% of full drug size molecules. Another source of data is SARS-CoV-1 screens.Since SARS-CoV-1 and SARS-CoV-2 proteases are similar (more than 79% sequence identity) [12],drugs screened against SARS-CoV-1 can be expected to be relevant for SARS-CoV-2 predictions.These two examples highlight the challenges for property prediction tools: much of the availabletraining data comes from either different chemical space (molecular fragments) or different viralspecies (SARS-CoV-1).The key technological challenge is to be able to estimate models that can extrapolate beyond theirtraining data, e.g., to different chemical spaces. The ability to extrapolate implies a notion ofinvariance (being impervious) to the differences between the available training data and wherepredictions are sought. A recently proposed approach known as invariant risk minimization (IRM) [1]seeks to ﬁnd predictors that are simultaneously optimal across different such scenarios (calledenvironments). Indeed, the differences in chemical spaces can be thought as "nuisance variation" that a r X i v : . [ q - b i o . Q M ] M a y he predictor should be explicitly forced to ignore. One possible way to automatically deﬁne this typeof environment variability for molecules is scaffolds [2]. But the setting is challenging since scaffoldsare combinatorial descriptors (substructures) and can potentially uniquely identify each compound inthe training data. Useful environments for estimation should enjoy some statistical support.In this paper we propose a novel variant of invariant risk minimization speciﬁcally tailored to rich,combinatorially deﬁned environments typical in molecular contexts. Indeed, unlike in standard IRM,we introduce two dynamic (in contrast to many static) environments. These are deﬁned over the sameset of training examples, but differ in terms of their associated latent representations. The differencebetween them arises from continually adjusted perturbations that manipulate the latent representationsof compounds towards more “generic” versions with the help of a scaffold classiﬁer. The idea is toexplicitly highlight to the property predictor that operates on these latent representations what thenuisance variability is that it should not rely on.Our method is evaluated on existing SARS-CoV-2 screening data [9, 6]. The training utilizesthree sources of data: SARS-CoV-2 screened molecules, SARS-CoV-2 fragments and SARS-CoV-1screening data described above. We compare against multiple transfer learning techniques such asdomain adversarial training [5] and conditional domain adversarial network [7]. On two SARS-CoV-2 datasets, the proposed approach outperforms the best performing baseline with 8-16% relativeAUROC improvement. Finally, we apply our model on Broad drug repurposing hub [4] and reportthe top 20 predictions for further investigation. Training data in many emerging applications is necessarily limited, fragmented, or otherwise hetero-geneous. It is therefore important to ensure that model predictions derived from such data generalizesubstantially beyond where the training samples lie. In other words, the trained model should have theability to extrapolate. For instance, in computational chemistry, it is desirable for property predictionmodels to perform well in time-split scenarios where the evaluation concerns compounds that werecreated after those in the training set. Another way to simulate evaluation on future compounds isthrough a scaffold split [2]. A scaffold split between training and test introduces some structuralseparation between the chemical spaces of the two sets of compounds, hence evaluating the model’sability to extrapolate to a new domain.One way to ensure domain extrapolation is to enforce an appropriate invariance criterion duringtraining. We envision here that the compounds X can be divided into potentially a large numberof domains or “environments” E , for example, based on their scaffold. The goal is then to learn aparametric mapping Z = φ ( X ) of compounds X to their latent representations Z in a manner thatsatisﬁes the chosen invariance criterion. A number of such strategies relevant to extrapolation havebeen proposed. They can be roughly divided into the following three categories: • Domain adversarial training [5] enforces the latent representation Z = φ ( X ) to have the samedistribution across different domains E . If we denote by P ( X | E ) the conditional distributionof compounds in environment E , then we want P ( φ ( X ) | E ) = P ( φ ( X )) for all E . With someabuse of notation, we can write this condition as Z ⊥ E . A single predictor is learned basedon Z = φ ( X ) , i.e., all the domains share the same predictor. As a result, the predicted labeldistribution P ( Y ) will also be the same across the domains. This can be problematic when thetraining and test domains have very different label distributions [11]. The independence conditionitself can be challenging to satisfy when the chemical spaces overlap across the environments. • Conditional domain adaptation [7] relaxes the requirement that the label distributions must agreeacross the environments. The key idea is to condition the invariance criterion on the label. Inother words, we require that P ( φ ( X ) | E, Y ) = P ( φ ( X ) | Y ) for all E and Y , i.e., we aim to satisfythe independence statement Z ⊥ E | Y . The formulation allows the label distribution to varybetween domains since E and Y can depend on each other. The constraint remains, however,too restrictive about the latent representation. To illustrate this, consider a simple case where theenvironments share the same chemical space and differ only in terms of proportions of differenttypes of compounds in them. These type proportions play roles analogous to label proportionsin domain adversarial training. Hence, the only way to achieve Z ⊥ E | Y would be if theproportions were the same across environments. To state the example differently, a functionalmapping Z = φ ( X ) cannot fractionally assign probability mass placed on X to different latent2igure 1: Left : Illustration of base GCN model for molecule property prediction.

Right : IRMwith adaptive environments. The model perturbs the representation φ ( x ) into φ ( x ) + δ ( x ) using thegradient from scaffold classiﬁer g . The predictor f is trained to be simultaneously optimal across twoenvironments. The scaffold s is a subgraph of a molecule x with its side chains removed [2].space locations Z ; it all has to be mapped to a single location. To reduce the impact of the strictcondition, we would have to introduce P ( Z | φ ( X ) , E ) in place of the simpler functional mapping Z = φ ( X ) , further complicating the approach. • Invariant risk minimization (IRM) [1] seeks a different notion of invariance, focusing less onaligning distributions of latent representations, and instead shifting the emphasis on how thoserepresentations can be consistently used for predictions. The IRM principle requires that thepredictor f operating on Z = φ ( X ) is simultaneously optimal across different environments ordomains. For example, this holds if our representation explicates only features that are (causally)necessary for the correct prediction. How φ ( X ) is distributed across the environments is thenimmaterial. The associated conditional independence criterion is Y ⊥ E | Z . In other words,knowing the environment shouldn’t provide any additional information about Y beyond the features Z = φ ( X ) . The distribution of labels can differ across the environments.While the IRM principle provides a natural framework for domain extrapolation, it needs to beextended in several ways for our setting. The main limitation of the original framework is that theenvironments E themselves are ﬁxed and pre-deﬁned. Their role in the pricinple is to illustrate“nuisance” variation, i.e., variability that the predictor should learn not to rely on. In order toenforce the associated independence criterion, we need a fair number of examples within each suchenvironment. The approach therefore becomes unsuitable when the natural environments such asscaffolds are combinatorially deﬁned or otherwise have high cardinality. Indeed, we might have onlya single molecule per environment in our training set, making the independence criterion vacuous ( E would uniquely specify X , thus also Z and Y ). A straightforward remedy for the high cardinalityenvironments would be to introduce a coarser deﬁnition, and enforce the principle at this coarselevel instead. Since environments represent constraints on the predictor, their role in estimation isadversarial. What is then the appropriate trade-off between such a coarser deﬁnition (relaxation ofconstraints) and our ability to predict? We side-step having to answer this question, and insteadpropose to dynamically map the large number of environments to just two. These two environmentsare designed to nevertheless highlight the nuisance variation the predictor should avoid but do so in atractable manner. Our goal is to adaptively highlight to the predictor the type of variability that it ought not to relyon. We do this by replacing high cardinality environments such as those based on scaffold withjust two new environments. These two new environments are unusual in the sense that they sharethe exact same set of examples. Indeed, they only differ in terms of the representation that thepredictor operates on. The ﬁrst environment simply corresponds to the representation we are tryingto learn, i.e., z = φ ( x ) , where the lowercase letters refer to speciﬁc instances rather than randomvariables. The second environment is deﬁned in terms of a modiﬁed representation φ ( x ) + δ ( x ) thatis a perturbed version of φ ( x ) and constructed with the help of the environment or scaffold classiﬁer.3he goal of δ ( x ) is to explicate directions in the latent representation that the predictor should avoidpaying attention to. While traditional IRM environments divide examples x into environments, oftenexclusively, we instead exercise different latent representations over the same set of examples.More formally, our two environments correspond to a choice of perturbation h ∈ { , δ } used to derivethe latent representation z from x , i.e., φ ( x ) + h ( x ) . The associated target labels are clearly the sameregardless of which perturbation (none or δ ( x ) ) was chosen. The key part of our approach pertains tohow δ ( x ) is deﬁned. To this end, let g ( φ ( x ))) be a parametric environment classiﬁer that we willinstantiate in detail later. The associated classiﬁcation loss is (cid:96) ( s ( x ) , g ( φ ( x ))) where s ( x ) is thecorrect original environment label (here a scaffold) for x . The scaffold classiﬁer is evolved togetherwith the feature mapping φ and the associated predictor f . We deﬁne the non-zero perturbation δ ( x ) in terms of the gradient: δ ( x ) = α ∇ z (cid:96) ( s ( x ) , g ( z )) | z = φ ( x ) (1)where α is a step size parameter. The goal of this perturbation is to turn φ ( x ) into its “generic”version φ ( x ) + δ ( x ) which contains less information of the environment (e.g., scaffold). Note that ifwe were to perform adversarial domain alignment, δ ( x ) would represent a reverse gradient updateto modify φ ( x ) . We do not do that, instead we are using the perturbation to highlight directions ofvariability to avoid for the predictor f within an overall IRM formulation. The degree to which φ ( x ) is adjusted in response to δ ( x ) arises from the IRM principle, not from a direct alignment objective.We begin by building the overall training objective which is then optimized in batches as describedin Algorithm 1. Let ( x i , y i ) be a pair of training example + the associated label to predict. Each x i also has an environment label/features given by s i = s ( x i ) (the original mapping of examples toenvironments is assumed given and ﬁxed, deﬁned by s ( x ) ). The environment classiﬁer is trained tominimize L e ( g ◦ φ ) = (cid:88) i (cid:96) ( s i , g ( φ ( x i ))) (2)As we will explain later on, the environment classiﬁer remains “unaware” of how the perturbation isderived on the basis of its predictions. The loss of the predictor f , now operating on φ ( x ) + h ( x ) ,where h ∈ { , δ } , is deﬁned as L ( f ◦ ( φ + h )) = (cid:88) i (cid:96) (cid:0) y i , f ( φ ( x i ) + h ( x i )) (cid:1) h ∈ { , δ } (3)The speciﬁc form of the loss depends on the prediction task. In accordance with the IRM principle,we enforce that the predictor operating on z = φ ( x ) + h ( x ) remains optimal whether its input is φ ( x ) or the perturbed version φ ( x ) + δ ( x ) . In other words, we require that L ( f ◦ ( φ + h )) ≤ min h ∈{ ,δ } L ( f h ◦ ( φ + h )) (4)where f h is a predictor in the same parametric family as f but trained separately with the knowledgeof h (perturbed or not). By relaxing the constraints via Lagrange multipliers, we express the overalltraining objective as λ e L e ( g ◦ φ ) + (cid:88) h ∈{ ,δ } (cid:20) (1 + λ h ) L ( f ◦ ( φ + h )) − λ h L ( f h ◦ ( φ + h )) (cid:21) (5)This minimax objective is minimized with respect to φ , g , and f , and maximized with respect to f h , h ∈ { , δ } . A few remarks are necessary concerning this objective: • Even though δ is deﬁned on the basis of φ and the environment classiﬁer g , we view it as afunctionally independent player. The goal of δ is to enforce optimality of f and therefore it playsan adversarial role relative to f . Similarly to GAN objectives where the discriminator has a separateobjective function, different from the generator, we separate out δ as another player in an overallgame theoretic objective. Speciﬁcally, δ takes input from φ and g but does not inform them inreturn in back-propagation. • φ in our objective is adjusted to also help the auxiliary environment classiﬁer. This is contraryto domain alignment where the goal would be to take out any dependence on the environment.The beneﬁt in our formulation is two-fold. First, the term grounds φ also based on the auxiliary Incorporating this higher order dependence would not improve the empirical results. lgorithm 1 Training IRM with adaptative environments for each training step do Sample a batch of molecules { x , · · · , x n } with their environments { s , · · · , s n } Encode molecules { x , · · · , x n } into vectors { φ ( x ) , · · · , φ ( x n ) } Computed environment classiﬁcation loss L s ( g ◦ φ ) = (cid:80) i (cid:96) ( s i , g ( φ ( x i ))) . Construct perturbed representation δ ( x i ) = − α ∇ φ ( x i ) (cid:96) ( s i , g ( φ ( x i ))) Compute invariant predictor loss L ( f ◦ φ ) Compute competing predictor loss L ( f h ◦ ( φ + h )) h ∈ { , δ } Update φ, f, g to minimize Eq. (5) Update f , f δ to maximize Eq. (5) end for objective, helping it to retain useful information about each example x . Second, the term groundsand stabilizes the deﬁnition of δ as the gradient of the environment predictor since g no longer ap-proaches a random predictor. It would be weak if φ contains no information about the environmentas in domain alignment. Thus δ remains well-deﬁned as a direction throughout the optimization.The training procedure is shown in Algorithm 1. In molecule property prediction, the training data is a collection of pairs { ( x i , y i ) } , where x i is amolecular graph and y i is its activity score, typically binary (active/inactive). The feature extractor φ ( · ) is a graph convolutional network (GCN) which translates a molecular graph into a continuousvector through directed message passing operations [10]. The predictor f is a feed-forward networkthat takes φ ( x ) or φ ( x ) + δ ( x ) as input and yields predicted activity y .The original environment of each compound x i is deﬁned as its Murcko scaffold [2], which is asubgraph of x i . Since scaffold is a combinatorial object with a large vocabulary of possible values,we deﬁne and train the environment classiﬁer in a contrastive fashion [8]. Speciﬁcally, for a givenmolecule x i with scaffold s i , we randomly sample n other molecules and take their associatedscaffolds { s k } as negative examples, as the contrastive set C . The environment classiﬁer g makes useof a feed-forward network g s that maps each compound or a scaffold (subgraph) to a feature vector.The probability that x i is mapped to its correct scaffold s i is then deﬁned as P g ( s = s i | x i , C ) = exp { sim (cid:0) g s ( φ ( x i )) , g s ( φ ( s i )) (cid:1) } (cid:80) k ∈{ i }∪ C exp { sim (cid:0) g s ( φ ( x i )) , g s ( φ ( s k )) (cid:1) } (6)where sim( · , · ) stands for cosine similarity. In practice, we use the molecules within the same batchas negative examples. Our experiments consist of two settings. To compare our method with existing transfer learningtechniques, we ﬁrst evaluate our methods on a standard unsupervised transfer setup. All the modelsare trained on SARS-CoV-1 data and tested on SARS-CoV-2 compounds. Next, in order to identifydrug candidates for SARS-CoV-2, we extend our method by incorporating labeled SARS-CoV-2 datato maximize prediction accuracy and perform virtual screening over Broad drug repurposing hub [4].

Training data

Our training data consist of three screens related to SARS-CoV. All the data can befound at https://github.com/yangkevin2/coronavirus_data . • SARS-CoV-2 MPro inhibition

881 fragments screened for SARS-CoV-2 main protease (Mpro)collected by the Diamond Light Source group [9]. The dataset contains 78 hits. • SARS-CoV-2 antiviral activity

48 FDA-approved drugs screened for antiviral activity againstSARS-CoV-2 in vitro [6], including reference drugs such as Remdesivir, Lopinavir and Chloroquine.The dataset contains 27 hits. • SARS-CoV-1 3CLpro inhibition

Over 290K molecules screened for activity against SARS-CoV-1 3C-like protease (3CLpro) in PubChem AID1706 assay. There are 405 active compounds.5 aselines

We compare the proposed approach with the following baselines: • Direct transfer : We train a GCN on SARS-CoV-1 data and directly test it on SARS-CoV-2 data. • Domain adversarial training (DANN) [5]: Since distribution of molecules is different betweenSARS-CoV-1 and SARS-CoV-2 datasets, we use domain adversarial training to facilitate transfer.Speciﬁcally, we augment our GCN with additional domain classiﬁer g to enforce the distributionof φ ( x ) to be the same across training (SARS-CoV-1) and test set (SARS-CoV-2). • Conditional adversarial domain adaptation (CDAN) [7] conditions the domain classiﬁer g withpredicted labels f ( φ ( x )) . In particular, we adopt their multilinear conditioning strategy: the inputto g becomes a vector outer-product φ ( x ) ⊗ f ( φ ( x )) , which has the same dimension as φ ( x ) forbinary classiﬁcation tasks. • Scaffold adversarial training (SANN): This is an extension of DANN where the domain classiﬁer g is replaced with our scaffold classiﬁer g s in Eq.(6). SANN seeks to learn a scaffold-invariantrepresentation φ ( x ) through the following minimax game ( L e is scaffold classiﬁcation loss): min φ,f max g L ( f ◦ φ ) − λ e L e ( g s ◦ φ ) (7) • Invariant risk minimization (IRM): The original IRM [1] requires the predictor f to be constant,which does not work well in our setting. Therefore, we adopt an adversarial formulation for IRMproposed in [3], allowing us to use powerful neural predictors: min φ,f max f , ··· ,f K (cid:88) e (1 + λ e ) L ( f ◦ φ ) − λ e L ( f e ◦ φ ) (8)Here each of the K environments consists of molecules with the same scaffold. Since the number ofenvironments is large, we impose parameter sharing among the competing predictors f , · · · , f K .Speciﬁcally, the input of f j = f (cid:48) ( φ ( x ) , j ) is a concatenation of φ ( x ) and one-hot encoding of j . Model hyperparameters

For our model, we set λ h = λ e = 0 . and perturbation learning rate α = 0 . , which worked well across all experiments. All methods are trained with Adam using itsdefault conﬁguration. Our GCN implementation is based on chemprop [10]. • For unsupervised transfer, we use their default hyper-parameter setting. For all methods, the GCN φ has three layers with hidden layer dimension 300. The predictor f is a two-layer MLP. • For supervised transfer, we perform hyper-parameter optimization to identify the best architecturefor the multitask GCN. The GCN φ has two layers with hidden layer dimension 2000. The predictor f is a three-layer MLP. The dropout rate is . . For fair comparison, all the methods use the samearchitecture in this setting. Our model is a single-task binary classiﬁcation model which predicts the SARS-CoV-13CLpro inhibition. After training, the model is tested on SARS-CoV-2 Mpro and antiviral data. Eachmodel is evaluated under ﬁve independent runs and we report the average AUROC score.

Results

Our results are shown in Table 1. The proposed method signiﬁcantly outperformed all thebaselines, especially on the Mpro inhibition prediction dataset (0.756 versus 0.653 AUROC).

Ablation study

Indeed, the improvement of our model comes from two sources: the additionalauxiliary task and IRM principle. To show individual contribution of each component, we conduct anablation study of our method without the IRM principle. The loss function in this case is the scaffoldclassiﬁcation loss plus property prediction loss λ e L e ( g ◦ φ ) + L ( f ◦ φ ) . The performance of thismethod is shown in the end of Table 1 (“without IRM”). The auxiliary scaffold classiﬁer shows quitesigniﬁcant improvement, but is still inferior to our full model trained with IRM principle. We extend all the methods to multitask binary classiﬁcation models that predict three differentproperties for each new compound: 1) probability of inhibiting the SARS-CoV-2 Mpro; 2) antiviralactivity against SARS-CoV-2; 3) probability of inhibiting SARS-CoV-1 3CLpro.6able 1: Unsupervised transfer results. Models are trained on SARS-CoV-1 data and tested onSARS-CoV-2 compounds. We report average of AUROC under 5-fold cross validation.Mpro inhibition AUC Antiviral activity AUCDirect Transfer . ± .

021 0 . ± . DANN [5] . ± .

012 0 . ± . SANN . ± .

079 0 . ± . CDAN [7] . ± .

013 0 . ± . IRM [1] . ± .

022 0 . ± . Our method . ± .

012 0 . ± . - without IRM . ± .

007 0 . ± . Table 2: 5-fold cross validation of different methods. CoV-2 means that the model is trained onSARS-CoV-2 data. CoV-1 means that the model is additionally trained on SARS-CoV-1 data.Method CoV-1 CoV-2 Mpro inhibition AUC Antiviral activity AUCMultitask GCN (cid:88) . ± . . ± . Multitask GCN (cid:88) (cid:88) . ± . . ± . DANN (cid:88) (cid:88) . ± . . ± . IRM [1] (cid:88) (cid:88) . ± . . ± . Our method (cid:88) (cid:88) . ± . . ± . Each model is evaluated under 5-fold cross validation with the same splits. In each fold, the trainingset contains the SARS-CoV data and 60% of the SARS-CoV-2 data (Mpro + antiviral), and the testset contains the rest 40% of the SARS-CoV-2 compounds. We report the mean and standard deviationof AUROC score evaluated on ﬁve different folds.

Results

Our results are shown in Table 2. The proposed method signiﬁcantly outperformed the twobaselines, especially on the antiviral activity prediction dataset (0.89 versus 0.82 AUROC). As anablation study, we also trained a GCN on only SARS-CoV-2 data (the ﬁrst row in Table 2). Indeed,the multitask GCN trained with additional SARS-CoV-1 data performs better (0.740 vs 0.807 onantiviral prediction), indicating that the two virus are closely related.The best model is then used to predict the SARS-CoV-2 Mpro inhibition and antiviral activity ofcompounds in Broad drug repurposing hub. In order to utilize maximal amount of labeled data, themodel is re-trained under 10-fold cross validation with 90%/10% split (instead of 60%/40%). Theresulting 10 models are combined together as an ensemble to predict properties for new compounds.We report the top 20 predicted molecules for MPro inhibition and antiviral activity in Table 3 and 4.

In this paper, we investigate existing domain extrapolation paradigms and their limitations. To allowthe method to extrapolate across combinatorially many environments, we propose a new methodwhich complements invariant risk minimization with adaptive environments. The method is evaluatedon molecule property prediction tasks and shows signiﬁcant improvements over strong baselines.

References [1] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini-mization. arXiv preprint arXiv:1907.02893 , 2019.[2] Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks.

Journal of medicinal chemistry , 39(15):2887–2893, 1996.[3] Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S Jaakkola. Invariant rationalization. arXivpreprint arXiv:2003.09772 , 2020. 7able 3: Top 20 SARS-CoV-2 Mpro inhibiting molecules (non-covalent) in the Broad drug repurpos-ing hub.

SMILES MProNc1ccc(cc1)S(N)(=O)=O 0.797N

Table 4: Top 20 SARS-CoV-2 antiviral molecules in the Broad drug repurposing hub.

SMILES AntiviralC[C@]12CC(=O)[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@@H]1CC[C@]2(O)C(=O)CO 0.955C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C 0.953C[C@]12C[C@H](O)[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@@H]1CC[C@]2(O)C(=O)CO 0.948CC(=O)OCC(=O)[C@@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3C(=O)C[C@]12C 0.945C[C@H]1C[C@H]2[C@@H]3CC[C@](O)(C(C)=O)[C@@]3(C)CC[C@@H]2[C@@]2(C)CCC(=O)C=C12 0.945CC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C 0.944C[C@@]12[C@H](CC[C@]1(O)[C@@H]1CC[C@@H]3C[C@@H](O)CC[C@]3(C)[C@H]1C[C@H]2O)C1=CC(=O)OC1 0.936C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@]34C)[C@@H]1CC[C@@H]2C(=O)CO 0.932CC(C)(C)C(=O)OCC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C 0.931C[C@]12CC(=O)C3C(CCC4=CC(=O)CC[C@]34C)C1CC[C@]2(O)C(=O)CO 0.924CC(=O)O[C@@]1(CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C)C(C)=O 0.924CC(=O)OCC(=O)[C@H]1CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C 0.920C\C=C1/C(=O)C[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C 0.919C[C@]12CC[C@H]3[C@@H](CC[C@@H]4C[C@@H](O)CC[C@]34C)[C@@]1(O)CC[C@@H]2C1=CC(=O)OC1 0.918CCCC(=O)O[C@@]1(CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3[C@@H](O)C[C@]12C)C(=O)CO 0.902CCC(=O)O[C@@]1(CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]12C)C(=O)CO 0.894CC(=O)[C@@]12OC(C)(O[C@@H]1C[C@H]1[C@@H]3CCC4=CC(=O)CC[C@]4(C)[C@H]3CC[C@]21C)c1ccccc1 0.880CC(=O)OCC(=O)[C@@]1(O)CCC2C3CCC4=CC(=O)CC[C@]4(C)C3[C@@H](O)C[C@]12C 0.877CC(=O)[C@@]1(O)CCC2C3CCC4=CC(=O)CC[C@]4(C)C3CC[C@]12C 0.876C[C@]12CCC3C(CCC4=CC(=O)CC[C@]34C)C1CC[C@@H]2O 0.872 [4] Steven M Corsello, Joshua A Bittker, Zihan Liu, Joshua Gould, Patrick McCarren, Jodi EHirschman, Stephen E Johnston, Anita Vrcic, Bang Wong, Mariya Khan, et al. The drugrepurposing hub: a next-generation drug library and information resource.

Nature medicine , 23(4):405–408, 2017.[5] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, FrançoisLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neuralnetworks.

The Journal of Machine Learning Research , 17(1):2096–2030, 2016.86] Sangeun Jeon, Meehyun Ko, Jihye Lee, Inhee Choi, Soo Young Byun, Soonju Park, DavidShum, and Seungtaek Kim. Identiﬁcation of antiviral drug candidates against sars-cov-2 fromfda-approved drugs. bioRxiv , 2020.[7] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarialdomain adaptation. In

Advances in Neural Information Processing Systems , pages 1640–1650,2018.[8] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastivepredictive coding. arXiv preprint arXiv:1807.03748 , 2018.[9] Diamond Light Source. Sars-cov-2 main protease structure and xchem frag-ment screen. 2020. .[10] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, AngelGuzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molec-ular representations for property prediction.

Journal of chemical information and modeling , 59(8):3370–3388, 2019.[11] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. On learning invariantrepresentation for domain adaptation. arXiv preprint arXiv:1901.09453 , 2019.[12] Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang, Hao-RuiSi, Yan Zhu, Bei Li, Chao-Lin Huang, et al. A pneumonia outbreak associated with a newcoronavirus of probable bat origin.