[PDF] Chemical Property Prediction Under Experimental Biases

Abstract

The ability to predict the chemical properties of compounds is crucial in discovering novel materials and drugs with specific desired characteristics. Recent significant advances in machine learning technologies have enabled automatic predictive modeling from past experimental data reported in the literature.However, these datasets are often biased due to various reasons, such as experimental plans and publication decisions, and the prediction models trained using such biased datasets often suffer from over-fitting to the biased distributions and perform poorly on subsequent uses.The present study focuses on mitigating bias in the experimental datasets. To this purpose, we adopt two techniques from causal inference and domain adaptation combined with graph neural networks capable of handling molecular structures.The experimental results in four possible bias scenarios show that the inverse propensity scoring-based method makes solid improvements, while the domain-invariant representation learning approach fails.

Full PDF

CChemical Property Prediction UnderExperimental Biases

Yang Liu and Hisashi Kashima

Department of Intelligence Science and Technology, Kyoto University

Abstract.

The ability to predict the chemical properties of compoundsis crucial in discovering novel materials and drugs with speciﬁc desiredcharacteristics. Recent signiﬁcant advances in machine learning technolo-gies have enabled automatic predictive modeling from past experimen-tal data reported in the literature. However, these datasets are oftenbiased due to various reasons, such as experimental plans and publi-cation decisions, and the prediction models trained using such biaseddatasets often suﬀer from over-ﬁtting to the biased distributions andperform poorly on subsequent uses. The present study focuses on miti-gating bias in the experimental datasets. To this purpose, we adopt twotechniques from causal inference and domain adaptation combined withgraph neural networks capable of handling molecular structures. The ex-perimental results in four possible bias scenarios show that the inversepropensity scoring-based method makes solid improvements, while thedomain-invariant representation learning approach fails.

The ability to predict the chemical properties of compounds is crucial in dis-covering novel materials and drugs with speciﬁc desired characteristics. To ac-celerate the discovery process, a variety of computational approaches, includ-ing those based on density functional theory, have been widely used; however,they are still expensive and time-consuming. In recent decades, researchers haveturned to data-driven approaches using fast-progressing machine learning tech-nologies [31]. This trend has been further accelerated by the remarkable devel-opment of deep learning in recent years; in particular, graph neural networks(GNNs) have achieved remarkable performance in predicting chemical proper-ties via automatic feature extraction from molecular structures represented asgraphs [6,9]. Their applications have expanded to a variety of tasks, such asmolecular generation [23,46,5], molecular explanation [45,2], and analysis ofintermolecular interactions [14,39].The construction of accurate predictive models often relies on large-scale la-beled data sets; they are usually collections of knowledge, such as experimentalresults reported on scientiﬁc papers, that are the product of tremendous sci-entiﬁc eﬀorts. Unsurprisingly, scientists do not sample molecules from a vastchemical space uniformly at random nor based on their natural distribution;rather, their decisions on experimental plans or publication of results are biased a r X i v : . [ q - b i o . Q M ] S e p Yang Liu and Hisashi Kashima by physical, economic, or scientiﬁc reasons. For instance, a large proportion ofmolecules are not investigated experimentally due to factors related to molecularmechanics, such as solubility [24], weights [29], toxicity [13] and side eﬀects, orfactors related to molecular structure such as crystals [17]. Concerns about thecost and availability of molecules can also be reasons to exclude certain groupsof molecules. Conversely, popularity considerations based on current researchtrends [15] and the experimental methods in which each lab specializes [37]drive the selection of compounds. These propensities related to the researchers’experience and knowledge can contribute to more eﬃcient search and discoveryin the chemical space, but on the other hand, they inﬂuence the data in an un-desirable way. Prediction models trained using such biased datasets often suﬀerfrom over-ﬁtting to the biased distributions, which leads to poor performanceon subsequent uses [18,38,3].Despite the reported evidence on the existence of bias in scientiﬁc experi-ments and their harmful eﬀects, almost no attempts to address this issue havebeen reported. In this study, we focus on mitigating the sources of bias in theexperimental datasets by applying two techniques from causal inference anddomain adaptation, i.e., inverse propensity scoring (IPS) [16,32] and domain-invariant representation learning (DIRL) [8,33,43]. In the IPS approach, we ﬁrstestimate the propensity score function, which represents the probability of eachmolecule to be analysed, and then estimate the chemical property predictionmodel by weighting the objective function with the inverse of the propensityscore. The DIRL approach is based on more recent advancements of representa-tion learning of deep neural networks for transfer learning and causal inference.It consists of a feature extractor, a domain classiﬁer, and a label predictor, wherethe feature extractor extracts features that help the label predictor while discour-aging the domain classiﬁer, and the whole network is optimized in an end-to-endmanner. Both approaches are implemented over a GNN to handle the molecularstructures of the compounds.In the experiments, we use the well-known large-scale dataset QM9, con-sisting of exhaustively enumerated small-molecule structures associated with 12fundamental chemical properties. Because it is impossible to know how a pub-licly available dataset is truly aﬀected by bias, we simulate four practical biasedsampling scenarios from the dataset, which introduce signiﬁcant biases in theobserved molecules. Under each biased sampling scenario, we validate our pro-posed models for predicting 12 chemical properties as 12 regression problems.The experimental results show that the IPS approach improves the predictiveperformance in all the scenarios on most of the targets with statistical signiﬁ-cance compared with the baseline. On the other hand, contrary to our expec-tations, the (more modern) DIRL approach reduces predictive performance inall of the bias scenarios. This is probably because the information that is usefulfor the domain classiﬁer is also useful for predicting chemical properties, andis eliminated by the DIRL approach. These results provide useful insights forfuture developments of the proposed methods.In summary, the main contributions of this work are as follows: hemical Property Prediction Under Experimental Biases 3 – We ﬁrst address the problem of predicting properties of chemical compoundsunder experimental biases. – We introduce two bias mitigation techniques, IPS and DIRL, combined withgraph neural network-based chemical property prediction. – We validate the two proposed approaches under various experimentally bi-ased sampling scenarios, and show IPS improves the prediction performance,whereas DIRL worsens it.

Let us assume that we are given a training dataset D train = { ( G i , y i ) } Ni =1 includ-ing N molecular graphs, where G i ∈ G is a molecular graph (biasedly) sampledfrom the universe of molecules G , and y i ∈ R is the target chemical property.Each molecular graph G i = ( V i , E i , σ i ) ∈ G has a set of nodes (i.e., atoms) V i ,a set of edges (i.e., bonds) E i ⊆ V i × V i , and a label function σ : V i ∪ E i → L ,where L is the set of possible node labels (i.e., atom types such as hydrogenand oxygen) and edge labels (i.e., bond types such as double bond and aromaticbond).Our goal is to obtain a prediction function f : G → R that predicts a particu-lar target chemical property over the molecule universe G that we are interestedin. We assume we have a uniformly randomly sampled part of (or possibly thewhole) chemical universe, consisting of M molecules D test = { G i } N + Mi = N +1 .The main diﬃculty in this problem is that the training data is constructedfrom past experiments reported in the literature, and therefore is signiﬁcantlybiased with respect to the uniform distribution over the chemical universe due todecisions taken by the researchers on experimental plans or publication options.There is no guarantee that the predictor derived from the biased training datahas high predictive performance even on the chemical universe.Note that the D test can also be a biased sample; however, without loss ofgenerality, we just assume it is a uniformly random subset of the molecules weare interested in. We start with reviewing the GNN architecture that is the fundamental build-ing block of our model, and then describe two bias canceling schemes, inversepropensity scoring (IPS) and domain-invariant representation learning (DIRL);they are combined to tackle the problem of chemical graph property predictionunder experimental biases.

Among many successful graph neural networks, we choose the message-passingGNN architecture proposed by Gilmer et al. [9] for its generality, simplicity, andfair performance in the chemical domain.

Yang Liu and Hisashi Kashima

A GNN takes a graph G = ( V , E , σ ) ∈ G as its input. In the t -th layer of theGNN, it updates the current set of the node representation vectors { h tv } v ∈ V i to { h t +1 v } v ∈ V i . Speciﬁcally, the representation vector of node v is updated depend-ing on the current vectors of its neighbor nodes using the update formula m t +1 v = a  (cid:88) u ∈N ( v ) m t (cid:0) h tv , h tu , σ ( v, u ) (cid:1) , h t +1 v = u t (cid:0) h tv , m t +1 v (cid:1) , (1)where N ( v ) denotes the set of the neighbor nodes of v , while m t is a so-calledmessage passing function that collects the information (i.e., the representationvectors) of the neighbors, that is a linear function of ( h tv , h tu ) depending onthe edge label σ ( v, u ). The a is an activation function, for which we choose therectiﬁed linear unit (ReLU) function. The u t is the vertex update function, forwhich we use the gated recurrent unit (GRU). The initial node representations { h v } v ∈V are initialized depending on their atom types. As the message pass-ing operation is repeatedly applied, the node representation vector graduallyincorporates information about its surrounding structure.After being processed through T layers, the ﬁnal node representations { h Tv } v ∈V are obtained; they are aggregated to the graph-level representation h G using areadout function r , that is, h G = r (cid:0) { h Tv } v ∈V (cid:1) , (2)where we could simply use summation followed by a linear transformation asthe readout function. However, in our implementation, we use a slightly morecomplex solution, i.e., an LSTM pooling layer followed by a linear layer.The graph-level representation h G is passed to the ﬁnal layer to give theoutputs of the GNN, such as the chemical property prediction, the propensityscore, and the domain classiﬁcation, as we will see later. If we assume there are no biases in our training dataset, the distributions ofthe training dataset and the target (test) dataset are identical. This means thatminimizing the empirical mean of the loss function (cid:96) , that is,1 N N (cid:88) i =1 (cid:96) ( y i , f ( G i )) , (3)directly leads to obtaining a good prediction model that achieves a small ex-pected loss E[ (cid:96) ( y, g ( G ))] for the test data. However, in our situation where thetraining dataset is sampled in a biased manner, the minimization of the standardempirical loss results in a biased prediction model. hemical Property Prediction Under Experimental Biases 5 One of the possible remedies to this problem is the use of a propensityscore [16] to adjust the importance weight of each training instance. The propen-sity score π ( G ) of molecular graph G is the probability that the molecule is in-cluded in the experimental data. The loss for each molecule is inversely weightedwith the propensity score, which results in the modiﬁed objective function: o IPS ( f ) = 1 N N (cid:88) i =1 π ( G i ) (cid:96) ( y i , f ( G i )) . (4)With the correct propensity score function π , the weighted loss function is un-biased with respect to the uniform sampling from the molecular universe.The IPS approach consists of two steps: propensity score estimation andchemical property prediction. We ﬁrst estimate the propensity score functionrepresenting the probability of each molecule being experimented. It is estimatedfrom the biased training dataset and the unbiased test set (uniformly sampledfrom the molecule universe). This step is usually performed by solving a two-class probabilistic classiﬁcation problem to classify the two datasets using thelogistic loss (a.k.a. the cross entropy loss). Note that the target property valuesare not used for propensity modeling.The second step estimates the chemical property prediction model with theloss function weighted using the inverse of the propensity score. We use thesquared loss function as (cid:96) , and the problem is cast as a weighted regressionproblem.The propensity score function and the chemical property prediction modelare both implemented as GNNs because they take graph-structured moleculesas their inputs. The domain-invariant representation learning approach [8,33,43] is another op-tion to correct sample selection biases. We adopt the best-known method pro-posed by Ganin et al. [8], which we refer to as DIRL. It requires three compo-nents: a feature extractor, a domain classiﬁer, and a label predictor, denoted by f F , f D , and f L , respectively. In contrast with IPS that has two separate GNNmodels, there are two paths from the input (i.e., a molecular graph) to two out-puts in one single model; the ﬁrst path is to make chemical property predictions,and is speciﬁed by a label predictor f L concatenated after a feature extractor f F . The second path is for domain classiﬁcation, which is speciﬁed by a domainclassiﬁer f D concatenated after the feature extractor f F . Note that f F commonlyappears in both of the paths. The ﬁnal deliverable f that we want is the com-posite function of f F and f L , that is, f ( G ) = f L ( f F ( G )). The domain classiﬁeritself is not directly what we actually want for our ﬁnal goal, but it helps toextract de-biased representations of the inputs in the training phase. Yang Liu and Hisashi Kashima

In the training phase, the ﬁrst path (consisting of f F and f L ) aims to predictthe target chemical property, by minimizing o property ( f F , f L ) = 1 N N (cid:88) i =1 (cid:96) ( y i , f L ( f F ( G i ))) , (5)where (cid:96) is the loss function for chemical properties, namely, the squared lossin our case. In the second path (including f F and f D ), the domain classiﬁer f D aims to correctly classify the domain (i.e., training or test) of the input, whilethe feature extractor f F aims to prevent the correct classiﬁcation. Therefore,denoting by c the loss function for domain classiﬁcation (that is, the logistic lossin our case), the objective function for the second path is given as o domain ( f F , f D ) = 1 N + M N + M (cid:88) i =1 c ( d i , f D ( f F ( G i ))) , (6)where d i indicates the domain that G i belongs to (i.e., training or test). Notethat f D tries to minimize this objective function, while f F tries to maximize it.The overlap between the minimization and the min-max problem is actually anundesirable feature in the process of solving the optimization problem; especiallyfor the feature extractor aiming to minimize o property and maximize o domain .To handle this situation simply with back-propagation, a special layer calledgradient reversal layer is often used. The gradient reversal layer denoted by r is an identity function r ( x ) = x in forward calculations, but the derivativeis deﬁned as ∂r∂ x = − I . With the gradient reversal layer, the second objectivefunction o domain is rewritten as o domain ( f F , f D ) = 1 N + M N + M (cid:88) i =1 c ( d i , f D ( r ( f F ( G i )))) (7)so that the whole optimization problem becomes a pure minimization problem.Since we optimize the whole network at once, the overall objective functionis the sum of o property and o domain , o DIRL ( f F , f D , f L ) = o property ( f F , f L ) + o domain ( f F , f D ) . (8) Chemists consciously or unconsciously select the molecules to be used in theirexperiments, depending on a variety of factors such as molecular structure, prop-erties, cost, and popularity. Since we cannot know why the compounds reportedin the literature were selected, we simulate several possible scenarios for samplingour training datasets.In our experiments, we consider the following four possible biased samplingscenarios: hemical Property Prediction Under Experimental Biases 7

Fig. 1.

Average densities of the training dataset and the test dataset. The x -axisescorrespond to (a) number of atoms, (b) proportion of double/triple/aromatic bonds,(c) value of HOMO-LUMO Gap (gap), and (d) value of mu, respectively. Scenario 1 assumes molecules with a smaller number of atoms have highersampling chances, because smaller molecules are thought to be more commonand better-studied [22]. In the QM9 dataset, the number of atoms rangesfrom 3 to 27.Scenario 2 prefers molecules with smaller proportions of single bonds (i.e.,with higher proportions of the other bond types). The spectral manifesta-tions of chemical functional groups greatly depend on bond types within thegroup [34]. For simplicity, we assume that less single-bonded molecule wouldhave stronger spectral manifestations.Scenario 3 selects molecules with higher values of the HOMO-LUMO Gap,because larger HOMO-LUMO Gap values often indicate higher stability andlower chemical reactivity [1].Scenario 4 assumes scientists focus on compounds with high target prop-erty values based on their expertise and experience to conduct experiments.Compounds with higher targets values have higher sampling chances.

We use the chemical molecule dataset QM9 [28] as the universe G of the smallgraphs. QM9 is a publicly available dataset covering 134,000 small stable organicmolecules made up of hydrogen (H), carbon (C), oxygen (O), nitrogen (N), andﬂuorine (F). They are the subset of all the molecules with up to nine heavyatoms (C, O, N, F) out of the 166 billion molecules of the GDB-17 chemicaluniverse [30,28]; therefore we consider the QM9 dataset as the natural universeof molecules made of C, O, N, and F.Each molecule in QM9 has 12 fundamental properties [28]: dipole moment(mu), isotropic polarizability (alpha), highest occupied molecular orbital energy(homo), lowest unoccupied molecular orbital energy (lumo), gap between homoand lumo (gap), electronic spatial extent (r2), zero point vibrational energy(zpve), internal energy at 0K (u0), internal energy at 298.15K (u298), enthalpyat 298.15K (h298), free energy at 298.15K (g298), and heat capacity at 298.15K(cv); they are used as 12 targets for 12 regression tasks. Yang Liu and Hisashi Kashima

According to each of the four biased sampling scenarios, we sample a biasedtraining dataset with a size of approximately 10% of the whole QM9 dataset(i.e., approximately 13,400 compounds). Each compound has a sampling chancedetermined by the sigmoid function depending on the corresponding samplingcriteria. For example, in Scenario 1, the smallest molecules with three atoms havethe largest sampling chances, while the ones with 27 atoms have the smallestchances. The larger the gain of the sigmoid curve becomes, the more the trainingand test dataset are separated; we tune the gain so that the average samplingprobability becomes 10%.A test dataset of the same size is then sampled uniformly at random fromthe remaining molecules. We use sampling without replacement, so no graphbelongs to the training set and the test set simultaneously.We repeat the sampling procedure 30 times to build training and test datasetsfor statistical testing. Figure 1 shows the average densities of the 30 trials underthe biased sampling scenarios (for Scenario 4, only the ﬁrst target among the 12targets is shown due to the page limitation). Note that the test density of thetest dataset reﬂects that of the whole chemical universe.

We use the PyTorch Geometric Library [7] to implement the GNN mod-els that both of our approaches are based on. Each atom is encoded as a 32-dimensional vector depending on the atom type. The message passing function m t depends on edge types and is 32-dimensional. The update function u t is agated recurrent unit with 32 internal dimensions. The readout function r is asequence-to-sequence layer followed by two linear layers with 32 internal dimen-sions. IPS.

To apply the IPS approach, we require two GNN models, one for thepropensity score function and the other for chemical property prediction. In theformer, we use T = 3 GNN layers and a logistic function as the ﬁnal layerbecause the propensity score gives the probability that a chemical compound isobserved. We use the ADAM [20] optimizer with no weight decay and a batchsize of 64 for each iteration. We use a learning rate updating scheduler with aninitial learning rate of 1 e −

5; this reduces the learning rate by a factor of 0.7until the validation error stops reducing for ﬁve training epochs. The validationdatasets are randomly chosen to include 20% of the training set and the test set.The optimized model that achieves the lowest validation error on the validationdataset is applied for inferring the importance.The chemical property prediction models also has T = 3 GNN layers, andthe other training settings are almost the same as those for the propensity scoremodel. The validation set is 20% of the training set. Because we have 12 targetchemical properties, we train 12 diﬀerent GNN predictors. hemical Property Prediction Under Experimental Biases 9 Fig. 2.

Average MAE on chemical property u0 depending on the indicators underfour biased sampling scenarios. The x-axes correspond to (a) number of atoms, (b)proportion of double/triple/aromatic bonds. (c) value of HOMO-LUMO Gap (gap),and (d) value of u0, respectively.

DIRL.

The network structure for the DIRL approach has two output paths:the domain classiﬁer and the label predictor. This structure shares the commonfeature extraction layers on the input side. We set the number of the GNN layerscorresponding to the feature extractor to 3, and those for the domain classiﬁerand the label predictor to 3 and 2, respectively. Similar to the IPS approach, thedomain classiﬁer has a readout function for classiﬁcation (i.e., a logistic function)and the label predictor has one for regression. Note that the feature extractor hasno readout function. As is the case with the IPS approach, we use ADAM [20]with no weight decay as the optimizer. We also use a learning rate updatingscheduler, with an initial learning rate of 1 e −

5, and reduce the learning rate bya factor of 0.7 until the validation error stops reducing for ﬁve training epochs.The batch size is set to 64. In contrast with IPS, the whole network is trained inan end-to-end manner. We train 12 DA-GNNs for each of the target propertiesin each trial. 20% of the training set and the test set are used as the validationset for the domain classiﬁer, and 20% of the training set is used for the labelpredictor.

Baseline Method.

As the baseline method, we use the same GNN structureas the one for IPS, but without bias mitigation. In contrast to the IPS approach,we use the standard average loss for the training dataset (3). The same settingsas for IPS are used except for those speciﬁc to IPS, such as the number of GNNlayers, the choices of the hyperparameters, and the training and validation sets.

Evaluation metrics.

The prediction accuracy for each option is evaluated interms of the average mean absolute error (MAE) obtained from the 30 trials. Wealso perform the paired t -test to check the statistical signiﬁcance of performancediﬀerences. We show all the MAE comparisonresults (in the mean and standard deviations of 30 trials) in Tables 1–4 corre-sponding to the four simulated biased sampling scenarios. In all four tables, the

Table 1.

MAE results under Scenario 1

Baseline IPS DIRL mu ± ± ± alpha ± ± ± homo 0.596 ± ± ± lumo 0.589 ± ± ± gap 0.902 ± ± ± r2 6.357 ± ± ± zpve ± ± ± u0 ± ± ± u298 ± ± ± h298 ± ± ± g298 ± ± ± cv ± ± ± Table 2.

MAE results under Scenario 2

Baseline IPS DIRL mu ± ± ± alpha ± ± ± homo 0.7116 ± ± ± lumo ± ± ± gap 1.1226 ± ± ± r2 7.8447 ± ± ± zpve ± ± ± u0 ± ± ± u298 ± ± ± h298 ± ± ± g298 ± ± ± cv ± ± ± ratios in the IPS and DIRL columns indicate the improvements achieved overthe baseline, respectively. The ‘*’ symbols next to the improvements denote the p -values reported in the paired t -test; ‘**’ means the p -value is less than 0.01,that is, the increase/decrease is statistically signiﬁcant with a 1% threshold.Similarly, ‘*’ means that the p-value is less than 0.05, that is, the statisticalsigniﬁcance has a 5% threshold.Table 1 shows the results under Scenario 1, which uses the number of atomsas an indicator for biased sampling. The IPS approach improves the predictiveperformance with statistical signiﬁcance compared with the baseline method foreight chemical properties, whereas the DIRL approach degrades the predictiveperformance for all 12 chemical properties with statistical signiﬁcance. hemical Property Prediction Under Experimental Biases 11 Table 3.

MAE results under Scenario 3

Baseline IPS DIRL mu 0.3067 ± ± ± alpha ± ± ± homo 0.6748 ± ± ± lumo ± ± ± gap - - - - - - r2 ± ± ± zpve ± ± ± u0 ± ± ± u298 ± ± ± h298 ± ± ± g298 ± ± ± cv ± ± ± Table 4.

MAE results under Scenario 4

Baseline IPS DIRL mu ± ± ± alpha ± ± ± homo ± ± ± lumo 0.8920 ± ± ± gap 1.1090 ± ± ± r2 ± ± ± zpve ± ± ± u0 ± ± ± u298 ± ± ± h298 ± ± ± g298 ± ± ± cv 0.2478 ± ± ± Similarly, for Scenario 2 , Table 2 shows the statistically signiﬁcant improve-ments by IPS for nine chemical properties, and DIRL degrades the predictiveperformance for nine chemical properties.Table 3 shows the results under Scenario 3 which uses the HOMO-LUMOGap value as the indicator for biased sampling, and predicts on 11 other chemicalproperties. Again, the predictive performance for eight chemical properties isimproved with statistical signiﬁcance by IPS, and DIRL is detrimental to theperformance for all 11 chemical properties. Note that the result for the predictionof the HOMO-LUMO Gap is missing in the table (denoted by ‘gap’), but it isequivalent to the result for the HOMO-LUMO Gap in Scenario 4.Similar results are found in Table 4 corresponding to Scenario 4, where eachof 12 chemical properties is used as the indicator for biased sampling in the prediction tasks for the property itself; IPS improves in eight properties, whileDIRL degrades in ten properties.The overall results for all the scenarios indicate that the IPS approach im-proves the performance for many properties and scenarios; especially, it is note-worthy that the performance is statistically signiﬁcantly improved for the sixproperties (alpha, zvpe, u0, u298, h298, and g298) in all of the four scenarios,which shows that IPS has solid eﬀectiveness and promise in mitigating experi-mental biases. On the other hand, the DIRL approach is almost harmful for allthe scenarios.The improvements by IPS are more signiﬁcant for Scenarios 1 and 2 thanScenarios 3 and 4. The diﬀerences might be explained by the accuracy of thepropensity score model; the accuracies in the four scenarios are 81.05%, 87.49%,76.04%, and 79.02%, respectively, which mean that the propensity scores aremore accurate in Scenarios 1 and 2.

Prediction accuracy depending on indicators for biased sampling.

Wefurther investigate why the IPS approach successfully corrects the bias while theDIRL approach fails, by visualizing the prediction accuracy depending on theindicators used in the biased sampling scenarios.Due to space limitations, we only show the predictive performance for thechemical property u0 in Figure 2. The horizontal axes in the ﬁgure correspondto the indicators, namely, (a) the number of atoms, (b) the proportion of dou-ble/triple/aromatic bonds, (c) the value of the HOMO-LUMO Gap (gap), and(d) the value of u0, respectively. The vertical axis corresponds to the averagetest MAE (that is, the smaller, the more accurate).Under Scenario 1, most of the molecules used for training have a numberof atoms ranging from 3 to 15. For molecules with 15 or more atoms, mainlyin the test data set, IPS consistently outperforms the other methods. UnderScenario 2, most of the molecules used for training have a proportion of dou-ble/triple/aromatic bonds higher than 0.3. Again, on those molecules with pro-portion less than 0.3, IPS consistently performs better than the others. Similarlyin Scenarios 3 and 4, we observe the advantage of IPS for the smaller indicatorvalues corresponding to the test datasets.

Why does the DIRL approach fail?

The DIRL approach [8] is a morecomplicated and modern approach for adapting two datasets from diﬀerent do-mains with diﬀerent distributions, which is known to perform better than theIPS approach on many tasks. However, from our experimental results, it is verysurprising that the DIRL approach performs very poorly in the chemical prop-erty prediction under the biases. We attribute some of this failure in part tothe recent discussion about transferability and disrciminability; in their words,while the model achieves high transferability, it fails in discriminability [4]. TheDIRL approach ﬁnds a space where the molecules from the training set and thetarget chemical universe are indistinguishable. However, when the informationthat is useful for discriminating the two domains is also useful for predicting hemical Property Prediction Under Experimental Biases 13 the chemical property, it is eliminated by the domain-invariant approach, whichresults in poor discriminability (in other words, prediction power). For example,in Scenario 4, the indicator for introducing the biases and the target chemicalproperty are strongly correlated, as is evident from the deﬁnition. Since the otherindicators are also very basic chemical properties, it would not be surprising ifthe information useful for predicting them were also highly related with that forthe target chemical properties.

The existence of experimental biases has been presented in the literature in theﬁeld of chemical, physical, biological, and pharmaceutical sciences. For instance,choices of compounds to be investigated are encouraged/discouraged by vari-ous physical and chemical properties of compounds, for example, solubility [24],weight [29], toxicity [13], and side eﬀects, or by factors related to the molecularstructure such as crystals [17]. In the pharmaceutical domain, drug likeness is animportant factor for target selection, as exempliﬁed by the “Lipinski’s rules ofﬁve” [22]. Concerns about the cost, availability, experimental methods [37], andresearch trends [15] can also be sources of bias in the selection. The present studyfocuses on properties of compounds themselves, but experimental biases in theirinteractions such as drug-target interactions have also been reported [44,25].Although the existing studies report negative impacts on the performance ofpredictive modeling using biased datasets [18,38,3], to the best of our knowl-edge, there has been no attempt to mitigate the biases to improve the predictiveperformance.Recent signiﬁcant developments in deep neural networks have expanded theirscope from vector data to texts and images, and even to graph-structured data.Graph neural networks are actively being studied [12,40] and successfully appliedto chemical and physical domains [6,19,36,11,21,42,47]. We used one of the well-known fairly general GNNs [9] in this study, but more modern architectureshave been proposed, e.g., one incorporating attention [36] and another with moreexpressive power [41]. Most of the existing studies assume unbiased datasets, andthe present study is the ﬁrst to address the bias mitigation problem in learningGNNs, except for the one considering classiﬁcation on a single large graph [10],while we consider the classiﬁcation of a set of small graphs .To mitigate the biases in the experimental datasets, the present study appliestwo techniques from domain adaptation and causal inference. The problems oflearning prediction models, when the distributions of the training dataset andthe test dataset are diﬀerent, are called domain adaptation, covariate shift adap-tation [27], or transfer learning [26], and have been a major topic in machinelearning. Recently, deep neural networks for domain adaptation based on theidea of domain-invariant representation learning (DIRL) were proposed [8,35].In this study, we combined DIRL with GNNs, but contrary to our expectations,the approach performed quite poorly. As we discussed in the experimental sec- tion, some more recent approach balancing transferability and discriminabilitymight oﬀer a solution to this issue [4].The problem of biased observations is also discussed in the context of causalinference. In this study, we applied the inverse propensity scoring (IPS) method [16],which yields remarkable improvements in prediction performance. IPS is alsosuccessfully applied to various applications including recommender systems [32].The DIRL approach is also proposed in the context of causal inference [33].

We tackled the problem of applying machine learning approaches to chemicalproperty prediction from datasets including experimental biases. We introducedbias mitigation methods by combining the recent developments in causal infer-ence, domain adaptation, and graph learning, i.e., the inverse propensity scoring(IPS) and the domain invariant representation learning (DIRL) combined withgraph neural networks (GNNs). We simulated four practical biased samplingscenarios on the well-known QM9 dataset. The experimental results conﬁrmedthat the IPS approach improved the performance for chemical property predic-tion with statistical signiﬁcance, while the DIRL approach was not eﬀective inthis domain. Although the domain-invariant approach performed poorly on oursettings, our considerations on the failure of DIRL suggest there may be spacefor improvement, if information for the ﬁnal predictor can be somehow keptuneliminated by the domain classiﬁer.

Acknowledgments

We are grateful to Yoshihiro Yamanishi, Ichhigaku Takigawa, and ShonosukeHarada for valuable information and discussions.

References

1. Aihara, J.: Reduced HOMO-LUMO gap as an index of kinetic stability for poly-cyclic aromatic hydrocarbons. The Journal of Physical Chemistry A (37), 7487–7495 (1999)2. Akita, H., Nakago, K., Komatsu, T., Sugawara, Y., Maeda, S.i., Baba, Y., Kashima,H.: Bayesgrad: Explaining predictions of graph convolutional networks. In: Pro-ceedings of the 25th International Conference on Neural Information Processing(ICONIP). pp. 81–92. Springer (2018)3. Chen, G., Chen, P., Hsieh, C.Y., Lee, C.K., Liao, B., Liao, R., Liu, W., Qiu, J.,Sun, Q., Tang, J., et al.: Alchemy: A quantum chemistry dataset for benchmarkingai models. arXiv preprint arXiv:1906.09427 (2019)4. Chen, X., Wang, S., Long, M., Wang, J.: Transferability vs. discriminability: Batchspectral penalization for adversarial domain adaptation. In: Proceedings of the 36thInternational Conference on Machine Learning (ICML). pp. 1081–1090 (2019)hemical Property Prediction Under Experimental Biases 155. De Cao, N., Kipf, T.: Molgan: An implicit generative model for small moleculargraphs. In: Proceedings of the ICML 2018 workshop on Theoretical Foundationsand Applications of Deep Generative Models (2018)6. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecularﬁngerprints. In: Advances in Neural Information Processing Systems 28. pp. 2224–2232 (2015)7. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.In: Proceedings of the ICLR Workshop on Representation Learning on Graphs andManifolds (2019)8. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F.,Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks.Journal of Machine Learning Research (1), 2096–2030 (2016)9. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural messagepassing for quantum chemistry. In: Proceedings of the 34th International Confer-ence on Machine Learning (ICML). pp. 1263–1272 (2017)10. Guo, R., Li, J., Liu, H.: Learning individual causal eﬀects from networked obser-vational data. In: Proceedings of the 13th International Conference on Web Searchand Data Mining (WSDM). pp. 232–240 (2020)11. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on largegraphs. In: Advances in neural information processing systems. pp. 1024–1034(2017)12. Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: Meth-ods and applications. IEEE Data Engineering Bulletin (2017)13. Hann, M.M.: Molecular obesity, potency and other addictions in drug discovery.Medicinal Chemistry Communication (5), 349–355 (2011)14. Harada, S., Akita, H., Tsubaki, M., Baba, Y., Takigawa, I., Yamanishi, Y.,Kashima, H.: Dual graph convolutional neural network for predicting chemicalnetworks. BMC Bioinformatics , 1–13 (2020)15. Hattori, K., Wakabayashi, H., Tamaki, K.: Predicting key example compoundsin competitors’ patent applications using structural information alone. Journal ofChemical Information and Modeling (1), 135–142 (2008)16. Imbens, G.W., Rubin, D.B.: Causal inference in statistics, social, and biomedicalsciences. Cambridge University Press (2015)17. Jia, X., Lynch, A., Huang, Y., Danielson, M., Langat, I., Milder, A., Ruby, A.E.,Wang, H., Friedler, S.A., Norquist, A.J., et al.: Anthropogenic biases in chemicalreaction data hinder exploratory inorganic synthesis. Nature (7773), 251–255(2019)18. Kearnes, S., Goldman, B., Pande, V.: Modeling industrial ADMET data with mul-titask networks. arXiv preprint arXiv:1606.08793 (2016)19. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graphconvolutions: moving beyond ﬁngerprints. Journal of Computer-Aided MolecularDesign (8), 595–608 (2016)20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. corrabs/1412.6980 (2014) (2014)21. Li, R., Wang, S., Zhu, F., Huang, J.: Adaptive graph convolutional neural networks.arXiv preprint arXiv:1801.03226 (2018)22. Lipinski, C.A.: Lead-and drug-like compounds: the rule-of-ﬁve revolution. DrugDiscovery Today: Technologies (4), 337–341 (2004)6 Yang Liu and Hisashi Kashima23. Liu, Q., Allamanis, M., Brockschmidt, M., Gaunt, A.: Constrained graph varia-tional autoencoders for molecule design. In: Advances in Neural Information Pro-cessing Systems 31. pp. 7795–7804 (2018)24. Llinas, A., Burley, J.C., Box, K.J., Glen, R.C., Goodman, J.M.: Diclofenac solu-bility: independent determination of the intrinsic solubility of three crystal forms.Journal of Medicinal Chemistry (5), 979–983 (2007)25. Mestres, J., Gregori-Puigjane, E., Valverde, S., Sole, R.V.: Data completenesstheAchilles heel of drug-target networks. Nature Biotechnology (9), 983–984 (2008)26. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowl-edge and Data Engineering (10), 1345–1359 (2009)27. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: DatasetShift in Machine Learning. The MIT Press (2009)28. Ramakrishnan, R., Dral, P.O., Rupp, M., Von Lilienfeld, O.A.: Quantum chemistrystructures and properties of 134 kilo molecules. Scientiﬁc Data , 140022 (2014)29. Raymer, B., Bhattacharya, S.K.: Lead-like drugs: A perspective: Miniperspective.Journal of Medicinal Chemistry (23), 10375–10384 (2018)30. Ruddigkeit, L., Van Deursen, R., Blum, L.C., Reymond, J.L.: Enumeration of 166billion organic small molecules in the chemical universe database gdb-17. Journalof Chemical Information and Modeling (11), 2864–2875 (2012)31. Rupp, M., Tkatchenko, A., M¨uller, K.R., Von Lilienfeld, O.A.: Fast and accuratemodeling of molecular atomization energies with machine learning. Physical Re-view Letters (5), 058301 (2012)32. Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recom-mendations as treatments: Debiasing learning and evaluation. In: Proceedings ofthe 33rd International Conference on Machine Learning (ICML) (2016)33. Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment eﬀect:generalization bounds and algorithms. In: Proceedings of the 34th InternationalConference on Machine Learning (ICML). pp. 3076–3085 (2017)34. Sousa-Silva, C., Petkowski, J.J., Seager, S.: Physical chemistry chemical physics.Phys. Chem. Chem. Phys. , 18970–18987 (2019)35. Tzeng, E., Hoﬀman, J., Saenko, K., Darrell, T.: Adversarial discriminative domainadaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR). pp. 7167–7176 (2017)36. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., Bengio, Y.: Graphattention networks. In: Proceedings of the Sixth International Conference on Learn-ing Representations (ICLR) (2018)37. Walker, R., Bedson, P., Lawn, R., Barwick, V.J., Burke, S., Roper, P.: Applicationsof Reference Materials in Analytical Chemistry. The Royal Society of Chemistry(2001)38. Wallach, I., Heifets, A.: Most ligand-based classiﬁcation benchmarks reward memo-rization rather than generalization. Journal of Chemical Information and Modeling (5), 916–932 (2018)39. Wang, H., Lian, D., Zhang, Y., Qin, L., Lin, X.: Gognn: Graph of graphs neuralnetwork for predicting structured entity interactions. In: Proceedings of the 29thInternational Joint Conference on Artiﬁcial Intelligence (IJCAI) (2020)40. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensivesurvey on graph neural networks. IEEE Transactions on Neural Networks andLearning Systems (2020)41. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks?In: Proceedings of the Sixth International Conference on Learning Representations(ICLR) (2018)hemical Property Prediction Under Experimental Biases 1742. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks forskeleton-based action recognition. arXiv preprint arXiv:1801.07455 (2018)43. Yao, L., Li, S., Li, Y., Huai, M., Gao, J., Zhang, A.: Representation learning fortreatment eﬀect estimation from observational data. In: Bengio, S., Wallach, H.,Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances inNeural Information Processing Systems 31, pp. 2633–2643 (2018)44. Yıldırım, M.A., Goh, K.I., Cusick, M.E., Barab´asi, A.L., Vidal, M.: Drugtargetnetwork. Nature Biotechnology25