ASGN: An Active Semi-supervised Graph Neural Network for Molecular Property Prediction
Zhongkai Hao, Chengqiang Lu, Zheyuan Hu, Hao Wang, Zhenya Huang, Qi Liu, Enhong Chen, Cheekong Lee
AASGN: An Active Semi-supervised Graph Neural Network forMolecular Property Prediction
Zhongkai Hao , Chengqiang Lu ,Zhenya Huang , Hao Wang ,Zheyuan Hu , Qi Liu , ∗ , EnhongChen , Cheekong Lee {hzk171805,lunar,huangzhy,wanghao3,ustc_hzy}@mail.ustc.edu.cn{qiliuql,cheneh}@ustc.edu.cn,[email protected]: Anhui Province Key Lab of Big Data Analysis and Application, School of Computer Science and Technology,University of Science and Technology of China, 2: Tencent America, ABSTRACT
Molecular property prediction (e.g., energy) is an essential problemin chemistry and biology. Unfortunately, many supervised learningmethods usually suffer from the problem of scarce labeled moleculesin the chemical space, where such property labels are generallyobtained by Density Functional Theory (DFT) calculation which isextremely computational costly. An effective solution is to incorpo-rate the unlabeled molecules in a semi-supervised fashion. How-ever, learning semi-supervised representation for large amounts ofmolecules is challenging, including the joint representation issue ofboth molecular essence and structure, the conflict between repre-sentation and property leaning. Here we propose a novel frameworkcalled Active Semi-supervised Graph Neural Network (ASGN) byincorporating both labeled and unlabeled molecules. Specifically,ASGN adopts a teacher-student framework. In the teacher model,we propose a novel semi-supervised learning method to learn gen-eral representation that jointly exploits information from molecularstructure and molecular distribution. Then in the student model,we target at property prediction task to deal with the learning lossconflict. At last, we proposed a novel active learning strategy interms of molecular diversities to select informative data during thewhole framework learning. We conduct extensive experiments onseveral public datasets. Experimental results show the remarkableperformance of our ASGN framework.
CCS CONCEPTS • Theory of computation → Active learning ; Semi-supervisedlearning ; •
Computer systems organization → Neural net-works ; Molecular computing . KEYWORDS
Active Learning; Molecular Property Prediction; Graph Neural Net-work; Semi-Supervised Learning ∗ Corresponding Author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
KDD ’20, August 23–27, 2020, Virtual Event, CA, USA © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7998-4/20/08...$15.00https://doi.org/10.1145/3394486.3403117 O Chemical Space
Labeled/ Unlabeled Molecules
C H O
C H
Machine Learning Methods
Graph Neural Network
DFT message many hoursper molecule~
Remdesivir
HOMO -0.38eVLUMO 0.08eVU -196.43eV
Property Value
Figure 1: Methods for molecular property prediction. Left:Machine learning methods using message passing graphneural networks. Right: DFT calculation.
ACM Reference Format:
Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, QiLiu, Enhong Chen, Cheekong Lee. 2020. ASGN: An Active Semi-supervisedGraph Neural Network for Molecular Property Prediction. In
Proceedings ofthe 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’20), August 23-27, 2020, Virtual Event, CA, USA.
ACM, New York, NY,USA, 9 pages. https://doi.org/10.1145/3394486.3403117
Predicting the property of molecules, such as the energy, is a funda-mental issue in many related domains including chemistry, biologyand material science, which has led to many significant relevant re-search and applications. For example, the process of drug discovery[10] can be accelerated if we can accurately predict the propertiesof molecules in time to help develop specific medicines for theepidemic, such as H1N1 flu, SARS, Covid19.In chemistry, density functional theory (DFT) is commonly usedcomputational methods for molecular property prediction, whichhas been studied dating back to the 1970s [4]. It offers accurate andexplainable solutions for molecular following complete theory [21].However, in practice, it suffers from a critical problem of expensivecomputation cost as it needs to solve many linear equations itera-tively for the solutions. For example, experimental results find that a r X i v : . [ c s . L G ] J u l DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zhongkai Hao, et al. it takes an hour to calculate the properties of a molecule with only20 atoms [13]. Obviously, such low efficiency of DFT has limitedits applications when screening from a large set of molecules.Recently, researchers have attempted to use machine learningmethods that are cost-effective for molecular property prediction [15].Along this line, the most representative methods are graph neuralnetworks (GNN), including MPNN [13], SchNet [30] and MGCN[23], which have shown superior performance. Generally, theytreat a molecule as a graph where the nodes denote atoms and theedges represent the interaction between atoms. They design severalneural layers to project each node into latent space with a low-dimensional learnable embedding vector and pass its interactionmessage through the edges iteratively. At last, the node messagescan be aggregated to represent the molecule for property prediction.Though GNNs have achieved great success, they are usuallydata-hungry, which requires a big amount of labeled data (i.e.,molecules whose properties are known) for training [13]. How-ever, the labeled molecules usually take an extreme small portionin the whole chemical space since they can only be provided byexpensive experiments or DFT calculation, which restricts GNNbased development. To gain further promotion, as shown in topleft part of the Figure 1, there are still many valid molecules in thechemical space, though the properties remaining unknown, thathave some benefits in terms of their structures. If we can effectivelyleverage these unlabeled molecules, it could be potentially helpfulto improve the performance. Therefore, in this paper, we aim toexplore semi-supervised learning (SSL) by fully taking advantage ofboth labeled molecules and unlabeled ones for property prediction.However, it is highly challenging due to the following domain-specific characteristics. First, learning molecular graph representa-tion is non-trivial because it involves both the node and the graphlevel information. Different from traditional applications like so-cial networks since we usually meet a large number of graphs inchemical space rather than a single graph with large number ofnodes. Though some existing semi-supervised learning methods,such as Ladder Networks [29], have shown their performance invarious domains, such as image and text, they cannot be directlyused for molecular graph learning. Second, it is difficult to handlethe imbalance between labeled and unlabeled molecules in chemi-cal space since the number of labeled ones generally take extremesmall portion. Directly applying previous SSL methods leads to lossconflict caused by large number of unlabeled molecules for theirstructural representation but ignores our main goal of propertyprediction. Third, the performance might be still unsatisfactory dueto limited labels, we need to find new molecules for labeling toimprove the model. To increase the efficiency of labeling, we needa mechanism to find most informative molecules for labeling.To address these challenges, we design a novel framework calledActive Semi-supervised Graph Neural Network (ASGN) for molec-ular property prediction by taking advantage of both labeled andunlabeled molecules. Generally, ASGN uses a novel teacher-studentframework consisting of two models that work alternatively. Specif-ically, in the teacher model, we propose a novel semi-supervisedlearning method to learn a general representation that jointly ex-plores molecular features both at a global scale and local scale. Thelocal one represents the essences of molecules, i.e., atoms and bondswhile the global one learns the whole molecular graph encoding with respect to the chemical space. Then, to deal with the lossconflict between the unsupervised structure representation andproperty prediction, we introduce the student model by fine-tuningon property prediction task only on the small labeled molecules. Bydoing so, the student model can focus on the prediction to achievelower error than the teacher model and converge much faster. Addi-tionally, it can alleviate over-fitting than training from scratch onlyon the labeled dataset. Moreover, to improve labeling efficiency,we propose a novel strategy based on active learning to select newinformative molecules. That is, ASGN uses the embeddings by theteacher model to select a diversified subset of molecules in thechemical space and add them to the labeled dataset for finetuningtwo models repeatedly until the label budget or desired accuracy isreached. We conduct extensive experiments on real-world datasets,where the experimental results demonstrate the effectiveness ofour proposed ASGN. To the best of our knowledge, this is the firstattempt to incorporate both unlabeled and labeled molecules forproperty prediction actively in a semi-supervised manner.
In this section, we summarize the related work with the followingthree categories.
Molecular Property Prediction . Predicting the properties ofmolecules is a fundamental task with applications in many ar-eas such as chemistry and biology [3, 25]. According to quantumphysics, the states of a molecule are characterized by SchrÃűdingerequation [37]. The first class like Density Functional Theory (DFT)[4] are simulation based methods directly derived or approximatedby the SchrÃűdinger equation. However, DFT methods are time-consuming because it solves some big linear equations and thecomplexity of DFT is O ( N ) where N is the number of atoms.Another class of molecular properties prediction methods aredata-driven [9, 13, 15, 44]. Researchers attempted to use traditionalmachine learning methods with empirical descriptors or handcraftfeatures to represent a molecule and use them for linear or logisticregression [15, 44]. However, these methods cannot achieve de-sirable accuracy due to the limited effectiveness of handcraftedfeatures and model capacity [13].Inspired by the remarkable development of graph neural net-works in various domains [13] [40][27][39], researchers have no-ticed the potentials of them for molecular property prediction. Gen-erally, by treating the molecule as a graph, several graph neuralnetworks have been applied [14, 24, 40] as an architecture that candirectly deal with noneuclidean data like graphs. Variants of graphneural networks like MPNN [13], Schnet[30], can be applied formolecular properties prediction where they use nodes to representatoms, and the edges are weighted by the distances between atoms.Then the node embeddings are propagated and updated using theembeddings of their neighborhood, named message passing. Thegraph embedding can be pooled from nodes for property prediction. Semi-supervised Representation learning . Semi-supervisedlearning is a popular framework to improve model performanceby incorporating unlabeled data into training [45]. The main ideais to use the unlabeled data to learn a general and robust repre-sentation to improve the performance of the model. On the onehand, methods like ladder network [29] borrow the idea of jointly
SGN for Molecular Property Prediction KDD ’20, August 23–27, 2020, Virtual Event, CA, USA learning representation for unlabeled data (via generation) and la-beled data [20]. On the other hand, a popular fashion is developedrecently which uses self-supervised methods that force the net-works to be consistent under the handcrafted transformations likeimage in-painting [26], rotation[12], contrastive loss [16]. Usually,these methods use a pseudo-labeling mechanism to assign eachunlabeled data with a pseudo label and force the neural networkto predict these pesudo labels. Then the pre-trained models canbe used for downstream tasks like classification or regression. Forexample, Gidaris et al. uses the rotation degree of an image as akind of pesudo label. These pesudo labels are often obtained fromtransformations of data without changing their semantic feature.Deep Clustering [6] shows that the convolutional neural networkitself can be viewed as a strong prior to processing image data. Ac-cordingly, they design a self-supervised method based on learningthe clustering results of the features by the neural networks.
Active Learning . Active learning is a popular framework toalleviate data deficiency and it has been applied in many tasks[11, 19, 41, 43]. Active learning framework starts with a small setof labeled data and a large set of unlabeled data. In every iteration,it develops a model to select a batch of unlabeled data to be labeledfor supplementing the limited labeled data so that it achieves betterperformance. Generally, the representative methods consider thestrategy selection from two perspectives, i.e., uncertainty, and di-versity [11] [31]. Specifically, the uncertainty based methods definethe model uncertainty for a new unlabeled data leveraged by somestatistics properties (e.g., variance) and then select the data with thehighest value [11] [38]. Comparatively, the diversity based methodsaim to choose a small subset that is the most representative for thewhole dataset [31].As is pointed out in [2], the data selected by the uncertaintystrategy are almost identical in batch mode settings, so it might benot suitable for large datasets like our scenarios. In this paper, wepropose a novel diversity based active learning strategy for infor-mative molecule selection where the semi-supervised embeddingsare used for calculating the distance between molecules.
In this section, we will give formal definitions of terminologies andproblems in this paper for clarity. Following the previous works[13] [30], we treat each molecule in chemical space as a graph,hence we define a molecular graph as follows:
Definition 3.1.
Molecular Graph : A molecule is denoted as aweighted graph G = (V , E) , where the vertex set V = { v i : 1 ≤ i ≤|G|} , we use x i to represent the feature vector of the node (atom) v i indicating its type such as Carbon, Nitrogen. |G| is the totalnumber of atoms. E = { e ij = | r i − r j | : 1 ≤ i , j ≤ |G|} is the setof edges connecting two atoms (nodes) v i and v j . Specifically, in acertain molecule, the coordinates of each atom can be represented as r i = ( r ( ) i , r ( ) i , r ( ) i ) . Therefore, we further denote the edge betweentwo atom nodes e ij as weighted by their coordinate distance | r i − r j | .Then we give the formal definition of chemical space. Definition 3.2.
Chemical Space : Generally, the whole chemi-cal space consists of a set of molecules, which can be denoted as: M = {G i : 1 ≤ i ≤ N } . In practice, only a subset of molecules in the space have been examined to obtain their several proper-ties (e.g., energy) by typical DFT calculation. Therefore, we di-vide the chemical space M into two subset D l , D u . Specifically, D l = {(G , y ) , · · · , (G N l , y N l )} represents the subset of moleculeswhose properties have been examined, where y i ∈ R m denotesthe property vector with real value of molecule G i . Comparatively, D u = {G , G , · · · , G N u } represents the subset of molecules whoseproperties remain unknown. Without loss of generality, we call thesubset D l and D u as "labeled set" and "unlabeled set", respectively.With the above definition, our problem can be formalized as thatwe want to find a model f (G) → y using limited labels |D u | , forprecisely predicting the properties of molecules. In this section, we present a description of the framework of ASGN.Then we describe the components of ASGN comprehensively.
In this paper, we propose a novel A ctive S emi-supervised G raphNeural N etwork (ASGN) for molecular property prediction by in-corporating both labeled and unlabeled molecules in chemical space.The general framework is illustrated in Figure 2.Generally, we use a teacher model and a student model thatwork iteratively. Each of them is a graph neural network. In theteacher network, we use a semi-supervised fashion to obtain ageneral representation of molecular graphs. We jointly train theembeddings for unsupervised representation learning and propertyprediction. Then in the student model, we handle the loss conflict byfine-tuning the parameters transferred from the teacher model forproperty prediction. After that, we use the student model to assignpseudo labels for the unlabeled dataset. As feedback for the teacher,the teacher model can learn the student’s knowledge from thesepseudo labels. Also, to improve the labeling efficiency, we proposeusing active learning to select the new representative unlabeledmolecules for labeling. We then add them to the labeled set andfinetune two models iteratively until accuracy budget is reached.Specifically, the key idea is to use the embeddings output by theteacher model to find a subset that is most diversified in the wholeunlabeled set. We then assign ground truth labels such as usingDFT calculation to these molecules. After that, we add them intothe labeled set and repeat the iteration to improve performance.In the following, we will first describe technical details of ourteacher model and student model. In the teacher model, we use semi-supervised learning. We firstintroduce the network backbone. Then, we introduce the loss forrepresentation learning. Specifically, a property loss on labeledmolecule D l and two unsupervised loss (from both the graph andthe node level) on all molecules D u ∪ D l are designed to guide it. The task of theteacher model is to learn a general representation for moleculargraphs from both labeled set and unlabeled set. We first introducea message passing graph neural network (MPGNN) as the back-bone that transforms a molecular graph into a representation vector
DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zhongkai Hao, et al.
Teacher Model
Pseudo Label Node LevelGraph Level L p Properties - Properties
Student Model L p ( ) - - , GNN
Active Selection
Unlabeled Molecules Labeled Molecules
GNN D u D l Transfer
Trained Student Model - Reconstruct L r L c UnsupervisedSupervised
Property Prediction
Output
Figure 2: The overall framework of our method. We use a teacher model that jointly learns node embeddings at the locallevel and the distribution of the data at the global level with the property task. A student network uses the teacher’s weightto fine-tune network parameters on property prediction task. Active learning and pseudo labeling are used to combine thesesteps effectively into a framework. based on message passing graph neural networks. The graph neuralnetwork consists of L message passing layers. At l -th layer, it firstembeds each node in a graph to a high dimensional space as theirembeddings using f ( v i ) = z i ∈ R d . Then the node embeddings areupdated by aggregating node embeddings of its neighbors N( v i ) along the weighted edges called message passing: z l + i = σ ( W l · AGG ( z li , { e ( v i , v j ) : v j ∈ N( v i )})) , (1)where σ (·) is the activation function, W l is a learnable weight ma-trix, AGG is the aggregation function such as sum , mean, max [24].Here we choose sum as the aggregation type which directly addsthe messages from its neighbors as suggested in [42]. e ( v i , v j ) is avector called message function determined by the node embeddingsand edge weights that pass from node v i to v j . As the interactionsdecay with the growth of the distances between two atoms, we usea Gaussian radical basis [30] to embed the edge information thatreflects the interaction strength between nodes: e ( v i , v j )[ k ] = z li [ k ] · exp (− γ (∥ r i − r j ∥ − d k ) ) , (2)for 1 ≤ k ≤ N f where { d k : 1 ≤ k ≤ N f } is a set of pre-definedfilter centers. More intensive centers means higher resolution andcan capture minor difference of different bond length.After L layers of message passing and aggregation, we aggregateall node embeddings to get the whole graph embedding: z G = Pool ({ z Li : v i ∈ V}) . (3)In this paper, we utilize a simple pooling method which directly av-erages or sums all node embeddings. At last, multi-layer perceptron f θ is used to get the property f θ ( z G ) .Traditionally, MPGNN is trained in a supervised manner whereall the labels are given and we usually use mean square loss (MSE)between predictions and labels y i (i.e. the labeled properties in D l ) to guide the optimization of the model parameters: L p = N l (cid:213) i = ∥ y i − f θ ( z G i )∥ . (4) However, in practice the training set with small number of labelseasily results in an over-fitted model. Additionally, end-to-end train-ing that only learns a high-level representation guided by the prop-erty/label is less effective for structural representation. To overcomethese challenges, in this paper we propose a semi-supervised rep-resentation learning method by considering both local level andglobal level unsupervised information to enhance the expressivepower of a model for both labeled and unlabeled molecular graphs. In node level representa-tion learning, we learn to capture domain knowledge from geometryinformation of a molecular graph. The main idea is to use nodeembeddings to reconstruct the node types and topology (distancesbetween nodes) from the representation. Specifically, we first ran-domly sample some nodes and edges from the graph as shown inFigure 2, then pass these nodes’ representation to a MLP and usethem to reconstruct the node types f i and distances between nodes e ij . Mathematically, we minimize the following cross-entropy loss: L r = − E v i ∼V (cid:34) K n (cid:213) m = f im log ( д θ n ( z i )) (cid:35) − E e ij ∼E (cid:34) K e (cid:213) m = e ijm log ( д θ e ( z i , z j )) (cid:35) , (5)where first term is the loss function for node types reconstruction,and the second term is the edge weights reconstruction. For bothterms, we optimize the expectation of the samples. K n is the numberof atom types, we transform the continuous edge weights into adiscrete classification problem by dividing the continuous distanceinto several discrete bins and K e is the total number of bins. Itmeans that e ijm = d m is the nearest to the weight of edge e ij . д θ n , д θ e is a multi-layer perceptron.Practically, we randomly sample some nodes and edges to re-construct their attributes and optimize the expectation of samples.We found such random sampling to be significantly more efficientwithout sacrificing much performance. We sample α |G| (0 < α < SGN for Molecular Property Prediction KDD ’20, August 23–27, 2020, Virtual Event, CA, USA features. What’s more, we notice that using a fully connected graphto represent a molecule contains redundant information becausea molecule contains only 3 n degrees of freedom since the coordi-nates of each atom can be decided by 3 numbers as ( r ( ) , r ( ) , r ( ) ) .Therefore sampling edges with size O (|G|) is an efficient trade-offbetween performance and algorithm complexity. By optimizing thereconstruction loss (Eq. (5)), we can obtain the node embeddingsthat contains the topology and features of molecular graphs. Although node embed-dings that can reconstruct the topology of molecules can effectivelyrepresent the structure of molecules, a recent study [18] showsthat combing graph level representation learning is beneficial fordownstream tasks like property prediction. In order to learn a graphlevel representation, the key insight is to use the mutual relationbetween molecules within the chemical space, i.e. similar moleculesroughly have similar properties. Inspired by this intuition, we pro-pose a method based on learning to cluster to enhance graph levelrepresentation. First, we calculate the graph level embedding bythe network. Then we use an implicit clustering based method toassign N molecules each with a cluster id which contains M clustersgenerated by the implicit clustering process. After that we optimizethe model with a penalty loss function. The process is iterativelydone until at least a local minima is reached.Next, we introduce the details of graph level representationlearning. We denote s as the cluster id in the rest of this section. Firstwe pass the graph level embedding into a multi-layer perceptronand predict the probability distribution p ( s |G) . We assume thereexists a posterior distribution p ( s |G) of cluster id. We optimize thecross-entropy loss between p and q as following: H ( y , x ) = − N (cid:213) i = M (cid:213) j = p ( s j |G i ) log q ( s j |G i ) . (6)However, we easily get a trivial solution if no constraint is appliedon p ( s |G) . The key is to confine these clustering ids to a pre-definedprior distribution p ( s ) as (cid:205) Ni = p ( s j |G i ) = p ( s j ) [5] [1]. We choose auniform distribution with fixed M supports which means that thewhole dataset is roughly divided into equally partitioned subsets.Practically, we use hard labeling technique to constraint p ( s |G i ) to be a discrete label by applying the hardmax function. Then weexplicitly write the optimization object as:min p , q L c = N (cid:213) i = M (cid:213) j = p ( s j |G i ) log q ( s j |G i ) (7)s . t : p ( s j |G i ) ∈ { , } , N (cid:213) i = p ( s j |G i ) = p ( s j ) . We iteratively optimize predictive distribution q ( s |G) by perform-ing gradient descent on the network parameters and the posteriordistribution p ( s |G) by the following method which can be viewedas an implicit clustering approach. We first rewrite Eq. (7) as:min L c = min Q ∈ U ( p , q ) ⟨ P , Q ⟩ , (8)with ⟨· , ·⟩ denotes the Frobenius dot-product between two matri-ces, P ij = p ( s j |G i ) , Q ij = q ( s j |G i ) , and U ( p , q ) denotes the joint distribution of p and q . This is a typical optimal transport problemand we add an entropy regularization and use Sinkhorn-Knoppalgorithm [7] for a better convergence speed:min L c = min Q ∈ U ( p , q ) ⟨ P , Q ⟩ − λ KL ( Q || pq T ) . (9)In fact, this process can be viewed as a type of clustering [8] so wename this loss as clustering loss for self-supervision.In a nutshell, to train a teacher model under a semi-supervisedmanner, we need to optimize the following loss jointly combiningEq. (4), Eq. (5) and Eq. (8) as: L t = (cid:213) G∈D l L p + (cid:213) G∈D u ∪D l L r + (cid:213) G∈D u ∪D l L c . (10) Practically, directly optimizing Eq. (10) of the teacher model yieldsunsatisfactory results for property prediction. The teacher modelwill be heavily loaded since it requires to learn several tasks si-multaneously. Due to the conflict of these optimization targets, weobserve that each target gets worse performance compared withoptimizing them separately. Especially, it is also inefficient becauseif |D l | << D u then little attention will be paid to optimization of L p in an epoch, however property prediction is what we care themost. As a result, the property prediction loss is much higher com-pared with a model that only needs to learn this task. To alleviatethis problem, we propose introducing a student model. We use theteacher model to learn such representation by jointly optimizingthe objects above. When the teacher’s learning process ends, wetransfer the teacher’s weight to the student model, and use thestudent model to fine-tune only on the labeled dataset to learn thetarget properties the same as Eq. (4) shown in Figure 2: L s = (cid:213) G i ∈D l || y i − f θ s ( z G i )|| . (11)After fine-tuning, we use the student model to infer the wholeunlabeled dataset and assign each unlabeled data a pseudo labelindicating the student’s prediction of its properties then the unla-beled dataset is D u = {(G i , f θ s (G i )) : 1 ≤ i ≤ |D u |} where θ s isthe parameters of student model. In the next iteration, the teachermodel also needs to learn such pseudo labels as Eq. (10) becomes: L = (cid:213) G∈D u ∪D l L p + (cid:213) G∈D u ∪D l L r + (cid:213) G∈D u ∪D l L c . (12)This can be viewed as the teacher learns the knowledge fromthe students as feedback inspired by the idea of knowledge distil-lation [17]. In summary, we handle the loss conflict by using twomodels whose targets are different. The teacher model learns a gen-eral representation while the student model aims to learn accurateprediction of molecular graph properties. The pre-training of theteacher provides a warm start for the student model. We have incorporated the information in both labeled and unlabeledmolecules. However, due to the limited number of labels available,the accuracy might still be unsatisfactory, we need to find newlabeled data to improve its performance. Therefore, in each iterationwe use the embeddings output by the teacher model to iteratively
DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zhongkai Hao, et al. select a subset of molecules, and the properties (ground truth labels)will be computed (i.e., by DFT). Then we add these molecules outputby active learning into the labeled set for finetuning two modelsiteratively. Along this line, the key strategy of active learning isto find a small batch of most diversified molecules in the chemicalspace for labeling. A well-studied method to measure diversity is tosample from k -DPP as [22] suggests. However, the subset selectionis NP-hard therefore a greedy approximation is taken advantageof, which is the k -center method. Denoting the unlabeled datasetby D u , and the labeled dataset by D l , we use a myopic methodthat in each iteration we choose a subset of data that maximizethe distance between labeled set and unlabeled set. Concretely, forevery 0 < i < b within the k -th batch, we choose the data pointthat satisfies the following condition:argmax j ∈[ n ]\D ku min i ∈D kl d (G i , G j ) , (13)where d (G i , G j ) = ∥ z G i − z G j ∥ is the distance between two molecules.We use L − In this subsection, we briefly summarize the framework in Algo-rithm 1. Given a unlabeled set and a labeled set. In each iteration, weuse k -center active learning strategy to get a new batch of data forlabeling and add them to the labeled set (Line 4), next we transferthe teacher’s weight to the student network (Line 5) and fine-tunethe student network (Line 6), then we use the student model toassign a pseudo label of the property for the rest of the unlabeleddataset (Line 7). After that, we continue to fine-tune the teachermodel jointly with three tasks (Line 8). At last, the trained studentmodel will be applied to predict the properties of the molecules.To summarize, we propose a novel approach to predict the prop-erties of molecules using graph neural networks. First, we use amulti-level representation learning method to obtain general em-beddings for molecular graphs. The node embeddings store essen-tial components of molecular graphs and they are composable toform meaningful graph level embeddings with respect to the wholedata distribution. Subsequently, a teacher-student framework isused to effectively combine semi-supervised learning and activelearning to deal with label insufficiency. Compared with vanillasemi-supervised learning methods [31], the separation of the twomodels can alleviate loss conflict. Compared with naive active learn-ing methods that re-trains the model from scratch when every newbatch data points are selected, the weight transferred from theteacher provides a warm start for the student and avoids overfittingof the small labeled dataset and accelerates training. Besides, thetwo models communicate via weight transfer and feedback fromassigning pseudo labels so that they can be mutually promoted. Algorithm 1
ASGN framework
Input:
Unlabeled,labeled,test dataset D u , D l , D test , error ϵ (· , D) ,batch size b , stopping error,label budget ϵ , B Output: student model θ s Initialize teacher and student θ t , θ s , labeled dataset D l while ϵ ( θ s , D test ) > ϵ or |D l | ≤ B do Pre-train/finetune the teacher model by minimizing L = L r + L c + L p to get graph embeddings { z G : G ∈ D u } . Use k -center active learning with z G for querying new la-beled data s , D l ← D l (cid:208) s , | s | = b . Transfer the weights of teacher to student θ s ← θ t . Finetune the student network by minimizing L p = ϵ ( θ s , D l ) . Assign pseudo label for the unalabeled dataset using studentmodel, y i ← f θ s (G i ) , i ≤ |D u \ D l | . end whileReturn: student model θ s In this section, we conduct extensive experiments to show theeffectiveness of ASGN on two popular molecular datasets. The codeis publicly available . • QM9: The QM9 dataset [28] is a well-known benchmarkdatasets that contains the equilibrium coordinates of 130,000molecules along with their quantum mechanical properties.We use 10,000 molecules for testing and 10,000 for validation.Coordinates and properties for all molecules are calculatedusing DFT methods. Molecules in QM9 contain no more than9 heavy atoms (atom heavier than hydrogen). • OPV: OPV [34]is a dataset with roughly 100,000 mediumsize molecules, each contains 20 to 30 heavy atoms. Again theproperties and equilibrium coordinates of these moleculesare obtained through DFT. We use 5,000 for testing and 5,000for validation.
We evaluate our method under two experimental settings. We firstdescribe the implementation details and parameters of ASGN. Werun all experiments are on one Tesla V100 GPU and 16 Intel CPUs.
Graph Neural Network Hyperparameters . For the network back-bone, we use 4 message passing layers and embedding dimensionof 96 in Eq. (1). We use Adam optimizer with a learning rate 1e-3.We use filters from 0 to 3nm with an interval of 0.01nm in Eq. (2).
Semi-Supervised Learning Hyperparameters . The teacher modelhas an additional linear classifier after the graph neural network.We divide the distance of the edge into 30 bins in Eq. (5) for recon-struction. We use M =
100 in Eq. (6). The regularization constant λ is set to be 25 in Eq. (9). We train (fine-tune) the teacher model for20 epochs in each iteration. We train the student network until theloss does not decrease for about 20 epochs. https://github.com/HaoZhongkai/AS_Molecule http://quantum-machine.org/datasets/ https://cscdata.nrel.gov/ SGN for Molecular Property Prediction KDD ’20, August 23–27, 2020, Virtual Event, CA, USA
Table 1: Results on QM9 dataset for effectiveness experiment.
Properties U U G H C v HOMO LUMO gap ZPVE R µ α Unit eV eV eV eV Cal/MolK eV eV eV eV Bohr Debye Bohr Supervised 0.3204 0.2934 0.2948 0.2722 0.2368 0.1632 0.1686 0.2475 0.0007 10.05 0.3201 0.5792Mean-Teachers 0.3717 0.2730 0.2535 0.2150 0.2036 0.1605 0.1686 0.2394 0.00054 5.22 0.3488 0.5792InfoGraph 0.1410 0.1702 0.1592 0.1552 0.1965 0.1605 0.1659 0.2421 0.00036 4.92 0.3168 0.5444ASGN (Ours)
Property HOMO LUMOUnit HatreeSupervised 0.080 0.078Mean-Teacher 0.078 0.075InfoGraph 0.077 0.076ASGN (Ours) (a) QM9 (b) OPV
Figure 3: The results of efficiency experiments of propertyHOMO on QM9 and OPV datasets.
Active Learning Hyperparameters . In each iteration, we select1,000 new unlabeled molecules in Eq. (13) to be labeled and addthem into the training dataset.
To demonstrate that our method could achieve lower error with lim-ited labeled data, we first conduct an effectiveness experiment. Un-der this experimental setting we have a fixed label budget which isthe maximum number of labels. Given a fixed label budget, we com-pare the final Mean-Absolute-Error(MAE) [30] on the test datasetafter training. We use a label budget of 5000 for both QM9 and OPVabout 5%. Other than these 5,000 labeled data, other labels are notavailable. We compare our methods with baselines listed below.
For accuracy experiments, we mainly compareour method with several semi-supervised learning baselines. Toensure fairness, all baselines are conducted on the same networkbackbone (i.e MPGNN). The compared baselines are selected fromtwo perspectives, one is traditional semi-supervised learning, theother is semi-supervised learning baselines for graph data. • Supervised :
We train the network backbone using fullysupervised manner only on the small labeled dataset. • Mean-Teachers [36]:
This is a method for semi-supervisedlearning by using a consistency regularization and uses mov-ing average for the models’ weights as the teacher. • InfoGraph [35]:
This is the state-of-the-art method forsemi-supervised learning or unsupervised learning on graphs. It maximizes the mutual information between the graph levelrepresentations and the substructures of the graphs.
The results are listed in Table 1 on QM9 dataset andTable 2 on OPV dataset.First, We found that our method is significantly better thanbaseline methods on all properties. We achieved a reduction of morethan 50% on several properties such as U , U , α and C v comparedwith the state-of-the-art method. This shows our semi-supervisedlearning method is effective and incorporating unlabeled data canhelp the prediction of molecular properties.Second, the semi-supervised reconstruction captures domainknowledge for molecules and achieves better results than super-vised model (i.e MPGNN) and Mean-Teachers. The global represen-tation learning at graph level is beneficial for molecular propertyprediction and its performance is better than Infograph. To demonstrate ASGN is label efficient, we conduct an efficiency ex-periment. In this experiment, we start with 5,000 labeled moleculesand the rest in the unlabeled set. Then, in each iteration, after themodel selects a molecule from D u , we add it to D l . During thisprocess, we measure the Label Rate-Mean Absolute Error(MAE)curve to show how many labels are saved for a fixed error. For afixed error, the less labeled data is used, the better the model is. The baselines are selected from active learningmethods. We apply these methods on the backbone of ASGN (i.eMPGNN). We simply omit some methods that cannot be applied toour settings. We use a batch number of 2500 new labeled moleculesin every iteration in Eq. (13) for ASGN. The computational cost ofQBC method on OPV dataset is unaffordable so we simply omit it. • Random:
Choosing data points randomly from the unla-beled dataset in each iteration. The model is re-initializedwhen a new batch of labeled data is selected. This methodequals the passive learning. • Query By Committee (QBC) [32]:
We jointly train a groupof models named committee initialized in the same methodbut different parameters. Each iteration we choose a batch ofdata points with the biggest disagreement of the committeemembers. We use 8 models as a committee, training 8 modelsat the same time is time consuming. • Deep Bayes Active Learning (BALD) [11]:
This is a methodbased on uncertainty. We approximate the uncertainty byperforming Monte Carlo dropout [33] on layers of the net-work. • Vanilla k -center [31]: The representation learned by thesemi-supervised learning methods actually benefits the se-lection of new data points. We also compare our methodwith the vanilla plain k -center active learning strategy. DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zhongkai Hao, et al.
Figure 4: Ablation resuts on the necessity of weight transfer.
We plot the results on HOMO (highest occupiedmolecular orbital) on both QM9 dataset and OPV dataset in Figure3. "Full" denotes the MAE for a supervised MPGNN using all labeleddata. We have the following conclusions.First, we show that for all datasets and properties, when thelabel number is fixed, the MAE is much lower than baselines whichproves the effectiveness of our model. This shows that the activelearning strategy is beneficial for model training. Additionally, theperformance is better than a fully supervised model on all labeleddata, proving the effectiveness of combining semi-supervised lossas regularization.Second, when we set a fixed error target, we found that our modelis about 2 ∼ / ∼ / k -centerdo not perfrom well on molecular data. Additionally, since BALDrequires dropout, the performance is better when few labels areavailable but worse when we use all the labels. In this section, we conduct more experiments on ASGN includingthe ablation study to demonstrate how every part of our modelaffects the performance and a visualization experiment to supportthe interpretability of our model.
First, to show theeffectiveness of the teacher-student framework in our model, weconduct an ablation study of ASGN without the teacher model orthe student model. We denote ASGN with only the teacher modelas ASGN-T which means that we jointly learn all tasks withouthandling the loss conflict. We list the results of HOMO on QM9 andOPV datasets in Table 3. We see that with the student network, themodel achieves better performance on property prediction task.We also study the case without the teacher model as ASGN-S which means no semi-supervised learning is used. Notice thatASGN-S is identical to a vanilla k -center active learning method[31].Results show that it is necessary using the teacher-student frame-work. The essential step in connectingthe student model and teacher model in our method is to transfer
Figure 5: Visualization results on QM9 dataset of moleculargraph embeddings using t-SNE method.Table 3: Results of Ablation experiments on the necessity ofteacher-student framework.
Name/Dataset Homo(QM9) Homo(OPV)Unit eV HatreeNumber of data 5k 10k 50k 5k 10k 50kASGN-T 0.1668 0.1523 0.0682 0.080 0.053 0.020ASGN-S 0.1632 0.1252 0.0653 0.076 0.049 0.019ASGN the weight of the teacher model to the student model in order toaccelerate the training process. Here we use an ablation experimentto demonstrate the effect of the weight transfer. In Figure 4, weplot the MAE of ASGN with weight transfer and without weighttransfer on the test dataset of QM9 on LUMO (lowest unoccupiedmolecular orbital) property when 10,000 labeled data are available.Results show that both training and testing MAE converge fasterand are more stable with weight transfer. The final performance isalso better using weight transfer.
Our representation learning has considered the mutual relationbetween molecules within the chemical space and we use the in-formation mutually for predicting the clustering to enhance therepresentation. To demonstrate that the distribution of moleculesexhibits a clustered structure, we use t-SNE method to visualize thegraph level representation of molecules using ASGN, shown in Fig-ure 5. We see after using t-SNE the embedding of molecules can beclustered, and there is obvious distance between the clusters whichverifies that we have got discriminative graph level embeddings.Additionally, similar molecules are clustered into the same clusterthat means the embeddings can capture structural information.
In this paper, we proposed a novel framework to improve the per-formance for molecular property prediction with limited labels byincorporating unlabeled molecules. We designed a teacher-studentframework consisting of two graph neural networks that work it-eratively. Then we introduced the details of our semi-supervisedrepresentation learning method for molecular graphs that considerboth graph level and node level information. Weight transfer andpseudo labeling are used to optimize two models to balance the lossfunctions. Furthermore, we used diversity based active learning
SGN for Molecular Property Prediction KDD ’20, August 23–27, 2020, Virtual Event, CA, USA to select new molecules for labelling. ASGN achieves much betterperformance compared with baselines when labels are limited. Ad-ditionally, we showed the necessity for components in ASGN usingablation experiments. In future work, we will attempt to extend ourmodel to more general molecular property prediction.
ACKNOWLEDGMENTS . This research was supported by grantsfrom the National Natural Science Foundation of China (Grants No.61922073, U1605251). Qi Liu gratefully acknowledges the support ofthe Youth Innovation Promotion Association of CAS (No. 2014299).
REFERENCES [1] Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. 2019. Self-labelling via simultaneous clustering and representation learning. arXiv preprintarXiv:1911.05371 (2019).[2] Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, andAlekh Agarwal. 2019. Deep batch active learning by diverse, uncertain gradientlower bounds. arXiv preprint arXiv:1906.03671 (2019).[3] Axel Becke. 2007.
The quantum theory of atoms in molecules: from solid state toDNA and drug design . John Wiley & Sons.[4] Axel D Becke. 2014. Perspective: Fifty years of density-functional theory inchemical physics.
The Journal of chemical physics
Proceedings of the 34th International Conference on Machine Learning-Volume 70 . JMLR. org, 517–526.[6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018.Deep clustering for unsupervised learning of visual features. In
Proceedings ofthe European Conference on Computer Vision (ECCV) . 132–149.[7] Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimaltransport. In
Advances in neural information processing systems . 2292–2300.[8] Marco Cuturi and Arnaud Doucet. 2014. Fast computation of Wasserstein barycen-ters. (2014).[9] Kien Do, Truyen Tran, and Svetha Venkatesh. 2019. Graph transformation policynetwork for chemical reaction prediction. In
Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . 750–760.[10] Sean Ekins, Ana C Puhl, Kimberley M Zorn, Thomas R Lane, Daniel P Russo,Jennifer J Klein, Anthony J Hickey, and Alex M Clark. 2019. Exploiting machinelearning for end-to-end drug discovery and development.
Nature materials
18, 5(2019), 435.[11] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep bayesian activelearning with image data. In
Proceedings of the 34th International Conference onMachine Learning-Volume 70 . JMLR. org, 1183–1192.[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised repre-sentation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George EDahl. 2017. Neural message passing for quantum chemistry. In
Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 . JMLR. org,1263–1272.[14] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In
Advances in neural information processing systems .1024–1034.[15] Katja Hansen, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis,O Anatole Von Lilienfeld, Klaus-Robert MuÌĹller, and Alexandre Tkatchenko.2015. Machine learning predictions of molecular properties: Accurate many-bodypotentials and nonlocality in chemical space.
The journal of physical chemistryletters
6, 12 (2015), 2326–2331.[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. [n.d.].
Mo-mentum Contrast for Unsupervised Visual Representation Learning . TechnicalReport. arXiv:1911.05722v2[17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge ina neural network. arXiv preprint arXiv:1503.02531 (2015).[18] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay SPande, and Jure Leskovec. [n.d.].
Pre-training Graph Neural Networks . TechnicalReport. arXiv:1905.12265v1[19] Zhenya Huang, Qi Liu, Yuying Chen, Le Wu, Keli Xiao, Enhong Chen, HaipingMa, and Guoping Hu. 2020. Learning or Forgetting? A Dynamic Approach forTracking the Knowledge Proficiency of Students.
ACM Trans. Inf. Syst.
38, 2(2020), 19:1–19:33. https://doi.org/10.1145/3379507[20] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).[21] Walter Kohn and Lu Jeu Sham. 1965. Self-consistent equations including exchangeand correlation effects.
Physical review
Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33.1052–1060.[24] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. 2019. Graph convo-lutional networks with eigenpooling. In
Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . 723–731.[25] Dino Oglic, Roman Garnett, and Thomas Gärtner. 2017. Active search in inten-sionally specified structured spaces. In
Thirty-First AAAI Conference on ArtificialIntelligence .[26] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei AEfros. 2016. Context encoders: Feature learning by inpainting. In
Proceedings ofthe IEEE conference on computer vision and pattern recognition . 2536–2544.[27] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. 2020.Geom-GCN: Geometric Graph Convolutional Networks. In . OpenReview.net. https://openreview.net/forum?id=S1e2agrFvS[28] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O AnatoleVon Lilienfeld. 2014. Quantum chemistry structures and properties of 134 kilomolecules.
Scientific data
Advances inneural information processing systems . 3546–3554.[30] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela,Alexandre Tkatchenko, and Klaus-Robert Müller. 2017. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In
Ad-vances in neural information processing systems . 991–1001.[31] Ozan Sener and Silvio Savarese. 2017. Active Learning for Convolutional NeuralNetworks: A Core-Set Approach. (2017), 1–13. arXiv:1708.00489 http://arxiv.org/abs/1708.00489[32] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query bycommittee. In
Proceedings of the fifth annual workshop on Computational learningtheory . 287–294.[33] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting.
The journal of machine learning research
15, 1 (2014), 1929–1958.[34] Peter C St. John, Caleb Phillips, Travis W Kemper, A Nolan Wilson, Yanfei Guan,Michael F Crowley, Mark R Nimlos, and Ross E Larsen. 2019. Message-passingneural networks for high-throughput polymer screening.
The Journal of chemicalphysics arXiv preprint arXiv:1908.01000 (2019).[36] Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learningresults. In
Advances in neural information processing systems . 1195–1204.[37] David J Thouless. 2014.
The quantum mechanics of many-body systems . CourierCorporation.[38] Daniel Ting and Eric Brochu. 2018. Optimal subsampling with influence functions.In
Advances in Neural Information Processing Systems . 3650–3659.[39] Hao Wang, Enhong Chen, Qi Liu, Tong Xu, Dongfang Du, Wen Su, and XiaopengZhang. 2018. A United Approach to Learning Sparse Attributed Network Em-bedding. In . IEEE,557–566.[40] Hao Wang, Tong Xu, Qi Liu, Defu Lian, Enhong Chen, Dongfang Du, Han Wu,and Wen Su. 2019. MCNE: An End-to-End Framework for Learning MultipleConditional Network Representations of Social Network. In
Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining . 1064–1072.[41] Likang Wu, Zhi Li, Hongke Zhao, Zhen Pan, Qi Liu, and Enhong Chen. 2020. Es-timating Early Fundraising Performance of Innovations via Graph-Based MarketEnvironment Model. In
The Thirty-Fourth AAAI Conference on Artificial Intelli-gence, AAAI 2020 . AAAI Press, 6396–6403.[42] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerfulare Graph Neural Networks? (2018), 1–17. arXiv:1810.00826 http://arxiv.org/abs/1810.00826[43] Zhilin Yang, Jie Tang, and Yutao Zhang. 2014. Active learning for streamingnetworked data. In
Proceedings of the 23rd ACM International Conference onConference on Information and Knowledge Management . 1129–1138.[44] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton,and Jure Leskovec. 2018. Graph convolutional neural networks for web-scalerecommender systems. In
Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . 974–983.[45] Xiaojin Jerry Zhu. 2005.