[PDF] Graph Policy Network for Transferable Active Learning on Graphs

Abstract

Graph neural networks (GNNs) have been attracting increasing popularity due to their simplicity and effectiveness in a variety of fields. However, a large number of labeled data is generally required to train these networks, which could be very expensive to obtain in some domains. In this paper, we study active learning for GNNs, i.e., how to efficiently label the nodes on a graph to reduce the annotation cost of training GNNs. We formulate the problem as a sequential decision process on graphs and train a GNN-based policy network with reinforcement learning to learn the optimal query strategy. By jointly training on several source graphs with full labels, we learn a transferable active learning policy which can directly generalize to unlabeled target graphs. Experimental results on multiple datasets from different domains prove the effectiveness of the learned policy in promoting active learning performance in both settings of transferring between graphs in the same domain and across different domains.

Full PDF

GGraph Policy Network for TransferableActive Learning on Graphs

Shengding Hu

Tsinghua University [email protected]

Zheng Xiong

Tsinghua University [email protected]

Meng Qu

MILA [email protected]

Xingdi Yuan

Microsoft Research [email protected]

Marc-Alexandre Côté

Microsoft Research [email protected]

Zhiyuan Liu

Tsinghua University [email protected]

Jian Tang

HEC Montreal & MILA [email protected]

Abstract

Graph neural networks (GNNs) have been attracting increasing popularity due totheir simplicity and effectiveness in a variety of ﬁelds. However, a large numberof labeled data is generally required to train these networks, which could be veryexpensive to obtain in some domains. In this paper, we study active learning forGNNs, i.e., how to efﬁciently label the nodes on a graph to reduce the annotationcost of training GNNs. We formulate the problem as a sequential decision processon graphs and train a GNN-based policy network with reinforcement learning tolearn the optimal query strategy. By jointly optimizing over several source graphswith full labels, we learn a transferable active learning policy which can directlygeneralize to unlabeled target graphs under a zero-shot transfer setting. Experimen-tal results on multiple graphs from different domains prove the effectiveness ofour proposed approach in both settings of transferring between graphs in the samedomain and across different domains.

Graphs encode the relations between different objects and are ubiquitous in real world. Learningeffective representation of graphs is critical to a variety of applications. Recently, graph neuralnetworks (GNNs) have been attracting growing attention for their effectiveness in graph representationlearning [30, 33]. They have achieved great success on various tasks such as node classiﬁcation[15, 27] and link prediction [4, 32]. Despite their appealing performance, GNNs typically require alarge amount of labeled data for training [29]. However, in many domains, such as chemistry [11] andhealth care [6], it could be very expensive and time-consuming to collect a large amount of labeleddata, which signiﬁcantly limits the performance of GNNs in these domains.Active learning [1, 22, 25] is a promising strategy to tackle this challenge. It aims to dynamicallyquery the labels of the most informative instances selected from the unlabeled data. Althoughactive learning has been proven effective on independent and identically distributed (i.i.d.) data ina variety of ﬁelds [9, 24], how to apply it to graph-structured data with dense correlations remainsunder-explored. This motivates us to study active learning on graphs, i.e., how to efﬁciently label thenodes on a graph to reduce the annotation cost of training GNNs.

Preprint. Under review. a r X i v : . [ c s . L G ] J un owards this goal, several methods have been proposed recently [8, 13, 12, 3, 10, 5]. They followa general approach of designing a single selection criterion based on the graph characteristics oradaptively combining several selection criteria to measure the informativeness of each node, thenlabeling the most "informative" node at each query step according to the selection criterion. However,these methods suffer from the following limitations. (1) Ignoring Long-term Performance : Existingmethods usually adopt different selection criterion as a surrogate objective function and greedilyoptimize it at each query step. However, the long-term objective function we truly want to optimizeis how to select a sequence of nodes which can maximize the performance score of the GNN trainedon them. Maximizing the short-term surrogate criterion usually leads to sub-optimal query strategies.(2)

Lack of Node Interactions : When measuring the node informativeness, existing methods usuallyconsider each node independently and ignore the interconnections between different nodes. Forexample, if an unlabeled node has several labeled neighbors, then it is very likely that this node wouldprovide little additional information for training.To tackle these limitations, we propose GPA, a G raph P olicy network for transferable A ctive learningon graphs. Our approach formalizes active learning on graphs as a Markov decision process (MDP)and learns the optimal query strategy with reinforcement learning (RL), where the state is deﬁnedbased on the current graph status, the action is to select a node for annotation at each query step, andthe reward is the performance gain of the GNN trained with the selected nodes. By maximizing itslong-term returns with policy gradient [26], our policy network can effectively learn to optimize thelong-term performance of the GNN in an end-to-end fashion. Moreover, our approach parameterizesthe policy network as another GNN to explicitly model node interactions, which effectively propagatesuseful information over the graph and thereby better measures the node informativeness. We train thegraph policy network on multiple training graphs where node labels are available, and evaluate it onthe testing graphs where no labels are available at all (i.e., zero-shot transfer learning).We evaluate GPA on the standard semi-supervised node classiﬁcation task under two experimentalsettings with increasing difﬁculty: 1) the training graphs and testing graphs are from the same domain;2) the training graphs and testing graphs are from different domains. Experimental results prove theeffectiveness of GPA over competitive baselines under both settings. Graph Neural Networks . Typically, GNNs [30, 33] learn node representation by iteratively aggre-gating neighborhood information. A key difference between different GNN variants lies in how theydesign the aggregation function [15, 27, 11]. Despite their effectiveness, GNNs typically require mas-sive labeled data for training, which entails high annotation cost in some domains [16]. Consequently,we propose to study active learning on graphs to reduce the annotation cost of training GNNs.

Active Learning . Active learning [1, 25, 22, 2] has been widely studied on i.i.d. data in differentdomains such as natural language processing [24] and computer vision [9]. Recently, there are alsoa handful of studies focusing on active learning for graph-structured data. Some earlier studies[8, 12, 13] are developed based on the graph homophily assumption that neighboring nodes aremore likely to have the same label. More recent works utilize the expressive power of GNNs todesign more informative selection criteria. AGE [3] measures node informativeness by a linearcombination of three heuristics, where the combination weights are sampled from a beta distributionwith time-sensitive parameters. ANRMAB [10] also uses the combination of different features butadjusts the combination weights based on a multi-armed bandit framework. Similarly, ActiveHNE[5] tackles active learning on heterogeneous graphs by posing it as a multi-armed bandit problem.However, all these methods measure the informativeness of different nodes independently, withoutexplicitly considering their interactions.

Reinforcement Learning (RL) for Active Learning . Fang et al. [7] uses deep Q-networks (DQN)[19] to learn active learning policies for named entity recognition (NER). Similarly, Liu et al. [17]uses imitation learning to select the most informative data points for NER. Liu et al. [18] tacklesactive learning for neural machine translation with reinforcement learning. However, all these studiesfocus on i.i.d. data. In contrast, we focus on graph-structured data, where different nodes are highlycorrelated. Speciﬁcally, our approach learns a GNN-based policy network to utilize the interactionsbetween different nodes. 2

Methodology

We consider a graph denoted as G = ( V, E ) , where V is a set of nodes and E is a set of edges. Eachnode v ∈ V is associated with a feature vector x v ∈ X ⊆ R d , and a label y v ∈ Y = { , , · · · , C } .The node set is divided into three subsets as V train , V valid , and V test . In conventional semi-supervisednode classiﬁcation, the labels of a subset V label ⊆ V train are given. The task is to learn a classiﬁcationnetwork f G,V label (formulated as a GNN) with the graph G and V label to classify the nodes in V test .For active learning on graphs, the labeled training subset is initialized as an empty set, V label = ∅ .A query budget B is given, which allows us to sequentially acquire the labels of B samples from V train , where B (cid:28) | V train | . At each step t , we select an unlabeled node v t from V train \ V t − label basedon an active learning policy π and query the label of v t . Next, we update the labeled node set as V t label = V t − label ∪ { v t } . The classiﬁcation GNN f is then trained with the updated V t label for one moreepoch. When the budget B is used up, we stop the query and continue training the classiﬁcationGNN f with V B label until convergence.We learn and evaluate the active learning policy π under a zero-shot transfer learning setting. Duringthe training phase, we collect a set of source graphs G S with node labels. We learn the optimal policy π ∗ to maximize (cid:80) G ∈G S M ( f G,V B label ) , where M ( · ) is a metric used to evaluate the performance ofthe classiﬁcation GNN; V B label = ( v , . . . , v B ) is the node sequence labeled by π ∗ on graph G underthe annotation budget B . During the evaluation phase, we directly apply the learned policy π ∗ to aset of unlabeled target graphs G T . For each G ∈ G T , we select a sequence of nodes to label based on π ∗ and use them to train the classiﬁcation GNN on G . Our ultimate goal is to learn a transferableactive learning policy π ∗ from G S which can perform well on G T without ﬁne-tuning or retraining. In the task of active learning for GNNs, our goal is to interactively select a sequence of nodes whichmaximize the performance of the GNN trained on them. This problem could be naturally formalizedas an MDP. Intuitively, given the condition of the current graph (i.e., the outputs of the classiﬁcationnetwork and information about the available labels in the graph) as state, the active learning systemtakes an action by selecting the next node to query. It is then rewarded by the performance gain of theclassiﬁcation GNN trained with the updated set of labeled nodes. Formally, the MDP is deﬁned asfollows.

State . We denote the state in graph G at step t as S tG = { s tv | v ∈ V G } , where s tv is the staterepresentation of node v , V G is the node set in graph G . We deﬁne the state representation of eachnode based on several commonly-used heuristic criteria in active learning [22] as follows. • We compute the degree of a node to measure its representativeness. The intuition is that high-degree nodes are likely to be hubs in a graph, and thus their labels are more informative. To ensurecomputational stability, we scale node degree by a hyperparameter α and clip it to , i.e., s tv (1) = min( degree ( v ) /α, . • We compute the entropy of the label distribution predicted by the classiﬁcation GNN f on eachnode to measure its uncertainty. If the network is not conﬁdent about its prediction on certain nodes,then the labels of these nodes are likely more useful. We divide the entropy H by log( classes ) toﬁt its value range within [0 , even across graphs with different class numbers, i.e., s tv (2) = H (¯ y ( v ; t )) / log( classes ) , where ¯ y ( v ; t ) ∈ R C is the class probability of node v predicted by the classiﬁcation GNN at step t . • In addition to the entropy of the node itself, we are also interested in the divergence betweena node’s predicted label distribution and its neighbor’s. The divergence between neighboringnodes measures local graph similarity, which can help the active learning policy better identifypotential clusters and decision boundaries in the graph. Consequently, we compute the average KLdivergence and reverse KL divergence between the predicted label distribution of a node v and itsneighbors N v as a measure of local similarity, i.e., s tv (3) = 1 | N v | (cid:88) u ∈ N v KL (¯ y ( v ; t ) || ¯ y ( u ; t )) , s tv (4) = 1 | N v | (cid:88) u ∈ N v KL (¯ y ( u ; t ) || ¯ y ( v ; t )) . rainClassification Network 𝒇 𝒕 > 𝑩 NoYesGenerate Graph State 𝑺 𝑮𝒕+𝟏 with 𝒇 and 𝑮 Update π with Reward Signal 𝑹 Graph State 𝑺 𝑮𝒕 GCN Policy

Network π …… …… Select & AnnotateAction Probability 𝒑 𝑮𝒕 ReLU

EvaluateClassification Network 𝒇 Labeled nodeUnlabeled node

Figure 1: The RL-based framework for active learning on GNNs. Blue and orange nodes representunlabeled and labeled nodes in the training set respectively. For simplicity, we omit the validationnodes and test nodes. In the policy network π , each column represents a layer of GNN, and thegraphs in each column correspond to the feature aggregation on different nodes. • We use an indicator variable to represent whether a node has been labeled or not, i.e., s tv (5) = { v ∈ V t − label } . In addition, we can easily incorporate additional heuristic features (e.g., the features proposed in[3, 10, 5]) into this ﬂexible framework by concatenating them onto the node representation vector. Inthis paper, however, we only use the ﬁve primary features introduced above and utilize the policynetwork to automatically learn more complicated and informative criterion for node selection.

Action . At time step t , the action is to select a node v t from V train \ V t − label based on p tG ∼ π ( ·| S tG ) , theaction probability given by the policy network in step t . Reward . We use the classiﬁcation GNN’s performance score on the validation set after convergenceas a trajectory reward. Although this trajectory reward is delayed and sparse compared to usingstep-wise performance gain as intermediate rewards, empirically it provides a much more stableestimation of the policy’s quality, as it is more robust to the random interference factors during thetraining process of the classiﬁcation GNN. Given a sequence of labeled nodes V B label = ( v , . . . , v B ) ,we deﬁne the trajectory reward with respect to V B label as R ( V B label ) = M ( f G,V B label ( V valid ) , y valid ) , (1)where f G,V B label is the classiﬁcation GNN trained with the graph G and the labels of V B label ; V valid and y valid are the nodes and labels of the validation set; M is the evaluation metric. State Transition Dynamics . At each query step t , a newly labeled node v t is added to V t label to updatethe classiﬁcation GNN, the graph state thus transits from S tG to S t +1 G . Speciﬁcally, the selectionindicator in the state vector of the selected node v t is changed from to . Updating the classiﬁcationGNN can inﬂuence the predicted label distribution, thus the entropy and KL divergence terms in thestate vector of each node will change accordingly. Since it is hard to directly model the transitiondynamics p ( S t +1 G | S tG , v t ) , we learn the optimal policy in a model-free approach. Framework . We show an overview of the policy training framework in Figure 1. At query step t , we ﬁrst update the current graph state S tG with the graph G and the outputs of the classiﬁcationGNN f G,V t − label . The policy network π takes S tG as input and produces a probability distribution overactions as p tG , which represents the probability of annotating each unlabeled node in the candidatepool V train \ V t − label . Next, we sample a node v t based on p tG for annotation and add it to the labeledtraining subset to get V t label . The classiﬁcation GNN f is trained for one more epoch with V t label toget f G,V t label , which is then used to generate the graph state S t +1 G for the next step. When t = B , westop the query phase and train f until convergence. Finally, we evaluate f G,V B label on the validation set V valid , and the performance score is used as the trajectory reward R to update the policy network π . A key characteristic of active learning on graphs is that the nodes are highly correlated with eachother based on the graph topology. This provides valuable information on the informativeness ofeach candidate node at different query step. To automatically extract such information and modelthe inﬂuence of graph structures on the query policy, we parameterize the policy network π as a4NN, which iteratively aggregates neighborhood information to update the state representation ofeach node. Speciﬁcally, we implement the policy network as a L -layer Graph Convolutional Network(GCN) [15]. The propagation rule for layer l in GCN is H ( l +1) = σ (cid:16) ˜ D − ˜ A ˜ D − H ( l ) W ( l ) (cid:17) , (2)where W ( l ) and H ( l ) are the weight and the input feature matrices of layer l respectively. ˜ A = A + I is the adjacency matrix with self loops, ˜ D is a diagonal matrix with ˜ D ii = (cid:80) i ˜ A ij . We use ReLU asthe activation function σ . In the ﬁrst layer, we use H (0) = S tG as the initial input feature.On top of the GCN, we apply a linear layer (with output dimension of ) to the ﬁnal output embedding H ( L ) . The resulting logits are then normalized by a softmax to generate a probability distribution p tG over all candidate nodes for annotation in G : p tG = π ( ·| S tG ) = Softmax ( W H ( L ) + b ) . (3) In the training phase, given a set of N S labeled source graphs G S = { G i | i = 1 , . . . , N S } , our goal isto maximize the sum of expected rewards obtained from following policy π over the training graphs.The objective function with respect to the policy network parameters θ is J ( θ ) = N S (cid:88) i =1 E P ( V Bi label =( v ,...,v Bi ); θ ) [ R i ( V B i label )] , (4)where B i is the query budget on graph G i , and R i is the trajectory reward on graph G i . We utilizeREINFORCE [28], a classical policy gradient method, to train the policy network. In each trainingepisode, we iterate over all the training graphs to update the policy network.In the evaluation phase, given a set of N T unlabeled target graphs G T = { G i | i = 1 , . . . , N T } , wedirectly apply the learned policy π θ on each test graph to perform active learning. As no ﬁne-tuningor retraining is required, we only need to label B G samples to train the classiﬁcation GNN on eachunlabeled test graph G , which is consistent with the annotation budget. Due to the space limit, wegive the detailed pseudo-code for policy training and evaluation in Appendix A. For transferable active learning on graphs from the same domain, we use a multi-graphdataset collected from Reddit , which consists of 5 graphs. For transferable active learning on graphsfrom different domains, we adopt 5 widely used benchmark datasets: Cora, Citeseer and Pubmed,Coauthor-Physics and Coauthor-CS [23]. We also use the 5 Reddit graphs in this setting, resulting 10graphs in total. Due to the space limit, we give detailed description of these datasets in Appendix B. Baselines.

We compare our method against the following baseline methods: (1) Random:

At each step, randomly select a node to annotate. This is equivalent to the conventionalsemi-supervised training of GNNs. (2) Uncertainty-based policy:

At each step, predict the label distribution of each node with thecurrent classiﬁcation GNN, then annotate the node with the maximal entropy on label distribution. (3) Centrality-based policy:

At each step, annotate the node with the largest degree. (4) Coreset [21] : Coreset performs k-means clustering over the outputs of the last hidden layer ofthe classiﬁcation network, which was originally proposed for Convolutional Neural Networks. Wesimply apply this method on the node representation learned by the classiﬁcation GNN. At each step,the node which is closest to the cluster center is selected for annotation. (5) AGE [3] : AGE measures the informativeness of each node by combining three heuristics, i.e., theentropy of the predicted label distribution, the node centrality score, and the distance between thenode’s embedding and it nearest cluster center. To apply AGE under the transfer learning setting, we Reddit is an online forum where users create posts and comment on them. same domain. The active learningpolicy is trained on Reddit {1, 2} and evaluated on Reddit {3, 4, 5}.

Boldface and underline representthe best and second best scores respectively.

Method Reddit3 Reddit4 Reddit5Micro-F Macro-F Micro-F Macro-F Micro-F Macro-F Random 88.21 87.30 84.81 79.87 86.20 84.39Uncertainty 70.03 64.34 72.28 60.32 73.27 63.67Centrality 90.93 90.35 83.43 75.96 84.83 79.71Coreset 78.34 76.11 82.18 76.71 83.29 81.99AGE 91.09 90.44 87.55 84.39 88.02 85.99ANRMAB 85.26 83.06 83.14 76.80 83.65 79.99GPA (Ours) ﬁnd the optimal combination weights on each training graph separately using grid search, then usetheir mean value as the combination weights on test graphs. (6) ANRMAB [10] : ANRMAB utilizes the same set of selection heuristics as in AGE, and proposesa multi-armed bandit framework to dynamically adjust the combination weights of these heuristics.ANRMAB uses the performance score on historical query steps as the rewards to the multi-armedbandit machine, which enables it to learn the combination weights during the query process. Evaluation Metrics and Parameter Settings.

Following the common settings in GNN literature[31], we use Micro-F and Macro-F as the evaluation metrics. On each graph, we set the sizes ofvalidation and test sets as 500 and 1000 respectively and use all remaining nodes as the candidatetraining samples for annotation. To test the policy, we run 100 independent experiments with differentclassiﬁcation network initialization and report the average performance scores on the test set.We implement the policy network as a two-layer GCN [15] with a hidden layer size of 8. We useAdam [14] as the optimizer with a learning rate of . . The policy network is trained for a maximumof 2000 episodes with a batch size of 5. To demonstrate the advantage of active learning on reducingannotation cost, we set the query budget on each graph as (5 × classes ) , which is far less than thedefault labeling budget of (20 × classes ) in conventional (semi-supervised) GNN literature [15].We set the scaling hyperparameter as α = 20 based on the average graph degree in our datasets. Forthe classiﬁcation network, we implement it as a two-layer GCN with a hidden layer size of 64. Weuse Adam as the optimizer with a learning rate of 0.03 and a weight decay of 0.0005. Source code isattached in the supplementary material. Table 1 shows the results of transferable active learning on graphs from the same domain. Our policysuccessfully transfers to all the three test graphs of Reddit under a zero-shot transfer setting.Surprisingly, for the three methods which use a single heuristic as the selection criterion, i.e., uncertainty , centrality and Coreset , none of them could consistently outperform random selection.We suspect the reason is that the distribution of the selected samples is different from the underlyingdistribution of all the nodes. For example, when using uncertainty for selection, nodes that are closeto the decision boundaries usually have higher entropy and are more likely to be selected, whichintroduces a distribution drift to the labeled training data. Consequently, a combination of differentheuristics is essential to effectively measure node informativeness — as is done in our method.Compared to the two baselines that combine different heuristics as the selection criterion, GPAconsistently outperforms AGE and ANRMAB. The reasons of the performance gain are twofold.First, our approach formulates the active learning problem as an MDP, which directly optimizes thelong-term performance of the classiﬁcation GNN. Second, our approach parameterizes the policynetwork as a GNN, which could leverage node interactions to better measure node informativeness.

Table 2 shows the results of transferable active learning on graphs across different domains. Ourpolicy successfully transfers to graphs across different domains and achieves the best performanceon all the test graphs. Furthermore, comparing with Table 1, we can see that the cross-domain GPA As the code of ANRMAB is not provided, we implement it based on the pseudo-code from their paper. different domains. Cora and Citeseerare used as the training graphs, while all other graphs are used for evaluation. For simplicity, wediscard the three single-heuristic methods due to their relatively low performance.

Method Metric Pubmed Reddit1 Reddit2 Reddit3 Reddit4 Reddit5 Physics CSRandom Micro-F Macro-F

20 40 60 80 10030405060708090 M i c r o - F ANRMABCentralityGPA(Ours)RandomUncertaintyCoresetAGE Reddit5 Cora Citeseer Pubmed Physics Cs5060708090100 M i c r o - F Figure 2: Left: Performance of different methods on Reddit 4 under different query budgets. Thex-axis represents label budget, and the y-axis represents the Micro-F score. Right: Performance ofGPA on different test graphs when trained with different number of source graphs.policy performs comparably to the single-domain GPA policy on Reddit {3, 4, 5}, which againsuggests the strong transferability of our approach. This is mainly because that our policy is learnedon a state space closely related to the classiﬁcation GNN’s learning process, instead of depending onany graph-speciﬁc features. Moreover, the policy network is optimized jointly over multiple graphs,which helps it learn an universal policy that is naturally transferable to different graphs. Due to thespace limit, please refer to Appendix C for further experiments on different training graphs. Next, we compare all the algorithms on the dimension of query budgets. In this study, Reddit isused as an example. We train our policy on Reddit {1, 2} with { , , , , } budgets, thenevaluate the learned policy on Reddit 4 under the corresponding budgets. All baseline methods arealso tested using the same set of budgets. We test each method under each budget for 100 timesand report the averaged Micro-F score with 95% conﬁdence interval. Figure 2 (left) shows that ourpolicy consistently outperforms all baselines under all budgets. Compared with random selection,which uses 100 budget to reach a Micro-F of 90.0, our approach only needs 30 budget to reach thesame result. Meanwhile, AGE uses 100 budget to reach a Micro-F of 91.7, while our approach onlyuses 50 budget to achieve the same result. We also notice that using only half of the full budget (50),GPA can already achieve a higher Micro-F than most of the baselines consuming 100 budget. We study the performance and transferability of the learned policyw.r.t. the number of training graphs. We select {1, 2, 3, 4} graphs from Reddit as the training graphs,and evaluate on the remaining 6 graphs. The result is shown in Figure 2 (right). On average, thepolicy trained on multiple graphs transfers better than the policy trained on a single graph. The mainreason may be that training on a single graph overﬁts to the speciﬁc pattern of the training graph,while training on multiple graphs better captures the general pattern across different graphs.

Importance of Modeling Node Interactions

Our GNN-based policy network models node infor-mativeness by considering graph structures. Here, we compare it with a policy network withouttaking graph structures into consideration. We parameterize the policy network as a multi-layerperceptron (MLP), which only utilizes single-node information. For a fair comparison, we use a3-layer MLP with a hidden layer size of 8 — it has the same number of parameters as the GCN7able 3: Performance comparison between using GCN and MLP as the policy network.

Method Metric Pubmed Reddit1 Reddit2 Reddit3 Reddit4 Reddit5 Physics CSGCN Micro-F Macro-F Table 4: Contribution of each state feature. The policies are trained on Reddit {1, 2} and evaluatedon Reddit {3, 4, 5}. Each row corresponds to removing one feature, while "GPA (full model)" meansusing all the features.

Reddit3 Reddit4 Reddit5Features Micro-F Macro-F Micro-F Macro-F Micro-F Macro-F GPA (full model)

Entropy 92.48 92.12 90.12 86.94 90.35 89.88 – Degree 92.36 92.07 91.48 – KL 92.76 92.47 91.12 88.23 91.29 90.98 – Indicator 91.11 90.20 88.90 86.12 90.72 90.41 policy network. We train the two policy networks on Cora + Citeseer, and evaluate on the remaininggraphs. As shown in Table 3, GCN outperforms MLP by a large margin on all test graphs exceptCoauthor-CS, which evinces the importance of modeling node interactions.

Contribution of State Features

We also investigate the contribution of the state features introducedin Section 3.2. We remove each of them from the state space to see how they inﬂuence the learnedpolicy. We take the Reddit dataset as an example. As shown in Table 4, removing any of the featurescan result in a performance drop, which validates the effectiveness of these features. Among thefour features, the binary label indicator seems to contribute the most to the policy’s performance.We believe the reason is that propagating the annotation information over the graph helps the policybetter identify and model the under-explored areas in the graph.

To better understand how the learned policy works, we conduct a case study by visualizing thenode sequence selected by different method on a toy graph collected from Reddit ( | V | = 85 , | E | =156 , |Y| = 4 ). We apply the cross-domain policy learned in Section 4.3 to this toy graph for 100times with different classiﬁcation network initialization, and report the most frequently selected nodein each query step. For comparison, we also report the node sequence selected by AGE and thecentrality criterion. Since ANRMAB needs to learn during the query process, it fails to performmeaningful selection on this toy graph with small size, and thus is not reported.As shown in Figure 3, GPA better explores the whole graph space compared to the others two methods,which helps it to better model the data distribution and thus achieves much better performance.Speciﬁcally, the node sequences selected by AGE and the centrality criterion are very similar, bothbiased towards the nodes with large degree. On the contrary, GPA not only utilizes the nodes withlarge degree (e.g. node 1, 2), but also fully explores the under-represented areas (e.g. node 10, 13, 15)in the graph based on its annotation trajectory. It also learns to automatically switch between differentclasses to reach a class balance (e.g. no two consecutively selected nodes belong to the same class). In this paper, we investigate active learning on graphs which aims to reduce the annotation cost fortraining GNNs. We formulate the problem as an MDP and learn a transferable query policy withRL. We parameterize the policy network as a GNN to explicitly model graph structures and nodeinteractions. Experimental results suggest that our approach can transfer between graphs in bothsingle-domain and cross-domain settings, substantially outperforming competitive baselines. Forfuture work, we consider to dynamically adjust the importance of different state features based on thecurrent time step [3, 10], or to incorporate global graph information into the policy.8

23 45 67 89 1011 1213 14 15 (a) GPA, Micro-F : 90.41

12 34 56789 101112 1314 15 (b) Centrality, Micro-F : 84.13

12 34 56789 10 11121314 15 (c) AGE, Micro-F : 86.35 Figure 3: Visualization of the node query process on a toy graph from Reddit. The query budget is 15.Each color represents a unique class. The annotated nodes are magniﬁed, and the numbers representat which step they are selected. The Micro-F score of each strategy is reported in the subﬁgurecaption. For comparison, random selection achieves a Micro-F score of 84.74. Broader Impact

Graph-structured data are ubiquitous in real world, covering a variety of domains and applicationssuch as social science, biology, medicine, and political science. In many domains such as biologyand medicine, annotating a large number of labeled data could be extremely expensive and timeconsuming. Therefore, the algorithm proposed in this paper could help signiﬁcantly reduce thelabeling efforts in these domains — we can train systems on domains where labeled data are available,then transfer to those lower-resource domains.We believe such systems can help accelerating some research and develop processes that usually takea long time, in domains such as drug development. It can potentially also lower the cost for suchresearch by reducing the need of expert-annotations.However, we also acknowledge potential social and ethical issues related to our work.1. Our proposed system can effectively reduce the need of human annotations. However, in abroader point of view, this can potentially lead to a reduction of employment opportunitieswhich may cause layoff to data annotators.2. GNNs are widely used in domains related to critical needs such as healthcare and drugdevelopment. The community needs to be extra cautious and rigorous since any mistakemay cause harm to patients.3. Training the policy network for active learning on multiple graphs is relatively time - andcomputational resource - consuming. This line of research may produce more carbonfootprint compared to some other work. Therefore, how to accelerate the training processby developing more efﬁcient algorithms requires further investigation.Nonetheless, we believe that the directions of active learning and transfer learning provide a hopefulpath towards our ultimate goal of data efﬁciency and interpretable machine learning.

References [1] Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S Yu Philip. Activelearning: A survey. In

Data Classiﬁcation , pages 599–634. Chapman and Hall/CRC, 2014.[2] Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for activelearning. In Doina Precup and Yee Whye Teh, editors,

Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research ,pages 301–310, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.URL http://proceedings.mlr.press/v70/bachman17a.html .[3] HongYun Cai, Vincent Wenchen Zheng, and Kevin Chen-Chuan Chang. Active learningfor graph embedding.

CoRR , abs/1705.05085, 2017. URL http://arxiv.org/abs/1705.05085 .[4] Zhu Cao, Linlin Wang, and Gerard De Melo. Link prediction via subgraph embedding-basedconvex matrix completion. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.95] Xia Chen, Guoxian Yu, Jun Wang, Carlotta Domeniconi, Zhao Li, and Xiangliang Zhang.Activehne: active heterogeneous network embedding. In

Proceedings of the 28th InternationalJoint Conference on Artiﬁcial Intelligence , pages 2123–2129. AAAI Press, 2019.[6] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. Gram:graph-based attention model for healthcare representation learning. In

Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages787–795, 2017.[7] Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcementlearning approach.

CoRR , abs/1708.02383, 2017. URL http://arxiv.org/abs/1708.02383 .[8] Akshay Gadde, Aamir Anis, and Antonio Ortega. Active semi-supervised learning usingsampling theory for graph signals.

CoRR , abs/1405.4324, 2014. URL http://arxiv.org/abs/1405.4324 .[9] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with imagedata. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 1183–1192. JMLR. org, 2017.[10] Li Gao, Hong Yang, Chuan Zhou, Jia Wu, Shirui Pan, and Yue Hu. Active discriminativenetwork representation learning. In

Proceedings of the 27th International Joint Conference onArtiﬁcial Intelligence , pages 2142–2148, 2018.[11] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 1263–1272. JMLR. org, 2017.[12] Quanquan Gu, Charu Aggarwal, Jialu Liu, and Jiawei Han. Selective sampling on graphs forclassiﬁcation. In

Proceedings of the 19th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 131–139, 2013.[13] Ming Ji and Jiawei Han. A variance minimization criterion to active learning on graphs. In

Artiﬁcial Intelligence and Statistics , pages 556–564, 2012.[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[15] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[16] Qimai Li, Xiao-Ming Wu, Han Liu, Xiaotong Zhang, and Zhichao Guan. Label efﬁcient semi-supervised learning via graph ﬁltering. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 9582–9591, 2019.[17] Ming Liu, Wray Buntine, and Gholamreza Haffari. Learning how to actively learn: A deepimitation learning approach. In

Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , pages 1874–1883, Melbourne, Australia,July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1174. URL .[18] Ming Liu, Wray Buntine, and Gholamreza Haffari. Learning to actively learn neural machinetranslation. In

Proceedings of the 22nd Conference on Computational Natural LanguageLearning , pages 334–344, 2018.[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.[20] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classiﬁcation in network data.

AI magazine , 29(3):93–93, 2008.1021] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-setapproach. arXiv preprint arXiv:1708.00489 , 2017.[22] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.[23] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann.Pitfalls of graph neural network evaluation.

Relational Representation Learning Workshop,NeurIPS 2018 , 2018.[24] Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar.Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928 , 2017.[25] Mel Silberman.

Active Learning: 101 Strategies To Teach Any Subject.

ERIC, 1996.[26] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradientmethods for reinforcement learning with function approximation. In

Advances in neuralinformation processing systems , pages 1057–1063, 2000.[27] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[28] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256, 1992.[29] Yuexin Wu, Yichong Xu, Aarti Singh, Yiming Yang, and Artur Dubrawski. Active learning forgraph neural networks via node feature propagation. arXiv preprint arXiv:1910.07567 , 2019.[30] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. Acomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 , 2019.[31] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learningwith graph embeddings. arXiv preprint arXiv:1603.08861 , 2016.[32] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In

Advancesin Neural Information Processing Systems , pages 5165–5175, 2018.[33] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, ChangchengLi, and Maosong Sun. Graph neural networks: A review of methods and applications. arXivpreprint arXiv:1812.08434 , 2018. 11

Algorithm details

In this section, we present the pseudo-code of our approach for both policy training (Algorithm 1)and evaluation (Algorithm 2).

Algorithm 1:

Train the policy on multiple labeled source graphs

Input: the labeled training graphs G T = { G i } , the query budget on each graph { B i } , the maximaltraining episode M Result: the query policy π θ Randomly initialize π θ ; for episode = to M dofor G i in G T do V label ← ∅ ;Randomly initialize the classiﬁcation GNN as f G i ,V label ; for t = 1 to B i do Generate the graph state S tG i based on f G i ,V t − label and G i ;Compute the probability distribution over candidate nodes as p tG i = π θ ( S tG i ) ;Sample an unlabeled node v t ∼ p tG i from V train \ V t − label and query for its label; V t label ← V t − label ∪ { v t } ;Train the classiﬁcation GNN for one more epoch to get f G i ,V t label ; end Train the classiﬁcation GNN f G i ,V Bi label until converge;Evaluate f G i ,V Bi label on the validation set to get the reward signal R ( V B i label ) ;Use R ( V B i label ) to update π θ with policy gradient; endendAlgorithm 2: Evaluate the learned policy on an unlabeled test graph

Input: the unlabeled test graph G , the learned policy π θ , the query budget B on graph G Result: the classiﬁcation GNN f trained on the selected node sequence ττ = [] ; V label ← ∅ ;Randomly initialize the classiﬁcation GNN as f G,V label ; for t = 1 to B do Generate the graph state S tG based on f G,V t − label and G ;Compute the probability distribution over candidate nodes as p tG = π θ ( S tG ) ;Select v t = arg max v p tG ( v ) from V train \ V t − label and query for its label; V t label ← V t − label ∪ { v t } ;Train the classiﬁcation GNN for one more epoch to get f G,V t label ; τ. append ( v t ) ; end Train f G,V B label until converge;Evaluate the converged classiﬁcation GNN f on the test set V test of G . B Dataset descriptions

Here we present the details of the datasets used in our experiments.12able 5: Statistics of the datasets used in our experiments. For Reddit, * represents the average valueover all individual graphs. The Budget column shows the query budget on each graph, which isset to × class by default. We use Physics and CS as the abbreviation of Coauthor-Physics andCoauthor-CS respectively. Dataset Nodes Edges Features Classes BudgetCora 2708 5278 1433 7 35Citseer 3327 4676 3703 6 30Pubmed 19718 44327 500 3 15Physics 34493 247962 2000 5 25Cs 18333 81894 6805 15 75Reddit 4017.6* 28697.6* 300 10 50

For transferable active learning on graphs from the same domain, we use a multi-graph datasetcollected from Reddit . In Reddit, users publish multiple posts which are then commented by otherusers. To generate the corresponding post-connection graph, we regard the posts as nodes, andconnect two posts with an edge if they are both commented or posted by the same two users, insteadof only one user. If we don’t make this restriction, all nodes commented or posted by one user wouldbe fully connected, thus resulting in large cliques in the graph. We choose the data in January 2014as the raw data and conduct the following preprocessing steps:1. Delete the anonymous posts.2. Sort the posts by their creation time and separate every 300,000 posts into a group.3. For each group, we sort the subreddits by the total number of posts belonging to eachsubreddit. We exclude the subreddits which have either too many or too few posts. Then wechoose the subreddits whose post number rank between 11 and 20 and remove the posts thatdon’t belong to these subreddits.4. Build a graph for each group based on the edge connection criterion.5. Get the largest connected component in each graph.The resulting graphs consist of 4017.6 nodes and 28697.6 edges on average. For the node feature,we concatenate each post’s title and its description as the feature text. We use 300-dimensionalGloVe CommonCrawl word vectors to calculate the average word embedding in the text as the nodefeatures.For transferable active learning on graphs from different domains, we use 5 benchmark datasetsin addition to Reddit. Cora, Citeseer and Pubmed [20] contain citation networks of scientiﬁcpublications, where each node represents a publication as a sparse bag-of-words feature vector, eachedge corresponds to a citation link. Coauthor-Physics and Coauthor-CS [23] are co-authorship graphs,where the nodes represent authors and the edges indicate that two authors have co-authored a paper.Each node is represented by a bag-of-words vector of the keywords in the author’s papers, while itslabel indicates the most active research ﬁeld of the author.The statistics of these dataset are shown in Table 5. C Additional experimental results

In this section, we report additional experimental results of transferable active learning on graphsacross different domains. In Table 6, we show the results of training on Cora + Pubmed and testingon the remaining graphs. In Table 7, we show the results of training on Citeseer + Pubmed andtesting on the remaining graphs. We observe consistent trends compared to the results reported inSection 4.3, where our proposed method signiﬁcantly outperforms the random baseline and the twoactive learning baselines. This suggests the effectiveness of our proposed method. Reddit is an online forum where users create posts and comment on them. We use the January 2014 dumpof Reddit posts downloaded from https://bit.ly/3bumUtv and https://bit.ly/2Spg6G2 . http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip Method Metric Citeseer Reddit1 Reddit2 Reddit3 Reddit4 Reddit5 Physics CSRandom Micro-F Macro-F Table 7: Results of transferable active learning on graphs from different domains. Train on Citeseer +Pubmed, and test on the remaining graphs.

Method Metric Cora Reddit1 Reddit2 Reddit3 Reddit4 Reddit5 Physics CSRandom Micro-F Macro-F71.22 87.11 94.87 91.74 88.97 90.14 81.20 83.90