[PDF] Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs

Abstract

Graph neural networks have shown superior performance in a wide range of applications providing a powerful representation of graph-structured data. Recent works show that the representation can be further improved by auxiliary tasks. However, the auxiliary tasks for heterogeneous graphs, which contain rich semantic information with various types of nodes and edges, have less explored in the literature. In this paper, to learn graph neural networks on heterogeneous graphs we propose a novel self-supervised auxiliary learning method using meta-paths, which are composite relations of multiple edge types. Our proposed method is learning to learn a primary task by predicting meta-paths as auxiliary tasks. This can be viewed as a type of meta-learning. The proposed method can identify an effective combination of auxiliary tasks and automatically balance them to improve the primary task. Our methods can be applied to any graph neural networks in a plug-in manner without manual labeling or additional data. The experiments demonstrate that the proposed method consistently improves the performance of link prediction and node classification on heterogeneous graphs.

Full PDF

SSelf-supervised Auxiliary Learningwith Meta-paths for Heterogeneous Graphs

Dasol Hwang ∗ , Jinyoung Park ∗ , Sunyoung Kwon Kyung-Min Kim , Jung-Woo Ha , Hyunwoo J. Kim Korea University Clova AI Research, Naver Corp Abstract

Graph neural networks have shown superior performance in a wide range ofapplications providing a powerful representation of graph-structured data. Recentworks show that the representation can be further improved by auxiliary tasks.However, the auxiliary tasks for heterogeneous graphs, which contain rich semanticinformation with various types of nodes and edges, have less explored in theliterature. In this paper, to learn graph neural networks on heterogeneous graphs wepropose a novel self-supervised auxiliary learning method using meta-paths, whichare composite relations of multiple edge types. Our proposed method is learningto learn a primary task by predicting meta-paths as auxiliary tasks. This can beviewed as a type of meta-learning. The proposed method can identify an effectivecombination of auxiliary tasks and automatically balance them to improve theprimary task. Our methods can be applied to any graph neural networks in a plug-inmanner without manual labeling or additional data. The experiments demonstratethat the proposed method consistently improves the performance of link predictionand node classiﬁcation on heterogeneous graphs.

Graph neural networks [1–3] have been proven effective to learn representations for various taskssuch as node classiﬁcation [4], link prediction [5, 6], and graph classiﬁcation [7, 8]. The powerfulrepresentation yields state-of-the-art performance in a variety of applications including social networkanalysis [9, 4, 10], citation network analysis [11, 12], visual understanding [13–15], recommendersystems [16–18], physics [19, 20], and drug discovery [21, 22]. Despite the wide operating rangeof graph neural networks, employing auxiliary (pre-text) tasks has been less explored for furtherimproving graph representation learning.Pre-training with an auxiliary task is a common technique for deep neural networks. Indeed, it is the defacto standard step in natural language processing and computer vision to learn a powerful backbonenetworks such as BERT [23] and ResNet [24] leveraging large datasets such as BooksCorpus [25],English Wikipedia, and ImageNet [26]. The models trained on the auxiliary task are often beneﬁcialfor the primary (target) task of interest. Despite the success of pre-training, few approaches havebeen generalized to graph-structured data due to their fundamental challenges. First, graph structure(e.g., the number of nodes/edges, and diameter) and its meaning can signiﬁcantly differ betweendomains. So the model trained on an auxiliary task can harm generalization on the primary task,i.e., negative transfer [27]. Also, many graph neural networks are transductive approaches. Thisoften makes transfer learning between datasets inherently infeasible. So, pre-training on the targetdataset has been proposed using auxiliary tasks: graph kernel [28], graph reconstruction [29], and ∗ First two authors have equal contributionPreprint. Under review. a r X i v : . [ c s . L G ] A ug ttribute masking [21]. These assume that the auxiliary tasks for pre-training are carefully selectedwith substantial domain knowledge and expertise in graph characteristics to assist the primary task.Since most graph neural networks operate on homogeneous graphs, which have a single type of nodesand edges, the previous pre-training/auxiliary tasks are not speciﬁcally designed for heterogeneous graphs, which have multiple types of nodes and edges. Heterogeneous graphs commonly occur inreal-world applications, for instance, a music dataset has multiple types of nodes (e.g., user, song,artist) and multiple types of relations (e.g., user-artist, song-ﬁlm, song-instrument).In this paper, we proposed a framework to train a graph neural networks with automatically selectedauxiliary self-supervised tasks which assist the target task without additional data and labels. Ourapproach ﬁrst generates meta-paths from heterogeneous graphs without manual labeling and traina model with meta-path prediction to assist the primary task such as link prediction and nodeclassiﬁcation. This can be formulated as a meta-learning problem. Furthermore, our method can beadopted to existing GNNs in a plug-in manner, enhancing the model performance.Our contribution is threefold: (i) We propose a self-supervised learning method on a heterogeneousgraph via meta-path prediction without additional data. (ii)

Our framework automatically selects meta-paths (auxiliary tasks) to assist the primary task via meta-learning. (iii)

We develop Hint Network thathelps the learner network to beneﬁt from challenging auxiliary tasks. To the best of our knowledge,this is the ﬁrst auxiliary task with meta-paths speciﬁcally designed for leveraging heterogeneousgraph structure. Our experiment shows that meta-path prediction improves the representational powerand the gain can be further improved to explicitly optimize the auxiliary tasks for the primary taskvia meta-learning and the Hint Network, built on various state-of-the art GNNs.

Graph Neural Networks have provided promising results for various tasks [16–18, 30–32].

Bruna etal. [33] proposed a neural network that performs convolution on the graph domain using Fourier basisfrom spectral graph theory. In contrast, non-spectral (spatial) approaches have been developed [12,11, 4, 34]. Inspired by self-supervised learning [35–38] and pre-training [23, 39] in computer visionand natural language processing, pre-training for GNNs has been recently proposed [21, 28]. Recentworks [34, 40] show promising results that transfer learning can be successful on graphs but theyrequire additional manually labeled data. To avoid the need for manual labeling, self-supervisedlearning on the target domain such as graph kernel [28], graph reconstruction [29], and attributemasking [21] has been proposed. The auxiliary tasks should be manually chosen with domainknowledge and they are not optimized for the primary task.

Auxiliary Learning is a learning strategy to employ auxiliary tasks to assist the primary task. It issimilar to multi-task learning, but auxiliary learning cares only the performance of the primary task. Anumber of auxiliary learning methods are proposed in a wide range of tasks [41–43]. AC-GAN[44]proposed an auxiliary classiﬁer for generative models. Recently, Meta-Auxiliary Learning [45]proposes an elegant solution to generate new auxiliary tasks by collapsing existing classes. However,it cannot be applicable to some tasks such as link prediction which has only one positive class. Ourapproach generates meta-paths on heterogeneous graphs to make new labels and trains models topredict meta-paths as auxiliary tasks.

Meta-learning aims learning to learn models efﬁciently and effectively, and generalizes the learningstrategy to new tasks. Meta-learning includes black-box methods to approximate gradients withoutany information about models [46], optimization-based methods to learn an optimal initialization foradapting new tasks [47–50], learning loss functions [48, 51] and metric-learning or non-parametricmethods for few-shot learning [52–54]. In contrast to classical learning algorithms that generalizeacross samples, meta-learning generalizes across tasks. In this paper, we use meta-learning to learn aconcept across tasks and transfer the knowledge from auxiliary tasks to the primary task.

The goal of our framework is to learn with multiple auxiliary tasks to improve the performance of theprimary task. In this work, we demonstrate our framework with math-path predictions as auxiliarytasks. But our framework could be extended to include other auxiliary tasks. The meta-paths capturediverse and meaningful relations between nodes on heterogeneous graphs [55]. However, learning2ith auxiliary tasks has multiple challenges: identifying useful auxiliary tasks, balancing the auxiliarytasks with the primary task, and converting challenging auxiliary tasks into solvable (and relevant)tasks. To address the challenges, we propose

SEL f-supervised A uxiliary L earning ( SELAR ). Ourframework consists of two main components: 1) learning weight functions to softly select auxiliarytasks and balance them with the primary task via meta-learning, and 2) learning Hint Networks toconvert challenging auxiliary tasks into more relevant and solvable tasks to the primary task learner.

Most existing graph neural networks have been studied focusing on homogeneous graphs that have asingle type of nodes and edges. However, in real-world applications, heterogeneous graphs [56], whichhave multiple types of nodes and edges, commonly occur. Learning models on the heterogeneousgraphs requires different considerations to effectively represent their node and edge heterogeneity.

Heterogeneous graph [57]. Let G = ( V, E ) be a graph with a set of nodes V and edges E . Aheterogeneous graph is a graph equipped with a node type mapping function f v : V → T v and anedge type mapping function f e : E → T e , where T v is a set of node types and T e is a set of edgetypes. Each node v i ∈ V (and edge e ij ∈ E resp.) has one node type, i.e., f v ( v i ) ∈ T v , (and oneedge type f e ( e ij ) ∈ T e resp.). In this paper, we consider the heterogeneous graphs with |T e | > or |T v | > . When |T e | = 1 and |T v | = 1 , it becomes a homogeneous graph. Meta-Path [55, 58] is a path on a heterogeneous graph G that a sequence of nodes connected withheterogeneous edges, i.e., v t −→ v t −→ . . . t l −→ v l +1 , where t l ∈ T e denotes an l -th edge type of themeta-path. The meta-path can be viewed as a composite relation R = t ◦ t . . . ◦ t l between node v and v l +1 , where R ◦ R denotes the composition of relation R and R . The deﬁnition of meta-pathgeneralizes multi-hop connections and is shown to be useful to analyze heterogeneous graphs. Forinstance, in Book-Crossing dataset, ‘user-item-written.series-item-user’ indicates that a meta-paththat connects users who like a same book series.We introduce meta-path prediction as a self-supervised auxiliary task to improve the representationalpower of graph neural networks. To our knowledge, the meta-path prediction has not been studied inthe context of self-supervised learning for graph neural networks in the literature. Meta-path prediction is similar to link prediction but meta-paths allow heterogeneous compositerelations. The meta-path prediction can be achieved in the same manner as link prediction. If twonodes u and v are connected by a meta-path p with the heterogeneous edges ( t , t , . . . t (cid:96) ) , then y pu,v = 1 , otherwise y pu,v = 0 . The labels can be generated from a heterogeneous graph without anymanual labeling. They can be obtained by A p = A t l . . . A t A t , where A t is the adjacency matrix ofedge type t . The binarized value at ( u, v ) in A p indicates whether u and v are connected with themeta-path p . In this paper, we use meta-path prediction as a self-supervised auxiliary task.Let X ∈ R | V |× d and Z ∈ R | V |× d (cid:48) be input features and their hidden representations learnt by GNN f , i.e., Z = f ( X ; w , A ) , where w is the parameter for f , and A ∈ R | V |×| V | is the adjacency matrix.Then link prediction and meta-path prediction are obtained by a simple operation as ˆ y tu,v = σ (Φ t ( z u ) (cid:62) Φ t ( z v )) , (1)where Φ t is the task-speciﬁc network for task t ∈ T and z u and z v are the node embeddings of node u and v . e.g., Φ (and Φ resp.) for link prediction (and the ﬁrst type of meta-path prediction resp.). The architecture is shown in Fig. 1. To optimize the model, as the link prediction, cross entropy isused. The graph neural network f is shared by the link prediction and meta-path predictions. As anyauxiliary learning methods, the meta-paths (auxiliary tasks) should be carefully chosen and properlyweighted so that the meta-path prediction does not compete with link prediction especially when thecapacity of GNNs is limited. To address these issues, we propose our framework that automaticallyselect meta-paths and balance them with the link prediction via meta-learning. Our framework SELAR is learning to learn a primary task with multiple auxiliary tasks to assist theprimary task. This can be formally written as min w , Θ E [ L pr ( w ∗ (Θ)) ] ( x,y ) ∼ D pr s.t. w ∗ (Θ) = argmin w E (cid:2) L pr + au ( w ; Θ) (cid:3) ( x,y ) ∼ D pr + au , (2)3igure 1: The SELAR framework for self-supervised auxiliary learning. Our framework learns howto balance (or softly select) auxiliary tasks to improve the primary task via meta-learning. In thispaper, the primary task is link prediction (or node classiﬁcation) and auxiliary tasks are meta-pathpredictions to capture rich information of a heterogeneous graph.where L pr ( · ) is the primary task loss function to evaluate the trained model f ( x ; w ∗ (Θ)) on meta-data D pr and L pr + au is the loss function to train a model on training data D pr + au with the primaryand auxiliary tasks. To avoid cluttered notation, f , x , and y are omitted. Each task T t has N t samplesand T and {T t } Tt =1 denote the primary and auxiliary tasks respectively. The proposed formulationin Eq. (2) learns how to assist the primary task by optimizing Θ via meta-learning. The nestedoptimization problem given Θ is a regular training with properly adjusted loss functions to balancethe primary and auxiliary tasks. The formulation can be more speciﬁcally written as min w , Θ M (cid:88) i =1 M (cid:96) ( y (0 ,meta ) i , f ( x (0 ,meta ) i ; w ∗ (Θ)) (3)s.t. w ∗ (Θ) = argmin w T (cid:88) t =0 N t (cid:88) i =1 N t V ( ξ ( t,train ) i ; Θ) (cid:96) t ( y ( t,train ) i , f t ( x ( t,train ) i ; w )) , (4)where (cid:96) t and f t denote the loss function and the model for task t . We overload (cid:96) t with its functionvalue, i.e., (cid:96) t = (cid:96) t ( y ( t,train ) i , f t ( x ( t,train ) i ; w ) . ξ ( t,train ) i is the embedding vector of i th sample fortask t . In our experiment, ξ ( t,train ) i is the concatenation of one-hot representation of task types, thelabel of the sample (positive/negative), and its loss value, i.e., ξ ( t,train ) i = (cid:104) (cid:96) t ; e t ; y ( t,train ) i (cid:105) ∈ R T +2 .To derive our learning algorithm, we ﬁrst shorten the objective function in Eq. (3) and Eq. (4)as L pr ( w ∗ (Θ)) and L pr + au ( w ; Θ) . This is equivalent to Eq. (2) without expectation. Then, ourformulation is given as min w , Θ L pr ( w ∗ (Θ)) s.t. w ∗ (Θ) = argmin w L pr + au ( w ; Θ) , (5)To circumvent the difﬁculty of the bi-level optimization, as previous works [47, 48] in meta-learningwe approximate it with the updated parameters ˆ w using the gradient descent update as w ∗ (Θ) ≈ ˆ w k (Θ) = w k − α ∇ w L pr + au ( w k ; Θ) , (6)where α is the learning rate for w . We do not numerically evaluate ˆ w k (Θ) instead we plug thecomputational graph of ˆ w k in L pr ( w ∗ (Θ)) to optimize Θ . Let ∇ Θ L pr ( w ∗ (Θ k )) be the gradientevaluated at Θ k . Then updating parameters Θ is given as Θ k +1 = Θ k − β ∇ Θ L pr ( ˆ w k (Θ k )) , (7)where β is the learning rate for Θ . This update allows softly selecting useful auxiliary tasks (meta-paths) and balance them with the primary task to improve the performance of the primary task.Without balancing tasks with the weighting function V ( · ; Θ) , auxiliary tasks can dominate trainingand degrade the performance of the primary task.The model parameters w k for tasks can be updated with optimized Θ k +1 in (7) as w k +1 = w k − α ∇ w L pr + au ( w k ; Θ k +1 ) . (8)4 emarks. The proposed formulation can suffer from the meta-overﬁtting [59, 60] meaning that theparameters Θ to learn weights for softly selecting meta-paths and balancing the tasks with the primarytask can overﬁt to the small meta-dataset. In our experiment, we found that the overﬁtting can bealleviated by meta-validation sets [59]. To learn Θ that is generalizable across meta-training sets, weoptimize Θ across k different meta-datasets like k -fold cross validation using the following equation: Θ k +1 = Θ k − β E (cid:2) ∇ Θ L pr ( ˆ w k (Θ k )) (cid:3) , D pr ∼ CV (9)where D meta ∼ CV is a meta-dataset from cross validation. We used 3-fold cross validation and thegradients of Θ w.r.t different meta-datasets are averaged to update Θ k , see Algorithm 1. The crossvalidation is crucial to alleviate meta-overﬁtting and more discussion is Section 4.3. Algorithm 1

Self-supervised Auxiliary Learning

Input: training data for primary/auxiliary tasks D pr , D au , mini-batch size N pr , N au Input: max iterations K , C Output: network parameter w K for the primary task for k = 1 to K do D prm ← MiniBatchSampler ( D pr , N pr ) D aum ← MiniBatchSampler ( D au , N au ) for c = 1 to C do (cid:46) Meta Learning with Cross Validation D pr ( train ) m , D pr ( meta ) m ← CVSplit ( D prm , c ) (cid:46) Split Data for CV ˆ w k (Θ) ← w k − α ∇ w L pr + au ( w k ; Θ) with D pr ( train ) m ∪ D aum (cid:46) Eq. (6) g c ← ∇ Θ L pr ( ˆ w k (Θ k )) with D pr ( meta ) m (cid:46) Eq. (7) end for Update Θ k +1 ← Θ k − β (cid:80) Cc g c (cid:46) Eq. (9) w k +1 = w k − α ∇ w L pr + au ( w k ; Θ k +1 ) with D prm ∪ D aum (cid:46) Eq. (8) end for3.3 Hint Networks

Figure 2: HintNet helps the learner network tolearn even with challenging and remotely relevantauxiliary tasks. As our framework selects effectiveauxiliary tasks, our framework with HintNet learns V H to decide to use hint ˆ y H in the orange line fromHintNet or not via meta-learning. ˆ y in the blue linedenotes the prediction from the learner network.Meta-path prediction is generally more challeng-ing than link prediction and node classiﬁcationsince it requires the understanding of long-rangerelations across heterogeneous nodes. The meta-path prediction gets more difﬁcult when mini-batch training is inevitable due to the size ofdatasets or models. Within a mini-batch, impor-tant nodes and edges for meta-paths are not avail-able. Also, a small learner network, e.g., two-layer GNNs, with a limited receptive ﬁeld, inher-ently cannot capture long-range relations. Thechallenges can hinder representation learningand damage the generalization of the primarytask. We proposed a Hint Network (HintNet)which makes the challenge tasks more solvableby correcting the answer with more informationat the learner’s need. Speciﬁcally, in our exper-iments, the HintNet corrects the answer of thelearner with its own answer from the augmentedgraph with hub nodes, see Fig. 2.The amount of help (correction) by HintNet is optimized maximizing the learner’s gain. Let V H ( · ) and Θ H be a weight function to determine the amount of hint and its parameters which are optimized5y meta-learning. Then, our formulation with HintNet is given as min w , Θ M (cid:88) i =1 M (cid:96) ( y (0 ,meta ) i , f ( x (0 ,meta ) i ; w ∗ (Θ , Θ H )) (10)s.t. w ∗ (Θ) = argmin w T (cid:88) t =0 N t (cid:88) i =1 N t V ( ξ ( t,train ) i , (cid:96) t ; Θ) (cid:96) t ( y ( t,train ) i , ˆ y ( t,train ) i (Θ H )) , (11)where ˆ y ( t,train ) i (Θ H ) denotes the convex combination of the learner’s answer and HintNet’s answer,i.e., V H ( ξ ( t,train ) i ; Θ H ) f t ( x ( t,train ) i ; w ) + (1 − V H ( ξ ( t,train ) i ; Θ H )) f tH ( x ( t,train ) i ; w ) . The sampleembedding is ξ ( t,train ) i = (cid:104) e t ; y ( t,train ) i ; (cid:96) t ; (cid:96) tH (cid:105) ∈ R T +3 . We evaluate our proposed methods on four public benchmark datasets on heterogeneous graphs.Our experiments answer the following research questions:

Q1.

Is meta-path prediction effectivefor representation learning on heterogeneous graphs?

Q2.

Can the meta-path prediction be furtherimproved by the proposed methods (e.g., SELAR, HintNet)?

Q3.

Why are the proposed methodseffective, any relation with hard negative mining?

Datasets.

We use two public benchmark datasets from different domains for link prediction: Musicdataset Last-FM and Book dataset Book-Crossing, released by KGNN-LS[61], RippleNet[30]. Weuse two datasets for node classiﬁcation: citation network datasets ACM and Movie dataset IMDB,used by HAN[55] for node classiﬁcation tasks. ACM has three types nodes (Paper(P), Author(A),Subject(S)), four types of edges (PA, AP, PS, SP) and labels (categories of papers). IMDB containsthree types of nodes (Movie (M), Actor (A), Director (D)), four types (MA, AM, MD, DM) of edgesand labels (genres of movies). ACM and IMDB have node features, which are bag-of-words ofkeywords and plots. Dataset details are in the supplement.

Baselines.

We evaluate our methods with four graph neural networks: GCN [12], GAT [11], GIN [34]and SGConv [62]. We compare four learning strategies:

Vanilla , standard training of base modelsonly with the primary task samples; w/o meta-path , learning a primary task with sample weightingfunction V ( ξ ; Θ) ; w/ meta-path , training with the primary task and auxiliary tasks (meta-pathprediction) with a standard loss function; SELAR proposed in Section 3.2, learning the primarytask with optimized auxiliary tasks by meta-learning;

SELAR+Hint introduced in Section 3.3.Implementation details are in the supplement.

We used ﬁve types of meta-paths of length 2 to 4 for auxiliary tasks. Table 1 shows that our methodsconsistently improve link prediction performance for all the GNNs, compared to the Vanilla andthe method using Meta-Weight-Net only without meta-paths (denoted as w/o meta-path). Overall,a standard training with meta-paths shows 2% improvement on average on Last-FM and about 3%improvement on Book-Crossing whereas meta-learning that learns sample weights improves only0.4% and 0.6% on average and two cases, e.g., GCN on Last-FM and SGC on Book-Crossing, showdegradation compared to the standard training (Vanilla). As we expected, SELAR and SELAR withHintNet provide more optimized auxiliary learning resulting in 2.2% and 2.5% absolute improvementon Last.fm and 4.1% and 4.4% on the Book-Crossing dataset. Further, in particular, GIN on Book-crossing, SELAR+HintNet provides ∼ Similar to link prediction above, our SELAR consistently enhances node classiﬁcation performanceof all the GNN models and the improvements are more signiﬁcant on IMDB which is larger than theACM dataset. We believe that ACM dataset is already saturated and the room for improvement islimited. However, our methods still show small yet consistent improvement over all the architectureon ACM. We conjecture that the efﬁcacy of our proposed methods differs depending on graphstructures. However, it is worth noting that the introducing meta-path prediction as auxiliary tasks6able 1: Link prediction performance (

AU C ) of GNNs trained by various learning strategies.

Dataset Base GNNs Vanilla w/ometa-path Oursw/ meta-path SELAR SELAR+HintLast-FM GCN 0.7898 ∗ GIN 0.7895 0.8081

Avg. Gain - +0.0046 +0.0204 +0.0222 +0.0253

Book-Crossing GCN 0.6918 0.6967 0.6970

GIN 0.6782 0.6968 0.7442 0.7554

SGC 0.6781 ∗ +0.0441 Table 2: Node classiﬁcation performance ( F -score) of GNNs trained by various learning schemes. Dataset Base GNNs Vanilla w/ometa-path Oursw/ meta-path SELAR SELAR+HintACM GCN 0.9034 ∗ GAT 0.9179 ∗ SGC 0.9138 ∗ +0.0061 IMDB GCN 0.5826 0.5952 ∗ ∗ +0.0340 +0.0172 remarkably improves the performance of primary tasks such as link and node prediction withconsistency compared to the existing methods. “w/o meta-path”, the meta-learning to learn sampleweight function on a primary task shows marginal degradation in ﬁve out of eight settings highlightedwith ∗ . Remarkably, SELAR improved the F1-score of GAT on the IMDB by (6.54%) compared tothe vanilla learning scheme. The effectiveness of meta-path prediction and the proposed learning strategies are answered above.To address the last research question

Q3. why the proposed method is effective, we provide analysison the weighting function V ( ξ ; Θ) learned by our framework. Also, we show the evidence thatmeta-overﬁtting occurs and can be addressed by cross-validation as in Algorithm 1. Weighting function.

Our proposed methods can automatically balance multiple auxiliary tasks toimprove the primary task. To understand the ability of our method, we analyze the weighting functionand the adjusted loss function by the weighting function, i.e., V ( ξ ; Θ) , V ( ξ ; Θ) (cid:96) t ( y, ˆ y ) . The positiveand negative samples are solid and dash lines respectively. We present the weighting function learntby SELAR+HintNet for GAT which is the best-performing construction on Last-FM. The weightingfunction is from the epoch with the best validation performance. Fig. 3 shows that the learnt weightingfunction attends to hard examples more than easy ones with a small loss range from 0 to 1.Also, the primary task-positive samples are relatively less down weighted than auxiliary tasks evenwhen the samples are easy (i.e., the loss is ranged from 0 to 1). Our adjusted loss V ( ξ ; Θ) (cid:96) t ( y, ˆ y ) is closely related to the focal loss, − (1 − p t ) γ log( p t ) . When (cid:96) t is the cross-entropy, it becomes V ( ξ ; Θ) log( p t ) , where p is the model’s prediction for the correct class and p t is deﬁned as p if y = 1 , otherwise − p as [63]. The weighting function differentially evolves over iterations. At theearly stage of training, it often focuses on easy examples ﬁrst and then changes its focus over time.7 a) Weighting function V ( ξ ; Θ) . (b) Adjusted Cross Entropy V ( ξ ; Θ) (cid:96) t ( y, ˆ y ) . Figure 3: Weighting function V ( · ) learnt by SELAR+HintNet. V ( · ) gives overall high weights to theprimary task positive samples (red) in (a). V ( · ) decreases the weights of easy samples with a lossranged from 0 to 1. In (b), the adjusted cross entropy, i.e., −V ( ξ ; Θ) log(ˆ y ) , by V ( · ) acts like thefocal loss, which focuses on hard examples by − (1 − p t ) γ log(ˆ y ) .Also, the adjusted loss values by the weighting function learnt by our method differ across tasks. Toanalyze the contribution of each task, we calculate the average of the task-speciﬁc weighted loss onthe Last-FM and Book-Crossing datasets. Especially, on the Book-Crossing, our method has moreattention to ’user-item’ (primary task) and ‘user-item-literary.series.item-user’ (auxiliary task) whichis a meta-path that connects users who like a book series. This implies that two users who like a bookseries likely have a similar preference. More results and discussion are available in the supplement. Meta cross-validation , i.e., cross-validation for meta-learning, helps to keep weighting functionfrom over-ﬁtting on meta data. Table 3 evidence that our algorithms as other meta learning methodscan overﬁt to meta-data. As in Algorithm 1, our proposed methods, both SELAR and SELARwith HintNet, with cross-validation denoted as ‘3-fold’ alleviates the meta-overﬁtting problem andprovides a signiﬁcant performance gain, whereas without meta cross-validation denoted as ‘1-fold’the proposed method can underperform the vanilla training strategy.Table 3: Comparison between 1-fold and 3-fold as meta-data on

Last-FM datasets.SELAR SELAR+HintModel Vanilla 1-fold 3-fold 1-fold 3-foldGCN 0.7898 0.7885

GAT 0.8090 0.8293

GIN 0.7895 0.8182

SGC 0.7725 0.7391

We proposed meta-path prediction as self-supervised auxiliary tasks on heterogeneous graphs. Ourexperiments show that the representation learning on heterogeneous graphs can beneﬁt from meta-path prediction which encourages to capture rich semantic information. The auxiliary tasks can befurther improved by our proposed method SELAR, which automatically balances auxiliary tasks toassist the primary task via a form of meta-learning. The learnt weighting function identiﬁes morebeneﬁcial meta-paths for the primary tasks. Within a task, the weighting function can adjust thecross entropy like the focal loss, which focuses on hard examples by decreasing weights for easysamples. Moreover, when it comes to challenging and remotely relevant an auxiliary tasks, ourHintNet helps the learner by correcting the learner’s answer dynamically and further improves thegain from auxiliary tasks. Our framework based on meta-learning provides learning strategies tobalance primary task and auxiliary tasks, and easy/hard (and positive/negative) samples. Interestingfuture directions include applying our framework to other domains and various auxiliary tasks. Ourcode is publicly available at https://github.com/mlvlab/SELAR .8 eferences [1] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods andapplications. arXiv preprint arXiv:1709.05584 , 2017.[2] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deeplearning: going beyond euclidean data. IEEE , 34(4):18–42, 2017.[3] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehen-sive survey on graph neural networks.

IEEE , 2020.[4] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs.

CoRR , abs/1706.02216, 2017.[5] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling.Modeling relational data with graph convolutional networks. In

European Semantic Web Conference , pages593–607, 2018.[6] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In

NeurIPS , pages5165–5175, 2018.[7] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, AlánAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints.In

NeurIPS , pages 2224–2232, 2015.[8] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierar-chical graph representation learning with differentiable pooling.

CoRR , abs/1806.08804, 2018.[9] Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks viaimportance sampling. In

ICLR , 2018.[10] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In

SIGKDD , pages1225–1234, 2016.[11] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[12] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In

ICLR . OpenReview.net, 2017.[13] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative messagepassing. In

CVPR , pages 5410–5419, 2017.[14] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation.In

ECCV , pages 670–685, 2018.[15] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis.Graph-based global reasoning networks. In

CVPR , pages 433–442, 2019.[16] Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXivpreprint arXiv:1706.02263 , 2017.[17] Federico Monti, Michael Bronstein, and Xavier Bresson. Geometric matrix completion with recurrentmulti-graph neural networks. In

NeurIPS , pages 3697–3707, 2017.[18] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graphconvolutional neural networks for web-scale recommender systems. In

SIGKDD , pages 974–983, 2018.[19] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, RaiaHadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. arXivpreprint arXiv:1806.01242 , 2018.[20] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks forlearning about objects, relations and physics. In

NeurIPS , pages 4502–4510, 2016.[21] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec.Strategies for pre-training graph neural networks. In

ICLR , 2020.[22] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu,Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning.

Chemicalscience , 9(2):513–530, 2018.

23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

CVPR , pages 770–778, 2016.[25] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. Aligning books and movies: Towards story-like visual explanations by watching movies and readingbooks. In

ICCV , pages 19–27, 2015.[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

CVPR , pages 248–255. Ieee, 2009.[27] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.

IEEE Transactions on knowledge anddata engineering , 22(10):1345–1359, 2009.[28] Nicolò Navarin, Dinh V Tran, and Alessandro Sperduti. Pre-training graph neural networks with kernels. arXiv preprint arXiv:1811.06930 , 2018.[29] Jiawei Zhang, Haopeng Zhang, Li Sun, and Congying Xia. Graph-bert: Only attention is needed forlearning graph representations. arXiv preprint arXiv:2001.05140 , 2020.[30] Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. Ripplenet:Propagating user preferences on the knowledge graph for recommender systems. In

Proceedings of the27th ACM International Conference on Information and Knowledge Management , pages 417–426, 2018.[31] Gusi Te, Wei Hu, Amin Zheng, and Zongming Guo. RGCNN: regularized graph CNN for point cloudsegmentation. In Susanne Boll, Kyoung Mu Lee, Jiebo Luo, Wenwu Zhu, Hyeran Byun, Chang Wen Chen,Rainer Lienhart, and Tao Mei, editors,

ACM , pages 746–754. ACM, 2018.[32] Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. Graph convolutionalencoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675 , 2017.[33] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connectednetworks on graphs. arXiv preprint arXiv:1312.6203 , 2013.[34] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 , 2018.[35] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by contextprediction. In

ICCV , pages 1422–1430, 2015.[36] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

ECCV , pages 69–84. Springer, 2016.[37] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by contextprediction. In

ICCV , pages 1422–1430, 2015.[38] Bo Dai and Dahua Lin. Contrastive learning for image captioning. In

NeurIPS , pages 898–907, 2017.[39] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell.Deion. In

ICML , pages 647–655, 2014.[40] Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory PWay, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M Hoffman, et al. Opportuni-ties and obstacles for deep learning in biology and medicine.

Journal of The Royal Society Interface ,15(141):20170387, 2018.[41] Shubham Toshniwal, Hao Tang, Liang Lu, and Karen Livescu. Multitask learning with low-level auxiliarytasks for encoder-decoder based speech recognition. arXiv preprint arXiv:1704.01631 , 2017.[42] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new viewsfrom the world’s imagery. In

CVPR , pages 5515–5524, 2016.[43] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth andego-motion from video. In

CVPR , pages 1851–1858, 2017.[44] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliaryclassiﬁer gans. In

ICML , pages 2642–2651. JMLR. org, 2017.

45] Shikun Liu, Andrew Davison, and Edward Johns. Self-supervised generalisation with meta auxiliarylearning. In

NeurIPS , pages 1677–1687, 2019.[46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In

ICLR . OpenReview.net,2017.[47] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deepnetworks. In

ICML , pages 1126–1135. JMLR. org, 2017.[48] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net:Learning an explicit mapping for sample weighting. In

NeurIPS , 2019.[49] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shotlearning. arXiv preprint arXiv:1707.09835 , 2017.[50] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace.In Jennifer G. Dy and Andreas Krause, editors,

ICML , volume 80 of

Proceedings of Machine LearningResearch , pages 2933–2942. PMLR, 2018.[51] Yunhun Jang, Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Learning what and where to transfer. InKamalika Chaudhuri and Ruslan Salakhutdinov, editors,

ICML , volume 97 of

Proceedings of MachineLearning Research , pages 3030–3039. PMLR, 2019.[52] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot imagerecognition. In

ICML deep learning workshop , volume 2. Lille, 2015.[53] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In

NeurIPS ,pages 4077–4087, 2017.[54] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. Learningto compare: Relation network for few-shot learning. In

CVPR , pages 1199–1208. IEEE Computer Society,2018.[55] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Peng Cui, P. Yu, and Yanfang Ye. Heterogeneous graphattention network.

CoRR , abs/1903.07293, 2019.[56] Yizhou Sun and J. Han. Mining heterogeneous information networks: A structural analysis approach.

SIGKDD Explorations , 14:20–28, 01 2012.[57] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. A survey of heterogeneous informationnetwork analysis.

IEEE Transactions on Knowledge and Data Engineering , 29(1):17–37, 2016.[58] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similaritysearch in heterogeneous information networks.

Proceedings of the VLDB Endowment , 4(11):992–1003,2011.[59] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your maml. arXiv preprintarXiv:1810.09502 , 2018.[60] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast contextadaptation via meta-learning. arXiv preprint arXiv:1810.03642 , 2018.[61] Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, and ZhongyuanWang. Knowledge-aware graph neural networks with label smoothness regularization for recommendersystems. In

SIGKDD , pages 968–977, 2019.[62] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger.Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153 , 2019.[63] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense objectdetection. In

ICCV , pages 2980–2988, 2017.[64] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio andYann LeCun, editors, , 2015. Summary

We provide additional experimental results and implementation details that are not included in the main paperdue to the space limit. This supplement includes (1) additional experimental results showing that our methodscan be further improved by regularization alleviating meta-overﬁtting, (2) details of datasets, (3) implementationdetails, (4) task selection, and (5) behaviours of the weighting function at different training stages.

B Meta-Learning and Regularization

We compare the learning strategies:

Vanilla , standard training of base models only with the primary task;

Graph-MW w/o mp , modiﬁed MW-Net [48] for graph neural networks, which learns a primary task for weightingthe primary task samples;

Graph-MW w/ mp , MW-Net [48] for graph neural networks, which learns trainingwith the primary and auxiliary tasks.

SELAR and

SELAR+Hint denote our models introduced in the main.

Regularized SELAR+Hint is the exactly same model as

SELAR+Hint but it is trained with a regularization.We added a regularizer to HintNet introduced in main paper (Section 3);

Avg. Gain , averaged gain of all GNNsfrom Vanilla.Table. 4 shows that SELAR, SELAR+Hint, and Regularized SELAR+Hint consistently improve the linkprediction performance on Last-FM and Book-Crossing datasets, compared to the Vanilla and Graph-MW.Graph-MW with meta-paths shows 0.25% improvement on average on Last-FM while our SELAR+Hintprovides 2.5% improvement on average. In particular, Our regularized SELAR+Hint has 2.9% gains comparedto the Vanilla. On Book-Crossing, Graph-MW without meta-paths and with meta-paths show 0.6% and 3.8%improvements from the Vanilla respectively. It indicates that the auxiliary tasks are helpful on the primary taskon Book-Crossing. Also, our regularized SELAR+Hint has 4.8% absolute improvement compared to the Vanilla.The regularization which is applied to alleviate overﬁtting improves the overall performance of the SELAR+Hint.

Table 4: Link prediction performance (

AU C -score) of GNNs

Dataset BaseGNNs Vanilla Graph-MWw/o mp Graph-MWw/ mp SELAR SELAR+Hint RegularizedSELAR+HintLast-FM GCN 0.7898 0.7850 0.7861 0.8163 0.8162

GAT 0.8090 0.8100 0.8244 0.8319

SGC 0.7725 0.7759 0.7400 0.7803 0.7857

Avg. Gain - +0.0046 +0.0025 +0.0222 +0.0253 +0.0292

BookCrossing GCN 0.6918 0.6967 0.7047 0.7081 0.7075

GAT 0.6704 0.6759 0.7075 0.7136 0.7247

GIN 0.6782 0.6968 0.7543 0.7554 0.7587

SGC 0.6781 0.6732 0.7038 0.7070 0.7039

Avg. Gain - +0.0061 +0.0380 +0.0414 +0.0441 +0.0476

C Details of datasets

We use two datasets (Last-FM, Book-Crossing) for link prediction tasks and two datasets (ACM, IMDB) fornode classiﬁcation tasks. Last-FM and Book-Crossing do not have node features, while ACM and IMDB havenode features, which are bag-of-words of keywords and plots. The Last-FM dataset with a knowledge graphhave 122 types of edges, e.g., "artist.origin", "musician.instruments.played", "person.or.entity.appearing.in.ﬁlm",and "ﬁlm.actor.ﬁlm", etc. Book-Crossing with a knowledge graph has 52 types of edges, e.g., "book.genre","literary.series", "date.of.ﬁrst.publication", and "written.work.translation", etc. ACM has three types of nodes(Paper(P), Author(A), Subject(S)), four types of edges (PA, AP, PS, SP), and labels (categories of papers). IMDBcontains three types of nodes (Movie (M), Actor (A), Director (D)), four types (MA, AM, MD, DM) of edgesand labels (genres of movies). Statistics of the datasets are in Table 5.

D Implementation details

All the models are randomly initialized and optimized using Adam [64] optimizers. Hyperparameters suchas learning rate and weight decay rate are tuned using validation sets for all models. For a fair comparison,the number of layers is ﬁxed to two and the dimension of output node embedding is the same across models.The node embedding z for Last-FM has 16 dimensions and for the rest of the datasets 64 dimensions. Sincedatasets have a different number of samples, we train models for a different number of epochs; Last-FM (100), Datasets

E Task selection

Our proposed methods identify useful auxiliary tasks and balance them with the primary task. In other words,the loss functions for tasks are differentially adjusted by the weighting function learnt by SELAR+HintNet. Toanalyze the weights of the tasks, we calculate the average of the task-speciﬁc weighted loss. Table. 6 shows tasksin descending order of the task weights. ‘user-item-actor-item’ has the largest weight followed by ‘user-item’(primary task), ‘user-item-appearing.in.ﬁlm-item’, ‘user-item-instruments-item’, ‘user-item-user-item’ and ‘user-item-artist.origin-item’ on the Last-FM. It indicates that the preference of a given user is closely related to otheritems connected by an actor, e.g., speciﬁc edge type ‘ﬁlm.actor.ﬁlm’ in the knowledge graph. Moreover, ourmethod focuses on ‘user-item’ interaction for the primary task. On the Book-Crossing data, our method hasmore attention to ‘user-item’ for the primary task and ‘user-item-literary.series.item-user’ which means thatusers who like a series book have similar preferences.

Table 6: The average of the task-speciﬁc weighted loss on

Last-FM and

Book-Crossing datasets.Meta-paths (Last-FM) Avg. Meta-paths (Book-Crossing) Avg.user-item-actor-item user-item ∗ user-item ∗ ∗ primary task F Weighting function at different training stages

The weighting functions of our methods dynamically change over time. In Fig. 4, each row is the weightingfunction learnt by SELAR+HintNet for GCN, GAT, GIN, and SGC on Last-FM. From left, columns are from theﬁrst epoch, the epoch with the best validation performance, and the last epoch respectively. The positive andnegative samples are illustrated in solid and dash lines respectively in Fig. 4. At the begging of training (theﬁrst epoch), one noticeable pattern is that the weighting function focuses more on ‘easy’ samples. At the epochwith the highest performance, easy samples are down-weighted and the weight is large when the loss is large. Itimplies that hard examples are more focused. At the last epoch, most weights converge to zero when the loss isextremely small or large in the last epoch. Since learning has almost been done, the weighting function is learnedin a direction that considers both easy and difﬁcult examples less. Especially, for GCN and GAT in the epochwith the highest performance, the weights are increasing and it means that our weighting function imposes thateasy samples to smaller importance and more attention on hard samples. Among all tasks, the scale of weights inthe primary task is relatively high compared to that of auxiliary tasks. This indicates that our method focusesmore on the primary task. V ( · ))