DistilE: Distiling Knowledge Graph Embeddings for Faster and Cheaper Reasoning
Yushan Zhu, Wen Zhang, Hui Chen, Xu Cheng, Wei Zhang, Huajun Chen
DDistilE: Distiling Knowledge Graph Embeddingsfor Faster and Cheaper Reasoning
Yushan Zhu , Wen Zhang , Hui Chen , Xu Cheng , Wei Zhang , Huajun Chen ∗ Zhejiang University, Alibaba Group, CETC Big Data Research Institute { } @zju.edu.cn { weidu.ch, lantu.zw } @[email protected] Abstract
Knowledge Graph Embedding (KGE) is a popularmethod for KG reasoning and usually a higher dimen-sional one ensures better reasoning capability. How-ever, high-dimensional KGEs pose huge challenges tostorage and computing resources and are not suitablefor resource-limited or time-constrained applications,for which faster and cheaper reasoning is necessary.To address this problem, we propose DistilE, a knowl-edge distillation method to build low-dimensional stu-dent KGE from pre-trained high-dimensional teacherKGE. We take the original KGE loss as hard label lossand design specific soft label loss for different KGEsin DistilE. We also propose a two-stage distillation ap-proach to make the student and teacher adapt to eachother and further improve the reasoning capability ofthe student. Our DistilE is general enough to be appliedto various KGEs. Experimental results of link predic-tion show that our method successfully distills a goodstudent which performs better than a same dimensionalone directly trained, and sometimes even better than theteacher, and it can achieve 2 × -8 × embedding compres-sion rate and more than 10 × faster inference speed thanthe teacher with a small performance loss. We also ex-perimentally prove the effectiveness of our two-stagetraining proposal via ablation study. Introduction
Knowledge Graph (KG) is composed of triples representingfacts in the form of (head entity, relation, tail entity) , ab-breviate as (h, r, t) . KGs have been proven to be useful forvarious AI tasks, such as semantic search (Berant et al. 2013;Berant and Liang 2014), information extraction (Hoffmannet al. 2011; Daiber et al. 2013) and question answering(Zhang et al. 2016; Diefenbach, Singh, and Maret 2018).However, it is well known that KGs are usually far fromcomplete and this motivates many researches for knowledgegraph completion and reasoning, among which a commonand widely used series of methods is Knowledge Graph Em-bedding (KGE), such as TransE (Bordes et al. 2013), TransH(Wang et al. 2014), ConvE (Dettmers et al. 2018) etc.To achieve better performance, training KGEs with higherdimension are typically preferred. While the model size, i.e.the number of parameters, and the cost of reasoning timesusually increase fast as the embedding dimension goes up. As shown in Figure 1, as the embedding dimension becomeslarger, the performance of the model (MRR) grows moreand more slowly, while the model size and reasoning costincrease still quickly. (a) MRR and model size (b) MRR and reasoning cost
Figure 1: The changes of performance, model size and rea-soning cost along the growth of embedding dimensions . However, high-dimensional embeddings are impracticalin many real-life scenarios. For example, a pre-trainedbillion-scale knowledge graph is expected to be fine-tunedto solve downstream tasks and deployed frequently with acheaper cost. For applications with limited computing re-sources such as deploying KG on edge computing or mobiledevices, or with limited time for reasoning such as online fi-nancial predictions, embeddings with lower dimensions pro-vides obvious or even indispensable conveniences.Although low dimension enables faster deployment andcheaper reasoning, directly training with small embeddingsize normally performs poorly as shown in Figure 1. Thuswe propose a new research question : is it possible todistill low-dimensional KGEs from pre-trained high-dimensional ones so that we could achieve good perfor-mance as long as faster and cheaper inference?Knowledge Distillation (Hinton, Vinyals, and Dean2015) is a technology to distill knowledge from a largemodel (teacher) to build a smaller model (student) andhas been widely researched in Computer Vision and Natu-ral Language Processing. The student learns from both thehard labels (ground-truth labels) and the soft labels fromthe teacher. In this work, we propose a novel distillationmethod for large-scale knowledge graph training, named
DistilE , which is capable of distilling essence from a high-dimensional KGE into a smaller embedding size withoutlosing too much accuracy while performs much better than a r X i v : . [ c s . A I] S e p raining directly with the same smaller size.Conventional distillation methods usually use the logitsor softmax output from the teacher to supervise the student.Considering the diversity of loss functions of KGE meth-ods, we design specific soft labels for different KGEs inDistilE. For KGEs based on Margin Loss, such as TransXseries (Bordes et al. 2013; Wang et al. 2014; Lin et al. 2015;Ji et al. 2015), whose output does not have probabilistic in-terpretation, we use the triples’ scores from the teacher assoft labels because these scores directly reflect the existenceof triples. For KGEs based on Cross Entropy Loss, such asbilinear models (Nickel, Tresp, and Kriegel 2011; Yang et al.2015), rotation models (Sun et al. 2019b; Zhang et al. 2019)and models based on neural networks (Dettmers et al. 2018;Nguyen et al. 2019), we use the sigmoid function outputs ofthe teacher as soft labels.We also propose a two-stage distillation approach to im-prove the distillation results further. The basic idea is that al-though the trained teacher is already strong, it could achievebetter performance if the teacher could also learn from thestudent instead of being fixed all the time. Sun (Sun et al.2019a) also proved that the overall performance also de-pends on the student’s acceptance of the teacher. There-fore, in addition to a standard distillation stage in which theteacher is always static, we devise a second stage distillationin which the teacher is unfrozen and try to adjust itself tobecome more acceptable for the student.We evaluate DistilE with several typical KGEs and stan-dard KG reasoning datasets. Results prove the effectivenessof our method, showing that (1) the low-dimensional KGEsdistilled by DistilE performs much better than directly train-ing the same sized embeddings without the distillation stage,(2) the low-dimensional KGEs distilled by DistilE signifi-cantly infer faster than original high-dimensional KGEs, (3)our two-stage distillation approach works well and couldfurther improve the distillation results.In summary, our contributions are: • We propose a novel framework to distill lower dimen-sional KGEs from higher dimensional ones. To the bestof our knowledge, this is the first one to apply knowledgedistillation to knowledge graph embedding. • We propose different soft labels for different kinds ofKGEs with special structure and a two-stage distillationto enhance the distillation results. • We experimentally prove that our proposal can reduce thenumber of parameters of a KGE by and increasesthe inference speed by more than
10 times while retaininggood performance.
Related Work
Knowledge Distillation and Model Compression
In the last few years, the acceleration and compression ofmodels has attracted a lot of research works. Common meth-ods include network pruning (Castellano, Fanelli, and Pelillo1997; Molchanov et al. 2017), quantification (Lin, Talathi,and Annapureddy 2016; Sachan 2020), parameters sharing (Dehghani et al. 2019; Lan et al. 2020), and knowledge dis-tillation (Hinton, Vinyals, and Dean 2015).Among them, knowledge distillation has been widelyused in Computer Vision and Natural Language Processing.Its core idea is to regard the probability distributions out-put by the teacher as soft labels to help guide the trainingof the student. And knowledge distillation has an advantagedifferent from the other model compression methods men-tioned above: in addition to the probability distributions,other soft labels can also be designed according to needs,providing more modeling freedom. (Tang et al. 2019) pro-poses to distill the pre-trained language model BERT (De-vlin et al. 2019) into a single-layer bidirectional long andshort-term memory network (BiLSTM). (Sun et al. 2019a)proposes to enable students to learn more knowledge by fit-ting the middle layer output of the teacher, rather than justusing the probability distribution from the softmax layer.(Tian, Krishnan, and Isola 2020) believes that there are de-pendencies between the dimensions of data representation,and proposes maximizing the mutual information of the datarepresentation by the student and the teacher. (Zhao et al.2019) gives up the transfer of the softmax layer in BERT dis-tillation and directly approximates the corresponding weightmatrix in the student and the teacher.However, many KGEs do not have a deep network struc-ture and probability distribution output, such as KGEs ofTransX series, these existing distillation methods are notsuitable in our settings. In this work, we introduced the firstmethod of KG embedding compression using knowledgedistillation. We designed specific distillation soft labels fordifferent KGEs, and also proposed a two-stage distillationapproach to further improve the distillation effect.
Knowledge Graph Embeddings
In recent years, knowledge graph embedding (KGE) tech-nology has been rapidly developed and applied. Its key ideais to transform the entities and relations of KG into a con-tinuous vector space, named embedding. And then the em-beddings can be further applied to various KG downstreamtasks. RESCAL (Nickel, Tresp, and Kriegel 2011) is the firstrelation learning method based on tensor decomposition andencodes relations and entities into two-dimensional matri-ces and vectors respectively. To improve RESCAL, DistMult(Yang et al. 2015) restricts the relation matrix to a diago-nal matrix to simplify the model, ComplEx (Trouillon et al.2016) embeds entities and relations into the complex spaceto model asymmetric relations better, and HolE (Nickel,Rosasco, and Poggio 2016) combines the expressive powerof RESCAL with the simplicity of DistMult. TransE (Bor-des et al. 2013) is the first translation-based KGE methodand regards the relation as a translation from the head entityto the tail entity. Various variants of TransE are proposedto deal with more complex relations. TransH (Wang et al.2014) proposes that an entity should have different repre-sentations under different relations. TransR (Lin et al. 2015)believes that different relations pay attention to different at-tributes of entities. TransD (Ji et al. 2015) demonstrates thata relation may represent multiple semantics. With the rise ofneural networks, many KGE methods based on neural net-2orks have emerged. ConvE (Dettmers et al. 2018) and Con-vKB (Nguyen et al. 2018) use convolutional neural networks(CNN), and CapsE (Nguyen et al. 2019) uses capsule neuralnetwork as the score function of triples. In addition, rotationmodels such as RotatE (Sun et al. 2019b), QuatE (Zhang etal. 2019), DihEdral (Xu and Li 2019), etc. regard the relationas the rotation between the head entity and the tail entity.However, although the KGEs are simple and effective,they have an obvious problem that high-dimensional embed-dings pose a huge challenge to storage and computing. It isnecessary to reduce the dimension of embeddings and stillretain a good performance for many practical applicationscenarios. But now there is little research on KG embeddingcompression. The only published work of KGE compres-sion is (Sachan 2020), unlike our method, it uses quantita-tive technology to represent entities as a vector of discretecodes, while we use the knowledge distillation technology.
Method
This section elaborates on our proposal DistilE. Firstly, weidentify different distillation objectives with specific distil-lation soft labels for different types of KGE methods. Wethen introduce our two-stage distillation approach to contin-uously adjust the teacher for better distillation results.
Distillation Objective
Knowledge distillation typically involves two models, alarger size teacher model with good performance and asmall size student model. During training of the student, thestudent is first encouraged to 1) fit the hard labels from data,like the one-hot vector of a sentence’s class, with a hard la-bels loss, and 2) imitate the teacher’s behavior via fitting softlabels from the teacher with a soft label loss. Soft labels usu-ally refer to the probability distribution output by the teacher.Given a KG K = { E, R, T } , where E , R and T are the setof entities, relations and triples respectively. A KGE learnsto express the relationships between entities in a continu-ous vector space. Specifically, for a triple ( h, r, t ) , where h, t ∈ E, r ∈ R , the KGE model could assign a score toit by a score function f r ( h, t ) , to indicate the existence of ( h, r, t ) . KGEs with different score functions have differenttraining objectives and corresponding loss functions. Twomost commonly used loss of KGE include the MarginalLoss and
Cross Entropy Loss . For different types of loss,different types of distillation objectives are required.
Objective for KGEs with Marginal Loss.
Marginal lossis often used in translation-based KGEs, including TransE,TransH, TransR, and TransD, etc. They have a score function f r ( h, t ) to judge the existence of ( h, r, t ) based on distance-based metrics. The training goal is to make f r ( h, t ) small forpositive triples ( h, r, t ) and large for negative ones, and forcethe margin between positive and negative ones’ score largerthan a margin. The teacher is trained through the followingloss: L Thard = (cid:88) ( h,r,t ) ∈ G [ f Tr ( h, t ) − f Tr ( h (cid:48) , t (cid:48) ) + γ ] + , (1) where f Tr ( h, t ) and f Tr ( h (cid:48) , t (cid:48) ) is the score for positive triple ( h, r, t ) and negative triple ( h (cid:48) , r, t (cid:48) ) given by the teacher. ( h (cid:48) , r, t (cid:48) ) is generated by randomly replacing h or t in ( h, r, t ) ∈ T with h (cid:48) or t (cid:48) , which could be expressed as: G − = { ( h (cid:48) , r, t ) / ∈ T | h (cid:48) ∈ E ∧ h (cid:48) (cid:54) = h }∪ { ( h, r, t (cid:48) ) / ∈ T | t (cid:48) ∈ E ∧ t (cid:48) (cid:54) = t } . (2) Hard Label Loss for Student.
The hard label loss of studentis identical as the teacher as follows: L Shard = (cid:88) ( h,r,t ) ∈ G [ f Sr ( h, t ) − f Sr ( h (cid:48) , t (cid:48) ) + γ ] + , (3) Soft Label Loss for Student.
Since these KGEs lack thenecessary probability output layers as like conventionalknowledge distillation methods, fitting probability distribu-tion is inapplicable here. A natural choice is to make useof teacher’s triple score as the soft label for student, sincethe triple score contains richer information about the truthi-ness of a triple. We then engage the student to fit itself tothe teacher by minimizing the difference between these twoscores. Formally, the soft label loss of the student can beexpressed as: L Ssoft = (cid:88) ( h,r,t ) ∈ G ( (cid:12)(cid:12) f Tr ( h, t ) − f Sr ( h, t ) (cid:12)(cid:12) + k (cid:88) i =1 (cid:12)(cid:12) f Tr ( h (cid:48) i , t (cid:48) i ) − f Sr ( h (cid:48) i , t (cid:48) i ) (cid:12)(cid:12) ) , (4)where ( h (cid:48) i , r, t (cid:48) i ) with k ∈ [1 , k ] are negative triples, | x | de-notes the absolute value of x . Since the teacher’s score ofany triple can be regarded as a soft label, we generate multi-ple negative triples for a positive one during the experimentand make student fit to all their soft labels. Final Loss.
The final distillation loss can be formulated bythe weighted sum of the student’s soft label loss and hardlabel loss: L = αL Ssoft + (1 − α ) L Shard , (5)where α is a hyperparameter to balance the importance ofhard label loss and soft label loss. Objective for KGEs with Cross Entropy Loss.
CrossEntropy Loss is often used in models whose outputs haveprobabilistic interpretation, for example, bilinear modelssuch as ComplEx, rotation models such as RotatE, and mod-els based on neural networks. They model link prediction asa classification task and output the probability of truthinessof input triple according to the results from a sigmoid func-tion with triple score as input. Thus the cross entropy loss oftraining the teacher can be formulated as follows: L Thard = − (cid:88) ( h,r,t ) ∈ G ∪ G − ( y log p T ( h,r,t ) + (1 − y ) log(1 − p T ( h,r,t ) )) , (6)where p T ( h,r,t ) = exp f Tr ( h,t )1+exp f Tr ( h,t ) is a real number between (0 , given by teacher, representing the probability that the3 ethod Score function f r ( h, t ) Soft label loss for student L Ssoft
Hard label loss for student L Shard
TransE − (cid:107) h + r − t (cid:107) p (cid:80) ( h,r,t ) ∈ G ( (cid:12)(cid:12) f Tr ( h, t ) − f Sr ( h, t ) (cid:12)(cid:12) + k (cid:80) i =1 (cid:12)(cid:12) f Tr ( h (cid:48) i , t (cid:48) i ) − f Sr ( h (cid:48) i , t (cid:48) i ) (cid:12)(cid:12) ) (cid:80) ( h,r,t ) ∈ G [ f Sr ( h, t ) − f Sr ( h (cid:48) , t (cid:48) ) + γ ] + TransH − (cid:13)(cid:13)(cid:0) h − w (cid:62) r hw r (cid:1) + r − (cid:0) t − w (cid:62) r tw r (cid:1)(cid:13)(cid:13) ComplEx Re ( h (cid:62) diag ( r )¯ t ) − (cid:80) ( h,r,t ) ∈ G ∪ G − ( p T ( h,r,t ) log p S ( h,r,t ) +(1 − p T ( h,r,t ) ) log(1 − p S ( h,r,t ) )) − (cid:80) ( h,r,t ) ∈ G ∪ G − ( y log p S ( h,r,t ) + (1 − y ) log(1 − p S ( h,r,t ) )) RotatE − (cid:107) h • r − t (cid:107) Table 1: Score functions, soft label loss and hard label loss for the student of some popular knowledge graph embeddingmodels in DistilE. Here, ¯ x represents the conjugate of a complex number x , • represents the Hadamard product, f Sr ( h, t ) and f Tr ( h, t ) represents the score function in the student model and the teacher model respectively, p T ( h,r,t ) = exp f Tr ( h,t )1+exp f Tr ( h,t ) and p S ( h,r,t ) = exp f Sr ( h,t )1+exp f Sr ( h,t ) .triple is a true fact. y is the ground-truth label of ( h, r, t ) ,and it is for positive triples and for negative ones. Hard Label Loss for Student . Similarly, the hard label lossfor student is the same as the one of the teacher: L Shard = − (cid:88) ( h,r,t ) ∈ G ∪ G − ( y log p S ( h,r,t ) + (1 − y ) log(1 − p S ( h,r,t ) )) , (7) Soft Label Loss.
Since the outputs of these KGEs have prob-ability interpretation, the soft label loss of the student canbe defined as the cross entropy of the probability distribu-tion output by the student and the teacher as in conventionalknowledge distillation approach: L Ssoft = − (cid:88) ( h,r,t ) ∈ G ∪ G − ( p T ( h,r,t ) log p S ( h,r,t ) + (1 − p T ( h,r,t ) ) log(1 − p S ( h,r,t ) )) . (8) Final Loss.
The final loss for these KGEs is the same as Eq.(5).Table 1 summarizes the score function, soft label loss andhard label loss for the student of some popular knowledgegraph embedding models in DistilE.
Two-stage Distillation approach
In the previous part, we introduced how to enable the studentto extract knowledge from the KGE teacher, where the stu-dent is trained with hard labels and the soft labels generatedby a fixed teacher. To obtain a better student, we propose atwo-stage distillation approach to improve the student’s ac-ceptance of the teacher by unfreezing the teacher and engageit to learn from the student in a second stage of distillation.
The First Stage.
The first stage similar to conventionalknowledge distillation methods in which the teacher isfrozen and unchanged when training the student as intro-duced in the previous section.
The Second Stage.
In this stage, the teacher is unfrozenand tries to adjust itself to improve the acceptance for thestudent. The basic idea is that we not only train the teacher with a hard label to guarantee its performance, but also en-gage it to fit a soft label generated from the student. Essen-tially, this can be regarded as a process where the teacheralso learns from its student in reverse. As a result, the teacherwill become more adaptable to the student, thereby improv-ing the distillation effect.
For KGEs with Marginal Loss.
The hard label loss of opti-mizing the teacher is the same as Eq. (1) and the soft labelloss can be formulated as follows: L Tsoft = (cid:88) ( h,r,t ) ∈ G ( (cid:12)(cid:12) f Sr ( h, t ) − f Tr ( h, t ) (cid:12)(cid:12) + k (cid:88) i =1 (cid:12)(cid:12) f Sr ( h i , t i ) − f Tr ( h i , t i ) (cid:12)(cid:12) ) . (9)Eq. (4) and (9) are the same because absolute value of thedifference between two numbers has commutative property. For KGEs with Cross Entropy Loss.
The hard label loss ofoptimizing the teacher is the same as Eq. (6) and the softlabel loss can be expressed as: L Tsoft = − (cid:88) ( h,r,t ) ∈ G ∪ G − ( p S ( h,r,t ) log p T ( h,r,t ) + (1 − p S ( h,r,t ) ) log(1 − p T ( h,r,t ) )) . (10) Final Loss . It is a weighted sum of the soft label loss andhard label loss of teacher and student: L = αL Ssoft +(1 − α ) L Shard + βL Tsoft +(1 − β ) L Thard , (11)where α and β are independent weight hyperparameters fordifferent parts. Experiments
We evaluate DistilE on typical KGE benchmarks, and areparticularly interested in the following questions. • Whether it is capable of distilling a good student from theteacher and performing better than a same dimensionalmodel trained from scratch without distillation;4
How much the inference time is improved after a distilla-tion procedure.; • Whether and how much the two-stage distillation ap-proach contributes to our proposal.
Datasets and Implementation Details
Datasets.
We experiment on two common knowledgegraph completion benchmark datasets WN18RR (Toutanovaet al. 2015) and FB15k-237 (Dettmers et al. 2018), subsetsof WordNet (Bordes et al. 2013) and Freebase (Bordes et al.2013) with redundant inverse relations eliminated. Table 2shows the statistics of these two datasets.
Dataset
WN18RR 40,943 11 86,835 3,034 3,134FB15k-237 14,541 237 272,115 17,535 20,466
Table 2: Statistics of datasets we used in the experiments.
Evaluation Metrics.
We adopt standard metrics MR,MRR, and Hit@ k ( k = 1 , , . Given a test triple ( h, r, t ) ,we first replace the head entity h with each entity e ∈ E and generate candidate triples ( e, r, t ) . Then we use thescore function f r ( e, t ) to calculate the scores of all candi-date triples and arrange them in descending order, accordingto which, we obtain the rank of ( h, r, t ) ’s score, rank h asits head prediction result. For ( h, r, t ) ’s tail prediction, wefirst replacing t with all e ∈ E to generate candidate triples ( h, r, e ) , and get the tail prediction rank rank t in a simi-lar way. We average rank h and rank t as the final rank of ( h, r, t ) . Finally, we calculate MR, MRR, and Hit@ k via therank of all test triples. MR is their mean rank. MRR is theirmean reciprocal rank. And Hit@ k measures the percentageof test triples with rank ≤ k . We also use the filtered setting(Bordes et al. 2013) by removing all triples in the candidateset that existing in training, validating, and testing sets. Baselines.
We implement DistilE on several teacherKGEs. TransE and TransH are chosen for KGEs with mar-gin loss and ComplEx and RoratE are chosen for KGEs withcross entropy loss.
Implementation Details.
For the teacher, we set em-bedding dimension d teacher = { , , } and make d teacher = 64 for primary experiment, and set d student = { , , } for the student. We set batch size to andmaximum training epoch to . For other hyperparame-ters, we follow the setting in original paper of KGEs, set-ting γ = 1 . and L as dissimilarity in TransE, γ = 0 . ,soft constrained hyperparameter C = 0 . and L asdissimilarity in TransH, and γ = 6 . , (cid:15) = 2 . in Ro-tatE. For each positive triple, we generate 5 negative onesin TransE and TransH and 25 in ComplEx and RotatE. Wechoose Adam (Kingma and Ba 2015) as the optimizer withearning rate decay and trigger decay threshold set to . and respectively. We perform a grid search on the fol-lowing hyperparameter combination and report the reultsfrom the best one: learning rate: { . , . , . , . } , balance hyperparameter for student and teacher { α, β } = { . , . , . } . Q1: Whether our method successfully distills agood student ?
To verify whether DistilE successfully distills a good stu-dent, we first train a student with a higher dimensionalteacher, marked as ‘DS’, and then train a same dimensionalstudent with only hard label loss, marked as ‘no-DS’, whichis the same as training a same dimensional original KGEmodel. Then we compare their performance on link predic-tion. Table 3 and Table 4 shows the results on WN18RR andFB15k-237 with different dimensional setting for student re-spectively.
Results Analysis.
First we analyze the results onWN18RR in Table 3. Table 3 shows that the performanceof ‘no-DS’ model decreases significantly as the embeddingdimension reducing. With TransE as an example, comparedwith the 64-dimensional teacher, an 8-dimensional ‘no-DS’model only achieves , , , and results onMRR, Hit@10, Hit@3, and Hit@1 respectively. And forthe MRR results of other methods, TransH’s decreases from . to . ( ) and ComplEx’s decreases from . to . ( . ). This illustrates that directly training lowdimensional KGEs produce poor results.Compared with ‘no-DS’ results, the results of our distilledmodel with the same dimension achieves better results inmost settings. For example, the MRR of TransE improvesfrom . to . ( . ) and Hit@1 improves from . to . ( . ). The MRR, Hit@10, and Hit@1of ComplEx improves from . to . ( . ), from . . ( . ), and from . to . ( . )respectively. The MRR and Hit@1 of RotatE improves from . to . ( . ) and from . to . ( . )respectively. t results an average improvement of 98.8% , , , and on MRR, Hit@10, Hit@3, andHit@1 among these four KGEs. On the whole, we couldconclude that compared with ‘no-DS’, our method greatlyimproves the performance of low dimensional models.More importantly, compared to the results of 64-dimensional teachers, our 32-dimensional students evensurpass in some metrics. For example, with TransH, theMRR, Hit@3, and Hit@1 of our 32-dimensional studentsurpass teacher by , , and respectively. Our16-dimensional students, with a 4 times model compressionrate of 64-dimensional teacher, achieve similar results to theteacher in many metrics. Take RotatE as an example, our 16-dimensional student achieves , , , and ofthe teacher’s results on MRR, Hit@10, Hit@3, and [email protected] above analyses are based on results on WN18RR, andwith results on FB15k-237 in Table 4 we could find a similarphenomenon. Thus we could conclude that our method doessuccessfully distill a good student. Higher Dimensional Teachers.
Since the teacher’s di-mension may matters, we also conduct experiments on and dimensional teacher with TransE, to evaluate theinfluence of the teacher’s dimension. Figure 2 shows a5 im Method TransE TransH ComplEx RotatE
MRR H10 H3 H1 MRR H10 H3 H1 MRR H10 H3 H1 MRR H10 H3 H1
64 Tea. .194 .481 .331 .037 .180 .470 .326 .016 .312 .444 .356 .240 .478 .567 .500 .43132 no-DS .185 .427 .316 .030 .172 .423 .307 .014 .211 .368 .225 .143 .467 .545 .495 .419DS(ours) .201 .486 .319 .049 .198 .456 .360 .020 .319 .443 .361 .251 .477 .562 .497 .430
16 no-DS .131 .311 .230 .012 .131 .352 .203 .013 .142 .271 .156 .079 .413 .479 .438 .374DS(ours) .171 .413 .274 .030 .161 .415 .264 .016 .262 .421 .295 .187 .433 .525 .468 .407 .125 .307 .168 .032 .071 .182 .071 .014 .137 .268 .157 .068 .218 .406 .304 .107
Table 3: Link prediction results on WN18RR. Bold numbers are the better results between ‘no-DS’ and ‘DS(ours)’ with samedimension.
Dim Method TransE TransH ComplEx RotatE
MRR H10 H3 H1 MRR H10 H3 H1 MRR H10 H3 H1 MRR H10 H3 H1
64 Tea. .288 .468 .319 .196 .365 .556 .406 .266 .274 .471 .307 .178 .420 .613 .463 .32232 no-DS .263 .433 .289 .176 .333 .521 .367 .236 .197 .349 .221 .115 .365 .557 .405 .268DS(ours) .286 .463 .317 .196 .351 .537 .393 .253 .221 .430 .258 .139 .410 .599 .456 .312
16 no-DS .252 .410 .259 .154 .295 .458 .324 .209 .118 .270 .125 .057 .297 .484 .337 .203DS(ours) .267 .445 .299 .176 .337 .516 .369 .246 .179 .360 .176 .096 .365 .563 .423 .262 .238 .425 .256 .147 .296 .462 .331 .203 .183 .368 .212 .099 .297 .485 .322 .178
Table 4: Link prediction results on FB15k-237. Bold numbers are the better results between ‘no-DS’ and ‘DS(ours)’ with samedimension.Figure 2: Students’ test MRR distilled by teachers with dif-ferent dimensions on the WN18RR dataset for TransE.heatmap of MRR results of the student distilled from dif-ferent dimensional teachers on WN18RR. It shows that(1) for 32-dimensional students, higher dimensional teacherachieve slightly better results, (2) for 16-dimensional stu-dents, a higher dimensional teacher does not achieve betterresults, and (3) for 8-dimensional students, higher dimen-sional teacher achieves worse results. This indicate that ourmethod’s compression capability is about 8 times. Thereforit is not always necessary to distill from a bigger teacher. Anintuition is that although a bigger teacher is more expres-sive, a overly high compression ratio may prevent the stu-dent from absorbing important information from the teacher.This analysis reveals that for an application where an espe-cially low-dimensional student is required and suppose therequired dimension is d , instead of choosing a very high-dimensional teacher with fantastic performance, it is better to choose a teacher with dimension ≤ × d which mightnot achieve the best pretraining performance. Q2: Whether the distilled student successfullyaccelerates training and inference speed?
Figure 3: Test MRR for 32-Dim student as training proceedson FB15k-237 dataset and RotatE with and without 64-Dimteacher guidance.
Training Speed.
Figure 3 shows the convergence of 32-dimensional students with or without distillation. We ob-served that with distillation, our method converge signif-icantly faster and more stably than ‘no-DS’ since the be-ginning and finally achieves better results. After the secondstage distillation ( S2 , the right half of the red line separatedby the black dashed line) begin, MRR slightly fluctuates andquickly converges to an even better result. The reason forfluctuation is that at the beginning of S2 , the teacher be-gins to adapt according to the soft labels from the student.6 im TransE TransH ComplEx RotatE × ) 2621.76 472.9 (1 × ) 5242.112 569.4 (1 × ) 5241.408 617.2 (1 × )32 1310.528 137.4 (2.87 × ) 1310.88 173.2 (2.73 × ) 2621.056 200.6 (2.84 × ) 2620.704 237.2 (2.60 × )16 655.264 69.7 (5.66 × ) 655.44 93.2 (5.07 × ) 1310.528 117.7 (4.84 × ) 1310.352 134.7 (4.58 × )8 327.632 35.8 (11.01 × ) 327.72 46.4 (10.19 × ) 655.264 75.3 (7.56 × ) 655.176 82.6 (7.47 × ) Table 5: The number of parameters and inference times for TransE, TransH, ComplEx and RotatE.Although S2 optimizing student and teacher together intro-duces additional training, it converges very quickly and doesnot increase the total training time significantly. As shown inFigure 3, only about 50 epochs are enough for convergenceduring S2 . Inference Speed.
To test the inference speed of theteacher and the student, we conduct link prediction exper-iments on 93,003 triple sampled from WN18RR. The in-ference is performed on a single GeForce GTX GPU, andthe batch size is set to 1024. In order to avoid accidentalfactors, we repeat the experiment 3 times and report theaverage time. Table 5 shows the time cost as long as thenumber of parameters. It shows that the reduction of param-eter numbers is proportional to the compression rate, thusthe machine memory for saving 32-, 16- and 8-dimensionalstudents will be saved by 2 times, 4 times, and 8 times re-spectively, compared to a 64-dimensional one. It also showsthat our distillation method achieves almost linear acceler-ation for inference. The inference time of a 64-dimensionalteacher is about 5 times of the 16-dimensional student, andnearly or even more than 10 times that of the 8-dimensionalstudent.We observe the same phenomenon on FB15k-237, andwe do not show it due to the limitation of space. These re-sults support that our distilled student successfully acceler-ates training and inference speed.
Q3: Whether and how much does the two-stagedistillation approach contribute to the result?
To study the impact of the two-stage distillation approach,we conduct an ablation study to compare the performance ofour method with two stages (DS) to removing the first stage( -S1 ) and removing the second stage ( -S2 ). Table 6 summa-rizes the MRR and Hit@10 results on WN18RR.After removing S1 with only S2 preserved (refer to - S1 ),the performance is overall lower than that of DS. Presum-ably, the reason is that both the teacher and the student willadapt to each other in S2 . With a randomly initialized stu-dent, the student conveys mostly useless information to theteacher which may be misleading and will crash the teacher.In addition, the performance of ‘ -S1 ’ is very unstable.With ‘ -S1 ’ setting, 32-dimensional students obtain resultsonly slightly worse than DS, while 16-dimensional and espe-cially 8-dimensional students perform obviously very poor.Taking the 8-dimensional student of RotatE as an example,the MRR and Hit@10 of ‘ -S1 ’ are only and com-pared with DS. This is even worse than directly training thesame sized student without distillation, showing that the first D M TransE TransH ComplEx RotatE
MRR H10 MRR H10 MRR H10 MRR H10
32 DS .201 .486 .198 .467 .319 .443 .477 .562 -S1 .196 .483 .179 .463 .317 .442 .419 .458 -S2 .193 .473 .169 .456 .318 .442 .474 .56116 DS .171 .413 .161 .415 .262 .421 .432 .525 -S1 .137 .352 .101 .258 .133 .275 .291 .342 -S2 .163 .395 .134 .391 .250 .412 .433 .5258 DS .125 .307 .071 .182 .137 .268 .218 .406 -S1 .081 .183 .037 .082 .051 .097 .050 .115 -S2 .113 .284 .069 .177 .136 .269 .209 .403
Table 6: Ablation study. D refers to dimension and M refersto method.stage is necessary for distillation.After removing S2 with only S1 preserved (refer to - S2 ), the performance decreases in almost all setting. TakingTransE as an example, compared with DS, the MRR of 32-, 16- and 8-dimensional student of ‘- S2 ’ drop by , and respectively, indicating that the second stage canindeed make teacher and student adapt to each other, andfurther improve the result.We also observe the same phenomenon on FB15k-237,and we do not show it due to space limitation. These resultssupport the effectiveness of our two-stage distillation thatfirst train the student in S1 converging to a certain perfor-mance and then co-optimize the teach and student in S2 . Conclusion and Future Work
Too many embedding parameters of the knowledge graphwill bring huge storage and calculation challenges to actualapplication scenarios. In this work, we propose a novel KGEdistillation method to compress KGEs to lower dimensionalspace. In order to successfully apply the knowledge distilla-tion technology to KGEs with special structure, we designspecific soft label loss for different KGEs. In order to en-able the student to fully accept the rich information fromthe teacher, our method encourages the teacher and studentto adapt to each other through a two-stage distillation ap-proach. We have evaluated our method through link predic-tion task on several different KGEs and benchmark datasets.Results show that our method can effectively reduce modelparameters and greatly improve the inference speed withouttoo much loss in performance compared to the teacher andhas better reasoning capability than a directly trained onewith the same dimension.In this work, we only considered transmitting knowledgethrough the final output of KGEs. In the future, we wouldlike to firstly explore multi-layer distillation from other net-7ork layers of KGEs and secondly study the knowledge dis-tillation of KGEs in more complex environments, such asadversarial learning and ensemble learning.
References [Berant and Liang 2014] Berant, J., and Liang, P. 2014. Se-mantic parsing via paraphrasing. In
ACL (1) , 1415–1425.The Association for Computer Linguistics.[Berant et al. 2013] Berant, J.; Chou, A.; Frostig, R.; andLiang, P. 2013. Semantic parsing on freebase from question-answer pairs. In
EMNLP , 1533–1544. ACL.[Bordes et al. 2013] Bordes, A.; Usunier, N.; Garc´ıa-Dur´an,A.; Weston, J.; and Yakhnenko, O. 2013. Translating em-beddings for modeling multi-relational data. In
NIPS , 2787–2795.[Castellano, Fanelli, and Pelillo 1997] Castellano, G.;Fanelli, A. M.; and Pelillo, M. 1997. An iterative pruningalgorithm for feedforward neural networks.
IEEE Trans.Neural Networks
I-SEMANTICS , 121–124.ACM.[Dehghani et al. 2019] Dehghani, M.; Gouws, S.; Vinyals,O.; Uszkoreit, J.; and Kaiser, L. 2019. Universal transform-ers. In
ICLR (Poster) . OpenReview.net.[Dettmers et al. 2018] Dettmers, T.; Minervini, P.; Stenetorp,P.; and Riedel, S. 2018. Convolutional 2d knowledge graphembeddings. In
AAAI , 1811–1818. AAAI Press.[Devlin et al. 2019] Devlin, J.; Chang, M.; Lee, K.; andToutanova, K. 2019. BERT: pre-training of deep bidirec-tional transformers for language understanding. In
NAACL-HLT (1) , 4171–4186. Association for Computational Lin-guistics.[Diefenbach, Singh, and Maret 2018] Diefenbach, D.;Singh, K. D.; and Maret, P. 2018. Wdaqua-core1: Aquestion answering service for RDF knowledge bases. In
WWW (Companion Volume) , 1087–1091. ACM.[Hinton, Vinyals, and Dean 2015] Hinton, G. E.; Vinyals,O.; and Dean, J. 2015. Distilling the knowledge in a neuralnetwork.
CoRR abs/1503.02531.[Hoffmann et al. 2011] Hoffmann, R.; Zhang, C.; Ling, X.;Zettlemoyer, L. S.; and Weld, D. S. 2011. Knowledge-basedweak supervision for information extraction of overlappingrelations. In
ACL , 541–550. The Association for ComputerLinguistics.[Ji et al. 2015] Ji, G.; He, S.; Xu, L.; Liu, K.; and Zhao, J.2015. Knowledge graph embedding via dynamic mappingmatrix. In
ACL (1) , 687–696. The Association for ComputerLinguistics.[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015.Adam: A method for stochastic optimization. In
ICLR(Poster) .[Lan et al. 2020] Lan, Z.; Chen, M.; Goodman, S.; Gimpel,K.; Sharma, P.; and Soricut, R. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In
ICLR . OpenReview.net.[Lin et al. 2015] Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu,X. 2015. Learning entity and relation embeddings forknowledge graph completion. In
AAAI , 2181–2187. AAAIPress.[Lin, Talathi, and Annapureddy 2016] Lin, D. D.; Talathi,S. S.; and Annapureddy, V. S. 2016. Fixed point quantiza-tion of deep convolutional networks. In
ICML , volume 48 of
JMLR Workshop and Conference Proceedings , 2849–2858.JMLR.org.[Molchanov et al. 2017] Molchanov, P.; Tyree, S.; Karras, T.;Aila, T.; and Kautz, J. 2017. Pruning convolutional neuralnetworks for resource efficient inference. In
ICLR (Poster) .OpenReview.net.[Nguyen et al. 2018] Nguyen, D. Q.; Nguyen, T. D.; Nguyen,D. Q.; and Phung, D. Q. 2018. A novel embedding modelfor knowledge base completion based on convolutional neu-ral network. In
NAACL-HLT (2) , 327–333. Association forComputational Linguistics.[Nguyen et al. 2019] Nguyen, D. Q.; Vu, T.; Nguyen, T. D.;Nguyen, D. Q.; and Phung, D. Q. 2019. A capsule network-based embedding model for knowledge graph completionand search personalization. In
NAACL-HLT (1) , 2180–2189.Association for Computational Linguistics.[Nickel, Rosasco, and Poggio 2016] Nickel, M.; Rosasco,L.; and Poggio, T. A. 2016. Holographic embeddings ofknowledge graphs. In
AAAI , 1955–1961. AAAI Press.[Nickel, Tresp, and Kriegel 2011] Nickel, M.; Tresp, V.; andKriegel, H. 2011. A three-way model for collective learningon multi-relational data. In
ICML , 809–816. Omnipress.[Sachan 2020] Sachan, M. 2020. Knowledge graph embed-ding compression. In
ACL , 2681–2691. Association forComputational Linguistics.[Sun et al. 2019a] Sun, S.; Cheng, Y.; Gan, Z.; and Liu, J.2019a. Patient knowledge distillation for BERT model com-pression. In
EMNLP/IJCNLP (1) , 4322–4331. Associationfor Computational Linguistics.[Sun et al. 2019b] Sun, Z.; Deng, Z.; Nie, J.; and Tang, J.2019b. Rotate: Knowledge graph embedding by relationalrotation in complex space. In
ICLR (Poster) . OpenRe-view.net.[Tang et al. 2019] Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vech-tomova, O.; and Lin, J. 2019. Distilling task-specificknowledge from BERT into simple neural networks.
CoRR abs/1903.12136.[Tian, Krishnan, and Isola 2020] Tian, Y.; Krishnan, D.; andIsola, P. 2020. Contrastive representation distillation.In .OpenReview.net.[Toutanova et al. 2015] Toutanova, K.; Chen, D.; Pantel, P.;Poon, H.; Choudhury, P.; and Gamon, M. 2015. Represent-ing text for joint embedding of text and knowledge bases.In
EMNLP , 1499–1509. The Association for ComputationalLinguistics.8Trouillon et al. 2016] Trouillon, T.; Welbl, J.; Riedel, S.;Gaussier, ´E.; and Bouchard, G. 2016. Complex embed-dings for simple link prediction. In
ICML , volume 48 of
JMLR Workshop and Conference Proceedings , 2071–2080.JMLR.org.[Wang et al. 2014] Wang, Z.; Zhang, J.; Feng, J.; and Chen,Z. 2014. Knowledge graph embedding by translating onhyperplanes. In
AAAI , 1112–1119. AAAI Press.[Xu and Li 2019] Xu, C., and Li, R. 2019. Relation embed-ding with dihedral group in knowledge graph. In
ACL (1) ,263–272. Association for Computational Linguistics.[Yang et al. 2015] Yang, B.; Yih, W.; He, X.; Gao, J.; andDeng, L. 2015. Embedding entities and relations for learn-ing and inference in knowledge bases. In
ICLR (Poster) .[Zhang et al. 2016] Zhang, Y.; Liu, K.; He, S.; Ji, G.; Liu, Z.;Wu, H.; and Zhao, J. 2016. Question answering over knowl-edge base with neural attention combining global knowledgeinformation.
CoRR abs/1606.00979.[Zhang et al. 2019] Zhang, S.; Tay, Y.; Yao, L.; and Liu,Q. 2019. Quaternion knowledge graph embeddings. In
NeurIPS , 2731–2741.[Zhao et al. 2019] Zhao, S.; Gupta, R.; Song, Y.; and Zhou,D. 2019. Extreme language model compressionwith optimal subwords and shared projections.