[PDF] GIKT: A Graph-based Interaction Model for Knowledge Tracing

Abstract

With the rapid development in online education, knowledge tracing (KT) has become a fundamental problem which traces students' knowledge status and predicts their performance on new questions. Questions are often numerous in online education systems, and are always associated with much fewer skills. However, the previous literature fails to involve question information together with high-order question-skill correlations, which is mostly limited by data sparsity and multi-skill problems. From the model perspective, previous models can hardly capture the long-term dependency of student exercise history, and cannot model the interactions between student-questions, and student-skills in a consistent way. In this paper, we propose a Graph-based Interaction model for Knowledge Tracing (GIKT) to tackle the above probems. More specifically, GIKT utilizes graph convolutional network (GCN) to substantially incorporate question-skill correlations via embedding propagation. Besides, considering that relevant questions are usually scattered throughout the exercise history, and that question and skill are just different instantiations of knowledge, GIKT generalizes the degree of students' master of the question to the interactions between the student's current state, the student's history related exercises, the target question, and related skills. Experiments on three datasets demonstrate that GIKT achieves the new state-of-the-art performance, with at least 1% absolute AUC improvement.

Full PDF

GGIKT: A Graph-based Interaction Model forKnowledge Tracing

Yang Yang , Jian Shen , Yanru Qu , Yunfei Liu , Kerong Wang , YaomingZhu , Weinan Zhang ( (cid:0) ), and Yong Yu ( (cid:0) ) Shanghai Jiao Tong University { yyang,rockyshen,ymzhu,yyu } @apex.sjtu.edu.cn, { liuyunfei,wangkerong,wnzhang } @sjtu.edu.cn University of Illinois, Urbana-Champaign [email protected]

Abstract.

With the rapid development in online education, knowledgetracing (KT) has become a fundamental problem which traces students’knowledge status and predicts their performance on new questions. Ques-tions are often numerous in online education systems, and are alwaysassociated with much fewer skills. However, the previous literature failsto involve question information together with high-order question-skillcorrelations, which is mostly limited by data sparsity and multi-skillproblems. From the model perspective, previous models can hardly cap-ture the long-term dependency of student exercise history, and cannotmodel the interactions between student-questions, and student-skills ina consistent way. In this paper, we propose a Graph-based Interactionmodel for Knowledge Tracing (GIKT) to tackle the above probems. Morespeciﬁcally, GIKT utilizes graph convolutional network (GCN) to sub-stantially incorporate question-skill correlations via embedding propaga-tion. Besides, considering that relevant questions are usually scatteredthroughout the exercise history, and that question and skill are just diﬀer-ent instantiations of knowledge, GIKT generalizes the degree of students’master of the question to the interactions between the student’s currentstate, the student’s history related exercises, the target question, andrelated skills. Experiments on three datasets demonstrate that GIKTachieves the new state-of-the-art performance, with at least 1% absoluteAUC improvement.

Keywords:

Knowledge Tracing · Graph Neural Network · InformationInteraction.

In online learning platforms such as MOOCs or intelligent tutoring systems, knowledge tracing (KT) [6] is an essential task, which aims at tracing the knowl-edge state of students. At a colloquial level, KT solves the problem of predictingwhether the students can answer the new question correctly according to their a r X i v : . [ c s . A I] S e p Y. Yang et al. previous learning history. The KT task has been widely studied and variousmethods have been proposed to handle it.Existing KT methods [21,35,2] commonly build predictive models based onthe skills that the target questions correspond to rather than the questions them-selves. In the KT task, there exists several skills and lots of questions where oneskill is related to many questions and one question may correspond to morethan one skill, which can be represented by a relation graph such as the exampleshown in Figure 1. Due to the assumption that skill mastery can reﬂect whetherthe students are able to answer the related questions correctly to some extent, itis a feasible alternative to make predictions based on the skills just like previousKT works.

Fig. 1.

A simple example of question-skill relation graph.

Although these pure skill-based KT methods have achieved empirical suc-cess, the characteristics of questions are neglected, which may lead to degradedperformance. For instance, in Figure 1, even though the two questions q and q share the same skills, their diﬀerent diﬃculties may result in diﬀerent prob-abilities of being answered correctly. To this end, several previous works [14]utilize the question characteristics as a supplement to the skill inputs. However,as the number of questions is usually large while many students only attempton a small subset of questions, most questions are only answered by a few stu-dents, leading to the data sparsity problem [28]. Besides, for those questionssharing part of common skills ( e.g. q and q ), simply augmenting the questioncharacteristics loses latent inter-question and inter-skill information. Based onthese considerations, it is important to exploit high-order information betweenthe questions and skills.In this paper, we ﬁrst investigate how to eﬀectively extract the high-orderrelation information contained in the question-skill relation graph. Motivated bythe great power of Graph Neural Networks (GNNs) [26,13,10] to extract graphrepresentations by aggregating information from neighbors, we leverage a graphconvolutional network (GCN) to learn embeddings for questions and skills fromhigh-order relations. Once the question and skill embeddings are aggregated, IKT: A Graph-based Interaction Model for Knowledge Tracing 3 we can directly feed question embeddings together with corresponding answerembeddings as the input of KT models.In addition to the input features, another key issue in KT is the model frame-work. Recent advances in deep learning simulate a fruitful line of deep KT works,which leverage deep neural networks to sequentially capture the changes of stu-dents’ knowledge state. Two representive deep KT models are Deep KnowledgeTracing (DKT) [21] and Dynamic Key-Value Memory Networks (DKVMN) [35]which leverage Recurrent Neural Networks (RNN) [31] and Memory-AugmentedNeural Networks (MANN) respectively to solve KT. However, they are noto-riously unable to capture long-term dependencies in a question sequence [1].To handle this problem, Sequential Key-Value Memory Networks (SKVMN) [1]proposes a hop-LSTM architecture that aggregates hidden states of similar ex-ercises into a new state and Exercise-Enhanced Recurrent Neural Network withAttention mechanism (EERNNA) [25] uses the attention mechanism to performweighted sum aggregation for all history states.Instead of aggregating related history information into a new state for pre-diction directly, we take a step further towards improving long-term dependencycapture and better modeling student’s mastery degree. Inspired by SKVMN andEERNNA, we introduce a recap module to select several the most related hiddenexercises according to the attention weight with the intention of noise reduction.Considering the mastery of the new question and its related skills, we general-ize the interaction module and interact the relevant exercises and the currenthidden states with the aggregated question embeddings and skill embeddings.The generalized interaction module can better model student’s mastery degreeof question and skills. Besides, an attention mechanism is applied on each in-teraction to make ﬁnal predictions, which automatically weights the predictionutility of all the interactions.To sum up, in this paper, we propose an end-to-end deep framework, namelyGraph-based Interaction for Knowledge Tracing (GIKT), for knowledge tracing.Our main contributions are summarized as follows: 1) By leveraging a graphconvolutional network to aggregate question embeddings and skill embeddings,GIKT is capable to exploit high-order question-skill relations, which mitigatesthe data sparsity problem and the multi-skill issue. 2) By introducing a recapmodule followed by an interaction module, our model can better model thestudent’s mastery degree of the new question and its related skills in a consis-tent way. 3) Empirically we conduct extensive experiments on three benchmarkdatasets and the results demonstrate that our GIKT outperforms the state-of-the-art baselines substantially.

Existing knowledge tracing methods can be roughly categorized into two groups:traditional machine learning methods and deep learning methods. In this paper,we mainly focus on the deep KT methods.

Y. Yang et al.

Traditional machine learning KT methods mainly involve two types: BayesianKnowledge Tracing (BKT) [6] and factor analysis models. BKT is a hiddenMarkov model which regards each skill as a binary variable and uses bayes rule toupdate state. Several works extends the vanilla BKT model to incorporate moreinformation into it such as slip and guess probabilty [2], skill diﬃculty [19] andstudent individualization [18,34]. On the other hand, factor analysis models focuson learning general paramaters from historical data to make predictions. Amongthe factor analysis models, Item Response Theory (IRT) [8] models parametersfor student ability and question diﬃculty, Performance Factors Analysis (PFA)[20] takes into account the number of positive and negative responses for skillsand Knowledge Tracing Machines [27] leverages Factorization Machines [24] toencode side information of questions and users into the parameter model.Recently, due to the great capacity and eﬀective representation learning,deep neural networks have been leveraged in the KT literature. Deep KnowledgeTracing (DKT) [21] is the ﬁrst deep KT method, which uses recurrent neuralnetwork (RNN) to trace the knowledge state of the student. Dynamic Key-Value Memory Networks (DKVMN) [35] can discover the underlying conceptsof each skill and trace states for each concept. Based on these two models,several methods have been proposed by considering more information, such asthe forgetting behavior of students [16], multi-skill information and prerequisiteskill relation graph labeled by experts [4] or student individualization [15]. GKT[17] builds a skill relation graph and learns their relation explicitly. However,these methods only use skills as the input, which causes information loss.Some deep KT methods take question characteristices into account for pre-dictions. Dynamic Student Classiﬁcation on Memory Networks (DSCMN) [14]utilizes question diﬃculty to help distinguish the questions related to the sameskills. Exercise-Enhanced Recurrent Neural Network with Attention mechanism(EERNNA) [25] encodes question embeddings using the content of questions sothat the question embeddings can contain the characteristic information of ques-tions, however in reality it is diﬃcult to collect the content of questions. Dueto the data sparsity problem, DHKT [29] augments DKT by using the relationsbetween questions and skills to get question representations, which, however,fails to capture the inter-question and inter-skill relations. In this paper we useGCN to extract the high-order information contained in the question-skill graph.To handle the long term dependency issue, Sequential Key-Value Memory Net-works (SKVMN) [1] uses a modiﬁed LSTM with hops to enhance the capacity ofcapturing long-term dependencies in an exercise sequence. And EERNNA [25]assumes that current student knowledge state is a weighted sum aggregation ofall historical student states based on correlations between current question andhistorical quesitons. Our method diﬀers from these two works in the way thatthey aggregate related hidden states into a new state for prediction, while weﬁrst select the most useful history exercises to reduce the eﬀects of the noisy inthe current state, and then we perform pairwise interaction for prediction.

IKT: A Graph-based Interaction Model for Knowledge Tracing 5

In recent years, graph data is widely used in deep learning models. However,the traditional neural network suﬀers from the complex non-Euclidean structureof graph. Inspired by CNNs, some works use the convolutional method for thegraph-structure data [13,7]. Graph convolutional networks (GCNs) [13] is pro-posed for semi-supervised graph classiﬁcation, which updates node representa-tions based on itself and its neighbors. In this way, updated node representationscontain attributes of neighbor nodes and information of high-order neighbors ifmultiple graph-convolutional layers are used. Due to the great success of GCNs,some variants are further proposed for graph data [26,10].With the development of Graph Neural Networks (GNNs), many applicationsbased on GNNs appear in various domains, such as natural language processing(NLP) [3,33], computer vision (CV) [22,9] and recommendation systems [30,23].As GNNs help to capture high-order information, we use GCN in our GIKTmodel to extract relations between skills and questions into their representations.To the best of our knowledge, our method GIKT is the ﬁrst work to modelquestion-skill relations via graph neural network.

Knowledge Tracing.

In the knowledge tracing task, students sequentially an-swer a series of questions that the online learning platforms provide. After thestudents answer each question, a feedback of whether the answer is correct willbe issued. Here we denote an exercise as x i = ( q i , a i ), where q i is the question IDand a i ∈ { , } represents whether the student answered q i correctly. Given anexercise sequence XXX = { x , x , ..., x t − } and the new question q t , the goal of KTis to predict the probability of the student correctly answering it p ( a t = 1 | XXX, q t ). Question-Skill Relation Graph.

Each question q i corresponds to one ormore skills { s , ..., s n i } , and one skill s j is usually related to many questions (cid:8) q , ..., q n j (cid:9) , where n i and n j are the number of skills related to question q i and the number of questions related to skill s j respectively. Here we denotethe relations as a question-skill relation bipartite graph G , which is deﬁned as { ( q, r qs , s ) | q ∈ Q , s ∈ S} , where Q and S correpsond to the question and skillsets respectively. And r qs = 1 if the question q is related to the skill s . In this section, we will introduce our method in detail, and the overall frameworkis shown in Figure 2. We ﬁrst leverage GCN to learn question and skill repre-sentations aggregated on the question-skill relation graph, and a recurrent layeris used to model the sequential change of knowledge state. To capture long termdependency and exploit useful information comprehensively, we then design arecap module followed by an interaction module for the ﬁnal prediction.

Y. Yang et al. ...

InformationInteraction ...... ...

Graph Convolutional Networks

RNN RNN ...

RNN ...

RNN

Recap

Fig. 2.

An illustration of GIKT at time step t , where q t is the new question. First weuse GCN to aggregate question and skill embeddings. Then a recurrent neural networkis used to model the sequential knowledge state h t . In recap module we select the mostrelated hidden exercises of q t , which corresponds to soft selection and hard selectionimplementation. The information interaction module performs pairwise interaction be-tween the students current state, the selected students history exercises, the targetquestion and related skills for the prediction p t . Our GIKT method uses embeddings to represent questions, skills and answers.Three embedding matrices E s ∈ R |S|× d , E q ∈ R |Q|× d , E a ∈ R × d are denotedfor look-up operation where d stands for the embedding size. Each row in E s or E q corresponds to a skill or a question. The two rows in E a represent incorrectand correct answers respectively. For i -th row vector in matrices, we use s i , q i and a i to represent them respectively.In our framework, we do not pretrain these embeddings and they are trainedby optimizing the ﬁnal objective in an end-to-end manner. From training perspective, sparsity in question data raises a big challenge tolearn informative question representations, especially for those with quite lim-ited training examples. From the inference perspective, whether a student cananswer a new question correctly depends on the mastery of its related skills andthe question characteristic. When he/she has solved similar questions before,he/she is more likely answer the new question correctly. In this model, we incor-porate question-skill relation graph G to solve sparsity, as well as to utilize priorcorrelations to obtain better question representations.Considering the question-skill relation graph is bipartite, the 1st hop neigh-bors of a question should be its corresponding skills, and the 2nd hop neighbors IKT: A Graph-based Interaction Model for Knowledge Tracing 7 should be other questions sharing same skills. To extract the high-order infor-mation, we leverage graph convolutional network (GCN) [13] to encode relevantskills and questions into question embeddings and skill embeddings.Graph convolutional network stacks several graph convolution layers to en-code high-order neighbor information, and in each layer the node representationscan be updated by embeddings of itself and neighbor nodes. Denote the repre-sentation of node i in the graph as x i ( x i can represent skill embedding s i orquestion embedding q i ) and the set of its neighbor nodes as N i , then the formulaof l -th GCN layer can be expressed as: x li = σ ( 1 |N i | (cid:88) j ∈N i ∪{ i } w l x l − j + b l ) , (1)where w l and b l are the aggregate weight and bias to be learned in l -th GCNlayer, σ is the non-linear transformation such as ReLU.After embedding propagation by GCN, we get the aggregated embedding ofquestions and skills. We use (cid:101) q and (cid:101) s to represent the question and skill rep-resentation after embedding propagation. For easy implementation and betterparallelization, we sample a ﬁxed number of question neighbors (i.e., n q ) and skillneighbors (i.e., n s ) for each batch. And during inference, we run each examplemultiple times (sampling diﬀerent neighbors) and average the model outputs toobtain stable prediction results. For each history time t , we concatenate the question and answer embeddingsand project to d -dimension through a non-linear transformation as exercise rep-resentations: e t = ReLU( W ([ (cid:101) q t , a t ]) + b ) , (2)where we use [ , ] to denote vector concatenation.There may exist dependency between diﬀerent exercises, thus we need tomodel the whole exericise process to capture the student state changes and tolearn the potential relation between exercises. To model the sequential behaviorof a student doing exercise, we use LSTM [11] to learn student states from inputexercise representations: i t = σ ( W i [ e t , h t − , c t − ] + b i ) , (3) f t = σ ( W f [ e t , h t − , c t − ] + b f ) , (4) o t = σ ( W o [ e t , h t − , c t − ] + b o ) , (5) c t = f t c t − + i t tanh ( W c [ e t , h t − ] + b c ) , (6) h t = o t tanh ( c t ) , (7)where h t , c t , i t , f t , o t represents hidden state, cell state, input gate, forget gate,output gate respectively. It is worth mentioning that this layer is important forcapturing coarse-grained dependency like potential relations between skills, so Y. Yang et al. we just learn a hidden state h t ∈ R d as the current student state, which containscoarse-grained mastery state of skills. In a student’s exercise history, questions of relevant skills are very likely scatteredin the long history. From another point, consecutive exercises may not follow acoherent topic. These phenomena raise challenges for LSTM sequence modelingin traditional KT methods: (i) As is well recognized, LSTM can hardly capturelong-term dependencies in very long sequences, which means the current studentstate h t may “forget” history exercises related to the new target question q t . (ii)The current student state h t considers more about recent exercises, which maycontain noisy information for the new target question q t . When a student isanswering a new question, he/she may quickly recall similar questions he/shehas done before to help him/her to understand the new question. Inspired fromthis behavior, we propose to select relevant history exercises(question-answerpair) { e i | i ∈ [1 , . . . , t − } to better represent a student’s ability on a speciﬁcquestion q t , called history recap module.We develop two methods to ﬁnd relevant history exercises. The ﬁrst one ishard selection, i.e., we only consider the exercises sharing same skills with thenew question: I e = { e i |N q i = N q t , i ∈ [1 , .., t − } , (8)Another method is soft selection, i.e., we learn the relevance between targetquestion and history states through an attention network, and choose top- k states with highest attention scores: I e = { e i | R i,t ≤ k, V i,t ≥ v, i ∈ [1 , .., t − } , (9)where R i,t is the ranking of attention function f ( q i , q t ) like cosine similarity, V i,t is the attention value and v is the lower similarity bound to ﬁlter less relevantexercises. Previous KT methods predict a student’s performace mainly according to theinteraction between student state h t and question representation q t , i.e., (cid:104) h t , q t (cid:105) .We generalize the interaction in the following aspects: (i) we use (cid:104) h t , (cid:101) q t (cid:105) torepresent the student’s mastery degree of question q t , (cid:104) h t , (cid:101) s j (cid:105) to represent thestudent’s mastery degree of the corresponding skill s j ∈ N q t , (ii) we generalizethe interaction on current student state to history exercises, which reﬂect therelevant history mastery i.e., (cid:104) e i , (cid:101) q t (cid:105) and (cid:104) e i , (cid:101) s j (cid:105) , e i ∈ I e , which is equivalent tolet the student to answer the target question in history timesteps. We try other implementations like using history states instead of history exercises, and the results show using history exercises results in a better performance ashistory states contain other irrelevant information.IKT: A Graph-based Interaction Model for Knowledge Tracing 9

Then we consider all above interactions for prediction, and deﬁne the gener-alized interaction module. In order to encourage relevant interactions and reducenoise, we use an attention network to learn bi-attention weights for all interactionterms, and compute the weighted sum as the prediction: α i,j = Softmax i,j ( WWW T [ fff i , fff j ] + b ) (10) p t = (cid:88) fff i ∈ I e ∪{ h t } (cid:88) fff j ∈ (cid:101) N qt ∪{ (cid:101) q t } α i,j g ( fff i , fff j ) (11)where p t is the predicted probability of answering the new question correctly, (cid:101) N q t represents the aggregated neighbor skill embeddings of q t and we use innerproduct to implement function g . Similar to the selection of neighbors in relationgraph, we set a ﬁxed number of I e and (cid:101) N q t by sampling from these two sets. To optimize our model, we update the parameters in our model using gradientdescent by minimizing the cross entropy loss between the predicted probabilityof answering correctly and the true label of the student’s answer: L = − (cid:88) t ( a t log p t + (1 − a t ) log (1 − p t )) . (12) In this section, we conduct several experiments to investigate the performance ofour model. We ﬁrst evaluate the prediction error by comparing our model withother baselines on three public datasets. Then we make ablation studies on theGCN and the interaction module of GIKT to show their eﬀectiveness in Section5.5. Finally, we evaluate the design decisions of the recap module to investigatewhich design performs better in Section 5.6.

Table 1.

Dataset statisticsASSIST09 ASSIST12 EdNet

To evaluate our model, the experiments are conducted on three widely-useddatasets in KT and the detailed statistics are shown in Table 1. – ASSIST09 was collected during the school year 2009-2010 from ASSIST-ments online education platform . We conduct our experiments on “skill-builder” dataset. Following the previous work [32], we remove the duplicatedrecords and scaﬀolding problems from the original dataset. This dataset has3852 students with 123 skills, 17,737 questions and 282,619 exercises. – ASSIST12 was collected from the same platform as ASSIST09 during theschool year 2012-2013. In this dataset, each question is only related to oneskill, but one skill still corresponds to several questions. After the same dataprocessing as ASSIST09, it has 2,709,436 exercises with 27,485 students, 265skills and 53,065 questions. – EdNet was collected by [5]. As the whole dataset is too large, we randomlyselect 5000 students with 189 skills, 12,161 questions and 676,974 exercises.Note that for each dataset we only use the sequences of which the length islonger than 3 in the experiments as the too short sequences are meaningless. Foreach dataset, we split 80% of all the sequences as the training set, 20% as thetest set. To evaluate the results on these datasets, we use the area under thecurve (AUC) as the evaluation metric. In order to evaluate the eﬀeciveness of our proposed model, we use the followingmodels as our baselines: – BKT [6] uses Bayesian inference for prediction, which models the knowledgestate of the skill as a binary variable. – KTM [27] is the latest factor analysis model that uses Factorization Machineto interact each feature for prediction. Although KTM can use many typesof feature, for fairness we only use question ID, skill ID and answer as itsside information in comparison. – DKT [21] is the ﬁrst method that uses deep learning to model knowldgetracing task. It uses recurrent neural network to model knowldge state ofstudents. – DKVMN [35] uses memory network to store knowledge state of diﬀerentconcepts respectively instead of using a single hidden state. https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/skill-builder-data-2009-2010 https://new.assistments.org/ https://sites.google.com/site/assistmentsdata/home/2012-13-school-data-with-affect https://github.com/riiid/ednet IKT: A Graph-based Interaction Model for Knowledge Tracing 11 – DKT-Q is a variant of DKT that we change the input of DKT from skillsto questions so that the DKT model directly uses question information forprediction. – DKT-QS is a variant of DKT that we change the input of DKT to theconcatenation of questions and skills so that the DKT model uses questionand skill information simultaneously for prediction. – GAKT is a variant of the model Exercise-Enhanced Recurrent Neural Net-work with Attention mechanism (EERNNA) [25] as EERNNA utilizes ques-tion text descriptions but we can’t acquire this information from publicdatasets. Thus we utilize our input question embeddings aggregated by GCNas input of EERNNA and follow its framework design for comparison.

We implement all the compaired methods with TensorFlow. The code for ourmethod is available online . The embedding size of skills, questions and answersare ﬁxed to 100, all embedding matrices are randomly initialized and updated inthe training process. In the implementation of LSTM, a stacked LSTM with twohidden layers is used, where the sizes of the memory cells are set to 200 and 100respectively. In embedding propagation module, we set the maximal aggregatelayer number l = 3. We also use dropout with the keep probability of 0 . Table 2 reports the AUC results of all the compared methods. From the re-sults we observe that our GIKT model achieves the highest performance overthree datasets, which veriﬁes the eﬀectiveness of our model. To be speciﬁc, ourproposed model GIKT achieves at least 1% higher results than other baselines.Among the baseline models, traditional machine learning models like BKT andKTM perform worse than deep learning models, which shows the eﬀectiveness ofdeep learning methods. DKVMN performs slightly worse than DKT on averageas building states for each concept may lose the relation information betweenconcepts. Besides, GAKT performs worse than our model, which indicates thatexploiting high-order skill-question relations through selecting the most relatedexercises and performing interaction makes a diﬀerence.On the other hand, we ﬁnd that directly using questions as input may achievesuperior performance than using skills. For the question-level model DKT-Q, ithas comparable or better performance than DKT over ASSIST12 and EdNetdatasets. However, DKT-Q performs worse than DKT in ASSIST09 dataset. The https://github.com/Rimoku/GIKT Table 2.

The AUC results over three datasets. Among these models, BKT, DKT andDKVMN predict for skills, other models predict for questions. Note that “*” indicatesthat the statistically signiﬁcant improvements over the best baseline, with p-valuesmaller than 10 − in two-sided t-test.Model ASSIST09 ASSIST12 EdNetBKT 0.6571 0.6204 0.6027KTM 0.7169 0.6788 0.6888DKVMN 0.7550 0.7283 0.6967DKT 0.7561 0.7286 0.6822DKT-Q 0.7328 0.7621 0.7285DKT-QS 0.7715 0.7582 0.7428GAKT 0.7684 0.7652 0.7281GIKT * * * reason may be that the average number of attempts per question in ASSIST09dataset is signiﬁcantly less than other two datasets as observed in Table 1, whichillustrates DKT-Q suﬀers from data sparsity problem. Besides, the AUC resultsof the model DKT-QS are higher than DKT-Q and DKT, except on ASSIST12as it is a single-skill dataset, which indicates that considering question and skillinformation together improves overall performance. To get deep insights on the eﬀect of each module in GIKT, we design severalablation studies to further investigate on our model. We ﬁrst study the inﬂuenceof the number of aggregate layers, and then we design some variants of theinteraction module to investigate their eﬀectiveness.

Eﬀect of Embedding Propagation Layer

We change the number of theaggregate layers in GCN ranging from 0 to 3 to show the eﬀect of the high-orderquestion-skill relations and the results are shown in Table 3. Specially, when thenumber of the layer is 0, it means the question embeddings and skill embeddingsused in our model are indexed from embedding matrices directly.

Table 3.

Eﬀect of the number of aggregate layers.Layers ASSIST09 ASSIST12 EdNet0 0.7843 0.7738 0.74381 0.7844 0.7710 0.74322 0.7894 0.7736 0.74663

From Table 3 we ﬁnd that, when the number of aggregate layer from zero toone, the performance of GIKT changes slightly, as we have already used 1-order

IKT: A Graph-based Interaction Model for Knowledge Tracing 13 relation in recap module and interaction module. However, GIKT achieves betterperformance when the number of aggregate layers increases, which validates theeﬀectiveness of GCN. The results also imply that exploiting high-order relationscontained in the question-skill graph is necessary for adequate results as theperformance of adopting more layers is better than using less layers.

Eﬀect of Interaction Module

To verify the impact of interaction module inGIKT, we conduct ablation studies on four variants of our model. The detailsof the four settings are listed as below and the performance of them is shown inTable 4. – GIKT-RHS (Remove History related exercises and Skills related to thenew question) For GIKT-RHS, we just use the current state of the studentand the new question to perform interaction for prediction. – GIKT-RH (Remove History related exercises) For GIKT-RH, we only usethe current state of the student to model mastery of the new question andrelated skills. – GIKT-RE (Remove Skills related to the new question) For GIKT-RS, wedo not model the mastery of skills related to the new question. – GIKT-RA (Remove Attention in interaction module) GIKT-RA removesthe attention mechanism after interaction, which treats each interaction pairas equally important and average the prediction scores directly in interactionpart for prediction.

Table 4.

Eﬀect of Information ModuleModel ASSIST09 ASSIST12 EdNetGIKT-RHS 0.7814 0.7672 0.7420GIKT-RH 0.7808 0.7703 0.7463GIKT-RS 0.7864 0.7754 0.7428GIKT-RA 0.7856 0.7711 0.7500GIKT

From Table 4 we have the following ﬁndings: Our GIKT model consideringall interaction aspects achieve best performance, which shows the eﬀectivenessof the interaction module. Meanwhile, from the results of GIKT-RH we can ﬁndthat relevant history states can help better model the student’s ability on thenew question. Besides, the performance of GIKT-RS is slightly worse than GIKT,which implies that model mastery degree of question and skills simultaneouslycan further help prediction. Note that as ASSIST12 is a single-skill dataset,use skill information in interaction module is redundant after selecting historyexercises sharing the same skill, thus we set the number of sampled related skillas 0. Comparing the results of GIKT-RA with GIKT, the worse performanceconﬁrms the eﬀectiveness of attention in interaction module, which distinguishes diﬀerent interaction terms for better prediction results. By calculating diﬀerentaspects of interaction and weighted sum for the prediction, information fromdiﬀerent level can be fully interacted.

To evaluate the detailed design of the recap module in GIKT, we conduct ex-periments of several variants. The details of the settings are listed as below andthe performance of them is shown in Table 5. – GIKT-HE (Hard select history Exercises) For GIKT-HE, we select therelated exercises sharing the same skills. – GIKT-SE (Soft select history Exercises) For GIKT-SE, we select historyexercises according to the attention weight. – GIKT-HS (Hard select hidden States) For GIKT-HS, we select the relatedhidden states of the exercises sharing the same skills. – GIKT-SS (Soft select hidden States) For GIKT-SS, we select hidden statesaccording to the attention weight. The reported results of GIKT in previoussections are taken by the performance of GIKT-SS.

Table 5.

Results of Diﬀerent Recap Module DesignModel ASSIST09 ASSIST12 EdNetGIKT-HE

GIKT-HS 0.7788 0.7672 0.7364GIKT-SS 0.7743 0.7683 0.7417

From Table 5 we ﬁnd that selecting history exercises performs better than se-lecting hidden states. This result implies that the hidden state contain irrelevantinformation for the next question as it learns a general mastery for a student.Instead, selecting exercises directly can reduce noise to help prediction. The per-formances of hard selection and soft selection distinguish on diﬀerent datasets.Using attention mechanism can achieve better selection coverage while the hardselection variant can select exercises via explicit constraints.

In this paper, we propose a framework to employ high-order question-skill rela-tion graphs into question and skill representations for knowledge tracing. Besides,to model the student’s mastery for the question and related skills, we design arecap module to select relevant history states to represent student’s ability. Thenwe extend a generalized interaction module to represent the student’s masterydegree of the new question and related skills in a consistent way. To distinguishrelevant interactions, we use an attention mechanism for the prediction. Theexperimental results show that our model achieve better performance.

IKT: A Graph-based Interaction Model for Knowledge Tracing 15

Addendum Version

After the deadline of submitting the camera-ready version to the proceedings,we found our realization of “soft selection” mentioned in Section 4.4 is somewhatunreasonable, thus we adopt another suitable realization for this strategy, whichcauses some diﬀerences of the results with the proceeding version. This versionis the newest version and we report the revised experiment results in this versionwith the code. As the conference proceedings have been prepared already, PCchairs suggested us to upload an addendum version by ourself. Please refer tothis version.

References

1. Abdelrahman, G., Wang, Q.: Knowledge tracing with sequential key-value memorynetworks. In: the 42nd International ACM SIGIR Conference (2019)2. d Baker, R.S., Corbett, A.T., Aleven, V.: More accurate student modeling throughcontextual estimation of slip and guess probabilities in bayesian knowledge tracing.In: International conference on intelligent tutoring systems. pp. 406–415. Springer(2008)3. Beck, D., Haﬀari, G., Cohn, T.: Graph-to-sequence learning using gated graphneural networks. arXiv preprint arXiv:1806.09835 (2018)4. Chen, P., Lu, Y., Zheng, V.W., Pian, Y.: Prerequisite-driven deep knowledge trac-ing. In: 2018 IEEE International Conference on Data Mining (ICDM). pp. 39–48.IEEE (2018)5. Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Kim, B.,Jang, Y.: Ednet: A large-scale hierarchical dataset in education. arXiv preprintarXiv:1912.03072 (2019)6. Corbett, A.T., Anderson, J.R.: Knowledge tracing: Modeling the acquisition ofprocedural knowledge. User modeling and user-adapted interaction (4), 253–278(1994)7. Deﬀerrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks ongraphs with fast localized spectral ﬁltering. In: Advances in neural informationprocessing systems. pp. 3844–3852 (2016)8. Ebbinghaus, H.: Memory: A contribution to experimental psychology. Annals ofneurosciences (4), 155 (2013)9. Garcia, V., Bruna, J.: Few-shot learning with graph neural networks. arXiv preprintarXiv:1711.04043 (2017)10. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on largegraphs. In: Advances in neural information processing systems. pp. 1024–1034(2017)11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)12. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)13. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)14. Minn, S., Desmarais, M.C., Zhu, F., Xiao, J., Wang, J.: Dynamic student classiﬃ-cation on memory networks for knowledge tracing. In: Paciﬁc-Asia Conference onKnowledge Discovery and Data Mining. pp. 163–174. Springer (2019)6 Y. Yang et al.15. Minn, S., Yu, Y., Desmarais, M.C., Zhu, F., Vie, J.J.: Deep knowledge tracing anddynamic student classiﬁcation for knowledge tracing. In: 2018 IEEE InternationalConference on Data Mining (ICDM). pp. 1182–1187. IEEE (2018)16. Nagatani, K., Zhang, Q., Sato, M., Chen, Y.Y., Chen, F., Ohkuma, T.: Augmentingknowledge tracing by considering forgetting behavior. In: The World Wide WebConference. pp. 3101–3107 (2019)17. Nakagawa, H., Iwasawa, Y., Matsuo, Y.: Graph-based knowledge tracing: Model-ing student proﬁciency using graph neural network. In: IEEE/WIC/ACM Interna-tional Conference on Web Intelligence. pp. 156–163. ACM (2019)18. Pardos, Z.A., Heﬀernan, N.T.: Modeling individualization in a bayesian networksimplementation of knowledge tracing. In: International Conference on User Mod-eling, Adaptation, and Personalization. pp. 255–266. Springer (2010)19. Pardos, Z.A., Heﬀernan, N.T.: Kt-idem: Introducing item diﬃculty to the knowl-edge tracing model. In: International conference on user modeling, adaptation, andpersonalization. pp. 243–254. Springer (2011)20. Pavlik Jr, P.I., Cen, H., Koedinger, K.R.: Performance factors analysis–a new al-ternative to knowledge tracing. Online Submission (2009)21. Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J., Sohl-Dickstein, J.: Deep knowledge tracing. In: Advances in neural information process-ing systems. pp. 505–513 (2015)22. Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3d graph neural networks for rgbdsemantic segmentation. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 5199–5208 (2017)23. Qu, Y., Bai, T., Zhang, W., Nie, J., Tang, J.: An end-to-end neighborhood-basedinteraction model for knowledge-enhanced recommendation. In: Proceedings ofthe 1st International Workshop on Deep Learning Practice for High-DimensionalSparse Data. pp. 1–9 (2019)24. Rendle, S.: Factorization machines. In: 2010 IEEE International Conference onData Mining. pp. 995–1000. IEEE (2010)25. Su, Y., Liu, Q., Liu, Q., Huang, Z., Yin, Y., Chen, E., Ding, C., Wei, S., Hu,G.: Exercise-enhanced sequential modeling for student performance prediction. In:Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018)26. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graphattention networks. arXiv preprint arXiv:1710.10903 (2017)27. Vie, J.J., Kashima, H.: Knowledge tracing machines: Factorization machines forknowledge tracing. In: Proceedings of the AAAI Conference on Artiﬁcial Intelli-gence. vol. 33, pp. 750–757 (2019)28. Wang, T., Ma, F., Gao, J.: Deep hierarchical knowledge tracing. In: Proceedingsof the 12th International Conference on Educational Data Mining, EDM 2019,Montr´eal, Canada, July 2-5, 2019 (2019), https://drive.google.com/file/d/1wlW6zAi-l4ZAw8rBA_mXZ5tHgg6xKL00

29. Wang, T., Ma, F., Gao, J.: Deep hierarchical knowledge tracing. In: Proceedingsof the 12th International Conference on Educational Data Mining, EDM 2019,Montr´eal, Canada, July 2-5, 2019 (2019), https://drive.google.com/file/d/1wlW6zAi-l4ZAw8rBA_mXZ5tHgg6xKL00

30. Wang, X., He, X., Cao, Y., Liu, M., Chua, T.S.: Kgat: Knowledge graph attentionnetwork for recommendation. In: Proceedings of the 25th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & Data Mining. pp. 950–958 (2019)31. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully re-current neural networks. Neural computation1