[PDF] Knowledge Query Network: How Knowledge Interacts with Skills

Abstract

Knowledge Tracing (KT) is to trace the knowledge of students as they solve a sequence of problems represented by their related skills. This involves abstract concepts of students' states of knowledge and the interactions between those states and skills. Therefore, a KT model is designed to predict whether students will give correct answers and to describe such abstract concepts. However, existing methods either give relatively low prediction accuracy or fail to explain those concepts intuitively. In this paper, we propose a new model called Knowledge Query Network (KQN) to solve these problems. KQN uses neural networks to encode student learning activities into knowledge state and skill vectors, and models the interactions between the two types of vectors with the dot product. Through this, we introduce a novel concept called \textit{probabilistic skill similarity} that relates the pairwise cosine and Euclidean distances between skill vectors to the odds ratios of the corresponding skills, which makes KQN interpretable and intuitive. On four public datasets, we have carried out experiments to show the following: 1. KQN outperforms all the existing KT models based on prediction accuracy. 2. The interaction between the knowledge state and skills can be visualized for interpretation. 3. Based on probabilistic skill similarity, a skill domain can be analyzed with clustering using the distances between the skill vectors of KQN. 4. For different values of the vector space dimensionality, KQN consistently exhibits high prediction accuracy and a strong positive correlation between the distance matrices of the skill vectors.

Full PDF

KKnowledge Query Network for Knowledge Tracing

How Knowledge Interacts with Skills

Jinseok Lee

Hong Kong University of Science and TechnologyHong [email protected]

Dit-Yan Yeung

Hong Kong University of Science and TechnologyHong [email protected]

ABSTRACT

Knowledge Query Network (KQN) to solve these problems.KQN uses neural networks to encode student learning activitiesinto knowledge state and skill vectors, and models the interactionsbetween the two types of vectors with the dot product. Throughthis, we introduce a novel concept called probabilistic skill similarity that relates the pairwise cosine and Euclidean distances betweenskill vectors to the odds ratios of the corresponding skills, whichmakes KQN interpretable and intuitive.On four public datasets, we have carried out experiments to showthe following: 1. KQN outperforms all the existing KT models basedon prediction accuracy. 2. The interaction between the knowledgestate and skills can be visualized for interpretation. 3. Based onprobabilistic skill similarity, a skill domain can be analyzed withclustering using the distances between the skill vectors of KQN.4. For different values of the vector space dimensionality, KQNconsistently exhibits high prediction accuracy and a strong positivecorrelation between the distance matrices of the skill vectors.

CCS CONCEPTS • Computing methodologies → Neural networks ; •

Appliedcomputing → E-learning ; KEYWORDS

Knowledge Tracing, Deep Learning, Learning Analytics, Educa-tional Data Mining, Massive Open Online Courses, Intelligent Tu-toring Systems, Learner Modeling, Knowledge Modeling, DomainModeling

One of the advantages of an Intelligent Tutoring System [15] andmassive open online courses [9] is that they can potentially bene-fit from monitoring and tracking student activities in an adaptivelearning environment, where learner modeling comes into play.A learner model provides estimates for the students’ state, andincludes two inter-connected aspects: domain modeling and knowl-edge modeling [17]. A domain model [17] studies the structure within a domain ofproblems (For example, it finds out which skill a problem is relatedto: “1+2=?" to “addition of integers" and “1.3+2.5=?" to “additionof decimals"). Another task of a domain model is to discover thestructure of a skill domain, which can be performed either manuallyor automatically [17]. On the other hand, a knowledge model [1],in an abstract sense, traces students’ knowledge while they aresolving problems. Knowledge has been described in various formsby the name of knowledge state, which has no universal definitionyet. In this paper, the term knowledge state has been used as astate that can describe a student’s general level of attainment ofskills. Often, domain modeling and knowledge modeling are viewedto be separate; however, we tried to provide approaches for bothwhere problem-solving records of students can be important inputfeatures for finding the latent structure of a skill domain.KT is a research area which analyzes student activities and stud-ies knowledge acquisition, where its main task is to describe astudent’s knowledge. To elaborate, consider a student who solveda sequence of problems. Then the student’s data is given by thetemporal sequence of tuples, each of which is consisted of the skillthat the problem at each time step is related to and the binarycorrectness that indicates whether or not the student gave a rightanswer. By calling such tuple student response , the KT problem isformulated as follows: 1. given the student responses up to the timestep t , describe the student’s knowledge state at the current timestep t , and 2. given the skill at the next time step t +

1, predictthe correctness by modeling the interaction between the student’sknowledge state at time t and the skill at time t +

1, which we willcall knowledge interaction . Note that the knowledge state refersto the dynamic state of a student accumulated from the studentresponses while a skill indicates a particular ability that needs tobe learned by a student to solve a problem.Therefore, the quality of a KT model is measured by its abilityto describe the knowledge state of a student and its accuracy ofpredicting correctness. Additionally, since modeling the knowledgeinteraction is to describe how a student’s knowledge state respondsto different skills, it is desirable if a KT model can explain therelationship between skills that can be inferred from the knowledgeinteraction. For example, we can say that “addition of integers” isindependent of “subtraction of integers" if a model observes that astudent does not learn the latter while learning the former. Similarly,they are dependent if the change in a student’s knowledge state ofone skill affects the knowledge state of the other. We believe thatmodeling such skill relationship can lead to further explorationof the latent structure of the skill domain, which is the subject ofdomain modeling. a r X i v : . [ c s . C Y ] A ug owever, existing KT models provide limited definitions of eitherknowledge state, or knowledge interaction, or both. For example,Bayesian Knowledge Tracing (BKT) [1] imposes a binary assump-tion on the knowledge state, which is too restrictive to be intuitive,and Deep Knowledge Tracing (DKT) [18] does not give an expla-nation of knowledge interaction. In this paper, we propose a newneural network KT model called Knowledge Query Network (KQN)to generalize the knowledge state and explain the knowledge inter-action more descriptively. The central idea is to use the dot productbetween a knowledge state vector and a skill vector to define theknowledge interaction while leveraging neural networks to encodestudent responses and skills into vectors of the same d dimension-ality. Additionally, we introduce a novel concept called probabilisticskill similarity which relates the cosine and Euclidean distancesbetween the skill vectors to the odds ratios for the correspondingskills. Based on those distances, we explore the latent structure ofa skill domain with cluster analysis. Lastly, we show that KQN isstable in predicting correctness and learning skill vectors by com-paring prediction accuracy and the distance matrices of the skillvectors when the vector space dimensionality is varied. Item Response Theory (IRT) is a framework for modeling the re-lationship between problems and correctness [8]. In its simplestform, it uses a logistic regression model by estimating student pro-ficiency and skill difficulty. However, it assumes the proficiency tobe constant and does not explain any structure for problems. Toovercome those limitations, Bayesian extensions of IRT have beenproposed to have a hierarchical structure over items (HIRT) andtemporal changes in a student’s knowledge state (TIRT) [21]. Still,HIRT assumes constant student proficiency while TIRT lacks theability of domain analysis.In BKT, a student’s knowledge state is viewed as a set of binarylatent variables, one for each skill, with two possible states, known and unknown [1]. Then a set of observable variables, each of whichcorresponds to correctness per skill, are conditioned on the set ofthe binary variables. For example, let us say we have a runningexample of student Ben throughout this paper, who has records asshown in Table 1. Accordingly, there will be two knowledge statesand correctness variables for skills 1 and 2, a total of four variableswith two independent BKT models with input data as shown inTable 1b. Then the knowledge acquisition in BKT is modeled witha Hidden Markov model (HMM), where knowledge interaction iscontrolled by a set of interpretable equations. Since BKT lacks theability to forget and individualize, a number of extensions have beenproposed [10, 24]. However, most importantly, BKT’s independenceassumption on different skills is considered to be highly constrainedand not effective, where the model cannot leverage the whole data.For example, Ben’s history of responses on skill 1 cannot tell hisresponses on skill 2 since there should be two separate models foreach skill.Learning Factors Analysis (LFA) [2] models a student’s knowl-edge state as a set of binary variables, one for each skill. Correctnessat each time step is predicted with a logistic regression model whichhas covariates related to students, skills, and a summary statistic ofstudent responses, i.e., the number of past opportunities for each

T SID C1 1 02 2 03 1 1 (a) Original

T SID C1 1 02 1 1T SID C1 2 0 (b) BKT

T SID NO NO (c) LFA T SID S F S F (d) PFA Table 1: The example student Ben’s input data, one originaland the others preprocessed for different KT models. In thetables above, feature names are abbreviated as follows: Timeto T, Skill ID to SID, Correctness to C, Number of Opportuni-ties to NO, Number of Successes to S, and Number of Failuresto F. Note that the original data is used for neural networkmodels. skill. Performance Factors Analysis (PFA) [16] extends LFA by sep-arating ‘the number of opportunities per skill’ into ‘the number ofcorrect answers per skill’ and ‘the number of incorrect answers perskill’ with the others the same, e.g., in Ben’s case, the input dataare preprocessed as shown in Table 1c and Table 1d.Since an estimate for correctness is explained with student co-variates, and skill-specific covariates without variable interactionsin LFA and PFA, the two models do not describe how a student’sknowledge state with respect to one skill is affected by that withrespect to another skill; instead, a student parameter, which is alsocalled student proficiency, is the only factor that relates the knowl-edge state for different skills. Moreover, a skill is explained by theregression coefficients for the skill-specific covariates from whichwe cannot tell the structure of a skill domain directly.As the first neural network KT model, DKT [18], given the stu-dent’s responses as input, encodes a student’s knowledge state asa summarized vector calculated from a recurrent neural network(RNN), which is a renowned neural network technique for model-ing temporal data. However, DKT does not define the knowledgeinteraction directly. In detail, a student response at each time stepis formed as a tuple ( q t , a t ) , where q t and a t refer to the prob-lem ID and correctness, respectively. For example, Ben’s originaldata in Table 1a is expressed as ( , ) at t =

1. The input is thenpassed to an RNN layer, where Long Short-Term Memory (LSTM)[6] was used in the original paper [18], and its output hidden stateis passed to a logistic function after an affine transformation, i.e., y t = σ ( W · h t + b ) , where h t is the output hidden state of an LSTMlayer and σ is an element-wise logistic function. Finally, the k -thelement of y t is used to predict correctness at the next time stepgiven that the next problem ID is k . Despite its superior predictionperformance over the existing classical methods, DKT has been riticized by other papers [10, 21] for its lack of practicality in edu-cational applications. This is because the output hidden state h t isinherently hard to be interpreted as the knowledge state, and themodel does not give insights into the knowledge interaction.To make a neural network KT model more interpretable, Dy-namic Key-Value Memory Networks (DKVMN) [26] have been intro-duced by extending memory-augmented neural networks (MANNs)[5, 20]. Like DKT, DKVMN uses the original data as input. DKVMNaccumulates temporal information from student responses into adynamic matrix, or the value memory , while embedding skills witha static matrix, or the key memory . DKVMN defines the interactionbetween value vectors and key vectors using attention weights cal-culated with cosine similarity, and predicts correctness by passinga concatenation of a weighted sum of value vectors and an embed-ded skill vector to a multilayer perceptron (MLP). Even though theprediction accuracy of DKVMN has proven to be higher than thatof DKT, the use of an MLP for the output of the model still makesit hard to explain the knowledge interaction. To generalize the knowledge state while describing the knowledgeinteraction intuitively, we suggest a model that projects a student’sknowledge and skills into the same vector space of embeddingdimensionality d . An important constraint is to contain the skillvectors on the d -dimensional positive orthant unit sphere, i.e., theyhave unit-length and positive coordinates. The logit of a probabilityestimate for correctness is given by the dot product between thecurrent knowledge state vector and the skill vector of the nextproblem. This is only possible because both knowledge state andskill vectors lie in the same vector space.Now, we illustrate why skill vectors are set to unit-length andconstrained to a positive orthant: the former makes the logit onlydependent on the direction of the related skill vector while thelatter assures that learning on one skill does not decrease learningon another. For example, in a 2-D vector space, suppose that Benhas knowledge state KS = ( , ) T at t = s = ( , ) T and s = ( , ) T forthe skills 1 and 2, and s = (− , ) T for a third “imaginary” skill asshown in Figure 1. At t =

3, his knowledge state may change to KS = ( , ) T as he answers correctly for skill 1. Then the logit withrespect to s increases from KS · s = KS · s = KS · s = KS · s = KS · s = − KS · s = −

2, which would then decrease theprobability estimate for the correctness of skill 3. This is counter-intuitive since the datasets we are dealing with have a set of skillswithin the same area, e.g., mathematics.

Let the skill of a problem and the correctness at time t be e t ∈{ , · · · , N } and c t ∈ { , } , respectively. The correctness c t + ateach time step t = , · · · , T − t +

1. Thenthe objective of a KT model is to find out the parameter of theBernoulli distribution at each time step as follows:

Figure 1: Illustration of skill vectors and Ben’s knowledgestate vectors at t = and t = .Figure 2: KQN architecture drawn at time t . p t + = P ( c t + = | e t + , c t ) , c t + ∼ Bernoulli ( p t + ) . KQN consists of three components: knowledge encoder, skill en-coder, and knowledge state query. The knowledge encoder convertsthe temporal information from student responses into a knowledgestate vector while the skill encoder embeds a skill into a skill vector.The two vectors are then passed to the knowledge state query toprovide the prediction for correctness given the current knowledgestate and the provided skill. The network architecture of KQN isshown in Figure 2.

The model takes two inputs: a student response at the current timestep and a skill at the next time step. Each of the student responsesis one-hot encoded and given as input x t to an RNN layer as thefollowing: x t ∈ { , } N , x kt = , x k + Nt = , where N is the number of skills, and k is the skill at time step t .Similarly, the skill k ′ at time t + e t + , where he k ′ -th element is 1 and the other elements are 0’s. In Ben’s case,his response at t = t = x = ( , , , ) T and e = ( , ) T , respectively. Let the knowledge state vector KS t and the embedded skill vec-tor s t + , both d -dimensional, be the two vectors encoded by theknowledge encoder and the skill encoder, respectively. Then theknowledge interaction is defined by the inner product of the twovectors. The logit y t + and the corresponding prediction probability p t + are calculated as follows: y t + = KS t · s t + , p t + = σ ( y t + ) , where σ ( u ) = + exp (− u ) is a logistic function and · refers to theinner product. In this way, knowledge interaction is well-definedfor the following reasons: • If two skills are independent, their corresponding vectorsare orthogonal to each other. Accordingly, an increase ora decrease in the logit with respect to one vector does notaffect the logit with respect to the other vector. • If two skills are similar from the probabilistic perspective,then an increase in a logit with respect to one vector wouldlead to an increase in the logit with respect to the othervector, and vice versa.Note that from the definition of the knowledge interaction above,it is implied that there can be at most d mutually independent skills.For different values of d , whether or not KQN learns the pairwiserelationships between skills represented by pairwise distances wastested in experiments and shown in later sections.As a result, KQN is approximating the parameter of the Bernoullidistribution at each time step as follows: P ( c t + = | e t + , c t ) = P ( c t + = | x t , e t + )≈ σ ( y t + ) = σ (cid:0) KS t · s t + (cid:1) . Given input x t , the knowledge state encoder produces a knowledgestate vector KS t with the following equations: h t = RNN ( x t ) , KS t = W h , KS · h t + b h , KS , where W h , KS ∈ R d × H RNN , b h , KS ∈ R d , RNN is an RNN layer,and H RNN is the state size of

RNN . In KQN, LSTM [6] and GatedRecurrent Units (GRU) [3] have been tested as RNN variants. Also,to avoid overfitting, dropout regularization [19, 25] has been usedfor the RNN output layer as was used for DKT in a previous work[22].

The skill encoder embeds input e t + to s t + with an MLP as follows: o t + = ReLU ( W · ( ReLU ( W · e t + + b )) + b ) , s t + = L2 - normalize ( o t + ) , s t + ∈ U d = { v ∈ R d : || v || = , v i ≥ i = , · · · , d } , where W ∈ R H MLP × N , W ∈ R d × H MLP , b ∈ R H MLP , b ∈ R d , and ReLU is an element-wise ReLU activation with

ReLU ( u ) = max ( , u ) [14]. Note that s t + is now constrained to the d -dimensional posi-tive orthant unit sphere, which we will call U d for the rest of thispaper for notational convenience. At each time step, the cross-entropy error given the probabilityestimate and the target correctness is calculated, and the error termsfor t = , · · · , T − E ( θ model | c t + , p t + ) = − (cid:20) c t + log p t + + ( − c t + ) log ( − p t + ) (cid:21) , E total ( θ model | c t + , p t + ) = T − (cid:213) t = E ( θ model | c t + , p t + ) . The gradients of the total error with respect to the model pa-rameters θ model have been computed with back-propagation to beused by an optimization method. Based on the architecture of KQN, we hereby introduce a novelconcept called probabilistic skill similarity to measure the distancebetween skills from the probabilistic perspective.

For any two skill vectors s , s learned from KQN, the cosine dis-tance differs from the squared Euclidean distance by only a factorof 2 since they are constrained to U d as follows: d Euclidean ( s , s ) = || s − s || = ( − s · s ) = d cosine ( s , s ) , ∀ s , s ∈ U d . Next, we show how a pairwise distance between two skill vectors isrelated to the logarithm of their odds ratio. Given a knowledge statevector KS ∈ R d and a skill vector s ∈ U d , the probability estimate p for correctness and the corresponding odds o are calculated asfollows: p = P ( c = | KS , s ) , o = p − p . Then for any two skill vectors s , s ∈ U d , the logarithm of theodds ratio is characterized by their distance as follows: log o o (cid:19) = (cid:0) log o − log o (cid:1) = ( y − y ) = ( KS · s − KS · s ) = (cid:0) KS · ( s − s ) (cid:1) = (cid:0) KS · ∆ , (cid:1) × || s − s || = (cid:0) KS · ∆ , (cid:1) × d Euclidean ( s , s ) = (cid:0) KS · ∆ , (cid:1) × d cosine ( s , s ) , where ∆ , = s − s | | s − s | | . Therefore, we say that two skills are proba-bilistically similar if they are ‘close’ enough based on the distancebetween their corresponding vectors. KQN has been tested for four tasks: correctness prediction , knowledgeinteraction visualization , skill domain analysis , and the sensitivityanalysis of the dimensionality of the vector space . For correctnessprediction, the performance of KQN was compared to that of othermodels on four public datasets: one synthetic and three real-worldones which are available online. Then for a sample student, knowl-edge interaction was visualized with a heat map to demonstratethe knowledge state query with respect to different skills. Next, theskill domain was explored with clustering based on skill distances.Finally, pairwise distances of the skill vectors in one dimensionalitywere compared to those in other dimensionalities to conduct thesensitivity analysis of the vector embedding dimensionality. The following four datasets have been used to evaluate models: AS-SISTments 2009-2010, ASSISTments 2015, OLI Engineering Statics2011, and Synthetic-5. To make a fair comparison of the correctnessprediction task, we used the ones provided by the DKVMN sourcecode available online . The statistics of the datasets are shown inTable 2. It is a dataset collected by the AS-SISTments online tutoring systems [4]. It was gathered from skillbuilder problem sets, where students work on the problems toachieve mastery, a certain level of performance, working on sim-ilar questions. During the preprocessing, those records withoutskill names have been discarded. After a problem with duplicaterecords had been reported by a paper [22], the dataset has sincebeen corrected by the ASSISTments system. Therefore, those resultsreported by a number of previous papers are not compared in thispaper. Compared to ASSISTments 2009-2010which has 110 distinct skill tags, this dataset contains 100 distinctones with more than twice the number of student responses. Data https://github.com/jennyzhang0215/DKVMN https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/skill-builder-data-2009-2010 https://sites.google.com/site/assistmentsdata/home/2015-assistments-skill-builder-data records with invalid correct values that are not in { , } have beenremoved. This dataset was gatheredfrom a college level statics course in Fall 2011 [12]. The concate-nation of a problem name and a step name has been labeled as askill tag. Note that the number of skills is much larger than thoseof other datasets. It is a dataset originally generated by the au-thors of the DKT paper [18]. Each student response was generatedusing skill difficulty, student proficiency, and the probability of arandom guess set to a constant based on IRT [8]. The dataset con-sists of a number of sub-datasets, and those with five concepts fromversion 0 to version 19 have been used, i.e., a total of 20 sub-datasetsfrom the original dataset were used.Dataset Students Skills Size Max StepsASSIST2009 4,151 110 325,637 1,261ASSIST2015 19,840 100 683,801 618Statics2011 333 1,223 189,297 1,181Synthetic-5 4,000 50 200,000 50 Table 2: Statistics for all the datasets. Names of the datasetshave been abbreviated. ‘Size’ and ‘Max Steps’ refer to the to-tal number of student responses and the maximum numberof time steps, respectively.

All the program codes for the implemented KQN and DKT werewritten in TensorFlow 1.5 . For the data splits of each dataset, weused the same ones used by DKVMN for a fair comparison ofprediction accuracy. All the sequences of student responseswere preserved in their original length without truncation. Eachdataset except Synthetic-5 has been split into training, validation,and test sets with 8:2 and 7:3 for training to validation and (train-ing+validation) to test ratios, respectively. For Synthetic-5, the corre-sponding ratios of 8:2 and 5:5 have been set. Hyperparameters havebeen grid-searched with holdout validation with early-stopping.Note that no early-stopping was used in the testing phase. Thenumber of epochs was set to 50 and 200 during the validation andtesting phases, respectively. During the testing phase, KQN wasrun for five times, and the mean and standard deviation of theperformance metric results have been reported.The hyperparameters of KQN and their candidate values havebeen set as follows: • Type of the RNN layer in the knowledge state encoder: LSTM,GRU. • Hidden state size H RNN of the chosen RNN layer: 32, 64, 128. • Hidden state size H MLP of the MLP layer: 32, 64, 128. https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507 https://github.com/chrispiech/DeepKnowledgeTracing ataset Test AUC (%)IRT+ BKT+ DKVMN DKT DKT+KQN KQNASSIST2009 77.40 - 81.57 ± ± ± ± ASSIST2015 - - 72.68 ± ± ± ± ± ± ± ± Synthetic-5 - 80 82.73 ± ± ± ± • Dimensionality d of the vector space in which the knowledgestate and skills are embedded: 32, 64, 128.The retention rate of 0.6 for the RNN dropout and the batch sizeof 128 were set to default. The Adam optimization method [11] wasused to minimize the total error E total .Additionally, DKT was run with the skill vectors learned byKQN to evaluate their quality for the correctness prediction task.Specifically, at each time step, input x t given to DKT was set tothe concatenation of two vectors: one-hot encoded correctness c t ∈ R N and the learned skill vector s t ∈ R d corresponding to theoriginal skill e t . We denote DKT with such a setup as DKT+KQN.The hyperparameters of DKT+KQN have been searched in the sameway as explained previously for KQN with the same dropout rate ofRNN and the batch size. Meanwhile, LSTM was used for the RNNlayer following past works [18, 22]. Throughout a student’sresponses, prediction estimates for correctness with respect to theskills the student solved were calculated with the knowledge statequery followed by the logistic function. A sample from the testset of ASSISTments 2009-2010 was used for the task. Then thoseestimates were visualized with a heat map to evaluate their changesas the student solved the problems.

Skill distances have been used forclustering on the four datasets. To decide the linkage and the typeof distance measures to use, we first performed flat clustering onSynthetic-5 with the number of clusters fixed to 5, the ground truthnumber of clusters. Then the quality of clustering or partitioningwith respect to the original cluster labels was measured with theAdjusted Rand Index (ARI) [7]. It has a maximum value of 1 whenthe clusters are formed to match the original partitioning perfectly,and a minimum value of 0 when they are randomly partitioned.Since there are 20 sub-datasets for Synthetic-5, 20 ARI scores wereaveraged. The linkage between clusters and the type of distancemeasures have been set to hyperparameters as follows: • Cluster linkage: { average, centroid, complete, median, single,ward, weighted }• Type of distance measure: { cosine, Euclidean } After deciding which linkage and distance measure to use, thenumber of clusters n has been explored. First, for different values of n = , · · · ,

14, the skills of ASSISTments 2009-2010 were clustered based on the distances computed from the skill vectors learned byKQN. Then DKT was used to quantify the quality of those clustersas follows: First, all the original skill IDs were substituted withthe assigned cluster labels, where data splits remained the same asthose for the correctness prediction task. Then, DKT was run fivetimes. Finally, the average and the standard deviation of the testAUCs of DKT were reported.

Let d be the dimensionality of the vector space in KQN and d opt bethe optimal values of d obtained in the correctness prediction taskpreviously. KQN was trained on the four datasets by varying d to0 . d opt and 2 d opt with the data splits kept the same, and the otherhyperparameters set to the optimal values. To analyze the effectof d on the correctness prediction task and the learning of skillvectors, prediction accuracy was reported, and the three distancematrices of the skill vectors for each of d = d opt , . d opt , d opt were compared. Prediction accuracy was measured with the Area Under the ROCcurve (AUC) during the testing phase. Note that the AUC of a modelthat guesses 0 or 1 randomly should be 50%. As representativesof non-neural-network models, BKT, IRT, and their variants havebeen compared with KQN while DKT and DKVMN were comparedto the state-of-the-art neural network models. The AUC results forthose models have been cited from other papers as follows: thoseof IRT and its extensions from [21], of BKT and its variants from[10, 22], and of DKT and DKVMN from [26].Test AUCs for all the datasets are shown in Table 3. Overall,KQN performed better than all the previously available KT modelsand showed a more stable performance with the lowest standarddeviation values.For ASSISTments 2009-2010, the test AUC of KQN was 82.32%beating the previous highest value by 0.75%. DKT+KQN showed theAUC of 82.05%, not only higher than the original DKT performancebut also higher than all the others. Surprisingly, for ASSISTments2015, DKT+KQN achieved the highest test AUC of 73.41%, evenslightly higher than KQN. Both KQN and DKT+KQN performedbetter than all the previous results, which is promising in that KQN Area Irregular Figure 41 Finding Percents 74 Multiplication and Division Positive Decimals 55 Divisibility Rules46 Algebraic Solving 42 Pattern Finding 107 Parts of a Polynomial, Terms, Coefficient, Monomial, Exponent, Variable 57 Perimeter of a Polygon56 Reading a Ruler or Scale 43 Write Linear Equation from Situation 4 Table 71 Angles on Parallel Lines Cut by a Transversal63 Scale Factor 44 Square Root 12 Circle Graph 72 Write Linear Equation from Ordered Pairs67 Percents 47 Percent Discount 32 Box and Whisker 80 Unit Conversion Within a System78 Rate 54 Interior Angles Triangle 49 Complementary and Supplementary Angles 83 Area Parallelogram84 Effect of Changing Dimensions of a Shape Proportionally 62 Ordering Real Numbers 53 Interior Angles Figures with More than 3 Sides 91 Polynomial Factors85 Surface Area Cylinder 65 Scientific Notation 58 Solving for a variable 97 Choose an Equation from Given Information86 Volume Cylinder 76 Computation with Real Numbers 59 Exponents 101 Angles - Obtuse, Acute, and Right88 Solving Systems of Linear Equations 79 Solving Inequalities 68 Area Circle 104 Simplifying Expressions positive exponents92 Rotations 81 Area Rectangle 70 Equation Solving More Than Two Steps 20 Addition and Subtraction Integers93 Reflection 82 Area Triangle 75 Volume Sphere 31 Circumference96 Interpreting Coordinate Graphs 87 Greatest Common Factor 102 Distributive Property 34 Conversion of Fraction Decimals Percents14 Proportion 89 Solving Systems of Linear Equations by Graphing 1 Area Trapezoid 60 Division Fractions66 Write Linear Equation from Graph 90 Multiplication Whole Numbers 6 Stem and Leaf Plot 77 Number Line69 Least Common Multiple 98 Intercept 10 Venn Diagram 19 Multiplication Fractions3 Probability of Two Distinct Events 99 Linear Equations 11 Histogram as Table or Graph 25 Subtraction Whole Numbers5 Median 100 Slope 21 Multiplication and Division Integers 61 Estimation7 Mode 105 Finding Slope from Ordered Pairs 22 Addition Whole Numbers 109 Finding Slope From Equation8 Mean 106 Finding Slope From Situation 26 Equation Solving Two or Fewer Steps 24 Addition and Subtraction Fractions9 Range 108 Recognize Quadratic Pattern 33 Ordering Integers 50 Pythagorean Theorem13 Equivalent Fractions 110 Quadratic Formula to Solve Quadratic Equation 37 Ordering Positive Decimals 103 Recognize Linear Pattern15 Fraction Of 16 Probability of a Single Event 38 Rounding 29 Counting Methods18 Addition and Subtraction Positive Decimals 45 Algebraic Simplification 39 Volume Rectangular Prism 95 Midpoint23 Absolute Value 73 Prime Number 40 Order of Operations All 64 Surface Area Rectangular Prism28 Calculations with Similar Figures 94 Translations 48 Nets of 3D Figures 36 Unit Rate30 Ordering Fractions 17 Scatter Plot 51 D.4.8-understanding-concept-of-probabilities35 Percent Of 27 Order of Operations +,-,/,* () positive reals 52 Congruence

Table 4: 14 flat clusters of ASSISTments 2009-2010 skills based on the average linkage method and the Euclidean distance. Ineach cluster, skills are sorted in an ascending order based on skill IDs. Different clusters are separated by dashed lines. should be learning useful skill vectors that are transferable to othermodels and applications. For OLI Engineering Statics 2011, KQNachieved the highest value of 83.20%, higher than the previoushighest by 0.34%. DKT+KQN showed a performance comparable tothe vanilla DKT with the slightly higher test AUC of 80.27% andthe standard deviation of 0.22%. Lastly, also for Synthetic-5, KQNperformed the best with the highest average 82.81% and the loweststandard deviation of 0.01%. Interestingly, the standard deviation ofDKT+KQN was much lower than that of the original DKT, showingthat the learned skill vectors should be contributing to the stableprediction of the model.In summary, KQN showed the best performance for the correct-ness prediction task compared to all the previous models whileit achieved the second best result of all the models as DKT+KQNhad the highest score on ASSISTments 2015. In addition to thebest mean test AUC scores, our model had much lower standarddeviation values compared to other models. DKT+KQN also hadlow standard deviation values for all the datasets except for OLIEngineering Statics 2011. Therefore, we speculate that KQN is ableto produce stable prediction estimates due to its ability to learn ameaningful latent structure of the skill vectors.

For a sample student from ASSISTments 2009-2010, prediction es-timates for correctness in percentage are visualized in Figure 3through the knowledge state query with respect to particular skills.On the x-axis, student responses with skill IDs and correctnessvalues as tuples are marked while on the y-axis, all the skills thatthe student solved are sorted in ascending order from the top. Thecorresponding skill names can be found in Table 4.Changes in probability estimates are mostly intuitive. For ex-ample, at t =

2, after the student solved a problem with skill 52correctly, the probability estimate for skill 52 increased from 72%to 82%. However, it can also be observed that some changes arecounter-intuitive. For example, at t =

3, as the student solved a

Figure 3: Visualization of knowledge interaction by query-ing the knowledge state with respect to particular skills. Onthe x-axis, student responses are labeled while on the y-axis, all the skills contained in the responses are marked.Each column corresponds to one time step t , which increasesalong the x-axis. Prediction estimates for correctness in per-centage (%) are annotated in the grid. It is better viewed incolor. problem with skill 92 incorrectly, the corresponding estimate in-creased from 23% to 24% even though the change was only 1%. Thisproblem has also been addressed for DKT in a previous work andis still an open problem to be improved [23]. Average ARI scores for clustering with different linkage methodsand distance measures are reported in Table 5. ARI was the highestwhen the linkage was set to average, and the distance measure wasset to Euclidean for clustering. Not surprisingly, the average ARIscores did not differ much when the distance measure was set toeither cosine or Euclidean.After clustering the skills of ASSISTments 2009-2010 with thelinkage and the distance measure set to average and Euclidean, re-spectively, and substituting the original skill IDs with those cluster inkage Distance ARI average cosine 0.3180 Euclidean 0.3266 centroid cosine 0.0373Euclidean 0.0143complete cosine 0.2898Euclidean 0.2898median cosine 0.0368Euclidean 0.0071single cosine 0.0703Euclidean 0.0703ward cosine 0.3201Euclidean 0.3234weighted cosine 0.2996Euclidean 0.3020

Table 5: Average ARI scores for different linkage methodsand distance measures on Synthetic-5.

Number of Clusters Test AUC (%)5 79.77 ± ± ± ± ± ± ± ± ±

14 80.64 ± Table 6: Test AUCs (%) of DKT on ASSISTments 2009-2010after replacing skill IDs with cluster labels assigned by flatclustering. The average linkage and the Euclidean distancewere used. labels, DKT was run five times. The test AUCs of DKT are reportedin Table 6. They increased gradually as the number of clusterschanged from 5 to 14. The lowest test AUC was 79.77% when n = For the four datasets, the test AUCs of KQNs with the embeddingdimensionality d = d opt , . d opt , and 2 d opt are shown in Table 8,where d opt refers to the optimal values chosen from the holdout validation for the correctness prediction task. We could observeonly a little difference in the prediction accuracy as the values of d were varied.For each pair of { d opt , . d opt , d opt } , the average difference ξ between the pairwise distances of the skill vectors of two differentdimensionalities was calculated as follows: ∀ d , d ∈ { d opt , . d opt , d opt } , d (cid:44) d , ξ d , d = (cid:205) i > j | pdist d ( s i , s j ) − pdist d ( s i , s j )| (cid:0) N (cid:1) , where pdist d ( s i , s j ) refers to the pairwise distance between twoskill vectors s i and s j , and N is the number of skills. ξ is thencompared to the average pairwise distance η as follows: η d = (cid:205) Ni > j pdist d ( s i , s j ) (cid:0) N (cid:1) . In Table 7, the lowest values of ξ are indicated in bold. As canbe seen, ξ d opt , d opt is always lower than ξ d opt , . d opt . From this, itcan be inferred that KQN learned the skill relationships better when d was set to a number high enough since d controls the maximumnumber of mutually independent skill vectors. Also, the values of ξ were relatively low compared to the corresponding values of η . Forexample, ξ d opt , d opt was only 0 .

07 when the Euclidean distancewas used for ASSISTments 2009-2010 while the corresponding η d opt and η d opt were 1.22 and 1.23, respectively.To further evaluate the distance matrices, we performed Manteltests [13], which measure the similarity between two distance ma-trices with a correlation coefficient ρ and a p-value. ρ has the samerange as that of correlation coefficients in statistics while a p-valueindicates statistical significance. The Pearson correlation and thepermutation number of 999 were set for the Mantel tests.The results of the Mantel tests are reported in Table 9, where p-values are omitted since they were 0 .

001 in all cases, indicating thatthe values of ρ are statistically significant. The fact that the values of ρ were always the highest for d opt and 2 d opt confirmed that therewas the strongest positive correlation between the correspondingdistance matrices. Specifically, ρ was over 0 . .

536 for all the otherdatasets, proving strong positive correlations as well. Therefore,from Table 7, Table 8, and Table 9, KQN was shown to be stable inpredicting correctness and learning the relationships between theskill vectors as the value of the vector space dimensionality d wasvaried. From the experiment results for the four tasks, we list the contribu-tions of this paper as follows:(1) KQN performs better than all the previous models on thefour datasets for the correctness prediction task.(2) KQN enables the knowledge state of a student to be queriedwith respect to different skills, which is helpful for interpret-ing the knowledge interaction through visualization. istance Dataset ξ d opt , . d opt ξ d opt , d opt ξ . d opt , d opt η d opt η . d opt η d opt cosine ASSIST2009 0.11 Table 7: Average pairwise distances and average differences between the pairwise distances.

Dataset Test AUC (%) d opt . d opt d opt ASSIST2009 82.32 82.35 82.32ASSIST2015 73.40 73.38 73.40Statics2011 83.20 83.17 83.16Synthetic-5 82.81 82.79 82.82

Table 8: Test AUCs of KQNs by varying the embedding di-mensionality d . d opt refers to the optimal values found fromthe correctness prediction task. Note that prediction accu-racy may not be the highest when d was set to d opt . Distance Dataset ρd opt -0 . d opt d opt -2 d opt . d opt -2 d opt cosine ASSIST2009 0.521 Table 9: Mantel tests on the distance matrices. p-values arenot marked since they were 0.001 in all cases. (3) KQN’s architecture leads to the concept of probabilistic skillsimilarity to relate the cosine and Euclidean distances be-tween two skill vectors to the odds ratio for the correspond-ing skills as introduced previously in the paper. This makesthe skill vectors and their pairwise distances useful for do-main modeling, e.g., with cluster analysis.(4) KQN is robust to the changes in the dimensionality of thevector space for the knowledge state and skill vectors in thatits prediction accuracy is not degraded and it learns stronglypositively correlated sets of pairwise distances between theskill vectors as the value of the dimensionality is varied, orequivalently, KQN learns the latent relationships betweenskills stably.Compared to other neural network models, KQN has more pa-rameters to learn. For example, since it includes an MLP in theskill encoder in addition to an RNN in the knowledge state encoder,KQN is computationally heavier than DKT which only has an RNN for encoding student responses. Heuristically, more GPU memorywas required for training KQN compared to DKT+KQN. Still, we be-lieve that the advantages of KQN mentioned above are meaningfulenough to compensate for the increase in space complexity.KQN proposes an alternative approach to the KT problem bydefining the knowledge state and skill vectors in the same vectorspace. It has a general form of the knowledge state and skills asvectors while defining the knowledge interaction clearly as thedot product between the two types of vectors. From the fact thatthe pairwise distances between skill vectors are interpreted as thelogarithm of the corresponding odds ratios from the probabilisticperspective, those distances can become useful features for domainmodeling to explore the latent structure of the skill domain, whichcan be a future direction of the KT research.

This research has been supported by the project ITS/227/17FP fromthe Innovation and Technology Fund of Hong Kong.

REFERENCES [1] John R. Anderson, C. Franklin Boyle, Albert T. Corbett, and Matthew W. Lewis.1990. Cognitive Modeling and Intelligent Tutoring.

Artif. Intell.

42, 1 (Feb. 1990),7–49. https://doi.org/10.1016/0004-3702(90)90093-F[2] Hao Cen, Kenneth Koedinger, and Brian Junker. 2006. Learning factors analysis–ageneral method for cognitive model evaluation and improvement. In

InternationalConference on Intelligent Tutoring Systems . Springer, 164–175.[3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning PhraseRepresentations using RNN Encoder–Decoder for Statistical Machine Translation.In

Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 1724–1734.[4] Mingyu Feng, Neil Heffernan, and Kenneth Koedinger. 2009. Addressing theassessment challenge with an online system that tutors as it assesses.

UserModeling and User-Adapted Interaction

19, 3 (2009), 243–266.[5] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. arXiv preprint arXiv:1410.5401 (2014).[6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

NeuralComputation

9, 8 (1997), 1735–1780.[7] Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions.

Journal ofClassification

2, 1 (1985), 193–218.[8] Charles Lee Hulin, Fritz Drasgow, and Charles K Parsons. 1983.

Item responsetheory: Application to psychological measurement . Dorsey Press.[9] Andreas M Kaplan and Michael Haenlein. 2016. Higher education and the digitalrevolution: About MOOCs, SPOCs, social media, and the Cookie Monster.

BusinessHorizons

59, 4 (2016), 441–450.[10] Mohammad Khajah, Robert V Lindsey, and Michael C Mozer. 2016. How deep isknowledge tracing?. In

Proceedings of the 9th International Conference on Educa-tional Data Mining . 94–101.[11] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[12] Kenneth R Koedinger, Ryan SJd Baker, Kyle Cunningham, Alida Skogsholm, BrettLeber, and John Stamper. 2010. A data repository for the EDM community: The SLC DataShop.

Handbook of educational data mining

43 (2010), 43–56.[13] Pierre Legendre and Louis Legendre. 1998. Numerical Ecology, Volume 24,(Developments in Environmental Modelling).[14] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve re-stricted boltzmann machines. In

Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) . 807–814.[15] Hyacinth S Nwana. 1990. Intelligent tutoring systems: an overview.

ArtificialIntelligence Review

4, 4 (1990), 251–277.[16] Philip I Pavlik, Hao Cen, and Kenneth R Koedinger. 2009. Performance Fac-tors Analysis–A New Alternative to Knowledge Tracing. In .[17] Radek Pelánek. 2017. Bayesian knowledge tracing, logistic models, and beyond:an overview of learner modeling techniques.

User Modeling and User-AdaptedInteraction

27, 3-5 (2017), 313–350.[18] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami,Leonidas J Guibas, and Jascha Sohl-Dickstein. 2015. Deep knowledge tracing. In

Advances in Neural Information Processing Systems . 505–513.[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting.

Journal of Machine Learning Research

15, 1 (2014), 1929–1958.[20] Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks.

CoRR abs/1410.3916 (2014). arXiv:1410.3916 http://arxiv.org/abs/1410.3916[21] Kevin H Wilson, Yan Karklin, Bojian Han, and Chaitanya Ekanadham. 2016.Back to the basics: Bayesian extensions of IRT outperform neural networks forproficiency estimation. arXiv preprint arXiv:1604.02336 (2016).[22] Xiaolu Xiong, Siyuan Zhao, Eric Van Inwegen, and Joseph Beck. 2016. GoingDeeper with Deep Knowledge Tracing.. In

Proceedings of the 9th InternationalConference on Educational Data Mining . 545–550.[23] Chun-Kit Yeung and Dit-Yan Yeung. 2018. Addressing two problems in deepknowledge tracing via prediction-consistent regularization. In

Proceedings of theFifth Annual ACM Conference on Learning at Scale . ACM, 5.[24] Michael V Yudelson, Kenneth R Koedinger, and Geoffrey J Gordon. 2013. Indi-vidualized bayesian knowledge tracing models. In

International Conference onArtificial Intelligence in Education . Springer, 171–180.[25] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neuralnetwork regularization. arXiv preprint arXiv:1409.2329 (2014).[26] Jiani Zhang, Xingjian Shi, Irwin King, and Dit-Yan Yeung. 2017. Dynamic key-value memory networks for knowledge tracing. In

Proceedings of the 26th Inter-national Conference on World Wide Web . 765–774.. 765–774.