[PDF] The Impact of Looking Further Ahead: A Comparison of Two Data-driven Unsolicited Hint Types on Performance in an Intelligent Data-driven Logic Tutor

Abstract

Research has shown assistance can provide many benefits to novices lacking the mental models needed for problem solving in a new domain. However, varying approaches to assistance, such as subgoals and next-step hints, have been implemented with mixed results. Next-Step hints are common in data-driven tutors due to their straightforward generation from historical student data, as well as research showing positive impacts on student learning. However, there is a lack of research exploring the possibility of extending data-driven methods to provide higher-level assistance. Therefore, we modified our data-driven Next-Step hint generator to provide Waypoints, hints that are a few steps ahead, representing problem-solving subgoals. We hypothesized that Waypoints would benefit students with high prior knowledge, and that Next-Step hints would most benefit students with lower prior knowledge. In this study, we investigated the influence of data-driven hint type, Waypoints versus Next-Step hints, on student learning in a logic proof tutoring system, Deep Thought, in a discrete mathematics course. We found that Next-Step hints were more beneficial for the majority of students in terms of time, efficiency, and accuracy on the posttest. However, higher totals of successfully used Waypoints were correlated with improvements in efficiency and time in the posttest. These results suggest that Waypoint hints could be beneficial, but more scaffolding may be needed to help students follow them.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

The Impact of Looking Further Ahead: A Comparison ofTwo Data-driven Unsolicited Hint Types on Performancein an Intelligent Data-driven Logic Tutor

Christa Cody · Mehak Maniktala · Nicholas Lytle · Min Chi · TiﬀanyBarnes Received and accepted date will be inserted later

Abstract

Research has shown assistance can provide many beneﬁts to noviceslacking the mental models needed for problem solving in a new domain. How-ever, varying approaches to assistance, such as subgoals and next-step hints,have been implemented with mixed results. Next-Step hints are common indata-driven tutors due to their straightforward generation from historical stu-dent data, as well as research showing positive impacts on student learning.However, there is a lack of research exploring the possibility of extending data-driven methods to provide higher-level assistance. Therefore, we modiﬁed ourdata-driven Next-Step hint generator to provide Waypoints, hints that area few steps ahead, representing problem-solving subgoals. We hypothesizedthat Waypoints would beneﬁt students with high prior knowledge, and thatNext-Step hints would most beneﬁt students with lower prior knowledge. Inthis study, we investigated the inﬂuence of data-driven hint type, Waypointsversus Next-Step hints, on student learning in a logic proof tutoring system,Deep Thought, in a discrete mathematics course. We found that Next-Stephints were more beneﬁcial for the majority of students in terms of time, ef-ﬁciency, and accuracy on the posttest. However, higher totals of successfullyused Waypoints were correlated with improvements in eﬃciency and time inthe posttest. These results suggest that Waypoint hints could be beneﬁcial,but more scaﬀolding may be needed to help students follow them.

Keywords

Tutoring system, Hints, Assistance, Data-Driven methods

Christa CodyE-mail: [email protected] North Carolina State University, Computer Science Department, Raleigh, NC, USA a r X i v : . [ c s . H C ] F e b Christa Cody et al. Intelligent tutoring systems (ITS) provide adaptive assistance to students andhave signiﬁcant positive eﬀects on learning [42, 29]. Multiple approaches to as-sistance have been explored, with some very speciﬁc assistance, like bottom-outhints [59], designed to ensure that students “do not ﬂounder during problemsolving” [36], while other more abstract assistance, like a suggested subgoal[11], is designed to allow more freedom and exploration within the domain.Providing assistance has been shown to reduce the cognitive load of learn-ing by simplifying the task, leading to greater learning outcomes in less time[23, 53]. However, determining what level or type of help students need is acomplex task that can aﬀect learning outcomes [1, 59, 61]. A major goal ofproviding assistance is to level the playing ﬁeld of learning so that students atany incoming proﬁciency can master the same material in a similar amountof time. Research has shown that the level of hint and the learner’s incomingexperience can aﬀect learning outcomes in ITSs [5, 23]. One example of thisis the expertise reversal eﬀect where methods that beneﬁt novices, such asworked examples, become detrimental to students with higher expertise dueto increasing cognitive load through redundant information [54].Furthermore, research has found evidence of aptitude-treatment interac-tions (ATI) within instructional strategies[13, 50], where certain students, par-ticularly lower performin students, are more likely to be aﬀected by changesin the learning environment. Similar to solving programming problems, solv-ing logic proofs requires students to understand a system of domain principlesor rules and to creatively apply them in sequence to achieve a goal. Conse-quently, support can be directed at any of these facets of problem solving,such as helping a student learn a rule or identify when applying a such a rulewill move them towards a goal. Therefore, we hypothesized that diﬀerent hinttypes could have diﬀerent eﬀects based on students’ incoming proﬁciency.Deep Thought’s default hints are Next-Step hints, where the next state-ment to be derived is given to the student and can be derived within one step.Providing the next step to derive allows students to focus their learning ondiscovering how to reach their new short-term subgoal, rather than what nextsubgoal to pursue. On the other hand, Next-Step hints may reduce studentautonomy or practice in creating appropriate problem solving strategies.To evaluate the eﬀect of a new hint type on student’s performance, we cre-ated Waypoints, that can be thought of as intermediate subgoals , by modifyingour Next-Step hint generator to produce hints that mimic subgoals withoutthe need for expert labelling. Our new method produces Waypoint hints thatrequire students to perform 2-3 steps to derive them. Waypoints are intendedto serve as near-term subgoals, that allow students more room for explorationand latitude in strategy construction.Another important aspect to assistance in tutoring systems is the ease ofgeneration. Data-driven methods, where actions within the tutor are designedand developed using historical data, have been used to great eﬀect to automateand individualize computer-aided instruction [38, 52, 18, 8]. Deep Thoughts’s itle Suppressed Due to Excessive Length 3 data-driven assistance matches current student work with similar historicalsuccessful, eﬃcient examples to provide adaptive Next-Step hints using theHint Factory, which is a method of generating Next-Step hints. The originalHint Factory opened a new ﬁeld of data-driven hint generation that was ﬁrstapplied in tutors for logic [51, 8], and then for linked lists [17]. More recently,the Hint Factory approach inspired new research in generating Next-Step hintsfor novice programming, based on generating assistance using pieces of previ-ous student’s solutions [17, 49, 43, 8].However, there is a lack of research extending this Next-Step hint gener-ation to provide additional forms of assistance. Therefore, we modiﬁed ourNext-Step hint generator to produce Waypoint hints. Our modiﬁcations wereinspired by the Approach Maps technique of graph-based mining to discoverimportant subgoals in common student solutions [14]. This extension of Next-Hint generation to provide a higher level of hint may be used in other systemsto easily generate a new hint type that could provide more adaptive assistanceto address individual student needs.Our goals for this study were to 1) perform a study to compare the impactsof Waypoints with Next-Step hints on performance, and 2) determine whetherprior proﬁciency interacted with hint type to impact tutor posttest perfor-mance. We investigated the impact of the two types of hints, Next-Step andWaypoints, on student learning via unsolicited, tutor-initiated steps insertedinto the student workspace, which we refer to as “Assertions”. Assertions aredesigned to direct student attention to, and promote adoption of, unsolicitedNext-Step and Waypoint hints.Based on the prior research mentioned above, we hypothesized that ourNext-Step hints would be most beneﬁcial for students with lower incoming pro-ﬁciency and lead to better performance on the posttest. We also hypothesizedthat Waypoints would be more beneﬁcial to students with higher incomingproﬁciency and lead to better performance on the posttest. In other words, wepredicted an aptitude-treatment interaction (ATI) eﬀect [13, 50] where priorstudent proﬁciency would impact which students beneﬁt most from a treat-ment. We predicted an ATI eﬀect for both Waypoint and Next-Step hints,with higher proﬁciency students beneﬁtting more from Waypoints and lowerproﬁciency students beneﬁtting more from Next-Step hints.In this paper, we ﬁrst discuss the context of the logic tutor, Deep Thought,and the method of generation for the diﬀerent hint types. We then outline ourexperimental setup, designed to compare these two hint types in terms of theireﬀects on student learning outcomes. Finally, we discuss the study results andhow they relate to prior literature, and provide recommendations for futuredata-driven hint development and research.

In this section, we discuss various approaches to assistance, such as subgoals,Next-Step hints, and worked examples, within intelligent tutoring systems (re-

Christa Cody et al. ferred to here as ITSs or tutors). We also discuss cognitive theories surroundingassistance, including cognitive load and the “zone of proximal development”,that have inﬂuenced our work.Guided discovery, helping students discover new knowledge rather thanproviding direct instruction, is generally more beneﬁcial than allowing studentsto learn unguided [25, 33]. This ﬁnding agrees with the theory of the “zoneof proximal development” (ZPD), the space between things a student cando independently and those they can only do with support [60]. Vygotskyhypothesized the most eﬀective learning occurs when students are assignedtasks within their ZPD, meaning that tasks should neither be so simple thatstudents can do them independently nor so diﬃcult that they cannot makeprogress even with assistance. This dilemma of choosing an appropriate levelof assistance shows that giving or withholding information is a delicate balancewith trade-oﬀs [26].The theory of cognitive load may explain the trade-oﬀs of diﬀerent ap-proaches to assistance. Providing assistance can reduce the cognitive loadneeded for students to learn through methods such as simplifying the task [23]or breaking the task down into easier-to-digest components, such as subgoals[37]. However, the cognitive load of a learner is aﬀected by both the elementsof information in the task and their own ability [53]. Intuitively, providing as-sistance that is too hard for a particular student to understand can negativelyimpact learning. However, providing assistance when it is not needed may alsohave a negative eﬀect, such as the expertise reversal eﬀect in which providingstudents information they already know increases their cognitive load [54]. Onthe other hand, it is a known problem that many students fail to request helpwhen it is needed, and this has been termed hint avoidance [1], discussed laterin this section.2.1 Approaches to Assistance in Tutoring SystemsIntelligent tutoring systems (ITSs) have signiﬁcant positive eﬀects on learningoutcomes [42]. Many forms of contextualized assistance have been exploredin ITSs, such as hints, worked examples, and error feedback [21, 18, 58, 59].The most minimal hint type is error-speciﬁc feedback, which provides a hintregarding an error the student has made [59]. Our tutor, as described below,includes basic error feedback when rules are not applied correctly.Many tutors use goal-directed hint sequences to provide several hints ina row, beginning with a more general hint then transition to more speciﬁcand directive hints [21]. Our tutor has this capability, but it was disabled forthis study to determine the impact of hint type and not the amount of detaileach student might request. A standard goal-directed hint sequence within atutoring system is Point, Teach, and Bottom-out [21]. Pointing hints attemptto remind the student of relevant material. Teaching hints describe how toapply the relevant material. Bottom-out hints tell the student the next stepand speciﬁcally how to implement it. The hints in Deep Thought would be itle Suppressed Due to Excessive Length 5 considered pointing hints, because they point students in the direction theyshould be moving by giving them a hint statement to work towards. One type of assistance higher-expertise learners beneﬁt from is subgoals, a setof steps in the solution process that allows users to “chunk” information forease of learning [11, 37]. Sweller et al. [57] found that using more abstract rep-resentations of goals in ﬁve maze-tracing experiments resulted in “fewer errorsand more rapid learning of the structure of the problem.” The authors foundthat the more information solvers knew about the goal, the less they learnedabout problem structure. However, studies have found that these approacheshave trade-oﬀs depending on learner ability and problem diﬃculty or context[37].In regard to learner’s abilities, research within ITSs has shown that high-ability learners can beneﬁt from lower amounts, or less guidance, while lowability learners beneﬁt from more concrete (speciﬁc and direct) guidance [5,28]. These ﬁndings inspired us to explore how data-driven hint algorithmscould be used to derive less direct guidance to beneﬁt high-ability learners.

The Hint Factory is a data-driven approach developed to generate Next-Stephints for students applying rules to solve open-ended problems in well-deﬁneddomains where there are multiple valid solutions [51, 52]. New innovationsin generating assistance from individual pieces of previous student’s solutionshave helped researchers extend the ideas of the Hint Factory to generate Next-Step hints for new domains including novice programming and linked list con-struction [17, 49, 43, 8]. The Next-Step hints derived by the Hint Factoryand used in our tutor are pointing hints that suggest a statement a studentcould derive using a single domain rule application. Sweller et al. makes thecase that providing more explicit instruction is better for novices who needto establish those individual learning blocks before they can create their ownmental models [54, 35].However, research has shown that allowing students to make successful,unaided attempts at solving a problem can provide a higher learning beneﬁtcompared to explicit instruction showing them what to focus on [26]. HintFactory Next-Step hints have been shown to be successful in supporting stu-dent learning and problem-solving, with students having access to such hintsin logic being 3 times more likely to complete the tutor than those without[52]. These results suggest that Next-Step hints are direct and explicit enoughto support learning, but since level 1 hints do not provide the full informationto achieve a next-step, students must do some unaided exploration to achievethe suggested hint statement. On the other hand, Aleven et. al notes a “onesize ﬁts all” strategy for guidance is not likely beneﬁcial [1]. Hence, we are

Christa Cody et al. inspired to determine whether some even less direct data-driven hints maybeneﬁt high-ability learners. Aptitude-treatment interactions have been widely studied in the educationaldomain. Prior research in instructional strategies [13, 50] has shown the ex-istence of aptitude-treatment interaction (ATI), where certain students aremore sensitive to variations in the learning environment and may be aﬀecteddiﬀerently by the treatment compared to less sensitive students who performregardless of the treatment. Educational researchers have discovered ATI ef-fects based on prior experience level, prior working memory, and incomingself-regulated learning ability [24, 27, 19, 62]. For example. Lehmann et al. ex-plored the eﬀect of working memory on learning outcomes in ﬂuency/disﬂuencygroups, where instructional materials had diﬀerent levels of text legibility [27].Based on these ﬁndings, we believe that there could be an aptitude-treatmenteﬀect associated with hint type. We believe that students with lower incomingproﬁciency may be more sensitive to hint type.

Despite this considerable research on assistance, there is pervasive problemwithin ITSs called help avoidance, where students do not leverage the intelli-gence within the system for help [2]. There are many reasons for help avoid-ance, one of which is that certain students may lack speciﬁc meta-cognitiveskills like knowing when to ask for help [1]. As a result, some ITSs employ un-solicited hints (i.e. providing hints when needed without request) to preventhelp avoidance [41], and we adopt this unsolicited strategy here.Zhou et al. found that students were more likely to make eﬀective peda-gogical decisions at the problem-level rather than the step-level, meaning thatstudents were less able to make eﬀective decisions when deciding if they needa hint on a particular problem-solving step [63]. In another study, researchersfound that a large number of students using Andes, the physics tutor, wouldguess instead requesting hints [46]. Furthermore, higher learning gains havebeen observed for low performing students when unsolicited hints were pro-vided[4]. While one study found that students learned more reliably with hintson-demand than unsolicited hints[47], other studies have shown that provid-ing hints at the appropriate time can augment students’ learning experience[10, 45], improve their performance [9], and avoid the negative eﬀects of frus-tration while saving students time by preventing unproductive struggle [40].Within our tutor, even though students often have diﬃculty and hints arereadily available via the hint button, most students do not request assistance.In Fall 2017, students using our tutor requested a median of zero hints perproblem. In this study, to enable us to compare the impact of hint type,we periodically (frequency deﬁned in Section 3.1) provided unsolicited hintsto students based on the condition they were assigned. In prior work, we itle Suppressed Due to Excessive Length 7 compared our unsolicited hints to the normal conditions in Deep Thought,on-demand hints only, and found that the unsolicited hints had no impacton the performance metrics in the training and no negative impacts on anyperformance metrics on the posttest [12]. Furthermore, this work found thatproviding unsolicited hints reduced steps that students needed help, but didn’treceive it as detected by our Help-Need model [31, 30]. Therefore, we do notbelieve that our unsolicited hints are disruptive, but we note that providingunsolicited hints has potential for disrupting students’ learning. In the nextsection, Deep Thought and its interface are discussed in detail and the hints’generation, usage, and frequency are expanded on.

The Deep Thought tutor (see Figure 1, described further below) is used in thecontext of a discrete mathematics course where students ﬁrst spend 2 weekslearning about truth tables, and proving each logic rule is true in class andin online multiple-choice homework assignments. Then, students learn aboutformal proofs, where students iteratively apply logic rules to a set of givenstatements to derive a speciﬁed conclusion.A formal proof works much like any multi-step procedural problem wheredomain principles are applied to given and previously-derived facts to deriveand justify new statements. For example, in physics, students may be givenvalues for mass and acceleration and be asked to determine force. They wouldthen apply the domain principle of F = m ∗ a along with the given values of m and a to derive a new statement about the value of F . In logic, each derivedstatement must have a justiﬁcation which consists of the domain principleand the relevant prior statements it was applied to. This corresponds to theinformation used to derive F in the previous physics example. In a formalproof, students are given a few statements (the number may vary) that areknown to be true - often referenced as ”givens” - and a conclusion that is tobe derived. Then, students must apply logical rules to the givens to derive new statements. The student repeats this process of identifying rules to applyon certain statements until they derive the conclusion. An example of thisprocess in Deep Thought is covered in this section along with a description ofthe interface.Within the discrete math course, students next complete partially-workedexamples in a ﬁll-in-the-blank type interface where they are given formal logicproofs with one missing part on each step - either the derived statement, orpart of the justiﬁcation that consists of the rule used to derive it and thestatements the rule was applied to. Many example logic proofs are workedin class, with students asked to actively solve logic proofs in small groups,and students are provided with several full worked examples in handouts.After this class work and homework, students are assumed to have reasonablefamiliarity with logic rule application, but need practice in determining whichrules to apply in service to a problem-solving goal. Students are then assigned Christa Cody et al. to complete formal logic proofs using our propositional logic tutor called DeepThought [39].The intention of the Deep Thought tutor is to provide students with prac-tice on solving logic proofs with a focus on problem-solving eﬃciency in bothtime and the number of steps in their solutions, i.e. shorter proofs in less time,and ideally with few mistakes in justifying or deriving new statements. Todo so, the tutor must provide basic functionalities including (1) correctnessfeedback on each step (on both justiﬁcation and derived statements), and (2)automated detection of proof completion. Like a compiler, Deep Thought pro-vides these functions that identify errors and clearly shows when a problem is complete but do little to help students with the overall goal of reaching a prob-lem solution through deriving and justifying a series of well-chosen statements.To bridge this gap, the Hint Factory was created to provide data-driven assis-tance that could point students to appropriate subgoal statements to derive[8, 51, 52].Deep Thought allows students to solve logic proofs graphically as shownin Figure 1. On the left of Figure 1, the workspace is labelled. The workspaceis where the students can select statements (purple, oval-shaped nodes) andapply rules by selecting rules (blue, oval-shaped nodes) from the middle of thescreen under the “Rules” section to derive new statements. In Figure 1, thereare 4 givens (at the top of the workspace in purple, oval-shaped nodes) andthe conclusion (at the bottom of the workspace in a purple, square-shapednode). Each statement is labelled to show the order in which students derivedthem with the exception of the givens and conclusion which are labelled forease of reference. There is no particular ordering to the givens. Also, there isan example of our hints on the screen in the blue, oval-shaped nodes labelled“Goal.”To derive a new statement, a student must select statement(s) by clickingthem followed by selecting a rule to apply. In response to the student selectinga rule to apply, the tutor has one of 3 responses: 1) if the student is using anapplicable rule, i.e. a rule that logically can be applied to the statements, andthe new derived statement is the only potential derivation, then the statementis automatically added to the screen, 2) if the student is using an applicablerule and there are multiple potential derivations, e.g. using “Simplication” onthe statement “ I ∧ F ” where either I or F could be the new derived statement,then the student is prompted to enter in the statement they want to derive,and 3) if the student has incorrectly selected a rule that doesn’t apply to theselected statements, e.g. the rule requires only one statement to be selected,such as “Simpliﬁcation” or “Implication”, and the student has selected twostatements, then the tutor provides a pop-up and a description of the error.Note that in response 2, if the student incorrectly types in the statement tobe derived in the prompt, the tutor will pop-up and error and the student willhave to select the rule again to derive the statement.When a new statement is added by the student, the statement becomesoval-shaped node, similar to the givens shape, but the color depends on thefrequency and necessity of the node based on historical data. To help students itle Suppressed Due to Excessive Length 9 avoid deriving unnecessary statements/nodes in the training phase, the tutorcolors nodes based on their necessity and frequency in our historical dataset ofcorrect solutions by past students. Nodes that were never necessary to derivethe conclusion are colored gray, while frequently-necessary nodes are coloredgreen, and infrequently-necessary nodes are colored yellow.As the student is deriving new statements, the nodes are added to theproof with arrows pointing to them to show which statements were used andthe rule applied to derive the new statement. On the right of the screen, theInfo Box contains information communicated by the tutor about the currentproblem the student is solving, i.e. what rules may be useful to solving theproof, information about hints on the screen, and information about certainbuttons a student may try to use. The bottom left of the screen under theworkspace contains buttons that a student may use during the training portionof the tutor (skipping a problem and requesting suggestions, i.e. hints, are notavailable during the testing portions of the tutor). To the bottom right of thescreen, buttons are available that show the student general information aboutthe tutor as well as instructions that provide information about solving proofsand the options available for the students. Fig. 1

On the left of the screen is the Deep Thought workspace. Below the workspace areare the hint button and hint message box, the rules are in the middle, and to the right isthe Dialogue Box where messages related to unsolicited hints as well as problem informationare given.0 Christa Cody et al. As stated above, Deep Thought is intended to teach students to solve proofsmore eﬃciently, in terms of time and steps taken to reach the conclusion. Thetutor presents proof problems as an initial set of given statements with aconclusion to derive from them using logic rules. Each statement, given orderived, is represented by a node , with the conclusion represented with a nodewith a question mark ‘?’ above it, indicating that it has not yet been justiﬁed(shown to be true using logic rules applied iteratively to the givens).Each problem-solving step consists of two parts: the justiﬁcation and thederived statement. The justiﬁcation is the set of 1-2 existing nodes and the ruleapplied to them, and the derived statement is the result. Students completethe justiﬁcation by clicking to select 1-2 nodes, and clicking on a rule to apply.Students then type in the derived statement that results from applying the ruleto the selected statement nodes. For example, Figure 2 shows a formal proofbeginning with 3 givens (at the top of the workspace in purple, oval-shapednodes) and the conclusion (at the bottom of the workspace in a purple, square-shaped node). The student (1) selects the statement “ I ∧ F ” and (2) appliesthe “Simp” rule, i.e. Simplication, to (3) derive a new statement “ F ”. To solvethe proof, the student would continue identifying combinations of statementsand rules to apply until they derive the conclusion statement “ J ∧ K ”.Throughout the tutor, including the pre- and post-test problems, DeepThought provides immediate error feedback for mistakes - either in justiﬁca-tions or derived statements. If a student clicks on the wrong rule, or theirderived statement does not follow from the selected nodes and rule, DeepThought shows a popup message and records the error. For example, if a stu-dent selects two nodes and then clicks on the Simp rule, the error prompt reads“Rule requires one premise,” then fades away. If the student enters a derivedstatement that is true, and the justiﬁcation (consisting of the selected nodesand rule to derive it) is correct, then a new node with the derived statementappears in the workspace.To complete a problem, the student must iteratively derive and justifynew statements, until the conclusion statement is derived and justiﬁed. Whenstudents have completed a problem, the conclusion’s question mark is removed,and it is visually connected to the givens through a series of derived nodes andarrows indicating their justiﬁcations. Since the system automatically checkseach step and detects completion in all phases of the tutor, student solutionscannot be incorrect, but some may be more expert than others. Students areconsidered to have learned the topic when they perform well on the posttest(described below), especially with regard to problem solutions with fewer stepsand fewer mistakes in less time.Deep Thought includes four phases: introduction, pretest, training, andposttest. The introduction consists of three problems including two workedexamples, where students click through the addition of successive nodes untila conclusion is derived, and a third problem students solve alone to learnthe interface. Then, students take the pretest consisting of a solving singleproblem with no hints available. The pretest is used to measure students’incoming proﬁciency and assign them to conditions via stratiﬁed sampling. itle Suppressed Due to Excessive Length 11

Fig. 2

Deriving a new justiﬁed node. (1) Selecting the node “ I ∧ F ” to use (2) Selectingthe rule “Simpliﬁcation” to apply (3) The screen after the rule was clicked showing “F” asa justiﬁed node Next, students solve 18 problems in the training section. For each trainingproblem, the dialogue box provides information on what rules to focus on while et al. solving a problem, such as “Think about the following rules for this problem:MP, Simp, Add.” Students also receive contextual, data-driven hints duringtraining, including both unsolicited hints generated by the system and on-demand hints upon student request, all generated using the same Hint Factory-type approach described below. After completing training, students take amore diﬃcult, non-isomorphic posttest , where they must solve four problemswithout any help or assistance. Since the posttest is not isomorphic to thepre-test, we do not expect the post-test performance to be directly comparableto the pretest performance. Rather, we use the pretest to balance incomingproﬁciency across groups via stratiﬁed sampling, and focus on comparing post-test performance between groups.Expert solutions for all tutor problems range from 5-8 steps, and studentsolutions typically contain 5-20 steps. Longer student solutions may simply beineﬃcient, taking more steps than needed, or they may contain unnecessarynodes that do not lie on a direct path from the givens to the conclusion (seefootnote ). As ﬁrst mentioned in Section 3, to help students avoid derivingunnecessary statements/nodes in the training phase, the tutor colors nodesbased on their necessity and frequency in our historical dataset of correctsolutions by past students. Nodes that were never necessary to derive theconclusion are colored gray, while frequently-necessary nodes are colored green,and infrequently-necessary nodes are colored yellow.3.1 AssistanceIn training problems , students may receive unsolicited hints, depending ontheir assigned study condition, as well as request on-demand hints. On-demandhints and unsolcited hint provide the same content. All hints provide a targetstatement to derive, appearing as a node with a ‘?’ in the workspace. In pre-vious work, we showed this method of providing unsolicited hints, Assertions,resulted in better performance than text-based messages as a method of un-solicited hint delivery [31]. For the remainder of the paper, we refer to bothsolicited and unsolicited hints as hints.Deep Thought includes several measures intended to prevent gaming thesystem, where students attempt to use system features to avoid work, or helpabuse, where students request hints when they do not need them [6]. First,whenever a hint is already in the workspace, students may not receive anotherhint, whether it was solicited or provided automatically by the tutor. Second,no further details are provided for any hint, meaning there is no such thingas a bottom-out hint in this study. In past Hint Factory implementations, wehave provided students 4 levels of hints that (1) suggested the next step, (2)the speciﬁc rule, (3) the prior statements needed, and ﬁnally (4) a bottom-out Unnecessary nodes in a complete solution are easy to detect because removing themdoes not disconnect the conclusion from the givens, but they are diﬃcult to detect duringproblem solving.itle Suppressed Due to Excessive Length 13 hint with all this information. In this study, we use only level 1 pointing hints,and disabled hint levels 2-4.The tutor generates hints using historical student data from four semesters,each semester with approximately 250-300 students using the tutor. Both hintalgorithms produce assistance based on the most frequent and eﬃcient pathsavailable in the student’s current proof.We use the Hint Factory [51] approach to generate hints. The Hint Factory[51, 52, 7] is a data-driven method to generate hints by transforming histor-ical student problem-solving attempts into a Markov Decision Process, usingobserved frequencies as transition probabilities, and estimating the expectedvalue of each previously-observed problem state based on assigning rewardsto complete solutions, small negative rewards (i.e. costs) to steps to positivelyreward more eﬃcient solutions, and large negative rewards to errors to de-emphasize solutions that cause many students to make mistakes. Individualstudent problem-solving attempts are represented by a series of states, or snap-shots of the work done so far, where transitions occur between states whenstudents add or delete problem nodes, or make an error. The Hint Factory isdescribed in detail in Barnes and Stamper’s chapter in the 2011 Handbookon Educational Data Mining [7]. All student solutions are combined into aninteraction network [15] that reﬂects all previously-observed solutions to onespeciﬁc problem. When a hint is requested by the student or tutor, the HintFactory is used to select a target problem-solving state with the highest ex-pected value. Note that this process can be done oﬄine, and a simple table canbe used to store problem-solving states and their corresponding hint contentfor real-time hint provision. Then, the latest statement derived in that stateis used as the pointing hint to help students know what to try to derive next.In this study, we do not provide further information on how to derive orjustify the suggested statement, i.e. the statements that a student needs toselect and the rules the student may need to apply are not provided to thestudent, meaning that all hints in this paper can be considered as partially-worked example steps.In this study, two hint types are used:

Next-Step and

Waypoint hints.Figure 3 and Figure 4 shows the two forms of data-driven hints: Next-Step(NS) and Waypoints (WP), respectively, and how the students would approachderiving the suggested hint statement for each type. Descriptions of how eachhint type is generated as well as how to derive each hint are expanded in thefollowing paragraphs.

Next-Step hints are generated using the Hint Factory method as de-scribed above, with the target state selected to be the one with the highestexpected value that occurs within one rule application from the stu-dent’s current state . Simply, Next-Step hints suggest the best propositionthat can be derived in one step from the student’s current proof. This cor-responds to the next-step hints derived in all of our prior work using HintFactory [8, 7, 52, 51, 14, 15, 43, 44].Since Next-Step hints are partially-worked, they allow students to focuson how to justify them, and reﬂect on why they were suggested. This removes et al. a considerable load; without a hint, students must also search among manyoptions for the best what to derive next. For example, Figure 3 demonstratesthe ideal derivation of a Next-Step hint. In Figure 3, Deep Thought has 3givens at the top of the workspace and one hint statement labelled “Goal”, F ,on the screen. To derive the hint, the student (1) selects the I ∧ F statementby clicking it. After selecting the statement, which is now shown highlightedin blue, the student clicks the rule labelled “Simp” to apply Simpliﬁcation tothe statement. A pop-up will appear for the student to type in what they areattempting to derive, in which case they enter F . After entering F into theprompt, (3) the statement is shown incorporated into the student’s solutionin the same fashion regular derivations happen as in Figure 2 with arrows,coloring, and labelling. In this case, the justiﬁed hint appears on the screen asa green, oval-shaped node with an arrow pointing to it from the I ∧ F statementwith the labels “4:” to indicate this statement is the fourth statement justiﬁed(givens are automatically numbered) and the label “Simp” to indicate thestatement was derived using the Simpliﬁcation rule. Waypoint hints are generated with the same method as Next-Step hints;however, instead of selecting a hint 1 step away from the student’s currentstate, hints that are 2-3 steps away from the student’s current state are se-lected. A primary motivation for this study was to determine a simple way toextend the Hint Factory to provide less direct data-driven hints, i.e. comparedto Next-Step hints, without the need for expert authoring. In our prior work,we derived a new method called data-driven Approach Maps, that applies hi-erarchical graph mining to interaction networks to discover problem-solvingstates that represent critical junctures in problem-solving attempts, which wecall subgoals [14]. These subgoals occurred every 2-3 steps/states in our shortlogic proofs (which are typically 5-12 steps long). These subgoals inspired ourWaypoints, but we wanted to be able to generate these hints with an easiermethod that is more extensible to other researchers who may already be usingHint Factory or methods based on Hint Factory.To generate Waypoints without the need to apply data-driven ApproachMaps, we modiﬁed the Hint Factory to select a target statement that was 2-3steps away from the current state. Among states that were 2 or 3 steps away, weselected the state with a higher frequency within prior correct solutions. Thisresulted primarily in states that need only two rule applications to derive, sincethe diversity of student solutions means that frequency typically decreases ininteraction networks the further states are from the start. By expert review ofa random sample of Waypoints, we veriﬁed that this simple algorithm resultsin similar hints to those generated using data-driven Approach Maps [14].Waypoints are intended to serve as subgoals , giving students more room toexplore the solution space and develop their own problem-solving strategies.Since Waypoints cannot be achieved with a single rule application, they requirestudents to make their own problem-solving plan to derive them, consideringthe existing problem statements and how rules might be applied to them toderive and justify the suggested Waypoint statement. For example, Figure 4demonstrates the ideal derivation of a Waypoint hint. In Figure 4, (1) Deep itle Suppressed Due to Excessive Length 15

Thought has 3 givens at the top of the workspace and one hint statementlabelled “Goal”, G ∧ ¬ H , on the screen. To derive the hint, the student ﬁrstselects the I ∧ F statement by clicking it, then the student clicks the rulelabelled “Simp” to apply Simpliﬁcation to the statement. A pop-up will appearfor the student to type in what they are attempting to derive, in which casethey enter F . Note, this step is not shown in the ﬁgure, although it is thesame process as described in Figure 3. After entering F into the prompt,the F statement is shown incorporated into the student’s solution. Next, thestudent must make a second derivation to derive the hint. The student (2)selects the F → ( G ∧ ¬ H ) statement and the F statement by clicking eachone individually – this highlights both nodes – then the student clicks the MP rule to apply Modus Ponens to the statements. As a result, the statement isautomatically derived, due to the derived statement being the only option,and the justiﬁed hint appears on the screen as a green, oval-shaped node witharrows pointing to it from the F → ( G ∧ ¬ H ) statement and F statement withthe labels “5:” to indicate this statement is the ﬁfth statement justiﬁed (givensare automatically numbered) and the label “MP” to indicate the statementwas derived using the Modus Ponens rule.For both Next-Step and Waypoint hints, the process of deriving the hintis the same: students must select statement(s) and apply rules to derive newstatements, which we also refer to as “steps”. The only diﬀerence is how manytimes a student must repeat this process to derive the hint statement (ideallyonce for Next-Step and twice for Waypoint hints - with the exception of someWaypoints that may take three steps to derive). With that in mind, studentsmay also derive new statements that do not contribute to deriving the hintstatement; however, when we refer to how many steps it takes for a studentto derive a hint, we are speaking in ideal terms.We consider a continuum of goals for students, where Next-Step hints ide-ally take one step to derive, Waypoints take 2-3, and the problem conclusiontakes about 5 expert steps. With longer problems or more complex problemdomains like programming, we would recommend using a more complex algo-rithm to select Waypoints if they were shown to be eﬀective. In logic proofs,the shortest proof is considered to be the best, so simple metrics on interactionnetworks can quickly discover optimal solutions and those that many studentscan discover.As stated above, Deep Thought only provides pointing hints to suggeststatements that can be derived; neither Next-Step nor Waypoint hints tell stu-dents which rules to use to derive them; rather, they help students solve prob-lems by suggesting a subgoal that helps them break down multi-step problems.To use a hint in their proof, the suggested hint statement must be justiﬁed by applying a rule to previously-justiﬁed or given statement(s). Statementsthat are not justiﬁed appear in the tutor interface with a “?” above them toindicate that they need to be derived.We implemented unsolicited hints so they appear randomly and with enoughuniformity and frequency that even students with short proofs would receivehints. One limitation of this method of providing hints is that hints were not et al. Fig. 3

Next-Step hint. (1) A Next-Step hint appears, F . (2) The student has selected I ∧ F and is applying the Simpliﬁcation rule. (3) F has been justiﬁed.itle Suppressed Due to Excessive Length 17 Fig. 4

Waypoint hint. (1) A Waypoint hint appears, G ∧ ¬ H . (2) The ﬁrst derivation usingSimpliﬁcation has already been completed. The student has selected F → ( G ∧ ¬ H ) and F and is applying MP (Modus Ponens). (3) G ∧ ¬ H has been justiﬁed. necessarily provided when they were most needed, which may aﬀect learning et al. outcomes. However, since students in tutor rarely request hints, it was neces-sary to provide the hints automatically and frequently to enable us to evaluateour hypotheses. For the Next-Step group, we capped the number of unsolicitedhints at 1 / The Deep Thought tutor was used as a homework assignment for an under-graduate ‘discrete mathematics for computer scientists’ course in the Fall 2018semester at a large research university. We analyzed 143 students’ data fromtwo test conditions to investigate the impact of hint type on student per-formance and behavior. Both conditions were identical except for hint type,Next-Step or Waypoint. We used stratiﬁed sampling based on pretest perfor-mance, then randomly assigning to Next-Step hints (NS, n = 71) or Waypoints(WP, n = 72), ensuring both conditions were balanced in incoming knowledge.Before analysis, students who dropped the tutor before completion and stu-dents with technical errors in their data were removed (NS n = 15, WP n = 14) leaving 56 students in the NS condition and 58 students in the WPcondition for a total of 114 students.4.1 HypothesesThe goals of this study were to 1) evaluate the eﬀectiveness of a new hinttype, Waypoint hints, 2) compare the impacts of Waypoints and Next-stephints on performance, and 3) determine if proﬁciency had an eﬀect on whichhint type was more beneﬁcial. Based on prior literature, we developed thefollowing hypotheses: – H : Next-Step hints will improve performance for students with lower in-coming proﬁciency. – H : Waypoint hints will improve performance for students with higherincoming proﬁciency. – H Waypoint hints will be more diﬃcult to derive, resulting in a lowerjustiﬁcation rate and performance during training compared to Next-Steps.These hypotheses were based on the basic assumption that Waypoint hints aremore diﬃcult to justify and adopt, since Waypoints require students to derivemore steps to justify them. On the other hand, this challenge may be precisely itle Suppressed Due to Excessive Length 19 what high-proﬁciency students need for improved learning. To evaluate thesehypotheses, we focused on the performance metrics discussed below.4.2 Performance Evaluation MetricsIn this section we describe the metrics used to evaluate student performance.Recall that the tutor begins with an introduction with two worked examplesand one practice problem followed by the pretest. We used each student’spretest score to measure incoming knowledge/proﬁciency. Equation 1 showshow the score is calculated. Each metric is normalized, then the time and stepmetrics are subtracted from 1 to be comparable to accuracy, i.e. so that fortime, steps, and accuracy a number closer to 1 indicates the student is per-forming well. A student’s score is a combination of percentiles for the pretest time , number of steps , and accuracy on a single problem, ranking studentsbased on how fast, eﬃcient, and accurate they are compared to their peers.We chose these features because they each represent a diﬀerent aspect of astudent’s problem solving experience.Recall that the tutor was designed to improve time and steps to solveproblems, and assumes a basic level of ﬂuency or accuracy on rule applica-tions. Therefore, we have no goals or expectations of improving accuracy withthis tutor. However, the score includes all three metrics to ensure that ourinterventions do not decrease accuracy while attempting to improve time andsteps. For example, a student may take a short amount of time on a problem,but make many mistakes resulting in a lower accuracy. We use a median spliton the combined pretest score to assign students into High and Low proﬁciencygroups for some analyses.

Score = (1 − T otalT ime ) ∗ . − T otalSteps ∗ .

3) +

Accuracy ∗ . Total time is counted from the moment a problem begins until it is solved by derivingand justifying the conclusion.

Total steps in a problem include any attemptat deriving a new node, which includes correct and incorrect steps.

Accuracy is the percentage of correct out of the total steps, which is expected to startrelatively high due to prior exposure in the class, and increase as studentspractice. Note that the tutor is not designed or assumed to promote largeimprovements in accuracy, since no penalties are assigned for incorrect ruleapplications and the tutor simply alerts students upon wrong rule applicationsand students may try again, even within the pre- and post-tests. Further,problems require new rules and become more diﬃcult as the students progress.As we seek primarily to promote more eﬃcient problem solving, we focus moreon steps and time per problem while maintaining reasonable accuracy. This isbecause it is more diﬃcult for students to learn to determine which steps toderive to achieve shorter, more eﬃcient proofs, compared to learning how to et al. apply the rules, which can be done by memorization and simple practice. DeepThought is built primarily to allow students to practice with the strategy ofproblem solving, rather than ﬂuency with rules, most of which are assumed tobe learned before the tutor.One important thing to note is that Deep Thought does not include eye-tracking, and the unsolicited hints are provided regardless of whether a studentneeds them or not, so we cannot determine precisely whether students followeda hint or incidentally derived the hint statement. Therefore, we have deﬁnedmetrics to quantify when students justiﬁed a hint by selecting the statementsand rule needed to derive it, as well as when the students adopted a hintby ﬁrst justifying it and then using it directly on their path to derive theconclusion. These two hint-speciﬁc metrics are the hint Justiﬁcation Rate and

Adoption Rate .The hint Justiﬁcation Rate is the percentage of unsolicited hints justiﬁed(correctly identifying the rule and prior nodes needed to derive the suggestednode) divided by the total number of hints given across the training problems.A hint is said to be justiﬁed when a student applies logic rules to existing logicstatements to derive the hinted logic statement, and when a hint is justiﬁed,the tutor removes its ‘?’ and connects it to its predecessor nodes with arrowslabeled with the rule used to derive it. A hint justiﬁcation provides evidencethat a student noticed the hint and knew how to apply rules to justify it, butdo not tell the full story. As in any problem-solving context, statements can bederived that are not needed in a ﬁnal solution. Therefore, we also measure hintAdoption Rate, whether a hint contributes towards deriving the conclusion. Ajustiﬁed hint can be reached on a path from the problem’s given statements.When a hint is adopted , it must ﬁrst be justiﬁed and then become necessary toa student’s ﬁnal solution – in other words, the problem would be incompleteif the hinted statement were removed. This is shown visually when a directedpath can be found from the hinted statement node to the problem conclusion.Figure 5 shows a completed problem with labels indicating which nodes areconsidered justiﬁed and which nodes were also adopted for the solution.We also investigated impacts on help-seeking through the number of on-demand hint requests (when students click the“Get Suggestion” button).

To-tal Requests represents the number of hint requests during the training por-tion. Data were analyzed to compare groups for the pretest, training, andposttest portions of the tutor. Within each hint group, we also compared per-formance of students with High or Low pretest scores, based on a median spliton the pretest score.To determine signiﬁcant diﬀerences between hint types, we applied one wayANCOVA using the pretest as a covariate with Benjamini-Hochberg correc-tions to account for multiple tests. To check that the data met assumptionsfor ANCOVA, we used the the Shapiro-Wilk’s W test and Levene’s test, aswell as visually inspecting the data via Q-Q plots and histograms. Data thatdid not meet the assumptions were transformed using log or square-root trans-formations, then re-inspected. Data reported in tables For clarity, all data intables are reported before transformation. itle Suppressed Due to Excessive Length 21

Fig. 5

A completed problem with nodes that were used to derive the conclusion (justiﬁedand adopted) and one node that was not used to derive the conclusion (justiﬁed but notadopted ). Table 1 shows the overall hint metrics for each group during training. Weexpected

Total Added ( F (2 , . , p < .

01) and

Steps Until Justi-ﬁed ( F (2 , . , p < .

01) metrics to be signiﬁcantly diﬀerent, sinceeach step of a problem can have a unique Next-Step (NS) but one Waypoint(WP) requires multiple steps to be derived. Based on prior literature on helpavoidance and low help usage within tutors [43], we were pleasantly surprisedto ﬁnd students in both groups had relatively high justiﬁcation and adoptionrates. The Next-Step group justiﬁed a signiﬁcantly ( F (2 , . , p < .

01) higher percentage of hints, as shown by the

Justiﬁcation Rate . Addition-ally, of the justiﬁed hints, we also saw a signiﬁcantly ( F (2 , . , p =0 .

01) lower Adoption Rate of the WP hints in students’ ﬁnal proofs, no sig-niﬁcant interaction was observed with pretest proﬁciency. Although this is arelatively high number for both groups, the WP group’s lower justiﬁcation andadoption rates are concerning.This provides evidence in support of H that Waypoint hints would beharder to derive; however, this evidence does not address whether this wasdue to the diﬃculty of the WP hints or students’ lack of eﬀort to derive them.We explore the possible reasons for these diﬀerences later in this section. et al. Table 1

Hint metrics during training. For ANCOVA results controlling for the pretestscore, p-values that are at least marginally signiﬁcant are bolded and signiﬁcant valuesalso have an asterisk*.

NS WP n = 56 n = 58Metric

Mean(SD) Mean(SD) p

Justiﬁcation Rate 89%(7) 84%(12) < Adoption Rate 83%(10) 74%(17)

Steps Until Justiﬁed 1.1(0.1) 2.2(0.3) < Total Added 49(9) 30(7) < To understand the overall impact of Next-Step versus Waypoint hints,we examined the performance for both groups for the tutor pretest, train-ing, and posttest, for all students regardless of incoming proﬁciency as shownin Table 2. There were no signiﬁcant diﬀerences between the WP and NSgroups on the pretest, although the pretest was slightly worse for the NSgroup. During training, the NS group signiﬁcantly outperformed the WP groupwith fewer steps, less time, better accuracy and overall score (Total Time F (2 , . , p< .

01, Total Steps F (2 , . , p = 0 .

02, Accuracy F (2 , . , p = 0 . H during training, with all Next-Step students outperformingall WP students. We examined help-seeking behaviors during training andfound the NS group requested signiﬁcantly more hints, although still a smallnumber overall (approximately 1 per problem for NS versus 0.5 per problem onaverage for WP), so it is not likely that hint requests account for the diﬀerencein training performance. Table 2

Performance metrics for each group on the pretest, training, and posttest; p-valuesthat are at least marginally signiﬁcant when applying ANCOVA controlled for pretest are bold and those that are signiﬁcant also have an asterisk.

NS (n = 56) WP (n = 58)Metric

Mean(SD) Mean(SD) p-value

Pretest

Total Time

Total Steps

Accuracy

Training

Total Time (min) < Total Steps

Accuracy

Total Requests

Total Time (min)

Total Steps

Accuracy itle Suppressed Due to Excessive Length 23

More importantly, on the posttest, the NS group signiﬁcantly outper-formed the Waypoint group on Total Steps ( F (2 , . , p = 0 .

02) andAccuracy ( F (2 , . , p = 0 . F (2 , . , p = 0 . what to derive next when receiving hints,just the how ), and total steps (since the suggested hints were eﬃcient).5.1 Eﬀects on High- and Low- Pretest GroupsOur hypotheses focused on the diﬀerential impact of hints based on incomingproﬁciency and the diﬃculty of applying Next-Step versus Waypoint hints.To investigate these hypotheses, we checked for diﬀerences in performance be-tween prior proﬁciency groups within each group. We performed a median-splitfor incoming proﬁciency based on pretest scores and compared performancemetrics across groups and proﬁciency (NS-High n = 27, WP-High n = 30,NS-Low n = 29, WP-Low n = 30).First, we examined performance metrics for the High group, shown in Ta-ble 3. There were no signiﬁcant diﬀerences between the NS and WP Highgroups on the pretest. For the training, the WP group took longer and mademore mistakes, as indicated by the Total Time ( F (2 ,

54) = 17 . , p < . F (2 ,

54) = 5 . , p < . F (2 ,

54) =3 . , p = 0 . F (2 ,

54) = 4 . , p = 0 . H was rejected;Waypoint hints did not improve performance for higher proﬁciency students.Next, we examined performance metrics for the Low pretest group. Therewere no signiﬁcant diﬀerences between the NS and WP Low groups on thepretest. For the training, the WP took longer and attempted more steps, asindicated by the Total Time ( F (2 ,

54) = 3 . , p = 0 .

02) and Total Steps et al. Table 3

Performance metrics between the NS and WP

High proﬁciency groups for thepretest, training, and posttest of the tutor. ANOVA results are reported for the pretest.ANCOVA results, controlling for the pretest, are reported for the training and posttest,with p-values that are at least marginally signiﬁcant in bold and signiﬁcant p-values alsohave an asterisk *.

High Proﬁciency

NS-High WP-High n = 27 n = 30

Metric

Mean(SD) Mean(SD) p-value

Pretest

Total Time (min)

Total Steps

Accuracy

Training

Total Time (min) < Total Steps

Accuracy < Total Requests

Justiﬁcation Rate

Adoption Rate < Total Time (min)

Total Steps

Accuracy ( F (2 ,

54) = 12 . , p < . F (2 ,

54) = 7 . , p < . F (2 ,

54) = 1 . , p = 0 .

09) andattempting more steps ( F (2 ,

54) = 3 . , p = 0 .

02) with marginally signiﬁcantand signiﬁcant results, respectively, indicating that the (hypothesized) worseperformance in the training portion may have transferred to their overall proofsolving strategies on the posttest. No interactions were found between pretestproﬁciency and performance or hint metrics on the training or posttest.These results conﬁrm hypothesis H that Waypoints are more diﬃcult forstudents and have a negative impact on training performance.We hypothesized in H that Next-Step hints would improve (training andposttest) performance compared to Waypoint hints, for low proﬁciency stu-dents. The overall performance (Table 2) conﬁrmed that the Next-Step hintgroup produced better training and posttest performance. However, Table 4conﬁrms that the beneﬁts in the posttest are more prominently seen with thestudents with lower incoming proﬁciencies, conﬁrming H .We hypothesized in H that the Waypoint hints would cause lower justiﬁ-cation rates and worse training performance due to their increased diﬃculty,which is seen with both the WP-High and WP-Low groups, conﬁrming our H hypothesis. There was also a signiﬁcant diﬀerence in the Adoption ratesbetween the NS and WP groups for both High and Low students, with theWP adoption rates being lower. This suggests that students were not, in fact,able to independently discover the strategies that underlie the WP hints. itle Suppressed Due to Excessive Length 25 Table 4

Performance metrics between the NS and WP

Low proﬁciency groups for thepretest, training, and posttest. ANOVA results are reported for the pretest. ANCOVA re-sults, controlling for the pretest are reported for the training and posttest; p-values that areat least marginally signiﬁcant are in bold and signiﬁcant p-values also have an asterisk *.

Low Proﬁciency

NS-Low WP-Low n = 29 n = 28

Metric

Mean(SD) Mean(SD) p-value

Pretest

Total Time (min)

Total Steps

Accuracy

Training

Total Time (min)

Total Steps < Accuracy

Total Requests

Justiﬁcation Rate < Adoption Rate

Total Time (min)

Total Steps

Accuracy

Although we expected a lower hint justiﬁcation rate in the WP group, wethought that the increase in diﬃculty would be beneﬁcial to high proﬁciencystudents by allowing them more exploration of the problem space. Therefore,we hypothesized in H that higher incoming proﬁciency students would dobetter on the posttest after experiencing the WP hints in training. However,that is not the case. The WP-High group was only able to perform similarlyto the NS-High group and overall performed worse, although not signiﬁcantly.Therefore, H is rejected. However, the results do seem to indicate that thehigh incoming-proﬁciency students were less aﬀected by the treatment thanthe low incoming-proﬁciency students based on there being more signiﬁcant re-sults between conditions in the low incoming-proﬁciency group. As mentionedearlier, we expected that an aptitude-treatment interaction (ATI) might occur,where certain students are more sensitive to variations in the learning envi-ronment and may be aﬀected diﬀerently by the treatment compared to lesssensitive (more proﬁcient) students who are able to perform well regardless oftreatment.5.2 Did Waypoints help with strategy for those who could utilize them?Although the performance results caused us to reject H , we wanted to inves-tigate whether WP hints provided strategy-related beneﬁts to those studentswho were able to use them. Therefore, we performed correlation analyses usingthe Pearson correlation coeﬃcient between the hint Justiﬁcation and Adop-tion rates with posttest performance metrics. For the correlation analyses, weused an R function, corr.test, which computes the Pearson correlation coeﬃ-cient, signiﬁcance tests using t-tests, and performs optional corrections which et al. we speciﬁed as Bonferroni corrections[48]. Table 5 shows the signiﬁcant corre-lations of hint Adoption and Justiﬁcation Rates with performance metrics forNS and WP groups on the posttest, as well as correlations with the incomingproﬁciency groups.For the NS group, the only signiﬁcant correlation found was for the NS-High group between hint Adoption Rate and Total Steps ( p = 0 . p = 0 .

03) and with Total Steps ( p = 0 . p < .

01) and Total Steps ( p < . justifying and adopt-ing WPs were both associated with more eﬃcient proofs that wereshorter and achieved in less time . There was also a signiﬁcant moderate,negative correlation for the WP-Low group between Total Steps ( p = 0 . p = 0 .

02) and hint AdoptionRate, but the WP-High group also had moderate, negative correlation betweenTotal Time ( p = 0 .

02) and hint Adoption Rate.This result aligns with our reasoning behind H , that Waypoint hintsshould improve eﬃciency- and time-related metrics on the posttest, especiallyfor higher proﬁciency students. However, ultimately, the WP students per-formed worse. Based on these results, we conclude that more support may beneeded for WPs so that students can utilize them as well as NS hints to betterachieve eﬃciency-related beneﬁts. Table 5

Signiﬁcant correlations between hint Justiﬁcation and Adoption rates with posttestperformance metrics for each hint type group and pretest group

Condition Split

Metric-Pair

Corr p NS High Adoption-Total Steps -0.38 0.06 WP All Justiﬁcation-Total Time (min) -0.30

Justiﬁcation-Total Steps -0.32

Adoption-Total Time (min) -0.35 < Adoption - Total Steps -0.40 < High Adoption-Total Time (min) -0.40

Adoption-Total Steps -0.41

Low Adoption-Total Steps -0.39 H , we investigated how many unused (unjustiﬁed) hints were attempted to bejustiﬁed. The signiﬁcantly lower diﬀerence in hint Justiﬁcation Rate of the WPgroup as shown in Table 1 and the signiﬁcantly worse performance by the WP itle Suppressed Due to Excessive Length 27 group in the training as shown in Table 2 led us to want to better understandthe circumstances surrounding why the WP hints were used proportionatelyless.The hint Justiﬁcation and Adoption rates can only tell us that studentswere, or were not, using the hints, but do not provide any insight into whetherthe students were actively attempting to derive the hint. Therefore, we con-ducted analyses to see if the WP hints were truly harder to derive ( H ). Thiswould be indicated by the students attempting to work towards the hint, andnot succeeding, versus outright ignoring the hint. Because the WP hints aremore steps away than the NS hint, students see the hint as too complicated tobe helpful and just ignore the hint outright. However, if students are attempt-ing to to derive the hint and not able to be successful, this is a larger concern.To determine if students were attempting to derive the hint, we examined thesteps taken after a hint was added (3 steps ahead for NS and 5 steps ahead forWP). If a majority of the steps examined contained variables that were alsoseen in the hint (2 out of 3 steps for NS and 3 out of 5 steps for WP) , it wasconsidered attempted .Table 6 shows the total unused hints. Total Unused represents the totalnumber of unused hints per person in each group. The % Unused/Total isthe total number of unused hints divided by the total number of hints that wereadded, which provides a clearer picture of the relative percentage of hints thatwere left unused by each student compared to how many they were being given.The % Attempted/Unused is the total attempted hints divided by thetotal unused hints representing the percentage of the unused hints that wereattempted. There was not a signiﬁcant diﬀerence in the total amount of unusedhints between the groups ( F (2 , . , p = 0 . F (2 , . , p < . F (2 , . , p = 0 . Table 6

The total unused (unjustiﬁed) hints, percentage of hints unused out of all hintsadded, and the percentage of the unused hints that were attempted to be derived betweenthe NS and WP group. For ANCOVA results controlling for the pretest score, p-values thatare at least marginally signiﬁcant are in bold and signiﬁcant values also have an asterisk *.

NS WP n = 56 n = 58Metric

Mean(SD) Mean(SD p-valueTotal Unused % Unused/Total < % Attempted/Unused To understand when unsolicited hints were not justiﬁed, we determinedthe circumstances when this occurred and illustrate several situations: whenstudents attempted to use the hint, and what the eventual outcome was: ei-ther Gave Up or Solved Without using the hint.

Gave Up represents any et al. actions that end the problem without solving it, such as restarting or skippingthe problem. In this situation, students had a hint on the screen, worked afew steps, then clicked the restart or skip button without justifying the hint.When a student clicks restart or skip, this erases all current progress on theproblem. We considered this to be “giving up” because the student is remov-ing all progress made on the current problem by taking these actions, which isconcerning given that a hint was on the screen. Solved Without representswhen students completed a proof with an unjustiﬁed hint still on the screen.In this case, students have a hint but eventually solve the problem withoutusing the hint. This indicates that the hint was ignored, or at the very least,was not essential to solving the proof. We are less concerned with this casebecause the students were able to progress. However, since the hint is the mosteﬃcient step to work towards, any student who avoided it took a less eﬃcientroute to solve the problem. Lastly, although students had the option to deletea hint, no deletions were observed possibly due to students not knowing howto delete the hint.Table 7 details the two cases in which a hint was added, but the studentdid not justify it. For signiﬁcant diﬀerences, ANCOVA was used with thepretest score as the covariate. The Total Unused, % Unused/Total and the %Attempted/Unused are deﬁned above. We also examined how many steps thestudents took after a hint was given but before they gave up or solved theproof to determine how much eﬀort was put into trying to derive the hint.

Steps Before is the number of steps the student attempted after receivingthe hint before they gave up or solved the proof. This metric was added tosee how long students were trying to work on the problem after the hint wasgiven.

Table 7

Comparison of unused hints of each subtype by amount, percentage that wereattempted, and steps before the action occured.

Gave Up Solved Without

NS (n = 56) WP (n = 58) NS (n = 56) WP (n = 58)

Mean(SD) Mean(SD) Mean(SD) Mean(SD)Total Unused % Unused/Total % Att./Unused

Steps Before

The WP group attempted to derive a signiﬁcantly higher number of the un-used hints before giving up (% Att./Unused: F (2 , . , p = 0 . F (2 , . , p = 0 .

14 and Solved Without: F (2 , . , p = 0 . F (2 , . , p =0 . F (2 , . , p =0 .

13 and Solved Without: F (2 , . , p = 0 . itle Suppressed Due to Excessive Length 29 hints in the Gave Up case. This result indicates that the WP students hadattempted to make progress towards the hints, were unable to justify them,and then gave up. This is more concerning than giving up on a problem inwhich they had not attempted to derive the hint, and indicates the WP hintsmay have been too hard to derive.The purpose of this analysis was to investigate H and determine the cir-cumstances surrounding why the WP group had a signiﬁcantly lower hintJustiﬁcation Rate than the NS group. The results provide evidence in supportof H that the Waypoint hints were harder for students to derive. This work aims to explore the extension of a Next-Step hint generator to easilycreate subgoal-inspired assistance. The Next-Step group saw overall the bestperformance for both the training and posttest, including the students withlower incoming proﬁciencies providing supporting evidence for H . Our resultsindicated that the Waypoint group performed overall worse in both trainingand posttest causing us reject H . Results also showed that the lower proﬁ-ciency students, speciﬁcally, were less able to utilize this form of assistance;however, students who were able to utilize Waypoints did see beneﬁts in termsof time and eﬃciency on the posttest. Furthermore, we explored the circum-stances surrounding when hints were not utilized and found that students inthe Waypoint group attempted a larger percentage of the hints before givingup, providing evidence in support of H that Waypoints would be harder toderive. In this section, we discuss the trade-oﬀs of the two hint types.6.1 Waypoint hintsWPs were intended for students to learn strategies for solving proofs by break-ing the problem into smaller subgoals and providing students with more inde-pendent problem solving experience than NS. However, the majority of WPstudents appeared to have struggled with WP hints instead, a trade-oﬀ ofthe assistance dilemma [26]. The WP group performed worse overall in bothtraining and posttest portions of the tutor (see Table 2). Another interest-ing result, shown in Section 5.1, is that the WP Low-pretest group has asigniﬁcantly lower Justiﬁcation Rate and marginally signiﬁcant decrease inposttest performance metrics. This aligns with literature showing that lowerproﬁciency learners are less able to use abstract guidance [25, 54]. Therefore,the WP hints might not provide enough guidance for students. Research hasshown that complex assistance can hinder learning by taxing cognitive load[53, 55], which can happen when learners try to process new information andincorporate complex assistance at the same time and “thus forcing learnersto use random search procedures” [22]. This is a limitation of our study asWaypoints may produce better results with more scaﬀolding. et al. The Justiﬁcation Rate being signiﬁcantly lower for the WP group indicatesthat the lower performance may be due to an inability to properly use theassistance (see Table 1). This is partially supported by Table 6 and Table7, which shows that the WP group had a higher percentage of attempts tojustify a hint without succeeding, compared to the NS group. The AdoptionRate being signiﬁcantly lower for the WP group indicates that, even whenstudents in the WP group were able to justify the hints, they were less ableto adopt them to connect the WP hints to the conclusion. Due to the designof the hint being a few steps away, students could end up on a solution pathdiﬀerent from the path initially given by the WP. Consequently, students whowere unable to justify the WP or adopt it into the solution were not followingthe most eﬃcient path, hindering their ability to learn from the strategiesbehind the WP hints.One potentially positive result with Waypoints is shown in Table 5, withrespect to the signiﬁcant negative correlations of Justiﬁcation and Adoptionrate with total time and total steps on the posttest. Students who were ableto justify and adopt the WPs were associated with taking a shorter time andfewer steps on the posttest. This correlation aligns with our original intentionof using WP to support strategy development by helping students becomemore eﬃcient in their problem solving process. Therefore, it is possible thatstudents with more experience and domain knowledge may better utilize Way-points and receive strategy-related beneﬁts. However, it is important to takeinto account that correlation analyses cannot determine causality and therecould be variables not included in these analyses that play an important rolein these relationships [16]. Therefore, this interpretation is only a possibility.Based on these results, WPs can be improved by providing more information(perhaps automatically provided once we detect that a student is unsuccess-fully attempting to justify the hint) or incorporating ideas from recent researchwith promising methods of scaﬀolding goal-based hints [32].6.2 Next-Step hintsThe total time, total steps, and accuracy were signiﬁcantly diﬀerent, or trend-ing towards signiﬁcance, between groups as shown in Table 2 for the trainingand posttest. Since the groups had similar pretest scores, these results showthat both the NS and WP groups came into the tutor performing similarly,but by the posttest the two groups had diverged; the NS group had higher ac-curacy and fewer total steps. Furthermore, the NS group were able to increasetheir accuracy between the non-isomorphic pre- and posttest compared to theWP group who did not show such improvements. This was perhaps due to theincreased practice in applying rules to justify both unsolicited and on-demandhints - since the NS group received and justiﬁed signiﬁcantly more hints inboth of these categories.The diﬀerences in time, steps, and accuracy between the groups show thatNSs were more beneﬁcial for students. As shown in Table 1, there is a sig- itle Suppressed Due to Excessive Length 31 niﬁcant diﬀerence in the higher Justiﬁcation Rate for NSs. We believe theseresults may be due to the alignment of NSs with novice’s bottom-up problemsolving approaches that focus on what to do in the short term [3, 20, 54].NSs may also have potentially resulted in an overall lower cognitive load [23],though this supposition is only based on their design and not data from stu-dents. As a justiﬁcation, students considering NSs only needed to think aboutwhich nodes and rules could be used to derive the NS. In contrast, WP stu-dents needed to think about which nodes to use, which rules to apply, andwhat intermediate steps they would have to achieve before deriving the WP.Interestingly, the NS group requested more on-demand hints (see 2). Thissuggests that the NS group may have found the assistance more helpful andbecame more comfortable requesting help. Prior research has shown studentsare more likely to request help when they received help that they perceive tobe more suitable for their needs [43].Although WPs were designed to promote more independent, strategic prob-lem solving, it is possible that NSs also helped students learn problem solvingstrategies. Based on the hint generator design, NS students following the hintswere seeing the most eﬃcient next step based on the current proof state. Prob-lems with frequent Next-Step hints could be acting as partially-worked exam-ples, which are known to increase eﬃcient problem solving strategies [56, 34].Previous research on hint usage during problem solving in programming sug-gests that hints can, sometimes, save students time but reduce learning [37].In our research, NS hints seem to save students time and increase performanceon the post-test. This suggests that NS hints may help students learn to solvethe problems more eﬃciently (more quickly and with fewer steps).

This paper contributes a study showing an extension of the Hint Factory tocreate higher-level hints, and the eﬀects of two types of hints on students’ ef-ﬁciency and accuracy in solving logic proofs: Next-Step hints (NSs) and Way-point (WPs) hints. It is important to note these hints were provided unsolicitedas well as through on-demand hint requests, which could aﬀect the students’usage and reception of the hints. Furthermore, our hints are provided period-ically and not necessarily when a student may need them. However, our priorhas shown our unsolicited, periodically provided hints do not have any negativeimpacts on training or post-test performance metrics compared to students inthe only on-demand hint group. In this paper, NSs helped students becomequicker, more accurate, and more eﬃcient in their proofs. However, the moredistant goals of WPs seemed to be harder for the students, which not onlyaﬀected the training problems where the assistance occurred, but resulted inlower accuracy and reduced eﬃciency in the posttest. Despite the WP groupspending more time on problem solving during training, their performance didnot beneﬁt as much as the NS group. Furthermore, learners with lower incom-ing proﬁciency were least able to utilize WPs, while NSs provided beneﬁts et al. to both higher and lower proﬁciency groups. Although NS performed betteroverall, students who were able to incorporate WP, especially those in theWP-High group, saw beneﬁts in terms of time and eﬃciency on the posttest.Another interesting outcome was that the NS group had higher justiﬁcationrates and requested more help, which agrees with previous research showingthat hint quality aﬀects help-seeking behaviors. In the future, WPs could beaugmented to reduce cognitive load without eliminating the multi-step aspectby eliminating other elements of the task, such as highlighting needed nodes oroﬀering multiple hint levels. Other future work includes using machine learningto determine when to provide a hint rather than providing them periodically.Finally, we hope to transfer these ﬁndings to other open-ended problem do-mains like programming in order to oﬀer additional instructional supports andhints to novice students. References

1. Aleven, V., Koedinger, K.R.: Limitations of student control: Do students know whenthey need help? In: International Conference on Intelligent Tutoring Systems, pp. 292–303. Springer (2000)2. Aleven, V., McLaren, B., Roll, I., Koedinger, K.: Toward tutoring help seeking. In:International Conference on Intelligent Tutoring Systems, pp. 227–239. Springer (2004)3. Anderson, J.R., Farrell, R., Sauers, R.: Learning to program in lisp. Cognitive Science (2), 87–129 (1984)4. Arroyo, I., Beck, J.E., Beal, C.R., Wing, R., Woolf, B.P.: Analyzing students’ responseto help provision in an elementary mathematics intelligent tutoring system. In: Papersof the AIED-2001 workshop on help provision and help seeking in interactive learningenvironments, pp. 34–46. Citeseer (2001)5. Arroyo, I., Beck, J.E., Woolf, B.P., Beal, C.R., Schultz, K.: Macroadapting animalwatchto gender and cognitive diﬀerences with respect to hint interactivity and symbolism. In:International Conference on Intelligent Tutoring Systems, pp. 574–583. Springer (2000)6. Baker, R.S., Corbett, A.T., Koedinger, K.R.: Detecting student misuse of intelligenttutoring systems. In: International conference on intelligent tutoring systems, pp. 531–540. Springer (2004)7. Barnes, T., Stamper, J., Croy, M.: Using markov decision processes for automatic hintgeneration. Handbook of Educational Data Mining (2011)8. Barnes, T., Stamper, J., Lehman, L., Croy, M.: A pilot study on logic proof tutoringusing hints generated from historical student data. In: Educational Data Mining 2008(2008)9. Bartholom´e, T., Stahl, E., Pieschl, S., Bromme, R.: What matters in help-seeking? astudy of help eﬀectiveness and learner-related factors. Computers in Human Behavior (1), 113–129 (2006)10. Bunt, A., Conati, C., Muldner, K.: Scaﬀolding self-explanation to improve learning inexploratory learning environments. In: International Conference on Intelligent TutoringSystems, pp. 656–667. Springer (2004)11. Catrambone, R.: The subgoal learning model: Creating better examples so that studentscan solve novel problems. Journal of Experimental Psychology: General (4), 355(1998)12. Cody, C., Maniktala, M., Warren, D., Chi, M., Barnes, T.: Does autonomy help help?the impact of unsolicited hints on help avoidance and performance13. Cronbach, L.J., Snow, R.E.: Aptitudes and instructional methods: A handbook forresearch on interactions. Irvington (1977)14. Eagle, M., Barnes, T.: Exploring diﬀerences in problem solving with data-driven ap-proach maps. In: Educational Data Mining 2014 (2014)itle Suppressed Due to Excessive Length 3315. Eagle, M., Johnson, M., Barnes, T.: Interaction networks: Generating high level hintsbased on network community clustering. International Educational Data Mining Society(2012)16. Fields, A., Miles, J., Fields, Z.: Discovering statistics using r (2012)17. Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., Chen, L.: Generating proactivefeedback to help students stay on track. In: International Conference on IntelligentTutoring Systems, pp. 315–317. Springer (2010)18. Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., Chen, L.: Data driven automaticfeedback generation in the ilist intelligent tutoring system. Technology, Instruction,Cognition and Learning (1), 5–26 (2015)19. Fuchs, D., Kearns, D.M., Fuchs, L.S., Elleman, A.M., Gilbert, J.K., Patton, S., Peng,P., Compton, D.L.: Using moderator analysis to identify the ﬁrst-grade children whobeneﬁt more and less from a reading comprehension program: A step toward aptitude-by-treatment interaction. Exceptional children (2), 229–247 (2019)20. Guzdial, M.: Centralized mindset: A student problem with object-oriented program-ming. In: ACM SIGCSE Bulletin, vol. 27, pp. 182–185. ACM (1995)21. Hume, G., Michael, J., Rovick, A., Evens, M.: Hinting as a tactic in one-on-one tutoring.The Journal of the Learning Sciences (1), 23–47 (1996)22. Kalyuga, S.: Enhancing instructional eﬃciency of interactive e-learning environments:A cognitive load perspective. Educational Psychology Review (3), 387–399 (2007)23. Kalyuga, S.: Cognitive load theory: How many types of load does it really need? Edu-cational Psychology Review (1), 1–19 (2011)24. Kalyuga, S., Chandler, P., Tuovinen, J., Sweller, J.: When problem solving is superiorto studying worked examples. Journal of educational psychology (3), 579 (2001)25. Kirschner, P.A., Sweller, J., Clark, R.E.: Why minimal guidance during instructiondoes not work: An analysis of the failure of constructivist, discovery, problem-based,experiential, and inquiry-based teaching. Educational psychologist (2), 75–86 (2006)26. Koedinger, K.R., Aleven, V.: Exploring the assistance dilemma in experiments withcognitive tutors. Educational Psychology Review (3), 239–264 (2007)27. Lehmann, J., Goussios, C., Seufert, T.: Working memory capacity and disﬂuency eﬀect:An aptitude-treatment-interaction study. Metacognition and Learning (1), 89–105(2016)28. Luckin, R., Du Boulay, B., et al.: Ecolab: The development and evaluation of a vygot-skian design framework. International Journal of Artiﬁcial Intelligence in Education (2), 198–220 (1999)29. Ma, W., Adesope, O.O., Nesbit, J.C., Liu, Q.: Intelligent tutoring systems and learningoutcomes: A meta-analysis. Journal of educational psychology (4), 901 (2014)30. Maniktala, M., Barnes, T., Chi, M.: Extending the hint factory: Towards modellingproductivity for open-ended problem-solving. In: Proceedings of the 13th InternationalConference on Educational Data Mining (2020)31. Maniktala, M., Cody, C., Barnes, T., Chi, M.: Avoiding help avoidance: Using interfacedesign changes to promote unsolicited hint usage in an intelligent tutor. InternationalJournal of Artiﬁcial Intelligence in Education (2020 (under review))32. Marwan, S., Lytle, N., Williams, J.J., Price, T.: The impact of adding textual explana-tions to next-step hints in a novice programming environment. In: Proceedings of the2019 ACM Conference on Innovation and Technology in Computer Science Education,pp. 520–526. ACM (2019)33. Mayer, R.E.: Should there be a three-strikes rule against pure discovery learning? Amer-ican psychologist (1), 14 (2004)34. McLaren, B.M., van Gog, T., Ganoe, C., Yaron, D., Karabinos, M.: Exploring theassistance dilemma: Comparing instructional support in examples and problems. In:International Conference on Intelligent Tutoring Systems, pp. 354–361. Springer (2014)35. Meerbaum-Salant, O., Armoni, M., Ben-Ari, M.: Habits of programming in scratch.In: Proceedings of the 16th annual joint conference on Innovation and technology incomputer science education, pp. 168–172. ACM (2011)36. Merrill, D.C., Reiser, B.J., Ranney, M., Trafton, J.G.: Eﬀective tutoring techniques:A comparison of human tutors and intelligent tutoring systems. The Journal of theLearning Sciences (3), 277–305 (1992)4 Christa Cody et al.37. Morrison, B.B., Margulieux, L.E., Guzdial, M.: Subgoals, context, and worked examplesin learning computing problem solving. In: Proceedings of the eleventh annual inter-national conference on international computing education research, pp. 21–29. ACM(2015)38. Mostafavi, B., Barnes, T.: Evolution of an intelligent deductive logic tutor using data-driven elements. International Journal of Artiﬁcial Intelligence in Education pp. 1–32(2016)39. Mostafavi, B., Barnes, T.: Evolution of an intelligent deductive logic tutor using data-driven elements. International Journal of Artiﬁcial Intelligence in Education (1), 5–36(2017)40. Murray, R.C., VanLehn, K.: A comparison of decision-theoretic, ﬁxed-policy and randomtutorial action selection. In: International Conference on Intelligent Tutoring Systems,pp. 114–123. Springer (2006)41. Murray, R.C., VanLehn, K., Mostow, J.: Looking ahead to select tutorial actions: Adecision-theoretic approach. International Journal of Artiﬁcial Intelligence in Education (3, 4), 235–278 (2004)42. Murray, T.: Authoring Intelligent Tutoring Systems: An Analysis of the State of theArt. International Journal of Artiﬁcial Intelligence in Education , 98–129 (1999)43. Price, T.W., Liu, Z., Catet´e, V., Barnes, T.: Factors inﬂuencing students’ help-seekingbehavior while programming with human and computer tutors. In: Proceedings of the2017 ACM Conference on International Computing Education Research, pp. 127–135.ACM (2017)44. Price, T.W., Zhi, R., Barnes, T.: Hint generation under uncertainty: The eﬀect of hintquality on help-seeking behavior. In: International Conference on Artiﬁcial Intelligencein Education, pp. 311–322. Springer (2017)45. Puustinen, M.: Help-seeking behavior in a problem-solving situation: Development ofself-regulation. European Journal of Psychology of education (2), 271 (1998)46. RANGANATHAN, R., VANLEHN, K., VAN DE SANDE, B.: What do students dowhen using a step-based tutoring system? Research & Practice in Technology EnhancedLearning (2) (2014)47. Razzaq, L., Heﬀernan, N.T.: Hints: is it better to give or wait to be asked? In: Interna-tional Conference on Intelligent Tutoring Systems, pp. 349–358. Springer (2010)48. Revelle, W.: psych: Procedures for Psychological, Psychometric, and Personality Re-search. Northwestern University, Evanston, Illinois (2019). URL https://CRAN.R-project.org/package=psych . R package version 1.9.1249. Rivers, K., Koedinger, K.R.: Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artiﬁcial Intelligence inEducation (1), 37–64 (2017)50. Snow, R.E.: Aptitude-treatment interaction as a framework for research on individualdiﬀerences in psychotherapy. Journal of consulting and clinical psychology (2), 205(1991)51. Stamper, J., Barnes, T., Lehmann, L., Croy, M.: The hint factory: Automatic generationof contextualized help for existing computer aided instruction. In: Proceedings of the9th International Conference on Intelligent Tutoring Systems Young Researchers Track,pp. 71–78 (2008)52. Stamper, J., Eagle, M., Barnes, T., Croy, M.: Experimental evaluation of automatic hintgeneration for a logic tutor. International Journal of Artiﬁcial Intelligence in Education (1-2), 3–17 (2013)53. Sweller, J.: Cognitive load during problem solving: Eﬀects on learning. Cognitive science (2), 257–285 (1988)54. Sweller, J.: Evolutionary bases of human cognitive architecture: implications for com-puting education. In: Proceedings of the fourth international workshop on computingeducation research, pp. 1–2. ACM (2008)55. Sweller, J.: Cognitive load theory. In: Psychology of learning and motivation, vol. 55,pp. 37–76. Elsevier (2011)56. Sweller, J., Cooper, G.A.: The use of worked examples as a substitute for problemsolving in learning algebra. Cognition and instruction (1), 59–89 (1985)57. Sweller, J., Levine, M.: Eﬀects of goal speciﬁcity on means–ends analysis and learning.Journal of experimental psychology: Learning, memory, and cognition (5), 463 (1982)itle Suppressed Due to Excessive Length 3558. Ueno, M., Miyazawa, Y.: Irt-based adaptive hints to scaﬀold learning in programming.IEEE Transactions on Learning Technologies (2017)59. Vanlehn, K.: The behavior of tutoring systems. International journal of artiﬁcial intel-ligence in education (3), 227–265 (2006)60. Vygotsky, L.: Interaction between learning and development. Readings on the develop-ment of children (3), 34–41 (1978)61. Wood, H., Wood, D.: Help seeking, learning and contingent tutoring. Computers &Education (2), 153–169 (1999)62. Yeh, Y.c., Lin, C.F.: Aptitude-treatment interactions during creativity training in e-learning: How meaning-making, self-regulation, and knowledge management inﬂuencecreativity. Journal of Educational Technology & Society18