The Impact of Looking Further Ahead: A Comparison of Two Data-driven Unsolicited Hint Types on Performance in an Intelligent Data-driven Logic Tutor
Christa Cody, Mehak Maniktala, Nicholas Lytle, Min Chi, Tiffany Barnes
NNoname manuscript No. (will be inserted by the editor)
The Impact of Looking Further Ahead: A Comparison ofTwo Data-driven Unsolicited Hint Types on Performancein an Intelligent Data-driven Logic Tutor
Christa Cody · Mehak Maniktala · Nicholas Lytle · Min Chi · TiffanyBarnes Received and accepted date will be inserted later
Abstract
Research has shown assistance can provide many benefits to noviceslacking the mental models needed for problem solving in a new domain. How-ever, varying approaches to assistance, such as subgoals and next-step hints,have been implemented with mixed results. Next-Step hints are common indata-driven tutors due to their straightforward generation from historical stu-dent data, as well as research showing positive impacts on student learning.However, there is a lack of research exploring the possibility of extending data-driven methods to provide higher-level assistance. Therefore, we modified ourdata-driven Next-Step hint generator to provide Waypoints, hints that area few steps ahead, representing problem-solving subgoals. We hypothesizedthat Waypoints would benefit students with high prior knowledge, and thatNext-Step hints would most benefit students with lower prior knowledge. Inthis study, we investigated the influence of data-driven hint type, Waypointsversus Next-Step hints, on student learning in a logic proof tutoring system,Deep Thought, in a discrete mathematics course. We found that Next-Stephints were more beneficial for the majority of students in terms of time, ef-ficiency, and accuracy on the posttest. However, higher totals of successfullyused Waypoints were correlated with improvements in efficiency and time inthe posttest. These results suggest that Waypoint hints could be beneficial,but more scaffolding may be needed to help students follow them.
Keywords
Tutoring system, Hints, Assistance, Data-Driven methods
Christa CodyE-mail: [email protected] North Carolina State University, Computer Science Department, Raleigh, NC, USA a r X i v : . [ c s . H C ] F e b Christa Cody et al. Intelligent tutoring systems (ITS) provide adaptive assistance to students andhave significant positive effects on learning [42, 29]. Multiple approaches to as-sistance have been explored, with some very specific assistance, like bottom-outhints [59], designed to ensure that students “do not flounder during problemsolving” [36], while other more abstract assistance, like a suggested subgoal[11], is designed to allow more freedom and exploration within the domain.Providing assistance has been shown to reduce the cognitive load of learn-ing by simplifying the task, leading to greater learning outcomes in less time[23, 53]. However, determining what level or type of help students need is acomplex task that can affect learning outcomes [1, 59, 61]. A major goal ofproviding assistance is to level the playing field of learning so that students atany incoming proficiency can master the same material in a similar amountof time. Research has shown that the level of hint and the learner’s incomingexperience can affect learning outcomes in ITSs [5, 23]. One example of thisis the expertise reversal effect where methods that benefit novices, such asworked examples, become detrimental to students with higher expertise dueto increasing cognitive load through redundant information [54].Furthermore, research has found evidence of aptitude-treatment interac-tions (ATI) within instructional strategies[13, 50], where certain students, par-ticularly lower performin students, are more likely to be affected by changesin the learning environment. Similar to solving programming problems, solv-ing logic proofs requires students to understand a system of domain principlesor rules and to creatively apply them in sequence to achieve a goal. Conse-quently, support can be directed at any of these facets of problem solving,such as helping a student learn a rule or identify when applying a such a rulewill move them towards a goal. Therefore, we hypothesized that different hinttypes could have different effects based on students’ incoming proficiency.Deep Thought’s default hints are Next-Step hints, where the next state-ment to be derived is given to the student and can be derived within one step.Providing the next step to derive allows students to focus their learning ondiscovering how to reach their new short-term subgoal, rather than what nextsubgoal to pursue. On the other hand, Next-Step hints may reduce studentautonomy or practice in creating appropriate problem solving strategies.To evaluate the effect of a new hint type on student’s performance, we cre-ated Waypoints, that can be thought of as intermediate subgoals , by modifyingour Next-Step hint generator to produce hints that mimic subgoals withoutthe need for expert labelling. Our new method produces Waypoint hints thatrequire students to perform 2-3 steps to derive them. Waypoints are intendedto serve as near-term subgoals, that allow students more room for explorationand latitude in strategy construction.Another important aspect to assistance in tutoring systems is the ease ofgeneration. Data-driven methods, where actions within the tutor are designedand developed using historical data, have been used to great effect to automateand individualize computer-aided instruction [38, 52, 18, 8]. Deep Thoughts’s itle Suppressed Due to Excessive Length 3 data-driven assistance matches current student work with similar historicalsuccessful, efficient examples to provide adaptive Next-Step hints using theHint Factory, which is a method of generating Next-Step hints. The originalHint Factory opened a new field of data-driven hint generation that was firstapplied in tutors for logic [51, 8], and then for linked lists [17]. More recently,the Hint Factory approach inspired new research in generating Next-Step hintsfor novice programming, based on generating assistance using pieces of previ-ous student’s solutions [17, 49, 43, 8].However, there is a lack of research extending this Next-Step hint gener-ation to provide additional forms of assistance. Therefore, we modified ourNext-Step hint generator to produce Waypoint hints. Our modifications wereinspired by the Approach Maps technique of graph-based mining to discoverimportant subgoals in common student solutions [14]. This extension of Next-Hint generation to provide a higher level of hint may be used in other systemsto easily generate a new hint type that could provide more adaptive assistanceto address individual student needs.Our goals for this study were to 1) perform a study to compare the impactsof Waypoints with Next-Step hints on performance, and 2) determine whetherprior proficiency interacted with hint type to impact tutor posttest perfor-mance. We investigated the impact of the two types of hints, Next-Step andWaypoints, on student learning via unsolicited, tutor-initiated steps insertedinto the student workspace, which we refer to as “Assertions”. Assertions aredesigned to direct student attention to, and promote adoption of, unsolicitedNext-Step and Waypoint hints.Based on the prior research mentioned above, we hypothesized that ourNext-Step hints would be most beneficial for students with lower incoming pro-ficiency and lead to better performance on the posttest. We also hypothesizedthat Waypoints would be more beneficial to students with higher incomingproficiency and lead to better performance on the posttest. In other words, wepredicted an aptitude-treatment interaction (ATI) effect [13, 50] where priorstudent proficiency would impact which students benefit most from a treat-ment. We predicted an ATI effect for both Waypoint and Next-Step hints,with higher proficiency students benefitting more from Waypoints and lowerproficiency students benefitting more from Next-Step hints.In this paper, we first discuss the context of the logic tutor, Deep Thought,and the method of generation for the different hint types. We then outline ourexperimental setup, designed to compare these two hint types in terms of theireffects on student learning outcomes. Finally, we discuss the study results andhow they relate to prior literature, and provide recommendations for futuredata-driven hint development and research.
In this section, we discuss various approaches to assistance, such as subgoals,Next-Step hints, and worked examples, within intelligent tutoring systems (re-
Christa Cody et al. ferred to here as ITSs or tutors). We also discuss cognitive theories surroundingassistance, including cognitive load and the “zone of proximal development”,that have influenced our work.Guided discovery, helping students discover new knowledge rather thanproviding direct instruction, is generally more beneficial than allowing studentsto learn unguided [25, 33]. This finding agrees with the theory of the “zoneof proximal development” (ZPD), the space between things a student cando independently and those they can only do with support [60]. Vygotskyhypothesized the most effective learning occurs when students are assignedtasks within their ZPD, meaning that tasks should neither be so simple thatstudents can do them independently nor so difficult that they cannot makeprogress even with assistance. This dilemma of choosing an appropriate levelof assistance shows that giving or withholding information is a delicate balancewith trade-offs [26].The theory of cognitive load may explain the trade-offs of different ap-proaches to assistance. Providing assistance can reduce the cognitive loadneeded for students to learn through methods such as simplifying the task [23]or breaking the task down into easier-to-digest components, such as subgoals[37]. However, the cognitive load of a learner is affected by both the elementsof information in the task and their own ability [53]. Intuitively, providing as-sistance that is too hard for a particular student to understand can negativelyimpact learning. However, providing assistance when it is not needed may alsohave a negative effect, such as the expertise reversal effect in which providingstudents information they already know increases their cognitive load [54]. Onthe other hand, it is a known problem that many students fail to request helpwhen it is needed, and this has been termed hint avoidance [1], discussed laterin this section.2.1 Approaches to Assistance in Tutoring SystemsIntelligent tutoring systems (ITSs) have significant positive effects on learningoutcomes [42]. Many forms of contextualized assistance have been exploredin ITSs, such as hints, worked examples, and error feedback [21, 18, 58, 59].The most minimal hint type is error-specific feedback, which provides a hintregarding an error the student has made [59]. Our tutor, as described below,includes basic error feedback when rules are not applied correctly.Many tutors use goal-directed hint sequences to provide several hints ina row, beginning with a more general hint then transition to more specificand directive hints [21]. Our tutor has this capability, but it was disabled forthis study to determine the impact of hint type and not the amount of detaileach student might request. A standard goal-directed hint sequence within atutoring system is Point, Teach, and Bottom-out [21]. Pointing hints attemptto remind the student of relevant material. Teaching hints describe how toapply the relevant material. Bottom-out hints tell the student the next stepand specifically how to implement it. The hints in Deep Thought would be itle Suppressed Due to Excessive Length 5 considered pointing hints, because they point students in the direction theyshould be moving by giving them a hint statement to work towards. One type of assistance higher-expertise learners benefit from is subgoals, a setof steps in the solution process that allows users to “chunk” information forease of learning [11, 37]. Sweller et al. [57] found that using more abstract rep-resentations of goals in five maze-tracing experiments resulted in “fewer errorsand more rapid learning of the structure of the problem.” The authors foundthat the more information solvers knew about the goal, the less they learnedabout problem structure. However, studies have found that these approacheshave trade-offs depending on learner ability and problem difficulty or context[37].In regard to learner’s abilities, research within ITSs has shown that high-ability learners can benefit from lower amounts, or less guidance, while lowability learners benefit from more concrete (specific and direct) guidance [5,28]. These findings inspired us to explore how data-driven hint algorithmscould be used to derive less direct guidance to benefit high-ability learners.
The Hint Factory is a data-driven approach developed to generate Next-Stephints for students applying rules to solve open-ended problems in well-defineddomains where there are multiple valid solutions [51, 52]. New innovationsin generating assistance from individual pieces of previous student’s solutionshave helped researchers extend the ideas of the Hint Factory to generate Next-Step hints for new domains including novice programming and linked list con-struction [17, 49, 43, 8]. The Next-Step hints derived by the Hint Factoryand used in our tutor are pointing hints that suggest a statement a studentcould derive using a single domain rule application. Sweller et al. makes thecase that providing more explicit instruction is better for novices who needto establish those individual learning blocks before they can create their ownmental models [54, 35].However, research has shown that allowing students to make successful,unaided attempts at solving a problem can provide a higher learning benefitcompared to explicit instruction showing them what to focus on [26]. HintFactory Next-Step hints have been shown to be successful in supporting stu-dent learning and problem-solving, with students having access to such hintsin logic being 3 times more likely to complete the tutor than those without[52]. These results suggest that Next-Step hints are direct and explicit enoughto support learning, but since level 1 hints do not provide the full informationto achieve a next-step, students must do some unaided exploration to achievethe suggested hint statement. On the other hand, Aleven et. al notes a “onesize fits all” strategy for guidance is not likely beneficial [1]. Hence, we are
Christa Cody et al. inspired to determine whether some even less direct data-driven hints maybenefit high-ability learners. Aptitude-treatment interactions have been widely studied in the educationaldomain. Prior research in instructional strategies [13, 50] has shown the ex-istence of aptitude-treatment interaction (ATI), where certain students aremore sensitive to variations in the learning environment and may be affecteddifferently by the treatment compared to less sensitive students who performregardless of the treatment. Educational researchers have discovered ATI ef-fects based on prior experience level, prior working memory, and incomingself-regulated learning ability [24, 27, 19, 62]. For example. Lehmann et al. ex-plored the effect of working memory on learning outcomes in fluency/disfluencygroups, where instructional materials had different levels of text legibility [27].Based on these findings, we believe that there could be an aptitude-treatmenteffect associated with hint type. We believe that students with lower incomingproficiency may be more sensitive to hint type.
Despite this considerable research on assistance, there is pervasive problemwithin ITSs called help avoidance, where students do not leverage the intelli-gence within the system for help [2]. There are many reasons for help avoid-ance, one of which is that certain students may lack specific meta-cognitiveskills like knowing when to ask for help [1]. As a result, some ITSs employ un-solicited hints (i.e. providing hints when needed without request) to preventhelp avoidance [41], and we adopt this unsolicited strategy here.Zhou et al. found that students were more likely to make effective peda-gogical decisions at the problem-level rather than the step-level, meaning thatstudents were less able to make effective decisions when deciding if they needa hint on a particular problem-solving step [63]. In another study, researchersfound that a large number of students using Andes, the physics tutor, wouldguess instead requesting hints [46]. Furthermore, higher learning gains havebeen observed for low performing students when unsolicited hints were pro-vided[4]. While one study found that students learned more reliably with hintson-demand than unsolicited hints[47], other studies have shown that provid-ing hints at the appropriate time can augment students’ learning experience[10, 45], improve their performance [9], and avoid the negative effects of frus-tration while saving students time by preventing unproductive struggle [40].Within our tutor, even though students often have difficulty and hints arereadily available via the hint button, most students do not request assistance.In Fall 2017, students using our tutor requested a median of zero hints perproblem. In this study, to enable us to compare the impact of hint type,we periodically (frequency defined in Section 3.1) provided unsolicited hintsto students based on the condition they were assigned. In prior work, we itle Suppressed Due to Excessive Length 7 compared our unsolicited hints to the normal conditions in Deep Thought,on-demand hints only, and found that the unsolicited hints had no impacton the performance metrics in the training and no negative impacts on anyperformance metrics on the posttest [12]. Furthermore, this work found thatproviding unsolicited hints reduced steps that students needed help, but didn’treceive it as detected by our Help-Need model [31, 30]. Therefore, we do notbelieve that our unsolicited hints are disruptive, but we note that providingunsolicited hints has potential for disrupting students’ learning. In the nextsection, Deep Thought and its interface are discussed in detail and the hints’generation, usage, and frequency are expanded on.
The Deep Thought tutor (see Figure 1, described further below) is used in thecontext of a discrete mathematics course where students first spend 2 weekslearning about truth tables, and proving each logic rule is true in class andin online multiple-choice homework assignments. Then, students learn aboutformal proofs, where students iteratively apply logic rules to a set of givenstatements to derive a specified conclusion.A formal proof works much like any multi-step procedural problem wheredomain principles are applied to given and previously-derived facts to deriveand justify new statements. For example, in physics, students may be givenvalues for mass and acceleration and be asked to determine force. They wouldthen apply the domain principle of F = m ∗ a along with the given values of m and a to derive a new statement about the value of F . In logic, each derivedstatement must have a justification which consists of the domain principleand the relevant prior statements it was applied to. This corresponds to theinformation used to derive F in the previous physics example. In a formalproof, students are given a few statements (the number may vary) that areknown to be true - often referenced as ”givens” - and a conclusion that is tobe derived. Then, students must apply logical rules to the givens to derive new statements. The student repeats this process of identifying rules to applyon certain statements until they derive the conclusion. An example of thisprocess in Deep Thought is covered in this section along with a description ofthe interface.Within the discrete math course, students next complete partially-workedexamples in a fill-in-the-blank type interface where they are given formal logicproofs with one missing part on each step - either the derived statement, orpart of the justification that consists of the rule used to derive it and thestatements the rule was applied to. Many example logic proofs are workedin class, with students asked to actively solve logic proofs in small groups,and students are provided with several full worked examples in handouts.After this class work and homework, students are assumed to have reasonablefamiliarity with logic rule application, but need practice in determining whichrules to apply in service to a problem-solving goal. Students are then assigned Christa Cody et al. to complete formal logic proofs using our propositional logic tutor called DeepThought [39].The intention of the Deep Thought tutor is to provide students with prac-tice on solving logic proofs with a focus on problem-solving efficiency in bothtime and the number of steps in their solutions, i.e. shorter proofs in less time,and ideally with few mistakes in justifying or deriving new statements. Todo so, the tutor must provide basic functionalities including (1) correctnessfeedback on each step (on both justification and derived statements), and (2)automated detection of proof completion. Like a compiler, Deep Thought pro-vides these functions that identify errors and clearly shows when a problem is complete but do little to help students with the overall goal of reaching a prob-lem solution through deriving and justifying a series of well-chosen statements.To bridge this gap, the Hint Factory was created to provide data-driven assis-tance that could point students to appropriate subgoal statements to derive[8, 51, 52].Deep Thought allows students to solve logic proofs graphically as shownin Figure 1. On the left of Figure 1, the workspace is labelled. The workspaceis where the students can select statements (purple, oval-shaped nodes) andapply rules by selecting rules (blue, oval-shaped nodes) from the middle of thescreen under the “Rules” section to derive new statements. In Figure 1, thereare 4 givens (at the top of the workspace in purple, oval-shaped nodes) andthe conclusion (at the bottom of the workspace in a purple, square-shapednode). Each statement is labelled to show the order in which students derivedthem with the exception of the givens and conclusion which are labelled forease of reference. There is no particular ordering to the givens. Also, there isan example of our hints on the screen in the blue, oval-shaped nodes labelled“Goal.”To derive a new statement, a student must select statement(s) by clickingthem followed by selecting a rule to apply. In response to the student selectinga rule to apply, the tutor has one of 3 responses: 1) if the student is using anapplicable rule, i.e. a rule that logically can be applied to the statements, andthe new derived statement is the only potential derivation, then the statementis automatically added to the screen, 2) if the student is using an applicablerule and there are multiple potential derivations, e.g. using “Simplication” onthe statement “ I ∧ F ” where either I or F could be the new derived statement,then the student is prompted to enter in the statement they want to derive,and 3) if the student has incorrectly selected a rule that doesn’t apply to theselected statements, e.g. the rule requires only one statement to be selected,such as “Simplification” or “Implication”, and the student has selected twostatements, then the tutor provides a pop-up and a description of the error.Note that in response 2, if the student incorrectly types in the statement tobe derived in the prompt, the tutor will pop-up and error and the student willhave to select the rule again to derive the statement.When a new statement is added by the student, the statement becomesoval-shaped node, similar to the givens shape, but the color depends on thefrequency and necessity of the node based on historical data. To help students itle Suppressed Due to Excessive Length 9 avoid deriving unnecessary statements/nodes in the training phase, the tutorcolors nodes based on their necessity and frequency in our historical dataset ofcorrect solutions by past students. Nodes that were never necessary to derivethe conclusion are colored gray, while frequently-necessary nodes are coloredgreen, and infrequently-necessary nodes are colored yellow.As the student is deriving new statements, the nodes are added to theproof with arrows pointing to them to show which statements were used andthe rule applied to derive the new statement. On the right of the screen, theInfo Box contains information communicated by the tutor about the currentproblem the student is solving, i.e. what rules may be useful to solving theproof, information about hints on the screen, and information about certainbuttons a student may try to use. The bottom left of the screen under theworkspace contains buttons that a student may use during the training portionof the tutor (skipping a problem and requesting suggestions, i.e. hints, are notavailable during the testing portions of the tutor). To the bottom right of thescreen, buttons are available that show the student general information aboutthe tutor as well as instructions that provide information about solving proofsand the options available for the students. Fig. 1
On the left of the screen is the Deep Thought workspace. Below the workspace areare the hint button and hint message box, the rules are in the middle, and to the right isthe Dialogue Box where messages related to unsolicited hints as well as problem informationare given.0 Christa Cody et al. As stated above, Deep Thought is intended to teach students to solve proofsmore efficiently, in terms of time and steps taken to reach the conclusion. Thetutor presents proof problems as an initial set of given statements with aconclusion to derive from them using logic rules. Each statement, given orderived, is represented by a node , with the conclusion represented with a nodewith a question mark ‘?’ above it, indicating that it has not yet been justified(shown to be true using logic rules applied iteratively to the givens).Each problem-solving step consists of two parts: the justification and thederived statement. The justification is the set of 1-2 existing nodes and the ruleapplied to them, and the derived statement is the result. Students completethe justification by clicking to select 1-2 nodes, and clicking on a rule to apply.Students then type in the derived statement that results from applying the ruleto the selected statement nodes. For example, Figure 2 shows a formal proofbeginning with 3 givens (at the top of the workspace in purple, oval-shapednodes) and the conclusion (at the bottom of the workspace in a purple, square-shaped node). The student (1) selects the statement “ I ∧ F ” and (2) appliesthe “Simp” rule, i.e. Simplication, to (3) derive a new statement “ F ”. To solvethe proof, the student would continue identifying combinations of statementsand rules to apply until they derive the conclusion statement “ J ∧ K ”.Throughout the tutor, including the pre- and post-test problems, DeepThought provides immediate error feedback for mistakes - either in justifica-tions or derived statements. If a student clicks on the wrong rule, or theirderived statement does not follow from the selected nodes and rule, DeepThought shows a popup message and records the error. For example, if a stu-dent selects two nodes and then clicks on the Simp rule, the error prompt reads“Rule requires one premise,” then fades away. If the student enters a derivedstatement that is true, and the justification (consisting of the selected nodesand rule to derive it) is correct, then a new node with the derived statementappears in the workspace.To complete a problem, the student must iteratively derive and justifynew statements, until the conclusion statement is derived and justified. Whenstudents have completed a problem, the conclusion’s question mark is removed,and it is visually connected to the givens through a series of derived nodes andarrows indicating their justifications. Since the system automatically checkseach step and detects completion in all phases of the tutor, student solutionscannot be incorrect, but some may be more expert than others. Students areconsidered to have learned the topic when they perform well on the posttest(described below), especially with regard to problem solutions with fewer stepsand fewer mistakes in less time.Deep Thought includes four phases: introduction, pretest, training, andposttest. The introduction consists of three problems including two workedexamples, where students click through the addition of successive nodes untila conclusion is derived, and a third problem students solve alone to learnthe interface. Then, students take the pretest consisting of a solving singleproblem with no hints available. The pretest is used to measure students’incoming proficiency and assign them to conditions via stratified sampling. itle Suppressed Due to Excessive Length 11
Fig. 2
Deriving a new justified node. (1) Selecting the node “ I ∧ F ” to use (2) Selectingthe rule “Simplification” to apply (3) The screen after the rule was clicked showing “F” asa justified node Next, students solve 18 problems in the training section. For each trainingproblem, the dialogue box provides information on what rules to focus on while et al. solving a problem, such as “Think about the following rules for this problem:MP, Simp, Add.” Students also receive contextual, data-driven hints duringtraining, including both unsolicited hints generated by the system and on-demand hints upon student request, all generated using the same Hint Factory-type approach described below. After completing training, students take amore difficult, non-isomorphic posttest , where they must solve four problemswithout any help or assistance. Since the posttest is not isomorphic to thepre-test, we do not expect the post-test performance to be directly comparableto the pretest performance. Rather, we use the pretest to balance incomingproficiency across groups via stratified sampling, and focus on comparing post-test performance between groups.Expert solutions for all tutor problems range from 5-8 steps, and studentsolutions typically contain 5-20 steps. Longer student solutions may simply beinefficient, taking more steps than needed, or they may contain unnecessarynodes that do not lie on a direct path from the givens to the conclusion (seefootnote ). As first mentioned in Section 3, to help students avoid derivingunnecessary statements/nodes in the training phase, the tutor colors nodesbased on their necessity and frequency in our historical dataset of correctsolutions by past students. Nodes that were never necessary to derive theconclusion are colored gray, while frequently-necessary nodes are colored green,and infrequently-necessary nodes are colored yellow.3.1 AssistanceIn training problems , students may receive unsolicited hints, depending ontheir assigned study condition, as well as request on-demand hints. On-demandhints and unsolcited hint provide the same content. All hints provide a targetstatement to derive, appearing as a node with a ‘?’ in the workspace. In pre-vious work, we showed this method of providing unsolicited hints, Assertions,resulted in better performance than text-based messages as a method of un-solicited hint delivery [31]. For the remainder of the paper, we refer to bothsolicited and unsolicited hints as hints.Deep Thought includes several measures intended to prevent gaming thesystem, where students attempt to use system features to avoid work, or helpabuse, where students request hints when they do not need them [6]. First,whenever a hint is already in the workspace, students may not receive anotherhint, whether it was solicited or provided automatically by the tutor. Second,no further details are provided for any hint, meaning there is no such thingas a bottom-out hint in this study. In past Hint Factory implementations, wehave provided students 4 levels of hints that (1) suggested the next step, (2)the specific rule, (3) the prior statements needed, and finally (4) a bottom-out Unnecessary nodes in a complete solution are easy to detect because removing themdoes not disconnect the conclusion from the givens, but they are difficult to detect duringproblem solving.itle Suppressed Due to Excessive Length 13 hint with all this information. In this study, we use only level 1 pointing hints,and disabled hint levels 2-4.The tutor generates hints using historical student data from four semesters,each semester with approximately 250-300 students using the tutor. Both hintalgorithms produce assistance based on the most frequent and efficient pathsavailable in the student’s current proof.We use the Hint Factory [51] approach to generate hints. The Hint Factory[51, 52, 7] is a data-driven method to generate hints by transforming histor-ical student problem-solving attempts into a Markov Decision Process, usingobserved frequencies as transition probabilities, and estimating the expectedvalue of each previously-observed problem state based on assigning rewardsto complete solutions, small negative rewards (i.e. costs) to steps to positivelyreward more efficient solutions, and large negative rewards to errors to de-emphasize solutions that cause many students to make mistakes. Individualstudent problem-solving attempts are represented by a series of states, or snap-shots of the work done so far, where transitions occur between states whenstudents add or delete problem nodes, or make an error. The Hint Factory isdescribed in detail in Barnes and Stamper’s chapter in the 2011 Handbookon Educational Data Mining [7]. All student solutions are combined into aninteraction network [15] that reflects all previously-observed solutions to onespecific problem. When a hint is requested by the student or tutor, the HintFactory is used to select a target problem-solving state with the highest ex-pected value. Note that this process can be done offline, and a simple table canbe used to store problem-solving states and their corresponding hint contentfor real-time hint provision. Then, the latest statement derived in that stateis used as the pointing hint to help students know what to try to derive next.In this study, we do not provide further information on how to derive orjustify the suggested statement, i.e. the statements that a student needs toselect and the rules the student may need to apply are not provided to thestudent, meaning that all hints in this paper can be considered as partially-worked example steps.In this study, two hint types are used:
Next-Step and
Waypoint hints.Figure 3 and Figure 4 shows the two forms of data-driven hints: Next-Step(NS) and Waypoints (WP), respectively, and how the students would approachderiving the suggested hint statement for each type. Descriptions of how eachhint type is generated as well as how to derive each hint are expanded in thefollowing paragraphs.
Next-Step hints are generated using the Hint Factory method as de-scribed above, with the target state selected to be the one with the highestexpected value that occurs within one rule application from the stu-dent’s current state . Simply, Next-Step hints suggest the best propositionthat can be derived in one step from the student’s current proof. This cor-responds to the next-step hints derived in all of our prior work using HintFactory [8, 7, 52, 51, 14, 15, 43, 44].Since Next-Step hints are partially-worked, they allow students to focuson how to justify them, and reflect on why they were suggested. This removes et al. a considerable load; without a hint, students must also search among manyoptions for the best what to derive next. For example, Figure 3 demonstratesthe ideal derivation of a Next-Step hint. In Figure 3, Deep Thought has 3givens at the top of the workspace and one hint statement labelled “Goal”, F ,on the screen. To derive the hint, the student (1) selects the I ∧ F statementby clicking it. After selecting the statement, which is now shown highlightedin blue, the student clicks the rule labelled “Simp” to apply Simplification tothe statement. A pop-up will appear for the student to type in what they areattempting to derive, in which case they enter F . After entering F into theprompt, (3) the statement is shown incorporated into the student’s solutionin the same fashion regular derivations happen as in Figure 2 with arrows,coloring, and labelling. In this case, the justified hint appears on the screen asa green, oval-shaped node with an arrow pointing to it from the I ∧ F statementwith the labels “4:” to indicate this statement is the fourth statement justified(givens are automatically numbered) and the label “Simp” to indicate thestatement was derived using the Simplification rule. Waypoint hints are generated with the same method as Next-Step hints;however, instead of selecting a hint 1 step away from the student’s currentstate, hints that are 2-3 steps away from the student’s current state are se-lected. A primary motivation for this study was to determine a simple way toextend the Hint Factory to provide less direct data-driven hints, i.e. comparedto Next-Step hints, without the need for expert authoring. In our prior work,we derived a new method called data-driven Approach Maps, that applies hi-erarchical graph mining to interaction networks to discover problem-solvingstates that represent critical junctures in problem-solving attempts, which wecall subgoals [14]. These subgoals occurred every 2-3 steps/states in our shortlogic proofs (which are typically 5-12 steps long). These subgoals inspired ourWaypoints, but we wanted to be able to generate these hints with an easiermethod that is more extensible to other researchers who may already be usingHint Factory or methods based on Hint Factory.To generate Waypoints without the need to apply data-driven ApproachMaps, we modified the Hint Factory to select a target statement that was 2-3steps away from the current state. Among states that were 2 or 3 steps away, weselected the state with a higher frequency within prior correct solutions. Thisresulted primarily in states that need only two rule applications to derive, sincethe diversity of student solutions means that frequency typically decreases ininteraction networks the further states are from the start. By expert review ofa random sample of Waypoints, we verified that this simple algorithm resultsin similar hints to those generated using data-driven Approach Maps [14].Waypoints are intended to serve as subgoals , giving students more room toexplore the solution space and develop their own problem-solving strategies.Since Waypoints cannot be achieved with a single rule application, they requirestudents to make their own problem-solving plan to derive them, consideringthe existing problem statements and how rules might be applied to them toderive and justify the suggested Waypoint statement. For example, Figure 4demonstrates the ideal derivation of a Waypoint hint. In Figure 4, (1) Deep itle Suppressed Due to Excessive Length 15
Thought has 3 givens at the top of the workspace and one hint statementlabelled “Goal”, G ∧ ¬ H , on the screen. To derive the hint, the student firstselects the I ∧ F statement by clicking it, then the student clicks the rulelabelled “Simp” to apply Simplification to the statement. A pop-up will appearfor the student to type in what they are attempting to derive, in which casethey enter F . Note, this step is not shown in the figure, although it is thesame process as described in Figure 3. After entering F into the prompt,the F statement is shown incorporated into the student’s solution. Next, thestudent must make a second derivation to derive the hint. The student (2)selects the F → ( G ∧ ¬ H ) statement and the F statement by clicking eachone individually – this highlights both nodes – then the student clicks the MP rule to apply Modus Ponens to the statements. As a result, the statement isautomatically derived, due to the derived statement being the only option,and the justified hint appears on the screen as a green, oval-shaped node witharrows pointing to it from the F → ( G ∧ ¬ H ) statement and F statement withthe labels “5:” to indicate this statement is the fifth statement justified (givensare automatically numbered) and the label “MP” to indicate the statementwas derived using the Modus Ponens rule.For both Next-Step and Waypoint hints, the process of deriving the hintis the same: students must select statement(s) and apply rules to derive newstatements, which we also refer to as “steps”. The only difference is how manytimes a student must repeat this process to derive the hint statement (ideallyonce for Next-Step and twice for Waypoint hints - with the exception of someWaypoints that may take three steps to derive). With that in mind, studentsmay also derive new statements that do not contribute to deriving the hintstatement; however, when we refer to how many steps it takes for a studentto derive a hint, we are speaking in ideal terms.We consider a continuum of goals for students, where Next-Step hints ide-ally take one step to derive, Waypoints take 2-3, and the problem conclusiontakes about 5 expert steps. With longer problems or more complex problemdomains like programming, we would recommend using a more complex algo-rithm to select Waypoints if they were shown to be effective. In logic proofs,the shortest proof is considered to be the best, so simple metrics on interactionnetworks can quickly discover optimal solutions and those that many studentscan discover.As stated above, Deep Thought only provides pointing hints to suggeststatements that can be derived; neither Next-Step nor Waypoint hints tell stu-dents which rules to use to derive them; rather, they help students solve prob-lems by suggesting a subgoal that helps them break down multi-step problems.To use a hint in their proof, the suggested hint statement must be justified by applying a rule to previously-justified or given statement(s). Statementsthat are not justified appear in the tutor interface with a “?” above them toindicate that they need to be derived.We implemented unsolicited hints so they appear randomly and with enoughuniformity and frequency that even students with short proofs would receivehints. One limitation of this method of providing hints is that hints were not et al. Fig. 3
Next-Step hint. (1) A Next-Step hint appears, F . (2) The student has selected I ∧ F and is applying the Simplification rule. (3) F has been justified.itle Suppressed Due to Excessive Length 17 Fig. 4
Waypoint hint. (1) A Waypoint hint appears, G ∧ ¬ H . (2) The first derivation usingSimplification has already been completed. The student has selected F → ( G ∧ ¬ H ) and F and is applying MP (Modus Ponens). (3) G ∧ ¬ H has been justified. necessarily provided when they were most needed, which may affect learning et al. outcomes. However, since students in tutor rarely request hints, it was neces-sary to provide the hints automatically and frequently to enable us to evaluateour hypotheses. For the Next-Step group, we capped the number of unsolicitedhints at 1 / The Deep Thought tutor was used as a homework assignment for an under-graduate ‘discrete mathematics for computer scientists’ course in the Fall 2018semester at a large research university. We analyzed 143 students’ data fromtwo test conditions to investigate the impact of hint type on student per-formance and behavior. Both conditions were identical except for hint type,Next-Step or Waypoint. We used stratified sampling based on pretest perfor-mance, then randomly assigning to Next-Step hints (NS, n = 71) or Waypoints(WP, n = 72), ensuring both conditions were balanced in incoming knowledge.Before analysis, students who dropped the tutor before completion and stu-dents with technical errors in their data were removed (NS n = 15, WP n = 14) leaving 56 students in the NS condition and 58 students in the WPcondition for a total of 114 students.4.1 HypothesesThe goals of this study were to 1) evaluate the effectiveness of a new hinttype, Waypoint hints, 2) compare the impacts of Waypoints and Next-stephints on performance, and 3) determine if proficiency had an effect on whichhint type was more beneficial. Based on prior literature, we developed thefollowing hypotheses: – H : Next-Step hints will improve performance for students with lower in-coming proficiency. – H : Waypoint hints will improve performance for students with higherincoming proficiency. – H Waypoint hints will be more difficult to derive, resulting in a lowerjustification rate and performance during training compared to Next-Steps.These hypotheses were based on the basic assumption that Waypoint hints aremore difficult to justify and adopt, since Waypoints require students to derivemore steps to justify them. On the other hand, this challenge may be precisely itle Suppressed Due to Excessive Length 19 what high-proficiency students need for improved learning. To evaluate thesehypotheses, we focused on the performance metrics discussed below.4.2 Performance Evaluation MetricsIn this section we describe the metrics used to evaluate student performance.Recall that the tutor begins with an introduction with two worked examplesand one practice problem followed by the pretest. We used each student’spretest score to measure incoming knowledge/proficiency. Equation 1 showshow the score is calculated. Each metric is normalized, then the time and stepmetrics are subtracted from 1 to be comparable to accuracy, i.e. so that fortime, steps, and accuracy a number closer to 1 indicates the student is per-forming well. A student’s score is a combination of percentiles for the pretest time , number of steps , and accuracy on a single problem, ranking studentsbased on how fast, efficient, and accurate they are compared to their peers.We chose these features because they each represent a different aspect of astudent’s problem solving experience.Recall that the tutor was designed to improve time and steps to solveproblems, and assumes a basic level of fluency or accuracy on rule applica-tions. Therefore, we have no goals or expectations of improving accuracy withthis tutor. However, the score includes all three metrics to ensure that ourinterventions do not decrease accuracy while attempting to improve time andsteps. For example, a student may take a short amount of time on a problem,but make many mistakes resulting in a lower accuracy. We use a median spliton the combined pretest score to assign students into High and Low proficiencygroups for some analyses.
Score = (1 − T otalT ime ) ∗ . − T otalSteps ∗ .
3) +
Accuracy ∗ . Total time is counted from the moment a problem begins until it is solved by derivingand justifying the conclusion.
Total steps in a problem include any attemptat deriving a new node, which includes correct and incorrect steps.
Accuracy is the percentage of correct out of the total steps, which is expected to startrelatively high due to prior exposure in the class, and increase as studentspractice. Note that the tutor is not designed or assumed to promote largeimprovements in accuracy, since no penalties are assigned for incorrect ruleapplications and the tutor simply alerts students upon wrong rule applicationsand students may try again, even within the pre- and post-tests. Further,problems require new rules and become more difficult as the students progress.As we seek primarily to promote more efficient problem solving, we focus moreon steps and time per problem while maintaining reasonable accuracy. This isbecause it is more difficult for students to learn to determine which steps toderive to achieve shorter, more efficient proofs, compared to learning how to et al. apply the rules, which can be done by memorization and simple practice. DeepThought is built primarily to allow students to practice with the strategy ofproblem solving, rather than fluency with rules, most of which are assumed tobe learned before the tutor.One important thing to note is that Deep Thought does not include eye-tracking, and the unsolicited hints are provided regardless of whether a studentneeds them or not, so we cannot determine precisely whether students followeda hint or incidentally derived the hint statement. Therefore, we have definedmetrics to quantify when students justified a hint by selecting the statementsand rule needed to derive it, as well as when the students adopted a hintby first justifying it and then using it directly on their path to derive theconclusion. These two hint-specific metrics are the hint Justification Rate and
Adoption Rate .The hint Justification Rate is the percentage of unsolicited hints justified(correctly identifying the rule and prior nodes needed to derive the suggestednode) divided by the total number of hints given across the training problems.A hint is said to be justified when a student applies logic rules to existing logicstatements to derive the hinted logic statement, and when a hint is justified,the tutor removes its ‘?’ and connects it to its predecessor nodes with arrowslabeled with the rule used to derive it. A hint justification provides evidencethat a student noticed the hint and knew how to apply rules to justify it, butdo not tell the full story. As in any problem-solving context, statements can bederived that are not needed in a final solution. Therefore, we also measure hintAdoption Rate, whether a hint contributes towards deriving the conclusion. Ajustified hint can be reached on a path from the problem’s given statements.When a hint is adopted , it must first be justified and then become necessary toa student’s final solution – in other words, the problem would be incompleteif the hinted statement were removed. This is shown visually when a directedpath can be found from the hinted statement node to the problem conclusion.Figure 5 shows a completed problem with labels indicating which nodes areconsidered justified and which nodes were also adopted for the solution.We also investigated impacts on help-seeking through the number of on-demand hint requests (when students click the“Get Suggestion” button).
To-tal Requests represents the number of hint requests during the training por-tion. Data were analyzed to compare groups for the pretest, training, andposttest portions of the tutor. Within each hint group, we also compared per-formance of students with High or Low pretest scores, based on a median spliton the pretest score.To determine significant differences between hint types, we applied one wayANCOVA using the pretest as a covariate with Benjamini-Hochberg correc-tions to account for multiple tests. To check that the data met assumptionsfor ANCOVA, we used the the Shapiro-Wilk’s W test and Levene’s test, aswell as visually inspecting the data via Q-Q plots and histograms. Data thatdid not meet the assumptions were transformed using log or square-root trans-formations, then re-inspected. Data reported in tables For clarity, all data intables are reported before transformation. itle Suppressed Due to Excessive Length 21
Fig. 5
A completed problem with nodes that were used to derive the conclusion (justifiedand adopted) and one node that was not used to derive the conclusion (justified but notadopted ). Table 1 shows the overall hint metrics for each group during training. Weexpected
Total Added ( F (2 , . , p < .
01) and
Steps Until Justi-fied ( F (2 , . , p < .
01) metrics to be significantly different, sinceeach step of a problem can have a unique Next-Step (NS) but one Waypoint(WP) requires multiple steps to be derived. Based on prior literature on helpavoidance and low help usage within tutors [43], we were pleasantly surprisedto find students in both groups had relatively high justification and adoptionrates. The Next-Step group justified a significantly ( F (2 , . , p < .
01) higher percentage of hints, as shown by the
Justification Rate . Addition-ally, of the justified hints, we also saw a significantly ( F (2 , . , p =0 .
01) lower Adoption Rate of the WP hints in students’ final proofs, no sig-nificant interaction was observed with pretest proficiency. Although this is arelatively high number for both groups, the WP group’s lower justification andadoption rates are concerning.This provides evidence in support of H that Waypoint hints would beharder to derive; however, this evidence does not address whether this wasdue to the difficulty of the WP hints or students’ lack of effort to derive them.We explore the possible reasons for these differences later in this section. et al. Table 1
Hint metrics during training. For ANCOVA results controlling for the pretestscore, p-values that are at least marginally significant are bolded and significant valuesalso have an asterisk*.
NS WP n = 56 n = 58Metric
Mean(SD) Mean(SD) p
Justification Rate 89%(7) 84%(12) < Adoption Rate 83%(10) 74%(17)
Steps Until Justified 1.1(0.1) 2.2(0.3) < Total Added 49(9) 30(7) < To understand the overall impact of Next-Step versus Waypoint hints,we examined the performance for both groups for the tutor pretest, train-ing, and posttest, for all students regardless of incoming proficiency as shownin Table 2. There were no significant differences between the WP and NSgroups on the pretest, although the pretest was slightly worse for the NSgroup. During training, the NS group significantly outperformed the WP groupwith fewer steps, less time, better accuracy and overall score (Total Time F (2 , . , p< .
01, Total Steps F (2 , . , p = 0 .
02, Accuracy F (2 , . , p = 0 . H during training, with all Next-Step students outperformingall WP students. We examined help-seeking behaviors during training andfound the NS group requested significantly more hints, although still a smallnumber overall (approximately 1 per problem for NS versus 0.5 per problem onaverage for WP), so it is not likely that hint requests account for the differencein training performance. Table 2
Performance metrics for each group on the pretest, training, and posttest; p-valuesthat are at least marginally significant when applying ANCOVA controlled for pretest are bold and those that are significant also have an asterisk.
NS (n = 56) WP (n = 58)Metric
Mean(SD) Mean(SD) p-value
Pretest
Total Time
Total Steps
Accuracy
Training
Total Time (min) < Total Steps
Accuracy
Total Requests
Total Time (min)
Total Steps
Accuracy itle Suppressed Due to Excessive Length 23
More importantly, on the posttest, the NS group significantly outper-formed the Waypoint group on Total Steps ( F (2 , . , p = 0 .
02) andAccuracy ( F (2 , . , p = 0 . F (2 , . , p = 0 . what to derive next when receiving hints,just the how ), and total steps (since the suggested hints were efficient).5.1 Effects on High- and Low- Pretest GroupsOur hypotheses focused on the differential impact of hints based on incomingproficiency and the difficulty of applying Next-Step versus Waypoint hints.To investigate these hypotheses, we checked for differences in performance be-tween prior proficiency groups within each group. We performed a median-splitfor incoming proficiency based on pretest scores and compared performancemetrics across groups and proficiency (NS-High n = 27, WP-High n = 30,NS-Low n = 29, WP-Low n = 30).First, we examined performance metrics for the High group, shown in Ta-ble 3. There were no significant differences between the NS and WP Highgroups on the pretest. For the training, the WP group took longer and mademore mistakes, as indicated by the Total Time ( F (2 ,
54) = 17 . , p < . F (2 ,
54) = 5 . , p < . F (2 ,
54) =3 . , p = 0 . F (2 ,
54) = 4 . , p = 0 . H was rejected;Waypoint hints did not improve performance for higher proficiency students.Next, we examined performance metrics for the Low pretest group. Therewere no significant differences between the NS and WP Low groups on thepretest. For the training, the WP took longer and attempted more steps, asindicated by the Total Time ( F (2 ,
54) = 3 . , p = 0 .
02) and Total Steps et al. Table 3
Performance metrics between the NS and WP
High proficiency groups for thepretest, training, and posttest of the tutor. ANOVA results are reported for the pretest.ANCOVA results, controlling for the pretest, are reported for the training and posttest,with p-values that are at least marginally significant in bold and significant p-values alsohave an asterisk *.
High Proficiency
NS-High WP-High n = 27 n = 30
Metric
Mean(SD) Mean(SD) p-value
Pretest
Total Time (min)
Total Steps
Accuracy
Training
Total Time (min) < Total Steps
Accuracy < Total Requests
Justification Rate
Adoption Rate < Total Time (min)
Total Steps
Accuracy ( F (2 ,
54) = 12 . , p < . F (2 ,
54) = 7 . , p < . F (2 ,
54) = 1 . , p = 0 .
09) andattempting more steps ( F (2 ,
54) = 3 . , p = 0 .
02) with marginally significantand significant results, respectively, indicating that the (hypothesized) worseperformance in the training portion may have transferred to their overall proofsolving strategies on the posttest. No interactions were found between pretestproficiency and performance or hint metrics on the training or posttest.These results confirm hypothesis H that Waypoints are more difficult forstudents and have a negative impact on training performance.We hypothesized in H that Next-Step hints would improve (training andposttest) performance compared to Waypoint hints, for low proficiency stu-dents. The overall performance (Table 2) confirmed that the Next-Step hintgroup produced better training and posttest performance. However, Table 4confirms that the benefits in the posttest are more prominently seen with thestudents with lower incoming proficiencies, confirming H .We hypothesized in H that the Waypoint hints would cause lower justifi-cation rates and worse training performance due to their increased difficulty,which is seen with both the WP-High and WP-Low groups, confirming our H hypothesis. There was also a significant difference in the Adoption ratesbetween the NS and WP groups for both High and Low students, with theWP adoption rates being lower. This suggests that students were not, in fact,able to independently discover the strategies that underlie the WP hints. itle Suppressed Due to Excessive Length 25 Table 4
Performance metrics between the NS and WP
Low proficiency groups for thepretest, training, and posttest. ANOVA results are reported for the pretest. ANCOVA re-sults, controlling for the pretest are reported for the training and posttest; p-values that areat least marginally significant are in bold and significant p-values also have an asterisk *.
Low Proficiency
NS-Low WP-Low n = 29 n = 28
Metric
Mean(SD) Mean(SD) p-value
Pretest
Total Time (min)
Total Steps
Accuracy
Training
Total Time (min)
Total Steps < Accuracy
Total Requests
Justification Rate < Adoption Rate
Total Time (min)
Total Steps
Accuracy
Although we expected a lower hint justification rate in the WP group, wethought that the increase in difficulty would be beneficial to high proficiencystudents by allowing them more exploration of the problem space. Therefore,we hypothesized in H that higher incoming proficiency students would dobetter on the posttest after experiencing the WP hints in training. However,that is not the case. The WP-High group was only able to perform similarlyto the NS-High group and overall performed worse, although not significantly.Therefore, H is rejected. However, the results do seem to indicate that thehigh incoming-proficiency students were less affected by the treatment thanthe low incoming-proficiency students based on there being more significant re-sults between conditions in the low incoming-proficiency group. As mentionedearlier, we expected that an aptitude-treatment interaction (ATI) might occur,where certain students are more sensitive to variations in the learning envi-ronment and may be affected differently by the treatment compared to lesssensitive (more proficient) students who are able to perform well regardless oftreatment.5.2 Did Waypoints help with strategy for those who could utilize them?Although the performance results caused us to reject H , we wanted to inves-tigate whether WP hints provided strategy-related benefits to those studentswho were able to use them. Therefore, we performed correlation analyses usingthe Pearson correlation coefficient between the hint Justification and Adop-tion rates with posttest performance metrics. For the correlation analyses, weused an R function, corr.test, which computes the Pearson correlation coeffi-cient, significance tests using t-tests, and performs optional corrections which et al. we specified as Bonferroni corrections[48]. Table 5 shows the significant corre-lations of hint Adoption and Justification Rates with performance metrics forNS and WP groups on the posttest, as well as correlations with the incomingproficiency groups.For the NS group, the only significant correlation found was for the NS-High group between hint Adoption Rate and Total Steps ( p = 0 . p = 0 .
03) and with Total Steps ( p = 0 . p < .
01) and Total Steps ( p < . justifying and adopt-ing WPs were both associated with more efficient proofs that wereshorter and achieved in less time . There was also a significant moderate,negative correlation for the WP-Low group between Total Steps ( p = 0 . p = 0 .
02) and hint AdoptionRate, but the WP-High group also had moderate, negative correlation betweenTotal Time ( p = 0 .
02) and hint Adoption Rate.This result aligns with our reasoning behind H , that Waypoint hintsshould improve efficiency- and time-related metrics on the posttest, especiallyfor higher proficiency students. However, ultimately, the WP students per-formed worse. Based on these results, we conclude that more support may beneeded for WPs so that students can utilize them as well as NS hints to betterachieve efficiency-related benefits. Table 5
Significant correlations between hint Justification and Adoption rates with posttestperformance metrics for each hint type group and pretest group
Condition Split
Metric-Pair
Corr p NS High Adoption-Total Steps -0.38 0.06 WP All Justification-Total Time (min) -0.30
Justification-Total Steps -0.32
Adoption-Total Time (min) -0.35 < Adoption - Total Steps -0.40 < High Adoption-Total Time (min) -0.40
Adoption-Total Steps -0.41
Low Adoption-Total Steps -0.39 H , we investigated how many unused (unjustified) hints were attempted to bejustified. The significantly lower difference in hint Justification Rate of the WPgroup as shown in Table 1 and the significantly worse performance by the WP itle Suppressed Due to Excessive Length 27 group in the training as shown in Table 2 led us to want to better understandthe circumstances surrounding why the WP hints were used proportionatelyless.The hint Justification and Adoption rates can only tell us that studentswere, or were not, using the hints, but do not provide any insight into whetherthe students were actively attempting to derive the hint. Therefore, we con-ducted analyses to see if the WP hints were truly harder to derive ( H ). Thiswould be indicated by the students attempting to work towards the hint, andnot succeeding, versus outright ignoring the hint. Because the WP hints aremore steps away than the NS hint, students see the hint as too complicated tobe helpful and just ignore the hint outright. However, if students are attempt-ing to to derive the hint and not able to be successful, this is a larger concern.To determine if students were attempting to derive the hint, we examined thesteps taken after a hint was added (3 steps ahead for NS and 5 steps ahead forWP). If a majority of the steps examined contained variables that were alsoseen in the hint (2 out of 3 steps for NS and 3 out of 5 steps for WP) , it wasconsidered attempted .Table 6 shows the total unused hints. Total Unused represents the totalnumber of unused hints per person in each group. The % Unused/Total isthe total number of unused hints divided by the total number of hints that wereadded, which provides a clearer picture of the relative percentage of hints thatwere left unused by each student compared to how many they were being given.The % Attempted/Unused is the total attempted hints divided by thetotal unused hints representing the percentage of the unused hints that wereattempted. There was not a significant difference in the total amount of unusedhints between the groups ( F (2 , . , p = 0 . F (2 , . , p < . F (2 , . , p = 0 . Table 6
The total unused (unjustified) hints, percentage of hints unused out of all hintsadded, and the percentage of the unused hints that were attempted to be derived betweenthe NS and WP group. For ANCOVA results controlling for the pretest score, p-values thatare at least marginally significant are in bold and significant values also have an asterisk *.
NS WP n = 56 n = 58Metric
Mean(SD) Mean(SD p-valueTotal Unused % Unused/Total < % Attempted/Unused To understand when unsolicited hints were not justified, we determinedthe circumstances when this occurred and illustrate several situations: whenstudents attempted to use the hint, and what the eventual outcome was: ei-ther Gave Up or Solved Without using the hint.
Gave Up represents any et al. actions that end the problem without solving it, such as restarting or skippingthe problem. In this situation, students had a hint on the screen, worked afew steps, then clicked the restart or skip button without justifying the hint.When a student clicks restart or skip, this erases all current progress on theproblem. We considered this to be “giving up” because the student is remov-ing all progress made on the current problem by taking these actions, which isconcerning given that a hint was on the screen. Solved Without representswhen students completed a proof with an unjustified hint still on the screen.In this case, students have a hint but eventually solve the problem withoutusing the hint. This indicates that the hint was ignored, or at the very least,was not essential to solving the proof. We are less concerned with this casebecause the students were able to progress. However, since the hint is the mostefficient step to work towards, any student who avoided it took a less efficientroute to solve the problem. Lastly, although students had the option to deletea hint, no deletions were observed possibly due to students not knowing howto delete the hint.Table 7 details the two cases in which a hint was added, but the studentdid not justify it. For significant differences, ANCOVA was used with thepretest score as the covariate. The Total Unused, % Unused/Total and the %Attempted/Unused are defined above. We also examined how many steps thestudents took after a hint was given but before they gave up or solved theproof to determine how much effort was put into trying to derive the hint.
Steps Before is the number of steps the student attempted after receivingthe hint before they gave up or solved the proof. This metric was added tosee how long students were trying to work on the problem after the hint wasgiven.
Table 7
Comparison of unused hints of each subtype by amount, percentage that wereattempted, and steps before the action occured.
Gave Up Solved Without
NS (n = 56) WP (n = 58) NS (n = 56) WP (n = 58)
Mean(SD) Mean(SD) Mean(SD) Mean(SD)Total Unused % Unused/Total % Att./Unused
Steps Before
The WP group attempted to derive a significantly higher number of the un-used hints before giving up (% Att./Unused: F (2 , . , p = 0 . F (2 , . , p = 0 .
14 and Solved Without: F (2 , . , p = 0 . F (2 , . , p =0 . F (2 , . , p =0 .
13 and Solved Without: F (2 , . , p = 0 . itle Suppressed Due to Excessive Length 29 hints in the Gave Up case. This result indicates that the WP students hadattempted to make progress towards the hints, were unable to justify them,and then gave up. This is more concerning than giving up on a problem inwhich they had not attempted to derive the hint, and indicates the WP hintsmay have been too hard to derive.The purpose of this analysis was to investigate H and determine the cir-cumstances surrounding why the WP group had a significantly lower hintJustification Rate than the NS group. The results provide evidence in supportof H that the Waypoint hints were harder for students to derive. This work aims to explore the extension of a Next-Step hint generator to easilycreate subgoal-inspired assistance. The Next-Step group saw overall the bestperformance for both the training and posttest, including the students withlower incoming proficiencies providing supporting evidence for H . Our resultsindicated that the Waypoint group performed overall worse in both trainingand posttest causing us reject H . Results also showed that the lower profi-ciency students, specifically, were less able to utilize this form of assistance;however, students who were able to utilize Waypoints did see benefits in termsof time and efficiency on the posttest. Furthermore, we explored the circum-stances surrounding when hints were not utilized and found that students inthe Waypoint group attempted a larger percentage of the hints before givingup, providing evidence in support of H that Waypoints would be harder toderive. In this section, we discuss the trade-offs of the two hint types.6.1 Waypoint hintsWPs were intended for students to learn strategies for solving proofs by break-ing the problem into smaller subgoals and providing students with more inde-pendent problem solving experience than NS. However, the majority of WPstudents appeared to have struggled with WP hints instead, a trade-off ofthe assistance dilemma [26]. The WP group performed worse overall in bothtraining and posttest portions of the tutor (see Table 2). Another interest-ing result, shown in Section 5.1, is that the WP Low-pretest group has asignificantly lower Justification Rate and marginally significant decrease inposttest performance metrics. This aligns with literature showing that lowerproficiency learners are less able to use abstract guidance [25, 54]. Therefore,the WP hints might not provide enough guidance for students. Research hasshown that complex assistance can hinder learning by taxing cognitive load[53, 55], which can happen when learners try to process new information andincorporate complex assistance at the same time and “thus forcing learnersto use random search procedures” [22]. This is a limitation of our study asWaypoints may produce better results with more scaffolding. et al. The Justification Rate being significantly lower for the WP group indicatesthat the lower performance may be due to an inability to properly use theassistance (see Table 1). This is partially supported by Table 6 and Table7, which shows that the WP group had a higher percentage of attempts tojustify a hint without succeeding, compared to the NS group. The AdoptionRate being significantly lower for the WP group indicates that, even whenstudents in the WP group were able to justify the hints, they were less ableto adopt them to connect the WP hints to the conclusion. Due to the designof the hint being a few steps away, students could end up on a solution pathdifferent from the path initially given by the WP. Consequently, students whowere unable to justify the WP or adopt it into the solution were not followingthe most efficient path, hindering their ability to learn from the strategiesbehind the WP hints.One potentially positive result with Waypoints is shown in Table 5, withrespect to the significant negative correlations of Justification and Adoptionrate with total time and total steps on the posttest. Students who were ableto justify and adopt the WPs were associated with taking a shorter time andfewer steps on the posttest. This correlation aligns with our original intentionof using WP to support strategy development by helping students becomemore efficient in their problem solving process. Therefore, it is possible thatstudents with more experience and domain knowledge may better utilize Way-points and receive strategy-related benefits. However, it is important to takeinto account that correlation analyses cannot determine causality and therecould be variables not included in these analyses that play an important rolein these relationships [16]. Therefore, this interpretation is only a possibility.Based on these results, WPs can be improved by providing more information(perhaps automatically provided once we detect that a student is unsuccess-fully attempting to justify the hint) or incorporating ideas from recent researchwith promising methods of scaffolding goal-based hints [32].6.2 Next-Step hintsThe total time, total steps, and accuracy were significantly different, or trend-ing towards significance, between groups as shown in Table 2 for the trainingand posttest. Since the groups had similar pretest scores, these results showthat both the NS and WP groups came into the tutor performing similarly,but by the posttest the two groups had diverged; the NS group had higher ac-curacy and fewer total steps. Furthermore, the NS group were able to increasetheir accuracy between the non-isomorphic pre- and posttest compared to theWP group who did not show such improvements. This was perhaps due to theincreased practice in applying rules to justify both unsolicited and on-demandhints - since the NS group received and justified significantly more hints inboth of these categories.The differences in time, steps, and accuracy between the groups show thatNSs were more beneficial for students. As shown in Table 1, there is a sig- itle Suppressed Due to Excessive Length 31 nificant difference in the higher Justification Rate for NSs. We believe theseresults may be due to the alignment of NSs with novice’s bottom-up problemsolving approaches that focus on what to do in the short term [3, 20, 54].NSs may also have potentially resulted in an overall lower cognitive load [23],though this supposition is only based on their design and not data from stu-dents. As a justification, students considering NSs only needed to think aboutwhich nodes and rules could be used to derive the NS. In contrast, WP stu-dents needed to think about which nodes to use, which rules to apply, andwhat intermediate steps they would have to achieve before deriving the WP.Interestingly, the NS group requested more on-demand hints (see 2). Thissuggests that the NS group may have found the assistance more helpful andbecame more comfortable requesting help. Prior research has shown studentsare more likely to request help when they received help that they perceive tobe more suitable for their needs [43].Although WPs were designed to promote more independent, strategic prob-lem solving, it is possible that NSs also helped students learn problem solvingstrategies. Based on the hint generator design, NS students following the hintswere seeing the most efficient next step based on the current proof state. Prob-lems with frequent Next-Step hints could be acting as partially-worked exam-ples, which are known to increase efficient problem solving strategies [56, 34].Previous research on hint usage during problem solving in programming sug-gests that hints can, sometimes, save students time but reduce learning [37].In our research, NS hints seem to save students time and increase performanceon the post-test. This suggests that NS hints may help students learn to solvethe problems more efficiently (more quickly and with fewer steps).
This paper contributes a study showing an extension of the Hint Factory tocreate higher-level hints, and the effects of two types of hints on students’ ef-ficiency and accuracy in solving logic proofs: Next-Step hints (NSs) and Way-point (WPs) hints. It is important to note these hints were provided unsolicitedas well as through on-demand hint requests, which could affect the students’usage and reception of the hints. Furthermore, our hints are provided period-ically and not necessarily when a student may need them. However, our priorhas shown our unsolicited, periodically provided hints do not have any negativeimpacts on training or post-test performance metrics compared to students inthe only on-demand hint group. In this paper, NSs helped students becomequicker, more accurate, and more efficient in their proofs. However, the moredistant goals of WPs seemed to be harder for the students, which not onlyaffected the training problems where the assistance occurred, but resulted inlower accuracy and reduced efficiency in the posttest. Despite the WP groupspending more time on problem solving during training, their performance didnot benefit as much as the NS group. Furthermore, learners with lower incom-ing proficiency were least able to utilize WPs, while NSs provided benefits et al. to both higher and lower proficiency groups. Although NS performed betteroverall, students who were able to incorporate WP, especially those in theWP-High group, saw benefits in terms of time and efficiency on the posttest.Another interesting outcome was that the NS group had higher justificationrates and requested more help, which agrees with previous research showingthat hint quality affects help-seeking behaviors. In the future, WPs could beaugmented to reduce cognitive load without eliminating the multi-step aspectby eliminating other elements of the task, such as highlighting needed nodes oroffering multiple hint levels. Other future work includes using machine learningto determine when to provide a hint rather than providing them periodically.Finally, we hope to transfer these findings to other open-ended problem do-mains like programming in order to offer additional instructional supports andhints to novice students. References
1. Aleven, V., Koedinger, K.R.: Limitations of student control: Do students know whenthey need help? In: International Conference on Intelligent Tutoring Systems, pp. 292–303. Springer (2000)2. Aleven, V., McLaren, B., Roll, I., Koedinger, K.: Toward tutoring help seeking. In:International Conference on Intelligent Tutoring Systems, pp. 227–239. Springer (2004)3. Anderson, J.R., Farrell, R., Sauers, R.: Learning to program in lisp. Cognitive Science (2), 87–129 (1984)4. Arroyo, I., Beck, J.E., Beal, C.R., Wing, R., Woolf, B.P.: Analyzing students’ responseto help provision in an elementary mathematics intelligent tutoring system. In: Papersof the AIED-2001 workshop on help provision and help seeking in interactive learningenvironments, pp. 34–46. Citeseer (2001)5. Arroyo, I., Beck, J.E., Woolf, B.P., Beal, C.R., Schultz, K.: Macroadapting animalwatchto gender and cognitive differences with respect to hint interactivity and symbolism. In:International Conference on Intelligent Tutoring Systems, pp. 574–583. Springer (2000)6. Baker, R.S., Corbett, A.T., Koedinger, K.R.: Detecting student misuse of intelligenttutoring systems. In: International conference on intelligent tutoring systems, pp. 531–540. Springer (2004)7. Barnes, T., Stamper, J., Croy, M.: Using markov decision processes for automatic hintgeneration. Handbook of Educational Data Mining (2011)8. Barnes, T., Stamper, J., Lehman, L., Croy, M.: A pilot study on logic proof tutoringusing hints generated from historical student data. In: Educational Data Mining 2008(2008)9. Bartholom´e, T., Stahl, E., Pieschl, S., Bromme, R.: What matters in help-seeking? astudy of help effectiveness and learner-related factors. Computers in Human Behavior (1), 113–129 (2006)10. Bunt, A., Conati, C., Muldner, K.: Scaffolding self-explanation to improve learning inexploratory learning environments. In: International Conference on Intelligent TutoringSystems, pp. 656–667. Springer (2004)11. Catrambone, R.: The subgoal learning model: Creating better examples so that studentscan solve novel problems. Journal of Experimental Psychology: General (4), 355(1998)12. Cody, C., Maniktala, M., Warren, D., Chi, M., Barnes, T.: Does autonomy help help?the impact of unsolicited hints on help avoidance and performance13. Cronbach, L.J., Snow, R.E.: Aptitudes and instructional methods: A handbook forresearch on interactions. Irvington (1977)14. Eagle, M., Barnes, T.: Exploring differences in problem solving with data-driven ap-proach maps. In: Educational Data Mining 2014 (2014)itle Suppressed Due to Excessive Length 3315. Eagle, M., Johnson, M., Barnes, T.: Interaction networks: Generating high level hintsbased on network community clustering. International Educational Data Mining Society(2012)16. Fields, A., Miles, J., Fields, Z.: Discovering statistics using r (2012)17. Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., Chen, L.: Generating proactivefeedback to help students stay on track. In: International Conference on IntelligentTutoring Systems, pp. 315–317. Springer (2010)18. Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., Chen, L.: Data driven automaticfeedback generation in the ilist intelligent tutoring system. Technology, Instruction,Cognition and Learning (1), 5–26 (2015)19. Fuchs, D., Kearns, D.M., Fuchs, L.S., Elleman, A.M., Gilbert, J.K., Patton, S., Peng,P., Compton, D.L.: Using moderator analysis to identify the first-grade children whobenefit more and less from a reading comprehension program: A step toward aptitude-by-treatment interaction. Exceptional children (2), 229–247 (2019)20. Guzdial, M.: Centralized mindset: A student problem with object-oriented program-ming. In: ACM SIGCSE Bulletin, vol. 27, pp. 182–185. ACM (1995)21. Hume, G., Michael, J., Rovick, A., Evens, M.: Hinting as a tactic in one-on-one tutoring.The Journal of the Learning Sciences (1), 23–47 (1996)22. Kalyuga, S.: Enhancing instructional efficiency of interactive e-learning environments:A cognitive load perspective. Educational Psychology Review (3), 387–399 (2007)23. Kalyuga, S.: Cognitive load theory: How many types of load does it really need? Edu-cational Psychology Review (1), 1–19 (2011)24. Kalyuga, S., Chandler, P., Tuovinen, J., Sweller, J.: When problem solving is superiorto studying worked examples. Journal of educational psychology (3), 579 (2001)25. Kirschner, P.A., Sweller, J., Clark, R.E.: Why minimal guidance during instructiondoes not work: An analysis of the failure of constructivist, discovery, problem-based,experiential, and inquiry-based teaching. Educational psychologist (2), 75–86 (2006)26. Koedinger, K.R., Aleven, V.: Exploring the assistance dilemma in experiments withcognitive tutors. Educational Psychology Review (3), 239–264 (2007)27. Lehmann, J., Goussios, C., Seufert, T.: Working memory capacity and disfluency effect:An aptitude-treatment-interaction study. Metacognition and Learning (1), 89–105(2016)28. Luckin, R., Du Boulay, B., et al.: Ecolab: The development and evaluation of a vygot-skian design framework. International Journal of Artificial Intelligence in Education (2), 198–220 (1999)29. Ma, W., Adesope, O.O., Nesbit, J.C., Liu, Q.: Intelligent tutoring systems and learningoutcomes: A meta-analysis. Journal of educational psychology (4), 901 (2014)30. Maniktala, M., Barnes, T., Chi, M.: Extending the hint factory: Towards modellingproductivity for open-ended problem-solving. In: Proceedings of the 13th InternationalConference on Educational Data Mining (2020)31. Maniktala, M., Cody, C., Barnes, T., Chi, M.: Avoiding help avoidance: Using interfacedesign changes to promote unsolicited hint usage in an intelligent tutor. InternationalJournal of Artificial Intelligence in Education (2020 (under review))32. Marwan, S., Lytle, N., Williams, J.J., Price, T.: The impact of adding textual explana-tions to next-step hints in a novice programming environment. In: Proceedings of the2019 ACM Conference on Innovation and Technology in Computer Science Education,pp. 520–526. ACM (2019)33. Mayer, R.E.: Should there be a three-strikes rule against pure discovery learning? Amer-ican psychologist (1), 14 (2004)34. McLaren, B.M., van Gog, T., Ganoe, C., Yaron, D., Karabinos, M.: Exploring theassistance dilemma: Comparing instructional support in examples and problems. In:International Conference on Intelligent Tutoring Systems, pp. 354–361. Springer (2014)35. Meerbaum-Salant, O., Armoni, M., Ben-Ari, M.: Habits of programming in scratch.In: Proceedings of the 16th annual joint conference on Innovation and technology incomputer science education, pp. 168–172. ACM (2011)36. Merrill, D.C., Reiser, B.J., Ranney, M., Trafton, J.G.: Effective tutoring techniques:A comparison of human tutors and intelligent tutoring systems. The Journal of theLearning Sciences (3), 277–305 (1992)4 Christa Cody et al.37. Morrison, B.B., Margulieux, L.E., Guzdial, M.: Subgoals, context, and worked examplesin learning computing problem solving. In: Proceedings of the eleventh annual inter-national conference on international computing education research, pp. 21–29. ACM(2015)38. Mostafavi, B., Barnes, T.: Evolution of an intelligent deductive logic tutor using data-driven elements. International Journal of Artificial Intelligence in Education pp. 1–32(2016)39. Mostafavi, B., Barnes, T.: Evolution of an intelligent deductive logic tutor using data-driven elements. International Journal of Artificial Intelligence in Education (1), 5–36(2017)40. Murray, R.C., VanLehn, K.: A comparison of decision-theoretic, fixed-policy and randomtutorial action selection. In: International Conference on Intelligent Tutoring Systems,pp. 114–123. Springer (2006)41. Murray, R.C., VanLehn, K., Mostow, J.: Looking ahead to select tutorial actions: Adecision-theoretic approach. International Journal of Artificial Intelligence in Education (3, 4), 235–278 (2004)42. Murray, T.: Authoring Intelligent Tutoring Systems: An Analysis of the State of theArt. International Journal of Artificial Intelligence in Education , 98–129 (1999)43. Price, T.W., Liu, Z., Catet´e, V., Barnes, T.: Factors influencing students’ help-seekingbehavior while programming with human and computer tutors. In: Proceedings of the2017 ACM Conference on International Computing Education Research, pp. 127–135.ACM (2017)44. Price, T.W., Zhi, R., Barnes, T.: Hint generation under uncertainty: The effect of hintquality on help-seeking behavior. In: International Conference on Artificial Intelligencein Education, pp. 311–322. Springer (2017)45. Puustinen, M.: Help-seeking behavior in a problem-solving situation: Development ofself-regulation. European Journal of Psychology of education (2), 271 (1998)46. RANGANATHAN, R., VANLEHN, K., VAN DE SANDE, B.: What do students dowhen using a step-based tutoring system? Research & Practice in Technology EnhancedLearning (2) (2014)47. Razzaq, L., Heffernan, N.T.: Hints: is it better to give or wait to be asked? In: Interna-tional Conference on Intelligent Tutoring Systems, pp. 349–358. Springer (2010)48. Revelle, W.: psych: Procedures for Psychological, Psychometric, and Personality Re-search. Northwestern University, Evanston, Illinois (2019). URL https://CRAN.R-project.org/package=psych . R package version 1.9.1249. Rivers, K., Koedinger, K.R.: Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artificial Intelligence inEducation (1), 37–64 (2017)50. Snow, R.E.: Aptitude-treatment interaction as a framework for research on individualdifferences in psychotherapy. Journal of consulting and clinical psychology (2), 205(1991)51. Stamper, J., Barnes, T., Lehmann, L., Croy, M.: The hint factory: Automatic generationof contextualized help for existing computer aided instruction. In: Proceedings of the9th International Conference on Intelligent Tutoring Systems Young Researchers Track,pp. 71–78 (2008)52. Stamper, J., Eagle, M., Barnes, T., Croy, M.: Experimental evaluation of automatic hintgeneration for a logic tutor. International Journal of Artificial Intelligence in Education (1-2), 3–17 (2013)53. Sweller, J.: Cognitive load during problem solving: Effects on learning. Cognitive science (2), 257–285 (1988)54. Sweller, J.: Evolutionary bases of human cognitive architecture: implications for com-puting education. In: Proceedings of the fourth international workshop on computingeducation research, pp. 1–2. ACM (2008)55. Sweller, J.: Cognitive load theory. In: Psychology of learning and motivation, vol. 55,pp. 37–76. Elsevier (2011)56. Sweller, J., Cooper, G.A.: The use of worked examples as a substitute for problemsolving in learning algebra. Cognition and instruction (1), 59–89 (1985)57. Sweller, J., Levine, M.: Effects of goal specificity on means–ends analysis and learning.Journal of experimental psychology: Learning, memory, and cognition (5), 463 (1982)itle Suppressed Due to Excessive Length 3558. Ueno, M., Miyazawa, Y.: Irt-based adaptive hints to scaffold learning in programming.IEEE Transactions on Learning Technologies (2017)59. Vanlehn, K.: The behavior of tutoring systems. International journal of artificial intel-ligence in education (3), 227–265 (2006)60. Vygotsky, L.: Interaction between learning and development. Readings on the develop-ment of children (3), 34–41 (1978)61. Wood, H., Wood, D.: Help seeking, learning and contingent tutoring. Computers &Education (2), 153–169 (1999)62. Yeh, Y.c., Lin, C.F.: Aptitude-treatment interactions during creativity training in e-learning: How meaning-making, self-regulation, and knowledge management influencecreativity. Journal of Educational Technology & Society18