DDate of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xx.xxxx/xxxx.xxxx.DOI
Confidence-Aware Learning Assistant
SHOYA ISHIMARU , TAKANORI MARUICHI , ANDREAS DENGEL and KOICHI KISE. German Research Center for Artificial Intelligence (DFKI), Trippstadter Str. 122, 67663, Kaiserslautern, Germany Graduate School of Engineering, Osaka Prefecture University, 1-1, Gakuen-cho, Naka-ku, Sakai, Osaka 599-8531, Japan
Corresponding author: Shoya Ishimaru (e-mail: [email protected]).
ABSTRACT
Not only correctness but also self-confidence play an important role in improving the qualityof knowledge. Undesirable situations such as confident incorrect and unconfident correct knowledge preventlearners from revising their knowledge because it is not always easy for them to perceive the situations.To solve this problem, we propose a system that estimates self-confidence while solving multiple-choicequestions by eye tracking and gives feedback about which question should be reviewed carefully. We reportthe results of three studies measuring its effectiveness. (1) On a well-controlled dataset with 10 participants,our approach detected confidence and unconfidence with 81 % and 79 % average precision. (2) With thehelp of 20 participants, we observed that correct answer rates of questions were increased by 14 % and17 % by giving feedback about correct answers without confidence and incorrect answers with confidence,respectively. (3) We conducted a large-scale data recording in a private school (72 high school studentssolved 14,302 questions) to investigate effective features and the number of required training samples.
INDEX TERMS
Eye tracking, in-the-wild study, learning augmentation, self-confidence estimation.
I. INTRODUCTION Q UANTIFIED learning – sensing learning behaviors forgiving effective feedback based on the contexts toeach learner – has high potential in the era of digitalizededucation [1]. The appearance of smart sensors equipped ona personal computer, tablet, smartphone, chair, eyeglasses,etc. has enabled researchers to estimate various internal statessuch as engagement, boredom, tiredness, and self-confidencewhile learning [2], [3]. Among these internal states, theimportance of self-confidence has especially been investi-gated in educational research. Self-confidence is a base ofmetacognitive judgments and the most common paradigmin metacognitive domains ranging from decision-making andreasoning [4], [5] to perceptual judgments [6], [7] and mem-ory evaluations [8], [9]. It is a manifestation of metacognitiveassessing of one’s own knowledge or scholastic ability andaffected by proficiency, achievement, cognitive anxiety, anddifficulty of a task [10]. Several studies have reported thata positive increment in self-confidence enhances learners’engagement and performance [11], [12].One of the most critical cases where self-confidence playsan important role is on multiple-choice questions (MCQ).MCQ is a type of question that asks selecting the mostappropriate choice from given options. Since informationobtained from the answer to MCQ is only its correctness, it ishard to distinguish between the cases when a learner answers
FIGURE 1.
An overview of
Confidence-Aware Learning Assistant (CoALA) .The system recommends questions which should be checked again on thebasis of the correctness and the self-confidence estimated by eye-tracking. with confidence or when a learner answers randomly withoutconfidence. We consider that there are four levels of the qual-ity of knowledge, from high to low: correct with confidence,incorrect without confidence, correct without confidence, andincorrect with confidence. In particular, the last two casesare serious. In the case of correct without confidence, theanswer will be treated as correct by chance, and the learnerloses a chance to acquire correct knowledge. In the case ofincorrect with confidence, the wrong knowledge may causefurther misunderstandings.
VOLUME x, 202x a r X i v : . [ c s . H C ] F e b shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS As a solution to such serious cases, we propose a systemwhich estimates self-confidence on MCQ by analyzing eyemovements. Based on the estimation output, our systemgenerates a report suggesting which question should be re-viewed as shown in Figure 1. We define such an intelligentsystem that adapts learning materials for each learner basedon self-confidence as
Confidence-Aware Learning Assistant(CoALA) . The idea of estimating self-confidence on MCQby eye tracking has been originally proposed by Yamada etal. [13]. Compared to their study, we aim for (1) proposinga user-independent approach considering a real scenario and(2) investigating the effectiveness as an end-to-end workingsystem including feedback.This paper presents our self-confidence estimation algo-rithm and the results of three studies measuring its effec-tiveness. In the first study, which involves 10 undergraduatestudents solving a total of 1,700 questions, we investigatedthe performance (precision and recall) of the self-confidenceestimation. Then we trained a confidence estimator using thisdataset and used it for the second study. This is due to arealistic scenario of applying the technology; an estimatoris trained with the data from different learners. The secondstudy consists of a pre-test, a review, and a post-test with thehelp of 20 participants. By comparing the results of the pre-test and the post-test, we evaluated how much our system im-proves the quality of knowledge in terms of the correctness ofthe answers and the students’ confidence. For the third study,we deployed our system in a private school. During this five-week demonstration, 72 high school students solved 145,489questions, and 14,302 questions were labeled by themselves.In this wild dataset, we discuss the limitations and futuredirections of our system. In summary, our contribution in thispaper includes: • User-independent gaze-based self-confidence estima-tion on multiple-choice questions, which detects con-fidence with 81 % and detects unconfidence with 79 %average precision, respectively. • End-to-end confidence-based reviewing system, whichincreases correct answer rates by 14 % for unconfidentcorrect answers and 17 % for confident incorrect an-swers compared to a controlled condition.
II. RELATED WORK
A. EYE TRACKING FOR LEARNING ASSISTANT
It has been proved that eye-gaze contains linguistic profi-ciency [14] and a degree of self-confidence. For example, thebehavior of a student who does not understand the contentsof a document is characterized by low reading speed andfrequent rereading [15]. Thai et al. reported that comprehen-sion for a question of a student is appeared in his/her eyemovement, for example, rereading of a question [16]. More-over, about the relation between the behavior of eyes andself-confidence, it has been proved that low self-confidenceis characterized by a frequent rereading of questions andlong gaze on choices [17]. Okoso et al. have proposed amethod of extracting difficult parts of a document for a reader to comprehend and found some effective features [18]. Lee et al. have proposed to build a virtual tutor to support thelearning of a student [19]. This work demonstrated that eyecommunications with a virtual tutor enhance the efficiencyof learning. Oliver et al. have succeeded to estimate theEnglish skill of non-native English speaker from his/her eyebehaviors in English test. The contribution of their work isto estimate the skill successfully with a small error by a fewdocuments [14]. Yamada et al. have tackled the automaticestimation of self-confidence by sensing and analyzing learn-ers’ problem solving behaviors through eye movements [13].However, their method works well if the training can be donefor each learner with enough amount of data, which may notbe realistic. In other learning subjects, for instance, Ishimaru et al. have investigated reading behaviors of students ona textbook in Physics [20]. They have proposed Areas ofInterests based and subsequence based approaches to predictexpertise.
B. OTHER SENSING MODALITIES
Some researchers have measured the attention of students inlearning by Electroencephalogram (EEG) and investigated acorrelation between attention and self-efficacy, which refersto the level of confidence of an individual with regard to theirability at task execution) [21], [22]. Though a method usingEEG can be a solution to estimate self-confidence, the devicedisturbs a user engaging in a task for the reason that it isalways attached to his head. On this point, the eye tracker ispreferable because it can be attached to a display.There has been a growing interest in the study of therelation between cognitive performance and the autonomicnervous system (ANS). The activity of ANS can be alsomeasured by heart rate variability (HRV) [23], Electrodermalactivity (EDA) [24], [25] and so on. We did not utilize theseapproaches in this work because we had received commentsfrom students in the private school that wearing sensors whilestudying requires a high physical workload. If we can recordthe precise physiological signal with remote sensing, weconsider integrating it. For instance, the nose temperature,which can be measured by a commercial infrared thermog-raphy camera, can be a nice candidate [26]. Abdelrahman et al. have recorded nose and forehead temperatures underdifferent task difficulties and found significant changes [27].Although mobile eye trackers appeared, there is still astrong gap between controlled behaviors in the laboratoryand natural behaviors in the wild. One of the critical issuesin this research field is how we can conduct experiments innatural settings for proposing robust methods. Towards thisobjective, several researchers have conducted long-term andlarge-scale experiments (e.g., over 80 hours of recording witha mobile eye tracker [28] and 780 hours of recording with acommercial Electrooculography glasses [29]). Our work is inthis context and has evaluated real learning behaviors. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
C. ROLE OF SELF-CONFIDENCE IN LEARNING
Several studies have mentioned correlations between self-confidence or other cognitive states and behaviors of peoplein specific tasks including achievement test of learning [30],cognitive test [31], and cooking [32].According to work by Forbes-Riley et al. , adapting auser’s self-confidence into the computer tutoring systemimproves performance on learning efficiency and a user’ssatisfaction [33]. Kleitman et al. reported that a high levelof self-confidence predicted high grades for primary schoolchildren [34]. Indeed, students who have self-confidenceawareness tend to be recognized in their performances, whichdevelops their level of self-confidence again. This positivefeedback loop motivates students to learn by themselves. Inanother study, Stankov et al. showed that self-confidence canbe used to identify misconceptions [35]. The misconceptionoccurs when a learner feels confident with the knowledge andthinks that he/she is answering correctly but actually gives anincorrect answer.Roderer et al. have gathered participants of several agesand have found a correlation between the self-confidence ofparticipants and their age. Junior participants have tendedto get higher self-confidence than senior participants [36].In contrast to this research, we gathered participants ofalmost the same age so as to investigate self-confidencewith only information in answering. The researches referredabove, however, only have proposed the importance of self-confidence. On the other hand, our work is not only to findcorrelations but also to estimate self-confidence for practicalapplications.
D. POSITION OF OUR WORK
Most of the previous work focuses only on the scientificinvestigation about the importance of self-confidence. Only alimited number of research trials have tackled the automaticestimation of confidence. Moreover, the use of estimatedconfidence to improve the quality of knowledge has not beenwell attempted in the past. We consider that this is due to thefollowing two limitations.
No general estimation – It is difficult to establish self-confidence estimation which is independent of environments,subjects, and learners. In other words, estimation methodsmay work well under a specific environment, for specificsubjects and learners, but may not if those conditions are nolonger satisfied. In the latter case, the estimation is unstableand less reliable. The important research question here iswhether such estimation is still effective in improving thequality of knowledge.
No end-to-end system – Effectiveness should be evaluatedas an end-to-end system including sensing, estimation, andfeedback. It is often the case that parts work well but thesystem built by connecting them does not. Unfortunately,most of the previous work focuses on parts, and little hasconsidered the end-to-end scenario. If the goal is to builda system capable of improving the quality of knowledge oflearners, this standpoint is mandatory. (a) Answer with confidence (b) Answer without confidence
FIGURE 2.
Examples of eye gaze on multiple-choice questions
In summary, our work’s main aim is to evaluate all themethods of sensing, estimation, and feedback not indepen-dently but as an end-to-end system to prove that it canimprove the quality of knowledge.
III. PROPOSED METHOD
The processing in our system consists of the following foursteps: data recording, feature calculation, feature selection,classification, and feedback.
A. DATA RECORDING
The eye gaze of a user is recorded by a remote eye trackerattached at the bottom of a display. The output of the eyetracker includes coordinates of the gaze on display and theirtimestamps. Figure 2 shows the difference of eye gaze whilesolving MCQ with and without confidence. This preliminaryobservation indicates that confusion of choices appears in eyegaze as the transition between choices. The circle in the figureis the position the user is looking at, which is visualizedas a demonstration and calibration purpose, invisible whilesolving questions.Eye movements are composed of two events: fixations andsaccades. A fixation indicates an event when the gaze pausesat a certain position over a certain period, usually minimum100 ms. A rapid movement between fixations is called asaccade. We classify raw gaze into fixations and saccades byusing an algorithm proposed by Buscher et al. [37]. A blink– rapid closing of the eyelid – is not analyzed in our methodbecause the time required to solve one question (10 - 60 sec.)is too short of calculating statistical features. In addition, asmooth pursuit occurs when a person tracks a moving objectwith slow speed. But this metric is not considered in thismethod because all information on a display is fixed.
B. FEATURE CALCULATION
We define Areas of Interest (AOIs) as rectangles covering aquestion and each choice in order to recognize deep behav-iors (e.g., a ratio of reading-times on a question and choices,a process of the decision with comparisons of choices, etc.)Fixations and saccades are automatically associated with thecorresponding AOIs in this step. Then we extract 30 featuresshown in Table 1. Features 1–14 are related to fixations, andFeatures 15–28 are about saccades. We also use the reading-time and the correctness of the answer as features.
VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1.
The list of the features
No. Feature1-2 Fixation {count, ratio} on Choices3-4 Fixation {count, ratio} on Question5-8 {Sum, mean, max, min} of fixation durations on Choices9-12 {Sum, mean, max, min} of fixation durations on Question13-14 Variance of {x, y} coordinate of fixations15-16 {Sum, mean} of saccade length17-20 Saccade count {all, on Question, between Choices,between Question and Choices}21-24 {Sum, mean, max, min} of saccade durations25-28 {Sum, mean, max, min} of saccade speeds29 Reading-time30 Correctness of the answer
C. FEATURE SELECTION
We select effective features from the above 30 candidatesbecause increasing the number of features does not alwaysincrease classification performance. We utilize the followingsimple hill-climbing strategy called forward stepwise. Firstly,we create a subset of features. The subset is empty at theinitial state. Then we calculate average precision scores of es-timations using each feature and insert one with the best fea-ture to the subset. Performances of estimations with featuresin the subset and one new feature are calculated, and keep thebest combination again. These processes are repeated as longas the new subset performs better than the old one. Two-foldcross-validation is used for this feature selection. Note thatthis step has proceeded only while training a classifier withtraining samples. Then selected features are used to classifyunknown samples.
D. CLASSIFICATION
We estimate the self-confidence of answers by a SupportVector Machine (SVM) with the selected features. The RadialBasis Function (RBF) kernel with penalty parameters C = 1 and γ = 0 . were selected experimentally and are used forthe SVM. As a preliminary study, we tested other machinelearning techniques including Random Forest, and found thatSVM performs the best overall in our classification task. E. FEEDBACK TO A LEARNER
By combining the correctness and the estimated confidence,answers of a learner are categorized into four groups: correctwith confidence, correct without confidence, incorrect withconfidence, and incorrect without confidence. As shown inFigure 1, the system highlights questions that should bespecially reviewed. A learner can claim if the output iswrong. Then the data are stored to personalize the upcomingestimation.
IV. STUDY 1: EVALUATION OF SELF-CONFIDENCEESTIMATION
In this first study, we evaluate the performance of the self-confidence estimation. We involved 10 participants with thesame background in creating a well-designed dataset. Thissection explains the procedures and the results. P r e c i s i o n Conf. gaze and reading time: 0.81Conf. reading time only: 0.80Unconf. gaze and reading time: 0.79Unconf. reading time only: 0.73
FIGURE 3.
A. EXPERIMENTAL DESIGN
We invited 10 participants (male: 5, female: 5) to our lab-oratory for solving 170 MCQ about English vocabulariesand grammars. All the participants were first-year Japaneseundergraduate students. We utilized Tobii 4C remote 90 Hzeye tracker for this data recording. Note that an upgrade keyprovided by Tobii was applied to use this device for scientificpurposes. Participants answered the most appropriate wordfor a blank in a question from choices. After answeringeach question, they answered a survey “Do you have aconfidence in your decision?” with
Yes or No . Answers to thisquestionnaire were used as ground truth labels (referred to as true confidence in this paper). We applied the random over-sampling in imbalanced-learn to create a balanced dataset. B. SELF-CONFIDENCE ESTIMATION PERFORMANCE
Figure 3 shows 11-point precision-recall curve of the confi-dence detection and unconfidence detection among all par-ticipants. This result indicates that our confidence estimationperforms accurate enough, relatively better in confidencedetection compared to unconfidence detection (average pre-cisions: 81 % and 79 %). Since the labels of confidence werebalanced, the chance ratio of the estimation is 50 %. Selectedfeatures from this recording were as follows: f5: sum offixation durations on choices , f13: variance of x coordinateof fixations , f19: the number of saccades between choiceareas , f21: sum of saccade durations , and f29: reading-time . Since some of the selected features are correlated witheach other, one feature, i.e., only reading-time might be ableenough to classify confident and unconfident. However, theeye gaze feature improves the performance, in particular forthe unconfident detection. C. OBSERVATION OF MISCLASSIFICATIONS
We describe a difference in the eye gaze between the casea participant answered with confidence and without confi-dence. We display on Figure 4 some examples of the esti-mation results. The circles represent the fixations, and thediameter of the circle is proportional to fixation duration. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) True conf. estimated as conf. (b) True unconf. estimated as conf.(c) True conf. estimated as unconf. (d) True uncof. estimated as unconf.
FIGURE 4.
Examples of eye gaze on each classification result
Hence the longer a participant looked at a point, the largerthe diameter of the fixation is. The lines between circlesrepresent the saccades.Figure 4 (a) is an example of the eye gaze of a participantanswered with confidence. Figure 4 (d) is an example of theeye gaze which a participant answered without confidence.We can find that the confidence in answering is characterizedby the fewer eye movements and smaller diameter of thefixations, on the other hand, the unconfidence is character-ized by the complex eye movements and the longer fixationdurations.In Figure 4 (b), a participant answered without confidence,but the classifier estimated as he answered with confidence.We assume that he gave up to answer correctly to thisquestion because he did not have the necessary knowledge.In such a case, we can find that the number of fixations issmall, and the participant took a short time to answer thisquestion. These characteristics are common to Figure 4 (a),which represents a confident decision. Therefore the classi-fier estimated as a confident decision.In Figure 4 (c), a participant answered with confidence,but the classifier estimated as he answered without confi-dence. We assume that this participant decided his answercarefully by eliminating irrelevant choices one by one. Insuch a case, we find more fixations and frequent transitionsof eyes between rectangles. This characteristic is common toFigure 4 (d), which represents an unconfident answer.
V. STUDY 2: EVALUATION OF SELF-CONFIDENCEBASED FEEDBACK
To evaluate the effectiveness of feedback based on self-confidence, we utilized the classifier of the first recording andprepared the end-to-end review feedback system as the sec-ond study. This section explains the details of the experimentand answers to the following our research hypotheses.
FIGURE 5.
The procedure of the feedback study. • RH1 – Questions answered correctly without confi-dence (vague knowledge) tend to be forgotten comparedto knowledge with confidence. • RH2 – Questions answered incorrectly with confidence(misunderstandings) tend to be mistaken again com-pared to wrong knowledge without confidence. • RH3 – Estimating self-confidence from learning behav-iors and giving feedback (e.g., adding questions to areview list, highlighting them while reviewing) avoidssuch scenarios.
A. EXPERIMENTAL DESIGN
We employed 20 participants (undergraduate and graduateschool students, age: 18–25, male: 14, female: 6) and moni-toring the transition of their performance. For questions, weprepared three levels of MCQ about English grammar: Level1 (easy, 170 questions), Level 2 (normal, 290 questions),and Level 3 (hard, 160 questions). Each question requiresthe most appropriate word for a blank in a question fromfour choices. Eye movements on the questions were recordedby Tobii 4C remote 90 Hz eye tracker with an upgradekey. Figure 5 shows the experimental procedure. We invitedparticipants for three days and asked for the following tasks.One-day breaks were inserted between each task-day. Partic-ipants who completed tasks received 5,000 JPY.
Trial (the first day) – Each participant solved 10 ques-tions with the three levels as a trial. Two reasons were behindthis trial: getting participants used to the MCQ interface andselecting an appropriate degree of difficulty. If the questionsare too easy or too difficult, the dataset will be unbalanced,and we cannot show any transition of their performance.Based on the results, we selected a suitable level whosecorrect answer rate is closest to 50 %.
Pre-Test (the first day) – After choosing the suitablelevel, each participant answered 120 questions of the selectedlevel and reported his/her self-confidence after each decision.Besides, the result page (see Figure 1) with correctnessand estimated self-confidence based on the training datasetappears after answering every 10 questions. We instructedparticipants to press “Read the answer” button for self-reviewexcept for the questions correctly answered with confidence.We recorded 2,075 answers in total. Based on the cor-rectness and estimated self-confidence, we categorized theminto four groups: (1) correct with confidence, (2) correct
VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (1) corr. w/ conf.53.5%(2a) corr. w/o conf.5.6%(2b) corr. w/o conf. => feedback 5.1%(3) incorr. w/o conf. 5.1%(4a) incor. w/ conf. 16.5%(4b) incorr. w/ conf. => feedback14.1%
FIGURE 6.
The distribution of questions in the feedback study. without confidence, (3) incorrect without confidence, and(4) incorrect with confidence. The role of our system is toidentify (2) and (4) for suggesting a learner review them. Inorder to evaluate the effectiveness of the system, we gavefeedback to half of (2) and (4) (see Figure 6). In following,the without feedback samples are called as controlled groups(2a) and (4a), and the with feedback samples are defined asexperiments groups (2b) and (4b).
Review (the third day) – Participants answered reviewquestions generated based on the first day’s feedback. Wronganswers (3) and (4) were inserted into the list of the review.In addition, we added (2b) to the list. During the review,(2b) and (4b) were emphasized on the question page. Aftersolving each question, each participant reported his/her self-confidence. The result page with correctness and estimatedself-confidence was shown for every 10 questions. We askedparticipants to press “Read the answer” button again forself-review, except for the questions correctly answered withconfidence. The order of questions and choices was shuffledfrom the pre-test.
Post-Test (the fifth day) – Participants solved the same120 questions as the pre-test. They reported confidences ondecisions for each question and checked the result page every10 questions as same as the first and the third day. The orderof questions and choices was shuffled from the review.
B. IMPORTANCE OF SELF-CONFIDENCE ESTIMATION
Figure 7 shows results of the effect of our review feedback.For this investigation, we divided all questions into twogroups: answered correctly or incorrectly at the pre-test. Thenwe compared their correctness at the post-test under eachcondition.If a participant answered correctly at the pre-test, he/sheshould be able to select the right choices again when he/sheis asked the same questions. However, some answers werewrong at the post-test. Figure 7 (a) reports how many ques-tions were forgotten. As a result, the ratio of (2a) correctanswers without confidence was dropped 16 % compared to(1) answers with confidence ( p < . evaluated by Welch’st-test). In other words, answers without confidence tend to (1) (2a) (2b)0.00.20.40.60.81.0 0.82 0.66 0.80(1) corr. w/ conf.(2a) corr. w/o conf.(2b) corr. w/o conf. => feedback ** * (a) Correct at the pre-test (3) (4a) (4b)0.00.20.40.60.81.0 0.65 0.64 0.81(3) incorr. w/o conf.(4a) incor. w/ conf.(4b) incorr. w/ conf. => feedback *** (b) Incorrect at the pre-test FIGURE 7.
The mean of correct answer rates among 20 participants at thepost-test. The symbol ** and * indicates p <0.01 and p <0.05, respectively. be forgotten in the near future, and therefore they shouldbe included in the review list. We observed that questionsanswered without confidence could not always be answeredcorrectly if they are asked again ( RH1 is true). There isnot much difference in the correctness of post-test betweenwrong answers with and without confidence (
RH2 is notalways true).
C. EFFECT OF FEEDBACK
Figure 7 also shows that feedback succeeded to improve themean correct answer rate. The performance of the experi-ment groups was 14 % higher than the performance of thecontrolled group ( p < . ) for the feedback about correctand unconfident questions (see Figure 7 (a)), and was 17 %higher for incorrect and confident questions ( p < . ; seeFigure 7 (b)). Highlighting questions that were answeredincorrectly with confidence could increase the probability ofmaintaining the correct answers in mind ( RH3 is true).
D. QUALITY OF KNOWLEDGE
Let us show how the quality of knowledge changes by thefeedback with the estimated confidence. Figure 8 representstransitions of levels: correctness and reported confidencebetween the pre-test and post-test. Controlled groups (ran-domly selected no feedback samples) are not included inthis chart. After the review, the number of correct answerswith confidence was increased compared to the other threegroups. In addition, an interesting finding from this chart isthat participants were able to assess their state of knowledgebetter after the review. A lot of correct with unconfidentanswers were changed to correct with confidence answers.And percentages of correct with unconfident answers andincorrect with confident answers were decreased. From theresult mentioned above, the feedback is effective in improv-ing the quality of knowledge. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 8.
Transitions of currentnesses and estimated confidences before(left) and after (right) the review.
VI. STUDY 3: DEPLOYMENT IN THE WILD
This paper has demonstrated how does CoaLA estimate self-confidence and how much does it improve learning perfor-mances. However, it is a common affair that unexpectedproblems in the laboratory condition happen in the wild con-dition. In this section, we report findings from a deploymentin the real classroom environment.
A. EXPERIMENTAL DESIGN
We have collaborated with a private school and deployed oursystem in the school. Students solved MCQ about vocabu-laries in English on the system. Then they printed out a listof words involving incorrect answers and correct answerswith low self-confidence. The questions were prepared bythe private school. The main purpose of this deployment isnot to record data but to demonstrate the system in the realenvironment. Therefore, unlike the previous two studies, wedid not prevent students’ natural behaviors. Calibration of aneye tracker was performed once before a student starts usingthe system. We asked the self-confidence of the decision(ground truth labels) once every five questions. Each studenthas their own username in order to track who solved whichquestion with or without confidence. The number of solvedquestions depends on the students. We utilized Tobii 4Cremote 90 Hz eye tracker with an upgrade key. The durationof this demonstration was around five weeks. 83 studentsused our system, and we collected 145,489 solving behaviorsin total. We evaluated our proposed self-confidence esti-mation on this dataset with leave-one-participant-out cross-validation.
B. PRE-PROCESSING
Since real recordings included many noisy behaviors, thefollowing filterers were applied to obtain a reliable dataset.(1) We analyzed labeled data in this study. (2) Data withinvalid usernames (e.g., guest ) are filtered out. (3) Data withonly a few eye gaze (a ratio of valid gaze coordinates is lessthan 80 % of one recording) are also ignored. Finally, the wilddataset consists of 14,302 valid samples from 72 students. P r e c i s i o n Conf. gaze and reading time: 0.79Conf. reading time only: 0.76Unconf. gaze and reading time: 0.78Unconf. reading time only: 0.77
FIGURE 9.
In a real learning scenario, we are not able to ask studentsto calibrate an eye tracker many times. They frequently moveahead and change a seat position. Therefore eye gaze inthe wild dataset was not precise compared to data in thelaboratory. It causes problems in our feature calculation be-cause AOIs are predefined as absolute coordinates on display.However, an interesting finding from scan path images is thata relative positional relationship between gazes on a questionand choices is still correct even if they are shifted. In orderto solve this issue, we decided to define AOIs with a newapproach. From all fixations in one recording, we calculatethe maximum and the minimum x and y coordinates. ThenAOIs are defined on the basis of relative positions in thisspace. In our question format, an area of question is the 34 %top part of the space, and areas of questions are divided intoa cross of the remaining 66 % bottom part. C. CONFIDENCE ESTIMATION RESULTS IN THE WILD
We utilized the recorded data for training the estimatorand Figure 9 shows the estimation results. As same as thelaboratory study, our approach could detect confidence andunconfidence relatively better than the estimator using onlyreading-time. f1: fixation count on choices , f8: minimumfixation duration on choices , f12: minimum fixation durationon question , f29: reading-time , and f30: correctness of theanswer were selected as features. D. EFFECTIVE FEATURES
Figure 10 shows a list of features selected on the laboratorydataset (the first study) and wild dataset (the third study). Inboth conditions, f29: reading-time has a negative correlationwith self-confidence, and was selected as a feature. Most ofthe calculated features are negatively correlated with self-confidence. This is because the longer a learner takes timeto consider, the more fixations and saccades are observed.Interestingly, a feature that is highly correlated with self-confidence is not necessarily selected in a classifier. Further-more, a feature that is not correlated individually can play animportant role in combining other features.
VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS f30f22 f1 f5 f16f24f29f15f13f18 f8 f7 f25f11f12f20 f9 f14f19 f3 f4 f2 f6 f28f26f21f23f27f17f100.00.10.20.30.4 (a) Laboratory dataset f29f17f15 f9 f30 f1 f2 f5 f19f20f11f18f21f23 f7 f25f27f24 f6 f12f10f22 f8 f13 f4 f3 f14f28f16f260.00.10.20.3 (b) Wild dataset
FIGURE 10.
Pearson correlations between self-confidence and each feature.Features selected by the forward stepwise are highlighted as red color. (circle:positive, triangle: negative correlation; sorted by the absolute value)
E. THE NUMBER OF TRAINING SAMPLES
Figure 11 shows the relation between the number of trainingsamples and the performance. Average precisions increasedtill the number of training samples reached 200. Incrementsmore than 200 did not contribute to the improvement, but themore training samples we had, the less standard deviation ofthe result was obtained.
VII. DISCUSSION
Studies 1 and 3 have given us interesting findings to im-prove the system. In the first study, we evaluated our gaze-based self-confidence estimation on MCQ. The combinationof gaze features and the reading-time could estimate self-confidence better than the estimation by reading-time only(average precisions: 0.81 % and 0.79 %). One possible reasonfor the weaker contribution of gaze features compared to theprevious report [13] is that we aim to develop a system thatstarts with a user-independent estimation, although individ-ual learners have their own characteristics of eye movements.Our system has a function to collect feedback to the estima-tion results by learners (see Figure 1), and the personalizationof the estimation remains for future work.The third study demonstrated that our self-confidence esti-mation works in the wild condition such as a real classroomenvironment, where the system can not be frequently cali-brated. Instead of utilizing self-calibration approaches [38],[39], calculating features from relative-position based AOIsperformed enough in our use case. The number of trainingsamples seems not to be an important matter in this task.Rather than collecting similar answers, recording solvingbehaviors on varied types and levels of questions with shortand long reading-times should improve the estimation.Another limitation of our studies is that the characteristicsof questions in the two datasets were different. Since wecould not control the difficulty level of questions in thewild dataset (the third study), the questions seem to be easyfor participants, and there are more correct answers thanincorrect answers.The research mentioned in Section II attempts are mainlyfocused on their contexts and parts, and thus it is hard to find The number of samples A v e r a g e p r e c i s i o n confidence detectionunconfidence detection FIGURE 11.
Average precisions on different number of samples randomlyselected from the wild dataset. the evaluation as a whole system in an end-to-end manner.For example, this means that little has been known howaccurate the estimation should be to achieve the goal, whichis, in our case, to improve the quality of knowledge. Oursecond study indicated that questions answered with vagueknowledge tend to be forgotten compared to knowledge withconfidence (decreased by 16 %), and our confidence-basedfeedback avoided the drop.An important issue is whether it is still meaningful to givefeedback based on the noisy estimation of self-confidence. Inorder to establish a system that improves the learners’ perfor-mance, the end-to-end viewpoint must be incorporated intothe evaluation. Although there is still room for improvementin our self-confidence estimation, we observed improvedlearning performances.
VIII. CONCLUSION
We have proposed
Confidence-Aware Learning Assistant(CoALA) , which estimates self-confidence on MCQ by an-alyzing eye movements and generates a report suggestingwhich question should be reviewed. The self-confidence es-timation algorithm was evaluated in the laboratory and thewild condition. By utilizing a pre-trained estimator on thelaboratory dataset, we conducted a user-study of the reviewfeedback. Our end-to-end confidence-based review increasedcorrect answer rates by 14 % for unconfident correct answersand 17 % for confident incorrect answers compared to a con-trolled condition. By visualizing transitions of correctnessand reported self-confidence in a pre-test and a post-test, weobserved that the quality of knowledge was increased. Weconclude that CoALA is helpful for learners.In future work, we will apply our method to different kindsof subjects involving Mathematics, Science, Society, etc. Weexpect a successful estimation of self-confidence in an MCQ,which a student can answer just by looking at a display andthinking about a question. Moreover, we aim to apply ourmethod to questions that do not include choices. In this work,designing AOIs for a question and each choice has beenrelated to obtaining some effective features. We need to findnew features to solve this problem. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
REFERENCES [1] Andreas Dengel. Digital co-creation and augmented learning. In Pro-ceedings of the The 11th International Knowledge Management in Orga-nizations Conference on The changing face of Knowledge ManagementImpacting Society, page 3. ACM, 2016.[2] Rafael A Calvo and Sidney D’Mello. Affect detection: An interdisci-plinary review of models, methods, and their applications. IEEE Trans-actions on Affective Computing, 1(1):18–37, 2010.[3] Mark Warschauer and Carla Meskill. Technology and second languageteaching. Handbook of Undergraduate Second Language Education,15:303–318, 2000.[4] Rakefet Ackerman and Valerie A Thompson. Meta-reasoning. Reasoningas memory, pages 164–182, 2015.[5] Logan Fletcher and Peter Carruthers. Metacognition and reasoning. Phil.Trans. R. Soc. B, 367(1594):1366–1378, 2012.[6] Stephen M Fleming, Brian Maniscalco, Yoshiaki Ko, Namema Amendi,Tony Ro, and Hakwan Lau. Action-specific disruption of perceptualconfidence. Psychological Science, 26(1):89–98, 2015.[7] Megan AK Peters and Hakwan Lau. Human observers have optimalintrospective access to perceptual processes even for visually maskedstimuli. Elife, 4:e09651, 2015.[8] John Dunlosky, Michael J Serra, Greg Matvey, and Katherine A Rawson.Second-order judgments about judgments of learning. The Journal ofGeneral Psychology, 132(4):335–346, 2005.[9] Bridgid Finn and Janet Metcalfe. The role of memory for past test in theunderconfidence with practice effect. Journal of Experimental Psychology:Learning, Memory, and Cognition, 33(1):238, 2007.[10] Richard Clément, Zoltán Dörnyei, and Kimberly A Noels. Motivation,self-confidence, and group cohesion in the foreign language classroom.Language learning, 44(3):417–448, 1994.[11] Elizabeth A Linnenbrink and Paul R Pintrich. The role of self-efficacybeliefs instudent engagement and learning intheclassroom. Reading&Writing Quarterly, 19(2):119–137, 2003.[12] Jon-Chao Hong, Ming-Yueh Hwang, Kai-Hsin Tai, and Chi-Ruei Tsai. Anexploration of students’ science learning interest related to their cogni-tive anxiety, cognitive load, self-confidence and learning progress usinginquiry-based learning with an ipad. Research in Science Education, pages1–20, 2017.[13] Kento Yamada, Koichi Kise, and Olivier Augereau. Estimation of confi-dence based on eye gaze: an application to multiple-choice questions. InProceedings of the 2017 ACM International Joint Conference on Pervasiveand Ubiquitous Computing and Proceedings of the 2017 ACM Interna-tional Symposium on Wearable Computers, pages 217–220. ACM, 2017.[14] Olivier Augereau, Hiroki Fujiyoshi, and Koichi Kise. Towards an auto-mated estimation of english skill via toeic score based on reading analysis.In Pattern Recognition, 2016 23rd International Conference on, pages1285–1290. IEEE, 2016.[15] Keith Rayner. Eye movements in reading and information processing: 20years of research. Psychological bulletin, 124(3):372, 1998.[16] Meng-Jung Tsai, Huei-Tse Hou, Meng-Lung Lai, Wan-Yi Liu, and Fang-Ying Yang. Visual attention for solving multiple-choice science problem:An eye-tracking analysis. Computers & Education, 58(1):375–385, 2012.[17] Kazuaki Kojima, Keiich Muramatsu, and Tatsunori Matsui. Experimentalstudy toward estimation of a learner mental state from processes of solvingmultiple choice problems based on eye movements. In Proceedings of 20thInternational Conference on Computers in Education, pages 81–85, 2012.[18] Ayano Okoso, Takumi Toyama, Kai Kunze, Joachim Folz, Marcus Li-wicki, and Koichi Kise. Towards extraction of subjective reading incom-prehension: Analysis of eye gaze features. In Proceedings of the 2015CHI Conference on Human Factors in Computing Systems: ExtendedAbstracts, pages 1325–1330. ACM, 2015.[19] Hanju Lee, Yasuhiro Kanakogi, and Kazuo Hiraki. Building a responsiveteacher: how temporal contingency of gaze interaction influences wordlearning with virtual tutors. Royal Society open science, 2(1):140361,2015.[20] Shoya Ishimaru, Syed Saqib Bukhari, Carina Heisel, Nicolas Großmann,Pascal Klein, Jochen Kuhn, and Andreas Dengel. Augmented learning onanticipating textbooks with eye tracking. In Positive Learning in the Ageof Information, pages 387–398. Springer, 2018.[21] Jerry Chih-Yuan Sun and Katherine Pin-Chen Yeh. The effects of attentionmonitoring with eeg biofeedback on university students’ attention and self-efficacy: The case of anti-phishing instructional materials. Computers &Education, 106:73–82, 2017. [22] Lu-Ho Hsia, Iwen Huang, and Gwo-Jen Hwang. Effects of different onlinepeer-feedback approaches on students’ performance skills, motivation andself-efficacy in a dance course. Computers & Education, 96:55–71, 2016.[23] Antonio Luque-Casado, Mikel Zabala, Esther Morales, Manuel Mateo-March, and Daniel Sanabria. Cognitive performance and heart ratevariability: the influence of fitness level. PloS one, 8(2):e56935, 2013.[24] Iuliia Brishtel, Shoya Ishimaru, Olivier Augereau, Koichi Kise, and An-dreas Dengel. Assessing cognitive workload on printed and electronicmedia using eye-tracker and eda wristband. In Proceedings of the 23rd In-ternational Conference on Intelligent User Interfaces Companion, page 45.ACM, 2018.[25] Hugo D Critchley. Electrodermal responses: what happens in the brain.The Neuroscientist, 8(2):132–142, 2002.[26] Shoya Ishimaru, Soumy Jacob, Apurba Roy, Syed Saqib Bukhari, CarinaHeisel, Nicolas Großmann, Michael Thees, Jochen Kuhn, and AndreasDengel. Cognitive state measurement on learning materials by utilizingeye tracker and thermal camera. In Proceedings of the 14th IAPR Inter-national Conference on Document Analysis and Recognition, volume 8,pages 32–36. IEEE, 2017.[27] Yomna Abdelrahman, Eduardo Velloso, Tilman Dingler, AlbrechtSchmidt, and Frank Vetere. Cognitive heat: exploring the usage of thermalimaging to unobtrusively estimate cognitive load. Proceedings of the ACMon Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3):33,2017.[28] Julian Steil and Andreas Bulling. Discovery of everyday human activitiesfrom long-term visual behaviour using topic models. In Proceedings ofthe 2015 acm international joint conference on pervasive and ubiquitouscomputing, pages 75–85. ACM, 2015.[29] Shoya Ishimaru, Kensuke Hoshika, Kai Kunze, Koichi Kise, and AndreasDengel. Towards reading trackers in the wild: detecting reading activitiesby eog glasses and deep neural networks. In Proceedings of the 2017 ACMInternational Joint Conference on Pervasive and Ubiquitous Computingand Proceedings of the 2017 ACM International Symposium on WearableComputers, pages 704–711. ACM, 2017.[30] Radka Jersakova, Richard J Allen, Jonathan Booth, Céline Souchay, andAkira R O’Connor. Understanding metacognitive confidence: Insightsfrom judgment-of-learning justifications. Journal of Memory and Lan-guage, 97:187–207, 2017.[31] Sabina Kleitman and Jennifer Gibson. Metacognitive beliefs, self-confidence and primary learning environment of sixth grade students.Learning and Individual Differences, 21(6):728–735, 2011.[32] Jennifer A Pooler, Ruth E Morgan, Karen Wong, Margaret K Wilkin, andJonathan L Blitstein. Cooking matters for adults improves food resourcemanagement skills and self-confidence among low-income participants.Journal of nutrition education and behavior, 49(7):545–553, 2017.[33] Katherine Forbes-Riley and Diane J Litman. Adapting to student uncer-tainty improves tutoring dialogues. In AIED, pages 33–40, 2009.[34] Sabina Kleitman, Lazar Stankov, Carl Martin Allwood, Sarah Young,and Karina Kar Lee Mak. Metacognitive self-confidence in school-agedchildren. In Self-directed learning oriented assessments in the Asia-Pacific, pages 139–153. Springer, 2012.[35] Lazar Stankov, Jihyun Lee, Wenshu Luo, and David J Hogan. Confidence:A better predictor of academic achievement than self-efficacy, self-conceptand anxiety? Learning and Individual Differences, 22(6):747–758, 2012.[36] Thomas Roderer and Claudia M Roebers. Can you see me thinking(about my answers)? using eye-tracking to illuminate developmental dif-ferences in monitoring and control skills and their relation to performance.Metacognition and learning, 9(1):1–23, 2014.[37] Georg Buscher, Andreas Dengel, and Ludger van Elst. Eye movements asimplicit relevance feedback. In Proceedings of the 2008 CHI Conferenceon Human Factors in Computing Systems: Extended Abstracts, pages2991–2996. ACM, 2008.[38] Michael Xuelin Huang, Tiffany CK Kwok, Grace Ngai, Stephen CF Chan,and Hong Va Leong. Building a personalized, auto-calibrating eye trackerfrom user interactions. In Proceedings of the 2016 CHI Conference onHuman Factors in Computing Systems, pages 5169–5179, 2016.[39] Thiago Santini, Wolfgang Fuhl, and Enkelejda Kasneci. Calibme: Fastand unsupervised eye tracker calibration for gaze-based pervasive human-computer interaction. In Proceedings of the 2017 chi conference on humanfactors in computing systems, pages 2594–2605, 2017.