[PDF] Confidence-Aware Learning Assistant

Abstract

Not only correctness but also self-confidence play an important role in improving the quality of knowledge. Undesirable situations such as confident incorrect and unconfident correct knowledge prevent learners from revising their knowledge because it is not always easy for them to perceive the situations. To solve this problem, we propose a system that estimates self-confidence while solving multiple-choice questions by eye tracking and gives feedback about which question should be reviewed carefully. We report the results of three studies measuring its effectiveness. (1) On a well-controlled dataset with 10 participants, our approach detected confidence and unconfidence with 81% and 79% average precision. (2) With the help of 20 participants, we observed that correct answer rates of questions were increased by 14% and 17% by giving feedback about correct answers without confidence and incorrect answers with confidence, respectively. (3) We conducted a large-scale data recording in a private school (72 high school students solved 14,302 questions) to investigate effective features and the number of required training samples.

Full PDF

DDate of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer xx.xxxx/xxxx.xxxx.DOI

Conﬁdence-Aware Learning Assistant

SHOYA ISHIMARU , TAKANORI MARUICHI , ANDREAS DENGEL and KOICHI KISE. German Research Center for Artiﬁcial Intelligence (DFKI), Trippstadter Str. 122, 67663, Kaiserslautern, Germany Graduate School of Engineering, Osaka Prefecture University, 1-1, Gakuen-cho, Naka-ku, Sakai, Osaka 599-8531, Japan

Corresponding author: Shoya Ishimaru (e-mail: [email protected]).

ABSTRACT

Not only correctness but also self-conﬁdence play an important role in improving the qualityof knowledge. Undesirable situations such as conﬁdent incorrect and unconﬁdent correct knowledge preventlearners from revising their knowledge because it is not always easy for them to perceive the situations.To solve this problem, we propose a system that estimates self-conﬁdence while solving multiple-choicequestions by eye tracking and gives feedback about which question should be reviewed carefully. We reportthe results of three studies measuring its effectiveness. (1) On a well-controlled dataset with 10 participants,our approach detected conﬁdence and unconﬁdence with 81 % and 79 % average precision. (2) With thehelp of 20 participants, we observed that correct answer rates of questions were increased by 14 % and17 % by giving feedback about correct answers without conﬁdence and incorrect answers with conﬁdence,respectively. (3) We conducted a large-scale data recording in a private school (72 high school studentssolved 14,302 questions) to investigate effective features and the number of required training samples.

INDEX TERMS

Eye tracking, in-the-wild study, learning augmentation, self-conﬁdence estimation.

I. INTRODUCTION Q UANTIFIED learning – sensing learning behaviors forgiving effective feedback based on the contexts toeach learner – has high potential in the era of digitalizededucation [1]. The appearance of smart sensors equipped ona personal computer, tablet, smartphone, chair, eyeglasses,etc. has enabled researchers to estimate various internal statessuch as engagement, boredom, tiredness, and self-conﬁdencewhile learning [2], [3]. Among these internal states, theimportance of self-conﬁdence has especially been investi-gated in educational research. Self-conﬁdence is a base ofmetacognitive judgments and the most common paradigmin metacognitive domains ranging from decision-making andreasoning [4], [5] to perceptual judgments [6], [7] and mem-ory evaluations [8], [9]. It is a manifestation of metacognitiveassessing of one’s own knowledge or scholastic ability andaffected by proﬁciency, achievement, cognitive anxiety, anddifﬁculty of a task [10]. Several studies have reported thata positive increment in self-conﬁdence enhances learners’engagement and performance [11], [12].One of the most critical cases where self-conﬁdence playsan important role is on multiple-choice questions (MCQ).MCQ is a type of question that asks selecting the mostappropriate choice from given options. Since informationobtained from the answer to MCQ is only its correctness, it ishard to distinguish between the cases when a learner answers

FIGURE 1.

An overview of

Conﬁdence-Aware Learning Assistant (CoALA) .The system recommends questions which should be checked again on thebasis of the correctness and the self-conﬁdence estimated by eye-tracking. with conﬁdence or when a learner answers randomly withoutconﬁdence. We consider that there are four levels of the qual-ity of knowledge, from high to low: correct with conﬁdence,incorrect without conﬁdence, correct without conﬁdence, andincorrect with conﬁdence. In particular, the last two casesare serious. In the case of correct without conﬁdence, theanswer will be treated as correct by chance, and the learnerloses a chance to acquire correct knowledge. In the case ofincorrect with conﬁdence, the wrong knowledge may causefurther misunderstandings.

VOLUME x, 202x a r X i v : . [ c s . H C ] F e b shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS As a solution to such serious cases, we propose a systemwhich estimates self-conﬁdence on MCQ by analyzing eyemovements. Based on the estimation output, our systemgenerates a report suggesting which question should be re-viewed as shown in Figure 1. We deﬁne such an intelligentsystem that adapts learning materials for each learner basedon self-conﬁdence as

Conﬁdence-Aware Learning Assistant(CoALA) . The idea of estimating self-conﬁdence on MCQby eye tracking has been originally proposed by Yamada etal. [13]. Compared to their study, we aim for (1) proposinga user-independent approach considering a real scenario and(2) investigating the effectiveness as an end-to-end workingsystem including feedback.This paper presents our self-conﬁdence estimation algo-rithm and the results of three studies measuring its effec-tiveness. In the ﬁrst study, which involves 10 undergraduatestudents solving a total of 1,700 questions, we investigatedthe performance (precision and recall) of the self-conﬁdenceestimation. Then we trained a conﬁdence estimator using thisdataset and used it for the second study. This is due to arealistic scenario of applying the technology; an estimatoris trained with the data from different learners. The secondstudy consists of a pre-test, a review, and a post-test with thehelp of 20 participants. By comparing the results of the pre-test and the post-test, we evaluated how much our system im-proves the quality of knowledge in terms of the correctness ofthe answers and the students’ conﬁdence. For the third study,we deployed our system in a private school. During this ﬁve-week demonstration, 72 high school students solved 145,489questions, and 14,302 questions were labeled by themselves.In this wild dataset, we discuss the limitations and futuredirections of our system. In summary, our contribution in thispaper includes: • User-independent gaze-based self-conﬁdence estima-tion on multiple-choice questions, which detects con-ﬁdence with 81 % and detects unconﬁdence with 79 %average precision, respectively. • End-to-end conﬁdence-based reviewing system, whichincreases correct answer rates by 14 % for unconﬁdentcorrect answers and 17 % for conﬁdent incorrect an-swers compared to a controlled condition.

II. RELATED WORK

A. EYE TRACKING FOR LEARNING ASSISTANT

It has been proved that eye-gaze contains linguistic proﬁ-ciency [14] and a degree of self-conﬁdence. For example, thebehavior of a student who does not understand the contentsof a document is characterized by low reading speed andfrequent rereading [15]. Thai et al. reported that comprehen-sion for a question of a student is appeared in his/her eyemovement, for example, rereading of a question [16]. More-over, about the relation between the behavior of eyes andself-conﬁdence, it has been proved that low self-conﬁdenceis characterized by a frequent rereading of questions andlong gaze on choices [17]. Okoso et al. have proposed amethod of extracting difﬁcult parts of a document for a reader to comprehend and found some effective features [18]. Lee et al. have proposed to build a virtual tutor to support thelearning of a student [19]. This work demonstrated that eyecommunications with a virtual tutor enhance the efﬁciencyof learning. Oliver et al. have succeeded to estimate theEnglish skill of non-native English speaker from his/her eyebehaviors in English test. The contribution of their work isto estimate the skill successfully with a small error by a fewdocuments [14]. Yamada et al. have tackled the automaticestimation of self-conﬁdence by sensing and analyzing learn-ers’ problem solving behaviors through eye movements [13].However, their method works well if the training can be donefor each learner with enough amount of data, which may notbe realistic. In other learning subjects, for instance, Ishimaru et al. have investigated reading behaviors of students ona textbook in Physics [20]. They have proposed Areas ofInterests based and subsequence based approaches to predictexpertise.

B. OTHER SENSING MODALITIES

Some researchers have measured the attention of students inlearning by Electroencephalogram (EEG) and investigated acorrelation between attention and self-efﬁcacy, which refersto the level of conﬁdence of an individual with regard to theirability at task execution) [21], [22]. Though a method usingEEG can be a solution to estimate self-conﬁdence, the devicedisturbs a user engaging in a task for the reason that it isalways attached to his head. On this point, the eye tracker ispreferable because it can be attached to a display.There has been a growing interest in the study of therelation between cognitive performance and the autonomicnervous system (ANS). The activity of ANS can be alsomeasured by heart rate variability (HRV) [23], Electrodermalactivity (EDA) [24], [25] and so on. We did not utilize theseapproaches in this work because we had received commentsfrom students in the private school that wearing sensors whilestudying requires a high physical workload. If we can recordthe precise physiological signal with remote sensing, weconsider integrating it. For instance, the nose temperature,which can be measured by a commercial infrared thermog-raphy camera, can be a nice candidate [26]. Abdelrahman et al. have recorded nose and forehead temperatures underdifferent task difﬁculties and found signiﬁcant changes [27].Although mobile eye trackers appeared, there is still astrong gap between controlled behaviors in the laboratoryand natural behaviors in the wild. One of the critical issuesin this research ﬁeld is how we can conduct experiments innatural settings for proposing robust methods. Towards thisobjective, several researchers have conducted long-term andlarge-scale experiments (e.g., over 80 hours of recording witha mobile eye tracker [28] and 780 hours of recording with acommercial Electrooculography glasses [29]). Our work is inthis context and has evaluated real learning behaviors. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

C. ROLE OF SELF-CONFIDENCE IN LEARNING

Several studies have mentioned correlations between self-conﬁdence or other cognitive states and behaviors of peoplein speciﬁc tasks including achievement test of learning [30],cognitive test [31], and cooking [32].According to work by Forbes-Riley et al. , adapting auser’s self-conﬁdence into the computer tutoring systemimproves performance on learning efﬁciency and a user’ssatisfaction [33]. Kleitman et al. reported that a high levelof self-conﬁdence predicted high grades for primary schoolchildren [34]. Indeed, students who have self-conﬁdenceawareness tend to be recognized in their performances, whichdevelops their level of self-conﬁdence again. This positivefeedback loop motivates students to learn by themselves. Inanother study, Stankov et al. showed that self-conﬁdence canbe used to identify misconceptions [35]. The misconceptionoccurs when a learner feels conﬁdent with the knowledge andthinks that he/she is answering correctly but actually gives anincorrect answer.Roderer et al. have gathered participants of several agesand have found a correlation between the self-conﬁdence ofparticipants and their age. Junior participants have tendedto get higher self-conﬁdence than senior participants [36].In contrast to this research, we gathered participants ofalmost the same age so as to investigate self-conﬁdencewith only information in answering. The researches referredabove, however, only have proposed the importance of self-conﬁdence. On the other hand, our work is not only to ﬁndcorrelations but also to estimate self-conﬁdence for practicalapplications.

D. POSITION OF OUR WORK

Most of the previous work focuses only on the scientiﬁcinvestigation about the importance of self-conﬁdence. Only alimited number of research trials have tackled the automaticestimation of conﬁdence. Moreover, the use of estimatedconﬁdence to improve the quality of knowledge has not beenwell attempted in the past. We consider that this is due to thefollowing two limitations.

No general estimation – It is difﬁcult to establish self-conﬁdence estimation which is independent of environments,subjects, and learners. In other words, estimation methodsmay work well under a speciﬁc environment, for speciﬁcsubjects and learners, but may not if those conditions are nolonger satisﬁed. In the latter case, the estimation is unstableand less reliable. The important research question here iswhether such estimation is still effective in improving thequality of knowledge.

No end-to-end system – Effectiveness should be evaluatedas an end-to-end system including sensing, estimation, andfeedback. It is often the case that parts work well but thesystem built by connecting them does not. Unfortunately,most of the previous work focuses on parts, and little hasconsidered the end-to-end scenario. If the goal is to builda system capable of improving the quality of knowledge oflearners, this standpoint is mandatory. (a) Answer with conﬁdence (b) Answer without conﬁdence

FIGURE 2.

Examples of eye gaze on multiple-choice questions

In summary, our work’s main aim is to evaluate all themethods of sensing, estimation, and feedback not indepen-dently but as an end-to-end system to prove that it canimprove the quality of knowledge.

III. PROPOSED METHOD

The processing in our system consists of the following foursteps: data recording, feature calculation, feature selection,classiﬁcation, and feedback.

A. DATA RECORDING

The eye gaze of a user is recorded by a remote eye trackerattached at the bottom of a display. The output of the eyetracker includes coordinates of the gaze on display and theirtimestamps. Figure 2 shows the difference of eye gaze whilesolving MCQ with and without conﬁdence. This preliminaryobservation indicates that confusion of choices appears in eyegaze as the transition between choices. The circle in the ﬁgureis the position the user is looking at, which is visualizedas a demonstration and calibration purpose, invisible whilesolving questions.Eye movements are composed of two events: ﬁxations andsaccades. A ﬁxation indicates an event when the gaze pausesat a certain position over a certain period, usually minimum100 ms. A rapid movement between ﬁxations is called asaccade. We classify raw gaze into ﬁxations and saccades byusing an algorithm proposed by Buscher et al. [37]. A blink– rapid closing of the eyelid – is not analyzed in our methodbecause the time required to solve one question (10 - 60 sec.)is too short of calculating statistical features. In addition, asmooth pursuit occurs when a person tracks a moving objectwith slow speed. But this metric is not considered in thismethod because all information on a display is ﬁxed.

B. FEATURE CALCULATION

We deﬁne Areas of Interest (AOIs) as rectangles covering aquestion and each choice in order to recognize deep behav-iors (e.g., a ratio of reading-times on a question and choices,a process of the decision with comparisons of choices, etc.)Fixations and saccades are automatically associated with thecorresponding AOIs in this step. Then we extract 30 featuresshown in Table 1. Features 1–14 are related to ﬁxations, andFeatures 15–28 are about saccades. We also use the reading-time and the correctness of the answer as features.

VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1.

The list of the features

No. Feature1-2 Fixation {count, ratio} on Choices3-4 Fixation {count, ratio} on Question5-8 {Sum, mean, max, min} of ﬁxation durations on Choices9-12 {Sum, mean, max, min} of ﬁxation durations on Question13-14 Variance of {x, y} coordinate of ﬁxations15-16 {Sum, mean} of saccade length17-20 Saccade count {all, on Question, between Choices,between Question and Choices}21-24 {Sum, mean, max, min} of saccade durations25-28 {Sum, mean, max, min} of saccade speeds29 Reading-time30 Correctness of the answer

C. FEATURE SELECTION

We select effective features from the above 30 candidatesbecause increasing the number of features does not alwaysincrease classiﬁcation performance. We utilize the followingsimple hill-climbing strategy called forward stepwise. Firstly,we create a subset of features. The subset is empty at theinitial state. Then we calculate average precision scores of es-timations using each feature and insert one with the best fea-ture to the subset. Performances of estimations with featuresin the subset and one new feature are calculated, and keep thebest combination again. These processes are repeated as longas the new subset performs better than the old one. Two-foldcross-validation is used for this feature selection. Note thatthis step has proceeded only while training a classiﬁer withtraining samples. Then selected features are used to classifyunknown samples.

D. CLASSIFICATION

We estimate the self-conﬁdence of answers by a SupportVector Machine (SVM) with the selected features. The RadialBasis Function (RBF) kernel with penalty parameters C = 1 and γ = 0 . were selected experimentally and are used forthe SVM. As a preliminary study, we tested other machinelearning techniques including Random Forest, and found thatSVM performs the best overall in our classiﬁcation task. E. FEEDBACK TO A LEARNER

By combining the correctness and the estimated conﬁdence,answers of a learner are categorized into four groups: correctwith conﬁdence, correct without conﬁdence, incorrect withconﬁdence, and incorrect without conﬁdence. As shown inFigure 1, the system highlights questions that should bespecially reviewed. A learner can claim if the output iswrong. Then the data are stored to personalize the upcomingestimation.

IV. STUDY 1: EVALUATION OF SELF-CONFIDENCEESTIMATION

In this ﬁrst study, we evaluate the performance of the self-conﬁdence estimation. We involved 10 participants with thesame background in creating a well-designed dataset. Thissection explains the procedures and the results. P r e c i s i o n Conf. gaze and reading time: 0.81Conf. reading time only: 0.80Unconf. gaze and reading time: 0.79Unconf. reading time only: 0.73

FIGURE 3.

A. EXPERIMENTAL DESIGN

We invited 10 participants (male: 5, female: 5) to our lab-oratory for solving 170 MCQ about English vocabulariesand grammars. All the participants were ﬁrst-year Japaneseundergraduate students. We utilized Tobii 4C remote 90 Hzeye tracker for this data recording. Note that an upgrade keyprovided by Tobii was applied to use this device for scientiﬁcpurposes. Participants answered the most appropriate wordfor a blank in a question from choices. After answeringeach question, they answered a survey “Do you have aconﬁdence in your decision?” with

Yes or No . Answers to thisquestionnaire were used as ground truth labels (referred to as true conﬁdence in this paper). We applied the random over-sampling in imbalanced-learn to create a balanced dataset. B. SELF-CONFIDENCE ESTIMATION PERFORMANCE

Figure 3 shows 11-point precision-recall curve of the conﬁ-dence detection and unconﬁdence detection among all par-ticipants. This result indicates that our conﬁdence estimationperforms accurate enough, relatively better in conﬁdencedetection compared to unconﬁdence detection (average pre-cisions: 81 % and 79 %). Since the labels of conﬁdence werebalanced, the chance ratio of the estimation is 50 %. Selectedfeatures from this recording were as follows: f5: sum ofﬁxation durations on choices , f13: variance of x coordinateof ﬁxations , f19: the number of saccades between choiceareas , f21: sum of saccade durations , and f29: reading-time . Since some of the selected features are correlated witheach other, one feature, i.e., only reading-time might be ableenough to classify conﬁdent and unconﬁdent. However, theeye gaze feature improves the performance, in particular forthe unconﬁdent detection. C. OBSERVATION OF MISCLASSIFICATIONS

We describe a difference in the eye gaze between the casea participant answered with conﬁdence and without conﬁ-dence. We display on Figure 4 some examples of the esti-mation results. The circles represent the ﬁxations, and thediameter of the circle is proportional to ﬁxation duration. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) True conf. estimated as conf. (b) True unconf. estimated as conf.(c) True conf. estimated as unconf. (d) True uncof. estimated as unconf.

FIGURE 4.

Examples of eye gaze on each classiﬁcation result

Hence the longer a participant looked at a point, the largerthe diameter of the ﬁxation is. The lines between circlesrepresent the saccades.Figure 4 (a) is an example of the eye gaze of a participantanswered with conﬁdence. Figure 4 (d) is an example of theeye gaze which a participant answered without conﬁdence.We can ﬁnd that the conﬁdence in answering is characterizedby the fewer eye movements and smaller diameter of theﬁxations, on the other hand, the unconﬁdence is character-ized by the complex eye movements and the longer ﬁxationdurations.In Figure 4 (b), a participant answered without conﬁdence,but the classiﬁer estimated as he answered with conﬁdence.We assume that he gave up to answer correctly to thisquestion because he did not have the necessary knowledge.In such a case, we can ﬁnd that the number of ﬁxations issmall, and the participant took a short time to answer thisquestion. These characteristics are common to Figure 4 (a),which represents a conﬁdent decision. Therefore the classi-ﬁer estimated as a conﬁdent decision.In Figure 4 (c), a participant answered with conﬁdence,but the classiﬁer estimated as he answered without conﬁ-dence. We assume that this participant decided his answercarefully by eliminating irrelevant choices one by one. Insuch a case, we ﬁnd more ﬁxations and frequent transitionsof eyes between rectangles. This characteristic is common toFigure 4 (d), which represents an unconﬁdent answer.

V. STUDY 2: EVALUATION OF SELF-CONFIDENCEBASED FEEDBACK

To evaluate the effectiveness of feedback based on self-conﬁdence, we utilized the classiﬁer of the ﬁrst recording andprepared the end-to-end review feedback system as the sec-ond study. This section explains the details of the experimentand answers to the following our research hypotheses.

FIGURE 5.

The procedure of the feedback study. • RH1 – Questions answered correctly without conﬁ-dence (vague knowledge) tend to be forgotten comparedto knowledge with conﬁdence. • RH2 – Questions answered incorrectly with conﬁdence(misunderstandings) tend to be mistaken again com-pared to wrong knowledge without conﬁdence. • RH3 – Estimating self-conﬁdence from learning behav-iors and giving feedback (e.g., adding questions to areview list, highlighting them while reviewing) avoidssuch scenarios.

A. EXPERIMENTAL DESIGN

We employed 20 participants (undergraduate and graduateschool students, age: 18–25, male: 14, female: 6) and moni-toring the transition of their performance. For questions, weprepared three levels of MCQ about English grammar: Level1 (easy, 170 questions), Level 2 (normal, 290 questions),and Level 3 (hard, 160 questions). Each question requiresthe most appropriate word for a blank in a question fromfour choices. Eye movements on the questions were recordedby Tobii 4C remote 90 Hz eye tracker with an upgradekey. Figure 5 shows the experimental procedure. We invitedparticipants for three days and asked for the following tasks.One-day breaks were inserted between each task-day. Partic-ipants who completed tasks received 5,000 JPY.

Trial (the ﬁrst day) – Each participant solved 10 ques-tions with the three levels as a trial. Two reasons were behindthis trial: getting participants used to the MCQ interface andselecting an appropriate degree of difﬁculty. If the questionsare too easy or too difﬁcult, the dataset will be unbalanced,and we cannot show any transition of their performance.Based on the results, we selected a suitable level whosecorrect answer rate is closest to 50 %.

Pre-Test (the ﬁrst day) – After choosing the suitablelevel, each participant answered 120 questions of the selectedlevel and reported his/her self-conﬁdence after each decision.Besides, the result page (see Figure 1) with correctnessand estimated self-conﬁdence based on the training datasetappears after answering every 10 questions. We instructedparticipants to press “Read the answer” button for self-reviewexcept for the questions correctly answered with conﬁdence.We recorded 2,075 answers in total. Based on the cor-rectness and estimated self-conﬁdence, we categorized theminto four groups: (1) correct with conﬁdence, (2) correct

VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (1) corr. w/ conf.53.5%(2a) corr. w/o conf.5.6%(2b) corr. w/o conf. => feedback 5.1%(3) incorr. w/o conf. 5.1%(4a) incor. w/ conf. 16.5%(4b) incorr. w/ conf. => feedback14.1%

FIGURE 6.

The distribution of questions in the feedback study. without conﬁdence, (3) incorrect without conﬁdence, and(4) incorrect with conﬁdence. The role of our system is toidentify (2) and (4) for suggesting a learner review them. Inorder to evaluate the effectiveness of the system, we gavefeedback to half of (2) and (4) (see Figure 6). In following,the without feedback samples are called as controlled groups(2a) and (4a), and the with feedback samples are deﬁned asexperiments groups (2b) and (4b).

Review (the third day) – Participants answered reviewquestions generated based on the ﬁrst day’s feedback. Wronganswers (3) and (4) were inserted into the list of the review.In addition, we added (2b) to the list. During the review,(2b) and (4b) were emphasized on the question page. Aftersolving each question, each participant reported his/her self-conﬁdence. The result page with correctness and estimatedself-conﬁdence was shown for every 10 questions. We askedparticipants to press “Read the answer” button again forself-review, except for the questions correctly answered withconﬁdence. The order of questions and choices was shufﬂedfrom the pre-test.

Post-Test (the ﬁfth day) – Participants solved the same120 questions as the pre-test. They reported conﬁdences ondecisions for each question and checked the result page every10 questions as same as the ﬁrst and the third day. The orderof questions and choices was shufﬂed from the review.

B. IMPORTANCE OF SELF-CONFIDENCE ESTIMATION

Figure 7 shows results of the effect of our review feedback.For this investigation, we divided all questions into twogroups: answered correctly or incorrectly at the pre-test. Thenwe compared their correctness at the post-test under eachcondition.If a participant answered correctly at the pre-test, he/sheshould be able to select the right choices again when he/sheis asked the same questions. However, some answers werewrong at the post-test. Figure 7 (a) reports how many ques-tions were forgotten. As a result, the ratio of (2a) correctanswers without conﬁdence was dropped 16 % compared to(1) answers with conﬁdence ( p < . evaluated by Welch’st-test). In other words, answers without conﬁdence tend to (1) (2a) (2b)0.00.20.40.60.81.0 0.82 0.66 0.80(1) corr. w/ conf.(2a) corr. w/o conf.(2b) corr. w/o conf. => feedback ** * (a) Correct at the pre-test (3) (4a) (4b)0.00.20.40.60.81.0 0.65 0.64 0.81(3) incorr. w/o conf.(4a) incor. w/ conf.(4b) incorr. w/ conf. => feedback *** (b) Incorrect at the pre-test FIGURE 7.

The mean of correct answer rates among 20 participants at thepost-test. The symbol ** and * indicates p <0.01 and p <0.05, respectively. be forgotten in the near future, and therefore they shouldbe included in the review list. We observed that questionsanswered without conﬁdence could not always be answeredcorrectly if they are asked again ( RH1 is true). There isnot much difference in the correctness of post-test betweenwrong answers with and without conﬁdence (

RH2 is notalways true).

C. EFFECT OF FEEDBACK

Figure 7 also shows that feedback succeeded to improve themean correct answer rate. The performance of the experi-ment groups was 14 % higher than the performance of thecontrolled group ( p < . ) for the feedback about correctand unconﬁdent questions (see Figure 7 (a)), and was 17 %higher for incorrect and conﬁdent questions ( p < . ; seeFigure 7 (b)). Highlighting questions that were answeredincorrectly with conﬁdence could increase the probability ofmaintaining the correct answers in mind ( RH3 is true).

D. QUALITY OF KNOWLEDGE

Let us show how the quality of knowledge changes by thefeedback with the estimated conﬁdence. Figure 8 representstransitions of levels: correctness and reported conﬁdencebetween the pre-test and post-test. Controlled groups (ran-domly selected no feedback samples) are not included inthis chart. After the review, the number of correct answerswith conﬁdence was increased compared to the other threegroups. In addition, an interesting ﬁnding from this chart isthat participants were able to assess their state of knowledgebetter after the review. A lot of correct with unconﬁdentanswers were changed to correct with conﬁdence answers.And percentages of correct with unconﬁdent answers andincorrect with conﬁdent answers were decreased. From theresult mentioned above, the feedback is effective in improv-ing the quality of knowledge. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 8.

Transitions of currentnesses and estimated conﬁdences before(left) and after (right) the review.

VI. STUDY 3: DEPLOYMENT IN THE WILD

This paper has demonstrated how does CoaLA estimate self-conﬁdence and how much does it improve learning perfor-mances. However, it is a common affair that unexpectedproblems in the laboratory condition happen in the wild con-dition. In this section, we report ﬁndings from a deploymentin the real classroom environment.

A. EXPERIMENTAL DESIGN

We have collaborated with a private school and deployed oursystem in the school. Students solved MCQ about vocabu-laries in English on the system. Then they printed out a listof words involving incorrect answers and correct answerswith low self-conﬁdence. The questions were prepared bythe private school. The main purpose of this deployment isnot to record data but to demonstrate the system in the realenvironment. Therefore, unlike the previous two studies, wedid not prevent students’ natural behaviors. Calibration of aneye tracker was performed once before a student starts usingthe system. We asked the self-conﬁdence of the decision(ground truth labels) once every ﬁve questions. Each studenthas their own username in order to track who solved whichquestion with or without conﬁdence. The number of solvedquestions depends on the students. We utilized Tobii 4Cremote 90 Hz eye tracker with an upgrade key. The durationof this demonstration was around ﬁve weeks. 83 studentsused our system, and we collected 145,489 solving behaviorsin total. We evaluated our proposed self-conﬁdence esti-mation on this dataset with leave-one-participant-out cross-validation.

B. PRE-PROCESSING

Since real recordings included many noisy behaviors, thefollowing ﬁlterers were applied to obtain a reliable dataset.(1) We analyzed labeled data in this study. (2) Data withinvalid usernames (e.g., guest ) are ﬁltered out. (3) Data withonly a few eye gaze (a ratio of valid gaze coordinates is lessthan 80 % of one recording) are also ignored. Finally, the wilddataset consists of 14,302 valid samples from 72 students. P r e c i s i o n Conf. gaze and reading time: 0.79Conf. reading time only: 0.76Unconf. gaze and reading time: 0.78Unconf. reading time only: 0.77

FIGURE 9.

In a real learning scenario, we are not able to ask studentsto calibrate an eye tracker many times. They frequently moveahead and change a seat position. Therefore eye gaze inthe wild dataset was not precise compared to data in thelaboratory. It causes problems in our feature calculation be-cause AOIs are predeﬁned as absolute coordinates on display.However, an interesting ﬁnding from scan path images is thata relative positional relationship between gazes on a questionand choices is still correct even if they are shifted. In orderto solve this issue, we decided to deﬁne AOIs with a newapproach. From all ﬁxations in one recording, we calculatethe maximum and the minimum x and y coordinates. ThenAOIs are deﬁned on the basis of relative positions in thisspace. In our question format, an area of question is the 34 %top part of the space, and areas of questions are divided intoa cross of the remaining 66 % bottom part. C. CONFIDENCE ESTIMATION RESULTS IN THE WILD

We utilized the recorded data for training the estimatorand Figure 9 shows the estimation results. As same as thelaboratory study, our approach could detect conﬁdence andunconﬁdence relatively better than the estimator using onlyreading-time. f1: ﬁxation count on choices , f8: minimumﬁxation duration on choices , f12: minimum ﬁxation durationon question , f29: reading-time , and f30: correctness of theanswer were selected as features. D. EFFECTIVE FEATURES

Figure 10 shows a list of features selected on the laboratorydataset (the ﬁrst study) and wild dataset (the third study). Inboth conditions, f29: reading-time has a negative correlationwith self-conﬁdence, and was selected as a feature. Most ofthe calculated features are negatively correlated with self-conﬁdence. This is because the longer a learner takes timeto consider, the more ﬁxations and saccades are observed.Interestingly, a feature that is highly correlated with self-conﬁdence is not necessarily selected in a classiﬁer. Further-more, a feature that is not correlated individually can play animportant role in combining other features.

VOLUME x, 202x et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS f30f22 f1 f5 f16f24f29f15f13f18 f8 f7 f25f11f12f20 f9 f14f19 f3 f4 f2 f6 f28f26f21f23f27f17f100.00.10.20.30.4 (a) Laboratory dataset f29f17f15 f9 f30 f1 f2 f5 f19f20f11f18f21f23 f7 f25f27f24 f6 f12f10f22 f8 f13 f4 f3 f14f28f16f260.00.10.20.3 (b) Wild dataset

FIGURE 10.

Pearson correlations between self-conﬁdence and each feature.Features selected by the forward stepwise are highlighted as red color. (circle:positive, triangle: negative correlation; sorted by the absolute value)

E. THE NUMBER OF TRAINING SAMPLES

Figure 11 shows the relation between the number of trainingsamples and the performance. Average precisions increasedtill the number of training samples reached 200. Incrementsmore than 200 did not contribute to the improvement, but themore training samples we had, the less standard deviation ofthe result was obtained.

VII. DISCUSSION

Studies 1 and 3 have given us interesting ﬁndings to im-prove the system. In the ﬁrst study, we evaluated our gaze-based self-conﬁdence estimation on MCQ. The combinationof gaze features and the reading-time could estimate self-conﬁdence better than the estimation by reading-time only(average precisions: 0.81 % and 0.79 %). One possible reasonfor the weaker contribution of gaze features compared to theprevious report [13] is that we aim to develop a system thatstarts with a user-independent estimation, although individ-ual learners have their own characteristics of eye movements.Our system has a function to collect feedback to the estima-tion results by learners (see Figure 1), and the personalizationof the estimation remains for future work.The third study demonstrated that our self-conﬁdence esti-mation works in the wild condition such as a real classroomenvironment, where the system can not be frequently cali-brated. Instead of utilizing self-calibration approaches [38],[39], calculating features from relative-position based AOIsperformed enough in our use case. The number of trainingsamples seems not to be an important matter in this task.Rather than collecting similar answers, recording solvingbehaviors on varied types and levels of questions with shortand long reading-times should improve the estimation.Another limitation of our studies is that the characteristicsof questions in the two datasets were different. Since wecould not control the difﬁculty level of questions in thewild dataset (the third study), the questions seem to be easyfor participants, and there are more correct answers thanincorrect answers.The research mentioned in Section II attempts are mainlyfocused on their contexts and parts, and thus it is hard to ﬁnd The number of samples A v e r a g e p r e c i s i o n confidence detectionunconfidence detection FIGURE 11.

Average precisions on different number of samples randomlyselected from the wild dataset. the evaluation as a whole system in an end-to-end manner.For example, this means that little has been known howaccurate the estimation should be to achieve the goal, whichis, in our case, to improve the quality of knowledge. Oursecond study indicated that questions answered with vagueknowledge tend to be forgotten compared to knowledge withconﬁdence (decreased by 16 %), and our conﬁdence-basedfeedback avoided the drop.An important issue is whether it is still meaningful to givefeedback based on the noisy estimation of self-conﬁdence. Inorder to establish a system that improves the learners’ perfor-mance, the end-to-end viewpoint must be incorporated intothe evaluation. Although there is still room for improvementin our self-conﬁdence estimation, we observed improvedlearning performances.

VIII. CONCLUSION

We have proposed

Conﬁdence-Aware Learning Assistant(CoALA) , which estimates self-conﬁdence on MCQ by an-alyzing eye movements and generates a report suggestingwhich question should be reviewed. The self-conﬁdence es-timation algorithm was evaluated in the laboratory and thewild condition. By utilizing a pre-trained estimator on thelaboratory dataset, we conducted a user-study of the reviewfeedback. Our end-to-end conﬁdence-based review increasedcorrect answer rates by 14 % for unconﬁdent correct answersand 17 % for conﬁdent incorrect answers compared to a con-trolled condition. By visualizing transitions of correctnessand reported self-conﬁdence in a pre-test and a post-test, weobserved that the quality of knowledge was increased. Weconclude that CoALA is helpful for learners.In future work, we will apply our method to different kindsof subjects involving Mathematics, Science, Society, etc. Weexpect a successful estimation of self-conﬁdence in an MCQ,which a student can answer just by looking at a display andthinking about a question. Moreover, we aim to apply ourmethod to questions that do not include choices. In this work,designing AOIs for a question and each choice has beenrelated to obtaining some effective features. We need to ﬁndnew features to solve this problem. VOLUME x, 202x shimaru et al. : Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

REFERENCES [1] Andreas Dengel. Digital co-creation and augmented learning. In Pro-ceedings of the The 11th International Knowledge Management in Orga-nizations Conference on The changing face of Knowledge ManagementImpacting Society, page 3. ACM, 2016.[2] Rafael A Calvo and Sidney D’Mello. Affect detection: An interdisci-plinary review of models, methods, and their applications. IEEE Trans-actions on Affective Computing, 1(1):18–37, 2010.[3] Mark Warschauer and Carla Meskill. Technology and second languageteaching. Handbook of Undergraduate Second Language Education,15:303–318, 2000.[4] Rakefet Ackerman and Valerie A Thompson. Meta-reasoning. Reasoningas memory, pages 164–182, 2015.[5] Logan Fletcher and Peter Carruthers. Metacognition and reasoning. Phil.Trans. R. Soc. B, 367(1594):1366–1378, 2012.[6] Stephen M Fleming, Brian Maniscalco, Yoshiaki Ko, Namema Amendi,Tony Ro, and Hakwan Lau. Action-speciﬁc disruption of perceptualconﬁdence. Psychological Science, 26(1):89–98, 2015.[7] Megan AK Peters and Hakwan Lau. Human observers have optimalintrospective access to perceptual processes even for visually maskedstimuli. Elife, 4:e09651, 2015.[8] John Dunlosky, Michael J Serra, Greg Matvey, and Katherine A Rawson.Second-order judgments about judgments of learning. The Journal ofGeneral Psychology, 132(4):335–346, 2005.[9] Bridgid Finn and Janet Metcalfe. The role of memory for past test in theunderconﬁdence with practice effect. Journal of Experimental Psychology:Learning, Memory, and Cognition, 33(1):238, 2007.[10] Richard Clément, Zoltán Dörnyei, and Kimberly A Noels. Motivation,self-conﬁdence, and group cohesion in the foreign language classroom.Language learning, 44(3):417–448, 1994.[11] Elizabeth A Linnenbrink and Paul R Pintrich. The role of self-efﬁcacybeliefs instudent engagement and learning intheclassroom. Reading&Writing Quarterly, 19(2):119–137, 2003.[12] Jon-Chao Hong, Ming-Yueh Hwang, Kai-Hsin Tai, and Chi-Ruei Tsai. Anexploration of students’ science learning interest related to their cogni-tive anxiety, cognitive load, self-conﬁdence and learning progress usinginquiry-based learning with an ipad. Research in Science Education, pages1–20, 2017.[13] Kento Yamada, Koichi Kise, and Olivier Augereau. Estimation of conﬁ-dence based on eye gaze: an application to multiple-choice questions. InProceedings of the 2017 ACM International Joint Conference on Pervasiveand Ubiquitous Computing and Proceedings of the 2017 ACM Interna-tional Symposium on Wearable Computers, pages 217–220. ACM, 2017.[14] Olivier Augereau, Hiroki Fujiyoshi, and Koichi Kise. Towards an auto-mated estimation of english skill via toeic score based on reading analysis.In Pattern Recognition, 2016 23rd International Conference on, pages1285–1290. IEEE, 2016.[15] Keith Rayner. Eye movements in reading and information processing: 20years of research. Psychological bulletin, 124(3):372, 1998.[16] Meng-Jung Tsai, Huei-Tse Hou, Meng-Lung Lai, Wan-Yi Liu, and Fang-Ying Yang. Visual attention for solving multiple-choice science problem:An eye-tracking analysis. Computers & Education, 58(1):375–385, 2012.[17] Kazuaki Kojima, Keiich Muramatsu, and Tatsunori Matsui. Experimentalstudy toward estimation of a learner mental state from processes of solvingmultiple choice problems based on eye movements. In Proceedings of 20thInternational Conference on Computers in Education, pages 81–85, 2012.[18] Ayano Okoso, Takumi Toyama, Kai Kunze, Joachim Folz, Marcus Li-wicki, and Koichi Kise. Towards extraction of subjective reading incom-prehension: Analysis of eye gaze features. In Proceedings of the 2015CHI Conference on Human Factors in Computing Systems: ExtendedAbstracts, pages 1325–1330. ACM, 2015.[19] Hanju Lee, Yasuhiro Kanakogi, and Kazuo Hiraki. Building a responsiveteacher: how temporal contingency of gaze interaction inﬂuences wordlearning with virtual tutors. Royal Society open science, 2(1):140361,2015.[20] Shoya Ishimaru, Syed Saqib Bukhari, Carina Heisel, Nicolas Großmann,Pascal Klein, Jochen Kuhn, and Andreas Dengel. Augmented learning onanticipating textbooks with eye tracking. In Positive Learning in the Ageof Information, pages 387–398. Springer, 2018.[21] Jerry Chih-Yuan Sun and Katherine Pin-Chen Yeh. The effects of attentionmonitoring with eeg biofeedback on university students’ attention and self-efﬁcacy: The case of anti-phishing instructional materials. Computers &Education, 106:73–82, 2017. [22] Lu-Ho Hsia, Iwen Huang, and Gwo-Jen Hwang. Effects of different onlinepeer-feedback approaches on students’ performance skills, motivation andself-efﬁcacy in a dance course. Computers & Education, 96:55–71, 2016.[23] Antonio Luque-Casado, Mikel Zabala, Esther Morales, Manuel Mateo-March, and Daniel Sanabria. Cognitive performance and heart ratevariability: the inﬂuence of ﬁtness level. PloS one, 8(2):e56935, 2013.[24] Iuliia Brishtel, Shoya Ishimaru, Olivier Augereau, Koichi Kise, and An-dreas Dengel. Assessing cognitive workload on printed and electronicmedia using eye-tracker and eda wristband. In Proceedings of the 23rd In-ternational Conference on Intelligent User Interfaces Companion, page 45.ACM, 2018.[25] Hugo D Critchley. Electrodermal responses: what happens in the brain.The Neuroscientist, 8(2):132–142, 2002.[26] Shoya Ishimaru, Soumy Jacob, Apurba Roy, Syed Saqib Bukhari, CarinaHeisel, Nicolas Großmann, Michael Thees, Jochen Kuhn, and AndreasDengel. Cognitive state measurement on learning materials by utilizingeye tracker and thermal camera. In Proceedings of the 14th IAPR Inter-national Conference on Document Analysis and Recognition, volume 8,pages 32–36. IEEE, 2017.[27] Yomna Abdelrahman, Eduardo Velloso, Tilman Dingler, AlbrechtSchmidt, and Frank Vetere. Cognitive heat: exploring the usage of thermalimaging to unobtrusively estimate cognitive load. Proceedings of the ACMon Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3):33,2017.[28] Julian Steil and Andreas Bulling. Discovery of everyday human activitiesfrom long-term visual behaviour using topic models. In Proceedings ofthe 2015 acm international joint conference on pervasive and ubiquitouscomputing, pages 75–85. ACM, 2015.[29] Shoya Ishimaru, Kensuke Hoshika, Kai Kunze, Koichi Kise, and AndreasDengel. Towards reading trackers in the wild: detecting reading activitiesby eog glasses and deep neural networks. In Proceedings of the 2017 ACMInternational Joint Conference on Pervasive and Ubiquitous Computingand Proceedings of the 2017 ACM International Symposium on WearableComputers, pages 704–711. ACM, 2017.[30] Radka Jersakova, Richard J Allen, Jonathan Booth, Céline Souchay, andAkira R O’Connor. Understanding metacognitive conﬁdence: Insightsfrom judgment-of-learning justiﬁcations. Journal of Memory and Lan-guage, 97:187–207, 2017.[31] Sabina Kleitman and Jennifer Gibson. Metacognitive beliefs, self-conﬁdence and primary learning environment of sixth grade students.Learning and Individual Differences, 21(6):728–735, 2011.[32] Jennifer A Pooler, Ruth E Morgan, Karen Wong, Margaret K Wilkin, andJonathan L Blitstein. Cooking matters for adults improves food resourcemanagement skills and self-conﬁdence among low-income participants.Journal of nutrition education and behavior, 49(7):545–553, 2017.[33] Katherine Forbes-Riley and Diane J Litman. Adapting to student uncer-tainty improves tutoring dialogues. In AIED, pages 33–40, 2009.[34] Sabina Kleitman, Lazar Stankov, Carl Martin Allwood, Sarah Young,and Karina Kar Lee Mak. Metacognitive self-conﬁdence in school-agedchildren. In Self-directed learning oriented assessments in the Asia-Paciﬁc, pages 139–153. Springer, 2012.[35] Lazar Stankov, Jihyun Lee, Wenshu Luo, and David J Hogan. Conﬁdence:A better predictor of academic achievement than self-efﬁcacy, self-conceptand anxiety? Learning and Individual Differences, 22(6):747–758, 2012.[36] Thomas Roderer and Claudia M Roebers. Can you see me thinking(about my answers)? using eye-tracking to illuminate developmental dif-ferences in monitoring and control skills and their relation to performance.Metacognition and learning, 9(1):1–23, 2014.[37] Georg Buscher, Andreas Dengel, and Ludger van Elst. Eye movements asimplicit relevance feedback. In Proceedings of the 2008 CHI Conferenceon Human Factors in Computing Systems: Extended Abstracts, pages2991–2996. ACM, 2008.[38] Michael Xuelin Huang, Tiffany CK Kwok, Grace Ngai, Stephen CF Chan,and Hong Va Leong. Building a personalized, auto-calibrating eye trackerfrom user interactions. In Proceedings of the 2016 CHI Conference onHuman Factors in Computing Systems, pages 5169–5179, 2016.[39] Thiago Santini, Wolfgang Fuhl, and Enkelejda Kasneci. Calibme: Fastand unsupervised eye tracker calibration for gaze-based pervasive human-computer interaction. In Proceedings of the 2017 chi conference on humanfactors in computing systems, pages 2594–2605, 2017.