Cognitively Aided Zero-Shot Automatic Essay Grading
Sandeep Mathias, Rudra Murthy, Diptesh Kanojia, Pushpak Bhattacharyya
CCognitively Aided Zero-Shot Automatic Essay Grading
Sandeep Mathias , Rudra Murthy , , Diptesh Kanojia , , and Pushpak Bhattacharyya Department of Computer Science & Engineering, IIT Bombay IBM Research India Limited IITB-Monash Research Academy { sam,diptesh,pb } @cse.iitb.ac.in, [email protected] Abstract
Automatic essay grading (AEG) is a processin which machines assign a grade to an es-say written in response to a topic, called theprompt. Zero-shot AEG is when we train asystem to grade essays written to a new promptwhich was not present in our training data. Inthis paper, we describe a solution to the prob-lem of zero-shot automatic essay grading, us-ing cognitive information, in the form of gazebehaviour. Our experiments show that usinggaze behaviour helps in improving the perfor-mance of AEG systems, especially when weprovide a new essay written in response to anew prompt for scoring, by an average of al-most of QWK.
One of the major challenges in machine learning isthe requirement of a large amount of training data.AEG systems perform at their best when they aretrained in a prompt-specific manner - i.e. the essaysthat they are tested on are written in response tothe same prompt as the essays they are trainedon (Zesch et al., 2015). These systems performbadly when they are tested against essays writtenin response to a different prompt.Zero-shot AEG is when our AEG system is usedto grade essays written in response to a completelydifferent prompt. In order to solve this challenge oflack of training data, we use cognitive informationlearnt by gaze behaviour of readers to augment ourtraining data and improve our model.Automatic essay grading has been around forover half a century ever since Page (1966)’s work(Beigman Klebanov and Madnani, 2020). Whilethere have been a number of commercial sys-tems like E-Rater (Attali and Burstein, 2006)from the Educational Testing Service (ETS), mostmodern-day systems use deep learning and neu-ral networks, like convolutional neural networks (Dong and Zhang, 2016), recurrent neural networks(Taghipour and Ng, 2016), or both (Dong andZhang, 2016). However, all these systems relyon the fact that their training and testing data isfrom the same prompt.Quite often, at run time, we may not have es-says written in response to our target prompt (i.e.the prompt which our essay is written in responseto). Because of the lack of training data, especiallywhen training a model for essays written for a newprompt, many systems may fail at run time. Tosolve this problem, we propose a multi-task ap-proach, similar to Mathias et al. (2020), where welearn a reader’s gaze behaviour for helping our sys-tem grade new essays.In this paper, we look at a similar approach pro-posed by Mathias et al. (2020) to grade essays us-ing cognitive information, which is learnt as anauxiliary task in a multi-task learning approach.Multi-task learning is a machine-learning approach,where the model tries to solve one or more auxil-iary tasks to solve a primary task (Caruana, 1998).Similar to Mathias et al. (2020), the scoring of theessay is the primary task, while learning the gazebehaviour is the auxiliary task.
Contribution.
In this paper, we describe a rel-atively new problem - zero-shot automatic essaygrading - and propose a solution for it using gazebehaviour data. We show a increase in performance when learning gaze be-haviour, as opposed to without using it.
We use the following gaze behaviour terms as de-fined by Mathias et al. (2020). An
Interest Area (IA) is a part of the screen that is of interest to us.These areas are where some text is displayed, andnot the background on the left / right, as well asabove / below the text.
Each word is a separate a r X i v : . [ c s . C L ] F e b nd unique IA. A Fixation is an event when thereader’s eye fixates on a part of the screen. Weare only concerned with fixations that occur insideinterest areas. The fixations that occur in the back-ground are ignored.
Saccades are eye movementsas the eye moves from one fixation point to thenext.
Regressions are a type of saccade where thereader moves from the current interest area to an earlier one.
The rest of the paper is organized as follows. Sec-tion 2 describes the motivation for our work. Sec-tion 3 describes some of the related work in the areaof automatic essay grading. Section 4 describes theessay dataset, as well as the gaze behaviour dataset.Section 5 describes our experiment setup. We re-port our results and analyze them in Section 6 andconclude our paper in Section 7.
As stated earlier, in Section 1, one of the challengesfor machine-learning systems is the requirementof training data. Quite often, we may not havetraining data for an essay, especially if the essayis written in response to a new prompt. Withoutany labeled data, in the form of scored essays, wecannot train a system properly to grade the essays.Zero-shot automatic essay grading is a way inwhich we overcome this problem. In zero-shotautomatic essay grading, we train our system onessays written to different prompts, and test it onessays written in response to the target prompt. Onedrawback of this approach is that it would not beable to use the properties of the target essay set intraining the model. Therefore, as a way to alleviatethis problem, we learn cognitive information, in theform of gaze behaviour, for the essays to help ourautomatic essay grading system grade the essaysbetter.
While there has been work done on developingsystems for automatic essay grading, all of themdescribe systems which use some of the essays thesystem is tested on as part of the training data (aswell as validation data, where applicable) (Chenand He, 2013; Phandi et al., 2015; Taghipour andNg, 2016; Dong and Zhang, 2016; Dong et al.,2017; Zhang and Litman, 2018; Cozma et al., 2018;Tay et al., 2018; Mathias et al., 2020). One of the solutions to solve the problem wasusing cross-domain AEG, where systems weretrained using essays in a set of source prompt /prompts and tested on essays written in responseto the target prompt. Some of the work done tostudy cross-domain AEG were Zesch et al. (2015)(who used task-independent features), Phandi et al.(2015) (who used domain adaptation), Dong andZhang (2016) (who used a hierarchical CNN lay-ers) and Cozma et al. (2018) (who used string ker-nels and super word embeddings). In all of theirworks, they defined a source prompt which is usedfor training and a target prompt which is used forvalidation and testing.To the best of our knowledge, we are the firstto explore the task of
Zero-shot automatic essaygrading, as a way to alleviate the challenge of a lackof graded essays (written in response to the targetprompt) for an automatic essay grading system. Inour approach, we do not use the target promptessays even for validation , thereby making it trulyzero-shot.
In this section, we discuss our essay grading datasetand the gaze behaviour dataset which we used.
For our experiments, we use the Automatic Stu-dent Assessment Prize (ASAP)’s AEG dataset .This dataset is one of the most widely-used essaygrading datasets, consisting of 12,978 graded es-says, written in response to 8 essay prompts. Theprompts are either argumentative, narrative, andsource dependent responses. Details of the datasetare summarized in Table 1. For our experiments, we use the same essay grad-ing dataset as Mathias et al. (2020). We use 5 at-tributes of gaze behaviour, namely dwell time (thetotal time that the eye has fixated on a word), firstfixation duration (the duration of the first fixationof the reader on a particular word), IsRegression(whether or not there was a regression from a par-ticular interest area or not), Run Count (the numberof times an interest area was fixated on), and Skip(whether or not the interest area was skipped). The dataset can by downloaded from . rompt ID Number of Essays Score Range Mean Word Count Essay Type Prompt 1 1783 2-12 350 PersuasivePrompt 2 1800 1-6 350 PersuasivePrompt 3 1726 0-3 150 Source-DependentPrompt 4 1770 0-3 150 Source-DependentPrompt 5 1805 0-4 150 Source-DependentPrompt 6 1800 0-4 150 Source-DependentPrompt 7 1569 0-30 250 NarrativePrompt 8 723 0-60 650 Narrative
Total
Table 1: Statistics of the 8 prompts from the ASAP AEG dataset.
Essay Set 0 1 2 3 4 Total
Prompt 3 2 4 5 1 N/A 12Prompt 4 2 3 4 3 N/A 12Prompt 5 2 1 3 5 1 12Prompt 6 2 2 3 4 1 12
Total
Table 2: Number of essays for each essay set which wecollected gaze behaviour, scored between 0 to 3 (or 4).
The gaze behaviour was collected from 8 dif-ferent annotators, who read only 48 essays (outof the almost 13,000 essays in the ASAP AEGdataset) from the source dependent response essaysets. Table 2 summarizes the distribution of essaysacross the different essay sets that we collect gazebehaviour data for.Table 3 gives the details of the different annota-tors used by Mathias et al. (2020). We evaluatedthe annotator’s performance on 3 different metrics- QWK, Close and Correct.
QWK is the QuadraticWeighted Kappa agreement (Cohen, 1968) betweenthe score given by the annotator and the groundtruth score from the dataset.
Correct is the num-ber of times (out of 48) that the annotator exactly agreed with the ground truth score, and
Close is thenumber of times (out of 48) where the annotatordisagreed with the ground truth score by at most 1score point .More details about the dataset and its creationare found in Mathias et al. (2020).
In this section, we describe our experiment setup,such as the evaluation metric, network architecture and hyperparameters, etc.
For evaluating our system, we use Cohen’s Kappawith Quadratic Weights, i.e. Quadratic WeightedKappa (QWK) (Cohen, 1968). This evaluationmetric is most frequently used for automatic essaygrading experiments because it is sensitive to dif-ferences in scores, and takes into account chanceagreements (Mathias et al., 2018).
Figure 1 shows the architecture of our system. Theessay is split into different sentences and each sen-tence is tokenized and given as input at the Embed-ding Layer. In this layer, for each token, we outputthe corresponding word embedding, which is givenas input to the next layer - the Word-level CNNlayer.The Word-level CNN layer learns local repre-sentations of nearby words, as well as the gaze be-haviour. The outputs of the word-level CNN layerare then pooled at the word-level pooling layer toget a sentence representation for each sentence.Each sentence representation is then sent throughan LSTM (Hochreiter and Schmidhuber, 1997)layer, whose output is pooled through a sentence-level attention layer, to get the essay representation.The essay representation from the sentence-levelattention layer is then sent through a Dense layer,from which we learn the essay scores. For both thetasks (learning gaze behaviour, as well as scoringthe essay), we minimize the mean squared errorloss.
D Sex Age Occupation TA? L1 Language English Score QWK Correct Close
Annotator 1 Male 23 Masters student Yes Hindi 94% 0.611 19 41Annotator 2 Male 18 Undergraduate Yes Marathi 95% 0.587 24 41Annotator 3 Male 31 Research scholar Yes Marathi 85% 0.659 21 43Annotator 4 Male 28 Software engineer Yes English 96% 0.659 26 44Annotator 5 Male 30 Research scholar Yes Gujarati 92% 0.600 19 42Annotator 6 Female 22 Masters student Yes Marathi 95% 0.548 19 40Annotator 7 Male 19 Undergraduate Yes Marathi 93% 0.732 21 46Annotator 8 Male 28 Masters student Yes Gujarati 94% 0.768 29 45
Table 3: Profile of the annotatorsFigure 1: Architecture of our gaze behaviour system,showing an input essay of n sentences, with the outputsbeing the gaze behaviour (whenever applicable), andthe overall essay score. We use the
50 dimension
GloVe pre-trained wordembeddings (Pennington et al., 2014). We runour experiments over a batch size of 200 , for . We set the learning rate as 0.001 , andthe dropout rate as 0.5 . The word-level CNNlayer has a kernel size of 5 , with
100 filters . Thesentence-level LSTM layer has
100 hidden units .We use the RMSProp Optimizer (Dauphin et al.,2015) with an initial learning rate of 0.001 and momentum of 0.9 . Along with the network hy-perparameters, we also weigh the loss functions ofthe different gaze behaviour attributes differently,using the same weights as Mathias et al. (2020),namely , , and . While training our model, we scale the essay scoresfor all the data (training, testing and validation) toa range of [0 , . For calculating the final scores, aswell as the QWK, we rescale the predictions of theessay score back to the score range of the essays.We also bin the gaze behaviour attributes as de-scribed in Mathias et al. (2020). Binning is doneto take into account the idiosyncracies of the gazebehaviour of individual readers (i.e. some peoplemay read faster, others slower, etc.). Whenever weuse gaze behaviour, we scale the value of the gazebehaviour bins to the range of [0 , as well. We run our experiments in the following configu-rations.
No Gaze is a single-task learning experi-ment, where we only learn to score the essay.
Gaze is the multi-task learning approach, where we learngaze behaviour as an auxiliary task, and score theessay as the primary task.
We use five-fold cross-validation to evaluate oursystem. For each fold, the testing data consists ofessays from the target prompt and the training dataand validation data comprise of essays from theother 7 prompts.
Table 4 gives the results of our experiments. Theresults reported are on the target essay set for themean of the 5 folds. For each fold, we recordthe performance of the model on the target essayset, corresponding to the epoch which had the bestQWK for the development set. Table 4 reports themean performance for all 5 folds.From the table, we see that in most of the essaysets, we are able to see an improvement in perfor- arget Essay Set No Gaze Gaze
Prompt 1 0.319
Prompt 2 0.391
Prompt 3 0.508
Prompt 4 0.548
Prompt 5 0.548
Prompt 6 0.599
Prompt 7 0.362
Prompt 8
Mean QWK
Table 4: Results of our experiments with and withoutusing gaze behaviour. Improvements which are statisti-cally significant (with p < . ), when gaze behaviouris used, are marked with a * mance. In order to verify if the improvements werestatistically significant, we use the 2-tailed PairedT-Test with a significance level of p < . . Sta-tistically significant improvements where we usegaze behaviour data are marked with a * next tothe result.Out of the 8 essay sets, the only essay set wherethe performance using gaze behaviour falls shortcompared to when we do not use gaze behaviour isin Prompt 8. One of the main reasons for this is thatthe essays in Prompt 8 are very long compared tothe other essay sets. When they are absent from thetraining data, the system is unable to learn aboutthe existence of long essays, which could also bethe reason that those essays are scored badly. In this paper, we discussed an important problemfor automatic essay grading, namely zero-shot au-tomatic essay grading, where we have no labeledessays written in response to our target prompt,present at the time of training.We showed that, by using gaze behaviour, weare able to learn cognitive information which canhelp improve our AEG system.In the future, we plan to extend our work toother tasks, like grading of essay traits, using gazebehaviour.
References
Yigal Attali and Jill Burstein. 2006. Automated essayscoring with e-rater®v.2.
The Journal of Technology,Learning and Assessment (JTLA) , 4(3).Beata Beigman Klebanov and Nitin Madnani. 2020.Automated evaluation of writing – 50 years and counting. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 7796–7810, Online. Association for Computa-tional Linguistics.Rich Caruana. 1998.
Multitask Learning , pages 95–133. Springer US, Boston, MA.Hongbo Chen and Ben He. 2013. Automated essayscoring by maximizing human-machine agreement.In
Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing , pages1741–1752, Seattle, Washington, USA. Associationfor Computational Linguistics.Jacob Cohen. 1968. Weighted kappa: Nominal scaleagreement provision for scaled disagreement or par-tial credit.
Psychological bulletin , 70(4):213.M˘ad˘alina Cozma, Andrei Butnaru, and Radu TudorIonescu. 2018. Automated essay scoring with stringkernels and word embeddings. In
Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers) ,pages 503–509, Melbourne, Australia. Associationfor Computational Linguistics.Yann Dauphin, Harm De Vries, and Yoshua Bengio.2015. Equilibrated adaptive learning rates for non-convex optimization. In
Advances in neural infor-mation processing systems , pages 1504–1512.Fei Dong and Yue Zhang. 2016. Automatic featuresfor essay scoring – an empirical study. In
Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pages 1072–1077,Austin, Texas. Association for Computational Lin-guistics.Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-based recurrent convolutional neural network for au-tomatic essay scoring. In
Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) , pages 153–162, Vancou-ver, Canada. Association for Computational Linguis-tics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Sandeep Mathias, Diptesh Kanojia, Kevin Patel,Samarth Agrawal, Abhijit Mishra, and PushpakBhattacharyya. 2018. Eyes are the windows to thesoul: Predicting the rating of text quality using gazebehaviour. In
Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 2352–2362, Mel-bourne, Australia. Association for ComputationalLinguistics.Sandeep Mathias, Rudra Murthy, Diptesh Kanojia,Abhijit Mishra, and Pushpak Bhattacharyya. 2020.Happy are those who grade without seeing: A multi-task learning approach to grade essays using gaze be-haviour. In
Proceedings of the 1st Conference of thesia-Pacific Chapter of the Association for Compu-tational Linguistics and the 10th International JointConference on Natural Language Processing , pages858–872, Suzhou, China. Association for Computa-tional Linguistics.Ellis B Page. 1966. The imminence of... grading essaysby computer.
The Phi Delta Kappan , 47(5):238–243.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In
Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng.2015. Flexible domain adaptation for automated es-say scoring using correlated linear regression. In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , pages431–439, Lisbon, Portugal. Association for Compu-tational Linguistics.Kaveh Taghipour and Hwee Tou Ng. 2016. A neuralapproach to automated essay scoring. In
Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pages 1882–1891,Austin, Texas. Association for Computational Lin-guistics.Yi Tay, Minh Phan, Luu Anh Tuan, and Siu CheungHui. 2018. Skipflow: Incorporating neural coher-ence features for end-to-end automatic text scoring.Torsten Zesch, Michael Wojatzki, and Dirk Scholten-Akoun. 2015. Task-independent features for auto-mated essay grading. In
Proceedings of the TenthWorkshop on Innovative Use of NLP for Building Ed-ucational Applications , pages 224–232, Denver, Col-orado. Association for Computational Linguistics.Haoran Zhang and Diane Litman. 2018. Co-attentionbased neural network for source-dependent essayscoring. In