[PDF] Automated essay scoring using efficient transformer-based language models

Abstract

Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education, Linguistics, and Natural Language Processing (NLP). The efficacy of an NLP model in AES tests it ability to evaluate long-term dependencies and extrapolate meaning even when text is poorly written. Large pretrained transformer-based language models have dominated the current state-of-the-art in many NLP tasks, however, the computational requirements of these models make them expensive to deploy in practice. The goal of this paper is to challenge the paradigm in NLP that bigger is better when it comes to AES. To do this, we evaluate the performance of several fine-tuned pretrained NLP models with a modest number of parameters on an AES dataset. By ensembling our models, we achieve excellent results with fewer parameters than most pretrained transformer-based models.

Full PDF

aa r X i v : . [ c s . C L ] F e b Automated essay scoring using efﬁcient transformer-based languagemodels

Christopher M. Ormerod, Akanksha Malhotra, and Amir Jafari A BSTRACT . Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education,Linguistics, and Natural Language Processing (NLP). The efﬁcacy of an NLP model in AES testsit ability to evaluate long-term dependencies and extrapolate meaning even when text is poorlywritten. Large pretrained transformer-based language models have dominated the current state-of-the-art in many NLP tasks, however, the computational requirements of these models makethem expensive to deploy in practice. The goal of this paper is to challenge the paradigm in NLPthat bigger is better when it comes to AES. To do this, we evaluate the performance of severalﬁne-tuned pretrained NLP models with a modest number of parameters on an AES dataset. Byensembling our models, we achieve excellent results with fewer parameters than most pretrainedtransformer-based models.

1. Introduction

The idea that a computer could analyze writing style dates back to the work of Page in1968 [ ]. Many engines in production today rely on explicitly deﬁned hand-crafted featuresdesigned by experts to measure the intrinsic characteristics of writing [ ]. These features arecombined with frequency-based methods and statistical models to form a collection of methodsthat are broadly termed Bag-of-Word (BOW) methods [ ]. While BOW methods have beenvery successful from a purely statistical standpoint, [ ] showed they tend to be brittle withrespect novel uses of language and vulnerable to adversarially crafted inputs.Neural Networks learn features implicitly rather than explicitly. It has been shown that ini-tial neural network AES engines tend to be more more accurate and more robust to gamingthan BOW methods [

12, 15 ]. In the broader NLP community, the recurrent neural networkapproaches used have been replaced by transformer-based approaches, like the Bidirectional En-coder Representations from Transformers (BERT) [ ]. These models tend to possess an orderof magnitude more parameters than recurrent neural networks, but also boast state-of-the-art re-sults in the General Language Understanding Evaluation (GLUE) benchmarks [

42, 45 ]. One ofthe main problems in deploying these models is their computational and memory requirements[ ]. This study explores the effectiveness of efﬁcient versions of transformer-based models inthe domain of AES. There are two aspects of AES that distinguishes it from GLUE tasks thatmight beneﬁt from the efﬁciencies introduced in these models; ﬁrstly, the text being evaluatedcan be almost arbitrarily long. Secondly, essays written by students often contain many morespelling issues than would be present in typical GLUE tasks. It could be the case that fewer and more often updated parameters might possibly be better in this situation or more generallywhere smaller training sets are used.With respect to essay scoring the Automated Student Assessment Prize (ASAP) AES data-set on Kaggle is a standard openly accessible data-set by which we may evaluate the performanceof a given AES model [ ]. Since the original test set is no longer available, we use the ﬁve-foldvalidation split presented in [ ] for a fair and accurate comparison. The accuracy of BERT andXLNet have been shown to be very solid on the Kaggle dataset [

33, 47 ]. To our knowledge,combining BERT with hand-crafted features form the current state-of-the-art [ ].The recent works of have challenged the paradigm that bigger models are necessarily better[

11, 21, 24, 29 ]. The models in these papers possess some fundamental architectural character-istics that allow for a drastic reduction in model size, some of which may even be an advantagein AES. For this study, we consider the performance of the Albert models [ ], Reformer mod-els [ ], a version of the Electra model [ ], and the Mobile-BERT model [ ] on the ASAPAES data-set. Not only are each of these models more efﬁcient, we show that simple ensemblesprovide the best results to our knowledge on the ASAP AES dataset.There are several reasons that this study is important. Firstly, the size of models scalequadratically with maximum length allowed, meaning that essays may be longer than the max-imal length allowed by most pretrained transformer-based models. By considering efﬁcienciesin underlying transformer architectures we can work on extending that maximum length. Sec-ondly, as noted by [ ], one of the barriers to effectively putting these models in production isthe memory and size constraints of having ﬁne-tuned models for every essay prompt. Lastly, weseek models that impose a smaller computational expense, which in turn has been linked by [ ]to a much smaller carbon footprint.

2. Approaches to Automated Essay Scoring

From an assessment point of view, essay tests are useful in evaluating a student’s abilityto analyze and synthesize information, which assesses the upper levels of Blooms Taxonomy.Many modern rubrics use a multitude of scoring dimensions to evaluate an essay, includingorganization, elaboration, and writing style. An AES engine is a model that assigns scores to apiece of text closely approximating the way a person would score the text as a ﬁnal score or inmultiple dimensions.To evaluate the performance of an engine we use standard inter-rater reliability statistics.It has become standard practice in the development of a training set for an AES engine thateach essay is evaluated by two different raters from which we may obtain a resolved score. Theresolved score is usually either the same as the two raters in the case that they agree and anadjudicated score in cases in which they do not. In the case of the ASAP AES data-set, theresolved score for some items is taken to be the sum of the two raters, and a resolved scorefor others. The goal of a good model is to have a higher agreement with the resolved score incomparison with the agreement two raters have with each other. The most widely used statisticused to evaluate the agreement of two different collections of scores is the quadratic weightedkappa (QWK), deﬁned by(1) κ = P P w ij x ij P P w ij m ij UTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASED LANGUAGE MODELS 3 where x i,j is the observed probability m i,j = x ij (1 − x ij ) , and w ij = 1 − ( i − j ) ( k − , where k is the number of classes. The other measurements used in the industry are the standard-ized mean difference (SMD) and the exact match or accuracy (Acc). The total number of essaysand the human-human agreement for the two raters and the score ranges are shown in table 1.The usual protocol for training a statistical model is that some portion of the training set isset aside for evaluation while the remaining set is used for training. A portion of the trainingset is often isolated for purposes of early stopping and hyperparameter tuning. In evaluating theKaggle dataset, we use the 5-fold cross validation splits deﬁned by [ ] so that our results arecomparable to other works [

1, 12, 33, 40, 47 ]. The resulting QWK is deﬁned to be the averageof the QWK values on each of the ﬁve different splits.Essay Prompt1 2 3 4 5 6 7 8Rater score range 1-6 1-6 0-3 0-3 0-4 0-4 0-12 5-30Resolved scorerange 2-12 1-6 0-3 0-3 0-4 0-4 2-24 10-60Average Length 350 350 150 150 150 150 250 650Training examples 1783 1800 1726 1772 1805 1800 1569 723QWK 0.721 0.814 0769 0.851 0.753 0.776 0.721 0.624Acc 65.3% 78.3% 74.9% 77.2% 58.0% 62.3% 29.2% 27.8%SMD 0.008 0.027 0.055 0.005 0.001 0.011 0.006 0.069T

ABLE

1. A summary of the Automated Student Assessment Prize AutomatedEssay Scoring data-set.

3. Methodology

Most engines currently in production rely on Latent Semantic Analysis or a multitude ofhand-crafted features that measure style and prose. Once sufﬁciently many features are com-piled, a traditional machine learning classiﬁer, like logistic regression, is applied and ﬁt to atraining corpus. As a representative of this class of model we include the results of the “En-hanced AI Scoring Engine”, which is open source .When modelling language with neural networks, the ﬁrst layer of most neural networks areembedding layers, which send every token to an element of a semantic vector space. Whentraining a neural network model from scratch, the word embedding vocabulary often comprisesof the set of tokens that appear in the training set. There are several problems with this ap-proach that come as a result of word sparsity in language and the presence of spelling mistakes.Alternatively, one may use a pretrained word embedding [ ] built from a large corpus with alarge vocabulary with a sufﬁciently high dimensional semantic vector space. From an efﬁciency https://github.com/edx/ease CHRISTOPHER M. ORMEROD, AKANKSHA MALHOTRA, AND AMIR JAFARI standpoint, these word embeddings alone can account for billions of parameters. We can shrinkour embedding and address some issues arising from word sparsity by ﬁxing a vocabulary ofsubwords using a version of the byte-pair-encoding (BPE) [

25, 34 ]. As such, pretrained modelslike BERT and those we use in this study utilize subwords to ﬁx the size of the vocabulary.In addition to the word embeddings, positional embeddings and segment embedding areused to give the model information about the position of each word and next sentences prediction.The combination of the 3 embedding are the keys to reduce the vocabulary size, handling the outof vocab, and preserving the order of words.Once we have applied the embedding layer to the tokens of a piece of text, it is essentially asequence of elements of a vector space, which can be represented by a matrix whose dimensionsare governed by the size of the semantic vector space and the number of tokens in the text.In the ﬁeld of language modeling, sequence-to-sequence models (seq2seq) in the form of thispaper were proposed in [ ]. Initially, seq2seq models were used for natural machine translationbetween multiple languages. The seq2seq model has an encoder-decoder component; an encoderanalyzes the input sequence and creates a context vector while the decoder is initialized with thecontext vector and is trainer to produce the transformed output. Previous language models werebased on a ﬁxed-length context vector and suffered from an inability to infer context over longsentences and text in general. An attention mechanism was used to improve the performance intranslation for long sentences.The use of a self-attention mechanism turns out to be very useful in th context of machinetranslation [ ]. This mechanism, and it various derivatives, have been responsible for a largenumber of accuracy gains over a wide range of NLP tasks more broadly. The form of attentionfor this paper can be found in [ ]. Given a query matrix Q , key matrix K , and value matrix V ,then the resulting sequence is given by(2) Attention ( Q, K, V ) = softmax (cid:18) QK T √ d k (cid:19) V. These matrices are obtained by linear transformations of either the output of a neural network,the output of a previous attention mechanism, or embeddings. The overall success of atten-tion has led to the development of the transformer [ ]. The architecture of the transformer isoutlined in Figure 1.In the context of efﬁcient models, if we consider all sequences up to length L and if eachquery is of size d , then each keys are also of length d , hence, the matrix QK T of size L × L . Theimplication is that the memory and computational power required to implement this mechanismscales quadratically with length. The above-mentioned models often adhere to a size restrictionby letting L = 512 .

4. Efﬁcient Language Models

The transformer in language modeling is a innovative architecture to solve issues of seq2seqtasks while handling long term dependencies. It relies on self-attention mechanisms in its net-works architecture. Self attention is in charge of managing the interdependence within the inputelements.In this study, we use ﬁve prominent models all of which are known to perform well aslanguage models and possess an order of magnitude fewer parameters than BERT. It should benoted that many of the other models, like RoBERTa, XLNet, GPT models, T5, XLM, and even

UTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASED LANGUAGE MODELS 5

EmbeddingPosition + Multi-headAttentionAdd & NormFeedForwardAdd & Norm Embedding Position + MaskedMulti-headAttentionAdd & NormMulti-headAttentionAdd & NormFeedForwardAdd & Norm N × N × F IGURE

1. This is the basic architecture of a transformer-based model. Theleft block of N layers is called the encoder while the right block of N layers iscalled the decoder. The output of the Decoder is a sequence.Distilled BERT all possess more than 60M parameters, hence, were excluded for this study. Weonly consider models utilizing an innovation that drastically reduces the number of parametersto at most one quarter the number of parameters of BERT. Of this class of models, we present alist of models and their respective sizes in Table 2.The backbone architecture of all above language models is BERT [ ]. It has become thestate-of-the-art model for many different Natural Language Undestanding (NLU), Natural Lan-guage Generation (NLG) tasks including sequence and document classiﬁcation.The ﬁrst models we consider are the Albert models of [ ]. The ﬁrst efﬁciency of Albert isthat the embedding matrix is factored. In the BERT model, the embedding size must be the sameas the hidden layer size. Since the vocabulary of the BERT model is approximately 30,000 andthe hidden layer size is 768 in the base model, the embedding alone requires approximately 25Mparameters to deﬁne. Not only does this signiﬁcantly increase the size of the model, we expectthat some of those parameters to by updated rarely during the training process. By applyinga linear layer after the embedding, we effectively factor the embedding matrix in a way thatthe actual embedding size can be much smaller than the feed forward layer. In the two models(large and base), the size of the vocabulary is about the same however the embedding dimensionis effectively 128 making the embedding matrix one sixth the size. CHRISTOPHER M. ORMEROD, AKANKSHA MALHOTRA, AND AMIR JAFARI

Model Number Training Inferenceof Time TimeParameters Speedup Speedup

BERT (base) 110M . × . × Albert (base) 12M . × . × Albert (large) 18M . × . × Electra (small) 14M . × . × Mobile BERT 24M . × . × Reformer 16M . × . × Electra + Mobile-BERT 38M . × . × T ABLE

2. A summary of the memory and approximations of the computationalrequirements of the models we use in the study when comparing to BERT. Theseare estimates based on single epoch times using a ﬁxed batch-size in trainingand inference.A second efﬁciency proposed by Albert is that the layers share parameters. [ ] compare amultitude of parameter sharing scenarios in which all their parameters are shared across layers.The base and large Albert models, with 12 layer and 24 layers respectively, only possess a totalof 12M and 18M parameters. The hidden size of of the base and large models are also 768 and1024 respectively. Increasing the number of layers does increase the number of parameters butdoes come with a computational cost. In the pretraining of the Albert models, the models aretrained with a sentence order prediction (SOP) loss function instead of next sentence prediction.It is argued by [ ] that SOP can solve NSP tasks while the converse is not true and that thisdistinction leads to consistent improvements in downstream tasks.The second model we consider is the small version of Electra model presented by [ ].Like the Albert model, there is a linear layer between the embedding and the hidden layers,allowing for a embedding size of 128, a hidden layer size of 256, and only four attention heads.The pretraining of the Electra model is trained as a pair of neural networks consisting of agenerator and a discriminator with weight-sharing between the two networks. The generatorsrole is to replace tokens in a sequence, and is therefore trained as a masked language model. Thediscriminator tries to identify that tokens were replaced by the generator in the sequence.The third model, Mobile-BERT, was presented in [ ]. This model uses the same embeddingfactorization used to decouple the embedding size of 128 with the hidden size of 512. The maininnovation of [ ] is that they decrease the model size by introducing a pair of linear transfor-mations, called bottlenecks, around the transformer unit so that the transformer unit is operatingon a hidden size of 128 instead of the full hidden size of of 512. This effectively shrinks thesize of the underlying building blocks. MobileBERT uses absolute position embeddings and itis efﬁcient at predicting masked tokens and at NLU.The Reformer architecture of [ ] differs from BERT most substantially by its version of anattention mechanism and that the feedforward component is of the attention layers use revsibleresidual layers. This means that, like [ ], the inputs of each layer can be computed on demandinstead of being stored. [ ] noted that they use an approximation of the (2) in which the lineartransformation used to deﬁne Q and K are the same, i.e., Q = K . When we calculate QK T ,more importantly, the softmax, we need only consider values in Q and K that are close. Using UTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASED LANGUAGE MODELS 7 random vectors, we may create a Locally Sensitive Hashing (LSH) scheme, allowing us to chunkkey/query vectors into ﬁnite collections of vectors that are known to contribute to the softmax.Each chunk may be computed in parallel changing the complexity from scaling with length as O ( L ) to O ( L log L ) . This is essentially a way to utilize the sparse nature of the attention matrix,attempting to not calculate pairs of values not likely to contribute to the softmax. More hashesimproves the hashing scheme and better approximates the attention mechanism.

5. Results

The networks above are all pretrained to be seq2seq models. While some of the pretrainingdiffers for some models, as discussed above, we are required to modify a sequence-to-sequenceneural network to produce a classiﬁcation. Typically, the way in which this is done is thatwe take the ﬁrst vector of the sequence as a ﬁnite set of features from which we may apply aclassiﬁcation. Applying a linear layer to these features produces a ﬁxed length vector that isused for classiﬁcation.

QWK scores Prompt

AVG

QWKEASE 0.781 0.621 0.630 0.749 0.782 0.771 0.727 0.534 0.699LSTM 0.775 0.687 0.683 0.795 0.818 0.813 0.805 0.594 0.746LSTM+CNN 0.821 0.688 0.694 0.805 0.807 0.819 0.808 0.644 0.761LSTM+CNN+Att 0.822 0.682 0.672 0.814 0.803 0.811 0.801 0.705 0.764BERT(base) 0.792 0.680 0.715 0.801 0.806 0.805 0.785 0.596 0.758XLNet 0.777 0.681 0.693 0.806 0.783 0.794 0.787 0.627 0.743BERT + XLNet 0.808 0.697 0.703 0.819 0.808 0.815 0.807 0.605 0.758R BERT 0.817 0.719 0.698 0.845 0.841 0.847 0.839 0.744 0.794BERT + Features 0.852 0.651 0.804 0.888 0.885 0.817 0.864 0.645

Electra (small) 0.816 0.664 0.682 0.792 0.792 0.787 0.827 0.715 0.759Albert (base) 0.807 0.671 0.672 0.813 0.802 0.816 0.826 0.700 0.763Albert (large) 0.801 0.676 0.668 0.810 0.805 0.807 0.832 0.700 0.763Mobile-BERT 0.810 0.663 0.663 0.795 0.806 0.808 0.824 0.731 0.762Reformer 0.802 0.651 0.670 0.754 0.771 0.762 0.747 0.548 0.713Electra+Mobile-BERT 0.823 0.683 0.691 0.805 0.808 0.802 0.835 0.748 0.774Ensemble 0.831 0.679 0.690 0.825 0.817 0.822 0.841 0.748 T ABLE

3. The ﬁrst set of agreement (QWK) statistics for each prompt. EASE,LSTM, LSTM+CNN, and LSTM+CNN+Att, were presented by [ ] and [ ].The results of BERT, BERT extensions, and XLNET have been presented in[

33, 40, 47 ]. The remaining rows are the results of this paper.Given a set of n possible scores, we divide the interval [0 , into n even sub-intervals andmap each score to the midpoint of those intervals. So in a similar manner to [ ], we treat thisclassiﬁcation as a regression problem with a loss function of the mean squared error. This isslightly different from the standard multilabel classiﬁcation using a Cross-entropy loss functionoften applied by default to transformer-based classiﬁcation problems. Using the standard pre-trained models with an untrained linear classiﬁcation layer with a sigmoid activation function,we applied a small grid-search using two learning rates and two batch sizes for each model. Themodel performing the best on the test set was applied to validation. CHRISTOPHER M. ORMEROD, AKANKSHA MALHOTRA, AND AMIR JAFARI

For the reformer model, we pretrained our own 6-layer reformer model using a hidden layersize of 512, 4 hashing functions, and 4 attention heads. We used a cased sentencepiece tok-enization consisting of 16,000 subwords and the model was trained with a maximum length of1024 on a large corpus of essay texts from various grades on a single Nvidia RTX8000. Thismodel addresses the length constraints other essay models struggle with on transformer-basedarchitectures.We see that both Electra and Mobile-BERT show performance higher than BERT itselfdespite being smaller and faster. Given the extensive hyper-parameter tuning performed in [ ],we can only speculate that any additional gains may be due to architectural differences.Lastly, we took our best models and simply averaged the outputs to obtain an ensembleswhose output is in the interval [0 , , then applied the same rounding transformation to obtainscores in the desired range. We highlight the ensemble of Mobile-BERT and Electra becauseon their own, this ensembled model provides a big increase in performance over BERT withapproximately the same computational footprint on its own.

6. Discussion

The goal of this paper was not to achieve state-of-the-art, but rather to show that we canachieve signiﬁcant results within a very modest memory footprint and computational budget.We managed to exceed previous known results of BERT alone with approximately one third theparameters. Combining these models with R variants of [ ] or with features, as done in [ ],would be interesting since they do not add little to the computational load of the system.There were noteworthy additions to the literature we did not consider for various reasons.The Longformer of [ ] uses a sliding attention window in which the resulting self-attentionmechanism scales linearly with length, however, the number of parameters of the pretrainedLongformer models often coincided or exceeded those of the BERT model. Like the Reformermodel of [ ], the Linformer of [ ], Sparse Transformer of [ ], and Performer model of [ ]exploit the observation that the attention matrix (given by the softmax) can be approximated bycollection of low-rank matrices. Their different mechanisms for doing so means their complexityscales differently. The Projection Attention Networks for Document Classiﬁcation On-Device(PRADO) model given by [ ] seems promising, however, we did not have access to a versionof it we could use. The SHARNN of [ ] also looks interesting, however, we found that thearchitecture was difﬁcult to use for transfer learning.Transfer learning and language models improved the performance of document classiﬁcationtexts in natural language processing domain. We should note that all these models are pertainedwith a large corpus except for the Reformer model and we perform the ﬁne-tuing process. Bettertraining could improve upon the results of the Reformer model.There are several reasons these directions in research are important; as we are seeing moreand more attention given to power-efﬁcient computing for usage on small-devices, we think thedeep learning community will see a greater emphasis on smaller efﬁcient models in the future.Secondly, this work provides a stepping stone to the classifying and evaluating documents wherethe necessity for context to extend beyond the limits that current pretrained transformer-basedlanguage models allow. Lastly we think combining the approaches that seem to have a beneﬁcialeffect on performance should give smaller, better, and more environmentally friendly models inthe future. UTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASED LANGUAGE MODELS 9

Acknowledgments

We would like to acknowledge Susan Lottridge and Balaji Kodeswaran for their support ofthis project.

References [1] Alikaniotis, Dimitrios, Helen Yannakoudakis, and Marek Rei. “Automatic text scoring using neural networks.”arXiv preprint arXiv:1606.04289 (2016).[2] Attali, Yigal, and Jill Burstein. “Automated essay scoring with e-rater® V. 2.” The Journal of Technology, Learn-ing and Assessment 4, no. 3 (2006).[3] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning toalign and translate.” arXiv preprint arXiv:1409.0473 (2014).[4] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. “Longformer: The long-document transformer.” arXivpreprint arXiv:2004.05150 (2020).[5] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-lakantan et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020).[6] Burstein, Jill, Karen Kukich, Susanne Wolff, Chi Lu, and Martin Chodorow. “Enriching automated essay scoringusing discourse marking.” (2001).[7] Chen, Jing, James H. Fife, Isaac I. Bejar, and Andr´e A. Rupp. “Building e-rater® Scoring Models Using MachineLearning Methods.” ETS Research Report Series 2016, no. 1 (2016): 1-12.[8] Cho, Kyunghyun, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. “On the properties of neuralmachine translation: Encoder-decoder approaches.” arXiv preprint arXiv:1409.1259 (2014).[9] Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos,Peter Hawkins et al. “Rethinking attention with performers.” arXiv preprint arXiv:2009.14794 (2020).[10] Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating long sequences with sparse transform-ers.” arXiv preprint arXiv:1904.10509 (2019).[11] Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. “Electra: Pre-training text en-coders as discriminators rather than generators.” arXiv preprint arXiv:2003.10555 (2020).[12] Dong, Fei, Yue Zhang, and Jie Yang. “Attention-based recurrent convolutional neural network for automaticessay scoring.” In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017),pp. 153-162. 2017.[13] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectionaltransformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).[14] Dikli, Semire. “An overview of automated scoring of essays.” The Journal of Technology, Learning and Assess-ment 5, no. 1 (2006).[15] Farag, Youmna, Helen Yannakoudakis, and Ted Briscoe. “Neural automated essay scoring and coherence mod-eling for adversarially crafted input.” arXiv preprint arXiv:1804.06898 (2018).[16] Flor, Michael, Yoko Futagi, Melissa Lopez, and Matthew Mulholland. “Patterns of misspellings in L2 and L1English: A view from the ETS Spelling Corpus.” Bergen Language and Linguistics Studies 6 (2015).[17] Foltz, Peter W., Darrell Laham, and Thomas K. Landauer. “The intelligent essay assessor: Applications toeducational technology.” Interactive Multimedia Electronic Journal of Computer-Enhanced Learning 1, no. 2 (1999):939-944.[18] Gomez, Aidan N., Mengye Ren, Raquel Urtasun, and Roger B. Grosse. “The reversible residual network: Back-propagation without storing activations.” In Advances in neural information processing systems, pp. 2214-2224.2017.[19] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for image recognition.” InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.[20] Hochreiter, Sepp, and J¨urgen Schmidhuber. “Long short-term memory.” Neural computation 9, no. 8 (1997):1735-1780.[21] Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efﬁcient transformer.” arXiv preprintarXiv:2001.04451 (2020)[22] Kolowich, Steven. “Writing instructor, skeptical of automated grading, pits machine vs. machine.” The Chroni-cle of Higher Education 28 (2014). [23] Krathwohl, David R. “A revision of Bloom’s taxonomy: An overview.” Theory into practice 41, no. 4 (2002):212-218.[24] Krishnamoorthi, Karthik, Sujith Ravi, and Zornitsa Kozareva. “PRADO: Projection Attention Networks forDocument Classiﬁcation On-Device.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),pp. 5013-5024. 2019.[25] Kudo, Taku, and John Richardson. “Sentencepiece: A simple and language independent subword tokenizer anddetokenizer for neural text processing.” arXiv preprint arXiv:1808.06226 (2018).[26] Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. “Albert:A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).[27] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. “Effective approaches to attention-based neuralmachine translation.” arXiv preprint arXiv:1508.04025 (2015).[28] Mayﬁeld, Elijah, and Alan W. Black. “Should You Fine-Tune BERT for Automated Essay Scoring?.” In Pro-ceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 151-162.2020.[29] Merity, Stephen. “Single headed attention rnn: Stop thinking with your head.” arXiv preprint arXiv:1911.11423(2019).[30] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. ”Distributed representations ofwords and phrases and their compositionality.” In Advances in neural information processing systems, pp. 3111-3119. 2013.[31] Page, Ellis Batten, Gerald A. Fisher, and Mary Ann Fisher. “PROJECT ESSAY GRADE-A FORTRAN PRO-GRAM FOR STATISTICAL ANALYSIS OF PROSE.” BRITISH JOURNAL OF MATHEMATICAL & STATIS-TICAL PSYCHOLOGY 21 (1968): 139.[32] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018).[33] Rodriguez, Pedro Uria, Amir Jafari, and Christopher M. Ormerod. “Language models and Automated EssayScoring.” arXiv preprint arXiv:1909.09482 (2019).[34] Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Neural machine translation of rare words with subwordunits.” arXiv preprint arXiv:1508.07909 (2015).[35] Shermis, Mark D. “State-of-the-art automated essay scoring: Competition, results, and future directions from aUnited States demonstration.” Assessing Writing 20 (2014): 53-76.[36] Shermis, Mark D., and Jill C. Burstein, eds. “Automated essay scoring: A cross-disciplinary perspective.” Rout-ledge, 2003.[37] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learningin NLP.” arXiv preprint arXiv:1906.02243 (2019).[38] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” arXivpreprint arXiv:1409.3215 (2014).[39] Sun, Zhiqing, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. “Mobilebert: a compacttask-agnostic bert for resource-limited devices.” arXiv preprint arXiv:2004.02984 (2020).[40] Uto, Masaki, Yikuan Xie, and Maomi Ueno. ”Neural Automated Essay Scoring Incorporating HandcraftedFeatures.” In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6077-6088. 2020.[41] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,and Illia Polosukhin. ”Attention is all you need.” In Advances in neural information processing systems, pp. 5998-6008. 2017.[42] Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. “Glue: Amulti-task benchmark and analysis platform for natural language understanding.” arXiv preprint arXiv:1804.07461(2018).[43] Wang, Sinong, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. “Linformer: Self-Attention with LinearComplexity.” arXiv preprint arXiv:2006.04768 (2020).[44] Wang, Yequan, Minlie Huang, Xiaoyan Zhu, and Li Zhao. ”Attention-based LSTM for aspect-level sentimentclassiﬁcation.” In Proceedings of the 2016 conference on empirical methods in natural language processing, pp.606-615. 2016.

UTOMATED ESSAY SCORING USING EFFICIENT TRANSFORMER-BASED LANGUAGE MODELS 11 [45] Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,and Samuel Bowman. “Superglue: A stickier benchmark for general-purpose language understanding systems.” InAdvances in Neural Information Processing Systems, pp. 3266-3280. 2019.[46] Taghipour, Kaveh, and Hwee Tou Ng. “A neural approach to automated essay scoring.” In Proceedings of the2016 conference on empirical methods in natural language processing, pp. 1882-1891. 2016.[47] Yang, Ruosong, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. ”Enhancing Automated EssayScoring Performance via Cohesion Measurement and Combination of Regression and Ranking.” In Proceedings ofthe 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1560-1569. 2020.[48] Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. “Xlnet:Generalized autoregressive pretraining for language understanding.” In Advances in neural information processingsystems, pp. 5753-5763. 2019.[49] Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. “Understanding bag-of-words model: a statistical framework.” Inter-national Journal of Machine Learning and Cybernetics 1, no. 1-4 (2010): 43-52.C

AMBIUM A SSESSMENT , I NC . Current address : 1000 Thomas Jefferson St., N.W. Washington, D.C. 20007

Email address : [email protected] U NIVERSITY OF C OLORADO , B

OULDER

Email address : [email protected] C AMBIUM A SSESSMENT , I NC . Current address : 1000 Thomas Jefferson St., N.W. Washington, D.C. 20007

Email address ::