Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?
AAre pre-trained text representations useful for multilingual andmulti-dimensional language proficiency modeling?
Taraka Rama
University of North Texas, USA
Sowmya Vajjala
National Research Council, Canada [email protected]
Abstract
Development of language proficiency modelsfor non-native learners has been an active areaof interest in NLP research for the past fewyears. Although language proficiency is mul-tidimensional in nature, existing research typ-ically considers a single “overall proficiency”while building models. Further, existing ap-proaches also considers only one language ata time. This paper describes our experimentsand observations about the role of pre-trainedand fine-tuned multilingual embeddings in per-forming multi-dimensional, multilingual lan-guage proficiency classification. We reportexperiments with three languages – German,Italian, and Czech – and model seven dimen-sions of proficiency ranging from vocabularycontrol to sociolinguistic appropriateness. Ourresults indicate that while fine-tuned embed-dings are useful for multilingual proficiencymodeling, none of the features achieve consis-tently best performance for all dimensions oflanguage proficiency . Automated Essay Scoring (AES) is the task of grad-ing test taker writing using computer programs. Ithas been an active area of research in NLP for thepast 15 years. Although most of the existing re-search focused on English, recent years saw thedevelopment of AES models for second languageproficiency assessment for non-English languages,typically modeled using the Common EuropeanFramework of Reference (CEFR) reference scale(of Europe, 2002) in Europe.Most of the past research focused on monolin-gual AES models. However, the notion of languageproficiency is not limited to any one language. Asa matter of fact, CEFR (of Europe, 2002) provides All code, data and related supplementary ma-terial can be found at: https://github.com/nishkalavallabhi/MultidimCEFRScoring language agnostic guidelines to describe differentlevels of language proficiency, from A1 (beginner)to C2 (advanced). Hence, a universal, multilinguallanguage proficiency model is an interesting possi-bility to explore. From an application perspective,it will be useful to know if one can achieve cross-lingual transfer and build an AES system for a newlanguage without or with little training data. Va-jjala and Rama (2018) explored these ideas withbasic features such as n-grams and POS tag ratios.The usefulness of large, pre-trained multilingualmodels (with or without fine-tuning) from recentNLP research has not been studied for this task,especially for non-English languages.Further, AES research generally considers lan-guage proficiency as a single construct. However,proficiency encompasses multiple dimensions suchas vocabulary richness, grammatical accuracy, co-herence/cohesion, usage of idioms etc. (Attali andBurstein, 2004). CEFR guidelines also provide lan-guage proficiency rubrics for individual dimensionsalong with overall proficiency for A1–C2. Model-ing multiple dimensions instead of a single “overallproficiency” could result in a more fine-grained as-sessment for offering specific feedback.Given this background, we explore the useful-ness of multilingual pre-trained embeddings fortraining multi-dimensional language proficiencyscoring models for three languages – German,Czech and Italian. The main contributions of ourpaper are listed below:• We address the problem of multi-dimensionalmodeling of language proficiency for three(non-English) languages.• We explore whether large pre-trained, multi-lingual embeddings are useful as feature repre-sentations for this task with and without fine-tuning. a r X i v : . [ c s . C L ] F e b We investigate the possibility of a universalmultilingual language proficiency model andzero-shot cross lingual transfer using embed-ding representations.The paper is organized as follows. Section 2 brieflysurveys the related work. Section 3 describes ourcorpus, features, and experimental settings. Sec-tion 4 discusses our results in detail. Section 5concludes the paper with pointers to future work.
Automated Essay Scoring (AES) is a well-researched problem in NLP and has been applied toreal-world language assessment scenarios for En-glish (Attali and Burstein, 2004). A wide range offeatures such as document length, lexical/syntacticn-grams, and features capturing linguistic aspectssuch as vocabulary, syntax and discourse are com-monly used (Klebanov and Flor, 2013; Phandi et al.,2015; Zesch et al., 2015). In the recent past, dif-ferent forms of text embeddings and pre-trainedlanguage models have also been explored (Alikani-otis et al., 2016; Dong and Zhang, 2016; Mayfieldand Black, 2020) along with approaches to com-bine linguistic features with neural networks (Shin,2018; Liu et al., 2019). Ke and Ng (2019) and Kle-banov and Madnani (2020) present the most recentsurveys on the state of the art in AES (focusing onEnglish).In terms of modeling, AES has been modeledas a classification, regression, and ranking prob-lem, with approaches ranging from linear regres-sion to deep learning models. Some of the recentwork explored the usefulness of multi-task learning(Cummins and Rei, 2018; Berggren et al., 2019)and transfer learning (Jin et al., 2018; Ballier et al.,2020). Going beyond approaches that work for asingle language, Vajjala and Rama (2018) reportedon developing methods for multi- and cross-lingualAES.Much of the existing AES research has been fo-cused on English, but there is a growing body ofresearch on other European languages: German(Hancke and Meurers, 2013), Estonian (Vajjala andL˜oo, 2014), Swedish (Pil´an et al., 2016), Norwe-gian (Berggren et al., 2019) which explored bothlanguage specific (e.g., case markers in Estonian)as well as language agnostic (e.g., POS n-grams)features (Vajjala and Rama, 2018) for this task.However, to our knowledge, the use of large pre-trained language models such as BERT (Devlin et al., 2018) has not been explored yet for AES innon-English languages.Further, most of the approaches focused onmodeling language proficiency as a single vari-able. Although there is some research focusing onmultiple dimensions of language proficiency (Leeet al., 2009; Attali and Sinharay, 2015; Agejev andˇSnajder, 2017; Mathias and Bhattacharyya, 2020),none of them focused on non-English languages orused recent multilingual pre-trained models suchas BERT. In this paper, we focus on this problemof multi-dimensional modeling of language profi-ciency for three languages—German, Italian, andCzech—and explore whether recent research onmultilingual embeddings can be useful for non-English AES.
In this section, we describe the corpus, features,models, and implementation details. We modeledthe task as a classification problem and trained in-dividual models for each of the 7 dimensions oflanguage proficiency. The rest of this section de-scribes the different steps involved in our approachin detail.
In this paper, we employed the publicly availableMERLIN corpus (Boyd et al., 2014), which wasalso used in the experiments reported in some pastresearch (Hancke, 2013; Vajjala and Rama, 2018)and in the recently conducted REPROLANG chal-lenge (Branco et al., 2020). The MERLIN corpus contains CEFR scale based language proficiencyannotations for texts produced by non-native learn-ers in three languages: German, Czech, and Italianin seven dimensions which are described below:1. Overall proficiency is the generic label ex-pected to summarize the language proficiencyacross different dimensions.2.
Grammatical accuracy refers to the usageand control over the language’s grammar.3.
Orthographic control refers to the aspectsof language connected with writing such aspunctuation, spelling mistakes etc. Available here for download: https://merlin-platform.eu/C_download.php . Vocabulary range refers to the breadth of vo-cabulary use, including phrases, idiomatic ex-pressions, colloquialisms etc.5.
Vocabulary control refers to the correct andappropriate use of vocabulary.6.
Coherence and Cohesion refers to the abilityto connect different parts of the text using ap-propriate vocabulary (e.g., connecting words)and creating a smoothly flowing text.7.
Sociolinguistic appropriateness refers to theawareness of language use in different socialcontexts. For example, using proper form ofintroduction, ability to express both in formalas well as informal language, understandingthe sociocultural aspects of language use etc.Detailed description of a dimension at eachCEFR level is provided in the structured overviewof CEFR scales document (of Europe, 2002). Inthe MERLIN corpus, these annotations were pre-pared by human graders who were trained on thesewell defined rubrics. More details on the exami-nation setup, grade annotation guidelines, ratingprocedure, inter-rater reliability and reliability ofrating measures can be found in the project docu-mentation (B¨arenf¨anger, 2013). We used the textsand their universal dependency parsed versions –shared by Vajjala and Rama (2018) – consisting of2266 documents in total (1029 German, 803 Italian,434 for Czech).The German corpus had A1–C1, Italian corpushad A2–B1, and Czech had A2–B2 levels for theoverall proficiency category. More CEFR levelswere represented in the corpus for other proficiencydimensions. In this paper, we treat the annotatedlabels in the corpus as the gold standard labels.
Missing labels
Less than 10 documents had anannotation of instead of the CEFR scale (A1–C2)for some of the dimensions. The documentationdid not provide any reason behind this label assign-ment and we removed them from our experiments.In the case of German and Italian, for less thanten documents, some individual dimensions had ascore of “0” while the overall rating was A1. Forthese documents, we treated “0” score as A1 ratingfor that dimension. In the case of Czech, about More details on the corpus distribution can be found inMERLIN documentation, and the result files we shared assupplementary material contain CEFR level distributions forall the classification scenarios, for all languages. half of the documents ( ) for the sociolinguisticappropriateness dimension had a score of “0”. Thecorpus manual does not provide any explanation forthe missing annotation, therefore, we excluded thisdimension from all experiments involving Czechdata.
Inter-dimensional correlations
B¨arenf¨anger(2013)’s analyses on MERLIN corpus showthat correlations among the different dimensions(including overall proficiency) range from . − . in different languages. In general, correlationsbetween any two dimensions and specificallywith overall proficiency dimension are higher forGerman and Italian than for Czech. There is noconsistent high correlation of overall proficiencywith any single dimension. The variations showthat these individual dimensions as are indeeddifferent from each other as well as overallproficiency dimension, and we could expectthat a model trained on one dimension need notnecessarily reflect on the language proficiencyof the test taker in another dimension. Thisfurther motivates our decision to explore a multidimensional proficiency assessment approach inthis paper.
One of the goals of the paper is to examine if textrepresentations computed from large, pre-trained,multilingual models such as mBERT (Devlin et al.,2018) and LASER (Artetxe and Schwenk, 2019)are useful for the AES task. We trained classi-fiers based on these two pre-trained models andcompare them with two previously used features—document length baseline and the n-gram featuresused in Vajjala and Rama (2018). All the featuresare described below:•
Baseline : Document length (number of to-kens) is a standard feature in all AES ap-proaches (Attali and Burstein, 2004).•
Lexical and syntactic features : n-grams( ≤ n ≤ ) of Word, Universal POStags (UPOS) from the Universal Dependen-cies project (Nivre et al., 2016), dependencytriplets consisting of the head POS label, de-pendent POS label, and the dependency labelextracted using UDPipe (Straka et al., 2016). For details: Refer to Table 4 for Czech, Table 11 forGerman, Table 17 for Italian in B¨arenf¨anger (2013). hile word n-grams are useful only for mono-lingual setting, the syntactic level n-gramswere used in multi/cross-lingual scenarios aswell, as they are all derived from the sametagset/dependency relations.•
LASER embeddings map a sentence in asource language to a fixed dimension ( )vector in a common cross-lingual space al-lowing us to map the vectors from differentlanguages into a single space. Since the num-ber of sentences in an essay is variable, wemap each sentence in the segmented text toa vector and then compute the average of thevectors to yield a dimension representa-tion as our feature vector.• mBERT : We apply the 12-layer pre-trainedmultilingual BERT (trained on Wikipedias of104 languages with shared word-piece vocab-ulary) for mapping an essay (truncated to tokens which is the upper bound of the lengthfor 93% of the documents) into a dimen-sion vector. Specifically, we use the vectorfor the
CLS token from the final layer as thefeature vector for non-finetuned classificationexperiments. We used the MERLIN corpustexts to do task specific fine-tuning of mBERT.It is possible to use other representations such asusing average of the tokens’ embeddings of the lastlayer instead of using CLS token for mBERT, or ex-plore other recent pre-trained mono-/multilingualrepresentations. Our goal is not to explore the bestrepresentation but rather test if a representative ap-proach could be used for this problem. To ourknowledge, only Mayfield and Black (2020) stud-ied the application of BERT for AES in English,and its utility in the context of non-English andmultilingual AES models has not been explored.Although, it is possible to use the “domain” fea-tures such as spelling/grammar errors which arecommonly seen in AES systems, our goal in thispaper is to explore how far we can go with the repre-sentations without any language specific resourcesfor this task. Considering that such representationsare expected to capture different aspects of lan-guage (Jawahar et al., 2019; Edmiston, 2020), wecould hypothesize that some of the domain specificfeatures are already captured by them.
As discussed in Section 1, our motivation in this pa-per is to evaluate whether pre-trained multilingualembedding representations are useful for perform-ing multidimensional AES, whether they can beused to achieve a universal representation for thistask (multilingual), as well as transfer from onelanguage to another (cross-lingual) and if the pre-trained embedding representations can be trans-ferred to the AES task ( fine-tuning ). To explorethis, we trained mono/multi/cross lingual classifi-cation models using each of the features describedin Section 3.2, for each of the 7 dimensions.All the traditional classification models based onn-grams, LASER and mBERT were tested usingtraditional classification algorithms: Logistic Re-gression, Random Forests, and Linear SVM. ThemBERT fine-tuned model consists of a softmaxclassification layer on top of the
CLS token’s em-bedding. We used the MERLIN corpus texts tofine-tune mBERT for this task in all the three clas-sification scenarios.We evaluate the classifiers in monolingual andmultilingual scenarios through stratified five-foldvalidation where the distribution of the labels ispreserved across the folds. Owing to the nature ofthe corpus and the presence of unbalanced classesin all the languages and dimensions in the dataset,we used weighted F score to compare model per-formance, as was done in the REPROLANG chal-lenge (Branco et al., 2020). In the cross-lingualscenario, we trained on the German dataset andtested separately on Czech and Italian languages. All the POS and dependency n-gram features werecomputed using UDPipe (Straka et al., 2016). Allthe traditional classification models were imple-mented using the Python library scikit-learn (Pedregosa et al., 2011) with the default settings.LASER embeddings were extracted using thepython package laserembeddings . The ex-traction of mBERT embeddings and fine-tuningwas performed using the
Hugging Face libraryand PyTorch. The code, processed dataset, anddetailed result files are uploaded as supplementarymaterial with this paper. https://pypi.org/project/laserembeddings/ https://huggingface.co/transformers/v2.2.0/model_doc/bert.html Experiments and Results
As mentioned earlier, we trained monolingual, mul-tilingual and cross-lingual classification models forall the seven proficiency dimensions. We reportresults with logistic regression, which performedthe best in most of the cases. The results for theother classifiers such as Random Forest and LinearSVM are provided in the supplementary material.
Figures 1, 2 and 3 show the results of monolin-gual classification for German, Italian and Czechrespectively for all the feature sets and proficiencydimensions.
German
The fine-tuned BERT model performsthe best (from Figure 1) for the Overall CEFR pro-ficiency prediction dimension closely followed byPOS n-grams. Except for the Vocabulary Rangedimension, none of the other dimensions seem toperform on par with Overall proficiency in termsof absolute numbers, though. Fine-tuned mBERTperforms the best for Orthographic control dimen-sion, where the rest of the feature sets performedrather worse. Overall, these results seem to indi-cate that all our features only capture the ‘Overallproficiency’ dimension well, and to some extentthe ‘Vocabulary Range’ dimension. All featuresperform rather poorly at the task of prediction oforthographic control.
Italian
The word n-grams perform the best forOverall Proficiency closely followed by POS n-grams and fine-tuned mBERT model. There isnot much variation among the features, with lit-tle improvement over the strong document lengthbaseline for any feature group. Further, the perfor-mance on other dimensions seems far worse thanOverall Proficiency, compared to German. Ortho-graphic control is the worst performing dimensioneven for Italian. Word n-grams are the best fea-ture representation across all dimensions for Italian.Although mBERT fine-tuning improved the perfor-mance over non fine-tuned version, both LASERand mBERT based models don’t perform betterthan word or POS n-grams in any dimension. Thus,while there are some similarities between Germanand Italian classification, we also observe somedifferences.
Czech
Across all the dimensions, the results (Fig-ure 3) for Czech are different from German and Italian. The performance of the different systemson Coherence/Cohesion dimension is much betterthan the Overall Proficiency. Orthographic Control,which seemed to be the worst modeled dimensionfor German and Italian, does better than grammat-ical accuracy and vocabulary control. There is alarger difference between the baseline performanceand the best performance for most of the dimen-sions, than it was for German and Italian.The main conclusions from the monolingual clas-sification experiments are as follows:• The feature groups don’t capture multiple di-mensions of proficiency well and there is nosingle feature group that works equally wellacross all languages.• Pre-trained and fine-tuned text representationsseem to perform comparably to traditional n-gram features in several language-dimensioncombinations.One possible reason for the variation across di-mensions could be that the corpus consists of textswritten by language learners, coming from vari-ous native language backgrounds. It is possiblethat there are no consistent n-gram patterns in vari-ous dimensions to capture due to this characteris-tic. Further, models such as LASER and mBERTare pre-trained on well formed texts, and may notbe able to capture the potentially erroroneous lan-guage patterns in MERLIN texts. We can hypoth-esize that the overall proficiency label potentiallycaptures some percent of each dimension, and isprobably easier to model than others. However,even this hypothesis does not hold for the caseof Czech, where Coherence/Cohesion dimensionperhaps much better than the overall proficiency.Clearly, more analysis and experiments are neededto understand these aspects. The current set of ex-periments indicate that it is a worthwhile futuredirection to pursue.
In multilingual classification, we work with a sin-gle dataset formed by combining the essays fromall the three languages. We trained and tested clas-sifiers for all combinations of feature sets and di-mensions on the single large dataset. Since CEFRguidelines for language proficiency are not specificto any one language, we would expect multilingualmodels to perform on par with individual mono- imensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion Sociolinguistic Appropriateness
Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned
German: Monolingual
Figure 1: German monolingual five-fold validation results. All
POS and
Dep n-grams are based on UniversalDependencies framework.
Dimensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion Sociolinguistic Appropriateness
Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned
Italian: Monolingual
Figure 2: Italian monolingual five-fold validation results lingual models. The results of our multilingualexperiments are given in Figure 4.Our results show that the fine-tuned mBERTmodel performs the best on most of the dimen-sions, closely followed by the UPOS n-grams fea-tures. To understand the relation between the mul-tilingual model and its constituents, we looked athow each language fared in this model. For over- all proficiency dimension, for example, the bestresult is achieved with fine-tuned classifier basedon mBERT ( . ), which is closer to the averageof the results from the three monolingual models.While German ( . in monolingual vs . inmultilingual) and Italian ( . vs . ) saw aslight dip in the multilingual setup, Czech ( . vs . ) saw a 5 point increase due to multilingual imensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion
Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned
Czech: Monolingual
Figure 3: Czech monolingual five-fold validation results
Dimensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion
Doc.len. baseline POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned
Multilingual
Figure 4: Multi-dimensional, Multilingual language proficiency classification. The doclen baseline for Orthog-raphy domain is . which is less than the minimum threshold of . . classification.Clearly, multilingual classification is a beneficialsetup for languages with lower monolingual per-formance or less data, without compromising onthose languages that had better performance. How-ever, there is still a lot of performance variation interms of absolute numbers across dimensions. Aswith the case of monolingual models, we can po-tentially attribute this to the fact that we are dealingwith a relatively smaller sized dataset in MERLIN,with texts written by people with diverse native language backgrounds, although more experimentsare needed in this direction to confirm this. Here, we train different classification models onGerman, and test them on Italian and Czech. Wechose German for the training side since Germanhas the largest number of essays in MERLIN cor-pus. In the case of mBERT, we performed fine-tuning on German part of the corpus, and testedthe models on Italian and Czech texts respectively.The goal of this experiment is to test if there areny universal patterns in proficiency across the lan-guages and understand if zero-shot cross-lingualtransfer is possible for this task.UPOS n-grams consistently performed betterthan other features for most of the dimensions,in both the cross lingual setups. There is moreperformance variation among the different dimen-sions for Italian compared to Czech. In the caseof Czech, similar to the monolingual case, theCoherence/cohesion dimension achieved superiorperformance than others, even with the baselinedocument length feature. This is a result worthconsidering further qualitative analysis in future.More details on the results of this experiment canbe found in the Figures folder in the supplemen-tary material. Our cross-lingual experiments seemto indicate that the embedding representations wechose are not useful for zero-shot learning, and thatUPOS n-grams may serve as a strong baseline forbuilding AES systems with new languages.
We observed substantial performance differencesacross features/dimensions/languages in variousexperimental settings. While we don’t have a pro-cedure to understand the exact reasons for this yet,examining the confusion matrices (provided in thesupplementary material) may give us some insightsinto the nature of some of these differences. There-fore, we manually inspected a few confusion matri-ces, posing ourselves three questions:1. How does a given feature set perform acrossdifferent dimensions for a given language?2. How do different features perform for a singledimension for a given language?3. How does a given feature set perform for agiven dimension among the three languages?In all these cases, we did not notice any majordifferences, and the confusion matrices followedthe expected trend (observed in previous research)– immediately proximal levels such as A2/B1 orA1/A2 are harder to distinguish accurately as com-pared to distant levels such as A1/B2 along withthe expected observation that levels with larger rep-resentation have better results. It is neither possibleto cover all possible combinations nor is it suffi-cient to gain more insights into the models justby looking at confusion matrices alone. Carefullyplanned interpretable analyses should be conductedin future to understand these differences further.
In this paper, we reported several experiments ex-ploring multi-dimensional CEFR scale based lan-guage proficiency classification for three languages.Our main conclusions from these experiments canbe summarized as follows:1. UPOS n-gram features perform consistentlywell for all languages in monolingual classifi-cation scenarios for modeling “overall profi-ciency”, closely followed by embedding fea-tures in most language-dimension combina-tions.2. Fine-tuned large pre-trained models such asmBERT are useful language representationsfor multilingual classification, and languageswith low monolingual performance benefitfrom a multilingual setup.3. UPOS features seem to provide a strong base-line for zero-shot cross lingual transfer, andfine-tuning was very not useful in this case.4. None of the feature groups consis-tently perform well across all dimen-sions/languages/classification setups.The first conclusion is similar to (Mayfield andBlack, 2020)’s conclusion on using BERT for En-glish AES. However, these results need not be inter-preted as a “no” to pre-trained models. Consideringthat they are closely behind n-grams in many casesand were slightly better than them for German, webelieve they are useful to this task and more re-search needs to be done in this direction exploringother language models/fine-tuning options.Pre-trained and fine-tuned models are clearlyuseful in a multilingual classification setup, and itwould be an interesting new direction to pursue forthis task. As a continuation of these experiments,one can look for a larger CEFR annotated corpusfor a language such as English, and explore multi-lingual learning for languages with lesser data.The results from the experiments presented inthis paper highlight the inherent difficulty in cap-turing multiple dimensions of language proficiencythrough existing methods, and the need for morefuture research in this direction. An important di-rection for future work is to develop better featurerepresentations that capture specific dimensions oflanguage proficiency, which can potentially workfor many languages. Considering that all the dimen-sions share some commonalities and differencesith each other, multi-task learning is another use-ful direction to explore.
References
Tamara Sladoljev Agejev and Jan ˇSnajder. 2017. Usinganalytic scoring rubrics in the automatic assessmentof college-level summary writing tasks in l2. In
Proceedings of the Eighth International Joint Con-ference on Natural Language Processing (Volume 2:Short Papers) , pages 181–186.Dimitrios Alikaniotis, Helen Yannakoudakis, andMarek Rei. 2016. Automatic text scoring usingneural networks. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 715–725, Berlin, Germany. Association for Computa-tional Linguistics.Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond.
Transac-tions of the Association for Computational Linguis-tics , 7:597–610.Yigal Attali and Jill Burstein. 2004. Automated essayscoring with e-rater® v. 2.0.
ETS Research ReportSeries , 2004(2).Yigal Attali and Sandip Sinharay. 2015. Automatedtrait scores for toefl® writing tasks.
ETS ResearchReport Series , 2015(1):1–14.Nicolas Ballier, St´ephane Canu, Caroline Petitjean,Gilles Gasso, Carlos Balhana, Theodora Alex-opoulou, and Thomas Gaillat. 2020. Machine learn-ing for learner english: A plea for creating learnerdata challenges.
International Journal of LearnerCorpus Research , 6(1):72–103.Olaf B¨arenf¨anger. 2013. Assessing the reliability andscale functionality of the merlin written speech sam-ple ratings. Technical report, European Academy,Bolzano, Italy.Stig Johan Berggren, Taraka Rama, and Lilja Øvrelid.2019. Regression or classification? automated es-say scoring for norwegian. In
Proceedings of theFourteenth Workshop on Innovative Use of NLP forBuilding Educational Applications , pages 92–102.Adriane Boyd, Jirka Hana, Lionel Nicolas, DetmarMeurers, Katrin Wisniewski, Andrea Abel, KarinSch¨one, Barbora Stindlov´a, and Chiara Vettori. 2014.The merlin corpus: Learner language and the cefr.In
LREC , pages 1281–1288.Ant´onio Branco, Nicoletta Calzolari, Piek Vossen,Gertjan van Noord, Dieter van Uytvanck, Jo˜ao Silva,Lu´ıs Gomes, Andr´e Moreira, and Willem Elbers.2020. A shared task of a new, collaborative typeto foster reproducibility: A first exercise in the area of language science and technology with re-prolang2020. In
Proceedings of The 12th LanguageResources and Evaluation Conference , pages 5539–5545.Ronan Cummins and Marek Rei. 2018. Neural multi-task learning in automated assessment. arXivpreprint arXiv:1801.06830 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Fei Dong and Yue Zhang. 2016. Automatic features foressay scoring–an empirical study. In
Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1072–1077.Daniel Edmiston. 2020. A systematic analysis of mor-phological content in bert models for multiple lan-guages. arXiv preprint arXiv:2004.03032 .Council of Europe. 2002. Common european frame-work of reference for languages: Learning, teaching,assessment. structured overview of all cefr scales.Julia Hancke. 2013. Automatic prediction of cefr pro-ficiency levels based on linguistic features of learnerlanguage.
Master’s thesis, University of T¨ubingen .Julia Hancke and Detmar Meurers. 2013. Exploringcefr classification for german based on rich linguis-tic modeling. pages 54–56.Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah.2019. What does bert learn about the structure oflanguage? In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 3651–3657.Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018.Tdnn: a two-stage deep neural network for prompt-independent automated essay scoring. In
Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 1088–1097.Zixuan Ke and Vincent Ng. 2019. Automated essayscoring: a survey of the state of the art. In
Pro-ceedings of the 28th International Joint Conferenceon Artificial Intelligence , pages 6300–6308. AAAIPress.Beata Beigman Klebanov and Michael Flor. 2013.Word association profiles and their use for auto-mated scoring of essays. In
Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1148–1158.Beata Beigman Klebanov and Nitin Madnani. 2020.Automated evaluation of writing – 50 years andcounting. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 7796–7810, Online. Association for Computa-tional Linguistics.ong-Won Lee, Claudia Gentile, and Robert Kan-tor. 2009. Toward automated multi-trait scoringof essays: Investigating links among holistic, ana-lytic, and text feature scores.
Applied Linguistics ,31(3):391–417.Jiawei Liu, Yang Xu, and Lingzhe Zhao. 2019. Au-tomated essay scoring based on two-stage learning.
CoRR , abs/1901.07744.Sandeep Mathias and Pushpak Bhattacharyya. 2020.Can neural networks automatically score essaytraits? In
Proceedings of the Fifteenth Workshopon Innovative Use of NLP for Building EducationalApplications , pages 85–91, Seattle, WA, USA → On-line. Association for Computational Linguistics.Elijah Mayfield and Alan W Black. 2020. Should youfine-tune BERT for automated essay scoring? In
Proceedings of the Fifteenth Workshop on Innova-tive Use of NLP for Building Educational Applica-tions , pages 151–162, Seattle, WA, USA → Online.Association for Computational Linguistics.Joakim Nivre, Marie-Catherine de Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan T McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, Reut Tsarfaty, and DanielZeman. 2016. Universal Dependencies v1: A Mul-tilingual Treebank Collection. In
Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC 2016) , pages 1659–1666.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython.
Journal of Machine Learning Research ,12:2825–2830.Peter Phandi, Kian Ming A Chai, and Hwee Tou Ng.2015. Flexible domain adaptation for automated es-say scoring using correlated linear regression. In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , pages431–439.Ildik´o Pil´an, David Alfter, and Elena Volodina. 2016.Coursebook texts as a helping hand for classifyinglinguistic complexity in language learners’ writings.
CL4LC 2016 , page 120.Eunjin Shin. 2018. A neural network approach toautomated essay scoring: A comparison with themethod of integrating deep language features usingcoh-metrix.Milan Straka, Jan Hajic, and Jana Strakov´a. 2016. UD-Pipe: Trainable Pipeline for Processing CoNLL-UFiles Performing Tokenization, Morphological Anal-ysis, POS Tagging and Parsing. In
LREC . Sowmya Vajjala and Kaidi L˜oo. 2014. Automatic cefrlevel prediction for estonian learner text. In
Pro-ceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, UppsalaUniversity , 107. Link¨oping University ElectronicPress.Sowmya Vajjala and Taraka Rama. 2018. Experimentswith universal cefr classification. In
Proceedings ofthe Thirteenth Workshop on Innovative Use of NLPfor Building Educational Applications , pages 147–153, New Orleans, Louisiana. Association for Com-putational Linguistics.Torsten Zesch, Michael Wojatzki, and Dirk Scholten-Akoun. 2015. Task-independent features for auto-mated essay grading. In
BEA@ NAACL-HLT , pages224–232.
A Supplemental Material