[PDF] Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?

Abstract

Development of language proficiency models for non-native learners has been an active area of interest in NLP research for the past few years. Although language proficiency is multidimensional in nature, existing research typically considers a single "overall proficiency" while building models. Further, existing approaches also considers only one language at a time. This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification. We report experiments with three languages -- German, Italian, and Czech -- and model seven dimensions of proficiency ranging from vocabulary control to sociolinguistic appropriateness. Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency. All code, data and related supplementary material can be found at: this https URL

Full PDF

AAre pre-trained text representations useful for multilingual andmulti-dimensional language proﬁciency modeling?

Taraka Rama

University of North Texas, USA

[email protected]

Sowmya Vajjala

National Research Council, Canada [email protected]

Abstract

Development of language proﬁciency modelsfor non-native learners has been an active areaof interest in NLP research for the past fewyears. Although language proﬁciency is mul-tidimensional in nature, existing research typ-ically considers a single “overall proﬁciency”while building models. Further, existing ap-proaches also considers only one language ata time. This paper describes our experimentsand observations about the role of pre-trainedand ﬁne-tuned multilingual embeddings in per-forming multi-dimensional, multilingual lan-guage proﬁciency classiﬁcation. We reportexperiments with three languages – German,Italian, and Czech – and model seven dimen-sions of proﬁciency ranging from vocabularycontrol to sociolinguistic appropriateness. Ourresults indicate that while ﬁne-tuned embed-dings are useful for multilingual proﬁciencymodeling, none of the features achieve consis-tently best performance for all dimensions oflanguage proﬁciency . Automated Essay Scoring (AES) is the task of grad-ing test taker writing using computer programs. Ithas been an active area of research in NLP for thepast 15 years. Although most of the existing re-search focused on English, recent years saw thedevelopment of AES models for second languageproﬁciency assessment for non-English languages,typically modeled using the Common EuropeanFramework of Reference (CEFR) reference scale(of Europe, 2002) in Europe.Most of the past research focused on monolin-gual AES models. However, the notion of languageproﬁciency is not limited to any one language. Asa matter of fact, CEFR (of Europe, 2002) provides All code, data and related supplementary ma-terial can be found at: https://github.com/nishkalavallabhi/MultidimCEFRScoring language agnostic guidelines to describe differentlevels of language proﬁciency, from A1 (beginner)to C2 (advanced). Hence, a universal, multilinguallanguage proﬁciency model is an interesting possi-bility to explore. From an application perspective,it will be useful to know if one can achieve cross-lingual transfer and build an AES system for a newlanguage without or with little training data. Va-jjala and Rama (2018) explored these ideas withbasic features such as n-grams and POS tag ratios.The usefulness of large, pre-trained multilingualmodels (with or without ﬁne-tuning) from recentNLP research has not been studied for this task,especially for non-English languages.Further, AES research generally considers lan-guage proﬁciency as a single construct. However,proﬁciency encompasses multiple dimensions suchas vocabulary richness, grammatical accuracy, co-herence/cohesion, usage of idioms etc. (Attali andBurstein, 2004). CEFR guidelines also provide lan-guage proﬁciency rubrics for individual dimensionsalong with overall proﬁciency for A1–C2. Model-ing multiple dimensions instead of a single “overallproﬁciency” could result in a more ﬁne-grained as-sessment for offering speciﬁc feedback.Given this background, we explore the useful-ness of multilingual pre-trained embeddings fortraining multi-dimensional language proﬁciencyscoring models for three languages – German,Czech and Italian. The main contributions of ourpaper are listed below:• We address the problem of multi-dimensionalmodeling of language proﬁciency for three(non-English) languages.• We explore whether large pre-trained, multi-lingual embeddings are useful as feature repre-sentations for this task with and without ﬁne-tuning. a r X i v : . [ c s . C L ] F e b We investigate the possibility of a universalmultilingual language proﬁciency model andzero-shot cross lingual transfer using embed-ding representations.The paper is organized as follows. Section 2 brieﬂysurveys the related work. Section 3 describes ourcorpus, features, and experimental settings. Sec-tion 4 discusses our results in detail. Section 5concludes the paper with pointers to future work.

Automated Essay Scoring (AES) is a well-researched problem in NLP and has been applied toreal-world language assessment scenarios for En-glish (Attali and Burstein, 2004). A wide range offeatures such as document length, lexical/syntacticn-grams, and features capturing linguistic aspectssuch as vocabulary, syntax and discourse are com-monly used (Klebanov and Flor, 2013; Phandi et al.,2015; Zesch et al., 2015). In the recent past, dif-ferent forms of text embeddings and pre-trainedlanguage models have also been explored (Alikani-otis et al., 2016; Dong and Zhang, 2016; Mayﬁeldand Black, 2020) along with approaches to com-bine linguistic features with neural networks (Shin,2018; Liu et al., 2019). Ke and Ng (2019) and Kle-banov and Madnani (2020) present the most recentsurveys on the state of the art in AES (focusing onEnglish).In terms of modeling, AES has been modeledas a classiﬁcation, regression, and ranking prob-lem, with approaches ranging from linear regres-sion to deep learning models. Some of the recentwork explored the usefulness of multi-task learning(Cummins and Rei, 2018; Berggren et al., 2019)and transfer learning (Jin et al., 2018; Ballier et al.,2020). Going beyond approaches that work for asingle language, Vajjala and Rama (2018) reportedon developing methods for multi- and cross-lingualAES.Much of the existing AES research has been fo-cused on English, but there is a growing body ofresearch on other European languages: German(Hancke and Meurers, 2013), Estonian (Vajjala andL˜oo, 2014), Swedish (Pil´an et al., 2016), Norwe-gian (Berggren et al., 2019) which explored bothlanguage speciﬁc (e.g., case markers in Estonian)as well as language agnostic (e.g., POS n-grams)features (Vajjala and Rama, 2018) for this task.However, to our knowledge, the use of large pre-trained language models such as BERT (Devlin et al., 2018) has not been explored yet for AES innon-English languages.Further, most of the approaches focused onmodeling language proﬁciency as a single vari-able. Although there is some research focusing onmultiple dimensions of language proﬁciency (Leeet al., 2009; Attali and Sinharay, 2015; Agejev andˇSnajder, 2017; Mathias and Bhattacharyya, 2020),none of them focused on non-English languages orused recent multilingual pre-trained models suchas BERT. In this paper, we focus on this problemof multi-dimensional modeling of language proﬁ-ciency for three languages—German, Italian, andCzech—and explore whether recent research onmultilingual embeddings can be useful for non-English AES.

In this section, we describe the corpus, features,models, and implementation details. We modeledthe task as a classiﬁcation problem and trained in-dividual models for each of the 7 dimensions oflanguage proﬁciency. The rest of this section de-scribes the different steps involved in our approachin detail.

In this paper, we employed the publicly availableMERLIN corpus (Boyd et al., 2014), which wasalso used in the experiments reported in some pastresearch (Hancke, 2013; Vajjala and Rama, 2018)and in the recently conducted REPROLANG chal-lenge (Branco et al., 2020). The MERLIN corpus contains CEFR scale based language proﬁciencyannotations for texts produced by non-native learn-ers in three languages: German, Czech, and Italianin seven dimensions which are described below:1. Overall proﬁciency is the generic label ex-pected to summarize the language proﬁciencyacross different dimensions.2.

Grammatical accuracy refers to the usageand control over the language’s grammar.3.

Orthographic control refers to the aspectsof language connected with writing such aspunctuation, spelling mistakes etc. Available here for download: https://merlin-platform.eu/C_download.php . Vocabulary range refers to the breadth of vo-cabulary use, including phrases, idiomatic ex-pressions, colloquialisms etc.5.

Vocabulary control refers to the correct andappropriate use of vocabulary.6.

Coherence and Cohesion refers to the abilityto connect different parts of the text using ap-propriate vocabulary (e.g., connecting words)and creating a smoothly ﬂowing text.7.

Sociolinguistic appropriateness refers to theawareness of language use in different socialcontexts. For example, using proper form ofintroduction, ability to express both in formalas well as informal language, understandingthe sociocultural aspects of language use etc.Detailed description of a dimension at eachCEFR level is provided in the structured overviewof CEFR scales document (of Europe, 2002). Inthe MERLIN corpus, these annotations were pre-pared by human graders who were trained on thesewell deﬁned rubrics. More details on the exami-nation setup, grade annotation guidelines, ratingprocedure, inter-rater reliability and reliability ofrating measures can be found in the project docu-mentation (B¨arenf¨anger, 2013). We used the textsand their universal dependency parsed versions –shared by Vajjala and Rama (2018) – consisting of2266 documents in total (1029 German, 803 Italian,434 for Czech).The German corpus had A1–C1, Italian corpushad A2–B1, and Czech had A2–B2 levels for theoverall proﬁciency category. More CEFR levelswere represented in the corpus for other proﬁciencydimensions. In this paper, we treat the annotatedlabels in the corpus as the gold standard labels.

Missing labels

Less than 10 documents had anannotation of instead of the CEFR scale (A1–C2)for some of the dimensions. The documentationdid not provide any reason behind this label assign-ment and we removed them from our experiments.In the case of German and Italian, for less thanten documents, some individual dimensions had ascore of “0” while the overall rating was A1. Forthese documents, we treated “0” score as A1 ratingfor that dimension. In the case of Czech, about More details on the corpus distribution can be found inMERLIN documentation, and the result ﬁles we shared assupplementary material contain CEFR level distributions forall the classiﬁcation scenarios, for all languages. half of the documents ( ) for the sociolinguisticappropriateness dimension had a score of “0”. Thecorpus manual does not provide any explanation forthe missing annotation, therefore, we excluded thisdimension from all experiments involving Czechdata.

Inter-dimensional correlations

B¨arenf¨anger(2013)’s analyses on MERLIN corpus showthat correlations among the different dimensions(including overall proﬁciency) range from . − . in different languages. In general, correlationsbetween any two dimensions and speciﬁcallywith overall proﬁciency dimension are higher forGerman and Italian than for Czech. There is noconsistent high correlation of overall proﬁciencywith any single dimension. The variations showthat these individual dimensions as are indeeddifferent from each other as well as overallproﬁciency dimension, and we could expectthat a model trained on one dimension need notnecessarily reﬂect on the language proﬁciencyof the test taker in another dimension. Thisfurther motivates our decision to explore a multidimensional proﬁciency assessment approach inthis paper.

One of the goals of the paper is to examine if textrepresentations computed from large, pre-trained,multilingual models such as mBERT (Devlin et al.,2018) and LASER (Artetxe and Schwenk, 2019)are useful for the AES task. We trained classi-ﬁers based on these two pre-trained models andcompare them with two previously used features—document length baseline and the n-gram featuresused in Vajjala and Rama (2018). All the featuresare described below:•

Baseline : Document length (number of to-kens) is a standard feature in all AES ap-proaches (Attali and Burstein, 2004).•

Lexical and syntactic features : n-grams( ≤ n ≤ ) of Word, Universal POStags (UPOS) from the Universal Dependen-cies project (Nivre et al., 2016), dependencytriplets consisting of the head POS label, de-pendent POS label, and the dependency labelextracted using UDPipe (Straka et al., 2016). For details: Refer to Table 4 for Czech, Table 11 forGerman, Table 17 for Italian in B¨arenf¨anger (2013). hile word n-grams are useful only for mono-lingual setting, the syntactic level n-gramswere used in multi/cross-lingual scenarios aswell, as they are all derived from the sametagset/dependency relations.•

LASER embeddings map a sentence in asource language to a ﬁxed dimension ( )vector in a common cross-lingual space al-lowing us to map the vectors from differentlanguages into a single space. Since the num-ber of sentences in an essay is variable, wemap each sentence in the segmented text toa vector and then compute the average of thevectors to yield a dimension representa-tion as our feature vector.• mBERT : We apply the 12-layer pre-trainedmultilingual BERT (trained on Wikipedias of104 languages with shared word-piece vocab-ulary) for mapping an essay (truncated to tokens which is the upper bound of the lengthfor 93% of the documents) into a dimen-sion vector. Speciﬁcally, we use the vectorfor the

CLS token from the ﬁnal layer as thefeature vector for non-ﬁnetuned classiﬁcationexperiments. We used the MERLIN corpustexts to do task speciﬁc ﬁne-tuning of mBERT.It is possible to use other representations such asusing average of the tokens’ embeddings of the lastlayer instead of using CLS token for mBERT, or ex-plore other recent pre-trained mono-/multilingualrepresentations. Our goal is not to explore the bestrepresentation but rather test if a representative ap-proach could be used for this problem. To ourknowledge, only Mayﬁeld and Black (2020) stud-ied the application of BERT for AES in English,and its utility in the context of non-English andmultilingual AES models has not been explored.Although, it is possible to use the “domain” fea-tures such as spelling/grammar errors which arecommonly seen in AES systems, our goal in thispaper is to explore how far we can go with the repre-sentations without any language speciﬁc resourcesfor this task. Considering that such representationsare expected to capture different aspects of lan-guage (Jawahar et al., 2019; Edmiston, 2020), wecould hypothesize that some of the domain speciﬁcfeatures are already captured by them.

As discussed in Section 1, our motivation in this pa-per is to evaluate whether pre-trained multilingualembedding representations are useful for perform-ing multidimensional AES, whether they can beused to achieve a universal representation for thistask (multilingual), as well as transfer from onelanguage to another (cross-lingual) and if the pre-trained embedding representations can be trans-ferred to the AES task ( ﬁne-tuning ). To explorethis, we trained mono/multi/cross lingual classiﬁ-cation models using each of the features describedin Section 3.2, for each of the 7 dimensions.All the traditional classiﬁcation models based onn-grams, LASER and mBERT were tested usingtraditional classiﬁcation algorithms: Logistic Re-gression, Random Forests, and Linear SVM. ThemBERT ﬁne-tuned model consists of a softmaxclassiﬁcation layer on top of the

CLS token’s em-bedding. We used the MERLIN corpus texts toﬁne-tune mBERT for this task in all the three clas-siﬁcation scenarios.We evaluate the classiﬁers in monolingual andmultilingual scenarios through stratiﬁed ﬁve-foldvalidation where the distribution of the labels ispreserved across the folds. Owing to the nature ofthe corpus and the presence of unbalanced classesin all the languages and dimensions in the dataset,we used weighted F score to compare model per-formance, as was done in the REPROLANG chal-lenge (Branco et al., 2020). In the cross-lingualscenario, we trained on the German dataset andtested separately on Czech and Italian languages. All the POS and dependency n-gram features werecomputed using UDPipe (Straka et al., 2016). Allthe traditional classiﬁcation models were imple-mented using the Python library scikit-learn (Pedregosa et al., 2011) with the default settings.LASER embeddings were extracted using thepython package laserembeddings . The ex-traction of mBERT embeddings and ﬁne-tuningwas performed using the

Hugging Face libraryand PyTorch. The code, processed dataset, anddetailed result ﬁles are uploaded as supplementarymaterial with this paper. https://pypi.org/project/laserembeddings/ https://huggingface.co/transformers/v2.2.0/model_doc/bert.html Experiments and Results

As mentioned earlier, we trained monolingual, mul-tilingual and cross-lingual classiﬁcation models forall the seven proﬁciency dimensions. We reportresults with logistic regression, which performedthe best in most of the cases. The results for theother classiﬁers such as Random Forest and LinearSVM are provided in the supplementary material.

Figures 1, 2 and 3 show the results of monolin-gual classiﬁcation for German, Italian and Czechrespectively for all the feature sets and proﬁciencydimensions.

German

The ﬁne-tuned BERT model performsthe best (from Figure 1) for the Overall CEFR pro-ﬁciency prediction dimension closely followed byPOS n-grams. Except for the Vocabulary Rangedimension, none of the other dimensions seem toperform on par with Overall proﬁciency in termsof absolute numbers, though. Fine-tuned mBERTperforms the best for Orthographic control dimen-sion, where the rest of the feature sets performedrather worse. Overall, these results seem to indi-cate that all our features only capture the ‘Overallproﬁciency’ dimension well, and to some extentthe ‘Vocabulary Range’ dimension. All featuresperform rather poorly at the task of prediction oforthographic control.

Italian

The word n-grams perform the best forOverall Proﬁciency closely followed by POS n-grams and ﬁne-tuned mBERT model. There isnot much variation among the features, with lit-tle improvement over the strong document lengthbaseline for any feature group. Further, the perfor-mance on other dimensions seems far worse thanOverall Proﬁciency, compared to German. Ortho-graphic control is the worst performing dimensioneven for Italian. Word n-grams are the best fea-ture representation across all dimensions for Italian.Although mBERT ﬁne-tuning improved the perfor-mance over non ﬁne-tuned version, both LASERand mBERT based models don’t perform betterthan word or POS n-grams in any dimension. Thus,while there are some similarities between Germanand Italian classiﬁcation, we also observe somedifferences.

Czech

Across all the dimensions, the results (Fig-ure 3) for Czech are different from German and Italian. The performance of the different systemson Coherence/Cohesion dimension is much betterthan the Overall Proﬁciency. Orthographic Control,which seemed to be the worst modeled dimensionfor German and Italian, does better than grammat-ical accuracy and vocabulary control. There is alarger difference between the baseline performanceand the best performance for most of the dimen-sions, than it was for German and Italian.The main conclusions from the monolingual clas-siﬁcation experiments are as follows:• The feature groups don’t capture multiple di-mensions of proﬁciency well and there is nosingle feature group that works equally wellacross all languages.• Pre-trained and ﬁne-tuned text representationsseem to perform comparably to traditional n-gram features in several language-dimensioncombinations.One possible reason for the variation across di-mensions could be that the corpus consists of textswritten by language learners, coming from vari-ous native language backgrounds. It is possiblethat there are no consistent n-gram patterns in vari-ous dimensions to capture due to this characteris-tic. Further, models such as LASER and mBERTare pre-trained on well formed texts, and may notbe able to capture the potentially erroroneous lan-guage patterns in MERLIN texts. We can hypoth-esize that the overall proﬁciency label potentiallycaptures some percent of each dimension, and isprobably easier to model than others. However,even this hypothesis does not hold for the caseof Czech, where Coherence/Cohesion dimensionperhaps much better than the overall proﬁciency.Clearly, more analysis and experiments are neededto understand these aspects. The current set of ex-periments indicate that it is a worthwhile futuredirection to pursue.

In multilingual classiﬁcation, we work with a sin-gle dataset formed by combining the essays fromall the three languages. We trained and tested clas-siﬁers for all combinations of feature sets and di-mensions on the single large dataset. Since CEFRguidelines for language proﬁciency are not speciﬁcto any one language, we would expect multilingualmodels to perform on par with individual mono- imensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion Sociolinguistic Appropriateness

Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned

German: Monolingual

Figure 1: German monolingual ﬁve-fold validation results. All

POS and

Dep n-grams are based on UniversalDependencies framework.

Dimensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion Sociolinguistic Appropriateness

Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned

Italian: Monolingual

Figure 2: Italian monolingual ﬁve-fold validation results lingual models. The results of our multilingualexperiments are given in Figure 4.Our results show that the ﬁne-tuned mBERTmodel performs the best on most of the dimen-sions, closely followed by the UPOS n-grams fea-tures. To understand the relation between the mul-tilingual model and its constituents, we looked athow each language fared in this model. For over- all proﬁciency dimension, for example, the bestresult is achieved with ﬁne-tuned classiﬁer basedon mBERT ( . ), which is closer to the averageof the results from the three monolingual models.While German ( . in monolingual vs . inmultilingual) and Italian ( . vs . ) saw aslight dip in the multilingual setup, Czech ( . vs . ) saw a 5 point increase due to multilingual imensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion

Doc.len. baseline Word ngrams POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned

Czech: Monolingual

Figure 3: Czech monolingual ﬁve-fold validation results

Dimensions W e i gh t ed F - sc o r e Overall CEFR Gram. Accuracy Orthography Vocab. range Vocab. control Coherence Cohesion

Doc.len. baseline POS ngrams Dep ngrams LASER BERT-no tuning BERT-fine tuned

Multilingual

Figure 4: Multi-dimensional, Multilingual language proﬁciency classiﬁcation. The doclen baseline for Orthog-raphy domain is . which is less than the minimum threshold of . . classiﬁcation.Clearly, multilingual classiﬁcation is a beneﬁcialsetup for languages with lower monolingual per-formance or less data, without compromising onthose languages that had better performance. How-ever, there is still a lot of performance variation interms of absolute numbers across dimensions. Aswith the case of monolingual models, we can po-tentially attribute this to the fact that we are dealingwith a relatively smaller sized dataset in MERLIN,with texts written by people with diverse native language backgrounds, although more experimentsare needed in this direction to conﬁrm this. Here, we train different classiﬁcation models onGerman, and test them on Italian and Czech. Wechose German for the training side since Germanhas the largest number of essays in MERLIN cor-pus. In the case of mBERT, we performed ﬁne-tuning on German part of the corpus, and testedthe models on Italian and Czech texts respectively.The goal of this experiment is to test if there areny universal patterns in proﬁciency across the lan-guages and understand if zero-shot cross-lingualtransfer is possible for this task.UPOS n-grams consistently performed betterthan other features for most of the dimensions,in both the cross lingual setups. There is moreperformance variation among the different dimen-sions for Italian compared to Czech. In the caseof Czech, similar to the monolingual case, theCoherence/cohesion dimension achieved superiorperformance than others, even with the baselinedocument length feature. This is a result worthconsidering further qualitative analysis in future.More details on the results of this experiment canbe found in the Figures folder in the supplemen-tary material. Our cross-lingual experiments seemto indicate that the embedding representations wechose are not useful for zero-shot learning, and thatUPOS n-grams may serve as a strong baseline forbuilding AES systems with new languages.

We observed substantial performance differencesacross features/dimensions/languages in variousexperimental settings. While we don’t have a pro-cedure to understand the exact reasons for this yet,examining the confusion matrices (provided in thesupplementary material) may give us some insightsinto the nature of some of these differences. There-fore, we manually inspected a few confusion matri-ces, posing ourselves three questions:1. How does a given feature set perform acrossdifferent dimensions for a given language?2. How do different features perform for a singledimension for a given language?3. How does a given feature set perform for agiven dimension among the three languages?In all these cases, we did not notice any majordifferences, and the confusion matrices followedthe expected trend (observed in previous research)– immediately proximal levels such as A2/B1 orA1/A2 are harder to distinguish accurately as com-pared to distant levels such as A1/B2 along withthe expected observation that levels with larger rep-resentation have better results. It is neither possibleto cover all possible combinations nor is it sufﬁ-cient to gain more insights into the models justby looking at confusion matrices alone. Carefullyplanned interpretable analyses should be conductedin future to understand these differences further.

In this paper, we reported several experiments ex-ploring multi-dimensional CEFR scale based lan-guage proﬁciency classiﬁcation for three languages.Our main conclusions from these experiments canbe summarized as follows:1. UPOS n-gram features perform consistentlywell for all languages in monolingual classiﬁ-cation scenarios for modeling “overall proﬁ-ciency”, closely followed by embedding fea-tures in most language-dimension combina-tions.2. Fine-tuned large pre-trained models such asmBERT are useful language representationsfor multilingual classiﬁcation, and languageswith low monolingual performance beneﬁtfrom a multilingual setup.3. UPOS features seem to provide a strong base-line for zero-shot cross lingual transfer, andﬁne-tuning was very not useful in this case.4. None of the feature groups consis-tently perform well across all dimen-sions/languages/classiﬁcation setups.The ﬁrst conclusion is similar to (Mayﬁeld andBlack, 2020)’s conclusion on using BERT for En-glish AES. However, these results need not be inter-preted as a “no” to pre-trained models. Consideringthat they are closely behind n-grams in many casesand were slightly better than them for German, webelieve they are useful to this task and more re-search needs to be done in this direction exploringother language models/ﬁne-tuning options.Pre-trained and ﬁne-tuned models are clearlyuseful in a multilingual classiﬁcation setup, and itwould be an interesting new direction to pursue forthis task. As a continuation of these experiments,one can look for a larger CEFR annotated corpusfor a language such as English, and explore multi-lingual learning for languages with lesser data.The results from the experiments presented inthis paper highlight the inherent difﬁculty in cap-turing multiple dimensions of language proﬁciencythrough existing methods, and the need for morefuture research in this direction. An important di-rection for future work is to develop better featurerepresentations that capture speciﬁc dimensions oflanguage proﬁciency, which can potentially workfor many languages. Considering that all the dimen-sions share some commonalities and differencesith each other, multi-task learning is another use-ful direction to explore.

References

Tamara Sladoljev Agejev and Jan ˇSnajder. 2017. Usinganalytic scoring rubrics in the automatic assessmentof college-level summary writing tasks in l2. In

Proceedings of the Eighth International Joint Con-ference on Natural Language Processing (Volume 2:Short Papers) , pages 181–186.Dimitrios Alikaniotis, Helen Yannakoudakis, andMarek Rei. 2016. Automatic text scoring usingneural networks. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 715–725, Berlin, Germany. Association for Computa-tional Linguistics.Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond.

Transac-tions of the Association for Computational Linguis-tics , 7:597–610.Yigal Attali and Jill Burstein. 2004. Automated essayscoring with e-rater® v. 2.0.

ETS Research ReportSeries , 2004(2).Yigal Attali and Sandip Sinharay. 2015. Automatedtrait scores for toeﬂ® writing tasks.

ETS ResearchReport Series , 2015(1):1–14.Nicolas Ballier, St´ephane Canu, Caroline Petitjean,Gilles Gasso, Carlos Balhana, Theodora Alex-opoulou, and Thomas Gaillat. 2020. Machine learn-ing for learner english: A plea for creating learnerdata challenges.

International Journal of LearnerCorpus Research , 6(1):72–103.Olaf B¨arenf¨anger. 2013. Assessing the reliability andscale functionality of the merlin written speech sam-ple ratings. Technical report, European Academy,Bolzano, Italy.Stig Johan Berggren, Taraka Rama, and Lilja Øvrelid.2019. Regression or classiﬁcation? automated es-say scoring for norwegian. In

Proceedings of theFourteenth Workshop on Innovative Use of NLP forBuilding Educational Applications , pages 92–102.Adriane Boyd, Jirka Hana, Lionel Nicolas, DetmarMeurers, Katrin Wisniewski, Andrea Abel, KarinSch¨one, Barbora Stindlov´a, and Chiara Vettori. 2014.The merlin corpus: Learner language and the cefr.In

LREC , pages 1281–1288.Ant´onio Branco, Nicoletta Calzolari, Piek Vossen,Gertjan van Noord, Dieter van Uytvanck, Jo˜ao Silva,Lu´ıs Gomes, Andr´e Moreira, and Willem Elbers.2020. A shared task of a new, collaborative typeto foster reproducibility: A ﬁrst exercise in the area of language science and technology with re-prolang2020. In

Proceedings of The 12th LanguageResources and Evaluation Conference , pages 5539–5545.Ronan Cummins and Marek Rei. 2018. Neural multi-task learning in automated assessment. arXivpreprint arXiv:1801.06830 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Fei Dong and Yue Zhang. 2016. Automatic features foressay scoring–an empirical study. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1072–1077.Daniel Edmiston. 2020. A systematic analysis of mor-phological content in bert models for multiple lan-guages. arXiv preprint arXiv:2004.03032 .Council of Europe. 2002. Common european frame-work of reference for languages: Learning, teaching,assessment. structured overview of all cefr scales.Julia Hancke. 2013. Automatic prediction of cefr pro-ﬁciency levels based on linguistic features of learnerlanguage.

Master’s thesis, University of T¨ubingen .Julia Hancke and Detmar Meurers. 2013. Exploringcefr classiﬁcation for german based on rich linguis-tic modeling. pages 54–56.Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah.2019. What does bert learn about the structure oflanguage? In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 3651–3657.Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018.Tdnn: a two-stage deep neural network for prompt-independent automated essay scoring. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 1088–1097.Zixuan Ke and Vincent Ng. 2019. Automated essayscoring: a survey of the state of the art. In

Pro-ceedings of the 28th International Joint Conferenceon Artiﬁcial Intelligence , pages 6300–6308. AAAIPress.Beata Beigman Klebanov and Michael Flor. 2013.Word association proﬁles and their use for auto-mated scoring of essays. In

Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1148–1158.Beata Beigman Klebanov and Nitin Madnani. 2020.Automated evaluation of writing – 50 years andcounting. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 7796–7810, Online. Association for Computa-tional Linguistics.ong-Won Lee, Claudia Gentile, and Robert Kan-tor. 2009. Toward automated multi-trait scoringof essays: Investigating links among holistic, ana-lytic, and text feature scores.

Applied Linguistics ,31(3):391–417.Jiawei Liu, Yang Xu, and Lingzhe Zhao. 2019. Au-tomated essay scoring based on two-stage learning.

CoRR , abs/1901.07744.Sandeep Mathias and Pushpak Bhattacharyya. 2020.Can neural networks automatically score essaytraits? In

Proceedings of the Fifteenth Workshopon Innovative Use of NLP for Building EducationalApplications , pages 85–91, Seattle, WA, USA → On-line. Association for Computational Linguistics.Elijah Mayﬁeld and Alan W Black. 2020. Should youﬁne-tune BERT for automated essay scoring? In

Proceedings of the Fifteenth Workshop on Innova-tive Use of NLP for Building Educational Applica-tions , pages 151–162, Seattle, WA, USA → Online.Association for Computational Linguistics.Joakim Nivre, Marie-Catherine de Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan T McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, Reut Tsarfaty, and DanielZeman. 2016. Universal Dependencies v1: A Mul-tilingual Treebank Collection. In

Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC 2016) , pages 1659–1666.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython.

Journal of Machine Learning Research ,12:2825–2830.Peter Phandi, Kian Ming A Chai, and Hwee Tou Ng.2015. Flexible domain adaptation for automated es-say scoring using correlated linear regression. In

Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , pages431–439.Ildik´o Pil´an, David Alfter, and Elena Volodina. 2016.Coursebook texts as a helping hand for classifyinglinguistic complexity in language learners’ writings.

CL4LC 2016 , page 120.Eunjin Shin. 2018. A neural network approach toautomated essay scoring: A comparison with themethod of integrating deep language features usingcoh-metrix.Milan Straka, Jan Hajic, and Jana Strakov´a. 2016. UD-Pipe: Trainable Pipeline for Processing CoNLL-UFiles Performing Tokenization, Morphological Anal-ysis, POS Tagging and Parsing. In

LREC . Sowmya Vajjala and Kaidi L˜oo. 2014. Automatic cefrlevel prediction for estonian learner text. In

Pro-ceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, UppsalaUniversity , 107. Link¨oping University ElectronicPress.Sowmya Vajjala and Taraka Rama. 2018. Experimentswith universal cefr classiﬁcation. In

Proceedings ofthe Thirteenth Workshop on Innovative Use of NLPfor Building Educational Applications , pages 147–153, New Orleans, Louisiana. Association for Com-putational Linguistics.Torsten Zesch, Michael Wojatzki, and Dirk Scholten-Akoun. 2015. Task-independent features for auto-mated essay grading. In

BEA@ NAACL-HLT , pages224–232.

A Supplemental Material