[PDF] Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1

Abstract

Full PDF

PPersonalizing Grammatical Error Correction:Adaptation to Proﬁciency Level and L1

Maria N˘adejde

Joel Tetreault ∗ Dataminr [email protected]

Abstract

Grammar error correction (GEC) systems havebecome ubiquitous in a variety of softwareapplications, and have started to approachhuman-level performance for some datasets.However, very little is known about how toefﬁciently personalize these systems to theuser’s characteristics, such as their proﬁciencylevel and ﬁrst language, or to emerging do-mains of text. We present the ﬁrst results onadapting a general purpose neural GEC systemto both the proﬁciency level and the ﬁrst lan-guage of a writer, using only a few thousandannotated sentences. Our study is the broad-est of its kind, covering ﬁve proﬁciency levelsand twelve different languages, and compar-ing three different adaptation scenarios: adapt-ing to the proﬁciency level only, to the ﬁrstlanguage only, or to both aspects simultane-ously. We show that tailoring to both scenariosachieves the largest performance improvement(3.6 F . ) relative to a strong baseline. Guides for English teachers have extensively doc-umented how grammatical errors made by learn-ers are inﬂuenced by their native language (L1).Swan and Smith (2001) attribute some of the er-rors to “transfer” or “interference” between lan-guages. For example, German native speakers aremore likely to incorrectly use a deﬁnite article withgeneral purpose nouns or omit the indeﬁnite articlewhen deﬁning people’s professions. Other errorsare attributed to the absence of a certain linguisticfeature in the native language. For example, Chi-nese and Russian speakers make more errors in-volving articles, since these languages do not havearticles.A few grammatical error correction (GEC) sys-tems have incorporated knowledge about L1. Ro- ∗ This research was conducted while the author was atGrammarly. zovskaya and Roth (2011) use a different prior foreach of ﬁve L1s to adapt a Naive Bayes classi-ﬁer for preposition correction. Rozovskaya et al.(2017) expand on this work to eleven L1s andthree error types. Mizumoto et al. (2011) showedfor the ﬁrst time that a statistical machine transla-tion (SMT) system applied to GEC performs bet-ter when the training and test data have the sameL1. Chollampatt et al. (2016) extend this work byadapting a neural language model to three differ-ent L1s and use it as a feature in SMT-based GECsystem. However, we are not aware of prior workaddressing the impact of both proﬁciency level andnative language on the performance of GEC sys-tems. Furthermore, neural GEC systems, whichhave become state-of-the-art (Gehring et al., 2017;Junczys-Dowmunt et al., 2018; Grundkiewicz andJunczys-Dowmunt, 2018), are general purposeand domain agnostic.We believe the future of GEC lies in providingusers with feedback that is personalized to theirproﬁciency level and native language (L1). In thiswork, we present the ﬁrst results on adapting ageneral purpose neural GEC system for English toboth of these characteristics by using ﬁne-tuning,a transfer learning method for neural networks,which has been extensively explored for domainadaptation of machine translation systems (Lu-ong and Manning, 2015; Freitag and Al-Onaizan,2016; Chu et al., 2017; Miceli Barone et al., 2017;Thompson et al., 2018). We show that a modeladapted to both L1 and proﬁciency level outper-forms models adapted to only one of these charac-teristics. Our contributions also include the ﬁrstresults on adapting GEC systems to proﬁciencylevels and the broadest study of adapting GEC toL1 which includes twelve different languages. a r X i v : . [ c s . C L ] J un

500 1000 1500 2000 2500 3000C2C1A2B2B1 0 250 500 750 1000 1250 1500JapaneseUrduThaiKoreanCzechCatalanGreekTurkishArabicPolishRussianFrenchSwissGermanGermanPortugueseItalianChineseSpanish 0 200 400 600 800Korean-B2Polish-B2Chinese-A2Russian-B2French-B1Spanish-C1French-B2German-B2Italian-A2Chinese-C1German-B1Portuguese-A2SwissGerman-B1Chinese-B1Spanish-B2Portuguese-B1Chinese-B2Italian-B1Spanish-A2Spanish-B1

Figure 1: Corpus Distributions for CEFR Level, L1 and L1-Level.

Data

In this work, we adapt a general purposeneural GEC system, initially trained on two mil-lion sentences written by both native and non-native speakers and covering a variety of topicsand styles. All the sentences have been correctedfor grammatical errors by professional editors. Adaptation of the model to proﬁciency leveland L1 requires a corpus annotated with thesefeatures. We use the Cambridge Learner Corpus(CLC) (Nicholls, 2003) comprising examinationessays written by English learners with six proﬁ-ciency levels and more than 100 different nativelanguages. Each essay is corrected by one anno-tator, who also identiﬁes the minimal error spansand labels them using about 80 error types. Fromthis annotated corpus we extract a parallel corpuscomprising of source sentences with grammaticalerrors and the corresponding corrected target sen-tences.We do note the proprietary nature of the CLCwhich makes reproducibility difﬁcult, though ithas been used in prior research, such as Rei andYannakoudakis (2016). It was necessary for thisstudy as the other GEC corpora available are notannotated for both L1 and level. The Lang-8 Learner Corpora (Mizumoto et al., 2011) alsoprovides information about L1, but it has no in-formation about proﬁciency levels. The FCEdataset (Yannakoudakis et al., 2011) is a subsetof the CLC, however, it only covers one proﬁ-ciency level and there are not enough sentencesfor each L1 for our experiments. Previous workon adapting GEC classiﬁers to L1 (Rozovskayaet al., 2017) used the FCE corpus, and thus did not To maintain anonymity, we do not include more details. The CLC uses levels deﬁned by the Common EuropeanFramework of Reference for Languages: A1 - Beginner, A2 -Elementary, B1 - Intermediate, B2 - Upper intermediate, C1- Advanced, C2 - Proﬁciency. address adaptation to different proﬁciency levels.One of our future goals is to create a public corpusfor this type of work.

Experimental Setup

Our baseline neural GECsystem is an RNN-based encoder-decoder neu-ral network with attention and LSTM units (Bah-danau et al., 2015). The system takes as inputan English sentence which may contain gram-matical errors and decodes the corrected sen-tence. We train the system on the parallel cor-pus extracted from the CLC with the OpenNMT-py toolkit (Klein et al., 2018) using the hyper-parameters listed in the Appendix. To increase thecoverage of the neural network’s vocabulary, with-out hurting efﬁciency, we break source and targetwords into sub-word units. The segmentation intosub-word units is learned from unlabeled data us-ing the Byte Pair Encoding (BPE) algorithm (Sen-nrich et al., 2016). The vocabulary, consisting of20,000 BPE sub-units, is shared between the en-coder and decoder. We truncate sentences longerthan 60 BPE sub-units and train the baseline sys-tem with early stopping on a development set sam-pled from the base dataset. To train and evaluate the adapted models, weextract subsets of sentences from the CLC thathave been written by learners having a particularLevel, L1, or L1-Level combination. We considerall subsets having at least 11,000 sentences, suchthat we can allocate 8,000 sentences for training,1,000 for tuning and 2,000 for testing. We com-pare adapted models trained and evaluated on thesame subset of the data. For example, we adapta model using the Chinese training data and thenevaluate it on the Chinese test set.Since our base dataset and CLC are differentdomains, we wanted to make sure that improve- Although the source and target vocabularies are thesame, the embeddings are not tied. Performance did not improve after 15 epochs. ents by ﬁne-tuning by Level or L1 were notdue to simply being in-domain with the test data,which is also from the CLC. To control for this, weconstruct another baseline system (“Random”) byadapting the general purpose GEC system to a ran-dom sample of learner data drawn from the CLC.In Figure 1 we show the distribution of Level, L1and L1-Level sentences in a random CLC sam-ple, for the subsets having at least 100 sentences.B1 is the most frequent level, while A2, the low-est proﬁciency level included in this study, is halfas frequent in the random sample. The L1 dis-tribution is dominated by Spanish, with Chinesesecond with half as many sentences. Among theL1-Level subsets, Spanish-B2 is the most frequentwith Spanish-A2 covering half as many sentences.

Fine-tuning

We build adapted GEC models us-ing ﬁne-tuning, a transfer learning method for neu-ral networks. We continue training the parametersof the general purpose model on the “in-domain”subset of the data covering a particular Level, L1,or L1-Level. Thompson et al. (2018) showed thatadapting only a single component of the encoder-decoder network is almost as effective as adaptingthe entire set of parameters. In this work, we ﬁne-tune the parameters of the source embeddings andencoder, while keeping the other parameters ﬁxed.To avoid quickly over-ﬁtting to the smaller“in-domain” training data, we reduce the batchsize (Thompson et al., 2018) and continue us-ing the dropout regularization (Miceli Baroneet al., 2017). We apply dropout to all the lay-ers and to the source words, as well as varia-tional dropout (Gal and Ghahramani, 2016) oneach step, all with probability 0.1. We also re-duce the learning rate by four times and use the start decay at option which halves the learn-ing rate after each epoch. Consequently, the up-dates become small after a few epochs. To en-able the comparison between different adaptationscenarios, all ﬁne-tuned models are trained for 10epochs on 8,000 sentences of “in-domain” data.

We report the results for the three adaptation sce-narios: adapting to Level only, adapting to L1only, and adapting to both L1 and Level. We sum-marize the results by showing the average M F . score (Dahlmeier and Ng, 2012) across all the testsets included in the respective scenario. We ﬁrst note that the strong baseline (“Ran-dom”), which is a model adapted to a random sam-ple of CLC , achieves improvements between 11 to13 F . points on average on all scenarios. Whilenot the focus of the paper, this large improvementshows the performance gains by simply adaptingto a new domain (in this case CLC data). Second,we note that the models adapted only by Level orby L1 are on average better than the “Random”model by 2.1 and 2.3 F . points respectively. Fi-nally, the models adapted to both Level and L1outperform all others, beating the “Random” base-line on average by 3.6 F . points.On all adaptation scenarios we report the per-formance of the single best model released byJunczys-Dowmunt et al. (2018). Their model,which we call JD single , was trained on Englishlearner data of comparable size to our base datasetand optimized using the CoNLL14 training andtest data.

Adaptation by Proﬁciency Level

We adaptGEC models to ﬁve of the CEFR proﬁciency lev-els: A2, B1, B2, C1, C2. The results in Ta-ble 1 show that performance improves for all lev-els compared to the “Random” baseline. Thelargest improvement, 5.2 F . points, is achievedfor A2, the lowest proﬁciency level. We attributethe large improvement to this level having a highererror rate, a lower lexical diversity and being lessrepresented in the random sample on which thebaseline is trained on. In contrast, for the B1 andB2 levels, the most frequent in the random sample,improvements are more modest: 0.7 and 0.2 F . points respectively. Our adapted models are betterthan the JD single model on all levels, and with alarge margin on the A2 and C1 levels.

Adapt A2 B1 B2 C1 C2 Avg.No 30.4 34.9 33.1 32.5 33.0 32.8Rand. 48.4 47.9 42.5 41.4 39.2 43.8Level

JD single 44.1 47.1 41.7 37.8 35.0 44.1Table 1: Adaptation to Proﬁciency Level in F . Adaptation by L1

We adapt GEC models totwelve L1s: Arabic, Chinese, French, German,Greek, Italian, Polish, Portuguese, Russian, Span-ish, Swiss-German and Turkish. The results inTable 2 (top) show that all L1-adapted modelsare better than the baseline, with improvementsranging from 1.2 F . for Chinese and French, up dapt AR CN FR DE GR IT PL PT RU ES CH TR AvgNo 37.5 36.2 32.7 31.4 32.7 29.3 36.0 31.7 35.8 32.1 31.1 35.4 33.5Random 46.3 45.0 44.9 44.7 46.4 44.9 46.2 45.2 45.3 47.6 44.2 47.0 45.6L1 JD single 47.0 44.7 44.2 41.4 44.1 40.7 46.0 44.6 43.7 44.8 40.7 47.5 44.1Adapt CN-B2 CN-C1 FR-B1 DE-B1 IT-B1 PT-B1 ES-A2 ES-B1 ES-B2 Avg.No 36.1 32.5 31.8 31.2 28.1 31.4 28.9 31.9 33.7 31.8Random 42.7 39.1 45.3 46.1 43.5 45.2 50.2 46.4 44.1 44.7Level 43.4 41.0 46.5 46.9 45.3 46.1 56.6 47.5 43.7 46.3L1 44.1 40.9 46.5 48.1 46.5 46.2 53.8 47.6 44.4 46.5L1 & Level

JD single 43.0 35.8 46.9 43.8 41.6 46.7 43.4 45.0 41.0 43.0Table 2: Top: Adaptation to L1 Only. Bottom: Adaptation to Level and L1. Eval metric: F . to 3.6 F . for Turkish. For the languages thatare less frequent in the random sample of CLC(Greek, Turkish, Arabic, Polish and Russian) wesee consistent improvements of over 2 F . points.Our adapted models are better than the JD single model on all L1s, and with a margin larger than5 F . points on German, Swiss-German, Italian,Greek and Spanish. Adaptation by L1 and Proﬁciency Level

Fi-nally, we adapt GEC models to the followingnine L1 – Level subsets: Chinese-B2, Chinese-C1,French-B1, German-B1, Italian-B1, Portuguese-B1, Spanish-A2, Spanish-B1 and Spanish-B2. Weinclude these subsets in our study because theymeet the requirement of having at least 8,000 sen-tences for training. All the models adapted to bothLevel and L1 outperform the models adapted toonly one of these features, as shown in Table 2(bottom). Focusing on the two levels for Chinesenative speakers, we see the model adapted to C1achieves a larger improvement over the baseline,4.1 F . points, compared to 2.7 F . points for theB2 level. Again, this is explained by the lower fre-quency of the C1 level in the random sample ofCLC, which is also reﬂected by the lowest F . score for the baseline model. Similarly, amongthe models adapted to different levels of Spanishnative speakers, the one adapted to Spanish-A2achieves the largest gains of 8 F . points. TheSpanish-A2 testset has the highest number of er-rors per 100 words among all the L1-Level test-sets, as shown in Table 1 in the Appendix. Fur-thermore, the A2 level is only half as frequent asthe B1 level in the random sample of CLC. Finally,our adapted models are better than the JD single model on all L1–Level subsets, with a margin of 5 F . points on average. Adapted P R F0.5Random 61.9 35.6 54.0CN-C1 61.1 37.0 54.1CN-B2 62.4 37.5 55.1+ spellcheck 63.6 40.3 57.0JD single 59.1 40.4 54.1JD ensemble 63.1 42.6 57.5Table 3: Results on the CoNLL14 testsets for Chinesemodels.

CoNLL14 Evaluation

We compare our adaptedmodels on the CoNLL14 testset (Ng et al., 2014)in Table 3. The model adapted to Chinese-B2improves the most over the baseline, achieving55.1 F . . This result aligns with how the testset was constructed: it consists of essays writ-ten by university students, mostly Chinese na-tive speakers. When we pre-process the eval-uation set before decoding with a commercialspellchecker , our adapted model scores 57.0which places it near other leading models, trainedon a similar amount of data, such as Chollam-patt and Ng (2018) (56.52) and Junczys-Dowmuntet al. (2018) (57.53) even though we do not usethe CoNLL14 in-domain training data. We notethat the most recent state-of-the-art models (Zhaoet al., 2019; Grundkiewicz et al., 2019), are trainedon up to one hundred million additional syntheticparallel sentences, while we adapt models withonly eight thousand parallel sentences. Details removed for anonymity. We call their ensemble of four models with languagemodel re-scoring

JD ensemble and their single best modelwithout language model re-scoring

JD single dapt Det Prep Verb Tense NNum Noun PronCN-C1 3.53 5.90 2.99 1.77 8.28 8.02 22.78FR-B1 2.34 1.99 12.54 5.16 9.16 3.48 1.13DE-B1 8.85 1.77 2.04 2.37 3.86 7.18 22.75IT-B1 2.37 5.32 12.48 6.74 4.40 3.29 8.99ES-A2 6.06 12.52 7.51 8.54 8.73 12.39 10.57Table 4: L1-Level breakdown by error type in relative improvements in F . over the “Random” baseline. Error-type Analysis

We conclude our study byreporting improvements on the most frequent errortypes, excluding punctuation, spelling and orthog-raphy errors. We identify the error types in eachevaluation set with Errant, a rule-based classi-ﬁer (Bryant et al., 2017). Table 4 shows the resultsfor the systems adapted to both L1 and Level thatimproved the most in overall F . . The adaptedsystems consistently outperform the “Random”baseline on most error types. For Chinese-C1, theadapted model achieves the largest gains on pro-noun (Pron) and noun number agreement errors(NNum). The Spanish-A2 adapted model achievesnotable gains on preposition (Prep), noun and pro-noun errors. Both the French-B1 and Italian-B1adapted models gain the most on verb errors. ForGerman-B1, the adapted model improves the moston pronoun (Pron) and determiner (Det) errors.The large improvement of 22.75 F . points forthe pronoun category is in part an artefact of thesmall error counts. The adapted model corrects35 pronouns (P=67.3) while the baseline correctsonly 15 pronouns (P=46.9). We leave an in depthanalysis by error type to future work.Below, we give an example of a confused aux-iliary verb that the French-B1 adapted model cor-rects. The verb phrase corresponding to “go shop-ping” in French is “faire des achats”, where theverb “faire” would translate to “make/do”.Orig He told me that celebrity can be badbecause he can’t do shopping nor-mally.Rand He told me that the celebrity can bebad because he can’t do shoppingnormally.FR-B1 He told me that celebrity can be badbecause he can’t go shopping nor-mally.Ref He told me that celebrity can be badbecause he can’t go shopping nor-mally. We present the ﬁrst results on adapting a neuralGEC system to proﬁciency level and L1 of lan-guage learners. This is the broadest study of itskind, covering ﬁve proﬁciency levels and twelvedifferent languages. While models adapted to ei-ther proﬁciency level or L1 are on average betterthan the baseline by over 2 F . points and thelargest improvement (3.6 F . ) is achieved whenadapting to both characteristics simultaneously.We envision building a single model that com-bines knowledge across L1s and proﬁciency lev-els using a mixture-of-experts approach. Adaptedmodels could also be improved by using the mixedﬁne tuning approach which uses a mix of in-domain and out-of-domain data (Chu et al., 2017). Acknowledgements

The authors would like to thank the anonymousreviewers for their feedback. We are also gratefulto our colleagues for their assistance and insights:Dimitrios Alikaniotis, Claudia Leacock, DmitryLider, Courtney Napoles, Jimmy Nguyen, VipulRaheja and Igor Tytyk.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Christopher Bryant, Mariano Felice, and Ted Briscoe.2017. Automatic annotation and evaluation of errortypes for grammatical error correction. In

Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 793–805, Vancouver, Canada. Associa-tion for Computational Linguistics.Shamil Chollampatt, Duc Tam Hoang, and Hwee TouNg. 2016. Adapting grammatical error correctionbased on the native language of writers with neu-ral network joint models. In

Proceedings of the016 Conference on Empirical Methods in NaturalLanguage Processing , pages 1901–1911. Associa-tion for Computational Linguistics.Shamil Chollampatt and Hwee Tou Ng. 2018. Neu-ral quality estimation of grammatical error correc-tion. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing , pages 2528–2539, Brussels, Belgium. Associ-ation for Computational Linguistics.Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.An empirical comparison of domain adaptationmethods for neural machine translation. In

Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , pages 385–391. Association for Computa-tional Linguistics.Daniel Dahlmeier and Hwee Tou Ng. 2012. Betterevaluation for grammatical error correction. In

Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages568–572, Montr´eal, Canada. Association for Com-putational Linguistics.Markus Freitag and Yaser Al-Onaizan. 2016. Fastdomain adaptation for neural machine translation.

CoRR , abs/1612.06897.Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In

Proceedings of the 30th Interna-tional Conference on Neural Information ProcessingSystems , NIPS’16, pages 1027–1035, USA. CurranAssociates Inc.Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N. Dauphin. 2017. Convolutionalsequence to sequence learning. In

Proceedingsof the 34th International Conference on MachineLearning , volume 70 of

Proceedings of MachineLearning Research , pages 1243–1252, InternationalConvention Centre, Sydney, Australia. PMLR.Roman Grundkiewicz and Marcin Junczys-Dowmunt.2018. Near human-level performance in grammati-cal error correction with hybrid machine translation.In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers) , pages 284–290. Associa-tion for Computational Linguistics.Roman Grundkiewicz, Marcin Junczys-Dowmunt, andKenneth Heaﬁeld. 2019. Neural grammatical errorcorrection systems with unsupervised pre-trainingon synthetic data. In

Proceedings of the FourteenthWorkshop on Innovative Use of NLP for BuildingEducational Applications , pages 252–263, Florence,Italy. Association for Computational Linguistics.Marcin Junczys-Dowmunt, Roman Grundkiewicz,Shubha Guha, and Kenneth Heaﬁeld. 2018. Ap-proaching neural grammatical error correction as a low-resource machine translation task. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 595–606. Association forComputational Linguistics.Guillaume Klein, Yoon Kim, Yuntian Deng, VincentNguyen, Jean Senellart, and Alexander Rush. 2018.Opennmt: Neural machine translation toolkit. In

Proceedings of the 13th Conference of the Associ-ation for Machine Translation in the Americas (Vol-ume 1: Research Papers) , pages 177–184. Associa-tion for Machine Translation in the Americas.Minh-Thang Luong and Christopher D. Manning.2015. Stanford neural machine translation systemsfor spoken language domain. In

International Work-shop on Spoken Language Translation .Antonio Valerio Miceli Barone, Barry Haddow, UlrichGermann, and Rico Sennrich. 2017. Regularizationtechniques for ﬁne-tuning in neural machine trans-lation. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 1489–1494. Association for Computa-tional Linguistics.Tomoya Mizumoto, Mamoru Komachi, Masaaki Na-gata, and Yuji Matsumoto. 2011. Mining revisionlog of language learning sns for automated japaneseerror correction of second language learners. In

Proceedings of 5th International Joint Conferenceon Natural Language Processing , pages 147–155.Asian Federation of Natural Language Processing.Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The CoNLL-2014 shared taskon grammatical error correction. In

Proceedings ofthe Eighteenth Conference on Computational Nat-ural Language Learning: Shared Task , pages 1–14, Baltimore, Maryland. Association for Compu-tational Linguistics.Diane Nicholls. 2003. The Cambridge Learner Cor-pus: Error coding and analysis for lexicography andELT. In

Proceedings of the Corpus Linguistics 2003conference , volume 16, pages 572–581.Marek Rei and Helen Yannakoudakis. 2016. Composi-tional sequence labeling models for error detectionin learner writing. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1181–1191, Berlin, Germany. Association for Computa-tional Linguistics.Alla Rozovskaya and Dan Roth. 2011. Algorithm se-lection and model adaptation for esl correction tasks.In

Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies - Volume 1 , HLT ’11, pages924–933, Stroudsburg, PA, USA. Association forComputational Linguistics.lla Rozovskaya, Dan Roth, and Mark Sammons.2017. Adapting to learner errors with minimal su-pervision.

Computational Linguistics , 43(4):723–760.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In

Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , Berlin, Germany.Association for Computational Linguistics.Michael Swan and Bernard Smith. 2001.

Learner En-glish: A Teacher’s Guide to Interference and otherProblems. Second Edition.

Cambridge UniversityPress.Brian Thompson, Huda Khayrallah, Antonios Anasta-sopoulos, Arya D. McCarthy, Kevin Duh, RebeccaMarvin, Paul McNamee, Jeremy Gwinnup, Tim An-derson, and Philipp Koehn. 2018. Freezing sub-networks to analyze domain adaptation in neuralmachine translation. In

Proceedings of the ThirdConference on Machine Translation, Volume 1: Re-search Papers , pages 124–132, Belgium, Brussels.Association for Computational Linguistics.Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading ESOL texts. In

Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages180–189, Portland, Oregon, USA. Association forComputational Linguistics.Wei Zhao, Liang Wang, Kewei Shen, Jia Ruoyu, andJingming Liu. 2019. Better evaluation for gram-matical error correctionimproving grammatical er-ror correction via pre-training a copy-augmented ar-chitecture with unlabeled data. In