[PDF] Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

Abstract

Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic grammatical errors would be difficult, one could learn the distribution of naturallyoccurring errors and attempt to introduce them into other datasets. Initial work on inducing errors in this way using statistical machine translation has shown promise; we investigate cheaply constructing synthetic samples, given a small corpus of human-annotated data, using an off-the-rack attentive sequence-to-sequence model and a straight-forward post-processing procedure. Our approach yields error-filled artificial data that helps a vanilla bi-directional LSTM to outperform the previous state of the art at grammatical error detection, and a previously introduced model to gain further improvements of over 5% F 0.5 score. When attempting to determine if a given sentence is synthetic, a human annotator at best achieves 39.39 F 1 score, indicating that our model generates mostly human-like instances.

Full PDF

aa r X i v : . [ c s . C L ] S e p Wronging a Right: Generating Better Errors to ImproveGrammatical Error Detection

Sudhanshu Kasewa and

Pontus Stenetorp and

Sebastian Riedel { sudhanshu.kasewa.16, p.stenetorp, s.riedel } @ucl.ac.ukUniversity College London Abstract

Grammatical error correction, like other ma-chine learning tasks, greatly beneﬁts fromlarge quantities of high quality training data,which is typically expensive to produce. Whilewriting a program to automatically generaterealistic grammatical errors would be difﬁcult,one could learn the distribution of naturally-occurring errors and attempt to introduce theminto other datasets. Initial work on induc-ing errors in this way using statistical machinetranslation has shown promise; we investigatecheaply constructing synthetic samples, givena small corpus of human-annotated data, usingan off-the-rack attentive sequence-to-sequencemodel and a straight-forward post-processingprocedure. Our approach yields error-ﬁlled ar-tiﬁcial data that helps a vanilla bi-directionalLSTM to outperform the previous state of theart at grammatical error detection, and a pre-viously introduced model to gain further im-provements of over 5% F . score. When at-tempting to determine if a given sentence issynthetic, a human annotator at best achieves39.39 F score, indicating that our model gen-erates mostly human-like instances. There is an ever-growing number of peoplelearning English as a second language; pro-viding them with quick feedback to facilitatetheir learning is a crucial, labour-intensive en-deavour. Part of this process is identify-ing and correcting grammatical errors, and sev-eral computational techniques have been devel-oped to automate it (Rozovskaya and Roth, 2014;Junczys-Dowmunt and Grundkiewicz, 2016). Forexample, given an erroneous sentence “I wantedto goes to the beach” , the grammatical er-ror correction task is to output the valid sen-tence “I wanted to go to the beach” . Thetask can be cast as a two-stage process, detec- tion and correction, which can either be per-formed sequentially (Yannakoudakis et al., 2017),or jointly (Napoles and Callison-Burch, 2017).Automated error correction performance is ar-guably still too low for practical considera-tion, perhaps limited by the amount of trainingdata (Rei et al., 2017). High quality annotationsare expensive to procure, and foreign languagelearners and commercial entities may feel uncom-fortable granting access to their data. Instead, onecould attempt to supplement existing manual an-notations with synthetic instances. Such artiﬁ-cial samples are beneﬁcial only when they sharestructure with the true distribution from whichhuman errors are generated. Generative Adver-sarial Networks (Goodfellow et al., 2014) couldbe used for this purpose, but they are difﬁcultto train, and require a large collection of sen-tences that are incorrect. One might attempt self-training (McClosky et al., 2006), where new in-stances are generated by applying a trained modelto unannotated data, using high-conﬁdence predic-tions as ground truth labels. However, in sucha scheme, the expectation is that the unlabelledtext already contains errors, which is not usuallythe case for most freely available text such asWikipedia articles as they strive towards correct-ness.In place of using machine translation (MT) to correct grammatical mistakes (Yuan and Felice,2013; Junczys-Dowmunt and Grundkiewicz,2014; Yuan and Briscoe, 2016), one might con-sider swapping the input and output streams,and instead learn to induce errors into error-freetext, for the purpose of creating a synthetic train-ing dataset (Felice and Yuan, 2014). Recently,Rei et al. (2017) used a statistical MT (SMT)system to induce errors into error-free text.Building on this work, and leveraging recentadvances in neural MT (NMT), we used anff-the-shelf attentive sequence-to-sequencemodel (Britz et al., 2017), eliminating the needof specialised software such as a phrase-tablegenerator, decoder, and part-of-speech tagger.We created multiple synthetic datasets from in-domain and out-of-domain sources, and found thatstochastic token sampling, and pruning redundantand low-likelihood sentences, were helpful ingenerating meaningful corruptions. Using theartiﬁcial samples thus generated, we improvedupon detection results with simply a vanilla bi-directional LSTM (Hochreiter and Schmidhuber,1997). Using a more powerful model, we estab-lished new state-of-the-art results, that improveon previously published F . scores by over5%. Additionally, we conﬁrm that our generatedinstances are human-like, as an annotator identi-fying generated sentences achieved a maximum F score of 39.39. In computer vision, images are blurred, rotated,or otherwise deformed inexpensively to createnew training instances (Wang and Perez, 2017),because such manipulation does not signiﬁcantlyalter the image semantics. Similar coarse pro-cesses do not work in NLP since mutating evena single letter or a word can change a sentence’smeaning, or render it nonsensical. Nonetheless,Vinyals et al. (2015) employed a kind of self-training where they use noisy predictions for un-labelled instances output by existing state-of-the-art parsers as ground-truth labels, and improvedsyntactic parsing performance. Sennrich et al.(2016) synthesised training instances by round-trip-translating a monolingual corpus with weakerversions of an NMT learner, and used them to im-prove the translation. Bouchard et al. (2016) de-veloped an efﬁcient algorithm to blend generatedand true data for improving generalisation.Grammar correction is a well-studiedtask in NLP, and early systems were rule-based pattern recognisers (Macdonald, 1983)and dictionary-based linguistic analysis en-gines (Richardson and Braden-Harder, 1988).Later systems used statistical approaches, ad-dressing speciﬁc kinds of errors such as articleinsertion (Knight et al., 1994) and spelling cor-rection (Golding and Roth, 1996). Most recently,architectural innovations in neural sequencelabelling (Rei et al., 2016; Rei, 2017) raised error detection performance through improved abilityto process unknown words and jointly learning alanguage model.Early efforts for artiﬁcial error generation in-cluded generating speciﬁc types of errors, such asmass noun errors (Brockett et al., 2006) and articleerrors (Rozovskaya and Roth, 2010), and leverag-ing linguistic information to identify error pat-terns and transfer them onto grammatically correcttext (Foster and Andersen, 2009; Yuan and Felice,2013). Imamura et al. (2012) investigated meth-ods to generate pseudo-erroneous sentences forerror correction in Japanese. Recently, Rei et al.(2017) corrupted error-free text using SMT to cre-ate training instances for error detection.

To learn to introduce errors, we use an off-the-shelf attentive sequence-to-sequence neural net-work (Bahdanau et al., 2014). Given an input se-quence, the encoder generates context vectors foreach token. Then, the attention mechanism andthe decoder work in tandem to emit a distributionover the target vocabulary. At every decoder time-step, the encoder context vectors are scored by theattention mechanism, and a weighted sum is sup-plied to the decoder, along with its propagated in-ternal state and last output symbol.

Corruption:

Tokens from this distribution aresampled at every decoder time-step, either by argmax (AM), which emits the most likely word,or by a stochastic alternative such as temperaturesampling (TS) as argmax cannot be relied onto generate rare words. A temperature parameter τ > sharpens or softens the distribution: ˜ p i = f τ ( p ) i = p τ i P j p τ j where i are the components of the probability dis-tribution corresponding to words in the vocabu-lary. As one interpolates τ from 0 to 1, the be-haviour of ˜ p transitions from argmax to p , control-ling the diversity of the generated tokens.The sentence generated by TS might be a lowprobability sequence from the joint conditionaldistribution P ( v | u ) , where u is the input sentenceand v is the output sentence. One way around thisis to use beam search (BS), which checks the like-lihood of every possible continuation of a sentence riginal CorruptionShe promised to turn over a new leaf. She promissed to turn over a new leaf.At the moment I’m in Spain. During the moment I’m in Spain. Table 1 : Example sentences generated by our NMT pipeline.

Data augmentation strategy Model FCE (dev) FCE CoNLL1 CoNLL2Rei et al. (2017) FCE

P AT + EVP

P AT

SL – 47.8 19.5 28.5Rei et al. (2017) FCE

SMT + EVP

SMT

SL – 48.4 19.7 28.4Rei et al. (2017) FCE

SMT + P AT + EVP

SMT + P AT

SL – 49.1 21.9 30.1None BiLSTM 47.9 43.6 16.6 24.3FCE

T S

BiLSTM 51.2 47.1 19.7 28.9EVP BS BiLSTM 52.1 50.1 20.8 29.0SW

T S

BiLSTM 51.5 50.6 24.2 31.7FCE AM + T S +EVP AM + T S

BiLSTM 52.3 50.4 22.1 30.8None SL 52.5 48.2 17.4 25.5FCE

T S

SL 54.8 49.9 20.9 29.2EVP BS SL 55.2 54.6 23.3 31.4SW

T S

SL 53.8 52.7 26.8 34.3FCE AM + T S +EVP AM + T S SL AM + T S +EVP AM + T S +SW AM + T S

SL 56.5

Table 2 : F . scores on various tests contrasted with published results and unaugmented baseline models.fragment, and maintains a list of the n best trans-lations generated up to the current time-step. AM,TS, and BS are indicative of the trade-off betweenincreasing levels of model ﬂexibility at the cost ofcomputation; we compare them to assess whetherthe additional computations were helpful in creat-ing high-quality synthetic instances. Post-processing:

Original and corrupted sen-tences are aligned at a word-level using Leven-shtein distance. Using the minimal alignment,words in the corrupted sentence are labelled cor-rect , ‘c’, or incorrect , ‘i’, as follows: If the word is not aligned with itself, then ‘i’. Else , if following a gap, then ‘i’, as at this pointa human reader would notice that there is a wordmissing in the sentence.

Else , if it is the last word,but it is not aligned to the last word of the sourcesentence, then ‘i’, as a human would realise thatthis sentence ends abruptly,

Else , ‘c’.These token-labelled corrupted sentences nowform an artiﬁcial dataset for training an error de-tector. Duplicate instances and corrupted sen-tences with more than 5 errors were dropped toremove noise from the downstream training.

We evaluated our approach on the First Cer-tiﬁcate of English (FCE) error detection dataset (Rei and Yannakoudakis, 2016), as well ason two human-annotated test sets (CoNLL1,CoNLL2) from the CoNLL 2014 sharedtask (Ng et al., 2014). The CoNLL data setspose a unique challenge; as they are different instyle and domain from FCE, we have no matchingtraining data. We compared the effect of differentneural generation procedures (AM, TS, BS) andcontrasted the downstream performance of abidirectional LSTM with an elaborate sequencelabeller.

We minimallymodiﬁed the open source implementation ofBritz et al. (2017) to implement TS and BS. Wetrained our NMT with a single-layered encoderand decoder with cell size 256, on the paral-lel corpus version of FCE (Yannakoudakis et al.,2011), with early stopping after the FCE develop-ment set score dropped consistently for 20 epochs.We introduced errors into three datasets: FCE it-self (450K tokens), the English Vocabulary Pro-ﬁle or EVP (270K tokens) and a subset of SimpleWikipedia or SW (8.4M tokens); of these, FCEand EVP were both used in artiﬁcial error gen-eration via SMT and pattern extraction (PAT) by https://github.com/google/seq2seq https://github.com/skasewa/wronging CE (dev) FCE CoNLL1 CoNLL20510 A b s o l u t e i n c r ea s e i n F . s c o r e BaselineEVP+AMEVP+TSEVP+BS

Figure 1 : Improvements using three different meth-ods of generation.Rei et al. (2017), enabling us to make a fair exper-imental comparison. Ten corrupted versions usingeach of AM, TS ( τ = 0 . ) and BS were sam-pled for FCE and EVP corruptions, while one suf-ﬁced for SW. The theoretical time complexity ofBS is O( bn ) for each sentence, where b is num-ber of candidates, and n is the maximum lengthof a sentence. Empirically, BS with b = 11 tooka factor of 11.3 more time than AM. Examples ofgenerated errors are provided in Table 1. Error detection:

We compare two error detec-tion models: a vanilla bi-directional LSTM (BiL-STM) (Schuster and Paliwal, 1997), and the state-of-the-art sequence labeller (SL) neural networkused by Rei et al. (2017). These models weretrained on the binary-labelled FCE training setaugmented with the corrupted instances. Wher-ever no model is explicitly stated, the SL modelwas used. During training, we alternate betweenthe annotated FCE dataset and the synthetic col-lection. This alternating protocol prevents over-ﬁtting on FCE; once it shifts back, it reinforcesconnections made from the helpful synthetic cor-ruptions while forgetting about the noisy ones.

The results for our baselines and data augmenta-tion strategies can be found in Table 2. Augmentedwith our NMT generated data, even our vanilladownstream BiLSTM outperforms the SMT+PATartiﬁcial error augmentation approach of Rei et al.(2017), indicating that our process better gener-alises the error information in the source dataset.Using the more powerful SL network bests theprevious state of the art by over 5% on the FCE

1x 2x 3x 4x 5x 6x 7x 8x 9x 10x1520253035

Amount of augmented data F . s c o r e FCE SW CoNLL1CoNLL2

Figure 2 : Training with increasing amounts of cor-rupted data from FCE and SW.test. Most intriguingly, we note a signiﬁcant im-provement for the CoNLL tests using corruptionsfrom out-of-domain SW. Figure 2 illustrates howwe gain performance on these tests with increas-ing amounts of corrupted SW, which does not holdtrue for corrupted FCE. This shows that we wereable to induce useful errors into a corpus with alarge unseen vocabulary and different syntactic bi-ases, and this in turn proved valuable for detect-ing errors in a third domain, suggesting that ourmethod can transfer learned distributions acrossstylistic genres.Using EVP as a standard source, Figure 1 illus-trates the variance of the different sampling meth-ods. All generation methods yield corruptions thatsigniﬁcantly improve test performance, with in-stances sampled by beam-search consistently out-performing the alternatives.

The original FCE dataset was annotated using theerror taxonomy speciﬁed in Nicholls (2003), andcontains 75 unique error codes. We annotatedsamples of EVP corrupted by all three samplingmethods, at a reduced resolution, to compare thedistribution of errors across FCE and the syntheticcorpora. These are presented in Table 3.At a high level, NMT generates errors more of-ten among more common parts-of-speech, favour-ing errors in verbs and nouns, rather than in ad-verbs and conjunctions. It did not make spellingerrors as often as in the source dataset; thisis likely because it only observed the speciﬁcspelling errors present in FCE, and as the vocab- pelling

FCE AM TS BSSpelling errors 11 1 1 4

Part-of-speech

FCE AM TS BSVerb 34 16 26 16Preposition 18 16 10 14Determiner 16 7 6 10Noun 13 36 35 43Pronoun 7 3 3 1Adverb 5 5 3 5Adjective 3 15 16 12Conjunction 2 2 2 1Quantiﬁer 1 0 0 0

Remedy Type

FCE AM TS BSReplacement 49 35 34 32Inclusion 23 30 27 35Removal 14 33 36 32Word form 9 2 2 1Word order 5 0 0 0

Table 3 : Error distribution across FCE and manu-ally annotated samples of artiﬁcial data. Spellingerrors are a % of all errors, while Part-of-speech,and Remedy Type are compared within their owncategories to sum to 100%.ulary is restricted to that dataset, it does not en-counter those words as frequently in EVP, and thusrarely makes the same spelling mistakes.Additionally, the differences in these distribu-tions can partially be attributed to the implicit dif-ferences between us and the annotators of FCE.

To check if the synthetic instances passed forhuman-like, we mixed 50 generated sentencesamong an equal number of actual ungrammaticalinstances from FCE-dev and tasked a human eval-uator to identify the artiﬁcial statements, in a sim-ple Turing-style test. We created three such sets,one for each of our sampling techniques, and thetest subject aimed to identify synthetic sampleswith high conﬁdence. Results of this test are pre-sented in Table 4.The high precision but low recall scores suggestthat while it is still possible to spot some corrup-tions that are quite clearly artiﬁcial, the bulk of oursamples do not betray their synthetic nature andare indistinguishable from naturally occurring er-roneous sentences. In order to fairly compare our

AM TS BSPrecision 81.25 63.63 50.00Recall 26.00 28.00 14.00F1 39.39 38.89 22.22

Table 4 : Results of a Turing-style test, where a sub-ject was asked to distinguish between real and fakesentences, sampled from each of the different gen-erated corpora.work with earlier results, we intended to conductsuch a test for sentences generated by the SMTof Rei et al. (2017). Unfortunately, we were onlyable to source corruptions of FCE-train via thismethod; therefore, we decided not to perform thistest as its results cannot be compared to ours.

We presented a novel data augmentation tech-nique for grammatical error detection using neu-ral machine translation to learn the distributionof language-learner errors, and induce such er-rors into grammatically correct text. We exploredseveral different variants of sampling to improvethe quality of our synthetic errors. After creat-ing artiﬁcial training instances with an off-the-shelf NMT, we bettered previous state-of-the-artresults on the canonical test with even a basic BiL-STM, and established a new state of the art usinga stronger model. Additionally, we demonstratedthat we were able to leverage corruptions of anout-of-domain dataset to set new benchmarks onseparate, also out-of-domain tests, without specif-ically optimising for either.Our work indicates that neural error genera-tion warrants further investigation with differentdatasets and architectures, both for error detec-tion and error correction. Among possible fu-ture work is using generative adversarial networksas corruption engines, and developing better se-quence alignment methods. Some preliminary re-sults with simple corruptions using word substitu-tion and word dropout (Iyyer et al., 2015) appearto be promising, and may feature as components ofa future corruption system. Finally, one could usesuch artiﬁcial error-prone corpora as source textfor self-training an error detection system. cknowledgements

We thank Marek Rei and Mariano Felice for grant-ing access to their data and code. Also, we wouldlike to thank the anonymous reviewers and Jo-hannes Welbl for valuable feedback and discus-sions. This work was supported by an Allen Dis-tinguished Investigator Award.

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.

CoRR ,abs/1409.0473.Guillaume Bouchard, Pontus Stenetorp, and SebastianRiedel. 2016. Learning to generate textual data.In

Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages1608–1616, Austin, Texas. Association for Compu-tational Linguistics.Denny Britz, Anna Goldie, Thang Luong, and QuocLe. 2017. Massive Exploration of Neural MachineTranslation Architectures.

ArXiv e-prints .Chris Brockett, William B. Dolan, and Michael Ga-mon. 2006. Correcting esl errors using phrasal smttechniques. In

Proceedings of the 21st Interna-tional Conference on Computational Linguistics and44th Annual Meeting of the Association for Compu-tational Linguistics , pages 249–256, Sydney, Aus-tralia. Association for Computational Linguistics.Mariano Felice and Zheng Yuan. 2014. Generat-ing artiﬁcial errors for grammatical error correction.In

Proceedings of the Student Research Workshopat the 14th Conference of the European Chapterof the Association for Computational Linguistics ,pages 116–126, Gothenburg, Sweden. Associationfor Computational Linguistics.Jennifer Foster and Øistein E Andersen. 2009. Gen-errate: generating errors for use in grammatical errordetection. In

Proceedings of the fourth workshop oninnovative use of nlp for building educational ap-plications , pages 82–90. Association for Computa-tional Linguistics.Andrew R Golding and Dan Roth. 1996. Apply-ing winnow to context-sensitive spelling correction. arXiv preprint cmp-lg/9607024 .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.

Neural Comput. , 9(8):1735–1780. Kenji Imamura, Kuniko Saito, Kugatsu Sadamitsu, andHitoshi Nishikawa. 2012. Grammar error correc-tion using pseudo-error sentences and domain adap-tation. In

Proceedings of the 50th Annual Meetingof the Association for Computational Linguistics:Short Papers-Volume 2 , pages 388–392. Associationfor Computational Linguistics.Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daum´e III. 2015. Deep unordered com-position rivals syntactic methods for text classiﬁca-tion. In

Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguistics andthe 7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) , vol-ume 1, pages 1681–1691.Marcin Junczys-Dowmunt and Roman Grundkiewicz.2014. The amu system in the conll-2014 sharedtask: Grammatical error correction by data-intensiveand feature-rich statistical machine translation. In

Proceedings of the Eighteenth Conference on Com-putational Natural Language Learning: SharedTask , pages 25–33, Baltimore, Maryland. Associa-tion for Computational Linguistics.Marcin Junczys-Dowmunt and Roman Grundkiewicz.2016. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction.In

Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages1546–1556, Austin, Texas. Association for Compu-tational Linguistics.Kevin Knight, Ishwar Chander, Matthew Haines,Vasileios Hatzivassiloglou, Eduard Hovy, MasayoIida, Steve K Luk, Akitoshi Okumura, RichardWhitney, and Kenji Yamada. 1994. Integratingknowledge bases and statistics in mt. arXiv preprintcmp-lg/9409001 .Nina H Macdonald. 1983. Human factors and behav-ioral science: The unix writer’s workbench soft-ware: Rationale and design.

Bell Labs TechnicalJournal , 62(6):1891–1908.David McClosky, Eugene Charniak, and Mark John-son. 2006. Effective self-training for parsing. In

Proceedings of the Human Language TechnologyConference of the NAACL, Main Conference , pages152–159, New York City, USA. Association forComputational Linguistics.Courtney Napoles and Chris Callison-Burch. 2017.Systematically adapting machine translation forgrammatical error correction. In

Proceedings ofthe 12th Workshop on Innovative Use of NLP forBuilding Educational Applications , pages 345–356,Copenhagen, Denmark. Association for Computa-tional Linguistics.Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The conll-2014 shared task ongrammatical error correction. In

Proceedings of theighteenth Conference on Computational NaturalLanguage Learning: Shared Task , pages 1–14, Bal-timore, Maryland. Association for ComputationalLinguistics.Diane Nicholls. 2003. The cambridge learner corpus:Error coding and analysis for lexicography and elt.In

Proceedings of the Corpus Linguistics 2003 con-ference , volume 16, pages 572–581.Marek Rei. 2017. Semi-supervised multitask learn-ing for sequence labeling. In

Proceedings of the55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 2121–2130, Vancouver, Canada. Associationfor Computational Linguistics.Marek Rei, Gamal Crichton, and Sampo Pyysalo. 2016.Attending to characters in neural sequence label-ing models. In

Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers , pages 309–318, Os-aka, Japan. The COLING 2016 Organizing Commit-tee.Marek Rei, Mariano Felice, Zheng Yuan, and TedBriscoe. 2017. Artiﬁcial error generation with ma-chine translation and syntactic patterns. In

Proceed-ings of the 12th Workshop on Innovative Use of NLPfor Building Educational Applications , pages 287–292, Copenhagen, Denmark. Association for Com-putational Linguistics.Marek Rei and Helen Yannakoudakis. 2016. Composi-tional sequence labeling models for error detectionin learner writing. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1181–1191, Berlin, Germany. Association for Computa-tional Linguistics.Stephen D. Richardson and Lisa C. Braden-Harder.1988. The experience of developing a large-scalenatural language text processing system: Critique.In

Second Conference on Applied Natural LanguageProcessing .Alla Rozovskaya and Dan Roth. 2010. Trainingparadigms for correcting errors in grammar and us-age. In

Human language technologies: The 2010annual conference of the north american chapter ofthe association for computational linguistics , pages154–162. Association for Computational Linguis-tics.Alla Rozovskaya and Dan Roth. 2014. Building astate-of-the-art grammatical error correction system.

Transactions of the Association for ComputationalLinguistics , 2:419–434.Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks.

IEEE Transactionson Signal Processing , 45(11):2673–2681. Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In

Advances in NeuralInformation Processing Systems , pages 2773–2781.Jason Wang and Luis Perez. 2017. The effectiveness ofdata augmentation in image classiﬁcation using deeplearning. Technical report, Technical report.Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading esol texts. In

Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages180–189, Portland, Oregon, USA. Association forComputational Linguistics.Helen Yannakoudakis, Marek Rei, Øistein E. Ander-sen, and Zheng Yuan. 2017. Neural sequence-labelling models for grammatical error correction.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages2795–2806, Copenhagen, Denmark. Association forComputational Linguistics.Zheng Yuan and Ted Briscoe. 2016. Grammatical er-ror correction using neural machine translation. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 380–386, San Diego, California. Associationfor Computational Linguistics.Zheng Yuan and Mariano Felice. 2013. Constrainedgrammatical error correction using statistical ma-chine translation. In