[PDF] MTNT: A Testbed for Machine Translation of Noisy Text

Abstract

Noisy or non-standard input text can cause disastrous mistranslations in most modern Machine Translation (MT) systems, and there has been growing research interest in creating noise-robust MT systems. However, as of yet there are no publicly available parallel corpora of with naturally occurring noisy inputs and translations, and thus previous work has resorted to evaluating on synthetically created datasets. In this paper, we propose a benchmark dataset for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on Reddit (this http URL) and professionally sourced translations. We commissioned translations of English comments into French and Japanese, as well as French and Japanese comments into English, on the order of 7k-37k sentences per language pair. We qualitatively and quantitatively examine the types of noise included in this dataset, then demonstrate that existing MT models fail badly on a number of noise-related phenomena, even after performing adaptation on a small training set of in-domain data. This indicates that this dataset can provide an attractive testbed for methods tailored to handling noisy text in MT. The data is publicly available at this http URL.

Full PDF

MMTNT: A Testbed for Machine Translation of Noisy Text

Paul Michel and

Graham Neubig

Language Technologies InstituteCarnegie Mellon University {pmichel1,gneubig}@cs.cmu.edu

Abstract

Noisy or non-standard input text can cause dis-astrous mistranslations in most modern Ma-chine Translation (MT) systems, and therehas been growing research interest in creat-ing noise-robust MT systems. However, as ofyet there are no publicly available parallel cor-pora of with naturally occurring noisy inputsand translations, and thus previous work hasresorted to evaluating on synthetically createddatasets. In this paper, we propose a bench-mark dataset for Machine Translation of NoisyText (MTNT), consisting of noisy commentson Reddit and professionally sourced trans-lations. We commissioned translations of En-glish comments into French and Japanese, aswell as French and Japanese comments intoEnglish, on the order of 7k-37k sentences perlanguage pair. We qualitatively and quantita-tively examine the types of noise included inthis dataset, then demonstrate that existing MTmodels fail badly on a number of noise-relatedphenomena, even after performing adaptationon a small training set of in-domain data. Thisindicates that this dataset can provide an at-tractive testbed for methods tailored to han-dling noisy text in MT. The data is publicly available at . the past few years due to the advent of Neu-ral Machine Translation (NMT) (Kalchbrennerand Blunsom; Sutskever et al., 2014; Bahdanauet al., 2014; Wu et al., 2016), systems are stillnot robust to noisy input like this (Belinkov andBisk, 2018; Khayrallah and Koehn). For exam-ple, Google Translate translates the above exam-ple into French as: translate.google.com as of May 2018 a r X i v : . [ c s . C L ] S e p entences with professionally sourced translationsboth in a pair of typologically close languages(English and French) and distant languages (En-glish and Japanese). We collect noisy commentsfrom the Reddit online discussion website (§3)in English, French and Japanese, and ask pro-fessional translators to translate to and from En-glish, resulting in approximately 1000 test sam-ples and from 6k to 36k training samples in fourlanguage pairs (English-French ( en-fr ), French-English ( fr-en ), English-Japanese ( en-ja ) andJapanese-English ( ja-en )). In addition, we re-lease additional small monolingual corpora inthose 3 languages to both provide data for semi-supervised adaptation approaches as well as noisyLanguage Modeling (LM) experiments. We teststandard translation models (§5) and languagemodels (§6) on our data to understand their failurecases and to provide baselines for future work. The term “noise” can encompass a variety of phe-nomena in natural language, with variations acrosslanguages ( e.g. what is a typo in logographic writ-ing systems?) and type of content (Baldwin et al.).To give the reader an idea of the challenges posedto MT and Natural Language Processing (NLP)systems operating on this kind of text, we providea non-exhaustive list of types of noise and moregenerally input variations that deviate from stan-dard MT training data we’ve encountered in Red-dit comments: • Spelling/typographical errors : “across” → “accross”, “receive” → “recieve”, “couldhave” → “could of”, “temps” → “tant”, “ 除く ” → “ 覗く ” • Word omission/insertion/repetition : “jen’aime pas” → “j’aime pas”,“je pense” → “moi je pense” • Grammatical errors : “a ton of” → “a tonsof”, “There are fewer people” → “There areless people” • Spoken language : “want to” → “wanna”, “Iam” → “I’m”, “je ne sais pas” → “chais pas”,“ 何を笑っているの ” → “ 何わろてんねん ”, • Internet slang : “to be honest” → “tbh”,“shaking my head” → “smh”, “mort de rire” → “mdr”, “ 笑 ” → “w”/“ 草 ” • Proper nouns (with or without correct capi-talization): “Reddit” → “reddit” • Dialects : African American Vernacular En-glish, Scottish, Provençal, Québécois, Kan-sai, Tohoku... • Code switching : “This is so cute” → “Thisis so kawaii”, “C’est trop conventionel” → “C’est trop mainstream”, “ 現在捏造中 . . . ” → “Now 捏造 ing...” • Jargon : on Reddit: “upvote”, “downvote”,“sub”, “gild” • Emojis and other unicode characters :, , , , , , • Profanities/slurs (sometimes masked)“f*ck”, “m*rde” . . .

To a certain extent, translating noisy text is a typeof adaptation , which has been studied extensivelyin the context of both Statistical Machine Transla-tion (SMT) and NMT (Axelrod et al.; Li et al.; Lu-ong and Manning, 2015; Chu et al.; Miceli Baroneet al.; Wang et al.; Michel and Neubig, 2018).However, it presents many differences with previ-ous domain adaptation problems, where the maingoal is to adapt from a particular topic or style. Inthe case of noisy text, it will not only be the casethat a particular word will be translated in a dif-ferent way than it is in the general domain (e.g.as in the case of “sub”), but also that there willbe increased lexical variation (e.g. due to spellingor typographical errors), and also inconsistency ingrammar (e.g. due to omissions of critical wordsor mis-usage). The sum of these differences war-rants that noisy MT be treated as a separate in-stance than domain adaptation, and our experi-mental analysis in 5.4 demonstrates that even af-ter performing adaptation, MT systems still makea large number of noise-related errors.

We ﬁrst collect noisy sentences in our three lan-guages of interest, English, French and Japanese. ollection Monolingual Data Parallel Data

Fetch comment from theAPI

Normalize tokenize, lowercase,strip markdown

Pre-ﬁlter

Remove urls, otherlanguages, bots

OOV ﬁlter [OPTIONAL] only keep commentswith OOV words

LM ﬁlter

Filter by subword LMscore

Translate

Send ~15k comments totranslation

Manually split intosentences and verify

Automatically split the remainingcomments intosentences

Test set (~1000 sentences)

Training set (6k-36k sentences) Split the remaining data in monolingual train , test and validation data Figure 1: Summary of our collection process and the respective sections addressing them. We apply the sameprocedure for each language.

We refer to Figure 1 for an overview of the datacollection and translation process.We choose Reddit as a source of data because(1) its content is likely to exhibit noise, (2) someof its sub-communities are entirely run in dif-ferent languages, in particular, English, Frenchand Japanese, and (3) Reddit is a popular sourceof data in curated and publicly distributed NLPdatasets (Tan et al.). We collect data using the pub-lic Reddit API. Note that the data collection and translation isperformed at the comment level. We split the par-allel data into sentences as a last step.

For each language, we select a set of communities(“subreddits”) that we know contain many com-ments in that language:

English:

Since an overwhelming majority of thediscussions on Reddit are conducted in En-glish, we don’t restrict our collection to anycommunity in particular.

French: /r/france , /r/quebec and /r/rance . The ﬁrst two are among thebiggest French speaking communities onReddit. The third is a humor/sarcasm basedoffspring of /r/france . Japanese: /r/newsokur , /r/bakanewsjp , /r/newsokuvip , /r/lowlevelaware In particular, we use this implementation: praw.readthedocs.io/en/latest , and our com-plete code is available at . and /r/steamr . Those are the biggestJapanese speaking communities, with over2,000 subscribers.We collect comments made during the03/27/2018-03/29/3018 time period for English,09/2018-03/2018 for French and 11/2017-03/2018for Japanese. The large difference in collectiontime is due to the variance in comment through-put and relative amount of noise between thelanguages. Not all comments found on Reddit exhibit noise asdescribed in Section 2. Because we would like tofocus our data collection on noisy comments, wedevise criteria that allow us to distinguish poten-tially noisy comments from clean ones. Speciﬁ-cally, we compile a contrast corpus composed ofclean text that we can compare to, and ﬁnd poten-tially noisy text that differs greatly from the con-trast corpus. Given that our ﬁnal goal is MT robustto noise, we prefer that these contrast corpora con-sist of the same type of data that is often used totrain NMT models. We select different datasets foreach language:

English:

The English side of the preprocessedparallel training data provided for theGerman-English WMT 2017 News transla-tion task, as provided on the website. Thisamounts to ≈ . million sentences. rench: The entirety of the French side ofthe parallel training data provided for theEnglish-French WMT 2015 translation task. This amounts to ≈ . million sentences. Japanese:

We aggregate three small/mediumsized MT datasets: KFTT (Neubig, 2011),JESC (Pryzant et al.) and TED talks (Cettoloet al., 2012), amounting to ≈ . millionsentences. We now describe the procedure used to identifycomments containing noise.

Pre-ﬁltering

First, we perform three pre-processing to discard comments that do not repre-sent natural noisy text in the language of interest:1. Comments containing a URL, as detected bya regular expression.2. Comments where the author’s username con-tains “bot” or “AutoModerator”. This mostlyremoves automated comments from bots.3. Comments in another language: we run langid.py (Lui and Baldwin) and discardcomments where p ( lang | comment ) > . for any language other than the one we areinterested in.This removes cases that are less interesting, i.e.those that could be solved by rule-based patternmatching or are not natural text created by regu-lar users in the target language. Our third criterionin particular discards comments that are blatantlyin another language while still allowing commentsthat exhibit code-switching or that contain propernouns or typos that might skew the language iden-tiﬁcation. In preliminary experiments, we noticedthat these criteria 14.47, 6.53 and 7.09 % of thecollected comments satisﬁed the above criteria re-spectively. Normalization

After this ﬁrst pass of ﬁltering,we pre-process the comments before running themthrough our noise detection procedure. We ﬁrststrip Markdown syntax from the comments. For https://github.com/saffsd/langid.py https://daringfireball.net/projects/markdown English and French, we normalize the punctua-tion, lowercase and tokenize the comments usingthe Moses tokenizer. For Japanese, we simplylowercase the alphabetical characters in the com-ments. Note that this normalization is done forthe purpose of noise detection only. The collectedcomments are released without any kind of pre-processing. We apply the same normalization pro-cedure to the contrast corpora.

Unknown words

In the case of French and En-glish, a clear indication of noise is the presenceof out-of-vocabulary words (OOV): we record alllowercased words encountered in our referencecorpus described in Section 3.2 and only keepcomments that contain at least one OOV. Sincewe did not use word segmentation for the Japanesereference corpus, we found this method not to bevery effective to select Japanese comments andtherefore skipped this step.

Language model scores

The ﬁnal step of ournoise detection procedure consists of selectingthose comments with a low probability under alanguage model trained on the reference monolin-gual corpus. This approach mirrors the one usedin Moore and Lewis and Axelrod et al. to se-lect data similar to a speciﬁc domain using lan-guage model perplexity as a metric. We searchfor comments that have a low probability under asub-word language model for more ﬂexibility inthe face of OOV words. We segment the contrastcorpora with Byte-Pair Encoding (BPE) using thesentencepiece implementation. We set the vocab-ulary sizes to , , , and , for English,French and Japanese respectively. We then usea 5-gram Kneser-Ney smoothed language modeltrained using kenLM (Heaﬁeld et al.) to calcu-late the log probability, normalized by the numberof tokens for every sentence in the reference cor-pus. Given a reddit comment, we compute the nor-malized log probability of each of its lines underour subword language model. If for any line thisscore is below the 1st percentile of scores in thereference corpus, the comment is labeled as noisyand saved. Once enough data has been collected, we isolate , comments in each language by the follow- https://github.com/google/sentencepiece https://kheafield.com/code/kenlm/ samples en-fr fr-en en-ja ja-en Table 1: Test set numbers. ing procedure: • Remove all duplicates. In particular, this han-dles comments that might have been scrapedtwice or automatic comments from bots. • To further weed out outliers (comments thatare too noisy, e.g.

ASCII art, wrong lan-guage. . . or not noisy enough), we discardcomments that are on either end of the dis-tribution of normalized LM scores within theset of collected comments. We only keepcomments whose normalized score is withinthe 5-70 percentile for English (resp. 5-60 forFrench and 10-70 for Japanese). These num-bers are chosen by manually inspecting thedata. • Choose , samples at random.We then concatenate the title of the threadwhere the comment was found to the text and sendeverything to an external vendor for manual trans-lations. Upon reception of the translations, we no-ticed a certain amount of variation in the quality oftranslations, likely because translating social me-dia text, with all its nuances, is difﬁcult even forhumans. In order to ensure the highest quality inthe translations, we manually ﬁlter the data to seg-ment the comments into sentences and weed outpoor translations for our test data. We thereby re-tain around , sentence pairs in each directionfor the ﬁnal test set.We gather the samples that weren’t selected forthe test sets to be used for training or ﬁne-tuningmodels on noisy data. We automatically split com-ments into sentences with a regular expressiondetecting sentence delimiters, and then align thesource and target sentences. Should this alignmentfail ( i.e. the source comment contains a differentnumber of sentences than the target comment af-ter automatic splitting), we revert back to provid-ing the whole comment without splitting. For thetraining data, we do not verify the correctness oftranslations as closely as for the test data. Finally, en-fr fr-en en-ja ja-en Table 2: Training sets numbers. en-fr

852 16,957 18,948 fr-en

886 41,578 46,886 en-ja

852 40,124 46,886 ja-en

965 25,010 23,289

Table 3: Validation sets numbers. we isolate ≈ samples in each direction toserve as validation data.Information about the size of the data can befound in Table 1, 2 and 3 for the test, trainingand validation sets respectively. We tokenize theEnglish and French data with the Moses (Koehnet al.) tokenizer and the Japanese data with Kytea(Neubig et al., 2011) before counting the numberof tokens in each dataset. After the creation of the parallel train and test sets,a large number of unused comments remain ineach language, which we provide as monolingualcorpora. This additional data has two purposes:ﬁrst, it serves as a resource for in-domain trainingusing semi-supervised methods relying on mono-lingual data (e.g. Cheng et al.; Zhang and Zong).Second, it provides a language modeling datasetfor noisy text in three languages.We select , comments at random in eachdataset to form a validation set to be used to tunehyper-parameters, and provide the rest as trainingdata. The data is provided with one comment perline. Newlines within individual comments are re-placed with spaces. Table 4 contains information en train 81,631 3,99M 18,9Mdev 3,000 146k 698k fr train 26,485 1,52M 7,49Mdev 3,000 176k 867k ja train 32,042 943k 3.9Mdev 3,000 84k 351k Table 4: Monolingual data numbers. pelling Grammar Emojis Profanities en newstest2014 0.210 0.189 0.000 0.030newsdiscusstest2015 0.621 0.410 0.021 0.076MTNT ( en-fr ) fr newstest2014 2.776 0.091 0.000 0.245newsdiscusstest2015 1.686 0.457 0.024 0.354MTNT ja TED 0.011 0.266 0.000 0.000KFTT 0.021 0.228 0.000 0.000JESC 0.096 0.929 0.090

MTNT

Table 5: Numbers, per 100 tokens, of quantiﬁable noise occurrences. For each language and category, the datasetwith the highest amount of noise is highlighted. on the size of the datasets. As with the parallelMT data, we provide the number of tokens aftertokenization with the Moses tokenizer for Englishand French and Kytea for Japanese.

In this section, we investigate the proposed data tounderstand how different categories of noise arerepresented and to show that our test sets containmore noise overall than established MT bench-marks.

We run a series of tests to count the number of oc-currences of some of the types of noise describedin Section 2. Speciﬁcally we pass our data throughspell checkers to count spelling and grammar er-rors. Due to some of these tests being impracticalto run on a large scale, we limit our analysis to thetest sets of MTNT.We use slightly different procedures depend-ing on the tools available for each language. Wetest for spelling and grammar errors in Englishdata using Grammarly , an online resource forEnglish spell-checking. Due to the unavailabil-ity of an equivalent of Grammarly in French andJapanese, we test for spelling and grammar er-ror using the integrated spell-checker in MicrosoftWord 2013 . Note that Word seems to countproper nouns as spelling errors, giving highernumbers of spelling errors across the board inFrench as compared to English.For all languages, we also count the number https://products.office.com/en-us/microsoft-word-2013 of profanities and emojis using custom-made listsand regular expressions . In order to compare re-sults across datasets of different sizes, we reportall counts per words.The results are recorded in the last row of eachsection in Table 5. In particular, for the languageswith a segmental writing system, English andFrench, spelling errors are the dominant type ofnoise, followed by grammar error. Unsurprisingly,the former are much less present in Japanese. Table 5 also provide a comparison with the rel-evant side of established MT test sets. For En-glish and French, we compare our data to new-stest2014 and newsdiscusstest2015 test sets.For Japanese, we compare with the test sets of thedatasets described in Section 3.2.Overall, MTNT contains more noise in all met-rics but one (there are more profanities in JESC,a Japanese subtitle corpus). This conﬁrms thatMTNT indeed provides a more appropriate bench-mark for translation of noisy or non-standard text.Compared to synthetically created noisy testsets (Belinkov and Bisk, 2018) MTNT containsless systematic spelling errors and more variedtypes of noise ( e.g. emojis and profanities) and isthereby more representative of naturally occurringnoise. available with our code at https://github.com/pmichel31415/mtnt Machine Translation Experiments

We evaluate standard NMT models on our pro-posed dataset to assess its difﬁculty. Our goal isnot to train state-of-the art models but rather to teststandard off-the-shelf NMT systems on our data,and elucidate what features of the data make it dif-ﬁcult.

All our models are implemented in DyNet (Neu-big et al., 2017) with the XNMT toolkit ( ? ). Weuse approximately the same setting for all lan-guage pairs: the encoder is a bidirectional LSTMwith 2 layers, the attention mechanism is a multilayered perceptron and the decoder is a 2 layeredLSTM. The embedding dimension is 512, all otherdimensions are 1024. We tie the target word em-beddings and the output projection weights (Pressand Wolf). We train with Adam (Kingma and Ba,2014) with XNMT’s default hyper-parameters, aswell as dropout (with probability . ). We usedBPE subwords to handle OOV words. Full con-ﬁguration details as well as code to reproducethe baselines is available at https://github.com/pmichel31415/mtnt . We train our models on standard MT datasets: • en ↔ fr: Our training data consists inthe europarl-v7 and news-commentary-v10 corpora, totaling , , samples, , , French tokens and , , English tokens (non-tokenized). We usethe newsdiscussdev2015 dev set fromWMT15 as validation data and evaluatethe model on the newsdiscusstest2015 andnewstest2014 test sets. • en ↔ ja: We concatenate the respective train,validation and test sets of the three corporamentioned in 3.2. In particular we detokenizethe Japanese part of each dataset to makesure that any tokenization we perform willbe uniform (in practice we remove ASCIIspaces). This amounts to , , trainingsamples ( , , English tokens withouttokenization). We concatenate the dev sets en-fr fr-en newstest2014 .

52 28 . newsdiscusstest2015 .

03 30 . MTNT .

77 23 . MTNT (+tuning) .

73 30 . en-ja ja-en TED .

51 13 . KFTT .

82 20 . JESC .

77 18 . MTNT .

02 6 . MTNT (+tuning) .

45 9 . Table 6: BLEU scores of NMT models on the variousdatasets. associated with these corpora to serve as val-idation data and evaluate on each respectivetest set separately.

We use sacreBLEU , a standardized BLEUscore evaluation script proposed by Post (2018),for BLEU evaluation of our benchmark dataset.It takes in detokenized references and hypothe-ses and performs its own tokenization before com-puting BLEU score. We specify the intl tok-enization option. In the case of Japanese text, werun both hypothesis and reference through KyTeabefore computing BLEU score. We strongly en-courage that evaluation be performed in the samemanner in subsequent work, and will provide bothscripts and an evaluation web site in order to facil-itate reproducibility.Table 6 lists the BLEU scores for our modelson the relevant test sets in the two language pairs,including the results on MTNT. To better understand the types of errors made byour model, we count the n-grams that are over-and under- generated with respect to the referencetranslation. Speciﬁcally, we compare the count ra-tios of all 1- to 3-grams in the output and in thereference and look for the ones with the highest(over-generated) and lowest (under-generated) ra-tio.We ﬁnd that in English, the model under-generates the contracted form of the negative (“donot”/“don’t”) or of auxiliaries (“That is”/“I’m”). https://github.com/mjpost/sacreBLEU ource Moi faire la gueule dans le métro me manque, c’est grave ?Target I miss sulking in the underground, is that bad?Our model I do not know what is going on in the metro, that is a serious matter.+ ﬁne-tuning I do not want to be in the metro, it’s serious?Source :o ’tain je me disais bien que je passais à côté d’un truc vu les upvotes.Target :o damn I had the feeling that I was missing something considering the upvotes.Our model o, I was telling myself that I was passing over a nucleus in view of the Yupvoots.+ ﬁne-tuning o, I was telling myself that I was going next to a nucleus in view of the upvotes.Source * C’est noël / pâques / pentecôte / toussaint : Pick One, je suis pas cathoTarget Christmas / Easter / Pentecost / All Saints: Pick One, I’m not Catholic!Our model It is a pale/poward, a palecte d’tat: Pick One, I am not a catho!+ ﬁne-tuning It’s nol / pesce /pentecate /mainly: Pick One, I’m not catho! Table 7: Comparison of our model’s output before and after ﬁne-tuning in fr-en . Similarly, in French, our model over generates“de votre” (where “votre” is the formal 2nd per-son plural for “your”) and “n’ai pas” which show-cases the “ne [. . . ] pas” negation, often droppedin spoken language. Conversely, the informal sec-ond person “tu” is under-generated, as is the in-formal and spoken contraction of “cela”, “ça”. InJapanese, the model under-generates, among oth-ers, the informal personal pronoun 俺 (“ore”) orthe casual form だ (“da”) of the verb です (“desu”,to be). In ja-en the results are difﬁcult to inter-pret as the model seems to produce incoherent out-puts ( e.g. “no, no, no. . . ”) when the NMT systemencounters sentences it has not seen before. Thefull list of n-grams with the top 5 and bottom 5count ratios in each language pair is displayed inTable 8. fr-en en-fr ja-en en-ja Over generated no, no, ※ it is not qu’ils i が I do not de votre no, no, no, か ?That is s’il so on and て not have n’ai pas on and so すか ? Under generated it’s tu | ？ I’m ça Is よ。 I don’t que tu > って > ! ""The 俺 doesn’t as those だ。 Table 8: Over and under generated n-grams in ourmodel’s output for en-fr

Finally, we test a simple domain adaptationmethod by ﬁne-tuning our models on the trainingdata described in Section 3.4. We perform oneepoch of training with vanilla SGD with a learn-ing rate of . and a batch size of . We do notuse the validation data at all. As evidenced by theresults in the last row of Table 6, this drives BLEUscore up by 3.17 to 7.96 points depending on thelanguage pair. However large this increase mightbe, our model still breaks on very noisy sentences.Table 7 shows three examples in fr-en . Al-though our model somewhat improves after ﬁne-tuning, the translations remain inadequate in allcases. In the third case, our model downright failsto produce a coherent output. This shows that de-spite improving BLEU score, naive domain adap-tation by ﬁne-tuning doesn’t solve the problem oftranslating noisy text. In addition to our MT experiments, we reportcharacter-level language modeling results on themonolingual part of our dataset. We use the datadescribed in Section 3.5 as training and validationsets. We evaluate the trained model on the sourceside of our en-fr , fr-en and ja-en test setsfor English, French and Japanese respectively.We report results for two models: a Kneser-Ney smoothed 6-gram model (implemented withKenLM) and an implementation of the AWD-LSTM proposed in (Merity et al., 2018) . We re-port the Bit-Per-Character (bpc) counts in table 9. https://github.com/salesforce/awd-lstm-lm -gram AWD LSTMdev test dev testEnglish 2.081 2.179 1.706 1.810French 1.906 2.090 1.449 1.705Japanese 5.003 5.497 4.801 5.225 Table 9: Language modeling scores

We intend these results to serve as a baseline forfuture work in language modeling of noisy text ineither of those three languages.

Handling noisy text has received growing attentionamong various language processing tasks due tothe abundance of user generated content on popu-lar social media platforms (Crystal, 2001; Herring,2003; Danet and Herring, 2007). These contentsare considered as noisy when compared to newscorpora which have been the main data source forlanguage tasks (Baldwin et al.; Eisenstein). Theypose several unique challenges because they con-tain a larger variety of linguistic phenomena thatare absent in the news domain and that lead todegraded quality when applying an model to out-of-domain data (Ritter et al.; Luong and Manning,2015). Additionally, they are live examples of theCmabrigde Uinervtisy (Cambridge University) ef-fect, where state-of-the-art models become brittlewhile human’s language processing capability ismore robust (Sakaguchi et al., 2017; Belinkov andBisk, 2018).Efforts to address these challenges have beenfocused on creating in-domain datasets and an-notations (Owoputi et al.; Kong et al.; Blodgettet al., 2017), and domain adaptation training (Lu-ong and Manning, 2015). In MT, improvementswere obtained for SMT (Formiga and Fonollosa).However, the speciﬁc challenges for neural ma-chine translation have not been studied until re-cently (Belinkov and Bisk, 2018; Sperber et al.;Cheng et al., 2018). The ﬁrst provides empiricalevidence of non-trivial quality degradation whensource sentences contain natural noise or syn-thetic noise within words, and the last two exploredata augmentation and adversarial approaches ofadding noise efﬁciently to training data to improverobustness.Our work also contributes to recent advances inevaluating neural machine translation quality withregard to speciﬁc linguistic phenomena, such as manually annotated test sentences for English toFrench translation, in order to identify errors dueto speciﬁc linguistic divergences between the twolanguages (Isabelle et al.), or automatically gener-ated test sets to evaluate typical errors in Englishto German translation (Sennrich). Our contribu-tion distinguishes itself from this previous workand other similar initiatives (Peterson, 2011) byproviding an open test set consisting of naturallyoccurring text exhibiting a wide range of phenom-ena related to noisy input text from contempora-neous social media.

We proposed a new dataset to test MT models forrobustness to the types of noise encountered in nat-ural language on the Internet. We contribute par-allel training and test data in both directions fortwo language pairs, English ↔ French and English ↔ Japanese, as well as monolingual data in thosethree languages. We show that this dataset con-tains more noise than existing MT test sets andposes a challenge to models trained on standardMT corpora. We further demonstrate that thesechallenges cannot be overcome by a simple do-main adaptation approach alone. We intend thiscontribution to provide a standard benchmark forrobustness to noise in MT and foster research onmodels, dataset and evaluation metrics tailored forthis speciﬁc problem.

References

Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Do-main adaptation via pseudo in-domain data selec-tion. In

Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Process-ing , pages 355–362.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. In

International Con-ference on Learning Representations .Timothy Baldwin, Paul Cook, Marco Lui, AndrewMacKinlay, and Li Wang. How noisy social mediatext, how diffrnt social media sources? In

Proceed-ings of the Sixth International Joint Conference onNatural Language Processing , pages 356–364.Yonatan Belinkov and Yonatan Bisk. 2018. Syntheticand natural noise both break neural machine transla-tion.

International Conference on Learning Repre-sentations .Su Lin Blodgett, Johnny Wei, and Brendan O’Connor.2017. A dataset and classiﬁer for recognizing socialedia english. In

Proceedings of the 3rd Workshopon Noisy User-generated Text , pages 56–61.Mauro Cettolo, Christian Girardi, and Marcello Fed-erico. 2012. Wit : Web inventory of transcribed andtranslated talks. In Proceedings of the 16 th Con-ference of the European Association for MachineTranslation (EAMT) , pages 261–268.Yong Cheng, Zhaopeng Tu, Fandong Meng, JunjieZhai, and Yang Liu. 2018. Towards robust neuralmachine translation.Yong Cheng, Wei Xu, Zhongjun He, Wei He, HuaWu, Maosong Sun, and Yang Liu. Semi-supervisedlearning for neural machine translation. In

Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 1965–1974.Chenhui Chu, Raj Dabre, and Sadao Kurohashi. Anempirical comparison of domain adaptation meth-ods for neural machine translation. In

Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 385–391.David Crystal. 2001.

Language and the Internet . Cam-bridge University Press.Brenda Danet and Susan Herring. 2007.

The Multilin-gual Internet: Language, Culture, and Communica-tion Online . Oxford University Press., New York.Jacob Eisenstein. What to do about bad language onthe internet. In

Proceedings of the 2013 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies , pages 359–369.Lluís Formiga and José A. R. Fonollosa. Dealing withinput noise in statistical machine translation. In

Pro-ceedings of the Conference on Computational Lin-guistics 2012: Posters , pages 319–328.Kenneth Heaﬁeld, Ivan Pouzyrevsky, Jonathan H.Clark, and Philipp Koehn. Scalable modiﬁedKneser-Ney language model estimation. In

Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics , pages 690–696.Susan Herring, editor. 2003.

Media and LanguageChange . Special issue of Journal of Historical Prag-matics 4:1.Pierre Isabelle, Colin Cherry, and George Foster. Achallenge set approach to evaluating machine trans-lation. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2486–2496.Nal Kalchbrenner and Phil Blunsom. Recurrent con-tinuous translation models. In

Proceedings of the2013 Conference on Empirical Methods in NaturalLanguage Processing , pages 1700–1709. Huda Khayrallah and Philipp Koehn. On the impactof various types of noise on neural machine transla-tion. In

Proceedings of the 2nd Workshop on NeuralMachine Translation and Generation , pages 74–83.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In

InternationalConference on Learning Representations .Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondˇrej Bojar, AlexandraConstantin, and Evan Herbst. Moses: Open sourcetoolkit for statistical machine translation. In

Pro-ceedings of the 45th Annual Meeting of the Associ-ation for Computational Linguistics on InteractivePoster and Demonstration Sessions , pages 177–180.Lingpeng Kong, Nathan Schneider, SwabhaSwayamdipta, Archna Bhatia, Chris Dyer, andNoah A. Smith. A dependency parser for tweets.In

Proceedings of the 2014 Conference on Em-pirical Methods in Natural Language Processing(EMNLP) , pages 1001–1012.Mu Li, Yinggong Zhao, Dongdong Zhang, and MingZhou. Adaptive development data selection for log-linear model in statistical machine translation. In

Proceedings of the 23rd International Conferenceon Computational Linguistics (Coling 2010) , pages662–670.Marco Lui and Timothy Baldwin. langid.py: An off-the-shelf language identiﬁcation tool. In

Proceed-ings of the Association for Computational Linguis-tics 2012 System Demonstrations , pages 25–30.Minh-Thang Luong and Christopher D Manning. 2015.Stanford neural machine translation systems for spo-ken language domains. In

Proceedings of the In-ternational Workshop on Spoken Language Transla-tion .Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2018. Regularizing and optimizing lstm lan-guage models.

International Conference on Learn-ing Representations .Antonio Valerio Miceli Barone, Barry Haddow, UlrichGermann, and Rico Sennrich. Regularization tech-niques for ﬁne-tuning in neural machine translation.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages1490–1495.Paul Michel and Graham Neubig. 2018. Extreme adap-tation for personalized neural machine translation.Robert C. Moore and William Lewis. Intelligent selec-tion of language model training data. In

Proceed-ings of the Association for Computational Linguis-tics 2010 Conference Short Papers arXiv preprintarXiv:1701.03980 .Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise prediction for robust, adaptablejapanese morphological analysis. In

The 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL-HLT) , pages 529–533.Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. Improved part-of-speech tagging for onlineconversational text with word clusters. Associationfor Computational Linguistics.Kay Peterson. 2011. Openmt12 evaluation.Matt Post. 2018. A call for clarity in reporting bleuscores. arXiv preprint arXiv:1804.08771 .Oﬁr Press and Lior Wolf. Using the output embeddingto improve language models. In

Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 2,Short Papers , pages 157–163.R. Pryzant, Y. Chung, D. Jurafsky, and D. Britz. Jesc:Japanese-english subtitle corpus.

ArXiv e-prints .Alan Ritter, Sam Clark, Oren Etzioni, et al. Named en-tity recognition in tweets: an experimental study. In

Proceedings of the conference on empirical methodsin natural language processing , pages 1524–1534.Association for Computational Linguistics.Keisuke Sakaguchi, Kevin Duh, Matt Post, and Ben-jamin Van Durme. 2017. Robsut wrod reocgini-ton via semi-character recurrent neural network. In

AAAI , pages 3281–3287.Rico Sennrich. How grammatical is character-levelneural machine translation? assessing mt qualitywith contrastive translation pairs. In

Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume2, Short Papers , pages 376–382.Matthias Sperber, Jan Niehues, and Alex Waibel. To-ward robust neural machine translation for noisy in-put sequences. In

Proceedings of the InternationalWorkshop on Spoken Language Translation , pages90–96.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112. Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Winning argu-ments: Interaction dynamics and persuasion strate-gies in good-faith online discussions. In

Proceed-ings of the 25th international conference on worldwide web , pages 613–624. International World WideWeb Conferences Steering Committee.Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen,and Eiichiro Sumita. Instance weighting for neuralmachine translation domain adaptation. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 1483–1489.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .Jiajun Zhang and Chengqing Zong. Exploiting source-side monolingual data in neural machine translation.In