Machine Translation with Unsupervised Length-Constraints
MMachine Translation with Unsupervised Length-Constraints
Jan Niehues
Department of Data Science and Knowledge Engineering (DKE), Maastricht University [email protected]
Abstract
We have seen significant improvements in ma-chine translation due to the usage of deeplearning. While the improvements in trans-lation quality are impressive, the encoder-decoder architecture enables many more possi-bilities. In this paper, we explore one of these,the generation of constraint translation. We fo-cus on length constraints, which are essentialif the translation should be displayed in a givenformat.In this work, we propose an end-to-end ap-proach for this task. Compared to a traditionalmethod that first translates and then performssentence compression, the text compression islearned completely unsupervised. By combin-ing the idea with zero-shot multilingual ma-chine translation, we are also able to performunsupervised monolingual sentence compres-sion. In order to fulfill the length constraints,we investigated several methods to integratethe constraints into the model.Using the presented technique, we are ableto significantly improve the translation qual-ity under constraints. Furthermore, we areable to perform unsupervised monolingual sen-tence compression.
Neural machine translation (NMT) (Sutskeveret al., 2014; Bahdanau et al., 2014) exploits neuralnetworks to directly learn to transform sentences ina source language to sentences in a target language.This technique has significantly improved the qual-ity of machine translation (Bojar et al., 2016; Cet-tolo et al., 2015). The advances in quality alsoallow for the application of this technology to newreal-world applications.While research systems often purely focus ona high translation quality, real-world applicationsoften have additional requirements for the output of the system. One example is the mapping of markupinformation from the source text to the target text(Zenkel et al., 2019). In this work, we will focuson another use case, the generation of translationswith given length constraints. Thereby, we focuson compression. That means the target length isshorter than the actual length of the translation.When translating from one language to another, thelength of the source text is usually different fromthe length of the target text. While for most appli-cations of machine translation this does not posea problem, for some applications this significantlydeteriorates the user experience. For example, ifthe translation should be displayed in the samelayout as the source text (e.g. in a website), it isadvantageous if the length stays the same. Anotheruse case are captions for videos. A human is onlycapable of reading text with a given speed. For anoptimal user experience, it is therefore not only im-portant to present an accurate translation, but alsoto present the translation with a maximum numberof words.A first approach to address this challenge wouldbe to use a cascade of a machine translation and sen-tence compression system. In this case, we wouldneed training data to train the machine translationsystem and additional training data to train the sen-tence compression system. It is very difficult andsometimes even impossible to collect the trainingdata for the sentence compression task. Further-more, we need a sentence compression model witha parametric length reduction ratio. For a super-vised model, we would therefore need exampleswith different length reduction ratios. Therefore,this work focuses on unsupervised sentence com-pression. In this case, the cascaded approach can-not directly be applied and we will start with anend-to-end approach to length-constraint transla-tion.While our work focuses on the end-to-end ap- a r X i v : . [ c s . C L ] A p r roach to translation combined with sentence com-pression, monolingual sentence compression isanother important task. For example, human-generated c aptions are often not an accurate tran-scription of the audio, but in addition the text isshortened. This is due to cognitive processing con-straints. The user is able to listen to more words ina given time than he can read in the same amount oftime. When combining the length-constrained ma-chine translation with the idea of zero-shot machinetranslation, the proposed method is also able toperform monolingual sentence compression. Com-pared to related work, this methods will learn thetext compression in an unsupervised manner with-out ever seeing a compressed sentence.The main contribution of this work is an end-to-end approach to length-constraint translation byjointly performing machine translation and sen-tence compression. We are able to show that forthis task an end-to-end approach outperforms thecascade of machine translation and unsupervisedsentence compression.Secondly, we perform an in depth analysis howto integrate additional constraints into neural ma-chine translation. We investigate methods to re-strict the search to translations with a given lengthas well as adapting the transformer-based encoder-decoder. In the analysis, we also investigated theportability of the technique to encode other con-straints. Therefore, we applied the same techniquesto avoiding difficult words.A third contribution of this work is to extend thepresented approach to unsupervised monolingualsentence compression. By combining the presentedapproach with multilingual machine translation, weare able to also generate paraphrases with a givenlength constraint. The investigation shows that asystem that is trained on several languages is ableto successful generate monolingual paraphrases. In this work, we address the challenge of generatingmachine translations with additional constraints.The main application is length-constrained transla-tion. That means that we want to generate a trans-lation with a given target length. We focus therebyon the case of shortening the translations. Whilethe length can be measured in words, sub-wordtokens or letters, in the experiments we measuredthe length by sub-word tokens.A first straightforward approach is to restrict the search space to generate only translations with agiven length. The length of the output is modeledby the probability of the end-of-sentence (EOS)token. By modifying this probability, we introducea hard constraint that is always fulfill.Based on the experiments with this type of lengthconstraints, we then investigate methods to includethe length constraints directly into the model. Forthat we used two techniques successfully used inencoder-decoder models: embeddings and posi-tional encodings. In this case, the target length ismodeled as a soft constraint. However, by combin-ing both techniques, we can again achieve a hardconstraint.
A first strategy to include the additional constraintsis to ignore them during training and restrict thesearch space during inference to translations thatfulfill the constraint. For length constraints, this canbe achieved by manipulating the end-of-sentencetoken probability. First, we need to ensure thatthe EOS token is not generated before the desiredlength of output J . This can be ensure by settingthe probability for the end-of-sentence token tozero for all positions before the desired length andrenormalizing the probability. p (cid:48) ( y j | x , . . . , x I , y , . . . , y j − ) = (1) (cid:40) p ( y j | x ,...,x I ,y ,...,y j − )1 − p ( EOS | x ,...,x I ,y ,...,y j − ) y j (cid:54) = EOS y j = EOS
Finally, we ensure to stop the search at the de-sired length by setting the probability of the end-of-sentence token to one if the output sequence hasreached this length. p (cid:48) ( y j | x , . . . , x I , y , . . . , y j − ) = (2) (cid:26) y j (cid:54) = EOS y j = EOS
While this approach will guarantee that the out-put of the translation systems always meets thelength condition (hard constraint), it has also onemajor drawback. Until the system reaches the con-straint length, the system is not aware of how manywords it is still allowed to generate. Therefore, it isnot able to shorten the beginning of the sentence inorder to fulfil the length constraint.otivated by this observation, we investigatemethods to integrate the length constraint into themodel and not only apply it during inference.
The first challenge we need to address when includ-ing the length constraint into the model itself isthe question of training data. While there is largeamounts of parallel training data, it is hard to ac-quire training data with length constraints. There-fore, we investigate methods to train the modelwith standard parallel training data.We perform the training by a type of pseudo-supervision. For each source sentence, in training,we also know the translation and thereby its length.The main idea is that we now assume this sen-tence was generated with the constraint to generatea translation with exactly the length of the giventranslation. Of course, this is mostly not the case.The human translator generated a translation thatappropriately expresses the meaning of the sourcesentence and not a sentence that fulfills the lengthconstraints.Therefore, learning in this case is more difficult.Since the translations are generated without thegiven length constraints the systems might learnto simply ignore the length information and in-stead generate a normal translation putting all theinformation of the source sentence into the targetsentence. In this case, we would not have the possi-bility to control the target length by specifying ourdesired length.Therefore, we continue to investigate differentpossibilities how to encode the constrained targetlength in the model architecture.
In this work, we investigate three different methodsto represent the target length in the model. In allcases our training data consists of a source sentence X = x , . . . , x I , a target sentence Y = y , . . . , y J and the target length J . Source embedding
A first method is to includethe target length into the source sentence as anadditional token. This is motivated by successfulapproaches for multilingual machine translation(Ha et al., 2016), domain adaptation (Kobus et al.,2017) and formality levels (Sennrich et al., 2016a).We change the training procedure to not use X as the input to the encoder of the NMT system,but instead J, X . Thereby, the encoder will learn a embedding for each target length seen duringtraining.There are two challenges using this approach.First, the dependency between the described length J and the output Y is quite long within the model.Therefore, the model might ignore the informationand just learn to generate the best translation fora given source sentence. Secondly, the representa-tions for all possible target lengths are independentfrom each other. The embedding for length isnot constrained to be similar to the one of length . This poses a special challenge for long sen-tences which occur less frequently and thereforethe embedding of these lengths will not be learnedas well as the frequent ones. Target embedding
We address the first chal-lenge by integrating the length constraint directlyinto the decoder. This is motivated by similarapproaches to supervised sentence compression(Kikuchi et al., 2016) and zero-shot machine trans-lation (Ha et al., 2017). We incorporate the infor-mation of the number of remaining target words ateach target position. For one, this should ensurethat the length information is not lost during thedecoding process. Secondly, by embedding smallernumbers which occur more frequently in the cor-pus towards the end of the sentence, the problemof rare sentence lengths does not matter that much.Formally, at each decoder step j the baselinemodel starts with the word embedding of the lasttarget word y j − . In the original transformer ar-chitecture, the positional encoding is applied ontop of the embedding to generate the first hiddenrepresentation. h = pos ( emb ( y j − ) , j ) (3)In our proposed architecture, we include the num-ber of remaining target words to be generated J − j .We concatenate h with the length embedding andthen apply a linear translation and a non-linearityto reduce the hidden size to the one of the originalword embedding h (cid:48) = relu ( lin ( cat ( h , lenEmb ( J − j ))) (4)The proposed architecture allows the model toconsider the number of remaining target words ateach decoding step. While the baseline model willonly cut the end of the sentences, the model is ableto shorten already consider the constraints at thebeginning of the sentence. ositional encoding Finally, we also address thechallenge of representing sentence lengths thatare less frequent. The transformer architecture(Vaswani et al., 2017) introduced the positionalencoding. This encodes the position within thesentence using a set of trigonometric functions.While their method encodes the position relativeto the start of the sentence, we follow (Takaseand Okazaki, 2019) to encode the position rel-ative to the end of the sentence. Thereby, ateach position we encode the number of remainingwords of the sentence. Formally, we replace h = pos ( emb ( y j − ) , j ) by h ∗ = pos ( emb ( y j − ) , J − j ) . Besides constraining the number of words, otherconstraints can be implemented as easily using thesame framework. In this work, we show this bylimiting the number of complex and difficult words.One use case is the generation of paraphrases insimplified language. A metric to measure text dif-ficulty, the Dale-Chall Readability metric (Challand Dale, 1995), for example, counts such diffi-cult words. In an NMT system, longer words aretypically split into subword units by Byte Pair en-coding (Sennrich et al., 2016b). A complex wordlike marshmallow is split into several parts like mar@@ shm@@ allow . Thereby, @@ indicatesthat the word is not yet finished.We can encourage the system to use less complexwords that need to be represented by several BPEtokens by counting the number of tokens that do notend a word. In the proposed encoding scheme theseare all tokens that end in @@ . During decoding,we then try to generate translations with a minimalnumber of these tokens. The lack of appropriate data is not only a challengefor training the model but also for evaluating. Thedefault approach to evaluate a machine translationmethods is to compare the translation by the sys-tem with human translation using some automaticmetric (e.g. BLEU (Papineni et al., 2001)).In our case, we would need to have a human-generated translation, which also fulfills the addi-tional constraints. For example, translation with alength that is shorten to 80% of the input.Since this type of translation data is not avail-able, we investigate methods to compare the length- constraint output of the system with standard hu-man translation that do not fulfill any specific con-straints.
While there is significant research in automaticmetrics for machine translation (Ma et al., 2018,2019), BLEU is still the most commonly used met-ric. Therefore, a first approach would be to useBLEU to compare the automatic translation withlength constraints with the human translation with-out constraints. If we were using length constraints,this would lead to low BLEU score due to thelength penalty of the metric. But since all sys-tems must fulfill the length constraint, the penaltywould be the same for all output and we could stillcompare between the different outputs.A problem of using the BLEU metric for this taskis illustrated by the example translations in Table 1.The Baseline system only uses the length constraintfor restricting the search space. In the constraintsystem, we are using the length constraint also asadditional embeddings in the decoder. Lookingat this example sentence, a human would alwaysrate the constraint translation better as the baselinetranslation. The problem of the latter model is thatit is often generating a prefix of the full transla-tion. While this does not lead to a good constrainedtranslation, it still leads to a relative high BLEUscore. In this case, we have one matching 3-gram,two bigrams and four unigrams.In contrast, the constraint model only containswords matching the reference scattered over thesentence. Therefore, in this case, we only have twounigram matches. Guided by this observation, weused different metrics to evaluate the models.
In order to address the challenges mentioned inthe last section, we used metrics that are based onsentence embeddings instead of word or character-based representation of the sentence. This way itis no longer important that the words occur in thesame sequence in automatic translation and refer-ence. Based on the performance of the automaticmetrics in the WMT Evaluation campaign in 2018,we used RUSE (Shimanaka et al., 2019) metric.It uses sentence embeddings from three differentmodels: InferSent, Quit-Thought and UniversalSentence Encoder. Then the quality is estimated byan MLP based on the representation of the hypoth-esis and the reference translation.eference: So CEOs, a little bit better than average, but here’s where it gets interesting.Baseline: CEOs are a little bitConstraint: the CEOs are interesting .
Table 1: Example of constraint translation
We train our systems on the TED data from theIWSLT 2017 multilingual evaluation campaign(Cettolo et al., 2017). The data contains paralleldata between German, English, Italian, Dutch andRomanian. We create three different systems. Thefirst system is only trained on the German-Englishdata, the second one is trained on German-Englishand English-German data and the last one is trainedon { German, Dutch, Italian, Romanian } and En-glish data in the both directions.The data is preprocessed using standard MTprocedures including tokenization, truecasing andbyte-pair encoding with 40K codes. For modelselection, the checkpoints performing best on thevalidation data (dev2010 and tst2010 combined)are averaged, which is then used to translate thetst2017 test set.In the experiments, we address two different tar-geted lengths. In order to not use any informationfrom the reference, we measure length limits rel-ative to the source sentence length by countingsubword units. We aim to shorten the translation toproduce output that is 80% and 50% of the sourcesentence length. In all experiments, the length ismeasured by the number of sub-word tokens. We use the transformer architecture (Vaswani et al.,2017) and increase the number of layers to eight.The layer size is 512 and the inner size is 2048.Furthermore, we apply word dropout with p = 0 . (Gal and Ghahramani, 2016). In addition, layerdropout is used with p = 0 . as in the originalwork. We use the same learning rate schedule as inthe original work. The implementation is availableat https://github.com/quanpn90/NMTGMinor.git . Initially, we wanted to investigate the difficulty ofhaving the additional length constraints. Therefore,we used the length of the human reference transla-tion as a first target length. One could even arguethat should make the typical machine translation easier, since some information about the translationis known. The results of this experiment are shownin Table 2. Since we do not perform compression inthis experiment, the aforementioned problem withBLEU should not apply here.Model BLEU RUSEBaseline 30.80 -0.085Only Search 28.32 -0.124Source Emb 28.56 -0.126Decoder Emb 27.88 -0.140Decoder Pos 28.80 -0.138
Table 2: Using oracle length
However, the results indicate the baseline systemachieves the best BLEU score as well as the bestRUSE score. All other models generate transla-tions that perfectly fit the desired target length, butthis leads to a drop in translation quality. Therefore,even if the target length is the same as the one ofthe reference translation, by restriction increasesthe difficulty of the problem. One reason could bethat the machine translation system rarely gener-ates translations which exactly match the reference.By forcing the translation to have an exact prede-fined length, we are increasing the difficulty of theproblem.
In a first series of experiments, we analyzed thedifferent techniques to encode the length of theoutput. First, we are interested in whether the dif-ferent length representations are able to enforce anoutput that has the length we are aiming at (softconstraints). For the German to English translationtask, the length of the different encoding versionsare shown in Table 3. We define the length asthe average difference between the targeted outputgiven in BPE units and the output of the translationsystem.First, without adding any constraints, the modelsgenerate translations that differ by 3.9 and 10.29words from the targeted length. By specifying thelength in the source side, we can reduce the lengthdifference to half a word in the case of a targetedength of 80% and one and a half words in the caseof 50% of the source length. The models usingthe decoder embeddings and the decoder positionalencoding where able to nearly perfectly generatetranslation with the correct number of words.Encoding Avg. length difference80% 50%Baseline 3.90 10.29Source Emb 0.55 1.40Decoder Emb 0.07 0.16Decoder Pos 0.09 0.19
Table 3: Avg. Length distance
Besides fulfilling the length constraints, thetranslations needs to be accurate. Since we wantedto have a fair comparison, we evaluated the out-put when using a restrict search space, so that onlytranslation with the correct number of words aregenerated (hard constraints). The results are sum-marized in Table 4Encoding RUSE80% 50%Baseline -0.272 -0.605Source Emb -0.263 -0.587Decoder Emb -0.2469 -0.555Decoder Pos -0.2598 -0.577
Table 4: German-English translation quality
As shown be the results, we see improvementsin translation quality when using the source embed-ding within the encoder. We have further improve-ments if we represent the targeted length within thedecoder. In this case, we can improve the RUSEscore by 2% and 5% absolute. The decoder en-codings perform similarly, with small advantagefor using embeddings and not positional encod-ings.Therefore, in the remaining of the experimentswe use the embeddings.
In a second series of experiments, we combine theconstraint translation approach with multi-lingualmachine translation. The combination of both of-fers the unique opportunity to perform unsuper-vised sentence compression. We can treat the trans-lation of English to English as a zero-shot direction(Johnson et al., 2017; Ha et al., 2016). This has notbeen addressed in traditional multi-lingual machine translation, since in this case the model will oftenjust copy the source sentence to the target one. Byadding the length constraints, we force the mode toreformulate the sentence in order to fulfil the lengthconstraint.The results for these experiments are shown inTable 5. In this case, we compared three scenar-ios. First, a model trained only to translation fromGerman to English. Secondly, a model trained totranslate from German to English and English toGerman. Finally, a model trained on 4 language toand from English.First of all, since the models are trained on rela-tive small data, we always gain when using morelanguage pairs. Secondly, for all models trainingfrom German to English, the decoder embedingsis clearly better than the baseline. Finally, to per-form paraphrasing, we need a multi-lingual systemwith several language pairs. Both model trainedonly on the German to English and English to Ger-man data fail to generate adequate translation. Incontrast, if we look at the translation from Englishto English for the multilingual model, the scoresare clearly better than the ones from German toEnglish. Furthermore, again, the system with de-coder embeddings is clearly better than the baselinesystem.In addition, we performed the same experimentwith a target length of half the source length (Table5). Although the absolute scores are significantlower since the model has to reduce the lengthfurther, the tendency is the same for this direction.
In this work, we are able to combine machine trans-lation and sentence compression. In a third seriesof experiments, we wanted to investigate the ad-vantage of modelling it in an end-to-end fashioncompared to a cascade of different models. Weperformed this investigation again for two tasks:German to English and English to English.The cascade system for German to English, firsttranslates the German text to English with a base-line machine translation system. In a second step,the output is compressed with the multi-lingualMT system. For the English-to-English system, thecascade system removes the zero-shot condition.Therefore, we first translate from English to Ger-man with the baseline system and then translatewith length contrasted from German to English.In cascade fix pivot also the English to Germanarget Length 0.8 Target Length 0.5Baseline Dec. Emb Baseline Dec. EmbModel DE-EN EN-EN DE-EN EN-EN DE-EN EN-EN DE-EN EN-ENDE-EN -0.272 -0.247 -0.587 -0.554DE+EN -0.264 -0.817 -0.223 -0.905 -0.598 -0.954 -0.523 -0.978All -0.225 -0.102 -0.214 0.020 -0.560 -0.525 -0.548 -0.481
Table 5: Multi-lingual systems
Length Model DE-EN EN-EN0.8 End2End -0.247 0.020Cascade -0.259 -0.118Cascade Fix. Pivot -0.1660.5 End2End -0.555 -0.481Cascade -0.575 -0.521Cascade Fix. Pivot -0.544
Table 6: Comparison of End-to-End and Cascaded ap-proach system already fulfill the length constraint.As shown in Table 6, in all condition, the end-to-end approach outperforms the cascaded ver-sion. This is especially the case for the English-to-English machine translation. Compared to multi-lingual machine translation, for this tasks it seemsto be beneficial to perform the zero-shot tasks in-stead of using a pivot language.
In a last series of experiments (Table 7), we in-vestigate the ability of the framework to generatesimpler sentence. As described in Section 2.4, weconcentrate on reducing the number of rare andcomplex words. Again, we are using the decoderembedding to represent the amount of BPE units inthe sentences. We use a system for 1 language pair,2 language pairs and the system using 8 languagepairs. First, the system is able to reduce the numberof BPE tokens in the text significantly. The amountof tokens is reduce by up to 48%. Since the num-ber of tokens is nearly keep the same, this is alsoreflected in a better readability. On the other hand,we see that the translation quality is only affectedslightly.
For the length restricted system, we also presentexamples in Table 8. The translation were gener-ated with the multi-lingual system using restrictedsearch space with 0.8 times and 0.5 times the length of the source length. The length is thereby mea-sured using the number of subword tokens.In the examples we see clearly the problem ofthe baseline model when using a restrict searchspace. The model mainly outputs the prefix ofthe long translation and do not try to put the maincontent into the shorter segment. In contrast, thesystem using the decoder embeddings are awarewhen generating a word how much space the stillhave to fill the content. Therefore, they do not justcut part of the sentence, but compress the sentenceand extract the most important part of the sentence.While the first example is more concentrating onthe second part of the original sentence, the secondone is focusing at the beginning. Although themodel reducing the length by 50% have to removesome content of the original sentence, the sentenceis still understandable.
The most common approach to model the targetlength within NMT is the use of coverage mod-els (Tu et al., 2016). More recently, (Lakew et al.,2019) used similar techniques to generate transla-tion with the same length as the source sentence.Compared to these works, we tried to significantlyreduce the length of the sentence and thereby havethe situation where the training and testing con-dition differ significant. This work on length-controlled machine translation is strongly relatedto sentence compression, where the compression isperformed in the monolingual case. First approachused rule-based approaches (Dorr et al., 2003) forextractive sentence compression. In abstractivecompression methods using syntactic translation(Cohn and Lapata, 2008) and phrase-based ma-chine translation were investigated (Wubben et al.,2012). The success of encoder-decoder modelsin mainly areas of natural language processing(Sutskever et al., 2014; Bahdanau et al., 2014) mo-tivated it success full application to sentence com-pression. (Kikuchi et al., 2016) and (Takase andetric DE-EN DE+EN AllBase Simp. Base Simp. Base SimpBPE tokens 1961 1053 1978 1041 1899 991DCI 7.63 7.47 7.69 7.5 7.66 7.45FRE 83.86 86.18 84.31 85.49 82.98 85.59BLEU 30.80 30.62 32.25 31.38 32.84 31.29RUSE -0.085 -0.092 -0.082 -0.080 -0.042 -0.084
Table 7: Simplification
Source: Und, obwohl es wirklich einfach scheint, ist es tatschlich richtig schwer,weil es Leute drngt sehr schnell zusammenzuarbeiten.Reference: And, though it seems really simple, it’s actually pretty hard because itforces people to collaborate very quickly.Base 0.8: and even though it really seems simple , it is actually really hard , becauseit really pushesDec. Emb. 0.8 : and although it really seems simple , it is really hard because it drivespeople to work together .Base 0.5 : and even though it really seems simple , it is really hardDec. Emb. 0. 5: it is really hard because it drives people to work together .Source: Konstrukteure erkennen diese Art der Zusammenarbeit als Kern einesiterativen Vorgangs.Reference: Designers recognize this type of collaboration as the essence of theiterative process.Base 0.8: now , traditional constructors recognize this kind of collaboration as the coreDec. Emb. 0.8 designers recognize this kind of collaboration as the core of iterative .Base 0.5: now , traditional constructors recognize this kindDec. Emb: 0.5 developers recognize this kind of collaboration .
Table 8: Examples
Okazaki, 2019) investigated an approach to directlycontrol the output length. Although their methodsuses similar techniques to ours, the model is trainedin a supervised way. Motivated by recent successin unsupervised machine translation (Artetxe et al.,2018; Lample et al., 2018), a first approach to learntext compression in an unsupervised fashion waspresented in (Fevry and Phang, 2018). Text com-pression in a supervised fashion for subtitles wasinvestigated in (Angerbauer et al., 2019).In contrast to text compression, the combinationof readability and machine translation has been re-searched recently. (Agrawal and Carpuat, 2019)presented an approach to model the readability us-ing source side annotation. In contrast to our work,they concentrated on the scenario where manuallycreated training data is available. In (Marchisioet al., 2019) the authors specified the desired read-ability difficulty either by a source token or thoughthe architecture by different encoder. While they concentrate on a single task and have only a limitednumber of difficulty classes, the work presentedhere is able to handle a huge number of possibleoutput class (e.g. in text compression the numberof words) and can be applied for different task.
In this work, we present a first approach to length-restricted machine translation. In contrast to workon monolingual sentence compression, we focusthereby on unsupervised methods. By combiningthe results with multi-lingual machine translation,we are also able to perform monolingual unsuper-vised sentence compression.We have shown that is important to include thelength constraints to the decoder to achieve transla-tions fulfilling the constraints. Furthermore, mod-eling the task in an end-to-end fashion improvesover splitting the task into different sub-tasks. Thisis even true for zero-shot conditions. eferences
Sweta Agrawal and Marine Carpuat. 2019. Control-ling Text Complexity in Neural Machine Translation. arXiv:1911.00835 [cs] . ArXiv: 1911.00835.Katrin Angerbauer, Heike Adel, and Thang Vu. 2019.Automatic Compression of Subtitles with NeuralNetworks and its Effect on User Experience. In
Pro-ceedings of the 20th Annual Conference of the Inter-national Speech Communication Association (Inter-speech 2019) , pages 594–598.Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018. Unsupervised Neural Ma-chine Translation. In
International Conference onLearning Representations .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate.
CoRR , abs/1409.0.Ondej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck, An-tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-gacheva, Christof Monz, Matteo Negri, Aurlie Nvol,Mariana Neves, Martin Popel, Matt Post, RaphaelRubino, Carolina Scarton, Lucia Specia, MarcoTurchi, Karin Verspoor, and Marcos Zampieri. 2016.Findings of the 2016 Conference on Machine Trans-lation. In
Proceedings of the First Conference onMachine Translation: Volume 2, Shared Task Pa-pers , pages 131–198, Berlin, Germany. Associationfor Computational Linguistics.Mauro Cettolo, Marcello Federico, Luisa Bentivoldi,Jan Niehues, Sebastian Stker, Katsuhito Sudoh,Koichiro Yoshino, and Christian Federmann. 2017.Overview of the IWSLT 2017 Evaluation Campaign.In
Proceedings of the 14th International Workshopon Spoken Language Translation (IWSLT 2017) ,Tokio, Japan.Mauro Cettolo, Jan Niehues, Sebastian Stker, LuisaBentivoldi, Roberto Cattoni, and Marcello Federico.2015. The IWSLT 2015 Evaluation Campaign. In
Proceedings of the Twelfth International Workshopon Spoken Language Translation (IWSLT 2015) , DaNang, Vietnam.J.S. Chall and E. Dale. 1995.
Readability revisited:the new Dale-Chall readability formula . BrooklineBooks.Trevor Cohn and Mirella Lapata. 2008. Sentence Com-pression Beyond Word Deletion. In
Proceedingsof the 22nd International Conference on Compu-tational Linguistics (Coling 2008) , pages 137–144,Manchester, UK. Coling 2008 Organizing Commit-tee.Bonnie Dorr, David Zajic, and Richard Schwartz. 2003.Hedge Trimmer: A Parse-and-Trim Approach toHeadline Generation. In
Proceedings of the HLT-NAACL 03 Text Summarization Workshop , pages 1–8. Thibault Fevry and Jason Phang. 2018. Unsuper-vised Sentence Compression using Denoising Auto-Encoders. In
Proceedings of the 22nd Confer-ence on Computational Natural Language Learning ,pages 413–422, Brussels, Belgium. Association forComputational Linguistics.Yarin Gal and Zoubin Ghahramani. 2016. A Theoreti-cally Grounded Application of Dropout in RecurrentNeural Networks. In
Proceedings of the 30th Inter-national Conference on Neural Information Process-ing Systems , NIPS’16, pages 1027–1035, USA. Cur-ran Associates Inc. Event-place: Barcelona, Spain.Thanh Le Ha, Jan Niehues, and Alex Waibel. 2016. To-ward Multilingual Neural Machine Translation withUniversal Encoder and Decoder. In
Proceedingsof the 13th International Workshop on Spoken Lan-guage Translation (IWSLT 2016) , Seattle, USA.Thanh Le Ha, Jan Niehues, and Alex Waibel. 2017.Effective Strategies in Zero-Shot Neural MachineTranslation. In
Proceedings of the 14th Interna-tional Workshop on Spoken Language Translation(IWSLT 2017) , Tokio, Japan.Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vigas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’sMultilingual Neural Machine Translation System:Enabling Zero-Shot Translation.
Transactions of theAssociation for Computational Linguistics , 5:339–351.Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi-roya Takamura, and Manabu Okumura. 2016. Con-trolling Output Length in Neural Encoder-Decoders.In
Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages1328–1338, Austin, Texas. Association for Compu-tational Linguistics.Catherine Kobus, Josep Crego, and Jean Senellart.2017. Domain Control for Neural Machine Trans-lation. In
Proceddings of Recent Advances in Natu-ral Language Processing (RANLP 2017) , pages 372–378, Varna, Bulgaria.Surafel Melaku Lakew, Mattia Di Gangi, and MarcelloFederico. 2019. Controlling the Output Length ofNeural Machine Translation. In
Proceedings of the16th International Workshop on Spoken LanguageTranslation (IWSLT 2019) , Hong Kong. Zenodo.Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. UnsupervisedMachine Translation Using Monolingual CorporaOnly. In
International Conference on Learning Rep-resentations .Qingsong Ma, Ondej Bojar, and Yvette Graham. 2018.Results of the WMT18 Metrics Shared Task: Bothcharacters and embeddings achieve good perfor-mance. In
Proceedings of the Third Conference onachine Translation: Shared Task Papers , pages671–688, Belgium, Brussels. Association for Com-putational Linguistics.Qingsong Ma, Johnny Wei, Ondej Bojar, and YvetteGraham. 2019. Results of the WMT19 MetricsShared Task: Segment-Level and Strong MT Sys-tems Pose Big Challenges. In
Proceedings of theFourth Conference on Machine Translation (Volume2: Shared Task Papers, Day 1) , pages 62–90, Flo-rence, Italy. Association for Computational Linguis-tics.Kelly Marchisio, Jialiang Guo, Cheng-I Lai, andPhilipp Koehn. 2019. Controlling the Reading Levelof Machine Translation Output. In
Proceedings ofMT Summit XVII , volume 1, page 11.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a Method for AutomaticEvaluation of Machine Translation. In
Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics - ACL ’02 , page 311, Morris-town, NJ, USA. Association for Computational Lin-guistics.Rico Sennrich, Alexandra Birch, and Barry Haddow.2016a. Controlling Politeness in Neural MachineTranslation via Side Constraints. In
Proceedings ofthe 2012 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL-HLT2016) , pages 35–40, San Diego, California, USA.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural Machine Translation of Rare Wordswith Subword Units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Hiroki Shimanaka, Tomoyuki Kajiwara, and MamoruKomachi. 2019. RUSE: Regressor Using Sen-tence Embeddings for Automatic Machine Transla-tion Evaluation. In
Proceedings of the Third Confer-ence on Machine Translation: Shared Task Papers ,pages 751–758, Stroudsburg, PA, USA. Associationfor Computational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to Sequence Learning with Neural Net-works.
Advances in Neural Information ProcessingSystems 27: Annual Conference on Neural Informa-tion Processing Systems 2014 , pages 3104–3112.Sho Takase and Naoaki Okazaki. 2019. Positional En-coding to Control Output Sequence Length. In
Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 3999–4004,Minneapolis, Minnesota. Association for Computa-tional Linguistics. Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling Coverage for NeuralMachine Translation. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 76–85, Berlin, Germany. Association for ComputationalLinguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need.
CoRR , abs/1706.0.Sander Wubben, Antal van den Bosch, and Emiel Krah-mer. 2012. Sentence Simplification by MonolingualMachine Translation. In
Proceedings of the 50th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1015–1024, Jeju Island, Korea. Association for Computa-tional Linguistics.Thomas Zenkel, Joern Wuebker, and John DeNero.2019. Adding Interpretable Attention to Neu-ral Translation Models Improves Word Alignment. arXiv:1901.11359 [cs]arXiv:1901.11359 [cs]