[PDF] Big Bidirectional Insertion Representations for Documents

Abstract

The Insertion Transformer is well suited for long form text generation due to its parallel generation capabilities, requiring O( log 2 n) generation steps to generate n tokens. However, modeling long sequences is difficult, as there is more ambiguity captured in the attention mechanism. This work proposes the Big Bidirectional Insertion Representations for Documents (Big BIRD), an insertion-based model for document-level translation tasks. We scale up the insertion-based models to long form documents. Our key contribution is introducing sentence alignment via sentence-positional embeddings between the source and target document. We show an improvement of +4.3 BLEU on the WMT'19 English → German document-level translation task compared with the Insertion Transformer baseline.

Full PDF

BBig Bidirectional Insertion Representations for Documents

Lala Li

Google Research, Brain Team [email protected]

William Chan

Google Research, Brain Team [email protected]

Abstract

The Insertion Transformer is well suited forlong form text generation due to its parallelgeneration capabilities, requiring O (log n ) generation steps to generate n tokens. How-ever, modeling long sequences is difﬁcult,as there is more ambiguity captured in theattention mechanism. This work proposesthe Big Bidirectional Insertion Representa-tions for Documents (Big BIRD), an insertion-based model for document-level translationtasks. We scale up the insertion-based mod-els to long form documents. Our key contri-bution is introducing sentence alignment viasentence-positional embeddings between thesource and target document. We show an im-provement of +4.3 BLEU on the WMT’19English → German document-level translationtask compared with the Insertion Transformerbaseline.

Recently, insertion-based models (Stern et al.,2019; Welleck et al., 2019; Gu et al., 2019; Chanet al., 2019) have been introduced for text gen-eration. Unlike traditional autoregressive left-to-right models (Cho et al., 2014; Sutskever et al.,2014; Vaswani et al., 2017), insertion-based mod-els are not restricted to generating text sequencesin a serial left-to-right manner, but these modelsare endowed with the capabilities of parallel gen-eration. More speciﬁcally, Stern et al. (2019);Chan et al. (2019) showed that we can teach neuralnets to generate text to follow a balanced binarytree order. An autoregressive left-to-right modelwould require O ( n ) generation steps to gener-ate n tokens, whereas the Insertion Transformer(Stern et al., 2019) and KERMIT (Chan et al.,2019) following a balanced binary tree policy re-quires only O (log n ) generation steps to generate n tokens. This is especially important for long- form text generation, for example, Document-Level Machine Translation.Document-Level Machine Translation is be-coming an increasingly important task. Recent re-search suggests we are nearing human-level parityfor sentence-level translation in certain domains(Hassan et al., 2018), however, we lag signiﬁ-cantly behind in document-level translation (Lubliet al., 2018). Various papers have proposed in-corporating context for document-level translation(Junczys-Dowmunt, 2019), which has been shownto improve translation quality. There are two pri-mary methods to include context in a document-level machine translation model compared to asentence-level translation model.1. Source Contextualization.

We can includesource context, wherein when we generatethe target sentence, we can condition on thecorresponding source sentence and its neigh-bours, or even the whole source document.This allows the target sentence to be contex-tualized to the source document.2.

Target Contextualization.

We can includetarget context, wherein when we generate thetarget sentence, we can condition on all thetarget tokens generated thus far in the wholedocument. This allows the target sentence tobe contextualized to other target sentences.Target contextualization is especially difﬁcult inan autoregressive left-to-right model (i.e., Trans-former (Vaswani et al., 2017)), the model mustgenerate the whole document in linear fashion,which would be prohibitively expensive costing O ( n ) iterations to generate n tokens. Additionally,the model is unable to model bidirectional context,since the text is always generated in a left-to-rightmanner. Some prior work have focused on utiliz-ing block coordinate descent like algorithms dur-ing inference (Maruf and Haffari, 2018), however a r X i v : . [ c s . C L ] O c t his adds complexity and additional runtime costduring inference.Insertion-based models, for example, the Inser-tion Transformer (Stern et al., 2019) is one po-tential solution. The Insertion Transformer cangenerate text following a balanced binary tree or-der. It requires O (log n ) iterations to generate n tokens, offering signiﬁcant inference time advan-tages over a serial generation model. The sourcedocument is naturally fully conditioned on, whichprovides full source contextualization. Addition-ally, the generation order offers bidirectional con-textualization, permitting target contextualizationthat is not solely on a left-to-right basis.In this paper, we present Big BidirectionalInsertion Representations for Documents (BigBIRD). We address the limitations of scaling upthe Insertion Transformer to document-level ma-chine translation. We present a model that canmodel long-form documents with thousands of to-kens in a fully contextualized manner. In this section, we present Big BidirectionalRepresentations for Documents (Big BIRD). BigBIRD is an extension of the Insertion Transformer(Stern et al., 2019), scaling up from sentences todocuments. The key contributions are 1) extend-ing the context window size to cover a document,and 2) informing the model of sentence positionalinformation, which are aligned between sourceand target sentences.

Insertion Transformer.

In the Insertion Trans-former (Stern et al., 2019), sequences are gener-ated via insertion operations. In the context of Ma-chine Translation, there is a source canvas x and atarget canvas y , where the target canvas is updatedat each iteration via inserting one token at eachplausible location. At time t during training, a hy-pothesis target canvas ˆ y t must be a subsequenceof the ﬁnal output. For example, if the ﬁnal out-put is [ A, B, C, D, E ] , then ˆ y t = [ B, D ] wouldbe a valid intermediate canvas, in which case themodel would be taught to predict [ A, C, E ] . Themodel is taught to insert multiple tokens at incom-plete slots, or predict end-of-slot for completedslots. The intermediate canvases are uniformlysampled from the ground truth target sequence.During inference, the target canvas starts empty,and tokens will be inserted iteratively until themodel predicts to insert empty tokens everywhere, or the sequence has exceeded the speciﬁed maxi-mum length. Larger Context and Sentence-Positional Em-beddings.

Longer sequences lead to more un-certainties for the Insertion Transformer. For ex-ample, if a token in ˆ y t appears in multiple sen-tences in the ﬁnal output, there is ambiguity to themodel which sentence it belongs to (and thereforewhere to attend to on both the source and targetcanvases). While there is location information en-dowed in the Transformer model, we hypothesizethat token level positional information is insufﬁ-cient (especially since we have limited trainingdata). We believe that endowing the model withsentence-level positional information (i.e., whichsentence each token belongs to) may help signiﬁ-cantly disambiguate in such situations and help themodel build a more robust attention mechanism.Based on this motivation and assuming that thedatasets have not only parallel documents, but alsosentence alignment between source and target doc-uments (which is true for WMT’19 document-level translation), we use sentence-positional em-beddings on both the source and target sequencesas shown in Figure 1. The intention is to endowthe model with this prior knowledge on sentencealignment between the source and target, and thusmore easily attend to the appropriate sentencesbased on sentence locality. More speciﬁcally, onthe source side, we do not use any sentence sepa-rator tokens; on the target side, we start each sen-tence with a sentence separator. During inferencewe initialize the output hypothesis with empty (cid:104) s (cid:105) sentence separator tokens, where the numberof (cid:104) s (cid:105) equals to the number of source sentences,which is equal to the number of target sentencesto be generated. These (cid:104) s (cid:105) tokens serve as sen-tence anchor points, and have sentence-positionalinformation. Figure 1 visualizes the model.In this work we increased the context windowsize to cover multiple sentences or a short docu-ment. Note that there is only a limit on the max-imum number of tokens in the entire sequence;there is no limit on the length of a single sentence,or the total number of sentences in the sequence. We experiment with the WMT’19English → German document-level translationtask (Barrault et al., 2019). The training datasetconsists of parallel document-level data (Eu- igure 1: Big Bidirectional Insertion Representations for Documents roparl, Rapid, News-Commentary) and parallelsentence-level data (WikiTitles, Common Crawl,Paracrawl). The test set is newstest2019. Thedocument-level portion contains 68.4k paralleldocuments, or a total of 7.7M parallel sentences;while the sentence-level portion has 19.9Mparallel sentences. We generated a vocabulary of32k subwords from the training data using theSentencePiece tokenizer (Kudo and Richardson,2018).The Big BIRD model is as described in Section2, and the baseline Insertion Transformer modelhas exactly the same conﬁgurations except with-out sentence-positional embeddings. To be ex-plicit, our baseline Insertion Transformer modelis also given the prior knowledge of number ofsource sentences in the document. The target can-vas is initialized target with (cid:104) s (cid:105) sentence separatortokens, where the number of (cid:104) s (cid:105) tokens is equalto the number of sentences in the document. Allour models follow the same architecture as theTransformer Base model in (Vaswani et al., 2017),and a context window of 1536 tokens during train-ing (determined based on the longest document inthe test set). All models were trained with theSM3 optimizer (Anil et al., 2019) with momen-tum 0.9, learning rate 0.1, and a quadratic learningrate warm-up schedule with 10k warm-up steps.The learning rate were chosen after some prelim-inary comparison runs between Adam and SM3.We opted to use the SM3 optimizer over Adamdue to its more memory efﬁcient properties, thusallowing us to use larger minibatches. Trainingwas around 800k steps at batch size 512.During training, each batch consists of 256 sub- Model BLEU

Insertion Transformer 25.3Big BIRD 29.6

Table 1: WMT19 English → German Document-LevelTranslation. documents and 256 sentences. Sub-documents arecontinuous sentences dynamically sampled from adocument. The lengths of sub-documents are uni-formly sampled in (0, 1536] tokens. The numberof sampled sub-documents from each document is1/10 of the number of sentences in the full doc-ument. Sentences directly come from sentence-level data. This 1:1 mixing of sub-documents andsentences results in training examples of vastlydifferent lengths and therefore many masked po-sitions, and we plan to improve it in the future bypacking multiple sentences into one example.We report sacreBLEU (Post, 2018) scores of thetwo models in Table 1. Our Big BIRD model out-performs the Insertion Transformer model by +4.3BLEU.When we inspected the outputs more closely forthe two models, we uncovered an interesting phe-nomenon. The Insertion Transformer, even thoughits target canvas is also initialized with the cor-rect number of sentence (cid:104) s (cid:105) separators, strugglesto align source and target sentences. For example,it can map two sources sentences into one sen-tence in the target, or vice versa. This is not al-ways bad, as long as it captures the semantics ac-curately. However, there are cases when misalign-ment causes loss of coherency. Table 2 shows suchan example where Big BIRD captures alignment ource: (...) Chelsea faces Videoton in the UEFA Europa Leaguge at 3 p.m. on Thursday in London. Target: (...) Chelsea trifft in der UEFA Europa League am Donnerstag um 15 Uhr in London auf Videoton.

Insertion Transformer: (...) Chelsea Gesichter am Donnerstag um 15.00 Uhr in London. Chelsea Gesichter Videoton in derUEFA Europa Leaguge.

Translation: (Google Translate)Chelsea faces on Thursday at 15.00 in London. Chelsea faces Videoton in UEFA Europa Leaguge.

Big BIRD: (...) Chelsea sieht am Donnerstag um 15.00 Uhr in London Videoton in der UEFA Europa Leaguge.

Translation: (Google Translate)Chelsea sees Videoton in UEFA Europa League on Thursday at 15.00 in London.

Table 2: An example where the Insertion Transformer gets confused with sentence alignment: it maps one sentencefrom the source into two sentences in the translation and loses semantic accuracy. When given sentence alignmentexplicitly, i.e. Big BIRD, it translates the sentence coherently. better than the Insertion Transformer, and there-fore its translation is more accurate and coherent.

In this paper, we presented Big BIRD, an adap-tation of the Insertion Transformer to document-level translation. In addition to a large contextwindow, Big BIRD also uses sentence-positionalembeddings to directly capture sentence alignmentbetween source and target documents. We showboth quantitatively and qualitatively the promiseof Big BIRD, with a +4.3 BLEU improvementover the baseline model and examples where BigBIRD achieves better translation quality via sen-tence alignment. We believe Big BIRD is apromising direction for document level under-standing and generation.

References

Rohan Anil, Vineet Gupta, Tomer Koren, and YoramSinger. 2019. Memory-Efﬁcient Adaptive Opti-mization for Large-Scale Learning. In arXiv .Loc Barrault, Ondej Bojar, Marta R. Costa-juss, Chris-tian Federmann, Mark Fishel, Yvette Graham, BarryHaddow, Matthias Huck, Philipp Koehn, ShervinMalmasi, Christof Monz, Mathias Mller, SantanuPal, Matt Post, and Marcos Zampieri. 2019. Find-ings of the 2019 Conference on Machine Transla-tion. In

ACL .William Chan, Nikita Kitaev, Kelvin Guu, MitchellStern, and Jakob Uszkoreit. 2019. KERMIT: Gen- erative Insertion-Based Modeling for Sequences. In arXiv .Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In

EMNLP .Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019.Insertion-based Decoding with Automatically In-ferred Generation Order. In arXiv .Hany Hassan, Anthony Aue andChang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,Renqian Luo, Arul Menezes, Tao Qin, Frank Seide,Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, YingceXia, Dongdong Zhang, Zhirui Zhang, and MingZhou. 2018. Achieving Human Parity on AutomaticChinese to English News Translation. In arXiv .Marcin Junczys-Dowmunt. 2019. Microsoft Transla-tor at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation. In

WMT .Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71.Samuel Lubli, Rico Sennrich, and Martin Volk. 2018.Has Machine Translation Achieved Human Parity?A Case for Document-level Evaluation. In

EMNLP .ameen Maruf and Gholamreza Haffari. 2018. Doc-ument Context Neural Machine Translation withMemory Networks. In

ACL .Matt Post. 2018. A Call for Clarity in Reporting BLEUScores. In

WMT .Mitchell Stern, William Chan, Jamie Kiros, and JakobUszkoreit. 2019. Insertion Transformer: FlexibleSequence Generation via Insertion Operations. In

ICML .Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014.Sequence to Sequence Learning with Neural Net-works. In

NIPS .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In

NIPS .Sean Welleck, Kiante Brantley, Hal Daume, andKyunghyun Cho. 2019. Non-Monotonic SequentialText Generation. In