SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning
SSLUA: A Super Lightweight Unsupervised Word Alignment Modelvia Cross-Lingual Contrastive Learning
Di Wu † Liang Ding ‡ Shuo Yang ‡ Dacheng Tao ‡† Peking University ‡ The University of Sydney [email protected] { ldin3097,syan9630 } @[email protected] Abstract
Word alignment is essential for the down-streamingcross-lingual language understanding and genera-tion tasks. Recently, the performance of the neu-ral word alignment models [Zenkel et al. , 2020;Garg et al. , 2019; Ding et al. , 2019] has ex-ceeded that of statistical models. However, theyheavily rely on sophisticated translation models.In this study, we propose a Super LightweightUnsupervised word Alignment (
SLUA ) model, inwhich a bidirectional symmetric attention trainedwith a contrastive learning objective is introduced,and an agreement loss is employed to bind theattention maps, such that the alignments followmirror-like symmetry hypothesis. Experimental re-sults on several public benchmarks demonstratethat our model achieves competitive, if not bet-ter, performance compared to the state of the artin word alignment while significantly reducing thetraining and decoding time on average. Further ab-lation analysis and case studies show the superi-ority of our proposed SLUA. Notably, we recog-nize our model as a pioneer attempt to unify bilin-gual word embedding and word alignments. En-couragingly, our approach achieves × speedup against GIZA++, and × parameter compression compared with the Transformer-based alignmentmethods. We will release our code to facilitate thecommunity. Word alignment, aiming to find the word-level correspon-dence between a pair of parallel sentences, is a core com-ponent of the statistical machine translation (SMT, [Brown etal. , 1993]). It also has benefited several downstream tasks, e.g. , computer-aided translation [Dagan et al. , 1993], seman-tic role labeling [Kozhevnikov and Titov, 2013], cross-lingualdataset creation [Yarowsky et al. , 2001] and cross-lingualmodeling [Ding et al. , 2020].Recently, in the era of neural machine translation(NMT, [Bahdanau et al. , 2015; Vaswani et al. , 2017]), theattention mechanism plays the role of the alignment modelin translation system. Unfortunately, [Koehn and Knowles,
Figure 1: Two examples of word alignment. The upper and bottomcases are the Chinese and Japanese references, respectively. et al. ,2019; Ghader and Monz, 2017] also confirm this finding.Although there are some studies attempt to mitigate thisproblem, most of them are rely on a sophisticated transla-tion architecture [Zenkel et al. , 2020; Garg et al. , 2019].These methods are trained with a translation objective, whichcomputes the probability of each target token conditionedon source tokens and previous target tokens. This willbring tremendous parameters and noisy alignments. Most re-cent work avoids the noisy alignment of translation modelsbut employed too much expensive human-annotated align-ments [Stengel-Eskin et al. , 2019]. Given these disadvan-tages, simple statistical alignment tools, e.g.,
FastAlign [Dyer et al. , 2013] and GIZA++ [Och and Ney, 2003] , are still themost representative solutions due to its efficiency and unsu-pervised fashion. We argue that the word alignment task, in-tuitively, is much simpler than translation, and thus should beperformed before translation rather than inducing alignmentmatrix with heavy neural machine translation models. For ex-ample, the IBM word alignment model, e.g. , FastAlign, is theprerequisite of SMT. However, related research about superlightweight neural word alignment without NMT is currentlyvery scarce.
Inspired by cross-lingual word embeddings (CLWEs)[Luong et al. , 2015], we propose to implement a su-per lightweight unsupervised word alignment model in §3,named SLUA, which encourages the embeddings between GIZA++ employs the IBM Model 4 as default setting. a r X i v : . [ c s . C L ] F e b ligned words to be closer. We also provide the theoreticaljustification from mutual information perspective for our pro-posed contrastive learning objective in §3.4, demonstratingits reasonableness. Figure 1 shows an English sentence, andits corresponding Chinese and Japanese sentences, and theirword alignments. The links indicate the correspondence be-tween English ⇔ Chinese and English ⇔ Japanese words. Ifthe Chinese word “ 举 行 ” can be aligned to English word“held”, the reverse mapping should also hold. Specifically, abidirectional attention mechanism with contrastive estimationis proposed to capture the alignment between parallel sen-tences. In addition, we employ an agreement loss to constrainthe attention maps such that the alignments follow symmetryhypothesis [Liang et al. , 2006].Our contributions can be summarized as follows:• We propose a super lightweight unsupervised alignmentmodel (SLUA), even merely updating the embeddingmatrices, achieves better alignment quality on severalpublic benchmark datasets compare to baseline mod-els while preserving comparable training efficiency withFastAlign.• To boost the performance of SLUA, we design a theo-retically and empirically proved bidirectional symmet-ric attention with contrastive learning objective for wordalignment task, in which we introduce extra objective tofollow the mirror-like symmetry hypothesis.• Further analysis show that the by-product of our modelin training phase has the ability to learn bilingual wordrepresentations, which endows the possibility to unifythese two tasks in the future. Word alignment studies can be divided into two classes:
Statistical Models
Statistical alignment models directlybuild on the lexical translation models of [Brown et al. ,1993], also known as IBM models. The most popular im-plementation of this statistical alignment model is FastAl-ign [Dyer et al. , 2013] and GIZA++ [Och and Ney, 2000;Och and Ney, 2003]. For optimal performance, the train-ing pipeline of GIZA++ relies on multiple iterations of IBMModel 1, Model 3, Model 4 and the HMM alignment model[Vogel et al. , 1996]. Initialized with parameters from previ-ous models, each subsequent model adds more assumptionsabout word alignments. Model 2 introduces non-uniform dis-tortion, and Model 3 introduces fertility. Model 4 and theHMM alignment model introduce relative distortion, wherethe likelihood of the position of each alignment link is condi-tioned on the position of the previous alignment link. FastAl-ign [Dyer et al. , 2013], which is based on a reparametrizationof IBM Model 2, is almost the existing fastest word aligner,while keeping the quality of alignment.In contrast to GIZA++, our SLUA model achieves nearly15 × speedup during training, while achieving the comparableperformance. Encouragingly, our model is at least 1.5 × fasterto train than FastAlign and consistently outperforms it. Neural Models
Most neural alignment approaches in theliterature, such as [Alkhouli et al. , 2018], rely on alignmentsgenerated by statistical systems that are used as supervisionfor training the neural systems. These approaches tend tolearn to copy the alignment errors from the supervising sta-tistical models. [Zenkel et al. , 2019] use attention to ex-tract alignments from a dedicated alignment layer of a neu-ral model without using any output from a statistical aligner,but fail to match the quality of GIZA++. [Garg et al. , 2019]represents the current state of the art in word alignment, out-performing GIZA++ by training a single model that is ableto both translate and align. This model is supervised with aguided alignment loss, and existing word alignments must beprovided to the model during training. [Garg et al. , 2019]can produce alignments using an end-to-end neural trainingpipeline guided by attention activations, but this approachunderperforms GIZA++. The performance of GIZA++ isonly surpassed by training the guided alignment loss usingGIZA++ output. [Stengel-Eskin et al. , 2019] introduce a dis-criminative neural alignment model that uses a dot-product-based distance measure between learned source and targetrepresentation to predict if a given source-target pair shouldbe aligned. Alignment decisions condition on the neighbor-ing decisions using convolution. The model is trained usinggold alignments. [Zenkel et al. , 2020] uses guided alignmenttraining, but with large number of modules and parameters,they can surpass the alignment quality of GIZA++.They either use translation models for alignment task,which introduces a extremely huge number of parameters(compare to ours), making the training and deployment of themodel cumbersome. Or they train the model with the align-ment supervision, however, these alignment data is scarce inpractice especially for low resource languages. These settingsmake above approaches less versatile.Instead, our approach is fully unsupervised at word level,that is, it does not require gold alignments generated byhuman annotators during training. Moreover, our modelachieves comparable performance and is at least 50 timessmaller than them, i.e.,
Our model trains in an unsupervised fashion, where the wordlevel alignments are not provided. Therefore, we need toleverage sentence-level supervision of the parallel corpus. Toachieve this, we introduce negative sampling strategy withcontrastive learning to fully exploit the corpus. Besides, in-spired by the concept of cross-lingual word embedding, wedesign the model under the following assumption:
If a targettoken can be aligned to a source token, then the dot productof their embedding vectors should be large.
Figure-2 showsthe schema of our approach
SLUA . For a given source-target sentence pair ( s , t ) , s i , t j ∈ R d rep-resent the i -th and j -th word embeddings for the source andtarget sentences, respectively. In order to capture the contex-tualized information of each word, we perform mean pooling y x y x i y j x s - y t - x s y t Context Vector Context Vector0.2 0.10.1 0.10.5 0.50.1 0.10 0.2
NCE LOSSNCE LOSSA s t A s t Figure 2: Illustration of the SLUA, where a pair of sentences are given as example. Each x i and y j are the representation of words in sourceand target part respectively. Given y j , we can calculate context vector in source part. The NCE training objective is encouraging the dotproduct of this context vector and y j to be large. The process in the other direction is consistent. By stacking all of the soft weights, twoattention maps A s → t and A t → s can be produced, which will be bound by an agreement loss to encourage symmetry. operation with the representations of its surrounding words.Padding operation is used to ensure the sequence length. As aresult, the final representation of each word can be calculatedby element-wisely adding the mean pooling embedding andits original embedding: x i = M EAN P OOL ([ s i ] win ) + s i , (1)where win is the pooling window size. We cantherefore derive the sentence level representations ( x , x , ..., x | s | ) , ( y , y , ..., y | t | ) for s and t . Bidirectional symmetric attention is the basic component ofour proposed model. The aim of this module is to generate thesource-to-target ( aka. s2t) and target-to-source ( aka. t2s) softattention maps. The details of the attention mechanism: givena source side word representation x i as query q i ∈ R d andpack all the target tokens together into a matrix V t ∈ R | t |× d .The attention context can be calculate as:A TTENTION ( q i , V t , V t ) = ( a it · V t ) (cid:124) , (2)where the vector a it ∈ R ×| t | represents the attention proba-bilities for q i in source sentence over all the target tokens, inwhich each element signifies the relevance to the query, andcan be derived from: a it = S OFTMAX ( V t · q i ) (cid:124) . (3)For simplicity, we denote the attention context of q i in the tar-get side as att t ( q i ) . s2t attention map A s,t ∈ R | s |×| t | is con-structed by stacking the probability vectors a it correspondingto all the source tokens.Reversely, we can obtain t2s attention map A t,s in a sym-metric way. Then, these two attention matrices A s,t and A t,s will be used to decode alignment links. Take s2t for exam-ple, given a target token, the source token with the highestattention weight is viewed as the aligned word. Intuitively, the two attention matrices A s,t and A Tt,s should bevery close. However, the attention mechanism suffers fromsymmetry error in different direction [Koehn and Knowles,2017].To bridge this discrepancy, we introduce agreement mech-anism [Liang et al. , 2006], acting like a mirror that preciselyreflects the matching degree between A s,t and A t,s , whichis also empirically confirmed in machine translation [Levin-boim et al. , 2015]. In particular, we use an agreement loss tobind above two matrices: L oss disagree = (cid:88) i (cid:88) j ( A s,ti,j − A t,sj,i ) . (4)In §4.6, we empirically show this agreement can be com-plementary to the bidirectional symmetric constraint, demon-strating the effectiveness of this component. Suppose that ( q i , att t ( q i )) is a pair of s2t word representa-tion and corresponding attention context sampled from thejoint distribution p t ( q, att t ( q )) (hereinafter we call it a posi-tive pair), the primary objective of the s2t training is to max-imize the alignment degree between the elements within apositive pair. Thus, we first define an alignment function byusing the sigmoid inner product as:A LIGN ( q, att t ( q )) = σ ( (cid:104) q, att t ( q ) (cid:105) ) , (5)where σ ( · ) denotes the sigmoid function and (cid:104)· , ·(cid:105) is the in-ner product operation. However, merely optimizing the align-ment of positive pairs ignores important positive-negative re-lation knowledge [Mikolov et al. , 2013].To make the training process more informative, we re-form the overall objective in the contrastive learning man- ethod EN-FR FR-EN sym RO-EN EN-RO sym DE-EN EN-DE sym NNSA 22.2 24.2 15.7 47.0 45.5 40.3 36.9 36.3 29.5FastAlign 16.4 15.9 10.5 33.8 35.5 32.1 28.4 32.0 27.0SLUA
Table 1: AER of each method in different direction. “sym” means grow-diag symmetrization. ner [Saunshi et al. , 2019; Oord et al. , 2018] with Noise Con-trastive Estimation (NCE) loss [Mikolov et al. , 2013]. Specif-ically, we first sample k negative word representations q j from the margin p t ( q ) . Then, we can formulate the overallNCE objective as following: L oss is → t = − E { att t ( q i ) ,q i ,q j } [log A LIGN ( q i , att t ( q i )) A LIGN ( q i , att t ( q i )) + (cid:80) kj =1 A LIGN ( q j , att t ( q i )) ] , (6)It is evident that the objective in Eq. (6) explicitly encour-ages the alignment of positive pair ( q i , att t ( q i )) while simul-taneously separates the negative pairs ( q j , att t ( q i )) .Moreover, a direct consequence of minimizing Eq. (6) isthat the optimal estimation of the alignment between the rep-resentation and attention context is proportional to the ratio ofjoint distribution and the product of margins p t ( q,att t ( q )) p t ( q ) · p t ( att t ( q )) which is the point-wise mutual information, and we can fur-ther have the following proposition with repect to the mutualinformation: Proposition 1.
The mutual information between the wordrepresentation q and its corresponding attention context att t ( q ) is lower-bounded by the negative L oss is → t in Eq. (6)as: I ( q, att t ( q )) ≥ log( k ) − L oss is → t , (7) where k is the number of the negative samples. The detailed proof can be found in [Oord et al. , 2018].Proposition 1 indicates that the lower bound of the mutualinformation I ( q, att t ( q )) can be maximized by achieving theoptimal NCE loss, which provides theoretical guarantee forour proposed method.Our training schema over parallel sentences is mainly in-spired by the bilingual skip-gram model [Luong et al. , 2015]and invertibility modeling [Levinboim et al. , 2015]. There-fore, the ultimate training objective should consider both for-ward ( s → t ) and backward ( t → s ) direction, combinedwith the mirror agreement loss. Technically, the final trainingobjective is: L oss = | t | (cid:88) i L oss is → t + | s | (cid:88) j L oss jt → s + α · L oss disagree , (8) In the contrastive learning setting, q j and att t ( q i ) can be sam-pled from different sentences. If q j and att t ( q i ) are from the samesentence, i (cid:54) = j ; otherwise, j can be a random index within the sen-tence length. For simplicity, in this paper, we use q j where i (cid:54) = j todenote the negative samples, although with a little bit ambiguity. Model EN-FR RO-EN DE-EN
Naive Attention 31.4 39.8 50.9NNSA 15.7 40.3 -FastAlign 10.5 32.1 27.0
SLUA 9.2 31.6 24.8 [Zenkel et al. , 2020] 8.4 24.1 17.9[Garg et al. , 2019] 7.7 26.0 20.2GIZA++ 5.5 26.5 18.7
Table 2: Alignment performance (with grow-diagonal heuristic) ofeach model. where L oss s → t and L oss t → s are symmetrical and α is a lossweight to balance the likelihood and disagreement loss. We perform our method on three widely used datasets:English-French (
EN-FR ), Romanian-English (
RO-EN ) andGerman-English (
DE-EN ). Training and test data for
EN-FR and
RO-EN are from NAACL 2003 share tasks [Mihalceaand Pedersen, 2003]. For
RO-EN , we merge Europarl v8corpus, increasing the amount of training data from 49K to0.4M. For
DE-EN , we use the Europarl v7 corpus as train-ing data and test on the gold alignments. All above data arelowercased and tokenized by Moses. The evaluation metricsare Precision, Recall, F-score (F1) and Alignment Error Rate(AER).
Besides two strong statistical alignment models, i.e. FastAl-ign and GIZA++, we also compare our approach with neuralalignment models where they induce alignments either fromthe attention weights or through feature importance measures.
FastAlign
One of the most popular statistical methodwhich log-linearly reparameterize the IBM model 2 proposedby [Dyer et al. , 2013].
GIZA++
A statistical generative model [Och and Ney,2003], in which parameters are estimated using theExpectation-Maximization (EM) algorithm, allowing it toautomatically extract bilingual lexicon from parallel corpuswithout any annotated data.
NNSA
A unsupervised neural alignment model proposedby [Legrand et al. , 2016], which applies an aggregation oper-ation borrowed from the computer vision to design sentence-level matching loss. In addition to the raw word indices, fol- igure 3: An visualized alignment example. (a-c) illustrate the effects when gradually adding the symmetric component, (d) shows the resultof FastAlign, and (e) is the ground truth. The more emphasis is placed on the symmetry of the model, the better the alignment results modelachieved. Meanwhile, as depicted, the results of the attention map become more and more diagonally concentrated. lowing three extra features are introduced: distance to the di-agonal, part-of-speech and unigram character position. Tomake a fair comparison, we report the result of raw feature inNNSA.
Naive Attention
Averaging all attention matrices in theTransformer architecture, and selecting the source unit withthe maximal attention value for each target unit as alignments.We borrow the results reported in [Zenkel et al. , 2019] tohighlight the weakness of such naive version, where signif-icant improvement are achieved after introducing an extraalignment layer.
Others [Garg et al. , 2019] and [Zenkel et al. , 2020] repre-sent the current developments in word alignment, which bothoutperform GIZA++. However, They both implement thealignment model based on a sophisticated translation model.Further more, the former uses the output of GIZA++ as super-vision, and the latter introduces a pre-trained state-of-the-artneural translation model. It is unfair to compare our resultsdirectly with them. We report them in Table 2 as references.
For our method (SLUA), all the source and target embeddingsare initialized by xavier method [Glorot and Bengio, 2010].The embedding size d and pooling window size are set to256 and 3, respectively. The hyper-parameters α is testedby grid search from 0.0 to 1.0 at 0.1 intervals. For FastAl- ign, we train it from scratch by the open-source pipeline .Also, we report the results of NNSA and machine translationbased model(Sec.§4.2). All experiments of SLUA are run on1 Nvidia K80
GPU. The CPU model is Intel(R) Xeon(R)CPU E5-2620 v3 @ 2.40GHz. Both FastAlign and SLUAtake nearly half a hour to train one million samples.
Table 2 summarizes the AER of our method over severallanguage pairs. Our model outperforms all other baselinemodels. Comparing to FastAlign, we achieve 1.3, 0.5 and2.2 AER improvements on
EN-FR , RO-EN , DE-EN respec-tively.Notably, our model exceeds the naive attention model ina big margin in terms of AER (ranging from 8.2 to 26.1)over all language pairs. We attribute the poor performanceof the straightforward attention model (translation model) toits contextualized word representation. For instance, whentranslating a verb, contextual information will be paid atten-tion to determine the form ( e.g., tense) of the word, that mayinterfere the word alignment.Experiment results in different alignment directions can befound in Table 1. The grow-diag symmetrization benifits allthe models. https://github.com/lilt/alignment-scripts etup P R F1 AER L oss s → t L oss t → s L oss s ↔ t Table 3: Ablation results on EN-FR dataset. china distinctive
EN DE EN DE china chinas distinctive unverwechselbarenchinese china distinct besonderheitenchina’s chinesische peculiar markanterepublic chinesischer differences charakteristischechina’ chinesischem diverse einzelnencat love
EN DE EN DE cat hundefelle love liebedog katzenfell affection liebttoys hundefellen loved liebecats kuchen loves liebendogs schlafen passion lieb
Table 4: Top 5 nearest English ( EN ) and German ( DE ) words foreach of the following words: china , distinctive , easily , cat , love and January . Take the experiment on EN-FR dataset as an example, SLUAconverges to the best performance after running 3 epochs andtaking 14 minutes totally, where FastAlign and GIZA++ cost21 and 230 minutes, respectively, to achieve the best results.Notably, the time consumption will rise dozens of times inneural translation fashion. All experiments of SLUA are runon a single Nvidia P40 GPU.
To further explore the effects of several components ( i.e., bidirectional symmetric attention, agreement loss) in ourSLUA, we conduct an ablation study. Table 3 shows the re-sults on
EN-FR dataset. When the model is trained usingonly L oss s → t or L oss t → s as loss functions, the AER of themare quite high (20.9 and 23.3). As expected, combined lossfunction improves the alignment quality significantly (14.1AER). It is noteworthy that with the rectification of agree-ment mechanism, the final combination achieves the best re-sult (9.2 AER), indicating that the agreement mechanism isthe most important component in SLUA.To better present the improvements brought by adding eachcomponent, we visualize the alignment case in Figure-3. Aswe can see, each component is complementary to others, thatis, the attention map becomes more diagonally concentratedafter adding the bidirectional symmetric attention and theagreement constraint.Table-4 Figure 4: Example of the DE-EN alignment. (a) is the result ofFastAlign, and (b) shows result of our model, which is closer tothe gold alignment. The horizontal axis shows German sentence“ wir glauben nicht , da wir nur rosinen herauspicken sollten . ”, andthe vertical axis shows English sentence “ we do not believe that weshould cherry-pick . ”. Alignment Case Study
We analyze an alignment examplein Figure- 4. Compared to FastAlign, our model correctlyaligns “ do not believe ” in English to “ glauben nicht ” in Ger-man. Our model, based on word representation, makes betteruse of semantics to accomplish alignment such that invertedphrase like “ glauben nicht ” can be well handled. Instead,FastAlign, relied on the positional assumption , fails here. Word Embedding Clustering
To further investigate the ef-fectiveness of our model, we also analyze the word embed-dings learned by our model. In particular, following [Col-lobert et al. , 2011], we show some words together with itsnearest neighbors using the Euclidean distance between theirembeddings. Table- 4 shows some examples to demonstratesthat our learned representations possess a clearly clusteringstructure bilingually and monolingually.We attribute the better alignment results to the ability ofour model that could learn bilingual word representation.
In this paper, we presented a super lightweight neural align-ment model, named SLUA, that has achieved better align-ment performance compared to FastAlign and other existingneural alignment models while preserving training efficiency.We empirically and theoretical show its effectiveness and rea-sonableness over several language pairs.In future works, we would further explore the relationshipbetween CLWEs and word alignments. A promising attemptis using our model as a bridge to unify cross-lingual embed-dings and word alignment tasks. Also, it will be interesting todesign alignment model in an non-autoregressive fashion [Gu et al. , 2018; Wu et al. , 2020] to achieve efficient inference. A feature h of position is introduced in FastAlign to encour-age alignments to occur around the diagonal. h ( i, j, m, n ) = − (cid:12)(cid:12) im − jn (cid:12)(cid:12) , i and j are source and target indices and m and n arethe length of sentences pair. eferences [Alkhouli et al. , 2018] Tamer Alkhouli, Gabriel Bretschner,and Hermann Ney. On the alignment problem in multi-head attention-based neural machine translation. In WMT ,2018.[Bahdanau et al. , 2015] Dzmitry Bahdanau, KyungHyunCho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. In
ICLR , 2015.[Brown et al. , 1993] Peter F Brown, Vincent J Della Pietra,Stephen A Della Pietra, and Robert L Mercer. The math-ematics of statistical machine translation: Parameter esti-mation.
Computational linguistics , 1993.[Collobert et al. , 2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, andPavel Kuksa. Natural language processing (almost) fromscratch.
Journal of machine learning research , 2011.[Dagan et al. , 1993] Ido Dagan, Kenneth Church, andWillian Gale. Robust bilingual word alignment for ma-chine aided translation. In
Very Large Corpora: Academicand Industrial Perspectives , 1993.[Ding et al. , 2019] Shuoyang Ding, Hainan Xu, and PhilippKoehn. Saliency-driven word alignment interpretation forneural machine translation.
WMT , 2019.[Ding et al. , 2020] Liang Ding, Longyue Wang, andDacheng Tao. Self-attention with cross-lingual positionrepresentation. In
ACL , 2020.[Dyer et al. , 2013] Chris Dyer, Victor Chahuneau, and NoahA Smith. A simple, fast, and effective reparameterizationof ibm model 2. In
NAACL , 2013.[Garg et al. , 2019] Sarthak Garg, Stephan Peitz, Udhyaku-mar Nallasamy, and Matthias Paulik. Jointly learning toalign and translate with transformer models. In
EMNLP ,2019.[Ghader and Monz, 2017] Hamidreza Ghader and ChristofMonz. What does attention in neural machine translationpay attention to? In
IJCNLP , 2017.[Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben-gio. Understanding the difficulty of training deep feed-forward neural networks. In
ICML , 2010.[Gu et al. , 2018] Jiatao Gu, James Bradbury, CaimingXiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. In
ICLR , 2018.[Koehn and Knowles, 2017] Philipp Koehn and RebeccaKnowles. Six challenges for neural machine translation.In
WNMT , 2017.[Kozhevnikov and Titov, 2013] Mikhail Kozhevnikov andIvan Titov. Cross-lingual transfer of semantic role labelingmodels. In
ACL , 2013.[Legrand et al. , 2016] Jo¨el Legrand, Michael Auli, and Ro-nan Collobert. Neural network-based word alignmentthrough score aggregation. In
WMT , 2016.[Levinboim et al. , 2015] Tomer Levinboim, AshishVaswani, and David Chiang. Model invertibility regularization: Sequence alignment with or withoutparallel data. In
NAACL , 2015.[Li et al. , 2019] Xintong Li, Guanlin Li, Lemao Liu, MaxMeng, and Shuming Shi. On the word alignment fromneural machine translation. In
ACL , 2019.[Liang et al. , 2006] Percy Liang, Ben Taskar, and Dan Klein.Alignment by agreement. In
NAACL , 2006.[Luong et al. , 2015] Thang Luong, Hieu Pham, and Christo-pher D Manning. Bilingual word representations withmonolingual quality in mind. In
NAACL Workshop , 2015.[Mihalcea and Pedersen, 2003] Rada Mihalcea and Ted Ped-ersen. An evaluation exercise for word alignment. In
NAACL , 2003.[Mikolov et al. , 2013] Tomas Mikolov, Ilya Sutskever, KaiChen, Greg S Corrado, and Jeff Dean. Distributed rep-resentations of words and phrases and their composition-ality. In
NIPS , 2013.[Och and Ney, 2000] Franz Josef Och and Hermann Ney.Improved statistical alignment models. In
ACL , 2000.[Och and Ney, 2003] Franz Josef Och and Hermann Ney.A systematic comparison of various statistical alignmentmodels.
Computational linguistics , 2003.[Oord et al. , 2018] Aaron van den Oord, Yazhe Li, and OriolVinyals. Representation learning with contrastive predic-tive coding. arXiv , 2018.[Saunshi et al. , 2019] Nikunj Saunshi, Orestis Plevrakis,Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khande-parkar. A theoretical analysis of contrastive unsupervisedrepresentation learning. In
ICML , 2019.[Stengel-Eskin et al. , 2019] Elias Stengel-Eskin, Tzu-RaySu, Matt Post, and Benjamin Van Durme. A discrimi-native neural model for cross-lingual word alignment. In
EMNLP , 2019.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In
NIPS , 2017.[Vogel et al. , 1996] Stephan Vogel, Hermann Ney, andChristoph Tillmann. HMM-based word alignment in sta-tistical translation. In
COLING , 1996.[Wu et al. , 2020] Di Wu, Liang Ding, Fan Lu, and Jian Xie.Slotrefine: A fast non-autoregressive model for joint intentdetection and slot filling. In
EMNLP , 2020.[Yarowsky et al. , 2001] David Yarowsky, Grace Ngai, andRichard Wicentowski. Inducing multilingual text analy-sis tools via robust projection across aligned corpora. In
HLT , 2001.[Zenkel et al. , 2019] Thomas Zenkel, Joern Wuebker, andJohn DeNero. Adding interpretable attention to neuraltranslation models improves word alignment. In arXiv ,2019.[Zenkel et al. , 2020] Thomas Zenkel, Joern Wuebker, andJohn DeNero. End-to-end neural word alignment outper-forms GIZA++. In