Neural Network-based Word Alignment through Score Aggregation
NNeural Network-based Word Alignment through Score Aggregation
Jo¨el Legrand , , † and Michael Auli and Ronan Collobert Idiap Research Institute, Martigny, Switzerland Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland Facebook AI Research, Menlo Park
Abstract
We present a simple neural network forword alignment that builds source and tar-get word window representations to com-pute alignment scores for sentence pairs.To enable unsupervised training, we usean aggregation operation that summarizesthe alignment scores for a given targetword. A soft-margin objective increasesscores for true target words while de-creasing scores for target words that arenot present. Compared to the popularFast Align model, our approach improvesalignment accuracy by 7 AER on English-Czech, by 6 AER on Romanian-Englishand by 1.7 AER on English-French align-ment.
Word alignment is the task of finding the cor-respondence between source and target words ina pair of sentences that are translations of eachother. Generative models for this task (Brownet al., 1990; Och and Ney, 2003; Vogel et al., 1996)still form the basis for many machine translationsystems (Koehn et al., 2003; Chiang, 2007).Recent neural approaches include Yang et al.(2013) who introduce a feed-forward network-based model trained on alignments that were gen-erated by a traditional generative model. Thistreats potentially erroneous alignments as super-vision. Tamura et al. (2014) sidesteps this issue bynegative sampling to train a recurrent-neural net-work on unlabeled data. They optimize a globalloss that requires an expensive beam search to ap-proximate the sum over all alignments. † This work was conducted while the first author did aninternship at Facebook AI Research.
In this paper we introduce a word alignmentmodel that is simpler in structure and which re-lies on a more tractable training procedure. Ourmodel is a neural network that extracts context in-formation from source and target sentences andthen computes simple dot products to estimatealignment links. Our objective function is word-factored and does not require the expensive com-putation associated with global loss functions. Themodel can be easily trained on unlabeled data via anovel but simple aggregation operation which hasbeen successfully applied in the computer visionliterature (Pinheiro and Collobert, 2015). The ag-gregation combines the scores of all source wordsfor a particular target word and promotes sourcewords which are likely to be aligned with a giventarget word according to the knowledge the modelhas learned so far. At test time, the aggregation op-eration is removed and source words are aligned totarget words by choosing the highest scoring can-didates ( § § § In the following, we consider a target-source sen-tence pair ( e , f ) , with e = ( e , . . . , e | e | ) and f = ( f , . . . , f | f | ) . Words are represented by f j and e i , which are indices in source and targetdictionaries. For simplicity, we assume here thatword indices are the only feature fed to our archi-tecture. Given a source word f j and a target word e i , our architecture embeds a window (of size d fwin a r X i v : . [ c s . C L ] J un arget: d eemb e e ... . . . e | e | ...net e d emb DotProduct
Source: f f ... ... f | f | net f d emb d femb ... s(i,j) | e || f | Aggr | e | Figure 1: Illustration of the model. The two networks net e and net f compute representations for sourceand target words. The score of an alignment link is a simple dot product between those source and targetword representations. The aggregation operation summarizes the alignment scores for each target word.and d ewin , respectively) centered around each ofthese words into a d emb -dimensional vector space.The embedding operation is performed with twodistinct neural networks:net e ([ e ] d ewin i ) ∈ R d emb and net f ([ f ] d fwin j ) ∈ R d emb , where we denote the window operator as [ x ] di = ( x i − d/ , . . . , x i + d/ ) . The matching score between a source word f j anda target word e i is then given by the dot-product: s ( i, j ) = net e ([ e ] d ewin i ) · net f ([ f ] d fwin j ) . (1)If e i is aligned to f a i , the score s ( i, a i ) should behigh, while scores s ( i, j ) ∀ j (cid:54) = a i should be low. In this paper, we consider an unsupervised setupwhere the alignment is not known at training time.We thus cannot minimize or maximize matchingscores (1) in a direct manner. Instead, given a tar-get word e i we consider the aggregated matchingscores over the source sentence: s aggr ( i, f ) = | f | Aggr j =1 s ( i, j ) , (2) where Aggr is an aggregation operator ( § ( e + , f ) and a negative sentence pair ( e − , f ) .Given a word at index i + in the positive targetsentence, we want to maximize the aggregatedscore s aggr ( i + , f ) ( ≤ i + ≤ | e + | ) because weknow it should be aligned to at least one sourceword. Conversely, given a word at index i − inthe negative target sentence, we want to minimize s aggr ( i − , f ) ( ≤ i − ≤ | e − | ) because it is unlikelythat the source sentence can explain the negativetarget word. Following these principles, we con-sider a simple soft-margin loss: L ( e + , e − , f ) = | e + | (cid:88) i + =1 log(1 + e − s aggr ( i + , f ) )+ | e − | (cid:88) i − =1 log(1 + e + s aggr ( i − , f ) ) . (3)Training is achieved by minimizing (3) and bysampling over triplets ( e + , e − , f ) from the train-ing data. We discuss how we handle unaligned target words in § .2 Choosing the Aggregation The aggregation operation (2) is only present dur-ing training and acts as a filter which aims to ex-plain a given target word e i by one or more sourcewords. If we had the word alignments, then wewould sum over the source words f j aligned with e i . However, in our setup alignments are not avail-able at training time, so we must rely on what themodel has learned so far to filter the source words.We consider the following strategies: • Sum: ignore the knowledge learned so far,and assign the same weight to all sourcewords f j to explain e i . In this case, we have s aggr ( i, f ) = | f | (cid:88) j =1 s ( i, j ) . • Max: encourage the best aligned sourceword f j , according to what the model haslearned so far. In this case, the aggregationis written as: s aggr ( i, f ) = | f | max j =1 s ( i, j ) . • LSE: give similar weights to source wordswith similar scores. This can be achievedwith a LogSumExp aggregation operation(also called LogAdd), and is defined as: s aggr ( i, f ) = 1 r log | f | (cid:88) j =1 e r s ( i, j ) , (4)where r is a positive scalar (to be chosen)controlling the smoothness of the aggrega-tion. For small r , the aggregation is equiva-lent to a sum, and for large r , the aggregationacts as a max. At test time, we align each target word e i withthe source word f j for which the matching score s ( i, j ) in (1) is highest. However, not every targetword is aligned, so we consider only alignmentswith a matching score above a threshold: s ( i, j ) > µ − ( e i ) + α σ − ( e i ) , (5) This can be seen by observing that the gradients for allsource words are the same. This may result in a source word being aligned to multi-ple target words. where α is a tunable hyper-parameter, and µ − ( e i ) = E { ˜ e k = e i ∈ ˜e , ˜ f j − ∈ ˜f −} (cid:2) s ( k, j − ) (cid:3) is the expectation over all training sentences ˜ e con-taining the word e i , and all words ˜ f − j belonging toa corresponding negative source sentence ˜f − , and σ − ( e i ) is the respective variance. Our model consists of two convolutional neuralnetworks net e and net f as shown in (1). Both ofthem take the same form, so we detail only the tar-get architecture. The discrete features [ e ] d ewin i are embedded intoa d eemb -dimensional vector space via a lookup-table operation as first introduced in Bengio et al.(2000): x ei = LT W e ([ e ] d ewin i )= ( LT W e ( e i − d ewin / ) , . . . , LT W e ( e i + d ewin / )) , where the lookup-table operation applied at index k returns the k th column of the parameter matrix W e : LT W e ( k ) = W e • , k . The matrix W e is of size |V e | × d eemb , where V e is the target vocabulary, and d eemb is the word em-bedding size for the target words. The word embeddings output by the lookup-tableare concatenated and fed through two successive1-D convolution layers. The convolutions use astep size of one and extract context features foreach word. The kernel sizes k e and k e determinethe size of the window d ewin = k e + k e − overwhich features will be extracted by net e . In orderto obtain windows centered around each word, weadd ( k e + k e ) / − padding words at the beginningand at the end of each sentence.The first layer cnn e applies the linear transfor-mation M e, exactly k e times to consecutive spansof size k e to the d ewin words in a given window: cnn e ( x ei ) = M e, LT W e ([ e ] k e i − a ) ... LT W e ([ e ] k e i + a ) , here a = (cid:98) k e (cid:99) , M e, ∈ R d ehu × ( d eemb k e ) is amatrix of parameters, and d ehu is the number ofhidden units ( hu ). The outputs of the first layer cnn e are concatenated to form a matrix of size k e d ehu which is fed to the second layer:net e ( x ei ) = M e, tanh( cnn e ( x ei )) (6)where M e, ∈ R d emb × ( k e d ehu ) is a matrix of pa-rameters, and the tanh ( · ) operation is applied el-ement wise. The parameters W e , M e, and M e, are trained by stochastic gradient descent to mini-mize the loss (3) introduced in § In addition to the raw word indices, we considertwo additional discrete features which were han-dled in the same way as word features by introduc-ing an additional lookup-table for each of them.The output of all lookup-tables was concatenated,and fed to the two-layer neural network architec-ture (6).
Distance to the diagonal.
This feature can becomputed for a target word e i and a source word f j : diag ( i, j ) = (cid:12)(cid:12)(cid:12)(cid:12) i | e | − j | f | (cid:12)(cid:12)(cid:12)(cid:12) , This feature allows the model to learn that alignedsentence pairs use roughly the same word orderand that alignment links remain close to the di-agonal. We use this feature only for the sourcenetwork because it encodes relative position infor-mation which only needs to be encoded once. Ifwe would use absolute position instead, then wewould need to encode this information both on thesource and the target side.
Part-of-speech
Words pairs that are good transla-tions of each other are likely to carry the same partof speech in both languages (Melamed, 1995). Wetherefore add the part-of-speech information to themodel.
Char n-gram.
We consider unigram characterposition features. Let K be the maximum size fora word in a dictionary. We denote the dictionaryof characters as C . Every character is representedby its index c (with < c < |C| ). We associateevery character c at position k with a vector at po-sition (( k − ∗ |C| ) + c in a lookup-table. For agiven word, we extract all unigram character po-sition embeddings, and average them to obtain acharacter embedding for a given word. We use the English-French Hansards corpus asdistributed by the NAACL 2003 shared task (Mi-halcea and Pedersen, 2003). This dataset con-tains 1.1M sentence pairs and the test and vali-dation sets contain 447 and 37 examples respec-tively. We also evaluate on the Romanian-Englishdataset of the ACL 2005 shared task (Martin et al.,2005) comprising 48K sentence pairs for training,248 for testing and 17 for validation. For English-Czech experiments, we use the WMT news com-mentary corpus for training (150K sentence pairs)and a set of 515 sentences for testing (Bojar andProkopov´a, 2006).
Our models are evaluated in terms of precision, re-call, F-measure and Alignment Error Rate (AER).We train models in each language direction andthen symmetrize the resulting alignments usingeither the intersection or the grow-diag-final-and heuristic (Och and Ney, 2003; Koehn et al., 2003).We validated the choice of symmetrization heuris-tic on each language pair and chose the best onefor each model considering the two aforemen-tioned types as well as grow-diag-final and grow-diag .Additionally, we train phrase-based machinetranslation models with our alignments using thepopular Moses toolkit (Koehn et al., 2007). ForEnglish-French, we train on the news commentarycorpus v10, for English-Czech we used news com-mentary corpus v11, and for Romanian-Englishwe used the Europarl corpus v8. We tuned ourmodels on the WMT2015 test set for English-Czech as well as for Romanian-English; forEnglish-French we tuned on the WMT2014 testset. Final results are reported on the WMT2016test set for English-Czech as well as Romanian-English, and for English-French we report resultson the WMT2015 test set (as there is no track forthis language-pair in 2016).We compare our model to Fast Align, a popu-lar log-linear reparameterization of IBM Model 2(Dyer et al., 2013).
The kernel sizes of the target network net e ( · ) areset to k e = k e = 3 for all language pairs. Thekernel sizes of the source network net f ( · ) are seto k f = k f = 3 for Romanian-English as well asEnglish-Czech; and for English-French we used k f = k f = 1 .The number of hidden units are d ehu = d fhu =256 and d emb is set to 256, The source V f and tar-get V e dictionaries consist of the 30K most com-mon words for English, French and Romanian,and 80K for Czech. All other words are mapped toa unique UNK token. The word embedding sizes d eemb and d femb , as well as the char-n-gram embed-ding size is . For LSE, we set r = 1 in (4).We initialize the word embeddings with a sim-ple PCA computed over the matrix of word co-occurrence counts (Lebret and Collobert, 2014).The co-occurrence counts were computed over thecommon crawl corpus provided by WMT16. Forpart of speech tagging we used the Stanford parseron English-French data, and MarMoT (Muelleret al., 2013) for Romanian-English as well asEnglish-Czech.We trained 4 systems for the ensembles, eachusing a different random seed to vary the weightinitialization as well as the shuffling of the trainingset. We averaged the alignment scores predictedby each system before decoding. The alignmentthreshold variables µ − ( e i ) and σ − ( e i ) for decod-ing ( § µ − ( e i ) = σ − ( e i ) = 0 .For systems where d ewin > and d fwin > , wesaw a tendency of aligning frequent words regard-less on if they appeared in the center of the contextwindow or not. For instance, a common mistakewould be to align ”the cat sat”, with ”PADDING le chat”. To prevent such behavior, we occasion-ally replaced the center word in a target windowby a random word during training. We do this forevery second training example on average and wetuned this rate on the validation set. We first explore different choices for the aggre-gation operator ( § § Table 1 shows that the LogSumExp (LSE) aggre-gator performs best on all datasets for every direc-tion as well as in the symmetrized setting using thegrow-diag-final heuristic. All results are based ona single model trained with the ’distance to the di-agonal’ feature detailed above. We therefore useLSE for the remaining experiments.Max Sum LSEEn-Fr 18.1 23.0
Fr-En 20.7 26.9 symmetrized 14.8 24.1
Ro-En 42.2 42.0
En-Ro 40.4 40.2 symmetrized 36.4 35.6
En-Cz 27.9 35.6
Cz-En 26.5 33.6 symmetrized 21.8 32.7
Table 1: Alignment error rates for different aggre-gation operations in each language direction andwith grow-diag-final-and symmetrization.
Table 2 shows the effect of the different input fea-tures. Both POS and the distance to the diago-nal feature significantly improve accuracy. Po-sition information via the ’distance to the diago-nal’ feature is helpful for all language pairs, andPOS information is more effective for Romanian-English and English-Czech which involve mor-phologically rich languages. We use the POS and’distance to the diagonal feature’ for the remainingexperiments.
In the following results we label our model asNNSA (Neural network score aggregation). OnEnglish-French data (Table 3) our model outper-forms the baseline (Dyer et al., 2013) in each indi-vidual language direction as well as for the sym-metrized setting. With an ensemble of four mod-els, we outperform the baseline by 1.7 AER (from11.4 to 9.7), and with an individual model we out-perform it by 1.2 AER (from 11.4 to 10.2). Notethat the choice of symmetrization heuristic greatlyaffects accuracy, both for the baseline and NNSA. We use kernel sizes k e = k e = 3 and k f = k f = 1 forall language pairs in this experiment. nglish-French Romanian-English English-CzechEn-Fr Fr-En sym Ro-En En-Ro sym En-Cz Cz-En symwords 22.2 24.2 15.7 47.0 45.5 40.3 36.9 36.3 29.5+ POS 20.9 23.9 15.3 45.3 42.9 36.9 35.6 33.7 28.2+ diag 15.1 15.8 12.8 37.6 35.7 32.2 24.8 24.5 21.0+ POS + diag Table 2: Alignment error rates using different input features in each language direction and with grow-diag-final-and symmetrization.P R F1 AEREnglish-FrenchBaseline 49.6 89.8 63.9 16.7NNSA 64.7 80.7 71.8 13.2+ ensemble 61.5 85.8 71.6
French-EnglishBaseline 52.9 88.4 66.2 16.2NNSA 61.7 86.3 72.0 12.1+ ensemble 62.6 86.7 72.7 symmetrizedBaseline (inter) 69.6 84.0 76.1 11.4NNSA (gdfa) 60.4 88.5 71.8 10.2+ ensemble 59.3 89.9 71.4
Table 3: English-French results on the test set interms of precision (P), recall (R), F-score (F1) andAER; ensemble denotes a combination of four sys-tems and we use the intersection (inter) and grow-diag-final-and symmetrization (gdfa) heuristics.On Romanian-English (Table 4) our model out-performs the baseline in both directions as well.Adding ensembles further improves accuracy andleads to a significant improvement of 6 AER overthe best symmetrized baseline result (from 32 to26).On English-Czech (Table 5) our model outper-forms the baseline in both directions as well. Weadded the character feature to better deal with themorphologically rich nature of Czech and the fea-ture reduced AER by 2.1 in the symmetrized set-ting. An ensemble improved accuracy further andled to a 7 AER improvement over the best sym-metrized baseline result (from 22.8 to 15.8).
Table 6 presents the BLEU evaluation of our align-ments. For each language-pair, we select the bestalignment model reported in Tables 3, 4 and 5, andalign the training data. We use the alignments to P R F1 AERRomanian-EnglishBaseline 70.0 61.0 65.2 34.8NNSA 75.1 65.2 69.8 30.2+ ensemble 75.8 62.8 68.7
English-RomanianBaseline 71.3 60.8 65.6 34.4NNSA 78.1 61.7 69.0 31.1+ ensemble 78.4 63.2 70.0 symmetrizedBaseline (gdfa) 69.5 66.5 68.0 32.0NNSA (gdfa) 74.1 71.8 73.0 27.0+ ensemble 73.0 74.5 73.7
Table 4: Romanian-English results (cf. Table 3).run the standard phrase-based training pipeline us-ing those alignments. Our BLEU results show theaverage BLEU score and standard deviation forfive runs of minimum error rate training (MERT;Och 2003).Our alignments achieve slightly better resultsfor Romanian-English as well as English-Czechwhile performing on par with Fast Align onEnglish-French translation.
In this section, we analyze the word representa-tions learned by our model. We first focus on thesource representations: given a source window,we obtain its distributional representation and thencompute the Euclidean distance to all other sourcewindows in the training corpus. Table 7 showsthe nearest windows for two source windows; theclosest windows tend to have similar meanings.We then analyze the relation between sourceand target representations: given a source win-dow we compute the alignment scores for all tar-get sentences in the training corpus. Table 8 showsfor two source windows which target words have R F1 AEREnglish-CzechBaseline 68.4 73.3 70.7 26.6NNSA 72.0 74.3 73.1 24.6+ char n-gram 73.8 75.4 74.6 23.2+ ensemble 78.8 77.2 78.0
Czech-EnglishBaseline 68.6 74.0 71.2 25.7NNSA 74.1 74.0 74.0 22.9+ char n-gram 78.1 74.1 76.1 21.4+ ensemble 79.1 77.7 78.4 symmetrizedBaseline (inter) 88.1 66.6 76.0 22.8NNSA (gdfa) 75.7 80.3 76.3 19.9+ char n-gram 76.9 81.3 79.1 17.8+ ensemble 78.9 83.2 81.0
Table 5: Czech-English results (cf. Table 3).Baseline NNSAFrench-English . ± . . ± . Romanian-English . ± . . ± . Czech-English . ± . . ± . Table 6: Average BLEU score and standard devia-tion for five runs of MERT.the largest alignment scores. The example ”inworking together” is particularly interesting sincethe aligned target words collabore , coordon´es ,and concert´es mean collaborate , coordinated , and concerted , which all carry the same meaning asthe source window phrase. In this paper, we present a simple neural networkalignment model trained on unlabeled data. Ourmodel computes alignment scores as dot prod-ucts between representations of windows aroundsource and target words. We apply an aggrega-tion operation borrowed from the computer vi-sion literature to make unsupervised training pos-sible. The aggregation operation acts as a filterover alignment scores and allows us to determinewhich source words explain a given target word.We improve over Fast Align, a popular log-linear reparameterization of IBM Model 2 (Dyeret al., 2013) by up to 6 AER on Romanian-English, 7 AER on English-Czech data and 1.7AER on English-French alignment. Furthermore, the voting process in working togetherthe voting area for working togetherthe voting power with working togetherthe voting rules from working togetherthe voting system about working togetherthe voting patterns by working togetherthe voting ballots and working togethertheir voting patterns while working togetherTable 7: Analysis of source window represen-tations. Each column shows a window over thesource sentence followed by several close neigh-bors in terms of Euclidean distance (among the 30nearest).the voting process in working togethervote travaill´evoteraient travaillerontvotent collaborationvoter travaillantvotant oeuvrantscrutin concertssuffrage coordon´esproc´edure concertinvestiture collabore´elections coop´erationTable 8: Analysis of source and target represen-tations. Each column shows a source window andthe target words which are most aligned accordingto our model.we evaluated our model as part of a full machinetranslation pipeline and showed that our align-ments are better or on par compared to Fast Alignin terms of BLEU.
References
Yoshua Bengio, R´ejean Ducharme, and PascalVincent. A Neural Probabilistic LanguageModel. In
NIPS , 2000.Ondˇrej Bojar and Magdalena Prokopov´a. Czech-English Word Alignment. In
Proceedings ofthe Fifth International Conference on LanguageResources and Evaluation (LREC 2006) , 2006.Peter F. Brown, John Cocke, Stephen DellaPietra, Vincent J. Della Pietra, Frederick Je-linek, John D. Lafferty, Robert L. Mercer, andPaul S. Roossin. A Statistical Approach to Ma-chine Translation.
Computational Linguistics ,1990.avid Chiang. Hierarchical Phrase-Based Trans-lation.
Computational Linguistics , 2007.Chris Dyer, Victor Chahuneau, and Noah A.Smith. A simple, fast, and effective reparame-terization of IBM Model 2. In
Proc. of NAACL ,2013.Philipp Koehn, Franz J. Och, and Daniel Marcu.Statistical Phrase-based Translation. In
Proc. ofNAACL , 2003.Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen,Christine Moran, Richard Zens, Chris Dyer,Ondˇrej Bojar, Alexandra Constantin, and EvanHerbst. Moses: Open source toolkit for statisti-cal machine translation. In
Proceedings of the45th Annual Meeting of the ACL on InteractivePoster and Demonstration Sessions , ACL ’07,2007.R´emi Lebret and Ronan Collobert. Word Em-beddings through Hellinger PCA. In
Proc. ofEACL , 2014.Joel Martin, Rada Mihalcea, and Ted Pedersen.Word Alignment For Languages With ScarceResources. In
Proc. of WPT , 2005.Dan I. Melamed. Automatic evaluation and uni-form filter cascades for inducing n-best transla-tion lexicons. In
Third Workshop on Very LargeCorpora , 1995.Rada Mihalcea and Ted Pedersen. An EvaluationExercise for Word Alignment. In
Proc. of WPT ,2003.Thomas Mueller, Helmut Schmid, and HinrichSch¨utze. Efficient higher-order CRFs for mor-phological tagging. In
Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing , 2013.Franz J. Och and Hermann Ney. A SystematicComparison of Various Statistical AlignmentModels.
Computational Linguistics , 2003.Franz Josef Och. Minimum error rate training instatistical machine translation. In
Proc of ACL ,2003.Pedro O. Pinheiro and Ronan Collobert. FromImage-level to Pixel-level Labeling with Con-volutional Networks. In
Proc. of CVPR , 2015.Akihiro Tamura, Taro Watanabe, and EiichiroSumita. Recurrent Neural Networks for WordAlignment Model. In
Proc. of ACL , 2014. Stephan Vogel, Hermann Ney, and Christoph Till-mann. HMM-Based Word Alignment in Statis-tical Translation. In
Proc. of COLING , 1996.Nan Yang, Shujie Liu, Mu Li, Ming Zhou, andNenghai Yu. Word Alignment Modeling withContext Dependent Deep Neural Network. In