Neural Machine Translation with Source-Side Latent Graph Parsing
NNeural Machine Translation with Source-Side Latent Graph Parsing
Kazuma Hashimoto and Yoshimasa Tsuruoka
The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan { hassy,tsuruoka } @logos.t.u-tokyo.ac.jp Abstract
This paper presents a novel neural ma-chine translation model which jointlylearns translation and source-side latentgraph representations of sentences. Un-like existing pipelined approaches usingsyntactic parsers, our end-to-end modellearns a latent graph parser as part ofthe encoder of an attention-based neu-ral machine translation model, and thusthe parser is optimized according to thetranslation objective. In experiments, wefirst show that our model compares favor-ably with state-of-the-art sequential andpipelined syntax-based NMT models. Wealso show that the performance of ourmodel can be further improved by pre-training it with a small amount of tree-bank annotations. Our final ensemblemodel significantly outperforms the previ-ous best models on the standard English-to-Japanese translation dataset.
Neural Machine Translation (NMT) is an activearea of research due to its outstanding empiri-cal results (Bahdanau et al., 2015; Luong et al.,2015; Sutskever et al., 2014). Most of the exist-ing NMT models treat each sentence as a sequenceof tokens, but recent studies suggest that syntac-tic information can help improve translation accu-racy (Eriguchi et al., 2016b, 2017; Sennrich andHaddow, 2016; Stahlberg et al., 2016). The exist-ing syntax-based NMT models employ a syntacticparser trained by supervised learning in advance,and hence the parser is not adapted to the transla-tion tasks. An alternative approach for leveragingsyntactic structure in a language processing taskis to jointly learn syntactic trees of the sentences
All the calculated electronic band structures are metallic .
Figure 1: An example of the learned latent graphs.Edges with a small weight are omitted.along with the target task (Socher et al., 2011; Yo-gatama et al., 2017).Motivated by the promising results of recentjoint learning approaches, we present a novelNMT model that can learn a task-specific latentgraph structure for each source-side sentence. Thegraph structure is similar to the dependency struc-ture of the sentence, but it can have cycles and islearned specifically for the translation task. Un-like the aforementioned approach of learning sin-gle syntactic trees, our latent graphs are composedof “soft” connections, i.e., the edges have real-valued weights (Figure 1). Our model consists oftwo parts: one is a task-independent parsing com-ponent, which we call a latent graph parser , andthe other is an attention-based NMT model. Thelatent parser can be independently pre-trained withhuman-annotated treebanks and is then adapted tothe translation task.In experiments, we demonstrate that our modelcan be effectively pre-trained by the treebankannotations, outperforming a state-of-the-art se-quential counterpart and a pipelined syntax-basedmodel. Our final ensemble model outperforms theprevious best results by a large margin on the WATEnglish-to-Japanese dataset.
We model the latent graph parser based on de-pendency parsing. In dependency parsing, a sen-tence is represented as a tree structure where eachnode corresponds to a word in the sentence and a r X i v : . [ c s . C L ] J u l unique root node (ROOT) is added. Given asentence of length N , the parent node H w i ∈{ w , . . . , w N , ROOT } ( H w i (cid:54) = w i ) of each word w i (1 ≤ i ≤ N ) is called its head . The sentence isthus represented as a set of tuples ( w i , H w i , (cid:96) w i ) ,where (cid:96) w i is a dependency label.In this paper, we remove the constraint of us-ing the tree structure and represent a sentence asa set of tuples ( w i , p ( H w i | w i ) , p ( (cid:96) w i | w i )) , where p ( H w i | w i ) is the probability distribution of w i ’sparent nodes, and p ( (cid:96) w i | w i ) is the probability dis-tribution of the dependency labels. For example, p ( H w i = w j | w i ) is the probability that w j is theparent node of w i . Here, we assume that a spe-cial token (cid:104) EOS (cid:105) is appended to the end of thesentence, and we treat the (cid:104)
EOS (cid:105) token as ROOT.This approach is similar to that of graph-based de-pendency parsing (McDonald et al., 2005) in that asentence is represented with a set of weighted arcsbetween the words. To obtain the latent graph rep-resentation of the sentence, we use a dependencyparsing model based on multi-task learning pro-posed by Hashimoto et al. (2017).
The i -th input word w i is represented with the con-catenation of its d -dimensional word embedding v dp ( w i ) ∈ R d and its character n -gram embed-ding c ( w i ) ∈ R d : x ( w i ) = [ v dp ( w i ); c ( w i )] . c ( w i ) is computed as the average of the embed-dings of the character n -grams in w i . Our latent graph parser builds upon multi-layer bi-directional Recurrent Neural Networks(RNNs) with Long Short-Term Memory (LSTM)units (Graves and Schmidhuber, 2005). In the firstlayer, POS tagging is handled by computing a hid-den state h (1) i = [ −→ h (1) i ; ←− h (1) i ] ∈ R d for w i ,where −→ h (1) i = LSTM( −→ h (1) i − , x ( w i )) ∈ R d and ←− h (1) i = LSTM( ←− h (1) i +1 , x ( w i )) ∈ R d are hiddenstates of the forward and backward LSTMs, re-spectively. h (1) i is then fed into a softmax classifierto predict a probability distribution p (1) i ∈ R C (1) for word-level tags, where C (1) is the number ofPOS classes. The model parameters of this layercan be learned not only by human-annotated data,but also by backpropagation from higher layers,which are described in the next section. Dependency parsing is performed in the secondlayer. A hidden state h (2) i ∈ R d is computedby −→ h (2) i = LSTM( −→ h (2) i − , [ x ( w i ); y ( w i ); −→ h (1) i ]) and ←− h (2) i = LSTM( ←− h (2) i +1 , [ x ( w i ); y ( w i ); ←− h (1) i ]) ,where y ( w i ) = W (1) (cid:96) p (1) i ∈ R d is the POS in-formation output from the first layer, and W (1) (cid:96) ∈ R d × C (1) is a weight matrix.Then, (soft) edges of our latent graph represen-tation are obtained by computing the probabilities p ( H w i = w j | w i ) = exp ( m ( i, j )) (cid:80) k (cid:54) = i exp ( m ( i, k )) , (1)where m ( i, k ) = h (2)T k W dp h (2) i (1 ≤ k ≤ N +1 , k (cid:54) = i ) is a scoring function with a weightmatrix W dp ∈ R d × d . While the models ofHashimoto et al. (2017), Zhang et al. (2017), andDozat and Manning (2017) learn the model pa-rameters of their parsing models only by human-annotated data, we allow the model parameters tobe learned by the translation task.Next, [ h (2) i ; z ( H w i )] is fed into a softmaxclassifier to predict the probability distribu-tion p ( (cid:96) w i | w i ) , where z ( H w i ) ∈ R d is theweighted average of the hidden states of theparent nodes: (cid:80) j (cid:54) = i p ( H w i = w j | w i ) h (2) j .This results in the latent graph representation ( w i , p ( H w i | w i ) , p ( (cid:96) w i | w i )) of the input sentence. The latent graph representation described in Sec-tion 2 can be used for any sentence-level tasks,and here we apply it to an Attention-based NMT(ANMT) model (Luong et al., 2015). We modifythe encoder and the decoder in the ANMT modelto learn the latent graph representation.
The ANMT model first encodes the informationabout the input sentence and then generates a sen-tence in another language. The encoder representsthe word w i with a word embedding v enc ( w i ) ∈ R d . It should be noted that v enc ( w i ) is differ-ent from v dp ( w i ) because each component is sep-arately modeled. The encoder then takes the wordembedding v enc ( w i ) and the hidden state h (2) i asthe input to a uni-directional LSMT: h ( enc ) i = LSTM( h ( enc ) i − , [ v enc ( w i ); h (2) i ]) , (2)here h ( enc ) i ∈ R d is the hidden state correspond-ing to w i . That is, the encoder of our model isa three-layer LSTM network, where the first twolayers are bi-directional.In the sequential LSTMs, relationships betweenwords in distant positions are not explicitly con-sidered. In our model, we explicitly incorporatesuch relationships into the encoder by defining adependency composition function: dep ( w i ) = tanh( W dep [ h enci ; h ( H w i ); p ( (cid:96) w i | w i )]) , (3)where h ( H w i ) = (cid:80) j (cid:54) = i p ( H w i = w j | w i ) h ( enc ) j isthe weighted average of the hidden states of theparent nodes. Note on character n -gram embeddings InNMT models, sub-word units are widely used toaddress rare or unknown word problems (Sennrichet al., 2016). In our model, the character n -gramembeddings are fed through the latent graph pars-ing component. To the best of our knowledge,the character n -gram embeddings have never beenused in NMT models. Wieting et al. (2016), Bo-janowski et al. (2017), and Hashimoto et al. (2017)have reported that the character n -gram embed-dings are useful in improving several NLP tasksby better handling unknown words. The decoder of our model is a single-layer LSTMnetwork, and the initial state is set with h ( enc ) N +1 andits corresponding memory cell. Given the t -th hid-den state h ( dec ) t ∈ R d , the decoder predicts the t -th word in the target language using an attentionmechanism. The attention mechanism in Luonget al. (2015) computes the weighted average of thehidden states h ( enc ) i of the encoder: s ( i, t ) = exp ( h ( dec ) t · h ( enc ) i ) (cid:80) N +1 j =1 exp ( h ( dec ) t · h ( enc ) j ) , (4) a t = (cid:80) N +1 i =1 s ( i, t ) h ( enc ) i , (5)where s ( i, t ) is a scoring function which speci-fies how much each source-side hidden state con-tributes to the word prediction.In addition, like the attention mechanism overconstituency tree nodes (Eriguchi et al., 2016b),our model uses attention to the dependency com-position vectors: s (cid:48) ( i, t ) = exp ( h ( dec ) t · dep ( w i )) (cid:80) Nj =1 exp ( h ( dec ) t · dep ( w j )) , (6) a (cid:48) t = (cid:80) Ni =1 s (cid:48) ( i, t ) dep ( w i ) , (7) To predict the target word, a hidden state ˜ h ( dec ) t ∈ R d is then computed as follows: ˜ h ( dec ) t = tanh( ˜ W [ h ( dec ) t ; a t ; a (cid:48) t ]) , (8)where ˜ W ∈ R d × d is a weight matrix. ˜ h ( dec ) t is fed into a softmax classifier to predict a targetword distribution. ˜ h ( dec ) t is also used in the tran-sition of the decoder LSTMs along with a wordembedding v dec ( w t ) ∈ R d of the target word w t : h ( dec ) t +1 = LSTM( h ( dec ) t , [ v dec ( w t ); ˜ h ( dec ) t ]) , (9)where the use of ˜ h ( dec ) t is called input feeding pro-posed by Luong et al. (2015).The overall model parameters, including thoseof the latent graph parser, are jointly learned byminimizing the negative log-likelihood of the pre-diction probabilities of the target words in thetraining data. To speed up the training, we useBlackOut sampling (Ji et al., 2016). By this jointlearning using Equation (3) and (7), the latentgraph representations are automatically learnedaccording to the target task. Implementation Tips
Inspired by Zoph et al.(2016), we further speed up BlackOut samplingby sharing noise samples across words in thesame sentences. This technique has proven tobe effective in RNN language modeling, and wehave found that it is also effective in the NMTmodel. We have also found it effective to sharethe model parameters of the target word embed-dings and the softmax weight matrix for word pre-diction (Inan et al., 2016; Press and Wolf, 2017).Also, we have found that a parameter averagingtechnique (Hashimoto et al., 2013) is helpful inimproving translation accuracy.
Translation
At test time, we use a novel beamsearch algorithm which combines statistics of sen-tence lengths (Eriguchi et al., 2016b) and lengthnormalization (Cho et al., 2014). During thebeam search step, we use the following scor-ing function for a generated word sequence y =( y , y , . . . , y L y ) given a source word sequence x = ( x , x , . . . , x L x ) : L y L y (cid:88) i =1 log p ( y i | x, y i − ) + log p ( L y | L x ) , (10)here p ( L y | L x ) is the probability that sentencesof length L y are generated given source-side sen-tences of length L x . The statistics are taken byusing the training data in advance. In our exper-iments, we have empirically found that this beamsearch algorithm helps the NMT models to avoidgenerating translation sentences that are too short. We used an English-to-Japanese translation taskof the Asian Scientific Paper Excerpt Corpus (AS-PEC) (Nakazawa et al., 2016b) used in the Work-shop on Asian Translation (WAT), since it hasbeen shown that syntactic information is usefulin English-to-Japanese translation (Eriguchi et al.,2016b; Neubig et al., 2015). We followed thedata preprocessing instruction for the English-to-Japanese task in Eriguchi et al. (2016b). The En-glish sentences were tokenized by the tokenizer inthe Enju parser (Miyao and Tsujii, 2008), and theJapanese sentences were segmented by the KyTeatool . Among the first 1,500,000 translation pairsin the training data, we selected 1,346,946 pairswhere the maximum sentence length is 50. Inwhat follows, we call this dataset the large trainingdataset. We further selected the first 20,000 and100,000 pairs to construct the small and medium training datasets, respectively. The developmentdata include 1,790 pairs, and the test data 1,812pairs.For the small and medium datasets, we builtthe vocabulary with words whose minimum fre-quency is two, and for the large dataset, we usedwords whose minimum frequency is three for En-glish and five for Japanese. As a result, the vo-cabulary of the target language was 8,593 for thesmall dataset, 23,532 for the medium dataset, and65,680 for the large dataset. A special token (cid:104) UNK (cid:105) was used to replace words which were notincluded in the vocabularies. The character n -grams ( n = 2 , , ) were also constructed fromeach training dataset with the same frequency set-tings. We turned hyper-parameters of the model usingdevelopment data. We set ( d , d ) = (100 , for the latent graph parser. The word and charac-ter n -gram embeddings of the latent graph parser . were initialized with the pre-trained embeddingsin Hashimoto et al. (2017). The weight matricesin the latent graph parser were initialized with uni-form random values in [ − √ √ row + col , + √ √ row + col ] ,where row and col are the number of rows andcolumns of the matrices, respectively. All the biasvectors and the weight matrices in the softmax lay-ers were initialized with zeros, and the bias vectorsof the forget gates in the LSTMs were initializedby ones (Jozefowicz et al., 2015).We set d = 128 for the small training dataset, d = 256 for the medium training dataset, and d = 512 for the large training dataset. Theword embeddings and the weight matrices of theNMT model were initialized with uniform ran-dom values in [ − . , +0 . . The training was per-formed by mini-batch stochastic gradient descentwith momentum. For the BlackOut objective (Jiet al., 2016), the number of the negative sampleswas set to 2,000 for the small and medium trainingdatasets, and 2,500 for the large training dataset.The mini-batch size was set to 128, and the mo-mentum rate was set to 0.75 for the small andmedium training datasets and 0.70 for the largetraining dataset. A gradient clipping techniquewas used with a clipping value of 1.0. The ini-tial learning rate was set to 1.0, and the learn-ing rate was halved when translation accuracy de-creased. We used the BLEU scores obtained bygreedy translation as the translation accuracy andchecked it at every half epoch of the model train-ing. We saved the model parameters at every halfepoch and used the saved model parameters forthe parameter averaging technique. For regulariza-tion, we used L2-norm regularization with a coef-ficient of − and applied dropout (Hinton et al.,2012) to Equation (8) with a dropout rate of 0.2.The beam size for the beam search algorithmwas 12 for the small and medium training datasets,and 50 for the large training dataset. We usedBLEU (Papineni et al., 2002), RIBES (Isozakiet al., 2010), and perplexity scores as our evalu-ation metrics. Note that lower perplexity scoresindicate better accuracy. The latent graph parser in our model can be op-tionally pre-trained by using human annotationsfor dependency parsing. In this paper we used The pre-trained embeddings can be found at https://github.com/hassyGo/charNgram2vec . he widely-used Wall Street Journal (WSJ) train-ing data to jointly train the POS tagging and de-pendency parsing components. We used the stan-dard training split (Section 0-18) for POS tagging.We followed Chen and Manning (2014) to gener-ate the training data (Section 2-21) for dependencyparsing. From each training dataset, we selectedthe first K sentences to pre-train our model. Thetraining dataset for POS tagging includes 38,219sentences, and that for dependency parsing in-cludes 39,832 sentences.The parser including the POS tagger was firsttrained for 10 epochs in advance according tothe multi-task learning procedure of Hashimotoet al. (2017), and then the overall NMT model wastrained. When pre-training the POS tagging anddependency parsing components, we did not ap-ply dropout to the model and did not fine-tune theword and character n -gram embeddings to avoidstrong overfitting. is our proposed model that learnsthe Latent Graph Parsing for NMT. LGP-NMT+ is constructed by pre-training thelatent parser in LGP-NMT as described in Sec-tion 4.3.
SEQ is constructed by removing the depen-dency composition in Equation (3), forming a se-quential NMT model with the multi-layer encoder.
DEP is constructed by using pre-trained depen-dency relations rather than learning them. That is, p ( H w i = w j | w i ) is fixed to 1.0 such that w j is thehead of w i . The dependency labels are also givenby the parser which was trained by using all thetraining samples for parsing and tagging. UNI is constructed by fixing p ( H w i = w j | w i ) to N for all the words in the same sentence. That is,the uniform probability distributions are used forequally connecting all the words. We first show our translation results using thesmall and medium training datasets. We report av-eraged scores with standard deviations across fivedifferent runs of the model training.
Table 1 shows the results of using the small train-ing dataset. LGP-NMT performs worse than SEQ
BLEU RIBES PerplexityLGP-NMT 14.31 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1: Evaluation on the development data usingthe small training dataset (20,000 pairs). K BLEU RIBES Perplexity0 14.31 ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Effects of the size K of the trainingdatasets for POS tagging and dependency parsing.and UNI, which shows that the small trainingdataset is not enough to learn useful latent graphstructures from scratch. However, LGP-NMT+( K = 10,000) outperforms SEQ and UNI, and thestandard deviations are the smallest. Therefore,the results suggest that pre-training the parsing andtagging components can improve the translationaccuracy of our proposed model. We can also seethat DEP performs the worst. This is not surpris-ing because previous studies, e.g., Li et al. (2015),have reported that using syntactic structures do notalways outperform competitive sequential modelsin several NLP tasks.Now that we have observed the effectiveness ofpre-training our model, one question arises natu-rally:how many training samples for parsing andtagging are necessary for improving thetranslation accuracy?Table 2 shows the results of using different num-bers of training samples for parsing and tagging.The results of K = 0 and K = 10,000 correspondto those of LGP-NMT and LGP-NMT+ in Ta-ble 1, respectively. We can see that using thesmall amount of the training samples performsbetter than using all the training samples. Onepossible reason is that the domains of the trans-lation dataset and the parsing (tagging) datasetare considerably different. The parsing and tag-ging datasets come from WSJ, whereas the trans-lation dataset comes from abstract text of scien-tific papers in a wide range of domains, such as We did not observe such significant difference when us-ing the larger datasets, and we used all the training samplesin the remaining part of this paper.LEU RIBES PerplexityLGP-NMT 28.70 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Evaluation on the development data usingthe medium training dataset (100,000 pairs).biomedicine and computer science. These resultssuggest that our model can be improved by a smallamount of parsing and tagging datasets in differ-ent domains. Considering the recent universal de-pendency project which covers more than 50 lan-guages, our model has the potential of being ap-plied to a variety of language pairs. Table 3 shows the results of using the mediumtraining dataset. In contrast with using the smalltraining dataset, LGP-NMT is slightly better thanSEQ. LGP-NMT significantly outperforms UNI,which shows that our adaptive learning is moreeffective than using the uniform graph weights.By pre-training our model, LGP-NMT+ signifi-cantly outperforms SEQ in terms of the BLEUscore. Again, DEP performs the worst among allthe models.By using our beam search strategy, the BrevityPenalty (BP) values of our translation results areequal to or close to 1.0, which is important whenevaluating the translation results using the BLEUscores. A BP value ranges from 0.0 to 1.0, andlarger values mean that the translated sentenceshave relevant lengths compared with the referencetranslations. As a result, our BLEU evaluation re-sults are affected only by the word n -gram preci-sion scores. BLEU scores are sensitive to the BPvalues, and thus our beam search strategy leads tomore solid evaluation for NMT models. Table 4 shows the BLEU and RIBES scores on thedevelopment data achieved with the large train-ing dataset. Here we focus on our models andSEQ because UNI and DEP consistently performworse than the other models as shown in Table 1and 3. The averaging technique and attention-based unknown word replacement (Jean et al.,2015; Hashimoto et al., 2016) improve the scores. http://universaldependencies.org/ . B./R. Single +Averaging +UnkRepLGP-NMT 38.05/81.98 38.44/82.23 38.77/82.29LGP-NMT+ 38.75/82.13 39.01/82.40 39.37/82.48SEQ 38.24/81.84 38.26/82.14 38.61/82.18 Table 4: BLEU (B.) and RIBES (R.) scores on thedevelopment data using the large training dataset.
BLEU RIBESLGP-NMT 39.19 82.66LGP-NMT+ 39.42 82.83SEQ 38.96 82.18Ensemble of the above three models 41.18 83.40Cromieres et al. (2016) 38.20 82.39Neubig et al. (2015) 38.17 81.38Eriguchi et al. (2016a) 36.95 82.45Neubig and Duh (2014) 36.58 79.65Zhu (2015) 36.21 80.91Lee et al. (2015) 35.75 81.15
Table 5: BLEU and RIBES scores on the test data.Again, we see that the translation scores of ourmodel can be further improved by pre-training themodel.Table 5 shows our results on the test data, andthe previous best results summarized in Nakazawaet al. (2016a) and the WAT website are alsoshown. Our proposed models, LGP-NMT andLGP-NMT+, outperform not only SEQ but alsoall of the previous best results. Notice also thatour implementation of the sequential model (SEQ)provides a very strong baseline, the performanceof which is already comparable to the previousstate of the art, even without using ensemble tech-niques. The confidence interval ( p ≤ . of theRIBES score of LGP-NMT+ estimated by boot-strap resampling (Noreen, 1989) is (82 . , . ,and thus the RIBES score of LGP-NMT+ is sig-nificantly better than that of SEQ, which showsthat our latent parser can be effectively pre-trainedwith the human-annotated treebank.The sequential NMT model in Cromieres et al.(2016) and the tree-to-sequence NMT model inEriguchi et al. (2016b) rely on ensemble tech-niques while our results mentioned above are ob-tained using single models. Moreover, our modelis more compact than the previous best NMTmodel in Cromieres et al. (2016). By applying theensemble technique to LGP-NMT, LGP-NMT+, http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=1&o=1 . Our training time is within five days on a c4.8xlarge machine of Amazon Web Service by our CPU-based C++code, while it is reported that the training time is more thantwo weeks in Cromieres et al. (2016) by their GPU code. s a result , it was found that a path which crosses a sphere obliquely existed .
Reference: その結果、球内部を斜めに横切る行路の存在することが分かった。
LGP-NMT: その結果、球を斜めに横切る経路が存在することが分かった。
LGP-NMT+: その結果、球を斜めに横切る経路が存在することが分かった。 (As a result , it was found that a path which obliquely crosses a sphere existed .)
Google trans: その結果、球を横切る経路が斜めに存在することが判明した。
SEQ: その結果、球を横断する経路が斜めに存在することが分かった。 (As a result , it was found that a path which crosses a sphere existed obliquely .) The androgen controls negatively
ImRNA .
Reference:
ImRNA はアンドロゲンにより負に調節される。
LGP-NMT+: アンドロゲンは ImRNA を負に制御している。 (The androgen negatively controls ImRNA .)
Google trans: アンドロゲンは負の ImRNA を制御する。
LGP-NMT: アンドロゲンは負の ImRNA を制御する。 (The androgen controls negative
ImRNA .)
SEQ: アンドロゲンは負の ImRNA を負に制御する。 (The androgen negatively controls negative
ImRNA .)
Translation Example (1)
Translation Example (2)
Figure 2: English-to-Japanese translation exam-ples for focusing on the usage of adverbs.and SEQ, the BLEU and RIBES scores are furtherimproved, and both of the scores are significantlybetter than the previous best scores.
Figure 2 shows two translation examples to seehow the proposed model works and what is miss-ing in the state-of-the-art sequential NMT model,SEQ. Besides the reference translation, the outputsof our models with and without pre-training, SEQ,and Google Translation are shown. Selectional Preference
In the translation ex-ample (1) in Figure 2, we see that the ad-verb “obliquely” is interpreted differently acrossthe systems. As in the reference translation,“obliquely” is a modifier of the verb “crosses”.Our models correctly capture the relationship be-tween the two words, whereas Google Translationand SEQ treat “obliquely” as a modifier of theverb “existed”. This error is not a surprise sincethe verb “existed” is located closer to “obliquely”than the verb “crosses”. A possible reason forthe correct interpretation by our models is thatthey can better capture long-distance dependen-cies and are less susceptible to surface word dis-tances. This is an indication of our models’ abil-ity of capturing domain-specific selectional prefer-ence that cannot be captured by purely sequential These English sentences were created by manual simpli-fication of sentences in the development data. The translations were obtained at https://translate.google.com in Feb. and Mar. 2017. models. It should be noted that simply using stan-dard treebank-based parsers does not necessarilyaddress this error, because our pre-trained depen-dency parser interprets that “obliquely” is a modi-fier of the verb “existed”.
Adverb or Adjective
The translation example(2) in Figure 2 shows another example wherethe adverb “negatively” is interpreted as an ad-verb or an adjective. As in the reference transla-tion, “negatively” is a modifier of the verb “con-trols”. Only LGP-NMT+ correctly captures theadverb-verb relationship, whereas “negatively” isinterpreted as the adjective “negative” to modifythe noun “ImRNA” in the translation results fromGoogle Translation and LGP-NMT. SEQ inter-prets “negatively” as both an adverb and an adjec-tive, which leads to the repeated translations. Thiserror suggests that the state-of-the-art NMT mod-els are strongly affected by the word order. Bycontrast, the pre-training strategy effectively em-beds the information about the POS tags and thedependency relations into our model.
We inspected the latentgraphs learned by LGP-NMT. Figure 1 shows anexample of the learned latent graph obtained for asentence taken from the development data of thetranslation task. It has long-range dependenciesand cycles as well as ordinary left-to-right depen-dencies. We have observed that the punctuationmark “.” is often pointed to by other words withlarge weights. This is primarily because the hid-den state corresponding to the mark in each sen-tence has rich information about the sentence.To measure the correlation between the la-tent graphs and human-defined dependencies, weparsed the sentences on the development data ofthe WSJ corpus and converted the graphs intodependency trees by Eisner’s algorithm (Eisner,1996). For evaluation, we followed Chen andManning (2014) and measured Unlabeled Attach-ment Score (UAS). The UAS is 24.52%, whichshows that the implicitly-learned latent graphs arepartially consistent with the human-defined syn-tactic structures. Similar trends have been re-ported by Yogatama et al. (2017) in the case ofbinary constituency parsing. We checked the mostdominant gold dependency labels which were as-signed for the dependencies detected by LGP-NMT. The labels whose ratio is more than 3% are ll the calculated electronic band structures are metallic .
All the calculated electronic band structures are metallic .
ROOT (a) (b)
Figure 3: An example of the pre-trained depen-dency structures (a) and its corresponding latentgraph adapted by our model (b). nn , amod , prep , pobj , dobj , nsubj , num , det , advmod , and poss . We see that depen-dencies between words in distant positions, suchas subject-verb-object relations, can be captured. With Pre-Training
We also inspected the pre-trained latent graphs. Figure 3-(a) shows the de-pendency structure output by the pre-trained latentparser for the same sentence in Figure 1. This is anordinary dependency tree, and the head selectionis almost deterministic; that is, for each word, thelargest weight of the head selection is close to 1.0.By contrast, the weight values are more evenlydistributed in the case of LGP-NMT as shown inFigure 1. After the overall NMT model training,the latent parser is adapted to the translation task,and Figure 3-(b) shows the adapted latent graph.Again, we can see that the adapted weight valuesare also distributed and different from the origi-nal pre-trained weight values, which suggests thathuman-defined syntax is not always optimal forthe target task.The UAS of the pre-trained dependency trees is92.52% , and that of the adapted latent graphs is18.94%. Surprisingly, the resulting UAS (18.94%)is lower than the UAS of our model without pre-training (24.52%). However, in terms of the trans-lation accuracy, our model with pre-training is bet-ter than that without pre-training. These resultssuggest that human-annotated treebanks can pro-vide useful prior knowledge to guide the overallmodel training by pre-training, but the resultingsentence structures adapted to the target task donot need to highly correlate with the treebanks. The UAS is significantly lower than the reported scorein Hashimoto et al. (2017). The reason is described in Sec-tion 4.3.
While initial studies on NMT treat each sentenceas a sequence of words (Bahdanau et al., 2015;Luong et al., 2015; Sutskever et al., 2014), re-searchers have recently started investigating intothe use of syntactic structures in NMT mod-els (Bastings et al., 2017; Chen et al., 2017;Eriguchi et al., 2016a,b, 2017; Li et al., 2017;Sennrich and Haddow, 2016; Stahlberg et al.,2016; Yang et al., 2017). In particular, Eriguchiet al. (2016b) introduced a tree-to-sequence NMTmodel by building a tree-structured encoder on topof a standard sequential encoder, which motivatedthe use of the dependency composition vectors inour proposed model. Prior to the advent of NMT,the syntactic structures had been successfully usedin statistical machine translation systems (Neubigand Duh, 2014; Yamada and Knight, 2001). Thesesyntax-based approaches are pipelined; a syntacticparser is first trained by supervised learning usinga treebank such as the WSJ dataset, and then theparser is used to automatically extract syntactic in-formation for machine translation. They rely onthe output from the parser, and therefore parsingerrors are propagated through the whole systems.By contrast, our model allows the parser to beadapted to the translation task, thereby providing afirst step towards addressing ambiguous syntacticand semantic problems, such as domain-specificselectional preference and PP attachments, in atask-oriented fashion.Our model learns latent graph structures in asource-side language. Eriguchi et al. (2017) haveproposed a model which learns to parse and trans-late by using automatically-parsed data. Thus, it isalso an interesting direction to learn latent struc-tures in a target-side language.As for the learning of latent syntactic structure,there are several studies on learning task-orientedsyntactic structures. Yogatama et al. (2017) used areinforcement learning method on shift-reduce ac-tion sequences to learn task-oriented binary con-stituency trees. They have shown that the learnedtrees do not necessarily highly correlate with thehuman-annotated treebanks, which is consistentwith our experimental results. Socher et al. (2011)used a recursive autoencoder model to greed-ily construct a binary constituency tree for eachsentence. The autoencoder objective works asa regularization term for sentiment classificationtasks. Prior to these deep learning approaches,u (1997) presented a method for bilingual pars-ing . One of the characteristics of our model isdirectly using the soft connections of the graphedges with the real-valued weights, whereas all ofthe above-mentioned methods use one best struc-ture for each sentence. Our model is based ondependency structures, and it is a promising fu-ture direction to jointly learn dependency and con-stituency structures in a task-oriented fashion.Finally, more related to our model, Kim et al.(2017) applied their structured attention networks to a Natural Language Inference (NLI) task forlearning dependency-like structures. They showedthat pre-training their model by a parsing datasetdid not improve accuracy on the NLI task. Bycontrast, our experiments show that such a parsingdataset can be effectively used to improve trans-lation accuracy by varying the size of the datasetand by avoiding strong overfitting. Moreover, ourtranslation examples show the concrete benefit oflearning task-oriented latent graph structures.
We have presented an end-to-end NMT model byjointly learning translation and source-side latentgraph representations. By pre-training our modelusing treebank annotations, our model signifi-cantly outperforms both a pipelined syntax-basedmodel and a state-of-the-art sequential model. OnEnglish-to-Japanese translation, our model outper-forms the previous best models by a large margin.In future work, we investigate the effectiveness ofour approach in different types of target tasks.
Acknowledgments
We thank the anonymous reviewers and AkikoEriguchi for their helpful comments and sugges-tions. We also thank Yuchen Qiao and KenjiroTaura for their help in speeding up our trainingcode. This work was supported by CREST, JST,and JSPS KAKENHI Grant Number 17J09620.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural Machine Translation by JointlyLearning to Align and Translate. In
Proceedings ofthe 3rd International Conference on Learning Rep-resentations .Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware NeuralMachine Translation. arXiv , cs.CL 1704.04675.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectors withSubword Information.
Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Danqi Chen and Christopher Manning. 2014. A Fastand Accurate Dependency Parser using Neural Net-works. In
Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Process-ing , pages 740–750.Huadong Chen, Shujian Huang, David Chiang, and Jia-jun Chen. 2017. Improved Neural Machine Transla-tion with a Syntax-Aware Encoder and Decoder. In
Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) . To appear.Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the Prop-erties of Neural Machine Translation: Encoder–Decoder Approaches. In
Proceedings of SSST-8,Eighth Workshop on Syntax, Semantics and Struc-ture in Statistical Translation , pages 103–111.Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa,and Sadao Kurohashi. 2016. Kyoto University Par-ticipation to WAT 2016. In
Proceedings of the 3rdWorkshop on Asian Translation , pages 166–174.Timothy Dozat and Christopher D. Manning. 2017.Deep Biaffine Attention for Neural DependencyParsing. In
Proceedings of the 5th InternationalConference on Learning Representations .Jason Eisner. 1996. Efficient Normal-Form Parsing forCombinatory Categorial Grammar. In
Proceedingsof the 34th Annual Meeting of the Association forComputational Linguistics , pages 79–86.Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016a. Character-based Decoding inTree-to-Sequence Attention-based Neural MachineTranslation. In
Proceedings of the 3rd Workshop onAsian Translation , pages 175–183.Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016b. Tree-to-Sequence AttentionalNeural Machine Translation. In
Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages823–833.Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to Parse and Translate Im-proves Neural Machine Translation. In
Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) . To appear.Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise Phoneme Classification with BidirectionalLSTM and Other Neural Network Architectures.
Neural Networks , 18(5):602–610.azuma Hashimoto, Akiko Eriguchi, and YoshimasaTsuruoka. 2016. Domain Adaptation and Attention-Based Unknown Word Replacement in Chinese-to-Japanese Neural Machine Translation. In
Proceed-ings of the 3rd Workshop on Asian Translation ,pages 75–83.Kazuma Hashimoto, Makoto Miwa, Yoshimasa Tsu-ruoka, and Takashi Chikayama. 2013. Simple Cus-tomization of Recursive Neural Networks for Se-mantic Relation Classification. In
Proceedings ofthe 2013 Conference on Empirical Methods in Nat-ural Language Processing , pages 1372–1376.Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2017. A Joint Many-Task Model: Growing a Neural Network for Mul-tiple NLP Tasks. In
Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing . To appear.Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors.
CoRR , abs/1207.0580.Hakan Inan, Khashayar Khosravi, and Richard Socher.2016. Tying Word Vectors and Word Classifiers:A Loss Framework for Language Modeling. arXiv ,cs.CL 1611.01462.Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010. Automatic Eval-uation of Translation Quality for Distant LanguagePairs. In
Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing , pages 944–952.S´ebastien Jean, Orhan Firat, Kyunghyun Cho, RolandMemisevic, and Yoshua Bengio. 2015. MontrealNeural Machine Translation Systems for WMTf15.In
Proceedings of the Tenth Workshop on StatisticalMachine Translation , pages 134–140.Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish,Michael J. Anderson, and Pradeep Dubey. 2016.BlackOut: Speeding up Recurrent Neural NetworkLanguage Models With Very Large Vocabularies. In
Proceedings of the 4th International Conference onLearning Representations .Rafal Jozefowicz, Wojciech Zaremba, and IlyaSutskever. 2015. An Empirical Exploration of Re-current Network Architectures. In
Proceedingsof the 32nd International Conference on MachineLearning , pages 2342–2350.Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M. Rush. 2017. Deep Biaffine Attention forNeural Dependency Parsing. In
Proceedings of the5th International Conference on Learning Represen-tations . Hyoung-Gyu Lee, JaeSong Lee, Jun-Seok Kim, andChang-Ki Lee. 2015. NAVER Machine TranslationSystem for WAT 2015. In
Proceedings of the 2ndWorkshop on Asian Translation , pages 69–73.Jiwei Li, Thang Luong, Dan Jurafsky, and EduardHovy. 2015. When Are Tree Structures Necessaryfor Deep Learning of Representations? In
Proceed-ings of the 2015 Conference on Empirical Methodsin Natural Language Processing , pages 2304–2314.Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, MinZhang, and Guodong Zhou. 2017. Modeling SourceSyntax for Neural Machine Translation. In
Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) . To appear.Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In
Proceedingsof the 2015 Conference on Empirical Methods inNatural Language Processing , pages 1412–1421.Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Online Large-Margin Training of De-pendency Parsers. In
Proceedings of the 43rd An-nual Meeting of the Association for ComputationalLinguistics , pages 91–98.Yusuke Miyao and Jun’ichi Tsujii. 2008. Feature For-est Models for Probabilistic HPSG Parsing.
Compu-tational Linguistics , 34(1):35–80.Toshiaki Nakazawa, Hideya Mino, Chenchen Ding,Isao Goto, Graham Neubig, Sadao Kurohashi, andEiichiro Sumita. 2016a. Overview of the 3rd Work-shop on Asian Translation. In
Proceedings of the3rd Workshop on Asian Translation (WAT2016) .Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016b. ASPEC:Asian Scientific Paper Excerpt Corpus. In
Proceed-ings of the 10th Conference on International Lan-guage Resources and Evaluation .Graham Neubig and Kevin Duh. 2014. On the Ele-ments of an Accurate Tree-to-String Machine Trans-lation System. In
Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 143–149.Graham Neubig, Makoto Morishita, and Satoshi Naka-mura. 2015. Neural Reranking Improves Subjec-tive Quality of Machine Translation: NAIST atWAT2015. In
Proceedings of the 2nd Workshop onAsian Translation (WAT2015) , pages 35–41.Eric W. Noreen. 1989.
Computer-Intensive Methodsfor Testing Hypotheses: An Introduction . Wiley-Interscience.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for AutomaticEvaluation of Machine Translation. In
Proceedingsf the 40th Annual Meeting on Association for Com-putational Linguistics , pages 311–318.Ofir Press and Lior Wolf. 2017. Using the Output Em-bedding to Improve Language Models. In
Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 2, Short Papers , pages 157–163.Rico Sennrich and Barry Haddow. 2016. Linguistic In-put Features Improve Neural Machine Translation.In
Proceedings of the First Conference on MachineTranslation , pages 83–91.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In
Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725.Richard Socher, Jeffrey Pennington, Eric H. Huang,Andrew Y. Ng, and Christopher D. Manning. 2011.Semi-Supervised Recursive Autoencoders for Pre-dicting Sentiment Distributions. In
Proceedings ofthe 2011 Conference on Empirical Methods in Nat-ural Language Processing , pages 151–161.Felix Stahlberg, Eva Hasler, Aurelien Waite, and BillByrne. 2016. Syntactically Guided Neural MachineTranslation. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 299–305.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to Sequence Learning with Neural Net-works. In
Advances in Neural Information Process-ing Systems 27 , pages 3104–3112.John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Charagram: Embedding Words andSentences via Character n-grams. In
Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing , pages 1504–1515.Dekai Wu. 1997. Stochastic Inversion TransductionGrammars and Bilingual Parsing of Parallel Cor-pora.
Computational Linguistics , 23(3):377–404.Kenji Yamada and Kevin Knight. 2001. A Syntax-based Statistical Translation Model. In
Proceedingsof 39th Annual Meeting of the Association for Com-putational Linguistics , pages 523–530.Baosong Yang, Derek F. Wong, Tong Xiao, Lidia S.Chao, and Jingbo Zhu. 2017. Towards Bidirec-tional Hierarchical Representations for Attention-Based Neural Machine Translation. In
Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing . To appear.Dani Yogatama, Phil Blunsom, Chris Dyer, EdwardGrefenstette, and Wang Ling. 2017. Learning toCompose Words into Sentences with ReinforcementLearning. In
Proceedings of the 5th InternationalConference on Learning Representations . Xingxing Zhang, Jianpeng Cheng, and Mirella Lapata.2017. Dependency Parsing as Head Selection. In
Proceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics , pages 665–676.Zhongyuan Zhu. 2015. Evaluating Neural MachineTranslation in English-Japanese Task. In
Proceed-ings of the 2nd Workshop on Asian Translation ,pages 61–68.Barret Zoph, Ashish Vaswani, Jonathan May, andKevin Knight. 2016. Simple, Fast Noise-ContrastiveEstimation for Large RNN Vocabularies. In