Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism
LLearning to Jointly Translate and Predict Dropped Pronouns with aShared Reconstruction Mechanism
Longyue Wang
Tencent AI Lab [email protected]
Zhaopeng Tu ∗ Tencent AI Lab [email protected]
Andy Way
Dublin City University [email protected]
Qun Liu
Huawei Noah’s Ark Lab [email protected]
Abstract
Pronouns are frequently omitted in pro-droplanguages, such as Chinese, generally lead-ing to significant challenges with respect tothe production of complete translations. Re-cently, Wang et al. (2018) proposed a novelreconstruction-based approach to alleviatingdropped pronoun (DP) translation problemsfor neural machine translation models. In thiswork, we improve the original model from twoperspectives. First, we employ a shared recon-structor to better exploit encoder and decoderrepresentations. Second, we jointly learn totranslate and predict DPs in an end-to-endmanner, to avoid the errors propagated froman external DP prediction model. Experimen-tal results show that our approach significantlyimproves both translation performance and DPprediction accuracy.
Pronouns are important in natural languages asthey imply rich discourse information. How-ever, in pro-drop languages such as Chinese andJapanese, pronouns are frequently omitted whentheir referents can be pragmatically inferred fromthe context. When translating sentences from apro-drop language into a non-pro-drop language( e.g.
Chinese-to-English), translation models gen-erally fail to translate invisible dropped pronouns(DPs). This phenomenon leads to various trans-lation problems in terms of completeness, syntaxand even semantics of translations. A number ofapproaches have been investigated for DP trans-lation (Le Nagard and Koehn, 2010; Xiang et al.,2013; Wang et al., 2016, 2018).Wang et al. (2018) is a pioneering work tomodel DP translation for neural machine trans- ∗ Zhaopeng Tu is the corresponding author of the paper.This work was conducted when Longyue Wang was studyingand Qun Liu was working at the ADAPT Centre in the Schoolof Computing at Dublin City University. lation (NMT) models. They employ two sepa-rate reconstructors (Tu et al., 2017) to respectivelyreconstruct encoder and decoder representationsback to the DP-annotated source sentence. Theannotation of DP is provided by an external pre-diction model, which is trained on the parallel cor-pus using automatically learned alignment infor-mation (Wang et al., 2016). Although this modelachieved significant improvements, there nonethe-less exist two drawbacks: 1) there is no interactionbetween the two separate reconstructors, whichmisses the opportunity to exploit useful relationsbetween encoder and decoder representations; and2) the external DP prediction model only has anaccuracy of 66% in F1-score, which propagatesnumerous errors to the translation model.In this work, we propose to improve the orig-inal model from two perspectives. First, we usea shared reconstructor to read hidden states fromboth encoder and decoder. Second, we integratea DP predictor into NMT to jointly learn to trans-late and predict DPs. Incorporating these as twoauxiliary loss terms can guide both the encoderand decoder states to learn critical information rel-evant to DPs. Experimental results on a large-scale Chinese–English subtitle corpus show thatthe two modifications can accumulatively improvetranslation performance, and the best result is +1.5BLEU points better than that reported by Wanget al. (2018). In addition, the jointly learned DPprediction model significantly outperforms its ex-ternal counterpart by 9% in F1-score.
As shown in Figure 1, Wang et al. (2018) in-troduced two independent reconstructors withtheir own parameters, which reconstruct the DP-annotated source sentence from the encoder anddecoder hidden states, respectively. The central a r X i v : . [ c s . C L ] O c t (cid:5) (cid:1) (cid:3) ? (cid:2) (cid:5) (cid:1) (cid:1) (cid:3) ?encoder encoder-reconstructorDid you bake it ?decoder (cid:2) (cid:5) (cid:1) (cid:1) (cid:3) ?decoder-reconstructor x y ˆ x ˆ x Figure 1 : Architecture of separate reconstructors.
Prediction F1-score Example
DP Position 88% 你你你 烤 的 吗 ?DP Words 66% 你你你 烤 的 它它它 吗 ? Table 1 : Evaluation of external models on predict-ing the positions of DPs (“DP Position”) and theexact words of DP (“DP Words”).idea underpinning their approach is to guide thecorresponding hidden states to embed the recalledsource-side DP information and subsequently tohelp the NMT model generate the missing pro-nouns with these enhanced hidden representations.The DPs can be automatically annotated fortraining and test data using two different strate-gies (Wang et al., 2016). In the training phase ,where the target sentence is available, we anno-tate DPs for the source sentence using alignmentinformation. These annotated source sentencescan be used to build a neural-based DP predic-tor, which can be used to annotate test sentencessince the target sentence is not available duringthe testing phase . As shown in Table 1, Wanget al. (2016, 2018) explored to predict the exactDP words , the accuracy of which is only 66% inF1-score. By analyzing the translation outputs, wefound that 16.2% of errors are newly introducedand caused by errors from the DP predictor. For-tunately, the accuracy of predicting DP positions(DPPs) is much higher, which provides the chanceto alleviate the error propagation problem. Intu-itively, we can learn to generate DPs at the pre-dicted positions using a jointly trained DP predic-tor, which is fed with informative representationsin the reconstructor. Unless otherwise indicated, in the paper, the terms “DP”and “DP word” are identical.
Recent work shows that NMT models can benefitfrom sharing a component across different tasksand languages. Taking multi-language translationas an example, Firat et al. (2016) share an attentionmodel across languages while Dong et al. (2015)share an encoder. Our work is most similar tothe work of Zoph and Knight (2016) and Anas-tasopoulos and Chiang (2018), which share a de-coder and two separate attention models to readfrom two different sources. In contrast, we shareinformation at the level of reconstructed frames.The architectures of our proposed shared recon-struction model are shown in Figure 2(a). For-mally, the reconstructor reads from both the en-coder and decoder hidden states, as well as theDP-annotated source sentence, and outputs a re-construction score. It uses two separate attentionmodels to reconstruct the annotated source sen-tence ˆ x = { ˆ x , ˆ x , . . . , ˆ x T } word by word, andthe reconstruction score is computed by R (ˆ x | h enc , h dec ) = T (cid:89) t =1 g r (ˆ x t − , h rect , ˆ c enct , ˆ c dect ) where h rect is the hidden state in the reconstructor,and computed by Equation (1): h rect = f r (ˆ x t − , h rect − , ˆ c enct , ˆ c dect ) (1)Here g r ( · ) and f r ( · ) are respectively softmax andactivation functions for the reconstructor. Thecontext vectors ˆ c enct and ˆ c dect are the weighted sumof h enc and h dec , respectively, as in Equation (2)and (3): ˆ c enct = (cid:80) Jj =1 ˆ α enct,j · h encj (2) ˆ c dect = (cid:80) Ii =1 ˆ α dect,i · h deci (3)Note that the weights ˆ α enc and ˆ α dec are calculatedby two separate attention models. We propose twoattention strategies which differ as to whether thetwo attention models have interactions or not. Independent Attention calculates the twoweight matrices independently, as in Equation (4)and (5): ˆ α enc = A TT enc (ˆ x t − , h rect − , h enc ) (4) ˆ α dec = A TT dec (ˆ x t − , h rect − , h dec ) (5)where A TT enc ( · ) and A TT dec ( · ) are two separateattention models with their own parameters. (cid:5) (cid:1) (cid:3) ?encoderDid you bake it ?decoder (cid:2) (cid:5) (cid:1) (cid:1) (cid:3) ?shared reconstructor x y ˆ x (a) Shared reconstructor. (cid:2) (cid:5) (cid:1) (cid:3) ?encoderDid you bake it ?decoder (cid:2) (cid:5) (cid:1) DP (cid:3) ?shared reconstructor x y ˆ x (cid:4) dp joint prediction (b) Shared reconstructor with joint prediction. Figure 2 : Model architectures in which the words in red are automatically annotated DPs and DPPs.
Interactive Attention feeds the context vectorproduced by one attention model to another atten-tion model. The intuition behind this is that theinteraction between two attention models can leadto a better exploitation of the encoder and decoderrepresentations. As the interactive attention is di-rectional, we have two options (Equation (6) and(7)) which modify either A TT enc ( · ) or A TT dec ( · ) while leaving the other one unchanged: • enc → dec : ˆ α dec = A TT dec (ˆ x t − , h rect − , h dec , ˆ c enct ) (6) • dec → enc : ˆ α enc = A TT enc (ˆ x t − , h rect − , h enc , ˆ c dect ) (7) Inspired by recent successes of multi-task learn-ing (Dong et al., 2015; Luong et al., 2016), wepropose to jointly learn to translate and predictDPs (as shown in Figure 2(b)). To ease the learn-ing difficulty, we leverage the information of DPPspredicted by an external model, which can achievean accuracy of 88% in F1-score. Accordingly, wetransform the original DP prediction problem toDP word generation given the pre-predicted DPpositions. Since the DPP-annotated source sen-tence serves as the reconstructed input, we in-troduce an additional
DP-generation loss , whichmeasures how well the DP is generated from thecorresponding hidden state in the reconstructor.Let dp = { dp , dp , . . . , dp D } be the list ofDPs in the annotated source sentence, and h rec = { h rec , h rec , . . . , h recD } be the corresponding hid-den states in the reconstructor. The generation probability is computed by P ( dp | h rec ) = D (cid:89) d =1 P ( dp d | h recd )= D (cid:89) d =1 g p ( dp d | h recd ) (8)where g p ( · ) is softmax for the DP predictor. We train both the encoder-decoder and the sharedreconstructors together in a single end-to-end pro-cess, and the training objective is J ( θ, γ, ψ ) = arg max θ,γ,ψ (cid:26) log L ( y | x ; θ ) (cid:124) (cid:123)(cid:122) (cid:125) likelihood + log R ( ˆx | h enc , h dec ; θ, γ ) (cid:124) (cid:123)(cid:122) (cid:125) reconstruction + log P ( dp | ˆ h rec ; θ, γ, ψ ) (cid:124) (cid:123)(cid:122) (cid:125) prediction (cid:27) (9)where { θ, γ, ψ } are respectively the parametersassociated with the encoder-decoder, shared re-constructor and the DP prediction model. Theauxiliary reconstruction objective R ( · ) guides therelated part of the parameter matrix θ to learnbetter latent representations, which are used toreconstruct the DPP-annotated source sentence.The auxiliary prediction loss P ( · ) guides the re-lated part of both the encoder-decoder and the re-constructor to learn better latent representations,which are used to predict the DPs in the sourcesentence.Following Tu et al. (2017) and Wanget al. (2018), we use the reconstruction score Model
Train DecodeExisting system (Wang et al., 2018)1 Baseline 86.7M 1.60K 15.23 31.802 Baseline (+DPs) 86.7M 1.59K 15.20 32.673 Separate-Recs ⇒ (+DPs) +73.8M 0.57K 12.00 35.08Our system4 Baseline (+DPPs) 86.7M 1.54K 15.19 33.185 Shared-Rec independent ⇒ (+DPPs) +86.6M 0.52K 11.87 35.27 †‡ independent ⇒ (+DPPs) + joint prediction +87.9M 0.51K 11.88 35.88 †‡ enc → dec ⇒ (+DPPs) + joint prediction +91.9M 0.48K 11.84 †‡ dec → enc ⇒ (+DPPs) + joint prediction +89.9M 0.49K 11.85 35.99 †‡ Table 2 : Evaluation of translation performance for Chinese–English. “Baseline” is trained and evaluatedon the original data, while “Baseline (+DPs)” and “Baseline (+DPPs)” are trained on the data anno-tated with DPs and DPPs, respectively. Training and decoding (beam size is 10) speeds are measuredin words/second. “ † ” and “ ‡ ” indicate statistically significant difference ( p < . ) from “Baseline(+DDPs)” and “Separate-Recs ⇒ (+DPs)”, respectively.as a reranking technique to select the best trans-lation candidate from the generated n -best list attesting time. Different from Wang et al. (2018),we reconstruct DPP-annotated source sentence,which is predicted by an external model. To compare our work with the results reported byprevious work (Wang et al., 2018), we conductedexperiments on their released Chinese ⇒ EnglishTV Subtitle corpus. The training, validation, andtest sets contain 2.15M, 1.09K, and 1.15K sen-tence pairs, respectively. We used case-insensitive4-gram NIST BLEU metrics (Papineni et al.,2002) for evaluation, and sign-test (Collins et al.,2005) to test for statistical significance.We implemented our models on the code repos-itory released by Wang et al. (2018). We usedthe same configurations ( e.g. vocabulary size =30K, hidden size = 1000) and reproduced their re-ported results. It should be emphasized that wedid not use the pre-train strategy as done in Wanget al. (2018), since we found training from scratchachieved a better performance in the shared recon-structor setting. https://github.com/longyuewangdcu/tvsub https://github.com/tuzhaopeng/nmt Table 2 shows the translation results. It is clearthat the proposed models significantly outperformthe baselines in all cases, although there are con-siderable differences among different variations.
Baselines (Rows 1-4): The three baselines(Rows 1, 2, and 4) differ regarding the trainingdata used. “Separate-Recs ⇒ (+DPs)” (Row 3) isthe best model reported in Wang et al. (2018),which we employed as another strong baseline.The baseline trained on the DPP-annotated data(“Baseline (+DPPs)”, Row 4) outperforms theother two counterparts, indicating that the errorpropagation problem does affect the performanceof translating DPs. It suggests the necessity ofjointly learning to translate and predict DPs. Our Models (Rows 5-8): Using our shared re-constructor (Row 5) not only outperforms the cor-responding baseline (Row 4), but also surpassesits separate reconstructor counterpart (Row 3). In-troducing a joint prediction objective (Row 6) canachieve a further improvement of +0.61 BLEUpoints. These results verify that shared reconstruc-tor and jointly predicting DPs can accumulativelyimprove translation performance.Among the variations of shared reconstructors(Rows 6-8), we found that an interaction attentionfrom encoder to decoder (Row 7) achieves the bestperformance, which is +3.45 BLEU points betterthan our baseline (Row 4) and +1.45 BLEU pointsetter than the best result reported by Wang et al.(2018) (Row 3). We attribute the superior per-formance of “Shared-Rec enc → dec ” to the fact thatthe attention context over encoder representationsembeds useful DP information, which can help tobetter attend to the representations of the corre-sponding pronouns in the decoder side. Similarto Wang et al. (2018), the proposed approach im-proves BLEU scores at the cost of decreased train-ing and decoding speed, which is due to the largenumber of newly introduced parameters resultingfrom the incorporation of reconstructors into theNMT model. External 0.67 0.65 0.66Joint 0.74 0.76 0.75
Table 3 : Evaluation of DP prediction accu-racy. “External” model is separately trained onDP-annotated data with external neural methods(Wang et al., 2016), while “Joint” model is jointly trained with the NMT model (Section 3.2).
DP Prediction Accuracy
As shown in Table 3,the jointly learned model significantly outper-forms the external one by 9% in F1-score. Weattribute this to the useful contextual informa-tion embedded in the reconstructor representa-tions, which are used to generate the exact DPwords.
Model Test (cid:52)
Baseline (+DPPs) 33.18 –Separate-Recs (+DPs) 34.02 +0.84Shared-Rec (+DPPs)
Table 4 : Translation results when reconstructionis used in training only while not used in testing . Contribution Analysis
Table 4 lists translationresults when the reconstruction model is used intraining only. We can see that the proposed modeloutperforms both the strong baseline and the bestmodel reported in Wang et al. (2018). This is en-couraging since no extra resources and compu-tation are introduced to online decoding, whichmakes the approach highly practical, for examplefor translation in industry applications.
Model Auto. Man. (cid:52)
Seperate-Recs (+DPs) 35.08 38.38 +3.30Shared-Rec (+DPPs) 36.53 38.94 +2.41
Table 5 : Translation performance gap (“ (cid:52) ”)between manually (“Man.”) and automatically(“Auto.”) labelling DPs/DPPs for input sentencesin testing.
Effect of DPP Labelling Accuracy
For eachsentence in testing, the DPs and DPPs are labelledautomatically by two separate external predictionmodels, the accuracy of which are respectively66% and 88% measured in F1 score. We investi-gate the best performance the models can achievewith manual labelling, which can be regarded asan “Oracle”, as shown in Table 5. As seen, therestill exists a significant gap in performance, andthis could be improved by improving the accuracyof our DPP generator. In addition, our modelsshow a relatively smaller distance in performancefrom the oracle performance (“Man”), indicatingthat the error propagation problem is alleviated tosome extent.
In this paper, we proposed effective approaches oftranslating DPs with NMT models: shared recon-structor and jointly learning to translate and pre-dict DPs. Through experiments we verified that1) shared reconstruction is helpful to share knowl-edge between the encoder and decoder; and 2)joint learning of the DP prediction model indeedalleviates the error propagation problem by im-proving prediction accuracy. The two approachesaccumulatively improve translation performance.The method is not restricted to the DP transla-tion task and could potentially be applied to othersequence generation problems where additionalsource-side information could be incorporated.In future work we plan to: 1) build a fullyend-to-end NMT model for DP translation, whichdoes not depend on any external component ( i.e.
DPP predictor); 2) exploit cross-sentence context(Wang et al., 2017) to further improve DP trans-lation; 3) investigate a new research strand thatadapts our model in an inverse translation direc-tion by learning to drop pronouns instead of re-covering DPs. cknowledgments
The ADAPT Centre for Digital Content Technol-ogy is funded under the SFI Research CentresProgramme (Grant 13/RC/2106) and is co-fundedunder the European Regional Development Fund.We thank the anonymous reviewers for their in-sightful comments.
References
Antonios Anastasopoulos and David Chiang. 2018.Tied multitask learning for neural speech translation.In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 82–91, New Orleans, Louisiana, USA.Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In
Proceedings of the 43rd AnnualMeeting of the Association for Computational Lin-guistics , pages 531–540, Ann Arbor, Michigan,USA.Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, andHaifeng Wang. 2015. Multi-task learning for mul-tiple language translation. In
Proceedings of the53rd Annual Meeting of the Association for Com-putational Linguistics , pages 1723–1732, Beijing,China.Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.2016. Multi-way, multilingual neural machinetranslation with a shared attention mechanism. In
Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 866–875, San Diego, California, USA.Ronan Le Nagard and Philipp Koehn. 2010. Aidingpronoun translation with co-reference resolution. In
Proceedings of the Joint 5th Workshop on StatisticalMachine Translation and MetricsMATR , pages 252–261, Uppsala, Sweden.Minh-Thang Luong, Quoc V Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2016. Multi-task se-quence to sequence learning. In
Proceedings of the 27th International Conference on ComputationalLinguistics , pages 2965–2977, Santa Fe, New Mex-ico, USA.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In
Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA.Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu,and Hang Li. 2017. Neural machine translation withreconstruction. In
Proceedings of the 31st AAAIConference on Artificial Intelligence , pages 3097–3103, San Francisco, California, USA.Longyue Wang, Zhaopeng Tu, Shuming Shi, TongZhang, Yvette Graham, and Qun Liu. 2018. Trans-lating pro-drop languages with reconstruction mod-els. In
Proceedings of the 32nd AAAI Conference onArtificial Intelligence , pages 4937–4945, New Or-leans, Louisiana, USA.Longyue Wang, Zhaopeng Tu, Andy Way, and QunLiu. 2017. Exploiting cross-sentence context forneural machine translation. In
Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing , pages 2816–2821, Copen-hagen, Denmark.Longyue Wang, Zhaopeng Tu, Xiaojun Zhang, HangLi, Andy Way, and Qun Liu. 2016. A novel ap-proach for dropped pronoun translation. In
Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages983–993, San Diego, California, USA.Bing Xiang, Xiaoqiang Luo, and Bowen Zhou. 2013.Enlisting the ghost: Modeling empty categories formachine translation. In
Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics , pages 822–831, Sofia, Bulgaria.Barret Zoph and Knight Knight. 2016. Multi-sourceneural translation. In