Improving Character-based Decoding Using Target-Side Morphological Information for Neural Machine Translation
IImproving Character-based Decoding Using Target-Side MorphologicalInformation for Neural Machine Translation
Peyman Passban, Qun Liu, Andy Way
ADAPT CentreSchool of ComputingDublin City University, Ireland [email protected]
Abstract
Recently, neural machine translation(NMT) has emerged as a powerful alterna-tive to conventional statistical approaches.However, its performance drops consider-ably in the presence of morphologicallyrich languages (MRLs). Neural enginesusually fail to tackle the large vocabularyand high out-of-vocabulary (OOV) wordrate of MRLs. Therefore, it is not suitableto exploit existing word-based modelsto translate this set of languages. In thispaper, we propose an extension to thestate-of-the-art model of Chung et al.(2016), which works at the character leveland boosts the decoder with target-sidemorphological information. In our archi-tecture, an additional morphology tableis plugged into the model. Each time thedecoder samples from a target vocabulary,the table sends auxiliary signals from themost relevant affixes in order to enrich thedecoder’s current state and constrain it toprovide better predictions. We evaluatedour model to translate English into Ger-man, Russian, and Turkish as three MRLsand observed significant improvements.
Morphologically complex words (MCWs) aremulti-layer structures which consist of differentsubunits, each of which carries semantic informa-tion and has a specific syntactic role. Table 1 givesa Turkish example to show this type of complexity.This example is a clear indication that word-basedmodels are not suitable to process such complexlanguages. Accordingly, when translating MRLs,it might not be a good idea to treat words as atomicunits as it demands a large vocabulary that im- poses extra overhead. Since MCWs can appearin various forms we require a very large vocabu-lary to i ) cover as many morphological forms andwords as we can, and ii ) reduce the number ofOOVs. Neural models by their nature are com-plex, and we do not want to make them more com-plicated by working with large vocabularies. Fur-thermore, even if we have quite a large vocabularyset, clearly some words would remain uncoveredby that. This means that a large vocabulary notonly complicates the entire process, but also doesnot necessarily mitigate the OOV problem. Forthese reasons we propose an NMT engine whichworks at the character level. Word Translationterbiye good manners terbiye. siz rude terbiye. siz . lik rudeness terbiye. siz . lik . leri their rudeness terbiye. siz . lik . leri . nden from their rudeness Table 1:
Illustrating subword units in MCWs. The bold-faced part indicates the stem.
In this paper, we focus on translating into MRLsand issues associated with word formation on thetarget side. To provide a better translation wedo not necessarily need a large target lexicon, asan MCW can be gradually formed during decod-ing by means of its subunits, similar to the solu-tion proposed in character-based decoding models(Chung et al., 2016). Generating a complex wordcharacter-by-character is a better approach com-pared to word-level sampling, but it has other dis-advantages.One character can co-occur with another withalmost no constraint, but a particular word or mor-pheme can only collocate with a very limited num-ber of other constituents. Unlike words, charactersare not meaning-bearing units and do not preservesyntactic information, so (in the extreme case) the a r X i v : . [ c s . C L ] A p r hance of sampling each character by the decoderis almost equal to the others, but this situation isless likely for words. The only constraint that pri-oritize which character should be sampled is in-formation stored in the decoder, which we believeis insufficient to cope with all ambiguities. Fur-thermore, when everything is segmented into char-acters the target sentence with a limited numberof words is changed to a very long sequence ofcharacters, which clearly makes it harder for thedecoder to remember such a long history. Ac-cordingly, character-based information flows inthe decoder may not be as informative as word-or morpheme-based information.In the character-based NMT model everythingis almost the same as its word-based counterpartexcept the target vocabulary whose size is consid-erably reduced from thousands of words to justhundreds of characters. If we consider the de-coder as a classifier, it should in principle be ableto perform much better over hundreds of classes(characters) rather than thousands (words), but theperformance of character-based models is almostthe same as or slightly better than their word-based versions. This underlines the fact that thecharacter-based decoder is perhaps not fed withsufficient information to provide improved perfor-mance compared to word-based models.Character-level decoding limits the search spaceby dramatically reducing the size of the target vo-cabulary, but at the same time widens the searchspace by working with characters whose samplingseems to be harder than words. The freedom inselection and sampling of characters can misleadthe decoder, which prevents us from taking themaximum advantages of character-level decoding.If we can control the selection process with otherconstraints, we may obtain further benefit from re-stricting the vocabulary set, which is the main goalfollowed in this paper.In order to address the aforementioned prob-lems we redesign the neural decoder in three dif-ferent scenarios. In the first scenario we equip thedecoder with an additional morphology table in-cluding target-side affixes. We place an attentionmodule on top of the table which is controlled bythe decoder. At each step, as the decoder samples acharacter, it searches the table to find the most rel-evant information which can enrich its state. Sig-nals sent from the table can be interpreted as addi-tional constraints. In the second scenario we share the decoder between two output channels. Thefirst one samples the target character and the otherone predicts the morphological annotation of thecharacter. This multi-tasking approach forces thedecoder to send morphology-aware information tothe final layer which results in better predictions.In the third scenario we combine these two mod-els. Section 3 provides more details on our mod-els.Together with different findings that will be dis-cussed in the next sections, there are two maincontributions in this paper. We redesigned andtuned the NMT framework for translating intoMRLs. It is quite challenging to show the impactof external knowledge such as morphological in-formation in neural models especially in the pres-ence of large parallel corpora. However, our mod-els are able to incorporate morphological informa-tion into decoding and boost its quality. We injectthe decoder with morphological properties of thetarget language. Furthermore, the novel architec-ture proposed here is not limited to morphologicalinformation alone and is flexible enough to pro-vide other types of information for the decoder. There are several models for NMT of MRLs whichare designed to deal with morphological complex-ities. García-Martínez et al. (2016) and Sennrichand Haddow (2016) adapted the factored machinetranslation approach to neural models. Morpho-logical annotations can be treated as extra factorsin such models. Jean et al. (2015) proposed amodel to handle very large vocabularies. Luonget al. (2015) addressed the problem of rare wordsand OOVs with the help of a post-translation phaseto exchange unknown tokens with their poten-tial translations. Sennrich et al. (2016) used sub-word units for NMT. The model relies on frequentsubword units instead of words. Costa-jussà andFonollosa (2016) designed a model for translatingfrom MRLs. The model encodes source wordswith a convolutional module proposed by Kimet al. (2016). Each word is represented by a con-volutional combination of its characters.Luong and Manning (2016) used a hybridmodel for representing words. In their model,unseen and complex words are encoded with acharacter-based representation, with other wordsencoded via the usual surface-form embed-dings. Vylomova et al. (2016) compared differ-nt representation models (word-, morpheme, andcharacter-level models) which try to capture com-plexities on the source side, for the task of trans-lating from MRLs.Chung et al. (2016) proposed an architec-ture which benefits from different segmentationschemes. On the encoder side, words are seg-mented into subunits with the byte-pair segmen-tation model ( bpe ) (Sennrich et al., 2016), andon the decoder side, one target character is pro-duced at each time step. Accordingly, the tar-get sequence is treated as a long chain of charac-ters without explicit segmentation. Grönroos et al.(2017) focused on translating from English intoFinnish and implicitly incorporated morphologicalinformation into NMT through multi-task learn-ing. Passban (2018) comprehensively studied theproblem of translating MRLs and addressed po-tential challenges in the field.Among all the models reviewed in this section,the network proposed by Chung et al. (2016) couldbe seen as the best alternative for translating intoMRLs as it works at the character level on the de-coder side and it was evaluated in different settingson different languages. Consequently, we considerit as a baseline model in our experiments.
We propose a compatible neural architecture fortranslating into MRLs. The model benefits fromsubword- and character-level information and im-proves upon the state-of-the-art model of Chunget al. (2016). We manipulated the model to incor-porate morphological information and developedthree new extensions, which are discussed in Sec-tions 3.1, 3.2, and 3.3.
In the first extension an additional table containingthe morphological information of the target lan-guage is plugged into the decoder to assist withword formation. Each time the decoder samplesfrom the target vocabulary, it searches the mor-phology table to find the most relevant affixesgiven its current state. Items selected from the ta-ble act as guiding signals to help the decoder sam-ple a better character.Our base model is an encoder-decoder modelwith attention (Bahdanau et al., 2014), imple-mented using gated recurrent units (GRUs) (Choet al., 2014). We use a four-layer model in our experiments. Similar to Chung et al. (2016) andWu et al. (2016), we use bidirectional units to en-code the source sequence. Bidirectional GRUs areplaced only at the input layer. The forward GRUreads the input sequence in its original order andthe backward GRU reads the input in the reverseorder. Each hidden state of the encoder in onetime step is a concatenation of the forward andbackward states at the same time step. This typeof bidirectional processing provides a richer rep-resentation of the input sequence.On the decoder side, one target character is sam-pled from a target vocabulary at each time step.In the original encoder-decoder model, the proba-bility of predicting the next token y i is estimatedbased on i ) the current hidden state of the de-coder, ii ) the last predicted token, and iii ) thecontext vector. This process can be formulated as p ( y i | y , ..., y i − , x ) = g ( h i , y i − , c i ) , where g ( . ) is a softmax function, y i is the target token (tobe predicted), x is the representation of the inputsequence, h i is the decoder’s hidden state at the i -th time step, and c i indicates the context vec-tor which is a weighted summary of the input se-quence generated by the attention module. c i isgenerated via the procedure shown in (1): c i = n (cid:88) j =1 α ij s j α ij = exp ( e ij ) (cid:80) nk =1 exp ( e ik ) ; e ij = a ( s j , h i − ) (1)where α ij denotes the weight of the j -th hiddenstate of the encoder ( s j ) when the decoder predictsthe i -th target token, and a () shows a combinato-rial function which can be modeled through a sim-ple feed-forward connection. n is the length of theinput sequence.In our first extension, the prediction prob-ability is conditioned on one more constraintin addition to those three existing ones, as in p ( y i | y , ..., y i − , x ) = g ( h i , y i − , c i , c mi ) , where c mi is the morphological context vector and car-ries information from those useful affixes whichcan enrich the decoder’s information. c mi is gener-ated via an attention module over the morphologytable which works in a similar manner to word-based attention model. The attention procedure for = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 y i b u t e r b i y e s i z l i k i ç i n l i s t e m - c s t e m - c w - s pa ce s t e m - c s t e m - c s t e m - c s t e m - c s t e m - c s t e m - c s t e m - c s i z - c s i z - c s i z - c li k - c li k - c li k - c w - s pa ce s t e m - c s t e m - c s t e m - c s t e m - c Figure 1: The target label that each output channel is supposed to predict when generating the Turkishsequence ‘ bu terbiyesizlik için ’ meaning ‘ because of this rudeness ’.generating c mi is formulated as in (2): c mi = |A| (cid:88) u =1 β iu f u β iu = exp ( e miu ) (cid:80) |A| v =1 exp ( e iv ) ; e miu = a m ( f u , h i − ) (2)where f u represents the embedding of the u -th af-fix ( u -th column) in the morphology/affix table A , β iu is the weight assigned to f u when predictingthe i -th target token, and a m is a feed-forward con-nection between the morphology table and the de-coder.The attention module in general can be consid-ered as a search mechanism, e.g. in the origi-nal encoder-decoder architecture the basic atten-tion module finds the most relevant input words tomake the prediction. In multi-modal NMT (Huanget al., 2016; Calixto et al., 2017) an extra attentionmodule is added to the basic one in order to searchthe image input to find the most relevant imagesegments. In our case we have a similar additionalattention module which searches the morphologytable.In this scenario, the morphology table includingthe target language’s affixes can be considered asan external knowledge repository that sends auxil-iary signals which accompany the main input se-quence at all time steps. Such a table certainlyincludes useful information for the decoder. Aswe are not sure which affix preserves those piecesof useful information, we use an attention moduleto search for the best match. The attention mod-ule over the table works as a filter which excludesirrelevant affixes and amplifies the impact of rel-evant ones by assigning different weights ( β val-ues). In the first scenario, we embedded a morphologytable into the decoder in the hope that it can enrichsampling information. Mathematically speaking, such an architecture establishes an extra constraintfor sampling and can control the decoder’s predic-tions. However, this is not the only way of con-straining the decoder. In the second scenario, wedefine extra supervision to the network via anotherpredictor (output channel). The first channel is re-sponsible for generating translations and predictsone character at each time step, and the other onetries to understand the morphological status of thedecoder by predicting the morphological annota-tion ( l i ) of the target character.The approach in the second scenario proposesa multi-task learning architecture, by which in onetask we learn translations and in the other one mor-phological annotations. Therefore, all networkmodules –especially the last hidden layer just be-fore the predictors– should provide informationwhich is useful enough to make correct predictionsin both channels, i.e. the decoder should preservetranslation as well as morphological knowledge.Since we are translating into MRLs this type ofmixed information (morphology+translation) canbe quite useful.In our setting, the morphological annotation l i predicted via the second channel shows to whichpart of the word or morpheme the target characterbelongs, i.e. the label for the character is the mor-pheme that includes it. We clarify the predictionprocedure via an example from our training set(see Section 4). When the Turkish word ‘ terbiye-sizlik ’ is generated, the first channel is supposed topredict t , e , r , up to k , one after another. For thesame word, the second channel is supposed to pre-dict stem-C for the fist steps as the first charac-ters ‘ terbiye ’ belong to the stem of the word. The C sign indicates that stem-C is a class label. Thesecond channel should also predict siz-C when thefirst channel predicts s (eighth character), i (ninthcharacter), and z (tenth character), and lik-C whenthe first channel samples the last three characters.Clearly, the second channel is a classifier whichworks over the { stem-C , siz-C , lik-C , ...} classes.Figure 1 illustrates a segment of a sentence includ-ng this Turkish word and explains which classtags should be predicted by each channel.To implement the second scenario we re-quire a single-source double-target training cor-pus: [source sentence] → [sequence of target char-acters & sequence of morphological annotations](see Section 4). The objective function should alsobe manipulated accordingly. Given a training set { x t , y t , m t } Tt =1 the goal is to maximize the jointloss function shown in (3): λ T (cid:88) t =1 log P ( y t | x t ; θ )+(1 − λ ) T (cid:88) t =1 log P ( m t | x t ; θ ) (3)where x t is the t -th input sentence whose transla-tion is a sequence of target characters shown by y t . m t is the sequence of morphological annota-tions and T is the size of the training set. θ is theset of network parameters and λ is a scalar to bal-ance the contribution of each cost function. λ isadjusted on the development set during training. In the first scenario, we aim to provide the de-coder with useful information about morphologi-cal properties of the target language, but we are notsure whether signals sent from the table are whatwe really need. They might be helpful or evenharmful, so there should be a mechanism to con-trol their quality. In the second scenario we alsohave a similar problem as the last layer requiressome information to predict the correct morpho-logical class through the second channel, but thereis no guarantee to ensure that information in thedecoder is sufficient for this sort of prediction. Inorder to address these problems, in the third exten-sion we combine both scenarios as they are com-plementary and can potentially help each other.The morphology table acts as an additional use-ful source of knowledge as it already consists ofaffixes, but its content should be adapted accord-ing to the decoder and its actual needs. Accord-ingly, we need a trainer to update the table prop-erly. The extra prediction channel plays this rolefor us as it forces the network to predict the tar-get language’s affixes at the output layer. Theerror computed in the second channel is back-propagated to the network including the morphol-ogy table and updates its affix information intowhat the decoder actually needs for its predic-tion. Therefore, the second output channel helps us train better affix embeddings.The morphology table also helps the secondpredictor. Without considering the table, the lastlayer only includes information about the input se-quence and previously predicted outputs, whichis not directly related to morphological informa-tion. The second attention module retrieves usefulaffixes from the morphology table and concate-nates to the last layer, which means the decoderis explicitly fed with morphological information.Therefore, these two modules mutually help eachother. The external channel helps update the mor-phology table with high-quality affixes (backwardpass) and the table sends its high-quality signals tothe prediction layer (forward pass). The relationbetween these modules and the NMT architectureis illustrated in Figure 2. h i l i 𝑠 𝑠 𝑠 𝑠 𝑛 h i-1 y i-1 𝑠 𝑠 𝑠 𝑠 𝑛 𝛼 𝛼 𝛼 𝛼 𝑛,𝑖 y i … 𝛽 𝛽 𝛽 𝛽 |𝒜|,𝑖 𝒜 f x x x x n Figure 2: The architecture of the NMT model withan auxiliary prediction channel and an extra mor-phology table. This network includes only one de-coder layer and one encoder layer. ⊕ shows theattention modules. As previously reviewed, different models try tocapture complexities on the encoder side, but tothe best of our knowledge the only model whichproposes a technique to deal with complex con-stituents on the decoder side is that of Chung et al.(2016), which should be an appropriate baselinefor our comparisons. Moreover, it outperformsother existing NMT models, so we prefer to com-pare our network to the best existing model. Thismodel is referred to as CDNMT in our experi-ments. In the next sections first we explain ourexperimental setting, corpora, and how we buildhe morphology table (Section 4.1), and then re-port experimental results (Section 4.2).
In order to make our work comparable we tryto follow the same experimental setting used inCDNMT, where the GRU size is , the affixand word embedding size is , and the beamwidth is . Our models are trained using stochas-tic gradient descent with Adam (Kingma and Ba,2015). Chung et al. (2016) and Sennrich et al.(2016) demonstrated that bpe boosts NMT, so sim-ilar to CDNMT we also preprocess the sourceside of our corpora using bpe . We use WMT-15 corpora to train the models, newstest-2013 for tuning and newstest-2015 as the testsets. For English–Turkish (En–Tr) we usethe OpenSubtitle2016 collection (Lison andTiedemann, 2016). The training side of theEnglish–German (En–De), English–Russian (En–Ru), and En–Tr corpora include . , . , and million parallel sentences, respectively. We ran-domly select K sentences for each of the develop-ment and test sets for En–Tr. For all language pairswe keep the most frequent characters as thetarget-side character set and replace the remainder(infrequent characters) with a specific character.One of the key modules in our architecture is themorphology table. In order to implement it we usea look-up table whose columns include embed-dings for the target language’s affixes (each col-umn represents one affix) which are updated dur-ing training. As previously mentioned, the tableis intended to provide useful, morphological in-formation so it should be initialized properly, forwhich we use a morphology-aware embedding-learning model. To this end, we use the neurallanguage model of Botha and Blunsom (2014) inwhich each word is represented via a linear com-bination of the embeddings of its surface form andsubunits, e.g. −−−−−−−−−→ terbiyesizlik = −−−−−−−−−→ terbiyesizlik + −−−−→ terbiye + −→ siz + −→ lik . Given a sequence of words,the neural language model tries to predict the nextword, so it learns sentence-level dependencies aswell as intra-word relations. The model trains sur-face form and subword-level embeddings whichprovides us with high-quality affix embeddings.Our neural language model is a recurrent net-work with a single -dimensional GRU layer,which is trained on the target sides of our paral- lel corpora. The embedding size is and weuse a batch size of to train the model. Be-fore training the neural language model, we needto manipulate the training corpus to decomposewords into morphemes for which we use Morfes-sor (Smit et al., 2014), an unsupervised morpho-logical analyzer. Using Morfessor each word issegmented into different subunits where we con-sider the longest part as the stem of each word;what appears before the stem is taken as a memberof the set of prefixes (there might be one or moreprefixes) and what follows the stem is consideredas a member of the set of suffixes.Since Morfessor is an unsupervised analyzer, inorder to minimize segmentation errors and avoidnoisy results we filter its output and exclude sub-units which occur fewer than times. Af-ter decomposing, filtering, and separating stemsfrom affixes, we extracted several affixes whichare reported in Table 2. We emphasize that theremight be wrong segmentations in Morfessor’s out-put, e.g. Turkish is a suffix-based language, sothere are no prefixes in this language, but basedon what Morfessor generated we extracted dif-ferent types of prefixes. We do not post-processMorfessor’s outputs. Language Prefix SuffixGerman
75 160
Russian
110 260
Turkish
11 293
Table 2: The number of affixes extracted for eachlanguage.Using the neural language model we train word,stem, and affix embeddings, and initialize thelook-up table (but not other parts) of the decoderusing those affixes. The look-up table includeshigh-quality affixes trained on the target side ofthe parallel corpus by which we train the transla-tion model. Clearly, such an affix table is an ad-ditional knowledge source for the decoder. It pre-serves information which is very close to what thedecoder actually needs. However, there might besome missing pieces of information or some in-compatibility between the decoder and the table,so we do not freeze the morphology table duringtraining, but let the decoder update it with respectto its needs in the forward and backward passes. The number may seem a little high, but for a corpus withmore than
M words this is not a strict threshold in prac-tice. .2 Experimental Results
Table 3 summarizes our experimental results. Wereport results for the bpe → char setting, whichmeans the source token is a bpe unit and the de-coder samples a character at each time step. CD-NMT is the baseline model. Table 3 includesscores reported from the original CDNMT model(Chung et al., 2016) as well as the scores from ourreimplementation. To make our work comparableand show the impact of the new architecture, wetried to replicate CDNMT’s results in our exper-imental setting, we kept everything (parameters,iterations, epochs etc.) unchanged and evaluatedthe extended model in the same setting. Table 3reports BLEU scores (Papineni et al., 2002) of ourNMT models. Model En → De En → Ru En → Tr CDNMT 21.33 26.00 -CDNMT ∗ ∗ m CDNMT ∗ o CDNMT ∗ mo Table 3: CDNMT ∗ is our implementation of CD-NMT. m and o indicates that the base model isextended with the morphology table and the addi-tional output channel, respectively. mo is the com-bination of both the extensions. The improvementprovided by the boldfaced number compared toCDNMT ∗ is statistically significant according topaired bootstrap re-sampling (Koehn, 2004) with p = 0 . .Table 3 can be interpreted from different per-spectives but the main findings are summarized asfollows: • The morphology table yields significant im-provements for all languages and settings. • The morphology table boosts the En–Tr en-gine more than others and we think this is be-cause of the nature of the language. Turkishis an agglutinative language in which mor-phemes are clearly separable from each other,but in German and Russian morphologicaltransformations rely more on fusional oper-ations rather than agglutination. • It seems that there is a direct relation betweenthe size of the morphology table and the gainprovided for the decoder, because Russianand Turkish have bigger tables and benefit from the table more than German which hasfewer affixes. • The auxiliary output channel is even moreuseful than the morphology table for all set-tings but En–Ru, and we think this is becauseof the morpheme-per-word ratio in Russian.The number of morphemes attached to a Rus-sian word is usually more than those of Ger-man and Turkish words in our corpora, and itmakes the prediction harder for the classifier(the more the number of suffixes attached toa word, the harder the classification task). • The combination of the morphology tableand the extra output channel provides the bestresult for all languages.Figure 3 depicts the impact of the morphology ta-ble and the extra output channel for each language.
En–De En–Ru En–Tr . . . . .
26 0 .
55 0 . .
38 0 .
16 0 . .
47 0 .
61 0 . Figure 3: The y axis shows the difference betweenthe BLEU score of CDNMT ∗ and the extendedmodel. The first, second, and third bars show the m , o , and mo extensions, respectively.To further study our models’ behaviour andensure that our extensions do not generate ran-dom improvements we visualized some attentionweights when generating ‘ terbiyesizlik ’. In Figure4, the upper figure shows attention weights for allTurkish affixes, where the y axis shows differenttime steps and the x axis includes attention weightsof all affixes (304 columns) for those time steps,e.g. the first row and the first column representsthe attention weight assigned to the first Turkishaffix when sampling t in ‘ t erbiyesizlik ’. While atthe first glance the figure may appear to be some-what confusing, but it provides some interestinginsights which we elaborate next.In addition to the whole attention matrix we alsovisualized a subset of weights to show how the e r b i y e s i z l I k t i i i All affixes
Figure 4:
Visualizing the attention weights between the morphology table and the decoder when generating ‘ terbiyesizlik . morphology table provides useful information. Inthe second figure we study the behaviour of themorphology table for the first ( t ), fifth ( i ), ninth( i ), and twelfth ( i ) time steps when generatingthe same Turkish word ‘ t erb i yes i zl i k ’. t isthe first character of the word. We also have three i characters from different morphemes, where thefirst one is part of the stem, the second one be-longs to the suffix ‘ siz ’, and the third one to ‘ lik ’.It is interesting to see how the table reacts to thesame character from different parts. For each timestep we selected the top- affixes which have thehighest attention weights. The set of top- affixescan be different for each step, so we made a unionof those sets which gives us affixes. The bot-tom part of Figure 4 shows the attention weightsfor those affixes at each time step.After analyzing the weights we observed inter-esting properties about the morphology table andthe auxiliary attention module. The main findingsabout the behaviour of the table are as follows: • The model assigns high attention weights to stem-C for almost all time steps. However,the weights assigned to this class for t and i are much higher than those of affix characters(as they are part of the stem). The verticallines in both figures approve this feature (badbehaviour). • For some unknown reasons there are someaffixes which have no direct relation to thatparticulate time step but they receive a highattention, such as maz in t (bad behaviour). Our observations are not based on this example aloneas we studied other random examples, and the table showsconsistent behaviour for all examples. • For almost all time steps the highest attentionweight belongs to the class which is expectedto be selected, e.g. weights for ( i , stem-C ) or( i , siz-C ) (good behaviour). • The morphology table may send bad or goodsignals but it is consistent for similar or co-occurring characters, e.g. for the last threetime steps l , i , and k , almost the sameset of affixes receives the highest attentionweights. This consistency is exactly whatwe are looking for, as it can define a reliableexternal constraint for the decoder to guideit. Vertical lines on the figure also confirmthis fact. They show that for a set of con-secutive characters which belong to the samemorpheme the attention module sends a sig-nal from a particular affix (good behaviour). • There are some affixes which might not bedirectly related to that time step but receivehigh attention weights. This is becausethose affixes either include the same charac-ter which the decoder tries to predict (e.g. i-C for i or t-C and tin-C for t ), or frequentlyappear with that part of the word which in-cludes the target character (e.g. mi-C has ahigh weight when predicting t because t be-longs to terbiye which frequently collocateswith mi-C : terbiye+mi ) (good behaviour).Finally, in order to complete our evaluationstudy we feed the English-to-German NMT modelwith the sentence ‘ Terms and conditions for send-ing contributions to the BBC ’, to show how themodel behaves differently and generates a bettertarget sentence. Translations generated by our eference : Geschäftsbedingungen für das Senden von Beiträgen an die BBC
CDNMT ∗ allgemeinen geschaftsbedingungen fur die versendung von Beiträgen an die BBC CDNMT ∗ mo Geschäft s bedingungen für die versendung von Beiträgen zum BBC
Table 4:
Comparing translation results for the CDNMT ∗ (baseline) and CDNMT ∗ mo (improved) models when the inputsentence is ‘ Terms and conditions for sending contributions to the BBC ’. models are illustrated in Table 4.The table demonstrates that our architecture isable to control the decoder and limit its selections,e.g. the word ‘allgemeinen’ generated by the base-line model is redundant. There is no constraint toinform the baseline model that this word shouldnot be generated, whereas our proposed architec-ture controls the decoder in such situations. Af-ter analyzing our model, we realized that there arestrong attention weights assigned to the w-space (indicating white space characters) and BOS (be-ginning of the sequence) columns of the affix ta-ble while sampling the first character of the word ‘Geschäft’ , which shows that the decoder is in-formed about the start point of the sequence. Sim-ilar to the baseline model’s decoder, our decodercan sample any character including ‘a’ of ‘allge-meinen’ or ‘G’ of ‘Geschäft’ . Translation informa-tion stored in the baseline decoder is not sufficientfor selecting the right character ‘G’ , so the de-coder wrongly starts with ‘i’ and continues alonga wrong path up to generating the whole word.However, our decoder’s information is accompa-nied with signals from the affix table which forceit to start with a better initial character, whose sam-pling leads to generating the correct target word.Another interesting feature about the table is thenew structure ‘ Geschäft s bedingungen’ generatedby the improved model. As the reference transla-tion shows, in the correct form these two structuresshould be glued together via ‘s’ , which can be con-sidered as an infix. As our model is supposed todetect this sort of intra-word relation, it treats thewhole structure as two compounds which are con-nected to one another via an infix. Although this isnot a correct translation and it would be trivial topost-edit into the correct output form, it is interest-ing to see how our mechanism forces the decoderto pay attention to intra-word relations.Apart from these two interesting findings, thenumber of wrong character selections in the base-line model is considerably reduced in the im-proved model because of our enhanced architec-ture.
In this paper we proposed a new architecture toincorporate morphological information into theNMT pipeline. We extended the state-of-the-artNMT model (Chung et al., 2016) with a morphol-ogy table. The table could be considered as anexternal knowledge source which is helpful as itincreases the capacity of the model by increasingthe number of network parameters. We tried tobenefit from this advantage. Moreover, we man-aged to fill the table with morphological informa-tion to further boost the NMT model when trans-lating into MRLs. Apart from the table we also de-signed an additional output channel which forcesthe decoder to predict morphological annotations.The error signals coming from the second chan-nel during training inform the decoder with mor-phological properties of the target language. Ex-perimental results show that our techniques wereuseful for NMT of MRLs.As our future work we follow three main ideas. i ) We try to find more efficient ways to supplymorphological information for both the encoderand decoder. ii ) We plan to benefit from othertypes of information such as syntactic and seman-tic annotations to boost the decoder, as the tableis not limited to morphological information aloneand can preserve other sorts of information. iii )Finally, we target sequence generation for fusionallanguages. Although our model showed signifi-cant improvements for both German and Russian,the proposed model is more suitable for generatingsequences in agglutinative languages. Acknowledgments
We thank our anonymous reviewers for their valu-able feedback, as well as the Irish centre for high-end computing ( ) for providingcomputational infrastructures. This work has beensupported by the ADAPT Centre for Digital Con-tent Technology which is funded under the SFIResearch Centres Programme (Grant 13/RC/2106)and is co-funded under the European Regional De-velopment Fund. eferences
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. In
Proceedings ofthe International Conference on Learning Represen-tations . Banff, Canada.Jan A Botha and Phil Blunsom. 2014. Compositionalmorphology for word representations and languagemodelling. In
The 3st International Conference onMachine Learning (ICML) . Beijing, China, pages1899–1907.Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neuralmachine translation. In
Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) . Vancouver,Canada, pages 1913–1924.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In
Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) . Doha, Qatar,pages 1724–1734.Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex-plicit segmentation for neural machine translation.In
Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) . Berlin, Germany, pages 1693–1703.Marta R. Costa-jussà and José A. R. Fonollosa. 2016.Character-based neural machine translation. In
Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) . Berlin, Germany, pages 357–361.Mercedes García-Martínez, Loïc Barrault, and FethiBougares. 2016. Factored neural machine transla-tion. arXiv preprint arXiv:1609.04621 .Stig-Arne Grönroos, Sami Virpioja, and Mikko Ku-rimo. 2017. Extending hybrid word-character neuralmachine translation with multi-task learning of mor-phological analysis. In
Proceedings of the SecondConference on Machine Translation . Copenhagen,Denmark, pages 296–302.Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, JeanOh, and Chris Dyer. 2016. Attention-based multi-modal neural machine translation. In
Proceedings ofthe first Conference on Machine Translation . pages639–645.Sébastien Jean, Kyunghyun Cho, Roland Memise-vic, and Yoshua Bengio. 2015. On using verylarge target vocabulary for neural machine trans-lation. In
Proceedings of the 53rd Annual Meet-ing of the Association for Computational Lin-guistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1: Long Papers)
Proceedings of the ThirtiethAAAI Conference on Artificial Intelligence (AAAI-16) . Phoenix, Arizona, USA, pages 2741–2749.Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
InternationalConference on Learning Representations (ICLR) .San Diego, USA.Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In
Proceed-ings of Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . Barcelona, Spain,pages 388–395.Pierre Lison and Jörg Tiedemann. 2016. Opensub-titles2016: Extracting large parallel corpora frommovie and TV subtitles. In
Proceedings of the 10thInternational Conference on Language Resourcesand Evaluation . Portoroz, Slovenia, pages 923–929.Minh-Thang Luong and Christopher D. Manning.2016. Achieving open vocabulary neural machinetranslation with hybrid word-character models. In
Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) . Berlin, Germany, pages 1054–1063.Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,and Wojciech Zaremba. 2015. Addressing the rareword problem in neural machine translation. In
Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) . Beijing,China, pages 11–19.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In
Proceedings ofthe 40th Annual Meeting of Association for Compu-tational Linguistics . Pennsylvania, PA., USA, pages311–318.Peyman Passban. 2018.
Machine Translation of Mor-phologically Rich Languages Using Deep NeuralNetworks . Ph.D. thesis, School of Computing,Dublin City University, Ireland.Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.In
Proceedings of the First Conference on MachineTranslation . Berlin, Germany, pages 83–91.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) . Berlin, Germany,pages 1715–1725.eter Smit, Sami Virpioja, Stig-Arne Grönroos, andMikko Kurimo. 2014. Morfessor 2.0: Toolkit forstatistical morphological segmentation. In
Proceed-ings of the Demonstrations at the 14th Conference ofthe European Chapter of the Association for Com-putational Linguistics . Gothenburg, Sweden, pages21–24.Ekaterina Vylomova, Trevor Cohn, Xuanli He, andGholamreza Haffari. 2016. Word representationmodels for morphologically rich languages in neu-ral machine translation.
CoRR abs/1606.04217.http://arxiv.org/abs/1606.04217.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation.