Machine Translation: A Literature Review
MMachine Translation: A Literature Review
Ankush Garg, Mayank Agarwal
Department of Computer ScienceUniversity of Massachusetts Amherst { agarg,mayankagarwa}@cs.umass.edu Abstract
Machine translation (MT) plays an important role in benefiting linguists, sociolo-gists, computer scientists, etc. by processing natural language to translate it intosome other natural language. And this demand has grown exponentially over pastcouple of years, considering the enormous exchange of information between differ-ent regions with different regional languages. Machine Translation poses numerouschallenges, some of which are: a) Not all words in one language has equivalentword in another language b) Two given languages may have completely differentstructures c) Words can have more than one meaning. Owing to these challenges,along with many others, MT has been active area of research for more than fivedecades. Numerous methods have been proposed in the past which either aim at im-proving the quality of the translations generated by them, or study the robustness ofthese systems by measuring their performance on many different languages. In thisliterature review, we discuss statistical approaches (in particular word-based andphrase-based) and neural approaches which have gained widespread prominenceowing to their state-of-the-art results across multiple major languages.
Machine Translation is a sub-field of computational linguistics that aims to automatically translatetext from one language to another using a computing device. To the best of our knowledge, PetrPetrovich Troyanskii was the first person to formally introduce machine translation [22]. In 1939,Petr approached the Academy of Sciences with proposals for mechanical translation, but barringpreliminary discussions these proposals were never worked upon. Thereafter, in 1949, WarrenWeaver [46] proposed using computers to solve the task of machine translation. Since then, machinetranslation has been studied extensively under different paradigms over the years. Earlier researchfocused on rule-based systems, which gave way to example-based systems in the 1980s. Statisticalmachine translation gained prominence starting late 1980s, and different word-based and phrase-basedtechniques requiring little to no linguistic information were introduced. With the advent of deepneural networks in 2012, application of these neural networks in machine translation systems becamea major area of research. Recently, researchers announced achieving human parity on automaticchinese to english news translation [18] using neural machine translation. While early machinetranslation systems were primarily used to translate scientific and technical documents, contemporaryapplications are varied. These include various online translation systems for exchange of bilingualinformation, teaching systems, and many others.In this literature review, we survey two major sub-fields of machine translation: statistical machinetranslation, and neural machine translation. The rest of the review is structured as follows: Section2 briefly discusses the early work in machine translation. Section 3 reviews Statistical machinetranslation, focusing on word-based and phrase-based translation techniques. Section 4 elaborateson neural machine translation techniques where we also discuss different attention mechanisms andarchitectures with special purposes. Section 5 briefly describes the current research in the field.Finally, we conclude the report with Section 6. a r X i v : . [ c s . C L ] D ec Early Work
In the 1970’s, Rule-based Machine Translation (RBMT) was the primary focus of research. Suchsystems fall into one of the following three categories: Direct systems (these map input sentencedirectly to the output sentence), Transfer RBMT systems (these use morphological and syntacticanalysis to translate sentences), and Interlingual RBMT systems (these transformed the input sentenceto an abstract representation and mapped this abstract representation to the final output). One suchwork in Interlingual RBMT system is by Carbonell et al. in 1978 [7]. The proposed approachtranslates text by: 1) Converting the source text to a language-free conceptual representation, 2)Augmenting this representation with information that was implicit in the source text, and 3) Convertingthis augmented representation to the target language. The authors argue that translation requiresdetailed understanding of the source text which semantic rules are inadequate to capture and thereforeneed to be augmented with detailed domain knowledge as well.Rule-based MT is complicated for certain languages (ex: English-Japan) owing to different structuresof the languages. In 1984, Nagao [33] proposed a translation system that works by analogy principle.Titled "machine translation by example-guided inference", the system relies on a big dataset ofexample sentences and their translations to learn the correspondence between English-Japanese wordsand also the structure of the language. The authors describe different approaches to build such asystem and also discuss ways to curate the data required for such a system. This paper, to the bestof our knowledge, is the first paper to introduce example-based learning and paved way for furtherresearch in building machine translation systems that do not rely on manually curated rules andexceptions.
Statistical Machine Translation (SMT), as introduced by Brown et al. [5], takes the view that everysentence S in a source language has a possible translation T in the target language. Building ontop of this fundamental assumption, SMT based approaches assign to each (S, T) sentence pair theprobability P ( T | S ) , which is interpreted as the probability that sentence T is the translated equivalentin the target language of the sentence S in the source language. Accordingly, statistical approachesdefine the problem of Machine Translation as: T = arg max T P ( T | S ) (1) = arg max T P ( T ) P ( S | T ) (2)The components P ( T ) and P ( S | T ) in the equation above are referred to as the Language Modelof the target language, and the Translation Model respectively. Hereafter, we refer to the languagemodel of the target language as the language model itself. Together, the language model and thetranslation model compute the joint probability of the sentences S and T . The argmax operation overall sentences in the target language denotes the search problem and is referred to as the decoder. Thedecoder performs the actual translation - given a sentence S , it searches for a sentence T in the targetlanguage with the highest probability P ( T | S ) .Since the current formulation requires a translation model for target language to the source language,an important question arises. Why can’t the process to build this translation model be utilized tobuild the model that computes P ( T | S ) . This would eliminate the need for the language modelof the target language, and can be used in conjunction with the decoder to get the translation ofthe original sentence. Brown et al. [6] state this to be a means to get a well-formed sentence. Tomodel P ( T | S ) and use this for translation would require the probabilities to be concentrated overwell-formed sentences in the target language domain. Rather, this is achieved through the joint usageof the language model and the translation model. Sentences which are not well-formed are expectedto have a lower language model probability which offsets the necessity for the translation model tohave their probabilities concentrated over well-formed sentences.In the following sections, we first briefly review language models since they are modelled independentof the translation model and typically remain consistent across works in SMT. Thereafter, we reviewthe research work in SMT categorized into two sections: Word-based SMT, and Phrase-based SMT.2 .1 Language models Given a target string T of length m and consisting of words t , t , · · · , t m , we can write the languagemodel probability P ( T ) as: P ( T ) = P ( t , t , · · · , t m ) = P ( t ) m (cid:89) i =2 P ( t i | t i − ) (3)This converts the language modelling problem into one that requires computing probabilities for aword given its history. However, computing these probabilities is infeasible since there could betoo many histories for a word. Thus, this requirement is relaxed by truncating the dependence ofthe current word on a fixed subset of its history. In an n-gram model, it is assumed that the currentword depends only on the previous (n-1) words. For example, in a trigram model, P ( w i | w i − ) = P ( w i | w i − , w i − ) . These probabilities can now be computed through counting to get a MaximumLikelihood Estimate (MLE). There are other formulations of language models - ones that make useof neural networks, or formulate the problem as a maximum entropy language model. However,we won’t delve deeper into language models since the majority of research in SMT is focused ondifferent formulations of the translation model, but the reader can refer to the following resources formore information [16] [37] [3]. Post Warren Weaver’s proposal in 1949 [46] to use statistical techniques from the then nascent field ofcommunication theory to the task of using computers to translate text from one language to another,research in the area lay dormant for a while. It wasn’t until 1988 that Brown et al. in [4] outlinedan approach to use statistical inference tools to solve the task. The authors argued that translationought to be based on a complex glossary of correspondences of fixed locations. This glossary wouldmap words as well as phrases (contiguous and non-contiguous) to corresponding translations. Forexample, the following could be the contents of a glossary mapping english words/phrases to theirfrench counterpart: [word = mot], [not = ne pas], [seat belt = ceinture de sécurité]. The authorsbase their approach on the following decomposition of the task: 1) Partition the source text into aset of fixed locations, 2) Use the glossary and contextual information to select the correspondingset of fixed locations in the target language, and 3) Arrange the words of the target fixed locationinto a sequence that forms the target language. The proposed glossary in the paper is based on amodel of the translation process P ( T | S ) , and comes to the critical conclusion that a probabilisticmethod is required to identify the corresponding words in the target and source sentence. To learn theparameters of this glossary, the paper introduces a concept of "generation pattern" which as we willsee later is similar to the critical concept of alignment in machine translation. Since the authors wereexperimenting with English-French language pairs - languages with similar word order and thereforethe translation being quite local - the fact that the proposed glossary did not incorporate this propertymotivated them to propose another formulation of the glossary - one that models the locality of thelanguage pairs through distortion probabilities P ( k | h, m, n ) , where k refers to the k th word in T, hrefers to the h th word in S, and m and n are the lengths of T and S. To the best of our knowledge, thiswork was the first to formalize the field of statistical machine translation, and though it provided onlysome intermediate results and not the translation examples it stimulated interest in the application ofstatistical methods to machine translation.Two years later, in 1990, Brown et al. in [5] provided first experimental results for a statistical machinetranslation technique translating sentences in French to English. The proposed method translates5% of the sentences exactly to their actual translation, but if alternate and different translations areconsidered reasonable translations, the model’s accuracy rises to 48%. The authors further argue thatthis system reduces the manual work of translation by about 60% as measured in units of key strokes.The translation model proposed by Brown et al. in [5] introduces the critical concept of alignment .As defined by the authors, alignment between a pair of strings (S, T) indicates the origin in T of eachword in S. One such alignment for the sentence pairs "Le programme a été mis en application" (S)and "And the program has been implemented" (T) is shown in figure 1. This particular alignmentstates that the origin of the word "the" in the english sentence lies in the word "Le", for "program" its"programme", and similarly "implemented" originates from the words "mis en application". Closely3elated to this concept of alignment is fertility . Fertility is the number of words in S, that each wordin T produces. Thus, for the same example, word "And" has fertility 0 (since it’s not aligned with anyword in french translation), "the" has fertility 1, and "implemented" has fertility 3.Figure 1: Alignment between two sentences.Building on top of their previous work, Brown et al. in [6] describe a set of five statistical models,each with a different model of the alignment probability distribution. Specifically, they modify theirtranslation model to include the alignment variable A. P ( S | T ) = (cid:88) A P ( S, A | T ) (4)For Models 1 and 2, the authors decompose P ( S, A | T ) into three probability distributions: 1)distribution over the length of the target sentence, 2) alignment model defining a distribution over thealignment configuration, and 3) the translation probabilities over target sentence, given the alignmentconfiguration and the source sentence. The main distinction between Models 1 and 2 is in theirmodelling of the alignment probabilities. Model 1 assumes a uniform distribution over all alignmentsfor a sentence pair, while Model 2 uses a zero-order alignment model where alignments at differentpositions are independent of each other. Additionally, the trained parameters of Model 1 are used toinitialize Model 2.For Models 3, 4, and 5, the authors decompose P ( S, A | T ) differently and parameterize fertilitiesdirectly. The generative process is broken into two parts: 1) Given T, compute the fertility of eachword in T and a set of words in S which connect to it. This is called the tableau τ , and 2) Permute thewords in the tableau to form the source sentence S. This permutation is denoted as π . Accordingly, P ( S, A | T ) is decomposed as: P ( S, A | T ) = (cid:88) τ,π ∈ ( S,A ) P ( τ, π | T ) (5) P ( τ, π | T ) is further decomposed to result in the following parameters for Model 3 and 4: fertilityprobabilities, translation probabilities, and distortion probabilities. The main distinction betweenModel 3 and 4 lies in modelling of the distortion probabilities. Model 3 uses a zero-order distortionprobabilities where the distortion for a particular position depends only on its current position and thelengths of T and S. Model 4 on the other hand, parameterizes these distortion probabilities by twosets of parameters: one to place the head of each word/phrase, and the other to place the rest of thewords. This was done because Model 3 did not account well for the tendency of phrases to movearound as a unit.Both of these models (Model 3 and 4) are however deficient . The authors define a model to bedeficient when it does not concentrate its probability over events of interest but rather distributes itover generalized strings. Model 5, which is the final of the proposed models, aims to avoid deficiencyand does so by reformulating Model 4 by a suitably refined alignment model. Since each of the 5proposed models have a particular decomposition of the translation model, the authors have tried togain insights into the capabilities of these individual distributions as well as the final model itself. Itis found that while the individual distributions model the particular events well, there is room forimprovement in the model’s capacity to translate.The fundamental basis of the five models presented by Brown et al. [6] was the introduction of ahidden alignment variable in the translation model. These alignment probabilities were then modelleddifferently in different models. Vogel et al. [44] propose a new alignment model that’s based on4idden Markov Models (HMM) and aims to effectively model the strong localization effect whentranslating between certain languages (ex: for language pairs from Indoeuropean languages). Thetranslation model P ( S | T ) is accordingly broken down into two components: the HMM alignmentprobabilities and the translation probabilities. The key component of this approach is that it makesthe alignment probabilities depend on the relative position of the word alignment rather than theabsolute position. This HMM model is shown to result in smaller perplexities as compared to Model2 by Brown et al. [6] and also produces smoother alignments.With the increased focus on research in alignment models, Och and Ney [35] present an annotationand evaluation scheme for word-alignment models. The proposed annotation scheme made it possibleto explicitly annotate the ambiguous alignments along with the sure alignments. This provided anextra degree of freedom to the human annotators to generate reference alignments. To evaluate theperformance of a word alignment model the authors propose an Alignment Error Rate which dependson the sure and ambiguous reference alignments, and the alignment produced by the model. Despite the revolutionary nature of word-based systems, they still failed to deal with cases, gender,and homonymy. Every single word was translated in a single-true way, according to the machine.In phrase-based translation system there is no restriction of translating source sentence into targetsentence word-by-word. This was a significant departure from word-based models - IBM models. InPhrase-based systems, a lexical unit is a sequence of words (of any length) as opposed to a singleword in IBM models. Each pair of units (one each from source and target language) has a score or a‘weight’ associated with it. For example, a lexical entry could look like:(le chien, the dog, 0.002)
Definition
More formally, a phrase-based lexicon L is a set of lexical entries where each lexicalentry is a tuple (f,e,g) where:• f is a sequence of one or more foreign language words• e is a sequence of one or more source language words• g is a ‘score’ of the lexical entry which is a real number.
Phrase-based translation models improved the translation quality over IBM models and many re-searches tried to advance the state-of-the-art with these models. Och et al. [35] alignment templatemodel can be reframed as a phrase translation system; Yamada and Knight[48] use phrase translationin a syntax based translation system; Marcu et al. [31] introduced a joint-probability model forphrase translation. At its core, phrase-based translation system has a phrase translation probabilitytable (defined above) to map phrases in source language to phrases in target language. The phrasetranslation table is learnt from word alignment models using bilingual corpus. We don’t delve intothe details of learning phrase lexicons from word alignments and encourage the reader to refer [34],[35] for details. We, instead, focus our discussions on modelling aspect of phrase-based systems andvariations among different models.Phrase-based systems decompose the translation probability defined in equation-2 as follows: P ( S | T ) = P (¯ s | ¯ t ) (6) = I (cid:89) i =1 φ (¯ s i | ¯ t i ) d ( a i − b i − ) (7) ¯ s is the sequence of phrases in source sentence, ¯ f is the sequence of phrases in target sentence, I is the number of sequences (in source sentence). d ( a i − b i − ) is the relative distortion probabilitydistribution, where a i denotes the start position of the source language phrase that was translatedinto the i th target language phrase, and b i − denotes the end position of the source language phrase5ranslated into the ( i -1)th target language phrase. φ (¯ s i | ¯ t i ) is the phrase translation probabilities (orequivalently phrase translation table) learnt from bilingual corpus and the distortion probability iseither learnt or could be as simple as α | a i − b i − − | . The distortion probability distribution accountsfor reordering of phrases in target language after they have been translated individually. Once all thefactors (phrase translation tables, distortion distribution, language model) are learnt, the decodingoperation (equation-1) generates translated sentences. The reader is encouraged to look at [14], [15],[45] for more information on design of decoders and their nuances.Marcu et al. [31] present a different formulation of phrase-based model to learn the phrase transitiontable and distortion distribution. They argue that lexical correspondences can be established not onlyat the word level but also at the phrase level. They model the translation task as a joint probabilitymodel where the translations between phrases are learnt directly without using word alignmentmodels. Their joint probability model is defined as: p ( E, F ) = (cid:88) C ∈C| L ( E,F,C ) (cid:89) c i ∈ C [ t ( ¯ e i , ¯ f i ) ∗ | ¯ f i | (cid:89) k =1 d ( pos ( ¯ f ki ) , pos cm ( ¯ e i ))] (8)The generative story of this model is as follows:1. Generate of bag of concepts C where each concept c i is the hidden variable.2. For each concept c i ∈ C , generate a pair of phrases ( ¯ e i , ¯ f i ) according to the distribution t ( ¯ e i , ¯ f i ) where ¯ e i and ¯ f i each contain atleast one word.3. Order the phrases generated in each language so as to create two linear sequence of phrases;these sequences correspond to the sentence pair in bilingual corpus. This is modelled with d ( . ) distribution.A set of concepts can be linearized into a sentence pair ( E, F ) if E and F can be obtained bypermuting the phrases ¯ e i and ¯ f i that characterize all concepts c i ∈ C .To learn this model, they also propose a heuristics based learning algorithm. The model couldn’t belearnt with the EM algorithm exhaustively as there are exponential number of alignments that cangenerate the sentence pair (E,F). They use French-English parallel corpus of 100,000 sentence pairsfrom the Hansard corpus to train their model. Their model achieves boost in the BLEU score by 6points compared to the IBM model 4 (with BLEU score of 22).Och et al. [36] proposed a maximum entropy models for phrase-based translation where the translationprobability is formulated as conditional log-linear model. The conditional probability of a sentencein target language given sentence in source language is: P r (¯ e I | ¯ f J ) = p λ M (¯ e I | ¯ f J ) (9) = exp [ (cid:80) Mm =1 λ m h m (¯ e I , ¯ f J )] (cid:80) ¯ e (cid:48) I exp [ (cid:80) Mm =1 λ m h m ( ¯ e (cid:48) I , ¯ f (cid:48) J )] (10)In this framework, there is set of M feature functions h m (¯ e I , ¯ f J ) . For each feature function, thereexists a model parameter λ m , m = 1 , ...., M . The model is trained with the GIS (global iterativesearch) algorithm[12]. Since the normalization constant is intractable, it is approximated with highly-probable n sentences. The list of highly probable n sentences is computed by extended version fromused search algorithm (Och et al. [35]) which approximately computes n-best list of translations.The main advantage of the maximum entropy model is that any feature function can be added easily(for eg., language model, distortion model, word penalty, phrase translation model) and the weightsof these individual feature functions (models) can be learnt jointly. They experiment with variousfeature functions including language model, word penalty, phrase translation dictionary and achievestate-of-the-art results on VERBMOBIL task which is a speech translation task in the domain ofappointment scheduling, travel planning and hotel reservation.6 Neural Machine Translation
In most statistical approaches to machine translation, the most crucial component of the system isthe phrase transition model. It is either the joint probability of co-occurence of source and targetlanguage phrases, P ( e i , f i ) or the conditional probability of generating a target language phrasegiven the source language phrase P ( e i | f i ) . Such models consider the phrases which are distincton the surface as distinct units. Although these distinct phrases share many properties, linguisticor otherwise, they rarely share parameters of the model while predicting translations. There is noconcrete notion of ‘phrase similarity’ in such models. Besides ignoring phrase similarities, this leadsto a very common problem of sparsity. It gets difficult for model to adapt itself to unseen phrases attest time. Finally, this makes it difficult to adapt such models to other similar domains.Continuous representations of linguistic units, be it character, word, sentence or document have shownpromising results on various language processing tasks. One of early works which introduced this ideawas proposed by Bengio et al. [3]. They model words with continuous fixed dimension word vectorsusing neural network and achieve state-of-the-art results on language modelling task. It has also shownpromising results in dealing with sparsity issue. Collobert et al. [10] have shown that continuousrepresentations for words are able to capture the syntactic, semantic and morphological properties ofthe words. Continuous representations for characters have also shown notable results in languagemodelling task as proposed by Sutskever et al. [42]. Recently, continuous representations have beenproposed for phrases and sentences and have been shown to carry task-dependent information to helpdownstream language processing tasks (Grefenstette et al. [17], Socher et al. [40], Hermann et al. [19]).The approaches discussed above make use of neural networks to model continuous representationsof linguistic units. Deep neural networks have shown tremendous progress in computer vision(eg., Krizhevsky et al. [26]) and speech recognition (eg., Hinton et al. [20] and Dahl et al. [11])tasks. Since then, they have also been successfully applied to solve many NLP tasks like paraphrasedetection (Socher et al. [41]) and word embedding extraction (Mikolov et al. [32]). Neural networkshave also been applied to advance the state-of-the-art in statistical machine translation. Schwenk [38]summarizes usage of feedforward neural networks in the framework of phrase-based SMT system. In the next two sections (4.1.1 and 4.1.2), we discuss some background work which is common toalmost all the neural machine translation systems.
A recurrent neural network (RNN) is a neural network that consists of a hidden state h and an optionaloutput y which operates on a variable length sequence x = ( x ... x T ). At each time step t, the hiddenstate h t of the RNN is updated by: h t = f ( h t − , x t ) (11)where the f is non-linear activation function which is usually implemented with LSTM cell (Hochreiterand Schmidhuber [21]). Using softmax function with vocabulary size V, an RNN can be trained topredict the distribution over x t given the history of words ( x t − , x t − , ...x ) at each time step t. Bycombining the probabilities at each time step, we can compute the probability of the sequence x (eg:target language sentence) using p ( x ) = T (cid:89) t =1 p ( x t | x t − , ..., x ) (12)which is called the Recurrent Language Model (RLM). Though RNN Encoder-Decoder architecture was proposed by Cho et al. [9] for a machine translationtask, it remains the base model for most of the NLP sequence-to-sequence models (and especiallymachine translation). We discuss this model in its general form here, and delve into details of differentneural architectures in next section. 7n encoder-decoder neural model (figure 2), from a probabilistic perspective, is a general methodto learn the conditional distribution over variable length sequence given yet another variable lengthsequence, e.g. p ( y , ..., y T (cid:48) | x , ..., x T ) . An encoder is an RNN which reads each symbol in inputsequence (x) one word at a time till it encounters end-of-sequence symbol. The hidden state of theRNN at the last time step is the summary c of the whole input sequence. The decoder operates verysimilar to RLM discussed previously except that the hidden state of the decoder h t now depends onthe summary c too. Hence the hidden state of the decoder at time step t is calculated by h t = f ( h t − , y t − , c ) (13)and the conditional distribution of the next symbol (for e.g. next word in target language sentencegiven source language sentence) is p ( y t | y , ..., y t − ) = g ( h t , y t − , c ) (14)The two components of the encoder-decoder model are jointly trained to maximize the conditionallog-likelihood N N (cid:88) n =1 log p θ ( y n | x n ) (15)Figure 2: An encoder-decoder architectureWe now describe some of the Neural Machine Translation (NMT) methods proposed recently. Motivated from success of deep neural networks and their ability to represent a linguistic unit with acontinuous representation, Kalchbrenner et al. [25] propose a class of probabilistic translation models,Recurrent Continuous Translation Model (RCTM) for machine translation. The RCTM model hasa generation aspect and a conditional aspect. The generation of a sentence in target language ismodelled with target Recurrent language model. The conditioning on the source sentence is modelledwith a Convolutional Neural Network (CNN). In their model, CNN takes a sentence as input andgenerates a fixed size representation of this source sentence. This representation of source sentence ispresented to the Recurrent Language Model to produce the translation in target language. The entiremodel (CNN and RNN) is trained jointly with back-propagation.To the best of our knowledge, this is the first work which explores the idea of modelling the taskof machine translation entirely with neural networks, with no component from statistical machinetranslation systems. They propose two CNN architectures to map source sentence into fixed sizecontinuous representation. Though CNN architectures have shown tremendous success in imagespace, these architectures were first explored extensively in text space in this paper. We, therefore,discuss these architectures in detail here.The Convolutional Sentence Model (CSM) creates a representation for a sentence that is progressivelybuilt up from representations of the n-grams in the sentence. The CSM architecture embodies ahierarchical structure, similar to parse trees, to create a sentence representation. The lower layersin the CNN architecture operate locally on n-grams and the upper layers act increasingly globallyon the entire sentence. The lack of need of parse tree makes it easy to apply these models tolanguages for which parsers are not available. Also, generation of the sentence in target language isnot dependent on one particular parse tree. Similar to CSM, the authors propose another CNN modelcalled Convolutional n-gram model (CGM). The CGM is obtained by truncating the CSM at the levelwhere n-grams are represented for the chosen value of n. The CGM can also be inverted (icgm) to8btain a representation for a sentence from the representation of its n-grams. The transformationicgm unfolds the n-gram representation onto a representation of a target sentence with m targetwords (where m is also predicted by the network according to the Poisson distribution). The pictorialrepresentation of two models is shown in figure 3.Figure 3: A graphical depiction of the two RCTMs. Arrows represent full matrix transformationswhile lines are vector transformations corresponding to columns of weight matrices.The experimentation is performed on a bilingual corpus of 144953 pairs of sentences less than 80words in length from the news commentary section of the Eighth Workshop on Machine Translation(WMT) 2013 training data. The source language is English and the target language is French. A lowperplexity value achieved by RCTMs on test set as compared to IBM models (model 1-4) suggeststhat continuous representations and the transformations between them make up well for the lack ofexplicit alignments. To make sure that RCTM architecture (with CGM) doesn’t just take bag-of-wordsapproach, they change the ordering of the words in the source sentence and train their model. Thismodel achieves much lower perplexity values which proves that the model is indeed sensitive tosource sentence structure. They also compare the performance of the RCTM model with cdec system.cdec employs 12 engineered features including, among others, 5 translation models, 2 languagemodel features and a word penalty feature (WP). RCTM models achieve comparable performance(marginally better) than the cded system on BLEU score. The results indicate that the RCTMs areable to learn both translation and language modelling distributions without explicitly modelling them.Cho et al. [9] propose a RNN encoder-decoder architecture very similar to the one proposed abovebut with one major difference. While Kalchbrenner et al. [25] use CNN to map a source sentenceinto a fixed-sized continuous representation, Cho et al. [9] use an encoder RNN to map sourcesequence into a vector. However, they use this architecture to learn phrase translation probabilities.The training is done on phrase translation pairs extracted in the phrase-based translation system. Themodel re-scores all the phrase-pairs probabilities which are used as additional features in log-linearphrase based translation system. They use WMT’14 translation task to build English/French SMTsystem coupled with features from encoder-decoder model. With quantitative analysis of the system(on BLEU score), they show that baseline SMT system’s performance was improved when RLMwas used. Additionally, adding features from proposed Encoder-Decoder architecture increased theperformance further suggesting that signals from multiple neural systems indeed add up and arenot redundant. They later perform qualitative analysis of their proposed model to investigate thequality of the target phrases generated by model. The target phrases (given a source phrase) proposedby model look more visually appealing than the top target phrases from translation table. Theyalso plot the phrase representations (after dimensionality reduction) on 2-d plane and show that thesyntactically and semantically similar phrases are clustered together.While Cho et al. [9] proposed an end-to-end RNN architecture, they use it only to get additionalphrase translation table to be eventually used in the SMT based system. Sutskever et al. [43] gave amore formal introduction to the sequence-to-sequence RNN encoder-decoder architecture. Thoughtheir motivation was to investigate the ability of very-deep neural networks at solving seq-to-seq9roblem, they run their experiments on machine translation task to achieve their goals. They proposedthe architecture very similar to Cho et al. [9] with three major architecture changes: 1) they usedLSTM cells in encoder and decoder RNN, 2) they trained their system on complete sentence pairsand not just phrases, 3) they used stacked LSTMs (with 4-6 layers) in both decoder and encoder.Finally, they also reverse the source sentences in the training data and train their system on reversedsource sentences (keeping the target language sentences in their original order). They don’t provide aclear motivation of why they did so, but informally, reversing the source sentence helps in capturinglocal dependencies around the word from either direction. Their experimentation results (on WMT-14English/French MT dataset) show that reversing the source sentence achieves higher BLEU score ontest set than the model where no reversing was done. Though their model doesn’t beat the state-of-the-art MT system, it achieves performance very close to the latter. Their model doesn’t employ anyattention methods or bi-directional RNN (which is used by the state-of-the-art system). This suggeststhat deep models indeed help in seq-to-seq learning with RNN encoder-decoder architecture.Neural machine translation has shown very promising results for many language pairs. Despite that,it has only been applied to only formal texts like WMT shared task. Luong et al. [28] study theeffectiveness of NMT systems in spoken language domains by using IWSLT 2015 dataset. Theyexplore two scenarios: NMT adaptation and NMT for low resource translation . For NMT adaptationtask, they take an existing state-of-the-art English-German system[29], which consists of 8 individualmodels trained on WMT data with mostly formal texts (4.5M sentence pairs. They further train onthe English-German spoken language data provided by IWSLT 2015 (200K sentence pairs). Theyshow that NMT adaptation is very effective: models trained on a large amount of data in one domaincan be finetuned on a small amount of data in another domain. This boosts the performance of anEnglish-German NMT system by 3.8 BLEU points. For NMT low resource translation task, they usethe provided English-Vietnamese parallel data (133K sentence pairs). At such a small scale of data,they could not train deep LSTMs with 4 layers as in the English-German case. Instead, they opt for2-layer LSTM models with 500-dimensional embeddings and LSTM cells. Though their system islittle behind the IWSLT baseline (baseline’s BLEU score is 27.0 and their model’s BLEU score is26.4), it still shows that NMT systems are quite effective in other domains too, and not just formaltexts.
A potential issue with this encoder–decoder approach is that a neural network needs to be able tocompress all the necessary information of a source sentence into a fixed-length vector. This maymake it difficult for the neural network to cope with long sentences, especially those that are longerthan the sentences in the training corpus. Cho et al. [8] showed that indeed the performance of abasic encoder–decoder deteriorates rapidly as the length of an input sentence increases. Bahdanau etal. [2] proposed an attention mechanism to deal with this issue. They propose a model where thesource sentence is not encoded into one fixed length vector. Instead, the encoder maps the sourcesentence into sequence of vectors and decoder chooses a subset of these vectors at each time step togenerate tokens in the target language. We now discuss this model more formally. In the proposedarchitecture, the conditional probability is defined as: p ( y i | y , ...y i − , x ) = g ( y i − , s i , c i ) (16)where s i is an RNN hidden state for time i, computed by: s i = f ( s i − , y i − , c i ) (17)The context vector c i depends on a sequence of annotations ( h , ...h T x ) to which an encoder maps theinput sentence. The context vector c i is, then, computed as a weighted sum of these annotations h i : c i = T x (cid:88) j =1 α ij h j (18)The weight α ij of each annotation is computed by α ij = exp ( e ij ) (cid:80) T x k =1 exp ( e ik ) (19)where e ij = a ( s i − , h j ) is an alignment model, implemented with feed-forward neural network.The alignment scores how well the inputs around position j and output at position i match. We can10nderstand the approach of taking a weighted sum of all the annotations as computing an expectedannotation, where the expectation is over possible alignments. They use bidirectional RNN forencoder and a unidirectional RNN for decoder, and use the bilingual, parallel corpora providedby ACL WMT ’14 task (English/French). With their experiments on source sentences of differentlengths, they show that the performance of the conventional encoder-decoder drops quickly when thesentence length increases beyond 30. On the other hand, the proposed model remains less volatile tothe sentence length and continues to achieve good performance on long sentences too. They alsoplot alignment visualizations for each target word produced by decoder. The visualizations show thatEnglish and French languages are highly monotonic, which indeed is the case with these languages.Luong et al. [29] propose two attention approaches: a global approach which always attends to allsource words and a local one that only looks at a subset of source words at a time. The global approachis very similar to the one proposed by Bahdanau et al. [2] but is architecturally simpler than the latter.The local attention can be viewed as a blend between soft and hard alignment approaches proposedby Xu et al. [47]. The local attention model is computationally less expensive and differentiable,making it easier to implement and train. We don’t discuss the global attention model here as it isvery similar to the one proposed by Bahdanau et al. [2], and direct the reader to the original paperfor details and comparisons. We focus our discussions on local attention model which is the majorcontribution of this work. In local attention, the model first generates the aligned position p t for eachtarget word at time t. The context vector c t is then derived as a weighted average over the set ofsource hidden states within the window [ p t − D , p t + D ];D is empirically selected. Unlike the globalapproach, the local alignment vector at is now fixed-dimensional, i.e., ∈ R D +1 . The model predictsthe aligned position p t as follows: p t = S. sigmoid ( v T p tanh ( W p h t )) (20) W p and v p are the model parameters which will be learned to predict positions. S is the sourcesentence length. To favor alignment points near pt, Gaussian distribution is used and centered around p t . The alignment weights are now defined as: a t ( s ) = align ( h t , ¯ h s ) exp (cid:18) − ( s − p t ) σ (cid:19) (21)The align function can be as simple as a dot product or can be learned with feed-forward neural net.The standard deviation is empirically set as σ = D and s is an integer within the window centeredaround p t . They evaluate the effectiveness of the model on the WMT translation tasks betweenEnglish and German in both directions, and use WMT’14 training data, and newstest2014 (2737sentences) and newstest2015 (2169 sentences) as their test data. Apart from achieving higher BLEUscores (even on longer sentences) than the baseline system NMT system (without attention) andother conventional SMT approaches, they also visualize the quality of the alignments produced bythe model during decoding. After learning, they extract only one-to-one alignments by selecting thesource word with the highest alignment weight per target word and compare it with gold alignmentdata provided by RWTH for 508 English-German Europarl sentences. They use the alignment errorrate (AER) to measure the alignment quality of the model. The results show that they were able toachieve AER scores comparable to the one-to-many alignments obtained by the Berkeley aligner(Liang et al. [27]). A significant weakness in conventional NMT systems is their inability to correctly translate veryrare words: end-to-end NMTs tend to have relatively small vocabularies with a single symbol thatrepresents unk every possible out-of-vocabulary (OOV) word. Standard phrase-based systems, on theother hand, do not suffer from this problem to same extent as NMTs as they make use of explicitalignments and phrase tables which allows them to memorize the translations of extremely rarewords.Jean et al. [23] propose a method based on importance sampling that allows them to use large targetlanguage vocabulary without increasing training complexity. They divide the training set into multipleindividual sets, each having its own target vocabulary V’. More concretely, before training begins,each target sentence is sequentially examined and unique words are accumulated till number of uniquewords reach predefined threshold τ . The accumulated vocabulary will be used for this partition ofthe corpus during training. The process is repeated until the end of the training set is reached. An11ntrigued reader can refer the original paper for more formal description of the model and the trainingprocedure. The proposed approach is evaluated on English to French and English to German task.Bilingual parallel corpora from WMT’14 is used for training the model. Apart from showing theefficiency of the proposed model through comparable BLEU scores with the state-of-the-art WMT’14submitted model, they propose heuristic-based changes to the traditional NMT decoder to make itsample efficiently from extremely large target vocabulary.Yet another method was proposed recently to address the OOV problem. Luong et al. [30] train anNMT system on data that is augmented by the output of a word alignment algorithm, allowing theNMT system to emit, for each OOV word in the target sentence, the position of its correspondingword in the source sentence. This information is later utilized in a post-processing step that translatesevery OOV word using a dictionary. Experiments on the WMT’14 English to French translation taskshow that this method provides improvement of up to 2.8 BLEU points over an equivalent NMTsystem that does not use this technique. Kaiser et al. [24] recently proposed an interesting neural network architecture - ‘One Model to learnthem all’. Its a Multi-Model architecture that can simultaneously learn many tasks across domains.At its core, it has four components - modality nets, encoder, IO mixer and decoder. Modality nets(one each for all types of data - text, speech, audio, image) map input into a representation. Encodertakes this representation and processes it with attention blocks and mixture-of-experts[39]. Decoder,in a similar fashion, produces output representation which is given to the respective modality net toproduce the output. Both encoder and decoder are built with convolutional blocks. Their experimentson various tasks (including machine translation) show that the model performs, if not at par yet,but close to the state-of-the-art systems on individual tasks. They also show that attention andmixture-of-experts blocks, designed for textual data (especially machine translation) doesn’t hurt theperformance of other completed unrelated tasks like classification on ImageNet[13].Another area of current research is related to the requirement of a large parallel corpus to trainNMT systems. The lack of such corpus for low-resource languages (e.g. Basque) as well as forcombinations of major languages (e.g. German-Russian) poses a challenge for such systems. Artetxe et al. [1] propose an unsupervised approach to neural machine translation which relies solely on amonolingual corpus. The system architecture is a standard encoder-decoder setup with the encodershared across the two decoders along with attention. The encoder contains pre-trained cross-lingualword embeddings which are kept fixed during training. Ideally, this architecture can be trained toencode a given sentence using the shared encoder and decode it using the appropriate decoder, but itis prone to learn trivial copying task. To circumvent this, the authors propose using denoising andon-the-fly backtranslation. Denoising randomizes the order of the words to force the network tolearn meaningful information about the language, and the on-the-fly backtranslation translates textfrom the available monolingual corpus to the other language to get a pseudo-parallel sentence pair.This architecture improves over the baseline scores by at least 40% on both German-English andFrench-English translation. This unsupervised approach is also shown to improve with the availabilityof small parallel corpus.
Machine translation has been an active area of research within the field of AI for many years.Statistical machine translation, with the advent of IBM models (model 1-4) paved way for advancedapproaches based on phrase-based and syntax-based models. These methods have shown tremendousprogress in many language pairs and have been successfully deployed in large scale systems, likeGoogle translate (up until 2014). Over the past couple of years, Neural Machine Translation hastaken the front seat in this task. Owing to their ease of learning, their ability to model complexfeature functions and their striking performance in translating major languages of the world, NMTsystems have become natural choice for researchers to study their behavior, the feature space theylearn, and the effect of variations in architectures. Despite that, much needs to be done both frommodelling perspective and architecture changes. We believe that a unified architecture similar tothe one proposed in ‘One model to learn them all’ holds potential in benefiting from multiple taskslearnt simultaneously. Finally, we are also seeing unsupervised methods being applied to learn MT12ystems from just one language. This research area is especially important considering the number oflanguages in the world and the limited amount of labelled data available for them.13 ibliography [1] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neuralmachine translation. arXiv preprint arXiv:1710.11041 , 2017.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.[3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilisticlanguage model.
Journal of machine learning research , 3(Feb):1137–1155, 2003.[4] Peter Brown, John Cocke, S Della Pietra, V Della Pietra, Frederick Jelinek, Robert Mercer,and Paul Roossin. A statistical approach to language translation. In
Proceedings of the 12thconference on Computational linguistics-Volume 1 , pages 71–76. Association for ComputationalLinguistics, 1988.[5] Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick Jelinek,John D Lafferty, Robert L Mercer, and Paul S Roossin. A statistical approach to machinetranslation.
Computational linguistics , 16(2):79–85, 1990.[6] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. Themathematics of statistical machine translation: Parameter estimation.
Computational linguistics ,19(2):263–311, 1993.[7] Jaime G Carbonell, Richard E Cullinford, and Anatole V Gershman. Knowledge-based machinetranslation. Technical report, Yale University, Department of Computer Science, 1978.[8] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On theproperties of neural machine translation: Encoder-decoder approaches.
CoRR , abs/1409.1259,2014. URL http://arxiv.org/abs/1409.1259 .[9] Kyunghyun Cho, Bart van Merriënboer, Ça˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In
Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar, Oc-tober 2014. Association for Computational Linguistics. URL .[10] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deepneural networks with multitask learning. In
Proceedings of the 25th International Conferenceon Machine Learning , ICML ’08, pages 160–167, New York, NY, USA, 2008. ACM. ISBN978-1-60558-205-4. doi: 10.1145/1390156.1390177. URL http://doi.acm.org/10.1145/1390156.1390177 .[11] G. E. Dahl, Dong Yu, Li Deng, and A. Acero. Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition.
Trans. Audio, Speech and Lang. Proc. ,20(1):30–42, January 2012. ISSN 1558-7916. doi: 10.1109/TASL.2011.2134090. URL http://dx.doi.org/10.1109/TASL.2011.2134090 .[12] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. In
The Annalsof Mathematical Statistics , volume 43, 1972.[13] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scalehierarchical image database. In
In CVPR , 2009.[14] Ulrich Germann. Greedy decoding for statistical machine translation in almost linear time.In
Proceedings of the 2003 Conference of the North American Chapter of the Association forComputational Linguistics on Human Language Technology-Volume 1 , pages 1–8. Associationfor Computational Linguistics, 2003.[15] Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast andoptimal decoding for machine translation.
Artificial Intelligence , 154(1-2):127–143, 2004.1416] Joshua T Goodman. A bit of progress in language modeling.
Computer Speech & Language ,15(4):403–434, 2001.[17] Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke, and Stephen Pul-man. Concrete sentence spaces for compositional distributional models of meaning.
CoRR ,abs/1101.0309, 2011. URL http://arxiv.org/abs/1101.0309 .[18] Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. Achievinghuman parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567 ,2018.[19] Karl Moritz Hermann and Phil Blunsom. A simple model for learning multilingual composi-tional semantics.
CoRR , abs/1312.6173, 2013. URL http://arxiv.org/abs/1312.6173 .[20] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly,Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deepneural networks for acoustic modeling in speech recognition.
Signal Processing Magazine ,2012.[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural Comput. , 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735 .[22] John Hutchins and Evgenii Lovtskii. Petr petrovich troyanskii (1894–1950): A forgotten pioneerof mechanical translation.
Machine translation , 15(3):187–221, 2000.[23] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using verylarge target vocabulary for neural machine translation.
CoRR , abs/1412.2007, 2014. URL http://arxiv.org/abs/1412.2007 .[24] Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones,and Jakob Uszkoreit. One model to learn them all.
CoRR , abs/1706.05137, 2017. URL http://arxiv.org/abs/1706.05137 .[25] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. Seattle, October2013. Association for Computational Linguistics.[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deepconvolutional neural networks. In
Proceedings of the 25th International Conference on NeuralInformation Processing Systems - Volume 1 , NIPS’12, pages 1097–1105, USA, 2012. CurranAssociates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257 .[27] Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In
Proceedings of the MainConference on Human Language Technology Conference of the North American Chapter of theAssociation of Computational Linguistics , HLT-NAACL ’06, pages 104–111, Stroudsburg, PA,USA, 2006. Association for Computational Linguistics. doi: 10.3115/1220835.1220849. URL https://doi.org/10.3115/1220835.1220849 .[28] Minh-Thang Luong and Christopher D. Manning. Stanford neural machine translation systemsfor spoken language domains. 2015.[29] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation.
CoRR , abs/1508.04025, 2015. URL http://arxiv.org/abs/1508.04025 .[30] Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressingthe rare word problem in neural machine translation.
CoRR , abs/1410.8206, 2014. URL http://arxiv.org/abs/1410.8206 .[31] Daniel Marcu and William Wong. A phrase-based, joint probability model for statisticalmachine translation. In
Proceedings of the ACL-02 Conference on Empirical Methods inNatural Language Processing - Volume 10 , EMNLP ’02, pages 133–139, Stroudsburg, PA,USA, 2002. Association for Computational Linguistics. doi: 10.3115/1118693.1118711. URL https://doi.org/10.3115/1118693.1118711 .1532] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributedrepresentations of words and phrases and their compositionality.
CoRR , abs/1310.4546, 2013.URL http://arxiv.org/abs/1310.4546 .[33] Makoto Nagao. A framework of a mechanical translation between japanese and english byanalogy principle.
Artificial and human intelligence , pages 351–354, 1984.[34] Franz Josef Och.
Statistical machine translation: from single-word models to alignmenttemplates . PhD thesis, Bibliothek der RWTH Aachen, 2002.[35] Franz Josef Och and Hermann Ney. Improved statistical alignment models. In
Proceedingsof the 38th Annual Meeting on Association for Computational Linguistics , pages 440–447.Association for Computational Linguistics, 2000.[36] Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models forstatistical machine translation. In
Proceedings of the 40th Annual Meeting on Association forComputational Linguistics , ACL ’02, pages 295–302, Stroudsburg, PA, USA, 2002. Associationfor Computational Linguistics. doi: 10.3115/1073083.1073133. URL https://doi.org/10.3115/1073083.1073133 .[37] Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here?
Proceedings of the IEEE , 88(8):1270–1278, 2000.[38] Holger Schwenk. Continuous space translation models for phrase-based statistical machinetranslation. In
COLING , 2012.[39] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E.Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
CoRR , abs/1701.06538, 2017. URL http://arxiv.org/abs/1701.06538 .[40] Richard Socher, Christopher D. Manning, and Andrew Y. Ng. Learning continuous phraserepresentations and syntactic parsing with recursive neural networks. In
In Proceedings of theNIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop , 2010.[41] Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning.Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In
Proceedingsof the 24th International Conference on Neural Information Processing Systems , NIPS’11,pages 801–809, USA, 2011. Curran Associates Inc. ISBN 978-1-61839-599-3. URL http://dl.acm.org/citation.cfm?id=2986459.2986549 .[42] Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating text with recurrent neuralnetworks. In
Proceedings of the 28th International Conference on International Conference onMachine Learning , ICML’11, pages 1017–1024, USA, 2011. Omnipress. ISBN 978-1-4503-0619-5. URL http://dl.acm.org/citation.cfm?id=3104482.3104610 .[43] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural net-works. In
Proceedings of the 27th International Conference on Neural Information ProcessingSystems - Volume 2 , NIPS’14, pages 3104–3112, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969173 .[44] Stephan Vogel, Hermann Ney, and Christoph Tillmann. Hmm-based word alignment in statisticaltranslation. In
Proceedings of the 16th conference on Computational linguistics-Volume 2 ,pages 836–841. Association for Computational Linguistics, 1996.[45] Ye-Yi Wang and Alex Waibel. Decoding algorithm in statistical machine translation. In
Proceedings of the eighth conference on European chapter of the Association for ComputationalLinguistics , pages 366–372. Association for Computational Linguistics, 1997.[46] Warren Weaver. Translation.
Machine translation of languages , 14:15–23, 1955.[47] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generationwith visual attention.
CoRR , abs/1502.03044, 2015. URL http://arxiv.org/abs/1502.03044 . 1648] Kenji Yamada and Kevin Knight. A syntax-based statistical translation model. In