[PDF] Universal Vector Neural Machine Translation With Effective Attention

Abstract

Neural Machine Translation (NMT) leverages one or more trained neural networks for the translation of phrases. Sutskever introduced a sequence to sequence based encoder-decoder model which became the standard for NMT based systems. Attention mechanisms were later introduced to address the issues with the translation of long sentences and improving overall accuracy. In this paper, we propose a singular model for Neural Machine Translation based on encoder-decoder models. Most translation models are trained as one model for one translation. We introduce a neutral/universal model representation that can be used to predict more than one language depending on the source and a provided target. Secondly, we introduce an attention model by adding an overall learning vector to the multiplicative model. With these two changes, by using the novel universal model the number of models needed for multiple language translation applications are reduced.

Full PDF

UUniversal Vector Neural Machine TranslationWith Eﬀective Attention

Satish Mylapore , Ryan Quincy Paul , Joshua Yi , and Robert D. Slater Master of Science in Data Science, Southern Methodist University, Dallas TX 75275USA { smylaporesaravanabhava, rqpaul, jyi, rslater } @smu.edu Abstract.

Neural Machine Translation (NMT) leverages one or moretrained neural networks for the translation of phrases. Sutskever intro-duced a sequence to sequence based encoder-decoder model which be-came the standard for NMT based systems. Attention mechanisms werelater introduced to address the issues with the translation of long sen-tences and improving overall accuracy. In this paper, we propose a sin-gular model for Neural Machine Translation based on encoder-decodermodels. Most translation models are trained as one model for one trans-lation. We introduce a neutral/universal model representation that canbe used to predict more than one language depending on the source anda provided target. Secondly, we introduce an attention model by addingan overall learning vector to the multiplicative model. With these twochanges, by using the novel universal model the number of models neededfor multiple language translation applications are reduced.

Neural Machine Translation (NMT) [1] is a signiﬁcant recent development inlarge scale translation [2,3]. The traditional translation model introduced byKoehn et al. 2003 [4] was trained on a single large neural model with layers thatare trained separately requiring many resources and eﬀort. Today, most industryplayers have adopted a neural network based machine translation system derivedfrom the Recurrent Neural Network (RNN) encoder-decoder model introducedby Cho et al. 2014 [5]. For machine translation, the encoder is used with thesource language to encode the sentence input into a vector representation forthe decoder. The decoder uses the encoded sequence to begin predicting thetarget sequence. There were several advancements to this model by the intro-duction of diﬀerent types of RNNs such as LSTM (Long Short Term Memory)[6,7,8,9], GRU (Gated Recurrent Unit) [10], and Bi-RNN (Bidirectional RNN)[11] which was introduced to address the vanishing gradient problem [12] thatwas encountered during the training of the simple recurrent neural network.Gated recurrent networks failed to fully resolve the problem of the encoder-decoder network [13] which is the ability to learn and maintain information ofthe encoder for longer sentences. This is where the attention mechanism wasintroduced by Graves et al. 2014 that is based on the cosine similarity of thesentences [14], Bahdanau et al. 2014 which concatenates the encoder and decoder a r X i v : . [ c s . C L ] J un nformation [15], and Loung et al. 2015 that uses the dot product of the theencoder and decoder information to score the attention on the target sequence[3,16]. The introduction of attention mechanisms increased the scalability ofmachine translation at the cost of performance during training.The latest development in the machine translation space is the introductionof the Transformer model by Vaswani et al. 2017 [17]. The Transformer modelfocuses more on self-attention and fully leverages recurrent networks. It promotesself-attention in both the encoder and the decoder where the encoding of thesource sequences are done in parallel. This reduces the training time signiﬁcantly.The decoder prediction is auto-regressive which means it predicts each word ata time in a regressive state. Vaswani claims that the results of the transformermodel has a signiﬁcant improvement in prediction accuracy when compared toother recent models in the NMT space with the use of a German translationtask [17].The transformer model is still in the incubation and adoption stage in currentindustry practice. This is due to its restricted context length during translation(ﬁxed-length context). Furthermore, at present all RNN encoder-decoder basedmachine translation models still use a single model architecture for a transla-tion job. For example, if a task requires translation from Spanish to English,one model will be trained. Another model would be trained to translate fromEnglish to Spanish. One model corresponds to one translation task, hence sepa-rate models are required. In this research, we seek to build a singular model totranslate multiple languages. For the purpose of this research we have consideredEnglish-Spanish and Spanish-English translation using the same model.All machine translation mechanisms to date use language speciﬁc encodersfor each source language [18]. This paper will detail a novel method of hostingmultiple neural machine translation tasks within the same model as follows.Section 2 will cover related works on the fundamental concepts of the sequence tosequence Recurrent Neural Network based Encoder-Decoder model, the additiveattention model by Bahdanau, and wrap-up with the Dual Learning methodintroduced by Microsoft. Section 3 outlines the architecture for the universalvector model and discusses each layer. Section 4 discusses the training methodfor the universal model, while Section 5 explains the dataset and how it is used fortraining. Section 6 is an overview of the BLEU score. The translation results ofthe Universal Vector is explained in Section 7 then Section 8 presents the analysisof the BLEU score, loss results, and attention model performance. Section 9goes over limitations and potential steps to take in the future with Section 10discussing previously considered experiments. Finally, the paper is concludedwith some parting thoughts on the development of this novel model in Section 11. This section will go over the associated work related to building the Univer-sal Vector Neural Machine Translation model. First Recurrent Neural NetworkBased Encoder-Decoder Models proposed by Sutskever et al. and Cho et al. wille discussed. Next, the attention mechanism ﬁrst proposed by Bahdanau et al.will be detailed. Finally, the Dual Learning model training approach is explained.

Many NMTs are built upon the fundamental Recurrent Neural Network (RNN)based Encoder-Decoder model as proposed by Sutskever et al. (2014) and byCho et al. (2014) [5,10,19,20]. This model uses two networks, an encoder and adecoder, to learn sequences of information and make predictions. In this modela sequence of input x is provided to the encoder, an RNN. An RNN allowsfor outputs of iterations through a network to be passed on as input to futureiterations [21,22,23,24]. x is processed word by word (1 , , . . . , t ) over multipleiterations. Each iteration calculates a hidden state that is based on the currentword in a phrase ( x t ) and the hidden states of previous iterations ( h t − ). Thisis represented at a high level in Equation 1 below with a non-linear equation q calculating hidden states at each position [15]. h t = f ( x t , h t − ) (1)Once all hidden states have been calculated, a function g will then returna single ﬁxed length context vector c with each hidden state as inputs like inEquation 2 below. c represents the full summary of the output of the encodernetwork [15]. c = q ( { h i , . . . , h T x } ) (2)The output of the encoder, c , is then fed into the decoder which is anothertrained RNN. The decoder emits the prediction for each input y at iteration t where these conditional outputs come together as a probability distribution likebelow in Equation 3 [15]. p ( y t |{ y , . . . , yt − } , c) = g ( y t − , s t , c ) (3) g is another non-linear function that takes in the previously predicted words( y t − ), the hidden state of the current iteration of the network ( s t ), and thecontext vector from above ( c ). p represents a predicted target sequence of wordsfor a given input sequence of words with conditional probability [25]. This isthe basis of the Encoder-Decoder model that has been used heavily in neuralmachine translation. Attention mechanisms have gained visibility recently as they are able to improvethe performance of translation by helping the encoder and decoder to align byproviding guidance on what parts of a large sentence will be most useful inpredicting the next word [15,16,17,26]. In recent years many attention modelshave been introduced such as Bahdanau et al [15] which concatenates (referredo as ”concat” in Luong, et al., 2015 [16] and as ”additive attention” in Vaswani,et al., 2017 [17]) forward and backward information from the source. This modelchanges the fundamental RNN Encoder-Decoder described above in a variety ofways.The encoder is built using a bi-directional recurrent neural network that con-tains two models. Each model will compute hidden states in either direction froma given input x i . This will yield two hidden states, −→ h i and ←− h i . These two hiddenstates are concatenated together to form a vector h i as below in Equation 4 thatwill represent the whole sentence emanating out from a given input word andare referred to as annotations [15]. h i = (cid:34) −→ h i ←− h i (cid:35) (4)Due to RNNs tendency toward recency bias, the words immediately sur-rounding around a given input ( x i ) will be better represented in the input word’sannotation ( h i ). This will be reﬂected when calculating attention which beginswith a replacement to the ﬁxed length context vector c mentioned in Section 2.1.A new context vector c j is calculated for every output word y j . This begins witha scoring function e ij which will represent the importance of the hidden stateoutput from the previous iteration of the decoder s j − to a given annotation h i represented by Equation 5 below [15]. A higher score will represent higherimportance. e ij = a ( s j − , h i ) (5) e ij is then fed into a softmax function α ij found below in Equation 6 whichwill return a vector of numbers that all sum up to one that represents the weightof each annotation with respect to the given position of y j [15]. α ij = exp( e ij ) (cid:80) T x k =1 exp( e jk ) (6)Finally, the context vector c j unique to each word output by the decoder y j is calculated with the summation found in Equation 7 below. c j = T x (cid:88) i =1 α ij h i (7)Vector c j will be used in the calculation of hidden states in the decoder s j found in Equation 8. s j , the previously predicted words y j − , and the c j willthen be used in calculating the output of each iteration of the decoder at step j as in Equation 9 below. The output is a vector of probabilities of each possibleword that could be predicted at y j . The context vector c i will weigh in wordsat input position i that scored a higher importance from e ij in Equation 5 morethan others which represents attention. This in in contrast to taking the wholevector of input words into account at every j th position of y [15]. j = f ( s j − , y j − , c j ) (8) p ( y j |{ y , . . . , yj − } , x ) = g ( y j − , s j , c j ) (9)This is an early implementation of attention proposed by Bahdanau et al.2014 [15]. Many other forms of attention have been proposed since. Luong et alrefers to Bahdanau’s attention mechanism as ”global attention.” In turn, Luonget al. proposed a ”local attention method” that focuses on smaller portions ofcontext instead of applying attention weights on the entire source text [16]. Thenew attention mechanism proposed in this paper combines the two. In a paper proposed by Microsoft Research [27], the team considered a dual learn-ing mechanism to handle the complexities in the training data labeling. The duallearning mechanism considers two agents, one agent for the forward translationmodel (source to target language) and the second agent is considered for the dualtranslation (target to source language). These models use two diﬀerent corporafor training which are not parallel data sets. This enables reinforcement learningfor the convergence of source and target language. The inputs considered on theMicrosoft Research paper are ”Monolingual corpora D A and D B , initial transla-tion models θAB and θBA , language models LMA and LMB, hyper-parameter α , beam search size K , learning rates γ ,t , γ ,t .” [27] The experiment used totest dual training uses two separate models, one for each translation direction.In this paper, there are two contributions based on RNN Encoder-Decoderbased machine translations. First, a neutral/universal vector representation formachine translation is introduced. Then a modiﬁed attention mechanism basedon global attention mechanism proposed in Luong et al. [16] and Bahdanau et al.[15] is discussed. Finally, testing of the proposed neutral vector representationwith modiﬁed attention mechanism are examined and the results are presented. The architecture of the model is built on top of the basic sequence to sequencemodel and modiﬁed to translate more than one language. A high level archi-tecture diagram is found in Fig. 1 below. This model contains two networks,an encoder and a decoder, with embedded inputs and outputs for each. It alsocontains the modiﬁed attention mechanism and a Fully Connected Layer. In thecurrent structure, the source text is inserted into the Input Embedding layerwhich contains the Encoder RNN. There are multiple Input Embedding Layersto handle diﬀerent source texts such as Spanish, English, German, etc. Fromthe Input Embedding layer, the results (context vectors) are fed into the TargetEmbedding layer which contains the Decoder RNN along with the modiﬁed At-tention layer. As is the case with the Encoder portion of the system, there areultiple Target Embedding layers for multiple target languages. Lastly, the out-put from the Target Embedding layer is passed into the Target Fully Connectedlayer. The result is a vector of probabilities for words in the target language.From this vector, the predicted phrase is converted from a numeric vector rep-resentation to words in a natural language.

Fig. 1.

Model architecture detailing the encoder and decoder networks and their inputs.

The model starts with an encoder layer to generate a vector that will be fed tothe decoder to generate predictions in a target language. Embedding vectors forthe encoder will be built as layers, which are considered to be the source inputto the encoding layer as Equation 10 below. es ∼ es , es , . . . , es n (10)Here, es is the embedding vector and each number represented by s representsa diﬀerent language used as a source for translation. This will be used as theﬁrst layer in the encoder network. Similarly, an embedding layer for the decoderis also built as in Equation 11 below. et ∼ et , et , . . . , et n (11) et is the embedding vector and each number represented by t represents adiﬀerent language used as the target prediction. The modiﬁed attention mechanism considers a context vector c created by theencoder. This vector c is created based on all the hidden vectors of the hiddentates during the encoding phase. This vector has a representation of each wordfrom the source. The attention score used to predict each target word is calcu-lated by the dot product of the hidden value of each prediction and the encodedoutput [17]. This scoring mechanism is based on the global attention methodproposed by Luong et al. [16]. Learning weights were introduced into the dotproduct score, which is calculated using Equation 12 below. The purpose of thisis to learn the overall weights of the dot product score. score ( h t , h s ) = v (cid:62) a ( h (cid:62) t W a h s ) (12)The context vector is computed by taking the dot product of the encoderoutput. This is done to add global alignment to the context vector, which willbe used to estimate the score of the next prediction.The attention mechanism will be used to align decoder predictions of the tar-get vector. Attention weights for each target language is deﬁned as Equation 13below. Where a is the attention weight and t is the target language. a t ∼ ( a t , a t , . . . , a tn ) (13) The last layer is a fully connected layer, which will be attached to the size of thetarget language as seen in Equation 14 below where f c is a connected layer and t is each target language. The purpose of the fully connected layer is to act as aclassiﬁer for each targeted translated text. f c ∼ ( f c t , f c t , . . . , f c tn ) (14) The model training process considers training for each set of translations in asequence. For this experiment, Spanish and English languages are consideredfor training with Gated Recurrent Units (GRUs) as the recurrent unit wereconsidered to address the long term dependencies [6,28,29,30] where W en , V en , W sp , and V sp all act as attention weight matrices, e is the embedding layer forSpanish and e is the embedding layer for English, d is the fully connected layerfor Spanish and d is the fully connected layer for English. The weight matricesfor each gate in the GRU are represented as Γ ue and Γ ud for the update gates, Γ re and Γ rd for relevance gates, c e and c d for the context gates, and h ( te ) and h ( td ) represent the hidden vectors. The weights will be initialized using the GlorotUniform Initializer [31]. The Adam optimization algorithm has been used witha learning rate α and a decayed learning rate γ . Loss will be measured usingthe discrete classiﬁcation methodology which leverages the sparse Softmax cross-entropy with logits loss. Spanish-English will be used as the parallel databasewhere each example of language will be trained in parallel.A pseudo algorithm of the training process is given below. lgorithm 1 Model Training Process

Require:

Parallel Dataset D with phrases in both Lang A and Lang B repeat for all Phrases p made up of Lang A , Lang B in D do Block 1:

Encode

Lang A example and compute encoder GRU layer Γ ue , Γ re , c e , h ( te )

4: Decode to predict

Lang B (English) using the encoder output of c e , h ( te )

5: Compute Compute W en and V en for alignment model6: Compute Γ ud , Γ rd , c d , h ( td ) for each prediction.7: Compute d

8: Compute loss L using the sparse softmax cross entropy with logits loss9: Block 2:

Encode

Lang B example and compute encoder GRU layer Γ ue , Γ re , c e , h ( te )

10: Decode to predict

Lang A (Spanish) using the encoder output of c e , h ( te )

11: Compute W sp and V sp for alignment model12: Compute Γ ud , Γ rd , c d , h ( td ) for each prediction.13: Compute d

14: Compute loss L using the sparse softmax cross entropy with logits loss15: Compute total loss L = L + L

16: Optimize using Adam optimization with learning rate α and decayed learn-ing rate γ .17: end for until All Phrases

Lang A , Lang B have been processed This process is repeated for all the examples. Here, we keep the encoder anddecoder same for all the languages that are trained for prediction. If anothertranslation is added, then the blocks are repeated for each language.

Parallel datasets for Spanish and English are used for training of the UniversalVector model. Data is taken from Many Things, an online resource for Englishas a Second Language Students [32]. It contains 122,936 pairs of phrases inEnglish and a corresponding Spanish translation. The universal vector model is trained using a modiﬁed version of the Dual Train-ing method proposed by Xia et al. [27]. This model is trained with a sequencefor each training data, ﬁrst with Spanish to English and then English to Span-ish for every iteration of the dataset mentioned above. Sample phrases in both The primary source of the dataset used in this study as well as many more lan-guage pairings can be found at . We used a copyhosted by the TensorFlow team at http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip nglish and Spanish were used to test the predictive ability of the network intoboth languages. The model has been trained at 20, 30, and 40 epochs to see theeﬀectiveness of the model as the amount of training increases.

A Bilingual Evaluation Understudy (BLEU) score was used as a metric to de-termine the eﬀectiveness of our NMT. BLEU was developed as a replacementfor human-based validation of machine based translation that was becoming anexpensive bottleneck due to the need for language expertise. The formula tocalculate the score is language independent, does not need to be trained, and isable to mimic human evaluation. The function takes in the translated sentenceand one or more reference sentences that it will be compared with. Groups ofwords, or n -grams, in the translated sentence to be evaluated are matched with n -grams in the reference sentences.The ﬁrst step in the scoring process is to calculate a precision score by takingthe number of matching n -grams between the evaluated sentence and the refer-ence sentences. This number is then divided by the total count of the n -grams inboth the references sentences and the evaluated translation. This equation canbe found below in Equation 15 [33]. Another consideration when determininga score for a translation is the length of the output. There are many ways tosay the same thing in most languages, but using too many words can introduceambiguity and using too few words may not provide enough nuance. p n = (cid:80) C ∈{ Candidates } (cid:80) n − gram ∈ C Count clip ( n − gram ) (cid:80) C ∈{ Candidates } (cid:80) n − gram (cid:48)∈ C (cid:48) Count ( n − gram (cid:48) ) (15)Penalties are in place to ensure sentence of proper length score better. Theprecision score equation has a built in penalty for candidate sentences that aretoo long as more n -grams will increase the denominator and lead to a smallerscore. For translations that are too short, a penalty is introduced in the formof a Brevity Penalty (BP) as in Equation 16 [33] below. r is the count of thewords in the reference sentence that is closest to the translated sentence beingevaluated. c is the length of the candidate sentence. If there is a match, the BPis 1 and there is no penalty assessed. If there is not an exact match in length,then a penalty is assessed according to an exponentiation of e . BP = (cid:40) , if c > re (1 − r/c ) if c ≤ r (16)The overall BLEU score for a candidate sentence is the product of the brevitypenalty and the exponential sum of the product of the log of the precision scoremultiplied by a positive weight w n . This weight is based on the number of n -grams N such that w n = 1 /N . The overall score is found by using Equation 17below. Equation 18 below is a form of the equation that is used to provide valueshat are more able to be ranked among other candidate translated sentences byapplying a log to the whole sentence. BLEU = BP · exp( N (cid:88) n =1 w n log p n ) (17)log BLEU = min(1 − rc ,

0) + N (cid:88) n =1 w n log p n (18)The NLTK BLUE Score package is used for evaluation of the model . The following section will cover the translation results obtained from the Uni-versal Vector model. It will discuss the translations from English to Spanish andSpanish to English.

Example phrases in each language were fed to the model. Two example pairs ofphrases are found in Table 1 below.

Table 1.

Example phrases used for testing

English Spanish

They abandoned their country Ellos abandonaron su pasThis is my life Esta es mi vida

The results of the English to Spanish task can be found in Table 2 below.In the case of our model, a BLEU score cannot capture the accuracy since itis based on matching n -grams. The sentences were too short to have anythinglarger than matching bigrams which are too small for the scoring algorithm.The result of the ﬁrst phrase perfectly matched the reference sentence found inTable 1. The output of the second phrase switched the gender of the word for”this” in English from ”esto” to ”esta”. Without more context before a phrase,the model is not able to consistently determine the genders of speciﬁc words.When English and Spanish are ﬂipped, the model provided similar results.The resulting English outputs can be found below in Table 3. Small diﬀerencesare present, again due to small gender diﬀerences that small sentences will beexpected to yield without proper context for pronouns. documentation for the BLEU score functionality can be found able 2. English input and Spanish output

English Input Spanish Output

They abandoned their country Ellos abandonaron su pasThis is my life Esto es mi vida

Table 3.

Spanish input and English output

Spanish Input English Output

Ellos abandonaron su pas They abandoned his countryEsta es mi vida This is my life

Sentences longer than four or ﬁve words yielded very poor results. This isdue to the small dataset and low number of training iterations when comparedwith other papers in the NMT space such as most of those cited in this paper.With a larger dataset and more training time the model will better handle longerphrases.

The following section covers the analysis and each subsection that follows is theanalysis discussion for the BLEU score, Loss analysis and Attention model.

Applying the BLEU score to the Universal Vector Model resulted in unfavorablescores. Table 4 shows the results of the BLEU score from Spanish to Englishand English to Spanish. BLEU score calculations are provided as part of thiswork to show the minimum capability of this model to translate more than onelanguage using a single universal model. The score from this work should not becompared with other translation models like Bert and other Transformer basedmodels [17,34]. There are two main reasons behind this. First, the tested sentencewords were short in length. Second, the short sentences did not meet the minimaln-gram length of 2 for proper scoring. The use of longer sentences could havesolved these issues, however the model had diﬃculty translating longer sentencesat the level of training we were able to accomplish in the time given (60 epochs).

Table 4.

BLEU Score Results

Sentence Direction BLEU Score esto es mi vida. English to Spanish this is my life. Spanish to English .2 Loss Analysis

Fig. 2.

Loss analysis by epoch during training of the model.

Since the BLEU score could not properly capture model accuracy for testing,more attention has been placed into minimizing loss. The loss explains how wellthe model is performing by minimizing error. A lower number for loss correlatesto a better performing model. Figure 2 shows the performance of the universalvector model by loss and training time over each epoch. The results of the modelare shown between 40-60 Epochs to show where the loss curve ﬂattens. The ﬁgureshows that the loss gradually declined between 41 to 48 epochs and began tostabilize at about . − . Heat maps were created to visualize how the attention mechanism directed thefocus of the decoder when predicting the corresponding text in a translation.The diagram has each word in the source language on the top and each word inthe predicted sentence in the target language on the left axis. Fig. 3 below wasgenerated when the Spanish phrase ”Esta es mi vida” was fed into the model.On the left is the output of the model which is a prediction of the EnglishTranslation. As a visual reminder, the heat map does not necessarily show howwords are correlated from source to target. Instead the visual representationf the heat map gives insight into the parts of input that the attention modelfocuses on when translating. For example, the yellow box in the upper left showsheavy focus on the Spanish word ”esta” when the model predicts the Englishword ”this”. From there, the heavy areas of focus follow a diagonal line downand to the left. This means as the decoder moves on to predict words later inthe sentence, the focus is directed to later parts of the source sentence whichis generally good. Longer sentences would show more deﬁned and more variedareas of heat as they get more complicated. Overall, the maps generated from thesmall sentence sizes that the model can handle, show potential that the modiﬁedattention mechanism is working as intended.

Fig. 3.

Heat map showing areas of focus from Spanish to English.

For further model experimentation on translation of more than two languages,a parallel dataset containing a triad of language phrases is required. While thearchitecture and model as part of this experiment is created to handle morethan two languages, we only consider using a single model for two languages. Asof today, most parallel datasets available are bilingual. In the future, a paralleldataset with three or more languages will be used to train and modify the currentuniversal vector representation model. Furthermore, larger datasets will be usedith more training iterations akin to other papers in the NMT space. A morestandardized test such as those provided by the annual Workshop on MachineTranslation can be used on the model translated text.

10 Previously Considered Experiments

Connected Learning was the ﬁrst attempt at a novel proposal. During the timeof initial research, there were no other papers proposing the methods that madeup this new idea. This method would allow the weights to learn source and targetlanguage as a Z format. First the model is trained in the direction of Source → Target, then immediately trained again with the direction of Target → Source,and ﬁnally the weights are retrained from Source → Target.In connected learning, training is done on the source sequence of vectors as x = ( x , . . . , x T x ) and target sequence of vectors as y = ( y , . . . , y T x ). For each ofthe sequence pairs of vectors, the source and target are swapped twice by utilizingthe hidden output as the input when swapped. For example, if vector sequence x represents Spanish and vector sequence y is English, the model would ﬁrstgenerate h ( t ) = f ( h ( t − , x t ) and c = q ( h , . . . , h T x ) and use the context vector” c ” when combining the hidden state of the recurrent network and provide vectorsequence y as the source and generate values for x .The belief was that the weights in the contextual information would have allthe target information, however the model could not converge to a local optimumpoint where it was aligned to both source and target information.

11 Conclusion

In this paper, the idea of a ”Universal Vector” is proposed as a new facet of NMTthat can be used to translate between multiple languages in the same vectorspace. Models are usually built to translate in one direction. There exists somework that has been done in using both directions between a source and targetlanguage for reinforcement learning of training sets. However, the ”UniversalVector” model is a singular model that can be trained in both directions (sourceto target and target to source) for more than one pair of languages.The ”Universal Vector” model detailed in this paper was built to test theproposition by modifying an RNN based Encoder-Decoder model. Existing at-tention mechanisms were also modiﬁed and used to create context vectors thatincreased performance in predicting the next translated text for overall targetphrase translation. Multiple fully connected layers are added, one for each targetlanguage, to facilitate translations into multiple target languages.The model is trained with parallel English and Spanish datasets. Phrasesfrom both languages are trained from English to Spanish and Spanish to En-glish within a recurrent network using Dual Training based methods. It wastested with many examples of both Spanish and English phrases. The attentionmechanism was evaluated by viewing heat maps of where the model selectivelyfocused on input text for its corresponding translated text.hile the results are promising, with more time and resources the experimentwould provide better results. With more computing power the model can betrained using more words with more languages in a reasonable amount of time.In the future, better accepted benchmarks in translation such as those providedby the annual Workshop on Machine Translation can be used. While limited inscope, these results point to potential for greater accuracy on using a singularmodel for translating between multiple languages.

References

1. Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models.In

Proceedings of the 2013 Conference on Empirical Methods in Natural LanguageProcessing , pages 1700–1709, Seattle, Washington, USA, October 2013. Associationfor Computational Linguistics.2. Sbastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On usingvery large target vocabulary for neural machine translation.

Proceedings of the53rd Annual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing (Volume 1: LongPapers) , 2015.3. Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba.Addressing the rare word problem in neural machine translation.

CoRR ,abs/1410.8206, 2014.4. Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based trans-lation. In

Proceedings of the 2003 Human Language Technology Conference of theNorth American Chapter of the Association for Computational Linguistics , pages127–133, 2003.5. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase represen-tations using rnn encoder-decoder for statistical machine translation.

Proceed-ings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP) , 2014.6. Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory.

Neural com-putation , 9:1735–80, 12 1997.7. David E. Rumelhart, Geoﬀrey E. Hinton, and Ronald J. Williams.

Learning Rep-resentations by Back-Propagating Errors , page 696699. MIT Press, Cambridge,MA, USA, 1988.8. H. Schwenk. Cslm - a modular open-source continuous space language modelingtoolkit.

Proceedings of the Annual Conference of the International Speech Com-munication Association, INTERSPEECH , pages 1198–1202, 01 2013.9. Martin Sundermeyer, Ralf Schlter, and Hermann Ney. Lstm neural networks forlanguage modeling. 09 2012.10. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben-gio. On the properties of neural machine translation: Encoder-decoder approaches.

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure inStatistical Translation , 2014.11. M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.

IEEETransactions on Signal Processing , 45(11):2673–2681, 1997.2. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the diﬃculty of trainingrecurrent neural networks. In

Proceedings of the 30th International Conferenceon International Conference on Machine Learning - Volume 28 , ICML13, pageIII1310III1318. JMLR.org, 2013.13. Jean Pouget-Abadie, Dzmitry Bahdanau, Bart van Merri¨enboer, Kyunghyun Cho,and Yoshua Bengio. Overcoming the curse of sentence length for neural machinetranslation using automatic segmentation. In

Proceedings of SSST-8, Eighth Work-shop on Syntax, Semantics and Structure in Statistical Translation , pages 78–85,Doha, Qatar, October 2014. Association for Computational Linguistics.14. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.

CoRR ,abs/1410.5401, 2014.15. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-lation by jointly learning to align and translate.

CoRR , abs/1409.0473, 2014.16. Thang Luong, Hieu Pham, and Christopher D. Manning. Eﬀective approaches toattention-based neural machine translation.

Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing , 2015.17. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.In

NIPS , 2017.18. Karl Moritz Hermann and Phil Blunsom. Multilingual Distributed Representationswithout Word Alignment. In

Proceedings of ICLR , apr 2014.19. Mikel L. Forcada and Ram´on P. ˜Neco. Recursive hetero-associative memoriesfor translation. In

Proceedings of the International Work-Conference on Artiﬁ-cial and Natural Neural Networks: Biological and Artiﬁcial Computation: FromNeuroscience to Technology , IWANN 97, page 453462, Berlin, Heidelberg, 1997.Springer-Verlag.20. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning withneural networks. In

NIPS , 2014.21. F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: continual pre-diction with lstm. In , volume 2, pages 850–855 vol.2, 1999.22. Zachary Chase Lipton. A critical review of recurrent neural networks for sequencelearning.

CoRR , abs/1506.00019, 2015.23. Hojjat Salehinejad, Julianne Baarbe, Sharan Sankar, Joseph Barfett, Errol Colak,and Shahrokh Valaee. Recent advances in recurrent neural networks.

CoRR ,abs/1801.01078, 2018.24. Alex Sherstinsky. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network.

CoRR , abs/1808.03314, 2018.25. Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How toconstruct deep recurrent neural networks. In

Proceedings of the Second Interna-tional Conference on Learning Representations (ICLR 2014) , 2014.26. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, JeﬀKlingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, GeorgeKurian, Nishant Patil, Wei Wang, Cliﬀ Young, Jason Smith, Jason Riesa, AlexRudnick, Oriol Vinyals, Greg Corrado, Macduﬀ Hughes, and Jeﬀrey Dean. Google’sneural machine translation system: Bridging the gap between human and machinetranslation.

CoRR , abs/1609.08144, 2016.7. Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-YingMa. Dual learning for machine translation. In

Proceedings of the 30th InternationalConference on Neural Information Processing Systems , NIPS’16, pages 820–828,USA, 2016. Curran Associates Inc.28. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies withgradient descent is diﬃcult.

IEEE Transactions on Neural Networks , 5(2):157–166, 1994.29. Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. 04 1991.30. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jrgen Schmidhuber. Gradientﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. 2001.31. Xavier Glorot and Yoshua Bengio. Understanding the diﬃculty of training deepfeedforward neural networks. In Yee Whye Teh and Mike Titterington, editors,

Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligenceand Statistics , volume 9 of

Proceedings of Machine Learning Research , pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.32. Charles Kelly. Tab-delimited Bilingual Sentence Pairs, March 2020.33. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a methodfor automatic evaluation of machine translation. In