Bilingual is At Least Monolingual (BALM): A Novel Translation Algorithm that Encodes Monolingual Priors
BBilingual is At Least Monolingual (BALM):
A Novel Translation Algorithm that Encodes Monolingual Priors
Jeffrey Cheng ∗ Department of Computer ScienceUniversity of PennsylvaniaPhiladelphia, PA 19104 [email protected]
Chris Callison-Burch
Department of Computer ScienceUniversity of PennsylvaniaPhiladelphia, PA 19104 [email protected]
Abstract
State-of-the-art machine translation (MT) models do not use knowledge ofany single language’s structure; this is the equivalent of asking someone totranslate from English to German while knowing neither language.
BALM is a framework incorporates monolingual priors into an MT pipeline; bycasting input and output languages into embedded space using BERT,we can solve machine translation with much simpler models. We findthat English-to-German translation on the Multi30k dataset can be solvedwith a simple feedforward network under the
BALM framework withnear-SOTA BLEU scores.
We motivate this research with a problem, an observation, and a technology.1.
The problem domain :Machine translation (MT) requires many pairs of parallel texts (also known asbilingual corpora): pairs of documents that are sentence-wise identical in meaningbut written in different languages. Since translation is a variable-length learningproblem in both input and output, it requires a large quantity of data. Machinetranslation on pairs of languages where parallel texts are scarce is an open problem(e.g. very few documents are written in both Haitian Creole and Sanskrit, sotranslation between the two is hard).2.
A observation about current solutions :It would be unreasonable – ludicrous, even – to ask a person who speaks neitherEnglish nor German to perform English-German translation. And yet most MTalgorithms incorporate no prior knowledge of either its input or output languageat any level (syntax or semantics); instead, these MT algorithms learn the trans-lation problem directly. This goes against the conventional wisdom of buildinginductive biases based on domain expertise (e.g. translational invariance, rotationalinvariance, hierarchical encoding) into models.3.
An underutilized technology :It is hypothesized that the bidirectional encoder representations from transformers(BERT) algorithm allows practitioners to incorporate prior knowledge about alanguage by encoding sentences as fixed-length embeddings. If this hypothesisis true, then BERT can leverage the observation about how humans perform ∗ Please find my most recent contact information at http://jeffreyscheng.com/ .Preprint. Work in progress. a r X i v : . [ c s . C L ] A ug translation in order to address the MT problem posed. However, no work hasproven whether BERT’s mean-pooled fixed-length encodings can actually serve asa sentence embedding.Suppose we want to translate between two languages A and B . Further suppose that thereare large quantities of written text in both of these languages, but there are very few paralleltexts in languages A and B . Current MT algorithms are unable to solve this problem,despite the fact that the vast majority of pairs of languages fall under this description.Our premise is to develop a novel algorithm – the Bilingual-is-At-Least-Monolingualmodel ( BALM ) – which uses BERT to incorporate prior knowledge about language A and language B independently. This encoding of prior knowledge allows BALM to learnthe translation problem as a fixed-length mapping problem (which is easier and moredata-efficient). We now motivate our concept more deeply.
MT is difficult in part because natural language is variable-length: the number of tokensin sentences varies greatly. Most classification models (logistic regression, decision trees,support vector machines, feedforward neural networks) have a rigid API that only allowsfor fixed-length inputs and fixed-length outputs. Thus, they are not only unable to learnMT; they are unable to compute even a single-forward pass over the data.Certain architectures are designed for variable-length problems, notably recurrent neuralnets (RNNs) and recursive neural nets. However, even RNNs cannot learn MT directlysince their API is still not flexible enough. • RNNs can map from variable-length inputs to fixed-length outputs. • RNNs can map from fixed-length inputs to variable-length outputs. • RNNs can map from variable-length inputs to variable-length outputs iff the inputand output have the same length.There are no atomic models that directly allow for arbitrarily variable-length inputs andoutputs. Since MT has natural langauge as both input and output, there are therefore noatomic models that can directly solve MT.MT practitioners are thus forced to use rigid compound models that either use recurrenceor enforce an awkward sequence length maximum. We would like to relax the conditionsof the MT problem to allow simpler models, such as feedforward neural networks.
Neural machine translation (NMT) was popularized by encoder-decoder networks’ impres-sive performance between English and French in 2015.[1]Bengio et al’s seq2seq architecture got around the variable-length problem by a simplecomposition of two attentioned RNNs: an encoder to embed sentences into a fixed-length thought vector and a decoder to interpret the thought vector as language. The attentionmechanism allowed the models to recognize nonconsecutive patterns in sequence datadirectly without sole reliance on memory gates. Seq2seq represented a clean improvementfrom prior efforts in statistical machine translation because of its compactness and becauseof its end-to-end differentiability.Shortly after Bengio et al’s success with seq2seq, NMT literature noted two fundamentalchallenges with using recurrent models for translation.1. Hochreiter et al found that each variable-length touchpoint within recurrent modelsexacerbates their susceptibility to vanishing gradients since the path along thecomputation graph from the end loss to early weights increases linearly in thesequence length. The chain rule is multiplicative along the gradient path; therefore,the gradient may decay exponentially, which prevents efficient training.[5] SinceMT has two such variable-length touchpoints (variable-length inputs and variable-length outputs), recurrent models for MT are difficult to tune and train reliably.2. Neural networks require a large quantity of data – even simple problems likeMNIST require 60K examples and several epochs in order to converge. MT isparticularly a noisy domain. Koehn and Knowles note that the combination ofthese two factors makes NMT is an extremely data-hungry problem setup. [7]Koehn and Knowles’ observation is exacerbated by the fact that most pairs oflanguages are not like the English-French translation problem solved by Bengio etal. English and French are both extremely widely-used languages, and the pair hasan enormous number of parallel texts. Most pairs of languages have few paralleltexts, and the data requirement imposed by such a model for universal translationscales quadratically in the number of languages.For example, anyone can find a public corpus of Romanian text on the order ofbillions of words with a quick search.[9] However, the largest parallel text sourcebetween Romanian and another language is only about 1 million tokens. [12]Further developments in attention led to Google Brain’s development of the transformer,which currently holds state-of-the-art results in machine translation between English-German and English-French.[13]Transformers utilize an encoder-decoder scheme using three kinds of multi-headed at-tention (encoder self-attention, decoder self-attention, and encoder-decoder attention).Transformers are a fixed-length attention-based model, which reduces the vanishing gradi-ent problem. However, even the smallest pre-trained English transformer models have over110 million parameters – the size of this model worsens the data-hungriness problem.Several works have attempted to design algorithms that artificially augment the sizes ofNLP and MT datasets. • Sennrich et al developed backtranslation, a semi-supervised algorithm that con-catenates monolingual utterances to bilingual corpora by using NMT to imputea translation. Backtranslation demonstrates modest increases in BLEU. However,neural nets under the backtranslation setup still do not learn the structure of anysingle language; the imputed examples are still used for directly learning thetranslation task.[10] • Burlot et al found that variants of backtranslation and GAN-based data augmen-tation can increase the size of the dataset slightly without compromising thegeneralizability of the neural net. However, since learning curves are extremelysublinear with respect to duration, these data augmentation schemes provide small,diminishing returns.[2] • Sriram, Jun, and Satheesh developed cold fusion (an NLP technique, contrastedwith deep fusion), which performs simultaneous inference and language modelingin order to reduce dependence on dataset size. This is the exact inductive bias thathumans use and is the jumping point for our work. However, cold fusion is unableto extend from monolingual inference to MT. This is likely because at the time ofSriram et al’s writing (August 2017), no general sentence embedding algorithmexisted. As explained later, we now bypass this constraint with Google’s BERTalgorithm.[11]None of these algorithms adequeately addresses our first observation that MT given naturallanguage understanding in a single language is a much easier learning task than directlylearning MT.Therefore, since data quantity is a performance bottleneck and and since bilingual trainingdata for translation models is typically difficult to obtain, we would like to pre-train themodels as much as possible by learning efficient representations using monolingual data before attempting translation. Kiros et al make progress towards pretrained monolinguallanguage modeling for MT by learning multimodal embeddings on words.[6]We will progress in the same vein but jump directly to pre-training sentence embeddingsrather than word embeddings by using the encoding prowess of transformers. Thisapproach could have the added advantages of context-based disamguation and betterunderstanding of syntactical arrangement.
Deep bidirectional transformers for language understanding (BERT) is an extremely pop-ular pre-trained transformer model that is supposedly able to encode sentences as fixedlength vectors using only a monolingual corpus; this represents an improvement overprevious language embedding technologies such as ELMO, which could only embed wordsinto vectors.[3]BERT uses the encoder stack of a transformer architecture to return an context-drivenembedding for each word in a sequence – practitioners have found that taking the meanpool of the word embeddings in a sentence functions well as a sentence embedding.[14] However, BERT’s sentence embedding property has never been verified with anautoencoder.We note that if BERT is able to embed sentences into fixed-length vectors, the problem oftranslation no longer has the difficulty of being a variable-length input problem. Similarly,if we can find an inversion of the BERT model, the MT problem will no longer be a variable-length output problem. Combining these two hypothetical successes would cast MT intoa fixed-length learning problem, which can be solved with any arbitrary classificationalgorithm. We would then be able to use much simpler models than attentioned RNNs(such as feedforward neural nets or even logistic regression).
1. Create an English sentence autoencoder using a pre-trained English BERT as theencoder and a newly initialized recurrent decoder. • If the autoencoder can reconstruct sentences with high fidelity, we will haveverified that mean-pooling over BERT’s word embedding does create sentenceembeddings. We will also have found a useful English thought-space that canbe used for transfer learning. • It is still interesting if the autoencoder fails since this disproves the runningassumption of many practitioners that a mean-pooled BERT output acts as asentence embedding. We would then attribute BERT’s positive performanceas a transfer learning tool to its ability to learn domain-specific embeddingfeatures but an inability to compress all sentence features into a single vector.2. Conditional on a successful autoencoder, create a German-English translationmodel using German BERT as an encoder, a translation model mapping betweenEnglish and German thought-spaces, and the aforementioned BERT-invertingdecoder.3. Conditional on a successful translator, compare its learning curve with SOTAmodels and check for convergence with shorter duration (fewer minibatches /epochs). If
BALM is successful here, then it shows promise as an algorithm forMT in pairs of languages with few bilingual corpora.4. Conditional on a successful translator, check for reasonably good translations asevaluated by bilingual evaluation understudy (BLEU).[8] Similarly, conditionalon a successful translator, check for qualitatively good translations on in-sampleand out-of-sample sentence examples. If
BALM is successful here, then it showspromise as an algorithm for general MT as a substitute for current SOTA methods.
We begin with a non-standard definition that will clarify the model premises.
Definition 2.1. (Thought-space) For a language L , a thought-space S L , k ⊆ R k is a k -dimensional embedding of sentences in language L for some fixed k ∈ N .The key insight of seq2seq learning is that by learning an intermediate thought-space, theoverall sequence-to-sequence task becomes easier since the subproblems are learnable byRNNs. Our insight here is that we can make machine translation significantly easier bylearning 2 intermediate thought-spaces.Suppose we want to translate German → English, with L English and L German representingthe formal languages. We learn the two intermediate thought-spaces S English, k and S German, k – each language has its own embedded space. We thus construct the Bilingual-is-At-Least-Monolingual ( BALM ) model for German-to-English translation with the following threesubmodules:1. A German BERT encoder learns to embed German natural language into a fixed-length embedding. Its outputs are German “thoughts.”
This is learnable bymonolingual datasets . In fact, one can easily download a pretrained BERT modelthat does this; no bilingual corpora is necessary. Formally, B German : L German → S German, k (Encodes from German for fixed k )2. A feedforward neural net translates from fixed-length German thoughts into fixed-length English thoughts. Note that since this is a fixed-length problem, the learningtask is significantly easier in this step and should require less data. F German → English : S German, k → S English, k (Translates German thoughts into English thoughts)3. A recurrent English decoder learns to reconstruct English natural language fromfixed-length English thoughts. This is learnable by monolingual datasets . B − : S English, k → L English (Decodes into English for fixed k )Under this framework, translation has been stripped of its variable-length property – weonly need to train one atomic model with the bilingual parallel texts!We must first pre-train B German and B − . We will take the former for granted sincethere are published pre-trained German BERT models. We will train B − with anautoencoder, then transfer-learn by using the pre-trained weights of B German and B − in a full translation model. We utilize a pre-trained BERT embedding model B English . B German : L English → S English, k (Encodes from English for fixed k )In order to build an autoencoder with a BERT encoder, we need to be able to invert thefunction B German . B − : S English, k → L English (Decodes into English for fixed k )We implement B − as an RNN with gated recurrent units (GRUs). We choose not touse vanilla RNNs due to concerns about vanishing gradients – natural language frequentlyreaches sequences of lengths greater than 50. We choose the GRU over the LSTM becauseits API is more compatible with the BERT encoder; BERT returns one hidden state; GRUsrecur with one hidden state while LSTMs recur with two.The autoencoder A is thus A = B English ◦ B − . Note that A ( x ) ≈ x . Once the autoencoder has converged, we copy B − as a pre-trained decoder and usea pre-trained German BERT model B German . This allows us to transfer-learn from thelanguage modeling task to the translation task. The full
BALM translation model is: B German : L German → S German, k F German → English : S German, k → S English, k B − : S English, k → L English
The translator T is thus the composition T = B German ◦ F ◦ B − .Note that while the translator is training, the feedforward network F German → English is theonly model being trained from initialization. We merely fine-tune the two transferredmodels B German and B − .Given the two compound models that need to be trained (autoencoder A and BALM translator T ), we proceed to our MT methodology. We utilize Elliott et al’s Multi30k translation dataset , a multilingual extension of theFlickr30k image-captioning dataset.[4] Each image has an English-language caption and ahuman-generated German-language caption. We depict one such example below. Figure 1:
We see two images here. (Right) Each image has an independently written caption in both Englishand German. (Left) Each independent caption has a corresponding translation in the dataset. Imagecourtesy of Elliott et al.
By the nature of image captioning, this dataset is biased towards static structures (e.g."This is a white house.") and animals doing things ("The brown dog is drinking from thebowl."). Notably, there are no questions, commands, or abstract statements in the data;thus, we should expect that our models are only able to create good representations ofdescriptive text.We load this dataset using the builtin function within torchtext, an NLP extension ofFacebook’s PyTorch framework. Natural language cannot be directly fed into an encodingmodel, so we must:1. Tokenize the text.2. Convert the text to categorical id values.3. Initialize one-hot vectors.4. Embed the vectors into a low-dimensional space for learning.We utilize a hybrid ecosystem of standard
PyTorch , the NLP extension torchtext , and pytorch-pretrained-bert a library within the PyTorch ecosystem for pre-trained transformers https://github.com/multi30k/dataset https://pytorch.org/docs/stable/index.html https://torchtext.readthedocs.io/en/latest/ https://github.com/huggingface/pytorch-pretrained-BERT developed by huggingface.ai. The diagram depicts the pipeline infrastructure and thedimensionality of the flow. Figure 2:
Three libraries transform natural language data into model-ready batch-wise tensors. Original flowchart diagram and original descriptions.
We now expand out the model and explain the forward passes of the autoencoder and thetranslator.
We implement the
BALM autoencoder model in native PyTorch with the following mod-ules: • The pre-trained English BERT encoder. This is downloaded from the pytorch-pretrained-bert package and has 110 million parameters. We allow the gradientupdates of the autoencoder to backpropagate through the pre-trained BERT modelin order to fine-tune its embedding. Note that we are not using the encoder in the
BALM translation model; the fine-tuning is solely to improve learning outcomesfor the GRU decoder model. • A mean pool layer, implemented with torch.mean . The BERT encoder has 12 self-attention layers, each of which outputs a hidden state. We could theoretically learnover all 12 hidden states since this would capture the entire set of low-level andhigh-level features embedded by the transformer. However, learning over such arich feature space would present memory issues for the GPU; thus, we take onlythe last one to obtain an vector of size 768 for each token. We then mean-poolacross the dimension of sentence length to get a sentence embedding of size 768;this operation is optimized by PyTorch’s CUDA interface. • The English GRU decoder layer. This is implemented as a single-layer RNN withgated recurrent units and hidden dimension equal to that of BERT’s encoding. Inorder to produce an output at a given timestep t , the GRU takes the hidden state at t and passes it through a linear layer of output size equal to the vocabulary length(28996). • A word embedding layer (not depicted below). We implement teacher-forcing onthe GRU decoder in order to learn tail-end patterns early in training. In order toteacher-force correctly, the incoming target words have the same dimensionality asthe GRU’s hidden layers. Thus, we learn a custom word embedding, implementedwith nn.Embedding .The autoencoder’s forward pass thus uses 3 trainable modules: the word embedding linearlayer, the transformer encoder stack from BERT, and the GRU decoder. The embeddingand the decoder are trained from initialization. The sequence of layers and rationale forarchitecture choices are depicted in the following flow diagram.
Figure 3:
Original flow chart diagram and original descriptions.
Translator Implementation
We implement the translator with the same framework as the autoencoder, substitutingthe English BERT encoder for a German BERT encoder. We also add in an intermediatemodule to learn the desired function F German → English mapping between the thought-spacesof the two languages. • The pre-trained German BERT encoder. This is downloaded from the pytorch-pretrained-bert package and has 110 million parameters; the specific model used inthis experiment is actually a multilingual BERT encoder that can taken in tokens inEnglish, French, or German. We fine-tune the model by allowing gradient updates. • A mean pool layer, implemented with torch.mean . The rationale for this module isthe same as the rationale for the mean pool layer in the autoencoder. • The feedforward network representing a thought translator (implemented withtwo nn.Linear units and ReLU activations). This neural net’s purpose is to learn thefixed-length function F German → English . It has architecture 768 × × • The English GRU decoder layer. This is transfer-learned from the autoencoder.We allow the gradient updates of the autoencoder to backpropagate in order tofine-tune the model for translation. • A word embedding layer (not depicted below) for teacher-forcing.The translator’s forward pass thus uses 4 trainable modules: the word embedding linearlayer, the transformer encoder stack from BERT, the feedforward network, and the GRUdecoder. The sequence of layers and rationale for architecture choices are depicted in thefollowing flow diagram.
Figure 4:
Original flow chart diagram and original descriptions. • Minimize loss on the validation set. • Minimize iteration-to-iteration variability. • Keep total memory allocation under 10GB in order to run the model on a standardNvidia GeForce GTX 1080 GPU.The final tuned hyperparameters are listed below. Surprisingly, the models seemed to berelatively insensitive to choices of hyperparameters. We attribute this to the large amount ofpretraining in the submodules for both the autoencoding and the translating tasks; given agood “seed” embedding from a careful selection of hyperparameter during the pre-trainingphase, perhaps the actual task itself because relatively invariant to hyperparameter tuning.
Figure 5:
The final tuned hyperparameters of both the autoencoder setup and the translator setup. They arethe same for both pipelines for consistency.
Given the above hyperparameters, we run the autoencoder model and the translator modelin sequence. Each model’s training run of 200 epochs took roughly 20 hours to run: arelatively quick convergence for an NLP task. • The convergence is extremely fast. Many NLP tasks take several hundred epochsto converge; the elbow of this learning curve is around 15 epochs. • The convergence is extremely smooth. Given how noisy natural language is as adomain, this is surprising. • The convergence of the learning curve is right at the Bayes error. Random guessingover 28996 classes has an expected cross entropy loss of: E [ L ( y , ˆ y )] ≥ L ( y , E [ ˆ y ]) (By Jensen’s Inequality; L is concave in ˆ y ) = − ∑ z = I ( z = y ) ln E [ ˆ y i ] (By definition of cross-entropy) = − ln 128996 ≈ e − ≈ While a low cross-entropy loss is a signal of a good model and is conveniently differentiable,it doesn’t directly give us the quality of reconstructions. We turn to the bilingual evaluationunderstudy (BLEU) score for a more holistic measure of the autoencoder’s reconstruction.The BLEU score measures the “adequacy, fidelity, and fluency” of proposed translations bymeasuring the proportion of n -grams shared between the proposed translation and theground-truth translation.[8]The BALM autoencoder achieves a remarkably high BLEU score of 0.605 .
Finally, in order to qualitatively interpret the output of a model, we convert the ouput IDsback into tokens. We utilize the reverse-tokenization builtins from pytorch-pretrained-bert .We observe the autoencoder’s outputs for training examples, test examples, handwritten-caption-like examples, and non-caption-like examples.
Figure 6:
Selected examples of the
BALM autoencoder’s reconstructions.
We note that on examples from the Multi30K dataset (train and test), the autoencoderhas perfect reconstructions. For manually-written examples that are somewhat similar2to captions, the autoencoder gets the rough syntax and semantically similar words. Onexamples that are not like captions (not declarative sentences), the autoencoder is only ableto capture the sentiment of the sentence.Overall, this is exactly what we’d expect of a properly trained autoencoder on the Multi30Kdataset. We take the favorable learning curve, the high BLEU score, and the selectedreconstructions as strong evidence that BERT creates good sentence embeddings and thatthe
BALM autoencoder is able to capture the English thought-space.
We notice similar features in the translator learning curves as compared to the autoen-coder’s. • The convergence is extremely fast. The translator converges just a hair slower thanthe autoencoder. • The convergence is extremely smooth. Again, the translator’s loss is slightly noisierthan the autoencoder. • The convergence of the learning curve is right at the Bayes error. Recall thatrandom guessing over 28996 classes has an expected cross entropy loss of at least10.27. After epoch 140, the translator model has an average loss of 0.014.
We again use the BLEU score for evaluation – this time, we use it as it was intended: forbilingual translation. We find that the score is 0.248 . Although this is not nearly as highas our autoencoder score and falls short of the state-of-the-art performance on Multi30K of0.35, we take this BLEU score as a weak success. In general, MT practitioners consider aBLEU score above 15% to indicate that nontrivial learning is happening – and this is beingachieved with a feedforward network!3
Figure 7:
Selected examples of the
BALM translator’s reconstructions.
We see that the translation model does relatively well on the training examples. In fact, itproduces the strongest signal of natural language understanding: a correct translation thatis synonymous but not identical to the given translation. However, the model struggles abit on the test set. There is clear n -gram similarity, but the model misses the key verb andreturns a poorly formed sentence. We first conclude that NLP practitioners are correct to assume that a simple mean-poolover BERT’s word embeddings serve as a rich sentence embedding. We take the extremelyhigh performance of the BERT-driven sentence autoencoder – nearly zero cross-entropyloss, an extremely high BLEU score of 0.605, and impressive performance on out-of-datasetexamples – as strong evidence that the English thought-space learned by BERT captures allsalient features of the English natural language (at least within the subdomain of imagecaptioning).This work definitely contradicts Ray Mooney’s famous exclamation at the Association ofComputational Linguistics (ACL) 2014: “You can’t cram the meaning of a whole %&!$
BALM algorithm does seem to converge faster than bothseq2seq and transformer-based MT systems, although its final performance is not as strong.This bodes well for the initial premise of this research, which was for translating betweenpairs of languages that have very few parallel bilingual texts.
Consider the Haiti earthquake. Many organizations, such as the United Nations, UNICEF,and the Red Cross, immediately reached out to offer humanitarian aid; however, there4are few translators between, say, English and Haitian creole. Before this crisis, automatedtranslation between frequently spoken languages and Haitian creole did not exist becauseof a lack of parallel texts.Microsoft’s best efforts reflect positively on the company but poorly on the state-of-the-artof NLP at the time. Their fast-tracked model involved crowdsourcing parallel texts fromuniversities (e.g. Carnegie Mellon University), websites (e.g. haitisurf.com), and the onlineHaitian community. It took Microsoft Research’s NLP group 5 days to support Haitiancreole on Bing’s translator service. There are roughly 4500 languages with more than 1000 speakers. Now consider a modelthat could translate between any of the ( ) = The value of this research is that by better utilizing more plentiful data resources –monolingual rather than bilingual texts – we open up the possibility of high-qualitymachine translation beyond the few pairs of frequently used languages.
This work has opened up many doors to follow-up analyses and use cases.1.
Interpretability
A major issue with neural methods in general is that they are considered opaque:as a semiparametric family of models, it is difficult to analyze what is happeningin parameter space.[7] However, a MT algorithm under the
BALM framework hastwo usable thought-spaces that can be used to visualize any mistakes. Supposethat a mistake occurs in translation due to the German encoder creating a faultyembedding; we can check this by using a German thought-decoder. Suppose thata mistake occurs in translation due to the English decoder reconstructing poorly;we can check this by using an English encoder on the poor reconstructor to see ifwe reproduce the output of the thought translator. And suppose that a mistakeoccurs in the thought translator; we can use the English decoder to pinpoint thesource of the error. It would be interesting to see whether certain submodulesare more susceptible to certain kinds of errors and whether it’s possible to createalgorithmic tools for identifying or preventing such mistakes.2.
Regularization and Learnability
We see that a shallow feedforward neural network can function as a reasonablethought translator. Would a more complex fixed-length model be able to increasethe performance of
BLAM to SOTA? A deeper neural network would not be moreexpressive but could improve the computational learning properties of the system.Or perhaps the feedforward neural network is too complex and overfits on thedata: we saw in the translator that it faithfully reproduced the training examplesbut struggled with test examples. Perhaps an extremely simple model like logisticregression would serve as a regularization scheme.Other forms of regularization could include changing the optimizer (we usedAdam by default); some works seem to demonstrate that SGD acts as betterregularizer because of its increased stochasticity. Perhaps combining SGD with
BALM would reduce the amount of overfitting.3.
Transfer Learning on Different MT Tasks
We test here only one one dataset: Multi30K. We do not know if this algorithmwill even generalize to other German-to-English translation tasks outside of image Github repository: https://github.com/jeffreyscheng/senior-thesis-translation
Pre-trained
BALM models: https://drive.google.com/drive/folders/1bt8hU24U_Uwn9j7gque2IjGzv4dphz3M?usp=sharing
References [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translationby jointly learning to align and translate, 2014.[2] Franck Burlot and François Yvon. Using monolingual data in neural machine trans-lation: a systematic study.
Proceedings of the Third Conference on Machine Translation ,2018.[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-trainingof deep bidirectional transformers for language understanding, 2018.[4] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30k: Multilin-gual english-german image descriptions.
CoRR , abs/1605.00459, 2016.[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural Computa-tion , 9(8):1735–1780, 1997.[6] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semanticembeddings with multimodal neural language models.
CoRR , abs/1411.2539, 2014.[7] Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation.
CoRR , abs/1706.03872, 2017.[8] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method forautomatic evaluation of machine translation. In
Proceedings of the 40th Annual Meetingon Association for Computational Linguistics
CoRR , abs/1511.06709, 2015.[11] Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. Cold fusion:Training seq2seq models together with language models.