Evolution of transfer learning in natural language processing
EEvolution of Transfer Learning in Natural LanguageProcessing
Aditya Malte † Dept. of Computer Engineering,Pune Institute of Computer Technology,Maharashtra, India.
Pratik Ratadiya † Dept. of Computer Engineering,Pune Institute of Computer Technology,Maharashtra, India.
Abstract —In this paper, we present a study of the recentadvancements which have helped bring Transfer Learning toNLP through the use of semi-supervised training. We discusscutting-edge methods and architectures such as BERT, GPT,ELMo, ULMFit among others. Classically, tasks in naturallanguage processing have been performed through rule-basedand statistical methodologies. However, owing to the vast natureof natural languages these methods do not generalise well andfailed to learn the nuances of language. Thus machine learningalgorithms such as Naive Bayes and decision trees coupled withtraditional models such as Bag-of-Words and N-grams were usedto usurp this problem. Eventually, with the advent of advancedrecurrent neural network architectures such as the LSTM, wewere able to achieve state-of-the-art performance in severalnatural language processing tasks such as text classificationand machine translation. We talk about how Transfer Learninghas brought about the well-known ImageNet moment for NLP.Several advanced architectures such as the Transformer and itsvariants have allowed practitioners to leverage knowledge gainedfrom unrelated task to drastically fasten convergence and providebetter performance on the target task. This survey representsan effort at providing a succinct yet complete understanding ofthe recent advances in natural language processing using deeplearning in with a special focus on detailing transfer learningand its potential advantages.
Index Terms —Natural language processing, transfer learning,self attention, language modeling
I. INTRODUCTIONNatural Language processing - the science of how tomake computers effectively process natural text- has recentlywitnessed rapid advancements thanks to increased processingpower, data and better algorithms. It forms the heart of severaluse cases such as opinion mining, conversational agents andmachine translation among others. Traditionally, NLP taskswere achieved through rule-based systems, in essence, a setof manually crafted rules that determined the behaviour ofthe system. Examples include rule-based machine translationwhere linguists iteratively framed new rules to make thetranslations more accurate.However, owing to the vast and heuristic nature of naturallanguage, machine learning gained a stronger ground in per-forming NLP tasks. Machine learning models such as SVM,Naive Bayes and random forests found use in tasks such assentiment analysis, spam detection and hate speech detection.On the other hand, Natural Language Generation(NLG) taskssuch as machine translation, question-answering, abstractive summarization were achieved through models such as theTransformer and Seq2Seq architectures.Two important breakthroughs that provided significant im-petus to the NLP and NLG domain were the arrival ofTransfer Learning and rapid improvements in the performanceof Language models. We † feel it necessary to discuss theseconcepts before moving further: A. Language Modeling
Language modeling is an NLP task where the model hasto predict the succeeding word given the previous wordsof the sequence as context. This task requires the languagemodel(LM) to learn the nuances and inter-dependencies amongthe various words of the language. Standard benchmarkdatasets for the language modeling tasks include the wikitextdataset, the BookCorpus dataset [1] and the 1B word [2]benchmark. Perplexity is generally used as a metric to evaluatethe performance of language models. Perplexity is defined asfollows: N (cid:118)(cid:117)(cid:117)(cid:116) N (cid:89) i =1 P ( w i | w ...w i − ) (1)Given a sequence of N words of the corpus, w w ...w N , P ( w i | w ...w i − ) is the probability assigned by the languagemodel to word w i given w i − preceding words of the se-quence.A lower perplexity generally signals towards a better per-forming language model as it indicates a lower entropy in thegenerated text. Language modeling is used for tasks such asnext word prediction, text auto-completion and checking lin-guistic acceptability. The concept of semi-supervised learningin NLP allows us to understand why and how language mod-eling has played a fundamental role in various architecturesthat have allowed for transfer learning in NLP. B. Transfer Learning
Traditionally, NLP models were trained after random initial-ization of the model parameters(also called weights). TransferLearning, a technique where a neural network is fine-tuned ona specific task after being pre-trained on a general task alloweddeep learning models to converge faster and with relatively † Equal contribution of both the authors a r X i v : . [ c s . C L ] O c t ower requirements of fine-tuning data. Historically, transferlearning has been mainly associated with the fine-tuning ofdeep neural networks trained on the ImageNet dataset [3] forother computer vision tasks. However, with recent advances innatural language processing, it has become possible to performtransfer learning in this domain as well.In this survey, we seek to discuss the recent strides made inTransfer learning, Language modeling and natural languagegenerations through advancements in algorithms and tech-niques. Transfer learning can be used for applications wherethere is lack of a large training set. The target dataset shouldideally be related to the pre-training dataset for effectivetransfer learning. This type of training is generally referredto as semi-supervised training where the neural network isfirst trained as a language model on a general dataset fol-lowed by supervised training on a labelled training datasetthus establishing a dependence of supervised fine-tuning onunsupervised language modeling.The paper is structured as follows- Section II of the pa-per elaborates on the various algorithms and architecturesthat have serve as a base on which more advanced modelshave been built upon. Section III provides information aboutthe transformer architecture which drastically improved theprospects of using transfer learning for NLP tasks. Section IVthen goes on to discuss the evolution of language modelingand transfer learning through models such as BERT, ElMo,UlMFit and so on. We conclude our survey and suggest futureimprovements in section V.Fig. 1: Developments in Transfer Learning and LanguageModeling II. BACKGROUND A. Vanilla RNNs
Machine learning models have been widely been used for anarray of supervised as well as unsupervised learning tasks suchas regression, classification, clustering and recommendationmodelling. Markov models such as the Multi-layer PerceptronNetwork, vector machines and logistic regression, however,did not perform well in sequence modeling tasks such astext classification, language modeling and tasks based on time series forecasting. These models suffered from an inability toretain information throughout the sequence and treated eachinput independently. In essence, the lack of a memory elementprecluded these models from performing well on sequencemodelling tasks.Recurrent Neural Networks [4] or RNNs attempted to redressthis shortcoming by introducing loops within the network, thusallowing the retention of information.Fig. 2: Recurrent Neural Network [7] h t = φ ( W h t − + U x t + b ) (2)As shown in Fig.2 and the corresponding equation, thecurrent hidden state of the neuron can be modelled as afunction of the hidden state of the previous neuron s t − , thecurrent input x t , weight matrices U, W and bias b . Theseweights of the network are then updated through a trainingalgorithm called Backpropagation Through Time(BPTT) [5].BPTT is, in essence, the backpropagation algorithm withsome modifications. The network is propagated for each timestep- an operation that is often referred to as unrolling theRNN. The parameters of the neural network remain the samethroughout the unrolling operation of the RNN. Correspondingerrors concerning the predicted output and the ground truth arethen calculated for each time step. Gradients of the error withrespect to all parameters are then calculated and accumulatedusing the backpropagation algorithm. It is only after theunrolling is complete that all the parameters of the RNN areupdated by using this accumulated error gradient.Vanilla RNNs were successful in a wide variety of tasksincluding speech recognition, translation and language mod-elling. Despite their initial success, vanilla RNNs were onlyable to model short term dependencies. They failed to modellong term dependencies, primarily because the informationwas often ”forgotten” after the unit activations were multipliedseveral times by small numbers. Further, they suffered fromvarious issues while training such as the vanishing gradientproblem (the error gradient being used for weight updationreducing to very low values) and the exploding gradient prob-lem. Thus, successfully training and applying vanilla RNNswas a challenging task. B. Long Short Term Memory
The problem of ’long-term’ dependencies faced by earlierrecurrent neural networks was solved by designing a specialkind of RNN architecture called the LSTM(Long Short TermMemory) [6]. They were designed to keep track of informationfor the extended number of timesteps. LSTMs have an overallhain-based architecture similar to RNNs but the crucialdifference is the improvements in the internal node structure.While a node in RNN consists of a single neural layer, thereare four layers connected uniquely in LSTMs as shown in thefigure. Fig. 3: Chain structure of LSTMs [7]The key feature of LSTMs is the information carrier con-nection present at the top known as the ’cell state’. It provesto be useful to carry information along longer distances withonly minor linear operations taking place at each node.Theability to add or delete certain information of the cell stateat each node is provided by structures called as gates. LSTMhas three such gates, each comprising of a sigmoid neural netlayer and a pointwise multiplier.The ’forget gate’ decides what information is to be retainedin the cell state by using a sigmoid layer which outputs avalue between 0 and 1(0 indicates ’forget everything’ while 1indicates ’retain completely’). The ’input gate’ layer is usedto determine new information which is to be added to the cellstate. It involves deciding which values to be updated and thenew candidates to do so. The previous cell state values andthe new candidate values are then combined to get the finalnew cell state. The output to be forwarded by the node isdecided by combining the values of current cell state with theresults of the ’output layer’. The output is generally supportinginformation relevant to the previous word. Mathematically, theoperations performed using the three gates can be expressedas: f t = σ ( W f . [ h t − , x t ] + b f ) (3) i t = σ ( W i . [ h t − , x t ] + b i ) (4) (cid:101) C t = tanh ( W c . [ h t − , x t ] + b C ) (5) C t = f t ∗ C t − + i t ∗ (cid:101) C t (6) o t = σ ( W o . [ h t − , x t ] + b o ) (7) h t = o t ∗ tanh ( C t ) (8)where f t , i t , o t indicate outputs of the forget gate, input gateand output gate respectively. W and b indicate the weights andbias. C t − indicates the previous cell state, (cid:101) C t represents thenew candidate values and the current cell state is shown by C t .The advantage of using LSTM is that they offer morecontrol in a network than the conventional recurrent networks.The system is more sophisticated and can retain informationover longer timesteps. However, the added gates lead to morecomputation requirement and thus LSTMs tend to be slower. C. Gated Recurrent Units
Gated Recurrent Units [8] or GRUs introduced by Cho, et al.in 2014 are a curtailed variation of LSTMs designed to reducethe computation issues of the latter. The forget and input gatesin LSTMs are combined into a single ’update gate’. The cellstate and hidden states are also merged together and computedusing a single ’reset gate’. The operations now performed areas follows: z t = σ ( W z . [ h t − , x t ]) (9) r t = σ ( W r . [ h t − , x t ]) (10) (cid:101) h t = tanh ( W c . [ r t ∗ h t − , x t ]) (11) h t = (1 − z t ) ∗ h t − + z t ∗ (cid:101) h t (12)Fig. 4: Internal structure of GRUs [7]GRUs have the advantage of being able to control the flowof information without having an explicit memory unit, unlikeLSTMs. It exposes the hidden content of the node without anycontrol. The performance is almost on par with LSTM but withefficient computation. However, with large data LSTMs withhigher expressiveness may lead to better results. . Average SGD Weight Dropped(AWD)-LSTM AWD-LSTM [9], despite its relatively simple 3-layer LSTMarchitecture was proven to be highly effective for LanguageModeling tasks. It employed a novel algorithm called Drop-Connect to mediate the problem of overfitting that had beeninherent in the RNN architecture. Besides, the authors usedNon-monotonically Triggered ASGD(NTASGD) algorithm tooptimize the network.
Dropconnect Algorithm
Neural networks, prone to overfit-ting, traditionally utilised Dropout as regularization to preventoverfitting. Dropout, an algorithm that randomly(with a prob-ability p) ignore units’ activations during the training phaseallows for the regularization of a neural network. By diminish-ing the probability of neurons developing inter-dependencies,it increases the individual power of a neuron and thus reducesoverfitting. However, dropout has not been able to providecommensurate results in case of the RNN architectures. Inessence, it inhibits the RNN’s capability of developing longterm dependencies as there is loss of information caused dueto randomly ignoring units activations.To this end, the drop connect algorithm randomly dropsweights instead of neuron activations. It does so by ran-domly(with probability 1-p) setting weights of the neuralnetwork to zero during the training phase. Thus redressingthe issue of information loss in the Recurrent Neural Networkwhile still performing regularization.
Non-monotonically Triggered ASGD (NT-ASGD)
Stochastic gradient descent has been demonstrated to offergood performance for language modeling tasks through saddlepoint avoidance and linear convergence. Thus, the authors goon to investigate a variant of SGD- averaged SGD. AveragedSGD-almost identical to vanilla SGD- differs in the fact thatan averaging of the weights(which are cached) is performedafter a threshold number of iterations T is over.While theoretically able to control the effects of noise,averaged SGD has found little use while training neuralnetworks. This has been mainly attributed by the author to am-biguous guidelines regarding the tuning of hyperparameters:learning rate scheduler and averaging trigger. A commonlyused strategy while using the SGD optimizer is to reduce thelearning rate by a fixed quantity when the validation errorworsens or fails to improve. Similarly, one may perform theaveraging operation after validation error worsens. The Non-monotonically triggered ASGD employs a similar technique.It differs in the fact that, instead of performing averagingwhen the validation error worsens NT-ASGD performs theaveraging operation if the validation error fails to improve.NT-ASGD introduces two new hyperparameters-the logginginterval L and the non-monotone interval n. Consequently, theauthors found that keeping n = 5 provided good performancein general. Better results were achieved compared to SGDwhile training their model. E. Seq2Seq Architecture
We take the example of neural machine translation to ex-plain the working of the Attention Mechanism and advantages that it provides.The Seq2Seq architecture [10] has been used to performa wide variety of tasks including Neural Machine Transla-tion(NMT), Abstractive summarization and chatbot systems.The traditional Seq2Seq architecture consists of an encoderRNN(LSTM/GRU) followed by a decoder RNN. The encoderencoded the given sequence into a fixed-length vector. Thedecoder generated the output sequence after taking the fixed-length vector as source hidden state. While giving significantimprovements in the domains of neural machine translation,encoding the context of complex and long sequences intoa single vector impeded the performance of the network.This was since a fixed-length vector was often incapableof effectively encoding the context of the given sequence.Consequently, this led to the birth of the Attention Mechanism,a novel technique that allowed the neural network to identifywhich input tokens are relevant to a corresponding target tokenin the output.
F. Attention Mechanism
Instead of encoding a single vector to represent the se-quence, the attention mechanism [11] computes a contextvector for all tokens in the input sequence for each token inthe output. The decoder computes a relevancy score for alltokens on the input side. These scores are then normalizedby performing a softmax operation to obtain the Attentionweights. These weights are then used to perform a weightedsum of the encoder’s hidden states, thus obtaining the ContextVector c t . α ts = exp ( score ( h t , h s )) (cid:80) Ss (cid:48) =1 exp ( score ( h t , h s (cid:48) ))) (13) c t = (cid:88) s α ts h s (14) a t = f ( c t , h t ) = tanh ( W c [ c t ; h t ]) (15)A hyperbolic tangent operation is performed on the concate-nation of the context vector and the target hidden state to getthe Attention vector a t . This attention vector generally pro-vides a better representation of the sequence than traditionalfixed-sized vector methods.By identifying the relevant input tokens while generatingthe output token, Attention mechanism is able to redressthe problem of compressing the context of the text into afixed-sized vector. Using this mechanism Bahdanu et. al. [11]were able to achieve state-of-the-art performance in machinetranslation tasks.III. THE TRANSFORMER ARCHITECTUREOwing to the significant improvements gained due tothe Attention mechanism, Vaswani et. al. [12] proposed theTransformer architecture. The Transformer achieved newstate-of-the-art results in various tasks such as machinetranslation, entailment and so on. As shown in Fig.5,the Transformer consists of an encoder and a decoder.Furthermore, the encoder consists of a Multi-Head Attentionig. 5: Architecture of the Transformerlayer, residual connections, normalization layer and a genericfeed-forward Layer. The decoder is almost identical tothe encoder but contains a certain ”Masked” Multi-HeadAttention layer. Encoder:
The encoder takes as input the input embeddingthat is added with the positional encoding. The positionalencoding allows for the retention of position and order relatedinformation. The authors employ the following equations tocompute the positional encoding:
P E ( pos, i ) = sin( pos/ (2 i/d model ) ) (16) P E ( pos, i +1) = cos( pos/ (2 i/d model ) ) (17)Where pos is the position of the word in the sequence and i is the dimension. This positional encoding is added to theinput embedding. It is then followed by a residual connection R , defined as follows: R ( x ) = LayerN orm ( x + M ultiHeadedAttention ( x )) (18)Where x is the value of the input embedding added withpositional encoding. Thus we are effectively able to encodethe semantic and position-related information using the inputand positional encodings. Decoder:
As previously stated, the decoder is almost identicalto the Encoder but for the ”Masked” Multi-Headed AttentionLayer.
Scaled dot-Product Attention:
While generating embeddingsfor each word in the input token, the authors made use ofthe self attention mechanism. Self attention, similar to vanillaattention, allows the Transformer to identify words in the input sequence that were relevant to the current token. Specifically,the authors made use of Scaled dot-Product Attention.
Attention ( Q, K, V ) = sof tmax ( QK T √ d k ) V (19)As shown above, the Scaled dot-Product Attention takesthree vectors as input- the key, value and query.As shown,we perform a weighted average of the value vector V . Theweights are assigned by using a ”compatibility function” tofind . • Embedding E is the output value of previous hidden layer • Query, Q = EW q • Key, Q = EW k • Value, Q = EW v Where W q , W k and W v are weight matrices. A dot productof the query vector Q and key vector K is performed witha scaling factor / √ dk . The scaling factor is used to avoidthe softmax input falling in a range where the output falls toa negligible value. A softmax operation is performed on theoutput of softmax. The final vector is multiplied with the valuevector through a matrix multiplication operation to obtain thefinal attention scores.Fig. 6: Multi-Headed and Scaled Dot-Product Attention Multi-Headed Attention:
Additionally, to reduce the numberof operations to compute Attention scores, the authors makeuse of Multi-Headed Attention. Multi-Head Attention splits thevector space into ’n’ parts. These divisions are then passed to’n’ Attention Heads to perform the Self-Attention operation,the results of these operations are then concatenated. Inaddition to reducing the number of operations, Multi-HeadedAttention allows the model to ”jointly attend to informationfrom different representation subspaces at different positions”.
Masked Multi-Headed Attention:
The Masked Multi-Headed Attention is similar to the Multi-Headed Attentionbut performs an additional ”masking” operations. The decoderis allowed to attend to only the previous positions whilecomputing self-attention passing the output embedding. Thiswould result in the transformer being able to attend to theubsequent positions and consequently the output predictionin the sequence. To prevent this undesirable phenomenon, allsubsequent positions are set to −∞ before computing the self-attention that is passed to higher layers.Finally, a softmax layer is used as the to compute output wordprobabilities. The word probabilities are in the form of a vectorthat has a size equal to the size of the vocabulary of the trainingset.Based on the Attention mechanism and without using anyrecurrence mechanism, the Transformer effectively supplantedthe Recurrent Neural Network Architectures-LSTM, GRU,etc.-as the state-of-the-art in several NLP tasks. It is usedfor a wide variety of tasks including machine translation andconstituency parsing.IV. EVOLUTION OF TRANSFER LEARNING ANDLANGUAGE MODELING A. ULMFIT
Universal Language Model Fine-tuning (ULMFit) [13]- amethod to fine-tune a pre-trained language model- was oneof the forerunners of inductive transfer learning in NLP. Bydefinition, the language modeling task entails that given asequence of tokens, the model has to predict the likelihoodof the next token based on the sequence. The ULMFit methodachieved good results by using the then state-of-the-art AWD-LSTM for their experiments. The proposed network was a sim-ple 3-layer neural network without any attention mechanism,skip connections etc. ULMFit allowed for transfer learning byemploying the following three steps in the stated order:1)
Generic Pretraining of the Language Model : Theauthors pre-trained their language model on WikiText-103– a large general-purpose dataset that consists of28,595 preprocessed articles and 103 million words. Thisstep allowed the language model to capture the generalproperties of the given language. Also, while computa-tionally expensive, this task need to be performed onlyonce.2)
Fine-tuning the Language Model on the Target task :This step allowed the language model to capture theinherent nuances of the target task,thus allowing betterperformance. Furthermore, this step can be performed ona relatively smaller dataset, thus requiring relatively lesscomputational power. The authors then proposed twonovel methods- Discriminative Fine-tuning and SlantedTriangular Learning rates to perform this task.
Discriminative fine-tuning - Different layers of themodel extract different features from the text. Thus,the use of different learning rates for different layersseemed apt. To this end, the the authors formulatedDiscriminative Fine-Tuning, a variation of the SGDoptimizer-as follows: θ lt = θ lt − − η l · ∇ θ l J ( θ ) (20)Where, θ is the weight, t is the iteration, l is the layer, ∇ θ l J ( θ ) is the error gradient, η is the learning rate andJ indicates the error function. The authors found that choosing a value of learningrate for the last layers and then setting the learning rateof lower layers by the relation η l − = η l / . workedwell. Slanted Triangular Learning Rates : The SlantedTriangular Learning Rate is defined by the author asfollows: by using a high initial learning rate, the modelwould quickly converge to an appropriate region in thehyperspace. The learning rate is then linearly decayedin order to improve the parameters on target task at afine rate. The authors defined three new equations: cut = [ T · cut frac ] p = t/cut, t < cut − t − cutcut · (1 /cut frac − , otherwise η t = η max · p · ( ratio − ratio where T is the number of iterations, cut frac the frac-tion of iterations that the LR is increased and p isthe number of iterations that the LR is going to bedecreased or has been decreased. Additionally, ratio defines eta min /eta max and cut is the cutoff iterationwhere the model switches from an increasing LR to adecreasing LR.3) Fine-Tuning the Classifier on Target Task
To performtask-specific classification, two linear blocks initializedfrom scratch are added to the language model. Theauthor follows standard practices such as Batch Normal-ization and Dropout to perform regularization. Besides,the ReLu activation function is used similar to thoseused in Computer Vision models.
Concat Pooling : To preserve information contained infew words, the input provided to this classifier is aconcatenation of the last hidden layers and the averageand max pooled output of the previous hidden layers.For this purpose, the author sought to concatenate asmany hidden layers as would fit in the GPU memory.The concatenation h c is as given below: h c = concatenate ( h T , maxpool ( H ) , averagepool ( H )) (21) where H = { h , h ..h T } Gradual Unfreezing : Keeping all parameters trainable,i.e. performing the updation of all parameters duringtraining would lead to a rapid loss in information learntduring the pre-training phase. To tackle this, the authorshave gradually ”unfrozen” the layers. In essence, theauthors start by unfreezing the last layers and thenperform fine-tuning. They repeat this process until alllayers of the AWD-LSTM have not been trained.
Bidirectional LM : The authors train both a forwardM as well as a backward LM. Consequently averagingthe predictions given by both the Language Models.One can apply transfer learning using ULMFit by using pre-trained models trained on datasets such as the Wikitext 103data. Fine-tuning is performed by adding training the networkon the target task by using supervised learning
B. Embeddings from Language Models(ELMo)
Traditional word embeddings involve assigning a uniquevector to each word. These word embeddings are fixed andindependent of the context in which the words are beingused. Peters et. al came up with a new word representationcalled ”Embeddings from Language Models(ELMo)”, [14]in which the tokens beings assigned to each word werea function of the entire sentence of which the word as apart of. These embeddings are obtained from the internallayers of a deep bidirectional LSTM that is trained with acoupled language model objective (biLM) on a large textcorpus. These representations are more elaborate as they aredependent on all of the internal layers of the biLM. Theword representations are computed on top of two-layer biLMswith character convolutions as a linear function of the internalnetwork states. For a given set of tokens, the biLM computestheir probability by taking into consideration the logarithmiclikelihood of both the previous words(forward LM) as well asthe future words(backward LM) and maximizing it. p ( t , t , ..., t N ) = N (cid:88) k =1 ( logp ( t k | t , ..., t k − ; Θ x , −→ Θ LSTM , Θ s )+ logp ( t k | t k +1 , ..., t N ; Θ x , ←− Θ LSTM , Θ s )) (22)where Θ s and Θ x indicate the parameters for the softmaxlayer and the token representations in the forward and back-ward layer respectively. The ELMo model uses the intermedi-ate layer representation of the biLM. ELMo combines all thelayers of the biLM representation into a single vector ELM O k to be accommodated later in the fine-tuning task. Generally,for a given task of obtaining word embeddings in the languagemodeling phase, we find the obtained weightings of all biLMlayers: ELM o taskk = γ task L (cid:88) j =0 s taskj h LMk,j (23)where s task are the weights and γ task is a scalar quantity. s task is obtained after normalizing the weights and passingthem through a softmax layer. γ task , on the other hand,allows us to scale the ELMo vector. γ task plays an importantrole in the optimization process. The layer representation h LMk,j = [ −→ h LMk,j ; ←− h LMk,j ] for each biLSTM layer wiz. combina-tion of forward context representations and backward contextrepresentations.Such a deep representation helps ELMo to trace two im-portant factors (1) Complex characteristics like the syntax andsemantics of the words being used and (2) Their variationsacross linguistic contexts. For any supervised task, the weights of the biLM are frozen. ELM O k is then concatenated withit and the obtained representations are then forwarded to thetask-specific architecture. A pretrained biLM can be used toobtain representations for any tasks and with fine-tuning, theyhave shown a decrease in perplexity thereby benefiting fromtransfer learning.The addition of ELMo embeddings to existing models hashelped them process more useful information from the sen-tences and thus enhanced the performances in many applica-tions like question answering, semantic role labelling, namedentity extraction and many more.ELMo provides an enhancement to the traditional wordembedding techniques by incorporating context while gener-ating the same. The vector generated by ELMo can be usedfor a general downstream task. This is sometimes done bypassing the generated word embeddings to another neuralnetwork(eg. LSTM) which is then trained on the target task.Furthermore, concatenation of ELMo embeddings with otherword embeddings is also done to provide better performance. C. OpenAI Transformer
Although there is availability of large text corpora, labelleddata is tough to find and manually labelling the data is anequally tedious task. Radford et al. at OpenAI proposed asemi-supervised approach called Generative Pre-training(GPT)[15] which involved unsupervised pre-tuning of the modeland then task-specific supervised fine-tuning for languageunderstanding tasks. The Transformer is used as the basemodel for this purpose. The unsupervised learning helps toset the initial parameters of the model based on a languagemodeling objective. The subsequent supervised learning helpsthe parameters adjust to the target task.Initially, a multi-layer transformer decoder is used to pro-duce an output distribution over the target tokens based on amulti-headed self-attention mechanism. h = U W c + W p (24) h l = transf ormer block ( h l − ) ∀ i ∈ [1 , n ] (25) P ( u ) = sof tmax ( h n W Tc ) (26)where h i is the transformer layer’s activations, W c is thetoken embedding matrix and W p is the position embeddingmatrix. The supervised learning task then obtains the finaltransformer block activations h lm which are passed througha softmax layer to predict output label y: P ( y | x , ..., x m ) = sof tmax ( h ml W y ) (27)to maximize L ( C ) = (cid:88) ( x,y ) logP ( y | x , ..., x n ) (28)where C indicates the labeled dataset with each sequenceconsisting of tokens x , x , ..x m . During transfer learning, thenput is converted into a single contiguous sequence of tokensso as to fit the pre-trained model.OpenAI transformer improvises on generative pre-trainingto improve performance on tasks by providing a better startthan random initialization. A single model is able to producequality results with minimum task-specific customization orhyperparameter tuning, thereby showing its robustness. Thisarchitecture model was able to outperform other approacheson tasks like natural language inference, question answering,sentence similarity etc. D. Bidirectional Encoder Represenation from Transform-ers(BERT)
BERT [16] -proposed by J.Devlin et al.- is a novel ap-proach to incorporate bidirectionality in a single Transformermodel. A particularly challenging task, direct approaches toincorporating bidirectionality in Transformer models fail sincedirect bidirectional conditioning would allow the words tosee themselves in the light of context from multiple layers,thereby ruling out the possibility of using it as a LanguageModel. In essence, it was traditionally only possible to traina unidirectional encoder- a left-right or a right-left model.However, bidirectional models that could see the completesequence context would inherently be more powerful thanunidirectional models or a concatenation of two unidirectionalmodels-left-right and right-left. To this end, the authors trainedtheir model on two unsupervised prediction tasks:
Masked LM
To overcome the challenges posed while ap-plying of bi-directionality in Transformers, J.Devlin proposedmasking of random tokens in the sequence. The Transformerwas trained such that it had to predict only the words that hadbeen masked while being able to view the whole sequence.WordPiece Tokenization is used to generate the sequence oftokens where rare words are split into sub-tokens. Maskingof 15% of the Wordpiece Tokens is performed. Maskingessentially replace the words with [ M ASK ] tokens. How-ever, instead of always replacing the selected words with a [ M ASK ] token, the data generator employs the followingapproach: • Replace the word with [ M ASK ] token 80% of the time • Replace the word with another random word 10% of time • Keep the word as it is 10% of the timePerforming prediction on only 15% of all words instead ofperforming prediction on all words would entail that BERTwould be much slower to converge. However, BERT showedimmediate improvements in absolute accuracy while converg-ing at a slightly slower pace than traditional unidirectionalleft-right models.
Next Sentence Prediction
This task entails predictingwhether the first sequence provided immediately precedes thenext. This task allows the Transformer to perform better onseveral downstream tasks such as question-answering, NaturalLanguage Inference that involve understanding the relationshipbetween two input sequences. The dataset so used for traininghad a balanced 50/50 distribution created as follows: choosingan actual pair of neighbouring sentences for positive examples and a random choice of the second sentence for the negativeexamples. The input sequence for this pair classification taskis generated as: [ CLS ] < Sentence A > [ SEP ] < Sentence B > [ SEP ] ,where sentences A and B are two sentences after performingthe masking operations. The [CLS] token is the first token usedto obtain a fixed vector representation that is consequentlyused for classification and [SEP] is used to separate the twoinput sequence. The authors were able to achieve an impressiveaccuracy of 97-98% in the next sentence prediction task. Pre-Training Procedure
The authors have used theBooksCorpus and the English Wikipedia as pretraining data.They have used two variations of BERT-
BERT
BASE (12-layer) and
BERT
LARGE (24 layers)- that primarily differ intheir depth. The maximum length of the input sequence isrestricted to 512 tokens. All subsequent tokens in the sequenceare neglected. A dropout value of 0.1 is used as regularization.Furthermore, the authors have made use of the GELU insteadof Relu as activation function. GELU- Gaussian Error Linearunits has been shown to provide improvements compared toReLU and eLu.Training of the models was performed on TPUs, specifically
BERT
BASE was trained on 16 TPU chips for 4 days.
BERT
LARGE was trained on 64 TPU chips, also for 4 days.
Fine-Tuning Procedure
The pre-trained BERT can be fine-tuned on a relatively small dataset and requires lesser pro-cessing power. BERT was able to improve upon the previousstate-of-the-art in several tasks involving natural languageinference, question answering, semantic similarity, linguisticacceptability among other tasks. The pattern of the input andoutput sequence varies depending on the type of the task. Thetasks can be broadly divided into four categories: • Single Sentence Classification Tasks: These tasks are per-formed by adding layers on the classification embedding [ CLS ] and passing the input sequence preceded by the [ CLS ] token. • Sentence Pair Classification Tasks The two sentences arepassed to BERT after being separated by the [ SEP ] token. Classification can be performed by adding layersto the [ CLS • Question Answering Tasks • Single Sentence Tagging TasksSubsequently, two multilingual BERT models-uncased andcased-for over 102 languages were released. Furthermore,OpenAI released the GPT2 [17], essentially BERT trained asa language model on a very large amount of data.
E. Universal sentence encoder
The amount of supervised training data present for languageprocessing tasks is limited. The use of pre-trained wordembeddings has proved to be useful in this case as theyperform a limited amount of transfer learning. Daniel et al.[18] proposed a new approach which involved direct encodingof sentences instead of words into vectors. The sentenceencoded vectors are found to require minimal task-specificdata to produce good results. The encoder models are availablen two architectures taking into consideration the two primalchallenges of training transfer learning models wiz. complexityand accuracy.
1) Transformer based architecture:
The first model makesuse of the transformer architecture to construct sentence em-beddings. The encoding subgraph of the transformer architec-ture is used for this purpose. Attention mechanism is usedto find context-based word representations which are thenconverted into a fixed-length sentence encoding vector. Theinput to the transformer model is a lower case Penn Treebank3(PTB) [19] tokenized string and a 512-dimensional sentenceembedding is produced as the output. A single encodingmodel is trained over multiple tasks to make it as generalas possible. This model achieves superior accuracy over theother architecture but at the expense of increased computationrequirement and complexity.
2) Deep Averaging Network(DAN) architecture:
In thismodel, the input embeddings for words and bi-grams are aver-aged and then passed through a deep neural network. The inputand output format are same as that of the transformer encoder.Multitask learning, similar to the transformer model encodermodel is used for training purpose. The main advantage ofthis model is that it performs the required operations in lineartime.The main difference between the transformer and DANencoder models is of time complexity(O( n ) and O(n) re-spectively). The memory requirement for the transformermodel increases quickly with increase in sentence length whilethat of DAN model remains constant. The trade-off betweencomplexity and accuracy should be noted when deciding aparticular architecture for a given task.The unsupervised learning data used for training in both thecases included Wikipedia, web news and discussion forums.Augmentation is performed using training on supervised datafrom the Stanford Natural Language Inference(SNLI) corpuswhich improved the performance further. The universal sen-tence encoder can be used for a variety of transfer tasks includ-ing sentiment analysis, sentence classification, text similarityetc. For determining a pairwise semantic similarity betweentwo sentences, the similarity of the sentence embeddingsproduced by the encoder can be calculated and converted intoangular distance to get the final result. sim ( u, v ) = (1 − arccos ( u.v || u |||| v || ) /π ) (29)The sentence embeddings outperform the results of wordembeddings on the fore mentioned tasks. However, combiningword and sentence embeddings for transfer learning producedthe best overall results. Universal sentence encoder assists themost when limited training data is available for the transfertask.The Universal Sentence Encoder can be used for down-stream tasks bypassing the generated embedding to a classifiersuch as an SVM or another deep neural network. F. Transformer-XL
The Transformer-XL [20]was able to model very long-rangedependencies. It did so by overcoming one limitation of thevanilla Transformer- fixed-length context. Vanilla Transform-ers were incapable of accommodating a very long sequenceowing to this limitation. Hence they resorted to alternativessuch as splitting the corpus into segments which could bemanaged by the Transformers. This led to loss of contextamong individual segments despite being part of the samecorpus. However, the Transformer-XL was able to take theentire large corpus as input, thus preserving this contextualinformation. In essence a vanilla Transformer, it relied ontwo novel techniques-Recurrence mechanism and PositionalEncoding- to provide the improvement.
1) Recurrence mechanism:
Instead of training the Trans-former on individual segments of the corpus(without regardsto previous context), the authors propose caching the hiddensequence state computed from the previous segments. Con-sequently, the model computes self attention(and other opera-tions) on the current hidden/input state as well as these cachedhidden states. The number of states cached during trainingis limited due to memory limitations of the GPU. However,during inference, the authors can increase the number ofcached hidden states used to model long-term dependency.
2) Relative Positional Encoding:
An inherent problem withusing the said Recurrence mechanism is preserving relativepositional information while reusing cached states. The authorsovercame this problem by incorporating Relative PositionalEncoding in the Attention mechanism(instead of hidden states)of the Transformer. They do so by encoding information re-garding the positional embedding dynamically in the Attentionscores themselves. The distance of the key vectors for thequery vector is the temporal information that is encoded in theAttention scores. In essence, computing attention is facilitatedas temporal distance is still available to the model while stillpreserving previous states. Furthermore, information regardingthe absolute position can be recovered recursively.The Transformer-XL was able to surpass the state-of-the-artresults for language modelling tasks on the enwiki8 and text8 datasets.
G. XLNet
The BERT model proposed by Authors et. al, was anAuto Encoding(AE) model that suffered from the followingproblems: • The use of [ M ASK ] tokens during the pre-training phaseled to a discrepancy as these tokens were absent duringthe fine-tuning phase. • The model neglected inherent dependencies among twoor more [ M ASK ] tokens, thus leading to sub-optimalperformance.A new model, the XLNet [21] was able to overcome thesedifficulties by using a modification of general autoregressivepretraining. Generalized Autoregressive Pretraining Phase
Insteadof using unidirectional language modeling or bidirectionalasked language model to predict tokens, the paper proposedpassing all permutations of a given sequence to the modeland predicting a particular token missing from this sequence.Despite random re-ordering of the sequence, order-relatedinformation remains preserved as positional encodings of thetokens remained the same for all permutations of the inputsequence.Use of this modified form of pretraining helped overcomethe two main challenges posed by the BERT architecture.Along with that, the XLNet incorporated Transformer-XL intoits core architecture. This allowed for better modeling of long-dependencies compared to BERT. Through the use of thesetwo major modifications, the XLNet provided new state-of-the-art results in 18 natural language processing tasks.Significant gains were observed compared to BERT espe-cially in tasks such as machine reading comprehension whichrequired modeling of long-range dependencies. The authorsattribute this improvement mainly to the use of Transformer-XL in the XLNet architecture. The XLNet, similar to BERT,can be used for a wide range of single sentence, sentence pairand reading comprehension tasks among others.V. CONCLUSIONWe have thus provided a lucid summary of recent advancesin the domain of transfer learning in the domain of naturallanguage processing. We hope that this survey would helpthe reader gain a quick and profound understanding of thisdomain. Recent advances in this domain, despite being astep forward, come with their challenges. Specifically, largearchitectures such as the BERT, XLNet and Transformer-XLmake training and deployment difficult owing to the largeamount of processing power required. Furthermore, employinglarge and opaque models impedes upon the explainabilityaspect of the same, thus making one question their deploy-ment in the real-world. Thirdly, while newer models canprovide improvements over their predecessors, the lack ofa standard benchmark dataset for pre-training these modelsmakes one question whether these improvements were dueto an architectural innovation or simply because the saidmodel was pre-trained on larger amount of data. Take forinstance, it is difficult to gauge whether the XLNet modelbettered upon the BERT model because of an architecturalimprovement or because it was pre-trained on a larger corpus.Thus, there is a need to decide upon a standard pre-trainingdataset to remove this ambiguity. Lighter models such as theDistilBERT and ALBERT are a step in the right direction andcould potentially help bridge the gap between performanceand processing power. On the other hand, innovations broughtabout during the training phase, such as in the RoBERTa modelmight help seek out better performance using the same modelarchitecture. R
EFERENCES [1] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov,Raquel Urtasun, Antonio Torralba, Sanja Fidler, ”AligningBooks and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books” , arXiv:1506.06724[cs.CV][2] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, ThorstenBrants, Phillipp Koehn, Tony Robinson, ”One Billion WordBenchmark for Measuring Progress in Statistical LanguageModeling” , arXiv:1312.3005 [cs.CL][3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,Li Fei-Fei, ”ImageNet: A large-scale hierarchical imagedatabase” , 2009 IEEE Conference on Computer Vision andPattern Recognition[4] Zachary C. Lipton, John Berkowitz, Charles Elkan, ”A CriticalReview of Recurrent Neural Networks for Sequence Learning” ,arXiv:1506.00019 [cs.LG][5] P.J. Werbos, ”Backpropagation through time: what it does andhow to do it” , Proceedings of the IEEE, Volume: 78 , Issue: 10, Oct 1990[6] Sepp Hochreiter,Jurgen Schmidhuber, ”Long short-termmemory” , Neural Computation 9(8):1735-1780, 1997[7] Christopher Olah, ”Understanding LSTM Networks” ,https://colah.github.io/posts/2015-08-Understanding-LSTMs/[8] K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, ”Onthe properties of neural machine translation: Encoder-decoderapproaches” , arXiv preprint arXiv:1409.1259, 2014[9] Stephen Merity, Nitish Shirish Keskar, Richard Socher, ”Regularizing and Optimizing LSTM Language Models” ,arXiv:1708.02182 [cs.CL][10] Ilya Sutskever, Oriol Vinyals, Quoc V. Le, ”Sequence toSequence Learning with Neural Networks” , arXiv:1409.3215[cs.CL][11] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, ”NeuralMachine Translation by Jointly Learning to Align and Translate” ,arXiv:1409.0473 [cs.CL][12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, ”Attention is All you Need” , Advances in Neural InformationProcessing Systems 30 (NIPS 2017)[13] Jeremy Howard, Sebastian Ruder, ”Universal Language ModelFine-tuning for Text Classification” , arXiv:1801.06146 [cs.CL][14] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner,Christopher Clark, Kenton Lee, Luke Zettlemoyer, ”DeepContextualized Word Representations” , Volume: Proceedingsof the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, Volume 1, June 2018[15] Alec Radford, Kartik Narsimhan, Tim Salimans, Ilya Sutskever ”Improving Language Understanding by Generative Pre-Training” ,2018[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, KristinaToutanova, ”BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding” , arXiv:1810.04805[cs.CL]17] Radford, Alec and Wu, Jeff and Child, Rewon and Luan, Davidand Amodei, Dario and Sutskever, Ilya, ”Language Models areUnsupervised Multitask Learners” , 2019[18] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, NicoleLimtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, BrianStrope, Ray Kurzweil, ”Universal Sentence Encoder” ,arXiv:1803.11175 [cs.CL][19] Marcus, Mitchell P. and Marcinkiewicz, Mary Ann andSantorini, Beatrice, ”Building a Large Annotated Corpusof English: The Penn Treebank” , Computational LinguisticsVolume 19 Issue 2, 0891-2017, June 1993.[20] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell,Quoc V. Le, Ruslan Salakhutdinov, ”Transformer-XL:Attentive Language Models Beyond a Fixed-Length Context” ,arXiv:1901.02860 [cs.LG][21] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, RuslanSalakhutdinov, Quoc V. Le, ”XLNet: Generalized AutoregressivePretraining for Language Understanding””XLNet: Generalized AutoregressivePretraining for Language Understanding”