[PDF] Neural Transfer Learning with Transformers for Social Science Text Analysis

Abstract

During the last years, there have been substantial increases in the prediction performances of natural language processing models on text-based supervised learning tasks. Especially deep learning models that are based on the Transformer architecture (Vaswani et al., 2017) and are used in a transfer learning setting have contributed to this development. As Transformer-based models for transfer learning have the potential to achieve higher prediction accuracies with relatively few training data instances, they are likely to benefit social scientists that seek to have as accurate as possible text-based measures but only have limited resources for annotating training data. To enable social scientists to leverage these potential benefits for their research, this paper explains how these methods work, why they might be advantageous, and what their limitations are. Additionally, three Transformer-based models for transfer learning, BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and the Longformer (Beltagy et al., 2020), are compared to conventional machine learning algorithms on three social science applications. Across all evaluated tasks, textual styles, and training data set sizes, the conventional models are consistently outperformed by transfer learning with Transformer-based models, thereby demonstrating the potential benefits these models can bring to text-based social science research.

Full PDF

NNeural Transfer Learning with Transformers for SocialScience Text Analysis

Sandra Wankm¨uller

Ludwig-Maximilians-University Munich [email protected]

Abstract.

During the last years, there have been substantial increases in the predictionperformances of natural language processing models on text-based supervised learningtasks. Especially deep learning models that are based on the Transformer architecture(Vaswani et al., 2017) and are used in a transfer learning setting have contributed tothis development. As Transformer-based models for transfer learning have the potentialto achieve higher prediction accuracies with relatively few training data instances, theyare likely to beneﬁt social scientists that seek to have as accurate as possible text-basedmeasures but only have limited resources for annotating training data. To enable socialscientists to leverage these potential beneﬁts for their research, this paper explains howthese methods work, why they might be advantageous, and what their limitations are.Additionally, three Transformer-based models for transfer learning, BERT (Devlin et al.,2019), RoBERTa (Liu et al., 2019), and the Longformer (Beltagy et al., 2020), are com-pared to conventional machine learning algorithms on three social science applications.Across all evaluated tasks, textual styles, and training data set sizes, the conventionalmodels are consistently outperformed by transfer learning with Transformer-based mod-els, thereby demonstrating the potential beneﬁts these models can bring to text-basedsocial science research.

Keywords.

Natural Language Processing, Deep Learning, Neural Networks, TransferLearning, Transformer, BERT 1 a r X i v : . [ c s . C L ] F e b Introduction

Deep learning architectures are well suited to capture the contextual and sequentialnature of language and have enabled natural language processing (NLP) researchersto more accurately perform a wide range of tasks such as text classiﬁcation, machinetranslation, or reading comprehension (Ruder, 2020). Despite the fact that deep learningtechniques tend to exhibit higher prediction accuracies in text-based supervised learningtasks compared to traditional machine learning algorithms (Budhwar et al., 2018; Iyyeret al., 2014; Ruder, 2020), they are far from being a standard tool for social scienceresearchers that use supervised learning for text analysis.Although there are exceptions (e.g. Amsalem et al., 2020; Chang and Masterson, 2020;Muchlinski et al., 2020; Rudkowsky et al., 2018; Wu, 2020; Zhang and Pan, 2019), forthe implementation of supervised learning tasks social scientists typically resort to bag-of-words vector space representations of texts that serve as an input to conventionalmachine learning models such as support vector machines (SVMs), naive Bayes, randomforests, boosting algorithms, or regression with regularization (e.g. via the least absoluteshrinkage and selection operator (LASSO)) (see e.g. Anastasopoulos and Bertelli, 2020;Barber´a et al., 2021; Ceron et al., 2015; Colleoni et al., 2014; Diermeier et al., 2011;D’Orazio et al., 2014; Fowler et al., 2020; Greene et al., 2019; Katagiri and Min, 2019;Kwon et al., 2018; Miller et al., 2020; Mitts, 2019; Park et al., 2020; Pilny et al., 2019;Ramey et al., 2019; Seb˝ok and Kacsuk, 2020; Theocharis et al., 2016; Welbers et al.,2017). One among several likely reasons why deep learning methods so far have not been widelyused for supervised learning tasks by social scientists might be that training deep learningmodels is resource intensive. To estimate the exceedingly high number of parametersthat deep learning models typically comprise, large amounts of training examples arerequired. For research questions relating to domains in which it is diﬃcult to access orlabel large enough amounts of training data instances, deep learning becomes infeasibleor prohibitively costly.Recent developments within NLP on transfer learning alleviate this problem. Transferlearning is a learning procedure in which representations learned on a source task aretransmitted to improve learning on the target task (Ruder, 2019). In sequential transferlearning, commonly a language representation model, that comprises highly general,close to universal representations of language, is trained by conducting the source task(Ruder, 2019). Using these learned general-purpose representations as inputs to target This is not to say that social scientists would not have started to leverage the foundations of deeplearning approaches in NLP: During the last years, the use of real-valued vector representations of terms,known as word embeddings, enabled social scientists to explore new research questions or to study oldresearch questions by new means (Han et al., 2018; Kozlowski et al., 2019; Rheault et al., 2016; Rheaultand Cochrane, 2020; Rodman, 2020; Watanabe, 2020). Yet, the analysis of social science relevant textsby means of deep learning up til now mostly is conducted by research teams that are not primarily socialscience trained (e.g. Abercrombie and Batista-Navarro, 2018; Ansari et al., 2020; Budhwar et al., 2018;Glavaˇs et al., 2017; Iyyer et al., 2014; Zarrella and Marsh, 2016).

Starting with core machine learning concepts, this section works out the diﬀerences thatconventional machine learning models vs. deep learning models imply for text analysis.Together with the next section, this section serves as the basis for the subsequentlyprovided introductions to transfer learning and Transformer-based models.

Given raw input data X = { d , . . . , d n , . . . , d N } (e.g. a corpus comprising N raw textﬁles) and a corresponding output variable y = { y , . . . , y n , . . . , y N } , the aim in supervisedmachine learning is to search the space of possible mappings between X and y to ﬁnd theparameters θ of a function such that the estimated function’s predictions, ˆ y = f ( X , ˆ θ ),are maximally close to the true values y and thus—given new, yet unseen data X test —thetrained model will generate accurate predictions (Chollet, 2020). The distance between y and f ( X , ˆ θ ) is measured by a loss function, L ( y , f ( X , ˆ θ )), that guides the searchprocess (Chollet, 2020). Learning in supervised machine learning hence means ﬁnding f ( X , ˆ θ ) and essentially isa two-step process (Goodfellow et al., 2016): The ﬁrst step is to learn representations ofthe data, f R ( X , ˆ θ R ), and the second is to learn mappings from these representations ofthe data to the output. ˆ y = f ( X , ˆ θ ) = f O ( f R ( X , ˆ θ R ) , ˆ θ O ) (1) Conventional machine learning algorithms cover the second step: They learn a func-tion mapping data representations to the output. This in turn implies that the ﬁrststep falls to the researcher who has to (manually) generate representations of the dataherself.When working with texts, the raw data X typically are a corpus of text documents. Avery common approach then is to transform the raw text ﬁles via multiple preprocessingprocedures into a document-feature matrix X = { x , . . . , x n , . . . , x N } (cid:62) (see Figure 1).In a document-feature matrix each document is represented as a feature vector x n = { x n , . . . , x nk , . . . , x nK } (Turney and Pantel, 2010). Element x nk in this vector givesthe value of document n on feature k —which typically is the (weighted) number oftimes that the k th textual feature occurs in the n th document (Turney and Pantel,2010). To conduct the second learning step, the researcher then commonly applies aconventional machine learning algorithm on the document-feature matrix to learn the4 agree :-)I confess this is a problem.These people need help –now!Come on! This is nonsense!… I agree :-) confess this … … … … …… … … … … … … raw data representations of the data outpute.g. corpus e.g. document-feature matrix Figure 1:

Learning as a Two-Step Process.

Learning in machine learning essentially isa two-step process. In text-based applications of conventional machine learning approaches, theraw data X ﬁrst are (manually) preprocessed such that each example is represented as a featurevector in the document-feature matrix X . Second, these feature representations of the data arefed as inputs to a traditional machine learning algorithm that learns a mapping between datarepresentations X and outputs y . relation between the document-feature representation of the data X and the providedresponse values y .There are two diﬃculties with this approach. The ﬁrst is that it may be hard for theresearcher to a priori know which features are useful for the task at hand (Goodfellowet al., 2016). The performance of a supervised learning algorithm will depend on therepresentation of the data in the document-feature matrix (Goodfellow et al., 2016).In a classiﬁcation task, features that capture observed linguistic variation that helps inassigning the texts into the correct categories are more informative and will lead to abetter classiﬁcation performance than features that capture variation that is not helpfulin distinguishing between the classes (Goodfellow et al., 2016). Yet, determining whichsets of possibly highly abstract and complex features are informative and which are notis highly diﬃcult (Goodfellow et al., 2016): A researcher can choose from a multitude ofpossible preprocessing steps such as stemming, lowercasing, removing stopwords, addingpart-of-speech (POS) tags, or applying a sentiment lexicon. The set of selected pre-processing steps as well as the order in which they are implemented deﬁne the way inwhich the texts at hand are represented and thus aﬀect the research ﬁndings (Denny andSpirling, 2018; Goodfellow et al., 2016). Social scientists may be able to use some of theirdomain knowledge in deciding upon a few speciﬁc preprocessing decisions (e.g. whetherit is likely that excluding a predeﬁned list of stopwords will be beneﬁcial because it re- For a more detailed list of possible steps see Turney and Pantel (2010) and Denny and Spirling(2018). x n —which implies that each document is treated as a bag-of-words(Turney and Pantel, 2010). Bag-of-words representations disregard word order and syn-tactic or semantic dependencies between words in a sequence (Turney and Pantel, 2010). Yet, text is contextual and sequential by nature. Word order carries meaning. And thecontext in which a word is embedded in is essential in determining the meaning of aword. When represented as a bag-of-words, the sentence ‘The opposition party leaderattacked the prime minister.’ cannot be distinguished from the sentence ‘The primeminister attacked the opposition party leader.’ . Moreover, the fact that the word ‘party’ here refers to a political party rather than a festive social gathering only becomes clearfrom the context. Although bag-of-words models perform relatively well given the sim-ple representations of text they build upon (Grimmer and Stewart, 2013), it has beenshown that capturing contextual and sequential information is likely to enhance predic-tion performances (see for example Socher et al., 2013).

In contrast to conventional machine learning algorithms, deep learning models conductboth learning steps: they learn representations of the data and a function mapping datarepresentations to the output.Deep learning is one branch of representation learning in which features are not designedand selected manually but are learned by a machine learning algorithm (Goodfellowet al., 2016). In deep learning models, an abstract representation of the data is learnedby applying the data to a stack of several simple functions (Goodfellow et al., 2016).Each function takes as an input the representation of the data created by (the sequenceof) previous functions and generates a new representation: f ( X , ˆ θ ) = f O ( . . . f R ( f R ( f R ( X , ˆ θ R ) , ˆ θ R ) , ˆ θ R ) . . . , ˆ θ O ) (2)The ﬁrst function f R takes as an input the data and provides as an output a ﬁrstrepresentation of the data. The second function f R takes as an input the representationlearned by the ﬁrst function and generates a new representation and so on. Deep learning By counting the occurrence of word sequences of length N , N -gram models extend unigram-basedbag-of-words models and allow for capturing information from small contexts around words. Yet, byincluding N -grams as features the dimensionality of the feature space increases, thereby increasing theproblem of high dimensionality and sparsity. Moreover, texts often exhibit dependencies between wordsthat are positioned much farther apart than what could be captured with N -gram models (Chang andMasterson, 2020). deep (Goodfellow et al., 2016). The representation layers, { R , R , R , . . . } , are namedhidden layers (Goodfellow et al., 2016). The dimensionality of the vector-valued hiddenlayer outputs is a model’s width and the number of hidden layers is the depth of amodel (Goodfellow et al., 2016). Deep learning models diﬀer in their architecture. Mosttypes of models, however, are based on a chain of functions as speciﬁed in equation 2(Goodfellow et al., 2016).Deep learning models do learn representations of the data. Yet, when applying themto text-based applications they do not take as an input the raw text documents. Theystill have to be fed with a data format they can read. Rather than taking as an inputa raw text document d n = { a , . . . , a t , . . . , a T } , which is a sequence of words (or moreprecisely: a sequence of tokens), deep learning models take as an input a sequence ofword embeddings { z [ a ] , . . . , z [ a t ] , . . . , z [ a T ] } .Word embeddings are real-valued vector representations of the unique terms in the vo-cabulary (Mikolov et al., 2013c). The vocabulary Z consists of the U unique terms inthe corpus: Z = { z , . . . , z u , . . . , z U } . Each vocabulary term z u can be represented asa word embedding—a K -dimensional real-valued vector z u ∈ R K . And as each token a t is an instance of one of the unique terms in the vocabulary, each token has a corre-sponding embedding. A document { a , . . . , a t , . . . , a T } consequently can be mapped to { z [ a ] , . . . , z [ a t ] , . . . , z [ a T ] } .A deep learning model thus is fed with a sequence of real-valued vectors instead of asequence of tokens. The values of the word embedding vectors are parameters that arelearned in the optimization process. Hence, the representation for each unique textualtoken is learned when training the model. Except for the decision of how to separatea continuous text sequence into tokens (and sometimes also whether to lowercase thetokens and consequently to only learn embeddings for lowercased tokens), deep learningmodels learn representations of the data rather than operating on manually tailored rep-resentations. Also note that deep learning NLP models—instead of learning one vectorrepresentation for each term—learn one vector per term in each hidden layer.As deep learning NLP models learn to represent a raw sequence of tokens as a sequenceof vectors instead of taking it as a bag-of-words, deep learning models can learn depen-dencies between tokens and take into account contextual information. Deep learningarchitectures as recurrent neural networks (RNNs) (Elman, 1990) and the Transformer(Vaswani et al., 2017) are especially suited to capture dependencies between sequen-tial input data and are able to encode information from the textual context a token isembedded in. 7 .4 Well-Performing and Eﬃcient Models for NLP Tasks The no free lunch theorem (Wolpert and Macready, 1997) states that if averaging overall possible classiﬁcation tasks every classiﬁcation algorithm will achieve the same gen-eralization performance (Goodfellow et al., 2016). This theorem implies that there isno universally superior machine learning algorithm (Goodfellow et al., 2016). A speciﬁcalgorithm will perform well on one task but less well on another.Hence, given a type of learning task, researchers ideally should develop and use machinelearning algorithms that are especially suited to conduct this type of task. This is,researchers should employ algorithms that are good at approximating the functionsthat map from feature inputs to provided outputs in the real-world applications theyencounter (Goodfellow et al., 2016).In NLP tasks, models have to learn mappings between textual inputs and task-speciﬁcoutputs. Common NLP tasks are binary or multi-class classiﬁcation tasks in which themodel’s task is to assign one out of two or one out of several class labels to each textsequence. Other NLP tasks, for example, require the model to answer multiple choicequestions about provided input texts or to make predictions about the entailment of asentence pair (Wang et al., 2019b).Due to their layered architecture and the vector-valued representations deep learningmodels tend to have a high capacity. This is, they can approximate a large variety ofcomplex functions (Goodfellow et al., 2016). On less complex data structures, large deeplearning models may risk overﬁtting and conventional machine learning approaches withlower expressivity may be more suitable. The ability to express complicated functionsas well as the ability to automatically learn representations and the ability to encodeinformation on connections between sequential inputs, however, seems essential whenworking with textual data: Empirically, deep learning models outperform classic machinelearning algorithms on nearly all NLP tasks (Ruder, 2020).Despite these advantages, deep learning methods so far are not widely used for super-vised learning tasks within the quantitative text analysis community in social science.Some studies do use deep learning models (e.g. Amsalem et al., 2020; Chang and Mas-terson, 2020; Muchlinski et al., 2020; Rudkowsky et al., 2018; Wu, 2020; Zhang and Pan,2019). Yet, until now social scientists typically approach supervised classiﬁcation tasksby generating bag-of-words based representations of texts that then are passed on toconventional machine learning algorithms as SVMs, naive Bayes, regularized regression,or tree-based methods (see e.g. Anastasopoulos and Bertelli, 2020; Barber´a et al., 2021;Ceron et al., 2015; Colleoni et al., 2014; Diermeier et al., 2011; D’Orazio et al., 2014;Fowler et al., 2020; Greene et al., 2019; Katagiri and Min, 2019; Kwon et al., 2018; Milleret al., 2020; Mitts, 2019; Park et al., 2020; Pilny et al., 2019; Ramey et al., 2019; Seb˝okand Kacsuk, 2020; Theocharis et al., 2016; Welbers et al., 2017).There are several factors that might have contributed to the sporadic rather thanwidespread use of deep learning models within text-based social science. One likely8eason is that deep learning models have considerably more parameters to be learnedin training than classic machine learning models. Consequently, deep learning modelsare computationally highly intensive and require substantially larger amounts of train-ing examples. How much more training data are needed depends on the width anddepth of the model, the task, and training data quality. Thus, precise numbers on theamounts of parameters and required training examples cannot be speciﬁed. To never-theless put the sizes in relation note that a SVM with a linear kernel that learns toconstruct a hyperplane in a 3000-dimensional feature space which separates instancesinto two categories based on 1000 support vectors has around 3 million parameters. TheTransformer-based models presented in this article, in contrast, have well above 100million parameters.Recent developments of deep language representation models for transfer learning, how-ever, reduce the amount of training instances needed to achieve the same level of per-formance than when not using transfer learning (Howard and Ruder, 2018). The intro-duction of the Transformer (Vaswani et al., 2017) additionally has improved the studyof text. Due to its self-attention mechanisms the Transformer is better able to encodecontextual information and dependencies between tokens than previous deep learningarchitectures (Vaswani et al., 2017). To enable social science research to leverage thesepotentials, this study presents and explains the workings and usage of Transformer mod-els for transfer learning.

This section provides an introduction to the basics of deep learning and thereby laysthe foundation for the following sections. First, based on the example of feedforwardneural networks (FNNs) the core elements of neural network architectures are explicated.Then, the optimization process via stochastic gradient descent with backpropagation(Rumelhart et al., 1986) will be presented. Subsequently, the architecture of recurrentneural networks (RNNs) (Elman, 1990) is outlined.

The most elementary deep learning model is a feedforward neural network (FNN), alsonamed multilayer perceptron (Goodfellow et al., 2016). A feedforward neural networkwith L hidden layers, vector input x and a scalar output y can be visualized as in Figure2 and be described as follows (Ruder, 2019, see also): h = σ ( W x + b ) (3) h = σ ( W h + b ) (4) . . . h l = σ l ( W l h l − + b l ) (5)9 . .y = σ O ( w O h L + b O ) (6)The input to the neural network is the K -dimensional vector x (see equation 3). x enters an aﬃne function characterized by weight matrix W and bias vector b , whereby W ∈ R K × K , and b ∈ R K . σ is a nonlinear activation function and h ∈ R K is the K -dimensional representation of the data in the ﬁrst hidden layer. This is,the neural networks takes the input data x and via combining an aﬃne function with anonlinear activation function generates a new, transformed representation of the originalinput: h . The hidden state h in turn serves as the input for the next layer thatproduces representation h ∈ R K . This continues through the layers until the lasthidden representation, h L ∈ R K L , enters the output layer (see equation 6). 𝒙 𝒉 ! 𝒉 " 𝑾 " 𝒘 𝑥 ! 𝑥 " 𝑥 𝑥 $ ℎ !! ℎ !" ℎ ! ℎ !$ ℎ %! ℎ %" ℎ % ℎ %$ 𝒉 $ ℎ &! ℎ &" ℎ & ℎ &$ 𝑦 Figure 2:

Feedforward Neural Network.

Feedforward Neural Network with L hiddenlayers, four units per hidden layer and scalar output y . The solid lines indicate the lineartransformations encoded in weight matrix W . The dotted lines indicate the connections betweenseveral consecutive hidden layers. The activation functions in neural networks typically are chosen to be nonlinear (Good-fellow et al., 2016). The reason is that if the activation functions were set to be linear,the output of the neural network would merely be a linear function of x (Goodfellowet al., 2016). Hence, the use of nonlinear activation functions is essential for the ca-pacity of neural networks to approximate a wide range of functions and highly complexfunctions (Ruder, 2019).In the hidden layers, the rectiﬁed linear unit (ReLU) (Nair and Hinton, 2010) is oftenused as an activation function σ l (Goodfellow et al., 2016). If q = { q , . . . , q k , . . . , q K } is the K -dimensional vector resulting from the aﬃne transformation in the l th hiddenlayer, q = W l h l − + b l (see equation 5), then ReLU is applied on each element q k : σ l ( q ) k = max { , q k } (7)10 l ( q ) k then is the k th element of hidden state vector h l . In the output layer, the activation function σ O is selected so as to produce an output thatmatches the task-speciﬁc type of output values. In binary classiﬁcation tasks with y n ∈{ , } the standard logistic function, often simply referred to as the sigmoid function,is a common choice (Goodfellow et al., 2016). For a single observational unit n , thesigmoid function’s scalar output value is interpreted as the probability that y n = 1.If y n , however, can assume one out of C unordered response category values, y n ∈{ , . . . , c, . . . , C } , the softmax function, a generalization of the sigmoid function thattakes as an input and produces as an output a vector of length C , is typically employed(Goodfellow et al., 2016). For the n th example, the c th element of the softmax outputvector gives the probability predicted by the model that unit n falls into class c . In supervised learning tasks, a neural network is provided with input x n and correspond-ing output y n for each training example. All the weights and bias terms are parametersto be learned in the process of optimization (Goodfellow et al., 2016). The set of pa-rameters hence is θ = { W , . . . , W l , . . . , W L , W O , b , . . . , b l , . . . , b L , b O } .For a single training example n , the loss function L ( y n , f ( x n , ˆ θ )) measures how well thevalue predicted for n by model f ( x n , ˆ θ ), that is characterized by the estimated parameterset ˆ θ , matches the true value y n . In the optimization process the aim is to ﬁnd the set ofvalues for the weights and biases that minimizes the average of the observed losses over alltraining set instances, also known as the empirical risk: R ( ˆ θ ) = N (cid:80) Nn =1 L ( y n , f ( x n , ˆ θ ))(Goodfellow et al., 2016).Neural networks commonly employ variants of gradient descent with backpropagation inthe optimization process (Goodfellow et al., 2016). To approach the local minimum ofthe empirical risk function, the gradient descent algorithm makes use of the fact that thedirection of the negative gradient of function R at current point ˆ θ i gives the direction inwhich R is decreasing fastest—the direction of the steepest descent (Goodfellow et al.,2016). The gradient is a vector of partial derivatives. It is the derivative of R at point ˆ θ i and is commonly denoted as ∇ ˆ θ i R ( ˆ θ i ).So, in the ith iteration, the gradient descent algorithm computes the negative gradientof R at current point ˆ θ i and then moves from ˆ θ i into the direction of the negativegradient: ˆ θ i +1 = ˆ θ i − η ∇ ˆ θ i R ( ˆ θ i ) (8)whereby η ∈ R + is the learning rate. If η is small enough, then R ( ˆ θ i ) ≥ R ( ˆ θ i +1 ) ≥R ( ˆ θ i +2 ) ≥ . . . . This is, repeatedly computing the gradient and updating into the Activation functions that are similar to ReLU are the Exponential Linear Unit (ELU) (Clevert et al.,2016), Leaky ReLU and the Gaussian Error Linear Unit (GELU) (Hendrycks and Gimpel, 2016). Thelatter is used in BERT (Devlin et al., 2019). η , will generate asequence moving toward the local minimum (Li et al., 2020a).In each iteration, the gradients for all parameters are computed via the backpropagationalgorithm (Rumelhart et al., 1986). A very frequently employed approach, known asmini-batch gradient descent or stochastic gradient descent, is to compute the gradientsbased on a small random sample, a mini-batch, of M training set observations: ∇ ˆ θ i R ( ˆ θ i ) = 1 M M (cid:88) m =1 ∇ ˆ θ i L ( y m , f ( x m , ˆ θ i )) (9)The gradient computed on a sample of training instances usually provides a good ap-proximation of the loss function gradient evaluated on the entire training set (Goodfellowet al., 2016). It requires fewer computational resources and thus leads to faster conver-gence (Goodfellow et al., 2016). M typically assumes a value in the range from 2 to afew hundred (Ruder, 2019).The size of the mini-batch M and the learning rate η are hyperparameters in trainingneural network models. Especially the learning rate is often carefully tuned (Li et al.,2020a). A too high learning rate leads to large oscillations in the loss function values,whereas a too low learning rate implies slow convergence and risks getting stuck at a non-optimal point with a high loss value (Goodfellow et al., 2016). Commonly, the learningrate is not kept constant but is set to decrease over time (Goodfellow et al., 2016).Furthermore, there are variants of stochastic gradient descent, as AdaGrad (Duchi et al.,2011), RMSProp (Hinton et al., 2012), and Adam (Kingma and Ba, 2015), that have avarying learning rate for each individual parameter (Goodfellow et al., 2016). The recurrent neural network (RNN) (Elman, 1990) is the most basic neural networkto process sequential input data of variable length such as texts (Goodfellow et al.,2016). Given an input sequence of T token embeddings { z [ a ] , . . . , z [ a t ] , . . . , z [ a T ] } , RNNssequentially process each token. Here, one input token embedding z [ a t ] corresponds toone time step t and the hidden state h t is updated at each time step. At each step t ,the hidden state h t is a function of the hidden state generated in the previous time step, h t − , as well as new input data, z [ a t ] (see Figure 3) (Amidi and Amidi, 2019; Elman,1990).The hidden states h t , that are passed on and transformed through time, serve as themodel’s memory (Elman, 1990; Ruder, 2019). They capture the information of the se-quence that entered until t (Goodfellow et al., 2016). Due to this sequential architecture,RNNs theoretically can model dependencies over the entire range of an input sequence The backpropagation algorithm makes use of the chain rule to compute the gradients. Helpfulintroductions to the backpropagation algorithm can be found in Li et al. (2020a), Li et al. (2020b) andHansen (2020). ! ℎ " ℎ " ! " " " … " … ℎ [& ! ] [& " ] [& ] [& ] Figure 3:

Recurrent Neural Network.

Architecture of a basic recurrent neural networkunfolded through time. At time step t , the hidden state h t is a function of the previous hiddenstate, h t − , and current input embedding z [ a t ] . y t is the output produced at t . (Amidi and Amidi, 2019). Yet, practically recurrent models have problems in learningdependencies extending beyond sequences of 10 or 20 tokens (Goodfellow et al., 2016).The reason is that when backpropagating the gradients through the time steps (this isknown as Backpropagation Through Time (BPTT)), the gradients may vanish and thusfail to transmit a signal over long ranges (Goodfellow et al., 2016).The Long Short-Term Memory (LSTM) model (Hochreiter and Schmidhuber, 1997) ex-tends the RNN with input, output, and forget gates that enable the model to accumulate,remember as well as forget provided information (Goodfellow et al., 2016). This makesthem better suited than the basic RNNs to model dependencies stretching over long timespans as found in textual data (Ruder, 2019). Transformer-based models for transfer learning fuse two lines of recent NLP research:transfer learning and attention. Both concepts and respective developments will beoutlined in the following two sections such that after reading these sections the reader willnot only be prepared to understand the workings of transfer learning with Transformer-based models like BERT but also will be informed about latest major developmentswithin NLP. This ﬁrst section deﬁnes transfer learning and describes the beneﬁts anduses of transfer learning.

The classic approach in supervised learning is to have a training data set containing alarge number of annotated instances, { x n , y n } Nn =1 , that are provided to a model thatlearns a function relating the x n to the y n (Ruder, 2019). If the train and test datainstances have been drawn from the same distribution over the feature space, the trained13odel can be expected to make accurate predictions for the test data, i.e. to generalizewell (Ruder, 2019). Given another task (e.g. another set of labels to learn) or anotherdomain (e.g. another set of documents with a diﬀerent thematic focus), the standardsupervised learning procedure would be to sample and create a new training data set forthis new task and domain (Pan and Yang, 2010; Ruder, 2019). This is, for each new taskand domain a new model is trained from the start (Pan and Yang, 2010; Ruder, 2019).There is no transferal of already existing, potentially relevant and useful informationfrom related domains or tasks to the task at hand (Ruder, 2019).Trained supervised learning models thus are not good at generalizing to data exhibit-ing characteristics diﬀerent from the data they have been trained on (Ruder, 2019).Moreover, the (manual) labeling of thousands to millions of training instances for eachnew task makes supervised learning highly resource intensive and prohibitively costly tobe applied for all potentially useful and interesting tasks (Pan and Yang, 2010; Ruder,2019). In situations in which the number of annotated training examples is restricted orthe researcher lacks the resources to label a suﬃciently large number of training instancesclassic supervised learning fails (Ruder, 2019).This is where transfer learning comes in. Transfer learning refers to statistical learningprocedures in which knowledge learned by training a model on a related task and domain,the source task and source domain, is transferred to the learning process of the targettask in the target domain (Pan and Yang, 2010; Ruder, 2019).Ruder (2019) provides a taxonomy of transfer learning scenarios in NLP. In transduc-tive transfer learning, the source and target tasks are the same but annotated trainingexamples only are available for the source domain (Ruder, 2019). Here, knowledge istransferred across domains (domain adaptation) or languages (cross-lingual learning)(Ruder, 2019). In inductive transfer learning source and target tasks diﬀer but theresearcher has access to at least some labelled training samples in the target domain(Ruder, 2019). In this setting, tasks can be learned simultaneously (multi-task learning)or sequentially (sequential transfer learning) (Ruder, 2019). In this article the focus is on sequential transfer learning, which at present is the mostfrequently employed type of transfer learning (Ruder, 2019). In sequential transfer learn-ing the source task diﬀers from the target task and training is conducted in a sequentialmanner (Ruder, 2019). Here, two stages are distinguished (see also Figure 4): First, amodel is trained on the source task (pretraining phase) (Ruder, 2019). Subsequently,the knowledge gained in the pretraining phase is transmitted to the model trained onthe target task (adaptation phase) (Ruder, 2019). In NLP, the knowledge that is trans-ferred are the parameter values learned during training the source model—and thus alsoincludes the values of the token representation vectors. The source model thus by someauthors is also referred to as a language representation model (see e.g. Devlin et al.,2019). 14 !" $ '($)&' %&'(&)*+*+, ! '($)&' , '($)&' $ '($)&' )-)%()(*.+ We laughed out loud [EOS]He is eating an apple [EOS]She told me to come [EOS]She nodded [EOS]New York woke up [EOS]The sun shone bright ! !" , !" Figure 4:

Pretraining and Adaptation Phase in Sequential Transfer Learning.

In the pretraining phase of sequential transfer learning a model is trained on the source task { X source , y source } . The representations (i.e. the parameters) learned in the pretraining phasethen are taken to the learning process of the target task, { X target , y target } . A once pretrainedmodel can serve as an input for a large variety of target tasks. The discoveries with the ImageNet Dataset suggest that in order to learn general, closeto universal representations, that are relevant for a wide range of tasks within an entirediscipline, two things are required: a pretraining data set that contains a large amount16f training samples and is representative of the feature distribution studied across thediscipline and a suitable pretraining task (Ruder, 2018, 2019).

The most fundamental pretraining approaches in NLP are self-supervised (Ruder, 2019).Among these, a very common pretraining task is language modeling (Bengio et al., 2003).A language model learns to assign a probability to a sequence of tokens (Bengio et al.,2003). As the probability for a sequence of T tokens, P ( a , . . . , a t , . . . , a T ), can becomputed as P ( a , . . . , a t , . . . , a T ) = T (cid:89) t =1 P ( a t | a , . . . , a t − ) (10)or as P ( a , . . . , a t , . . . , a T ) = (cid:89) t = T P ( a t | a T , . . . , a t +1 ) (11)language modeling involves predicting the conditional probability of token a t given allits preceding tokens, P ( a t | a , . . . , a t − ), or implicates predicting the conditional prob-ability of token a t given all its succeeding tokens, P ( a t | a T , . . . , a t +1 ) (Bengio et al.,2003; Yang et al., 2020). A forward language model models the probability in equation10, a backward language model computes the probability in equation 11 (Peters et al.,2018).When being trained on a forward and/or backward language modeling task in pre-training, a model is likely to learn general structures and aspects of language, such aslong-range dependencies, compositional structures, semantics and sentiment, that arerelevant for a wide range of possible target tasks (Howard and Ruder, 2018; Ruder,2018). Hence, language modeling can be argued to be a well-suited pretraining task(Howard and Ruder, 2018). In contrast to computer vision where pretraining on the ImageNet Dataset is ubiquitous,there is no standard pretraining data set in NLP (Aßenmacher and Heumann, 2020;Ruder, 2019). The text corpora that have been employed for pretraining vary widelyregarding the number of tokens they contain as well as their accessibility (Aßenmacherand Heumann, 2020). Most models are trained on a combination of diﬀerent corpora.Several models (e.g. Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019; Yang et al.,2019) use the English Wikipedia and the BooksCorpus Dataset (Zhu et al., 2015). Manymodels (e.g. Brown et al., 2020; Liu et al., 2019; Radford et al., 2019; Yang et al., 2019)additionally also use pretraining corpora made up of web documents obtained fromcrawling the web. A detailed and systematic overview over these data sets is provided by Aßenmacher and Heumann(2020). ——————————————————————————————————NLP Tasks.

GLUE comprises nine English language understanding subtasks in whichbinary or multi-class classiﬁcation problems have to be solved (Wang et al., 2019b). Thetasks vary widely regarding training data set size, textual style, and diﬃculty (Wanget al., 2019b). The Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) sub-task, for example, is a single-sentence binary classiﬁcation task in which the sentimentsexpressed in sentences from movie reviews are to be classiﬁed as being positive vs. neg-ative (Wang et al., 2019b). In the subtask based on the Multi-Genre Natural LanguageInference Corpus (MNLI) (Williams et al., 2018), in contrast, the model is presentedwith a sentence pair consisting of a premise and a hypothesis. The model has to make aprediction whether the premise entails, contradicts, or neither entails nor contradicts thehypothesis (Wang et al., 2019b). SuperGLUE was introduced when the performance ofdeep neural networks used for transfer learning surpassed human performance on GLUE(Wang et al., 2019a). SuperGLUE comprises eight subtasks considered to be more dif-ﬁcult than the GLUE tasks, e.g. natural language inference or reading comprehensiontasks (Wang et al., 2019a). Models that are evaluated based on SQUAD are presentedwith passages from Wikipedia articles and corresponding questions (Rajpurkar et al.,2016). The task is to return the text segment answering the question or discern thatthe question is not answerable by the provided passage (Rajpurkar et al., 2016, 2018).Finally, RACE is made up of 97 ,

687 multiple choice questions on 27 ,

933 text passagestaken from English exams for Chinese students testing reading comprehension ability(Lai et al., 2017). ——————————————————————————————————4.3.3 Deep Pretraining

Another issue beside a suitable pretraining task and a large pretraining corpus is thedepth (i.e. the number of hidden layers) of the neural language representation model.From the perspective of transfer learning, the early seminal word embedding models,such as continuous bag-of-words (CBOW) (Mikolov et al., 2013a), skip-gram (Mikolovet al., 2013a,b), and Global Vectors (GloVe) (Pennington et al., 2014), are self-supervised18retraining approaches (Ruder, 2019). In these models, word embeddings are learnedbased on neural network architectures that have no hidden layer (Ruder, 2019). Thisimplies that these models for each term learn one vector representation, that encodes oneinformation (Peters et al., 2018). Representing each term with a single vector, however,may become problematic if the meaning of a term changes with context (e.g. ‘party’ )(Peters et al., 2018). Moreover, for models to deduce meaning from sequences of words,several diﬀerent types of information (e.g. syntactic and semantic information) are likelyto be required (Peters et al., 2018).In contrast to the early word embedding models, deep learning models learn multi-layered contextualized representations (e.g. McCann et al., 2018; Howard and Ruder,2018; Peters et al., 2018). In deep neural networks for NLP, each layer learns one vectorrepresentation for a term (Peters et al., 2018). Hence, a single term is represented byseveral vectors—one vector from each layer. Although it cannot be speciﬁed a prioriwhich information is encoded in which hidden layer in a speciﬁc model trained on aspeciﬁc task, one can expect that the information encoded in lower layers is less complexand more general whereas information encoded in higher layers is more complex andmore task-speciﬁc (Yosinski et al., 2014). The representations learned by a deep neurallanguage model thus may, for example, encode syntactic aspects at lower layers andsemantic context-dependent information in higher layers (see for example Peters et al.,2018). As comparisons across NLP benchmark data sets show, deep neural languagemodels provide contextualized representations of language that generalize better acrossa wide range of speciﬁc target tasks compared to the one layer representations from earlyword embedding architectures (see e.g. McCann et al., 2018).

There are two basic ways of how to implement the adaptation phase in transfer learning:feature extraction vs. ﬁne-tuning (see Figure 5) (Ruder, 2019). In a feature extractionapproach the representations learned in the pretraining phase are frozen and not alteredduring adaptation (Ruder, 2019). In ﬁne-tuning, in contrast, the pretrained parametersare not ﬁxed but are updated in the adaptation phase (Ruder, 2019).An example for a feature extraction approach is ELMo (Embeddings from LanguageModels) (Peters et al., 2018). In pretraining, ELMo learns three layers of word vectors(Peters et al., 2018). These learned representations then are frozen and extracted toserve as the input for a target task-speciﬁc model that learns a linear combination ofthe three layers of pretrained vectors (Peters et al., 2018). Here, only the weights of thelinear model but not the frozen pretrained vectors are trained (Peters et al., 2018).In ﬁne-tuning typically the same model architecture used in pretraining is also usedfor adaptation (Peters et al., 2019). The model is merely added a task-speciﬁc outputlayer (Peters et al., 2019). The parameters learned in the pretraining phase serve asinitializations for the model in the adaptation phase (Ruder, 2019). When training themodel on the target task, the gradients are allowed to backpropagate to the pretrained19 " !" . !" . !" . '($)&' . !" ⇝ . '($)&' . !" ⇝ . '($)&' . !" ⇝ . '($)&' . !" ⇝ . '($)&' Figure 5:

Feature Extraction vs. Fine-Tuning in the Adaptation Phase.

In afeature extraction approach the parameters learned during pretraining on the source task arenot changed. Only the parameters in the separate target-task model architecture are trained. Inﬁne-tuning, the entire pretrained model architecture is trained on the target task such that alsothe pretrained parameters are updated. representations and thus to induce changes on these pretrained representations (Ruder,2019). In contrast to the feature extraction approach, the pretrained parameters henceare allowed to be ﬁne-tuned to capture task-speciﬁc adjustments (Ruder, 2019).When ﬁne-tuning BERT on a target task, for example, a target task-speciﬁc outputlayer is put on top of the pretraining architecture (Devlin et al., 2019). Then the entirearchitecture is trained, meaning that all parameters (also those learned in pretraining)are updated (Devlin et al., 2019).The performance of feature extraction vs. ﬁne-tuning is likely to be a function of thedissimilarity between the source and the target task (Peters et al., 2019). At present, thehighest performances in text classiﬁcation and sentiment analysis tasks are achieved bypretrained language representation models that are ﬁne-tuned to the target task (Ruder,2020).Fine-tuning, however, can be a tricky matter. The central parameter in ﬁne-tuning is thelearning rate η with which the gradients are updated during training on the target task(see equation 8) (Ruder, 2019). Too much ﬁne-tuning (i.e. a too high learning rate) canlead to catastrophic forgetting—a situation in which the general representations learnedduring pretraining are overwritten and therefore forgotten when ﬁne-tuning the model(Kirkpatrick et al., 2017). A too careful ﬁne-tuning scheme (i.e. a too low learning rate),in contrast, may lead to a very slow convergence process (Howard and Ruder, 2018). Ingeneral it is recommended that the learning rate should be lower than the learning rateused during pretraining such that the representations learned during pretraining are notaltered too much (Ruder, 2019). 20o eﬀectively ﬁne-tune a pretrained model without catastrophic forgetting, Howard andRuder (2018) present a set of learning rate schedules that vary the learning rate overthe course of the adaptation phase. They also introduce discriminative ﬁne-tuning—aprocedure in which each layer has a diﬀerent learning rate during ﬁne-tuning to accountfor the fact that diﬀerent layers encode diﬀerent types of information (Howard andRuder, 2018; Yosinski et al., 2014). As RNNs and derived architectures as the LSTM model are made to process sequentialinput data, they seem the natural model of choice for processing sequences of textualtokens. The problem of recurrent architectures, however, is that they model dependen-cies by sequentially propagating through the positions of the input sequence. Whetheror how well dependencies are learned by RNNs is a function of the distance between thetokens that depend on each other (Goodfellow et al., 2016). The longer the distancebetween the tokens, the less well the dependency tends to be learned (Goodfellow et al.,2016).A solution to this problem is provided by the attention mechanism, ﬁrst introduced byBahdanau et al. (2015) for Neural Machine Translation (NMT), that allows to model de-pendencies between tokens irrespective of the distance between them. The Transformer isa deep learning architecture that is solely based on attention mechanisms (Vaswani et al.,2017). It overcomes the ineﬃciencies of recurrent models and introduces self-attention(Vaswani et al., 2017). To provide a solid understanding of these methods, this sectionﬁrst explains the attention mechanism and then introduces the Transformer.

The common task encountered in NMT is to translate a sequence of S tokens in language G , { g , . . . , g s , . . . , g S } , to a sequence of T tokens in language A , { a , . . . , a t , . . . , a T } (Sutskever et al., 2014). The length of the input and output sequences may diﬀer: ‘Heis giving a speech.’ translated to German is ‘Er h¨alt eine Rede.’ . The task thus is one ofsequence-to-sequence modeling whereby the sequences are of variable length (Sutskeveret al., 2014).The classic architecture to solve this task is an encoder-decoder structure (see Figure6) (Sutskever et al., 2014). The encoder maps the input tokens { g , . . . , g s , . . . , g S } into a single vector of ﬁxed dimensionality, context vector c , that is then provided tothe decoder that generates the sequence of output tokens { a , . . . , a t , . . . , a T } from c (Sutskever et al., 2014). In the original NMT articles, encoder and decoder are recurrentmodels (Sutskever et al., 2014). Hence, the encoder sequentially processes each inputtoken embedding z [ g s ] . The hidden state at time step s , h s , is a nonlinear function of the21 ! ℎ " ℎ ! [, ! ] [, " ] [& ! ] … [& " ] … ℎ " [, % ] [, & ] [, ' ] [, ( ] ℎ ( ℎ ) ℎ * ℎ + = % ℎ …&'%()*+,*%()*+ [, ( ] () *+ /*9*./ , +0)):ℎ [345][345] 37 ℎä1- … Figure 6:

Encoder-Decoder Architecture.

Encoder-decoder structure in neural machinetranslation. In this example, the six token input sentence { He, is, giving, a, speech, [ EOS ] } istranslated to German: { Er, h¨alt, eine, Rede, [EOS] } . The end-of-sentence symbol [ EOS ] is usedto signal to the model the end of a sentence. The encoder processes one input token embedding z [ g s ] at a time and updates the input hidden state h s at each time step. The last encoder hiddenstate h serves as context vector c that captures all the information from the input senquence.The decoder generates one translated output token at a time. Each output hidden state h t isa function of the preceding hidden state h t − , the preceding predicted output token embedding z [ a t − ] , and context vector c . h s − , and input token embedding z [ g s ] (Cho et al., 2014): h s = σ ( h s − , z [ g s ] ) (12)The last encoder hidden state, h S , corresponds to context vector c that then is passed onto the decoder which—given the information encoded in c —produces a variable-lengthsequence output { a , . . . , a t , . . . , a T } . The decoder also operates in a recurrent manner:one output token a t is produced at one time step. In contrast to the encoder, the hiddenstate of the decoder at time step t now is not only a function of the previous hiddenstate h t − but also the embedding of the predicted previous output token z [ a t − ] , andcontext vector c (see also Figure 6) (Cho et al., 2014): h t = σ ( h t − , z [ a t − ] , c ) (13)A problem with this traditional encoder-decoder structure is that all the informationabout the input sequence—regardless of the length of the input sequence—is capturedin a single context vector c (Bahdanau et al., 2015).The attention mechanism, that has been introduced to NMT by Bahdanau et al. (2015)and was reﬁned by Luong et al. (2015), resolves this problem. In the attention mech-anism, the encoder produces hidden states for each input token and at each time stepthe decoder can attend to, and thus derive information from, all encoder-produced inputhidden states when computing its hidden state h t (see Figure 7). More precisely, thedecoder hidden state at time point t , h t , is a function of the initial decoder hidden state ˜ h t , the previous output token z [ a t − ] , and an output token-speciﬁc context vector c t (Luong et al., 2015). h t = σ ( ˜ h t , z [ a t − ] , c t ) (14)Note that now at each time step there is a context vector c t that is speciﬁc to the t thoutput token (Bahdanau et al., 2015). The attention mechanism rests in the computa-tion of c t , which is a weighted sum over the input hidden states { h , . . . , h s , . . . , h S } (Bahdanau et al., 2015): c t = S (cid:88) s =1 α ts h s (15)The weight α ts is computed as α ts = exp ( score ( ˜ h t , h s )) (cid:80) S ∗ s ∗ =1 exp ( score ( ˜ h t , h s ∗ )) (16)whereby score is a scoring function assessing the compatibility between output tokenrepresentation ˜ h t and input token representation h s (Luong et al., 2015). score could Note that equation 14 blends the speciﬁcations of Luong et al. (2015) and Bahdanau et al. (2015).Luong et al. (2015) do not include z [ a t − ] . Luong et al. (2015) also do not explicitly state how theycompute ˜ h t . Bahdanau et al. (2015) use h t − instead of ˜ h t to represent the state of the decoder at t (orrather: at the moment just before producing the t th output token). ˜ h t and h s (Luong et al., 2015). The weight α ts isa measure of the degree of alignment of the sth input token, represented by h s , withthe t th output token, represented as ˜ h t . α ts is the probability that output token ˜ h t isaligned with (in the context of NMT: translated from) input token h s (Bahdanau et al.,2015). ℎ ! ℎ " ℎ ! [, ! ] [, " ] [& ! ] … [& " ] … ℎ " [, % ] [, & ] [, ' ] [, ( ] ℎ ( ℎ ) ℎ * ℎ + ℎ …&'%()*+,*%()*+ [, ( ] () *+ /*9*./ , +0)):ℎ [345][345] 37 ℎä1- … .ℎ % ' %! ' %" ' %$ ' % ' %& ' %' Figure 7:

Attention in an Encoder-Decoder Architecture.

Visualization of theattention mechanism in an encoder-decoder structure at time step t . In the attention mechanism,at each time step, i.e. for each output token, there is a token-speciﬁc context vector c t . c t is computed as the weighted sum over all input hidden states { h , . . . , h } . The weights are { α t, , . . . , α t, } . α t, captures the alignment between the t th output token, as represented by theinitial output hidden state ˜ h t , and input token hidden state h . Input hidden states that do not match with output token representation ˜ h t receive asmall weight such that their contribution vanishes, whereas input hidden states that arerelevant to output token ˜ h t receive high weights, thereby increasing their contribution(Alammar, 2018c). Hence, c t considers all input hidden states and especially attends tothose input hidden states that match with the current output token. As context vector c t is constructed at each time step, i.e. for each output token, based on a weighted sumof all input hidden states, the attention architecture allows for modeling dependenciesbetween tokens irrespective of their distance (Vaswani et al., 2017). The introduction of the attention mechanism constituted a signiﬁcant development insequence-to-sequence modeling. Yet, the original articles on attention use recurrentarchitectures in the encoder and decoder. The sequential nature of recurrent modelsimplies that within each training example sequence each token has to be processedone after another—a computationally not eﬃcient strategy (Vaswani et al., 2017). Toovercome this ineﬃciency and to enable parallel processing within training sequences,24aswani et al. (2017) introduced the Transformer architecture that completely abandonsrecurrence and solely rests on attention mechanisms.The Transformer consists of a sequence of six encoders followed by a stack of six decoders(see Figure 8) (Vaswani et al., 2017). Each encoder consists of two components: amulti-head self-attention layer (to be explained below) and a feedforward neural network(Vaswani et al., 2017). Each decoder also has a multi-head self-attention layer followed bya multi-head encoder-decoder attention layer and a feedforward neural network (Vaswaniet al., 2017).In the following, the structure and workings of the Transformer are explicated in detail.The Transformer architecture is depicted in Figure 8 and a comprehensive visualizationof the precisely explicated Transformer encoder is provided in Figure 9. When readingthrough this outline, two aspects are to be kept in mind. First, instead of processing eachtoken embedding of each training example one after another, the Transformer encodertakes as an input the whole set of embeddings for one training example and processesthis set of embeddings in parallel (Alammar, 2018b). Second, the focus of the followingexplanation will be on the ﬁrst of the Transformer encoders. The other encoders operatein the same way with the exception that they do not take as an input a set of T wordembeddings, { z [ a ] , . . . , z [ a t ] , . . . , z [ a T ] } , but T updated vector representations, denotedas { h ∗ , . . . , h ∗ t , . . . , h ∗ T } , that are produced by the previous encoder (see upper partof Figure 9) (Alammar, 2018b). Moreover, the T word embeddings entering the ﬁrstencoder are position-aware word embeddings (see bottom of Figure 9) (Vaswani et al.,2017). A position-aware word embedding is the sum of a pure word embedding vectorand a positional encoding vector (Vaswani et al., 2017). The positional encoding vectorcontains information on the position of the t th token within the input sequence, therebymaking the model aware of token positions (Alammar, 2018b). The ﬁrst element in a Transformer encoder is the multi-head self-attention layer. In theself-attention layer, the provided input sequence attends to itself. Instead of improvingthe representation of an output token by attending to relevant tokens in the input se-quence, the idea of self-attention is to improve the representation of an input token byattending to the tokens in the sequence in which it is embedded in (Alammar, 2018b).For example: If ‘The company is issuing a statement as it is bankrupt.’ were a sentenceto be processed, then the word embedding for the word ‘it’ that enters the Transformerwould not contain any information regarding which other token in the sentence ‘it’ isreferring to. Is it the company or the statement? In the self-attention mechanismthe representation for ‘it’ is updated by attending to, and incorporating informationfrom, other relevant tokens in this sentence (Alammar, 2018b). It therefore is to be Note that the number of encoders and decoders as well as the dimensionality of the input wordembeddings and the key, query and value vectors (introduced in the following) are Transformer hyper-parameters that are simply set by the authors to speciﬁc values. Other suitable values could be usedinstead. , B&'*)&;=*&?, ?7%) N?<+;=/+)>

Figure 8:

Transformer Architecture.

In the original article by Vaswani et al. (2017),the Transformer is made up of a stack of six encoders proceeded by a stack of six decoders.In contrast to recurrent architectures where each input token is handled one after another, theTransformer encoders processes the entire set of input token representations in parallel. Herethe input token embeddings are { z [ a ] , . . . , z [ a ] } . The sixth encoder passes the key and queryvectors of the input tokens, { k , q , . . . , k , q } , to each of the decoders. The decoders generatea prediction for one output token at a time. The hidden state of the last decoder is handed toa linear and softmax layer to produce a probability distribution over the vocabulary signifyingthe prediction for the next output token. [- ! ] . [- " ] . [- ] . [- $ ] . [- % ] . [- & ] ./0 $ ./0 ./0 ! ./0 " ./0 % ./0 & . [- ]∗ . [- " ]∗ . [- ! ]∗ . [- $ ]∗ . [- % ]∗ . [- & ]∗

30 73>367 4 0.22?ℎ 9:;+ + + + + += = = = = =, ! - ! > ! , - > , $ - $ > $ , " - " > " , % - % > % , & - & > & / ! / " / $ / / & / ' ! " $ & ' !∗ "∗ $∗ &∗ '∗ ℎ ! ℎ " ℎ $ ℎ ℎ & ℎ ' ℎ !∗ ℎ "∗ ℎ $∗ ℎ ℎ &∗ ℎ '∗ '()*+,-.**/0123(2/4/(056+,-.**/0123(2/4/(0 - - ;.5*<.6= - (,.5)0 - /01 & >5?.)A(),56/E54/(0C.2/*:56 >.5)0 - /01 & >5?.)A(),56/E54/(0 ...................................................... Figure 9:

Transformer Encoder Architecture.

This visualization details the processesin the ﬁrst Transformer encoder. The encoder comprises a multi-head self-attention layer and afeedforward neural network (FNN); each followed by residual learning and layer normalization.The ﬁrst encoder takes as an input position-aware word embeddings, { z [ a ] , . . . , z [ a ] } , that aretransformed into eight sets of key, query and value vectors. One set is { k , q , v , . . . , k , q , v } .These are processed in the multi-head self-attention layer to produce eight sets of context vectors(one set being { c , . . . , c } ). The sets then are concatenated and transformed linearly to be-come the updated representations { u , . . . , u } . After residual learning and layer normalization, { u ∗ , . . . , u ∗ } enter the FNN, whose output—after residual learning and layer normalization—are the updated representations produced by the ﬁrst Transformer encoder: { h ∗ , . . . , h ∗ } . Therepresentations { h ∗ , . . . , h ∗ } constitute the input to the next encoder, where they are ﬁrst trans-formed to sets of key, query and value vectors. ‘it’ absorbed some of the representation for ‘company’ and so encodesinformation on the alignment between ‘it’ and ‘company’ (Alammar, 2018b).The ﬁrst operation within a self-attention layer is that each input word embedding z [ a t ] istransformed into three separate vectors, called key k t , query q t , and value v t (see Figure9). The key, query, and value vectors are three diﬀerent projections of the input tokenembedding z [ a t ] (Alammar, 2018b). They are generated by matrix multiplication of z [ a t ] with three diﬀerent weight matrices, W k , W q , and W v (Vaswani et al., 2017): k t = z [ a t ] W k q t = z [ a t ] W q v t = z [ a t ] W v (17)Then, for each token t context vector c t is computed as a weighted sum of the valuevectors of all tokens { , . . . , t ∗ , . . . , T ∗ } that are in the same sequence as t (Vaswaniet al., 2017): c t = T ∗ (cid:88) t ∗ =1 α t,t ∗ v t ∗ (18)The weight α t,t ∗ captures the probability that token t , represented by q t , is aligned withtoken t ∗ that is represented as k t ∗ : α t,t ∗ = exp ( score ( q t , k t ∗ )) (cid:80) T ∗ t ∗ =1 exp ( score ( q t , k t ∗ )) (19)whereby score is ( q t k (cid:62) t ∗ ) / (cid:112) | k t ∗ | (Vaswani et al., 2017). Thus attention vector c t iscalculated as in a basic attention mechanism—except that the attention now is withrespect to the value vectors of the tokens t ∗ that are in the same sequence as token t (see also Figure 10). The self-attention mechanism outlined so far is conducted eight times in parallel (Vaswaniet al., 2017). Hence, for each token t eight diﬀerent sets of query, key and value vec-tors are generated and there will be not one but eight attention vectors { c t , . . . , c t } (Vaswani et al., 2017). In doing so, each attention vector can attend to diﬀerent tokens ineach of the eight diﬀerent representation spaces (Vaswani et al., 2017). For example, inone representation space the attention vector for token t may learn syntactic structuresand in another representation space the attention vector may attend to semantic con-nections (Vaswani et al., 2017). In the example sentence from above, the ﬁrst attention Note that for reasons of computational eﬃciency, k t , q t , and v t typically are set to have a lowerdimensionality than z [ a t ] (Vaswani et al., 2017). In the original article Vaswani et al. (2017) let the wordembeddings and the updated word vector representations outputted by the encoders have dimensionality512, whereas each of k t , q t , and v t has dimensionality 64. Note that in a self-attention mechanism the query vector of token t , q t , is scored with the keyvectors of all tokens in the same sequence—and thus also with itself (see Figure 10). The score of thequery vector of t , q t , and the key vector of t , k t , is likely to be very high compared to the other scores(Alammar, 2018b). Yet, the model also is likely to learn that attention to other tokens in the sequencecan be beneﬁcial for representing token t (Alammar, 2018b). ! < [, ! ] < [, " ] < [, ] < [, $ ] < [, % ] < [, !! ] % - ' (,! ) ! * ! < [, & ] < [, ' ] < [, ( ] < [, ) ] < [, !* ]( " ) " * " ( $ ) $ * $ ( ) * ( & ) & * & ( ' ) ' * ' ( * ) * * * ( ( ) ( * ( ( + ) + * + ( !, ) !, * !, ( !! ) !! * !! ' (," ' (,$ ' (, ' (,& ' (,' ' (,* ' (,( ' (,+ ' (,!, ' (,!! Figure 10:

Attention Mechanism in the Transformer.

Illustration of the attentionmechanism in the ﬁrst Transformer encoder for the 8th token ( ‘it’ ) in the example sentence ‘Thecompany is issuing a statement as it is bankrupt.’ . The arrows pointing from the value vectors { v , . . . , v } to context vector c are the weights { α , , . . . , α ,t ∗ , . . . , α , } . A single weight α ,t ∗ gives the probability that token 8, represented by q , is aligned with token t ∗ representedas k t ∗ . The larger α ,t ∗ is assumed to be in this example, the thicker the arrow and the darkerthe corresponding value vector. The dotted lines symbolize the computation of the weights { α , , . . . , α ,t ∗ , . . . , α , } . ‘it’ , c , , may have a high attention weight for ‘company’ , whereasanother attention vector, say c , , may more strongly attend to ‘bankrupt’ (Alammar,2018b). Because of the self-attention mechanism is implemented eight times in paral-lel and generates eight attention vectors (or heads), the procedure is called multi-headself-attention (Vaswani et al., 2017).The eight attention vectors subsequently are concatenated into a single vector, c t =[ c t ; . . . ; c t ], and multiplied with a corresponding weight matrix W to produce vector u t (Vaswani et al., 2017). u t constitutes an updated representation of token t . Itincorporates the information of tokens in the same sequence that was captured by self-attention.Before being passed to the feedforward neural network, u t is added to z [ a t ] , therebyallowing for residual learning (He et al., 2015). Then, layer normalization as suggestedin Ba et al. (2016) is conducted to reduce training time (Vaswani et al., 2017). u ∗ t = LayerN orm ( u t + z [ a t ] ) (20) u ∗ t then enters the feedforward neural network with a ReLU activation function (Vaswaniet al., 2017) h t = max (0 , u ∗ t W + b ) W + b (21)followed by a residual connection with layer normalization (Vaswani et al., 2017): h ∗ t = LayerN orm ( h t + u ∗ t ) (22) h ∗ t ﬁnally is the representation of token t produced by the encoder. The entire sequenceof representations, { h ∗ , . . . , h ∗ t , . . . , h ∗ T } , serves as the input for the next encoder thatgenerates eight sets of query, key and value vectors from each representation h ∗ t to imple-ment multi-head self attention and to ﬁnally produce an updated set of representations, { h ∗ , . . . , h ∗ t , . . . , h ∗ T } ∗ , that are passed to the next encoder and so on.The last encoder from the stack of encoders produces the key and value vectors from itsproduced sequence of updated representations and passes these to each encoder-decodermulti-head attention layer in each decoder (see again Figure 8) (Vaswani et al., 2017).The decoder then generates one output token t at a time (Vaswani et al., 2017). The In residual learning, instead of leaning a new representation in each layer, merely the residual changeis learned (He et al., 2015). Here u t can be conceived of as the residual on the original representation z [ a t ] . Residual learning has been shown to facilitate the optimization of very deep neural networks (Heet al., 2015). In layer normalization, for each training instance, the values of the hidden units within a layerare standardized by using the mean and standard deviation of the layer’s hidden units (Ba et al.,2016). Layer normalization reduces training time and enhances generalization performance due to itsregularizing eﬀects (Ba et al., 2016). masked meaning that the attention vector for output token t can onlyattend to output tokens preceding token t (Vaswani et al., 2017). The Transformer advances the study of text as it enables the representation for eachtoken to encode information from other tokens in the same sequence. Irrespective ofthe distance between the tokens, this allows for modeling dependencies between tokensand context-dependent meanings of tokens. The utilization of pretrained representationsin a transfer learning setting is likely to increase prediction performances in text-basedsupervised learning tasks—especially for small training corpora.Taken together, the Transformer architecture in combination with transfer learning liter-ally transformed the ﬁeld of NLP. After the introduction of the Transformer by Vaswaniet al. (2017) several models for transfer learning that included elements of the Trans-former were developed (e.g. Clark et al., 2020; Devlin et al., 2019; Lan et al., 2020; Liuet al., 2019; Radford et al., 2018; Raﬀel et al., 2020; Yang et al., 2019). These modelsand their derivatives signiﬁcantly outperformed previous state-of-the-art models and sofar generate the highest performances on a wide spectrum of NLP tasks (see for examplethe leaderbords of common benchmark datasets as GLUE , SuperGLUE , SQuAD and RACE ).An important step within these developments was the introduction of BERT (Devlinet al., 2019). By establishing new state-of-the-art performance levels for eleven NLPtasks (Devlin et al., 2019), BERT demonstrated the power of transfer learning andcaused great excitement in the NLP community (Alammar, 2018a). The introduction ofBERT ﬁnally paved the way to a new transfer learning-based mode of learning in whichit is common to use a pretrained language representation model oﬀ-the-shelf and adaptit to a speciﬁc target task as needed (Alammar, 2018a).BERT can be considered the seminal Transformer-based model for transfer learning. https://gluebenchmark.com/leaderboard https://super.gluebenchmark.com/leaderboard https://rajpurkar.github.io/SQuAD-explorer/ The classic language modeling pretraining task (see equations 10 and 11), that for ex-ample is employed by language representation models such as ELMo, ULMFiT andOpenAI GPT, has one disadvantage: it is strictly unidirectional (Devlin et al., 2019).A forward language model predicts the probability for the next token t given the so farpredicted tokens, P ( a t | a , . . . , a t − ). Here, the model can only access information fromthe preceding tokens { a , . . . , a t − } but not from the following tokens { a t +1 , . . . , a T } . Ina self-attention mechanism this means that the context vector for token t can merelyattend to, and hence can only incorporate information from, the representations of pre-ceding but not from succeeding tokens (Devlin et al., 2019). The same is true for abackward language model in which the next token is predicted given all its followingtokens, P ( a t | a t +1 , . . . , a T ) (Yang et al., 2019). A backward language model can onlyoperate on, and capture information from, succeeding tokens.There is reason to believe that a representation of token t from a bidirectional modelthat simultaneously can attend to preceding and succeeding tokens constitutes a betterrepresentation of token t than a representation stemming from a unidirectional languagemodel (Devlin et al., 2019). The concatenation of representations learned by a forwardlanguage model with the representations of a backward language model, however, doesnot generate representations that genuinely draw from left and right contexts (Devlinet al., 2019). The reason is that the forward and backward representations are learnedseparately and each representation captures information only from a unidirectional con-text (Yang et al., 2019).The authors inventing BERT sought to tackle this issue: BERT is a language repre-sentation model for sequential transfer learning that utilizes the Transformer encoderand masked language modeling, which is an adapted variant of the traditional languagemodeling pretraining task, to learn deep and bidirectional representations (Devlin et al.,2019).The following subsections provide information on the basic architecture of BERT, explainthe masked language modeling task, explicate the input format required by BERT, anddescribe the pretraining and ﬁne-tuning speciﬁcations. Figures 11 and 12 illustrate theseaspects. 32 .1.1 Architecture BERT consists of a stack of Transformer encoders and comes in two diﬀerent modelsizes (Devlin et al., 2019): BERT BASE consists of 12 stacked Transformer encoders.In each encoder there are 12 attention heads in the multi-head self-attention layer.The dimensionality of the input word embeddings and the updated hidden word vectorrepresentations is 768. With this model size, BERT

BASE has 110 million parameters.BERT

LARGE has 24 Transformer encoders with 16 attention heads and a hidden vectorsize of 1024. BERT

LARGE has 340 million parameters.As in the original Transformer, the ﬁrst BERT encoder takes as an input a sequenceof embedded tokens, { z [ a ] , . . . , z [ a t ] , . . . , z [ a T ] } , processes the embeddings in parallelthrough the self-attention layer and the feedforward neural network to generate a set ofupdated token representations, { h ∗ , . . . , h ∗ t , . . . , h ∗ T } , that are then passed to the nextencoder that also generates updated representations to be passed to the next encoderand so on until the representations ﬁnally enter output layers for prediction (Alammar,2018a). To conduct the masked language modeling task, in each input sequence, 15% of thetoken embeddings are selected at random (Devlin et al., 2019). The selected tokensare indexed as { , . . . , q, . . . , Q } here. 80% of the Q selected tokens will be replaced bythe ‘[MASK]’ token (Devlin et al., 2019). 10% of the selected tokens are supplantedwith another random token and 10% of selected tokens remain unchanged (Devlin et al.,2019). The task then is to correctly predict all Q tokens sampled for the task based ontheir respective input token representation.To illustrate (see Figure 11): Assume that in the lowercased example sequence consistingof the segment pair ‘he starts to speak. the nervous crowd is watch-ing him.’ the tokens ‘speak’ and ‘nervous’ were sampled to be masked. ‘speak’ is replaced by the ‘[MASK]’ token and ‘nervous’ is replaced by the random token ‘that’ . The model’s task is topredict the tokens ‘speak’ and ‘nervous’ from the representation vectors it learns atthe positions of the input token embeddings of ‘[MASK]’ and ‘that’ . In doing so, self-attention is possible with regard to all—instead of only preceding or only succeeding—tokens in the same sequence and thus the learned representations for all tokens in thesequence can capture encoded information from bidirectional contexts (Devlin et al.,2019). Moreover, as the masked language modeling task is to predict the correct tokenfor all of the Q selected tokens —not only those replaced by the ‘[MASK]’ token—, thisensures that BERT does not know for which tokens it will have to make a prediction(Devlin et al., 2019). Thereby BERT is forced to learn a suitable representation for each token in the entire sequence; whether selected for prediction or not (Devlin et al., In the feedforward neural networks Devlin et al. (2019) employ the Gaussian Error Linear Unit(GELU) (Hendrycks and Gimpel, 2016) instead of the ReLU activation function. This change in theactivation function has also been used for the OpenAI GPT (Radford et al., 2018). " $&%=&$ &9 [-./] &ℎ" ("=B9A$ 2=91I 1%&2ℎ A [G ! ]∗ *9$ ' A [G " ] A [G ] A [G ! ] A [G $ ] A [G % ] A [G & ] A [G ' ] A [G ( ] A [G ) ] A [G "* ] A [G "" ] A [G " ] A [G "! ] A [G "$ ] *9$ ( *9$ ) *9$ * *9$ + *9$ , *9$ - *9$ . *9$ / *9$ '0 *9$ '' *9$ '( *9$ ') *9$ '* $") $") $") $") $") $") $") $") $") $") $") $") $") $") A [G ]∗ A [G " ]∗ A [G $ ]∗ A [G % ]∗ A [G & ]∗ A [G ' ]∗ A [G ( ]∗ A [G ) ]∗ A [G "* ]∗ A [G "" ]∗ A [G " ]∗ A [G "! ]∗ A [G "$ ]∗ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + = = = = = = = = = = = = = = ℎ $∗ ℎ ℎ !∗ ℎ "∗ ℎ %∗ ℎ &∗ ℎ J∗ ℎ K∗ ℎ L∗ ℎ !M∗ ℎ !!∗ ℎ ! ℎ !$∗ ℎ !"∗ ;;< ;;< ;;. @ F)/1/056<.G:.0H.'()*+,-.**/012952B.*<.G:.0H.3(2/4/(056+,-.**/012<.1,.04+,-.**/012@/056 I0J:4C.J).2.0454/(0<45HB (=+0H(*.)2F:4J:4>5?.) $*"%J$&%=&$

Figure 11:

Pretraining BERT.

Architecture of BERT in pretraining. Two tokens aresampled for the masked language modeling task. ‘speak’ is replaced by the ‘[MASK]’ token and ‘nervous’ is replaced by a random token ( ‘that’ ). The model has to make predictions for bothtokens (not only the ‘[MASK]’ token). ℎ" 9**9 *%=&D +"%I"= &ℎ" 3 ? [G ! ]∗ *9$ ' ? [G " ] ? [G ] ? [G ! ] ? [G $ ] ? [G % ] ? [G & ] ? [G ' ] ? [G ( ] ? [G ) ] ? [G "* ] ? [G "" ] ? [G " ] ? [G "! ] ? [G "$ ] *9$ ( *9$ ) *9$ * *9$ + *9$ , *9$ - *9$ . *9$ / *9$ '0 *9$ '' *9$ '( *9$ ') *9$ '* $") $") $") $") $") $") $") $") $") $") $") $") $") $") ? [G ]∗ ? [G " ]∗ ? [G $ ]∗ ? [G % ]∗ ? [G & ]∗ ? [G ' ]∗ ? [G ( ]∗ ? [G ) ]∗ ? [G "* ]∗ ? [G "" ]∗ ? [G " ]∗ ? [G "! ]∗ ? [G "$ ]∗ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + = = = = = = = = = = = = = = ℎ $∗ ℎ ℎ !∗ ℎ "∗ ℎ %∗ ℎ &∗ ℎ J∗ ℎ K∗ ℎ L∗ ℎ !M∗ ℎ !!∗ ℎ ! ℎ !$∗ ℎ !"∗ '()*+,-.**/012I0J:4<.G:.0H.3(2/4/(056+,-.**/012<.1,.04+,-.**/012@/056 I0J:4C.J).2.0454/(0<45HB (=+0H(*.)2F:4J:4>5?.) '∗ ) Figure 12:

Fine-Tuning BERT.

Architecture of BERT during ﬁne-tuning on a single se-quence classiﬁcation task.

To accommodate for the pretraining tasks and to prepare for a wide range of downstreamtarget tasks, the input format accepted by BERT consists of the following elements (seeFigure 11): • Each sequence of tokens { a , . . . , a t , . . . , a T } is set to start with the classiﬁcationtoken ‘[CLS]’ (Devlin et al., 2019). After ﬁne-tuning, the ‘[CLS]’ token functionsas an aggregate representation of the entire sequence and is used as an inputfor single sequence classiﬁcation target tasks such as sentence sentiment analysis(Devlin et al., 2019). • The separation token ‘[SEP]’ is used to separate diﬀerent segments (Devlin et al.,2019). • Each token a t is represented by the sum of its token embedding with a positionalembedding and a segment embedding (Devlin et al., 2019). – Token embeddings: BERT employs the WordPiece tokenizer and uses a vo-cabulary of 30 ,

000 features (Devlin et al., 2019; Wu et al., 2016). WordPiece(Schuster and Nakajima, 2012) is a variant of the Byte-Pair Encoding (BPE)subword tokenization algorithm. (For more information on subword tokeniza-tion algorithms see the explanatory box at the end of this subsection.) – Positional embeddings: Due to memory restrictions, the maximum sequencelength that BERT can process is limited to 512 tokens (Devlin et al., 2018).Consequently, the positional embeddings can distinguish at maximum 512positions and BERT cannot handle input sequences that comprise more than512 tokens (Devlin et al., 2019). – Segment embeddings: The segment embeddings allow the model to distin-guish segments. All tokens belonging to the same segment have the samesegment embedding (Devlin et al., 2019). • In practical software-based implementations, it is not uncommon for BERT-likemodels to require all input sequences to have the same length (The HuggingFace36eam, 2020a). To meet this requirement, the text sequences are tailored to thesame length by padding or truncation (The HuggingFace Team, 2020a). Truncationis typically employed if text sequences exceed the maximum accepted sequencelength and means that excess tokens are removed. In padding, a padding token( ‘[PAD]’ ) is repeatedly added to a sequence until the desired length is reached(McCormick and Ryan, 2019). ——————————————————————————————————Subword Tokenization Algorithms.

Subword tokenization algorithms try to ﬁndan eﬀective balance between word-level tokenization (which tends to result in a largevocabulary—and hence a large embedding matrix that consumes a lot of memory andthus taxes limited computational resources) and character-level tokenization (which gen-erates a small and ﬂexible vocabulary but does not yield as well-performing representa-tions of text) (The HuggingFace Team, 2020c; Radford et al., 2019). Subword tokeniza-tion algorithms typically result in vocabularies in which frequently occurring charactersequences are merged to form words whereas less common character sequences becomesubwords or remain separated as single characters (Radford et al., 2019). The BPEalgorithm and variants thereof are subword tokenization algorithms employed in manyTransformer-based models (e.g. Devlin et al., 2019; Liu et al., 2019; Radford et al., 2019).The base BPE algorithm starts with a list of all the unique characters in a corpus andthen learns to merge the characters into longer character sequences (eventually formingsubwords and words) until a desired vocabulary size is reached (Sennrich et al., 2016).In the WordPiece variant of BPE, the algorithm merges at each step the character pairthat, when merged, results in the highest increase in the likelihood of the training corpuscompared to all other pairs (Schuster and Nakajima, 2012). ——————————————————————————————————6.1.4 Pretraining BERT

In the masked language modeling pretraining task, for each token q , that has beensampled for prediction, the updated token representation produced by the last encoder h ∗ q is fed into a single-layer feedforward neural network with a softmax output layer togenerate a probability distribution over the terms in the vocabulary predicting the termcorresponding to q (see Figure 11) (Alammar, 2018a; Devlin et al., 2019). For the nextsentence prediction task, the representation for the [CLS] token, h ∗ , is processed viaa single-layer feedforward neural network with a softmax output to give the predictedprobability of the second segment succeeding the ﬁrst segment (see Figure 11) (Alammar,2018a; Devlin et al., 2019). The loss function in pretraining is the sum of the averageloss from the masked language modeling task and the average loss from next sentenceprediction (Devlin et al., 2019). 37ERT is pretrained based on the BooksCorpus (Zhu et al., 2015) and text passagesfrom the English Wikipedia (Devlin et al., 2019). Taken together the pretraining corpusconsists of 3 . i th iteration for each individualparameter the estimate of the gradient’s average for this parameter is updated based ona parameter-speciﬁc learning rate (Devlin et al., 2019; Kingma and Ba, 2015). Theyuse a learning rate schedule in which the global Adam learning rate (that is individuallyadapted per parameter) linearly increases during the ﬁrst 10,000 iterations (the warmup)to reach a maximum value of 1 e − L weight decay with the hyperparameter λ , that balances the relative weight given to the penalty vs. the loss function, set to 0 . p = 0 . , ,

000 iterations, which implies that they train themodel for around 40 epochs; i.e. they make around 40 passes over the entire 3 . The token representations learned during pretraining BERT afterward can be frozen andtaken as an input for a target task-speciﬁc architecture as in a classic feature extractionapproach (Devlin et al., 2019). The more common way to use BERT, however, is toﬁne-tune BERT on the target task. Here, only the output layer from pretraining isexchanged with an output layer tailored for the target task (Devlin et al., 2019). Otherthan that, the same model architecture is used in pretraining and ﬁne-tuning (compareFigures 11 and 12) (Devlin et al., 2019).If the target task is to classify single input sequences into a set of predeﬁned categories(see Figure 12), the hidden state vector generated by the last Transformer encoder forthe [CLS] token, h ∗ , enters an output softmax layer to generate a vector o (Devlin et al.,2018). o = sof tmax ( h ∗ W ) (23) o ’s dimensionality corresponds to C —the number of categories in the target classiﬁcationtask. For each of its C elements, o gives the predicted probability of the input sequence Here the individual learning rate is inversely proportional to the average of the squared gradient—such that the learning rate is smaller for large gradients and higher for smaller gradients (Goodfellowet al., 2016). The gradient’s average and the squared gradient’s average are exponentially weightedmoving averages with decay rates β , β ∈ [0 ,

1) to assign an exponentially decaying weight to gradientsfrom long ago iterations (Goodfellow et al., 2016; Kingma and Ba, 2015). Devlin et al. (2019) set β to0 . β to 0 . c . Note that during ﬁne-tuning not only weight matrix W in equation23 but all parameters of BERT are updated (Devlin et al., 2018).Based on their experiences with adapting BERT on various target tasks, the authorsrecommend to use for ﬁne-tuning a mini-batch size of 16 or 32 sequences and a globalAdam learning rate of 5 e −

5, 3 e −

5, or 2 e − A helpful way to describe and categorize the various Transformer-based models for trans-fer learning, is to diﬀerentiate them according to their pretraining objective and theirmodel architecture (The HuggingFace Team, 2020b). The major groups of models in thiscategorization scheme are autoencoding models, autoregressive models, and sequence-to-sequence models (The HuggingFace Team, 2020b).

In their pretraining task, autoencoding models are presented with input sequences thatare altered at some positions (Yang et al., 2019). The task is to correctly predict theuncorrupted sequence (Yang et al., 2019). The models’ architecture is composed ofthe encoders of the Transformer which implies that autoencoding models can access theentire set of input sequence tokens and can learn bidirectional token representations (TheHuggingFace Team, 2020b). Autoencoding models tend to be especially high performingin sequence or token classiﬁcation target tasks (The HuggingFace Team, 2020b). BERTwith its masked language modeling pretraining task is a typcial autoencoding model(Yang et al., 2019).Among the various extensions of BERT that have been developed since its introductionin 2018, RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020) are widely known.RoBERTa makes changes in the pretraining and hyperparameter settings of BERT. Forexample: RoBERTa is only pretrained on the masked language modeling and not thenext sentence prediction task (Liu et al., 2019). Masking is performed dynamically eachtime before a sequence is presented to the model instead of being conducted once in datapreprocessing (Liu et al., 2019). Instead of WordPiece tokenization, RoBERTa employsbyte-level BPE and a vocabulary size of 50 ,

000 features (Liu et al., 2019). Moreover,RoBERTa is pretrained on more data and more heterogeneous data (e.g. also on webcorpora) with a mini-batch size of 8 ,

000 for a longer time (ca. 160 epochs) (Liu et al.,2019). These changes enhance the original BERT prediction performance on GLUE,SQuaD, and RACE (Liu et al., 2019).ALBERT (Lan et al., 2020) aims at a parameter eﬃcient design. By decoupling the sizeof the input word embedding layers from the size of the hidden layers and by sharingparameters across all layers, ALBERT substantially reduces the number of parame-ters to be learned (e.g. by a factor of 18 comparing ALBERT-Large to BERT

LARGE )39Lan et al., 2020). Parameter reduction has regularizing eﬀects, and—because it savescomputational resources—allows to construct a deeper model with more and/or largerhidden layers whose increased capacity beneﬁts performance on target tasks while stillcomprising fewer parameters than the original BERT

LARGE (Lan et al., 2020).Whereas BERT, RoBERTa, and ALBERT make use of the masked language modelingtask, ELECTRA introduces a new, more resource eﬃcient pretraining objective, namedreplaced token detection (Clark et al., 2020). ELECTRA addresses the issue that inmasked language modeling for each input sequence predictions are made only for those15% of tokens that have been sampled for the task, thereby reducing the amount ofwhat could be learned from each training sequence (Clark and Luong, 2020). In pre-training, ELECTRA has to predict for each input token in each sequence whether thetoken comes from the original sequence or has been replaced by a plausible fake to-ken (Clark and Luong, 2020; Clark et al., 2020). Thus, ELECTRA (the discriminator)solves a binary classiﬁcation task for each token and is much more eﬃcient in pretrain-ing requiring fewer computational resources (Clark and Luong, 2020; Clark et al., 2020).The plausible fake tokens come from a generator that is trained on a masked languagemodeling task together with the ELECTRA discriminator (Clark et al., 2020). Afterpretraining, the generator is removed and only the ELECTRA discriminator is used forﬁne-tuning (Clark and Luong, 2020). On the GLUE benchmark, ELECTRA achievesperformances comparable to RoBERTa and XLNet whilst using only a small proportionof their computational resources in pretraining (Clark et al., 2020).One major disadvantage of pretrained language representation models that are basedon the self-attention mechanism in the Transformer is that currently available hard-ware does not allow Transformer-based models to process long text sequences (Beltagyet al., 2020). The reason is that the memory and time required increase quadraticallywith sequence length (Beltagy et al., 2020). Long text sequences thereby quickly exceedmemory limits of presently existing graphics processing units (GPUs) (Beltagy et al.,2020). Transformer-based pretrained models therefore typically induce a maximum se-quence length (typically of 512 tokens). Simple workarounds for processing sequenceslonger than 512 tokens (e.g. truncating texts or processing them in chunks) leads to in-formation loss and potential errors (Beltagy et al., 2020). To solve this problem, variousworks present procedures for altering the Transformer architecture such that longer textdocument can be processed (Beltagy et al., 2020; Child et al., 2019; Dai et al., 2019;Kitaev et al., 2020; Wang et al., 2020).Here, one of these models, the Longformer (Beltagy et al., 2020), is presented in moredetail. The Longformer introduces a new variant of the attention mechanism such thattime and memory complexity does not scale quadratically but linearly with sequencelength and thus longer texts can be processed (Beltagy et al., 2020). The attentionmechanism in the Longformer is composed of a sliding window as well as global attentionmechanisms for speciﬁc preselected tokens (Beltagy et al., 2020). In the sliding window,each input token t —instead of attending to all tokens in the sequence—attends only to aﬁxed number of tokens to the left and right of t (Beltagy et al., 2020). In order to learn40epresentations better adapted to speciﬁc NLP tasks, the authors use global attentionfor speciﬁc tokens on speciﬁc tasks (e.g. for the ‘[CLS]’ token in sequence classiﬁcationtasks) (Beltagy et al., 2020). These preselected tokens directly attend to all tokensin the sequence and enter the computation of the attention vectors of all other tokens(Beltagy et al., 2020). The position embeddings of the Longformer allow to process textsequences of up to 4 ,

096 tokens (Beltagy et al., 2020). This Longformer-speciﬁc attentionmechanism can be used as a plug-in replacement of the original attention mechanism inany Transformer-based model for transfer learning (Beltagy et al., 2020). In the originalarticle, the Longformer attention mechanism is inserted into the RoBERTa architecture(Beltagy et al., 2020). The Longformer then is pretrained by continuing to pretrainRoBERTa with the Longformer attention mechanism on the masked language modelingtask (Beltagy et al., 2020).

Autoregressive models are pretrained on the classic language modeling task (see equa-tions 10 and 11) (Yang et al., 2019). They learn a forward language model in whichthey are trained to predict the next token given all the preceding tokens in the sequence, P ( a t | a , . . . , a t − ), and/or a backward language model in which the next token is pre-dicted given all its succeeding tokens, P ( a t | a t +1 , . . . , a T ) (Yang et al., 2019). Hence,autoregressive models are not capable to genuinely learn bidirectional representationsthat draw from left and right contexts (Yang et al., 2019).In correspondence with this pretraining objective, their architecture typically is basedonly on the decoders of the Transformer (The HuggingFace Team, 2020b). An autore-gressive model predicts one token at a time and the self-attention layer of its decodersare masked such that the model only can attend to the preceding but not the proceedingtokens (The HuggingFace Team, 2020b). Due to the characteristics of their pretrainingtask, autoregressive models typically are very good at target tasks in which they have togenerate text (The HuggingFace Team, 2020b). Autoregressive models, however, can besuccessfully ﬁne-tuned to a large variety of downstream tasks (The HuggingFace Team,2020b). Examples of autoregressive models are XLNet (Yang et al., 2019), as well asOpenAI GPT (Radford et al., 2018) and its successors GPT-2 (Radford et al., 2019) andGPT-3 (Brown et al., 2020; The HuggingFace Team, 2020b) .Strictly speaking, XLNet (Yang et al., 2019) is not an autoregressive model (The Hug-gingFace Team, 2020b). Yet, the permutation language modeling objective that it intro-duces builds on the autoregressive language modeling framework (Yang et al., 2019). Theauthors of XLNet seek a pretraining objective that learns bidirectional representationsas in autoencoding models whilst overcoming problems of autoencoding representations:ﬁrst, the pretrain-ﬁnetune discrepancy that results from the ‘[MASK]’ tokens only oc-curring in pretraining, and, second, the assumption that the tokens selected for themasked language modeling task in one sequence are independent of each other (Yanget al., 2019). Given a sequence whose tokens are indexed { , . . . , T } , the permutationlanguage modeling objective makes use of the permutations of the token index { , . . . , T } { , . . . , T } , the task is to predictthe next token in the permutation order given the previous tokens in the permutation(Yang et al., 2019). In doing so, the learned token representations can access informationfrom left and right contexts whilst the autoregressive nature of the modeling objectiveavoids the pretrain-ﬁnetune discrepancy and the independence assumption (Yang et al.,2019). XLNet achieves high prediction performances across many NLP task (Yang et al.,2020).The research direction taken by the series of GPT models—OpenAI GPT (Radford et al.,2018), GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020)—increasingly turnstoward a new goal: Ultimately the aim of this strand of NLP research is to have a modelthat generalizes well to a wide spectrum of target tasks without being explicitly trainedon the target tasks (Brown et al., 2020; Radford et al., 2019). Especially the work onGPT-3 has demonstrated that large language representation models that are pretrainedon language modeling tasks on excessively large corpora can sometimes come close toachieving acceptable prediction performances without ﬁne-tuning (i.e. without gradientupdates) by merely being presented with 10 to 100 target task-speciﬁc examples (few-shot learning), a single target example (one-shot learning), or even none example (zero-shot learning) (Brown et al., 2020). So far, key to increasing the few-shot no-ﬁne-tuninglearning performances seems to be an increase in the models’ capacity to learn complexfunctions as determined by the number of model parameters (Brown et al., 2020). Additionally, and in correspondence with an increase in model parameters, the size ofthe employed training corpora increase rapidly as well (Brown et al., 2020; Radfordet al., 2019). Given its sheer size, training the GPT-3 is prohibitively expensive (Brownet al., 2020; Riedl, 2020). Moreover, whereas the source code of language representationmodels typically is open sourced by the companies (e.g. Google, Facebook, Microsoft)that developed these models, OpenAI decided not to share the code on GPT-3 andinstead to allow using GPT-3 for downstream tasks via an API, thereby raising questionsregarding accessibility and replicability of language representation models for research(Brockman et al., 2020; Riedl, 2020).

The architecture of sequence-to-sequence models contains Transformer encoders anddecoders (The HuggingFace Team, 2020b). They are pretrained on sequence-to-sequencetasks, e.g. translation, and consequently are especially suited for sequence-to-sequencelike downstream tasks as translating or summarizing input sequences (The HuggingFaceTeam, 2020b). The Transformer itself is a sequence-to-sequence model for translationtasks (The HuggingFace Team, 2020b).The T5 (Raﬀel et al., 2020) is another well-known sequence-to-sequence model applicableto a large variety of target tasks (The HuggingFace Team, 2020b). The T5 is very close Whilst the original OpenAI GPT comprises 117 million parameters, GPT-2 has 1 ,

542 million (Rad-ford et al., 2019) and GPT-3 has 175 ,

000 million parameters (Brown et al., 2020).

42o the original Transformer encoder-decoder architecture (Raﬀel et al., 2020). It is basedon the idea to consider all NLP tasks as text-to-text problems (Raﬀel et al., 2020). Toachieve this, each input sequence that is fed to the model is preceded by a task-speciﬁcpreﬁx, that instructs the model what to do. For example (see Raﬀel et al., 2020): Atranslation task in this scheme has the input: ‘translate from English to German: Ilove this movie.’ and the model is trained to output: ‘Ich liebe diesen Film.’ . For asentiment classiﬁcation task on the SST-2 Dataset, the input would be: ‘sst2 sentence:I love this movie.’ and the model is trained to predict one of ‘positive’ or ‘negative’ .The fact that there is a shared scheme for all NLP tasks, allows T5 to be pretrained on amultitude of diﬀerent NLP tasks before being ﬁne-tuned on a speciﬁc target task (Raﬀelet al., 2020). In multi-task pretraining T5 is trained on a self-supervised objective similarto the masked language modeling task in BERT as well as various diﬀerent supervisedtasks (such as translation or natural language inference) (Raﬀel et al., 2020). With thismulti-task pretraining setting, in which the parameters learned during pretraining areshared across diﬀerent tasks, the T5, rather than being a standard sequential transferlearning model, implements a softened version of multi-task learning (Raﬀel et al., 2020;Ruder, 2019). While neural transfer learning with Transformers has triggered signiﬁcant developmentsin NLP, the models are not free of shortcomings. One major problem, as discussedabove, is the ﬁxed (typically relatively small) maximum sequence length the modelscan process. Increasingly powerful hardware is likely to result in the ability to handleincreasingly longer sequences. Whatever the given current computational restrictions,eﬃcient modiﬁcations of the self-attention mechanism, as for example presented by theLongformer (Beltagy et al., 2020), allow for longer sequences to be processed than withthe original Transformer and thereby constitute important steps toward alleviating thismajor drawback.The performance of NLP models as evaluated via accuracy measures on held-out testsets, has risen substantially with the NLP developments during the last years (see e.g.Wang et al., 2019a). When evaluating the models via other means (e.g. behavioral test-ing (Ribeiro et al., 2020)), however, it is revealed that accuracy-based performances oncommon benchmark data sets overestimate the models’ linguistic and language under-standing capabilities (Hendrycks et al., 2020; Ribeiro et al., 2020). BERT, for example,is found to have high failure rates for simple negation tests (e.g. classifying 84 .

4% ofpositive or neutral tweets in which a negative sentiment expression is negated into thenegative category) (Ribeiro et al., 2020).Another major concern, also for researchers that seek to apply deep pretrained modelsfor transfer learning, is the comparability of the models (Aßenmacher and Heumann,2020). In NLP, it is an established procedure to evaluate pretrained language repre-sentation models on a set of common benchmark data sets. Yet, because models withdiﬀerent architectures and number of parameters are pretrained on corpora of various43izes and textual types for varying amounts of time with dissimilar amounts of comput-ing resources, this does not create comparability between the models (Aßenmacher andHeumann, 2020). Currently, the discipline lacks procedures that would allow to satis-factorily and in a fair way diﬀerentiate the eﬀects that distinct components—e.g. themodeling objective used in pretraining, architectural elements, model sizes, hyperparam-eter settings, pretraining data—have on model performance (Aßenmacher and Heumann,2020). Progress in this direction that also takes into account eﬃciency concerns and theﬁne-tuning process would be highly useful for researchers interested in applying pre-trained models to their domain-speciﬁc target tasks.

To practically implement deep learning models, it is advisable to have access to a graphicsprocessing unit (GPU). In contrast to a central processing unit (CPU), a GPU comprisesmany more cores and can conduct thousands of operations in parallel (Caulﬁeld, 2009).GPUs thus handle tasks that can be broken down into smaller, simultaneously executablesubtasks much more eﬃciently than CPUs (Caulﬁeld, 2009). When training a neuralnetwork via stochastic gradient descent, each single hidden unit within a layer usuallycan be updated independently of the other hidden units in the same layer (Goodfellowet al., 2016). Hence, neural networks lend themselves to parallel processing.A major route to access and use GPUs is via NVIDIA’s CUDA framework (Goodfellowet al., 2016). Yet, instead of additionally learning how to write CUDA code, researchersuse libraries that enable CUDA GPU processing (Goodfellow et al., 2016). As of to-day, PyTorch (Paszke et al., 2019) and TensorFlow (Abadi et al., 2015) are the mostcommonly used libraries that allow training neural networks via CUDA-enabled GPUs.Both libraries have Python interfaces. Therefore, to eﬃciently train deep learning mod-els via GPU acceleration, researchers can use a programming language they are familiarwith.Another obstacle is having a GPU at hand that can be used for computation. Thecomputing infrastructures of universities and research institutes typically provide theirmembers access to GPU facilities. Free GPU usage also is available via Google Colab-oratory (or Colab for short): https://colab.research.google.com/notebooks/intro.ipynb.Colab is a computing service that allows its user to run Python code via the browser(Google Colaboratory, 2020). Here, GPUs can be used free of cost. The free resources,however, are not guaranteed and there may be usage limits. One issue researchers haveto keep in mind when using Colab is that at each session another type of GPU maybe assigned. Documenting the used computing environment hence is vital to ensurereplicability.To leverage the power of neural transfer learning, researchers also require access to al-ready pretrained models that they can ﬁne-tune on their speciﬁc tasks. HuggingFace’sTransformers (Wolf et al., 2020) is an open-source library that contains thousands of44retrained NLP models ready to download and use. The pretrained models can beaccessed via the respective Python package that also provides compatibility with Py-Torch and TensorFlow.In the applications presented in the following neural transfer learning is conductedin Python 3 (Van Rossum and Drake, 2009) making use of PyTorch (Paszke et al.,2019) and HuggingFace’s Transformers (Wolf et al., 2020). The code is executed inGoogle Colab. Whenever a GPU is used, an NVIDIA Tesla T4 is employed. Thecode and data that support the ﬁndings of this study are openly available in ﬁgshareat https://doi.org/10.6084/m9.ﬁgshare.13490871. Especially the shared Colab Note-books also serve as templates that other researchers can easily adapt for their NLPtasks. To explore the use of transfer learning with Transformer-based models for text analysesin social science contexts, the prediction performances of BERT, RoBERTa, and theLongformer are compared to conventional machine learning algorithms on three diﬀerentdata sets of varying size and textual style.1. The Ethos Dataset (Duthie and Budzynska, 2018) is a corpus of 3 .

644 sentencesfrom debates in the UK parliament (train: 2 . . .

5% of the sentences being non-ethotic, 12 .

9% attacking and 4 .

6% supporting another’s ethos, the data are quiteimbalanced.2. The Legalization of Abortion Dataset comprises 933 tweets (train: 653; test: 280).The data set is a subset of the Stance Dataset (Mohammad et al., 2017) that wasused for detecting the attitude toward ﬁve diﬀerent targets from tweets. Moham- https://github.com/huggingface/transformers https://huggingface.co/transformers/index.html More speciﬁcally, bag-of-words and word vector-based text preprocessing is implemented in R (RCore Team, 2020) using the packages quanteda (Benoit et al., 2018), stringr (Wickham, 2019), text2vec(Selivanov et al., 2020), and rstudioapi (Ushey et al., 2020). Training and evaluating the pretrainedTransformer models as well as the conventional machine learning algorithms is conducted in Python 3(Van Rossum and Drake, 2009) employing the modules and packages gdown (Kentaro, 2020), imbalanced-learn (Lemaˆıtre et al., 2017), matplotlib (Hunter, 2007), NumPy (Oliphant, 2006), pandas (McKinney,2010), seaborn (Waskom and Team, 2020), scikit-learn (Pedregosa et al., 2011), PyTorch (Paszke et al.,2019), watermark (Raschka, 2020), HuggingFace’s Transformers (Wolf et al., 2020), and the XGBoostPython package (Chen and Guestrin, 2016). .

3% of thetweets express an opposing and 17 .

9% a favorable position toward legalization ofabortion whilst 23 .

8% express a neutral or no position.3. The Wikipedia Toxic Comment Dataset (Jigsaw/Conversation AI, 2018) contains159 .

571 comments from Wikipedia Talk pages that were annotated by humanraters for their toxicity. On Wikipedia Talk pages contributors discuss changesto Wikipedia pages and articles. Toxic comments are comments that are ob-scene, threatening, insulting, express hatred toward social groups and identities,“are rude, disrespectful, or otherwise likely to make people leave the discussion”(Dixon, 2017). This data set was used as the training data set in Kaggle’s ToxicComment Classiﬁcation Challenge (Jigsaw/Conversation AI, 2018). Whereas thetasks associated with the Ethos and the Legalization of Abortion Datasets aremulti-class classiﬁcation tasks, the task here is a simple binary classiﬁcation taskin which the aim is to separate toxic from non-toxic comments. 9 .

6% of commentsin the data are labelled toxic. In this work, the Wikipedia Toxic Comment Datasetis used to assess in how far the algorithms’ performances vary with training set size.To do so, the following steps are conducted to get ﬁve diﬀerently sized trainingdata sets evaluated on the same test set:(a) A set of 11 .

000 comments is sampled uniformly at random from the 159 . .

000 comments is drawn from the set of 11 .

000 commentsto become the test data set. The remaining 10 .

000 comments constitute theﬁrst training data set.(c) From the training set of 10 .

000 comments a random subset of 5 .

000 commentsis randomly drawn to become the second training set. From this subset againa smaller training subset of 2 .

000 texts is sampled from which a subset of1 .

000 and then 500 comments are drawn.(d) To account for the uncertainty induced by operating on training set samples,steps (a) to (c) are repeated ﬁve times to have ﬁve sets of ﬁve training datasets of varying size.In each of the three applications—Ethos, Abortion, and Toxic—the generalization per-formance of the pretrained Transformer models for transfer learning is examined sideby side with Support Vector Machines (SVMs) (Boser et al., 1992; Cortes and Vapnik,1995) and the gradient tree boosting algorithm XGBoost (Chen and Guestrin, 2016).SVMs have been widely used in social science text applications (e.g. Diermeier et al.,2011; D’Orazio et al., 2014; Miller et al., 2020; Ramey et al., 2019; Seb˝ok and Kacsuk, https://en.wikipedia.org/wiki/Help:Talk pages In each application, SVM and XGBoost are applied on a feature representation resultingfrom a basic bag-of-words (BOW) preprocessing procedure and a feature representationbased on GloVe word embeddings (Pennington et al., 2014). Hence, two types of prepro-cessing procedures are employed on the raw texts to provide data representation inputsfor the conventional models.1. Basic BOW: The texts are tokenized into unigrams. Punctuation, numbers, andsymbols are removed in the Ethos application but kept in the other applications.Afterward, the tokens are lowercased and stemmed. Then, tokens occurring inless than a tiny share of documents (e.g. 0 .

1% in the Ethos application) and morethan a large share of documents (e.g. 33% in the Ethos application) are excluded.Finally the entries in the document-feature matrix are weighted such that merepresence (1) vs. absence (0) of each feature within each document is recorded.2. GloVe Representation: GloVe (Pennington et al., 2014) is one of the seminalearly word embedding models that learns one word vector representation perterm. GloVe embeddings are learned via a log-bilinear model operating on the co-occurrence statistics of terms in a web data corpus from CommonCrawl comprising42 billion tokens (Pennington et al., 2014). Here, for each unigram that occurs atleast 3 (Ethos, Abortion) or 5 (Toxic) times in the respective corpus, the 300 di-mensional GloVe word vector is identiﬁed. Each document then is represented bythe mean over its unigrams’ GloVe word vectors. Note that due to making useof pretrained feature representations that are not updated during training, GloVeRepresentation constitutes a transfer learning approach with feature extraction.By averaging over the unigrams’ word embeddings, the word order, however, isnot taken into account.The Transformer-based models are applied after the documents have been transformedto the required input format. To match the requirements, in each document the tokensare lowercased and the special ‘[CLS]’ and ‘[SEP]’ tokens added. Then, each tokenis converted to an index identifying its token embedding and associated with an indexidentifying its segment embedding. Additionally, each document is padded to samelength. In the Ethos and Legalization of Abortion corpora this length correspondsto the maximum document length among the training set documents, which is 139and 53 tokens respectively. The comments from Wikipedia Talk pages pose a problemhere: An inspection of the distribution of sequence lengths in the sampled subsets ofthe Wikipedia Toxic Comment Dataset (see Figure 13) shows that the vast majorityof comments are shorter than the maximum number of 512 tokens that BERT andRoBERTa can distinguish—but there is a long tail of comments exceeding 512 tokens.47o address this issue, two diﬀerent approaches are explored: For BERT, following thebest strategy identiﬁed by Sun et al. (2019), in each comment that is longer than 512tokens only the ﬁrst 128 and the last 382 tokens are kept while the tokens positionedin the middle are removed. RoBERTa, in contrast, is replaced with the Longformerin the Toxic application. For the Longformer the sequence length is set to 2 ∗

512 =1024 tokens. This ensures that in each run only a small one- or two-digit number ofsequences that are longer than 1024 tokens are truncated by removing tokens from themiddle whilst padding the texts to a shared length that still can be processed withgiven memory restrictions. Except for the removal of tokens positioned in the middleof overlong input documents, all described formatting steps for the Transformer-basedmodels are implemented in HuggingFace’s Transformers library and therefore can beeasily applied.

Figure 13:

Boxplot of the Number of Tokens in a Sequence in a SampledSubset of the Wikipedia Toxic Comment Dataset.

The boxplot is generated basedon the 10 .

000 sampled training instances for the ﬁrst iteration.

In order to determine the algorithms’ hyperparameter settings, for each evaluated combi-nation of algorithm and preprocessing procedure a grid search across sets of hyperparam-eter values is performed via ﬁve fold cross validation on the training set. To handle theclass imbalances in the Ethos and Wikipedia Toxic Comment Datasets, at each train-testsplit the training data are randomly oversampled. In random oversampling, instances of To ensure that in the Toxic application for each diﬀerently sized training data set and for each ofthe ﬁve conducted runs the same hyperparameter setting is used, for each combination of algorithmand preprocessing procedure hyperparameter tuning is conducted only once on the training data setthat is sampled in the ﬁrst round and comprises 1 .

000 instances. For each algorithm, the so determinedhyperparameter setting is used throughout the Toxic application. / .

1. Toﬁne-tune the models within the memory resources provided by Colab, small batch sizesare used. In the Ethos and Abortion applications a batch size of 16 is selected. A batch inthe Toxic application comprises 8 (and for the Longformer 4) text instances. Moreover,for the pretrained models the base size of the model architecture is used instead of thelarge or extra large model versions. So, for example, BERT

BASE instead of BERT

LARGE is applied. Larger models are likely to lead to higher performances. Yet, because theyhave more parameters it takes more computing resources to ﬁne-tune them and especiallyfor small data sets ﬁne-tuning might lead to results that vary noticeably across randomrestarts (Devlin et al., 2019). Whilst these so far mentioned speciﬁcations are keptﬁxed, the hyperparameter grid search explores model performances across combinationsof diﬀerent learning rates and epoch numbers. To ensure that in the optimization processthe gradient updates, that are conducted based on small batches, are not too strong,smaller global Adam learning rates { e − , e − , e − } are inspected. Thenumbers of epochs explored typically is { , , } .Hyperparameter tuning for the SVMs compares a linear kernel and a Radial Basis Func-tion (RBF) kernel with values of { . , . , . } for penalty weight C and—in the case ofthe RBF kernel—values of { . , . , . } for parameter γ that speciﬁes the radius ofinﬂuence for single training examples. With regard to the XGBoost algorithms, the gridsearch explores 50 vs. 250 trees each with a maximum depth of 5 vs. 8 and XGBoostlearning rates of 0 . , .

01 and 0 . Note that when selecting a small batch size (e.g. because of memory restrictions) this is not adisadvantage, but rather the opposite: Research suggests that smaller batch sizes not only require lessmemory but also have better generalization performances (Keskar et al., 2017; Masters and Luschi, 2018).To ensure that the learning process with small batch sizes does not get too volatile, one merely has toaccount for the fact that smaller batch sizes require correspondingly smaller learning rates (Brownlee,2020a). For details on SVM and XGBoost hyperparameters see also scikit-learn Developers (2020a,c) andxgboost Developers (2020). .4 Results At the end of hyperparameter tuning, the best performing set of hyperparameters ac-cording to the macro-averaged F1-score and overﬁtting considerations is selected. Thenthe model with the chosen hyperparameter setting is trained on the entire training dataset and evaluated on the test set to obtain the macro-averaged F1-score performancemetric presented here.The F1-score for a particular class c is the harmonic mean of precision and recall forthis class (Manning et al., 2008). Recall indicates what proportion of instances thattruly belong to class c have been correctly classiﬁed as being c (Manning et al., 2008).Precision informs about what share of instances that have been predicted to be in class c truly belong to class c (Manning et al., 2008). The F1-score can range from 0 to1 with 1 being the highest value signifying perfect classiﬁcation. The macro-averagedF1-score is the unweighted mean of the F1-scores of each class (scikit-learn Developers,2020b). By not weighting the F1-scores according to class sizes, algorithms that are badat predicting the minority classes are penalized more severely (scikit-learn Developers,2020b).The results are presented in Table 1. Here, for each tested training data set size in theToxic application, { , , , , . } , the mean of the macro-averaged F1-scores across the ﬁve iterations is printed. Figure 14 additionally visualizes the resultsfor the toxic comment classiﬁcation task.Across all evaluated classiﬁcation tasks and training data set sizes, the Transformer-based models for transfer learning tend to achieve higher macro-averaged F1-scores thanthe conventional machine learning algorithms SVM and XGBoost. As has been observedbefore, the classic machine learning algorithms produce acceptable results given thesimple representations of text they are applied on. Yet, BERT, RoBERTa, and theLongformer consistently outperform the best performing conventional model by a marginof at minimum around 0 .

05 to 0 .

11. These moderate to considerably higher predictionperformances across all evaluated textual styles, sequence lengths and especially thesmaller training data set sizes, demonstrate the potential beneﬁts that neural transferlearning with Transformers can bring to analyses in which only a small to medium-sizedtraining data set exists and/or the aim is to have a text-based measure that is as accurateas possible.A more detailed examination of the macro-averaged F1-scores reveals further ﬁndingsworth discussing: • Averaged GloVe representations partly, though not consistently, produce a slightadvantage over basic BOW preprocessing. This emphasizes that employing trans-fer learning on conventional machine learning algorithms by extracting pretrainedfeatures (here: GloVe embeddings) and taking them as the data representation in-put might be beneﬁcial—even if averaging over the embeddings erases informationon word order and dependencies. 50 thos Abortion Toxic0.5K Toxic1K Toxic2K Toxic5K Toxic10KSVM BOW 0.566 0.526 0.711 0.754 0.782 0.802 0.817SVM GloVe 0.585 0.545 0.739 0.786 0.789 0.822 0.840XGBoost BOW 0.563 0.540 0.709 0.734 0.742 0.775 0.777XGBoost GloVe 0.513 0.506 0.710 0.753 0.774 0.804 0.823BERT 0.695 0.593 0.832 0.857 0.888 0.905 0.901RoBERTa/Longf. 0.747 0.617 0.849 0.875 0.884 0.890 0.906

Table 1:

Macro-Averaged F1-Scores.

Macro-averaged F1-scores of the evaluated modelsfor the Ethos, Abortion and Toxic classiﬁcation tasks. In the Toxic application, for each testedtraining data set size, { , , , , . } , the mean of the macro-averaged F1-scoresacross the ﬁve iterations is shown. The column labelled Toxic0.5K gives the mean of the macro-averaged F1-scores for the Toxic classiﬁcation task with a training set size of 500 instances.SVM BOW and XGBoost BOW denote SVM and XGBoost with bag-of-words preprocessing.SVM GloVe and XGBoost GloVe refer to SVM and XGBoost with GloVe representations. InRoBERTa/Longf., RoBERTa is applied for the Ethos and the Abortion target tasks whereas theLongformer is used for the toxic comment classiﬁcation tasks. Grey colored cells highlight thebest performing model for the task.

500 1000 2000 5000 10000

Training Data Set Size0.50.60.70.80.91.0 F - S c o r e ( m a c r o ) SVM BOWSVM GloVeXGBoost BOWXGBoost GloVeBERTLongformer

Figure 14:

Performances on Toxic Application with Varying Training Data SetSizes.

For each training data set size and each model, the plotted symbols indicate the meanof the test set macro-averaged F1-scores across the ﬁve iterations. The shaded areas range fromthe minimum to the maximum macro-averaged F1-score obtained across the ﬁve iterations. For the Ethos and Abortion applications, RoBERTa outperforms BERT to a smallextent. This ﬁnding is consistent with previous research (Liu et al., 2019). In gen-eral, it is diﬃcult to disentangle the eﬀects of single modiﬁcations of the originalBERT architecture and pretraining that BERT-extensions as RoBERTa imple-ment (Aßenmacher and Heumann, 2020). It is likely, however, that one importantcontribution is the longer pretraining on more and more varied data (Liu et al.,2019). Whereas BERT is pretrained on a corpus of books and Wikipedia articles,RoBERTa additionally is pretrained on three more large data sets that are based ontext passages from the web (Liu et al., 2019). The larger and more heterogeneouspretraining corpus is likely to enable RoBERTa to learn language representationsthat better generalize across a diverse set of target task corpora as inspected here. • In the Ethos application, BERT and RoBERTa do not only exceed the perfor-mances of the other evaluated models but also the best performing model devel-oped by Duthie and Budzynska (2018) that had created the Ethos Dataset withthe corresponding multi-class classiﬁcation task. To diﬀerentiate non-ethotic frompositive and from negative ethotic sentences, Duthie and Budzynska (2018) hadcreated an elaborate NLP pipeline including a POS tagger, dependency parsing,anaphora resolution, entity extraction, sentiment classiﬁcation, and a deep RNN.Duthie and Budzynska (2018) report a macro-averaged F1-score of 0 .

65 for theirbest model. BERT and RoBERTa surpass this performance. As the pretrainedBERT and ROBERTa models are simply ﬁne-tuned to the Ethos classiﬁcationtarget task without implementing (and having to come up with) an extensiveand complex preprocessing pipeline, this demonstrates the eﬃciency and power oftransfer learning. • With all models achieving only mediocre performances, the Abortion classiﬁcationtask, for which only 653 short Tweets are available as training instances, seemsto be especially diﬃcult. BERT and RoBERTa still surpass SVM and XGBoostbut with a slightly smaller margin. By applying a SVM with a linear kernel basedon word and character n-gram feature representations, Mohammad et al. (2017)reach classiﬁcation performance levels that are higher than the ones reached by themodels presented here—and also are higher than the performance of a RNN modelfor transfer learning that won the SemEval-2016 competition using this dataset(Zarrella and Marsh, 2016). Mohammad et al. (2017) merely compute the F1-score for the favorable and opposing categories leaving out the neutral position.They report a score of 0 .

664 for their N -gram based SVM classiﬁer (Mohammadet al., 2017). Here the corresponding score values are 0 .

633 for BERT, 0 .

648 forRoBERTa as well as 0 .

616 for the best performing conventional model SVM GloVe.The Abortion classiﬁcation task with short tweets in which the used hashtags tendto be indicative of the stance toward the issue (Mohammad et al., 2017), seems tobe an example for a task in which deep learning models only produce a moderateadvantage or—if it is easy to select BOW representations that very well capturelinguistic variation that helps in discriminating the texts into the categories (as52eems to be the case with the representations used by Mohammad et al. (2017))—even no advantage over traditional machine learning algorithms. • Across all evaluated training data set sizes, the Transformer-based models withtransfer learning tend to be better at solving the toxic comment classiﬁcation taskcompared to the conventional algorithms (see Figure 14). As is to be expected,the performance levels for all models decrease with decreasing training data setsizes. Yet, although the neural models have much more parameters to learn, theirmacro-averaged F1-scores do not decrease more sharply than those of the tradi-tional machine learning algorithms. Especially as training data sets become small,the eﬀectiveness of the pretrained representations becomes salient. Here, the pre-trained representations seem to function as a quite eﬀective input to the targettask. • Whereas the Longformer processes text sequences of 1024 tokens, the input se-quences for BERT were truncated at 512 tokens for the toxic comment classiﬁcationtask. Despite this large diﬀerence in sequence lengths, BERT only slightly under-performs compared to the Longformer—and matches the Longformer for largertraining data set sizes. As only a small share of comments in the Wikipedia ToxicComment Datatset are longer than 512 tokens (see again Figure 13), the Long-former’s advantage of being able to process longer text sequences does not materi-alize here. Removing tokens from the middle of comments that exceed 512 tokensdoes not harm BERT’s prediction performance and is an eﬀective workaround inthis application. For applications based on corpora in which the mass of the se-quence length distribution is above 512 tokens, however, the Longformer’s abilityto process and thus capture the information contained in these longer documents,is likely to be important for prediction performance. • Note that the time consumed during training diﬀers substantively between the con-ventional and the Transformer models. Larger training data sets and smaller batchsizes increase the time required for ﬁne-tuning the pretrained Transformer-basedmodels on the target task. Across the applications presented here, the absolutetraining time varies between 1 and 276 seconds for SVM BOW, between 32 and2272 seconds for BERT and between 31 and 9707 seconds for RoBERTa/Longformer.Achieving higher prediction performances requires higher computational resourcesnot only regarding memory but also regarding time.

Advances in NLP research on transfer learning and the attention mechanism, that isincorporated in the Transformer, has paved the way to a new mode of learning in whichresearchers from across domains can hope to achieve higher prediction performances bytaking a readily available pretrained model and ﬁne-tuning it with minimal resourceson their NLP task of interest (Alammar, 2018a). To use these potential advantages53or social science text analysis, this study has presented and applied Transformer-basedmodels for transfer learning. In the supervised classiﬁcation tasks evaluated in thisstudy, transfer learning with Transformer models consistently outperformed traditionalmachine learning across all tasks and data set sizes.Employing transfer learning with Transformer models, however, will not always performbetter compared to other machine learning algorithms and is not the most adequatestrategy for each and every text-based research question. As the attention mechanism isspecialized in capturing dependencies and contextual meanings, these models are likelyto generate more accurate predictions if contextual information and long-range depen-dencies between tokens are relevant for the task at hand. They are less likely to providemuch of an advantage if the function to be learned between textual inputs and desiredoutputs is less complex—for example because single N -grams, such as hashtags, arestrongly indicative of class labels (see e.g. the Abortion application).Transformer-based models for transfer learning furthermore are useful for supervisedclassiﬁcation tasks in which the aim is to achieve a high as possible prediction perfor-mance rather than having an interpretable model. Social scientists that, for example,wish to have as precise as possible text-based measures for concepts they employ mayﬁnd Transformer-based models for transfer learning highly useful, whereas researchersthat, for example, seek to know which textual features are most important in discrim-inating between class labeled documents (e.g. Slapin and Kirkland, 2020) will not ﬁndmuch use in these models.Moreover, due to the sequence length limitations of Transformer-based models for trans-fer learning, the applicability of these models currently is restricted to NLP tasks oper-ating on only moderately long text sequences. NLP research that seeks to reduce thememory resources consumed by the attention mechanism and thus allows for processinglonger text sequences (e.g. Beltagy et al., 2020; Wang et al., 2020) is highly impor-tant. Further research progress in this direction would open up the potential of transferlearning with Transformers for a wider range of social science text analyses. Data Availability Statement.

The code and data that support the ﬁndings of thisstudy are openly available in ﬁgshare at https://doi.org/10.6084/m9.ﬁgshare.13490871.54 eferences

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S.,Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G.,Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., ..., and Zheng, X. (2015).TensorFlow: Large-scale machine learning on heterogeneous systems. [arXiv preprint].arxiv:1603.04467.Abercrombie, G. and Batista-Navarro, R. (2018). ‘Aye’ or ‘no’? Speech-level sentimentanalysis of hansard UK parliamentary debate transcripts. In Calzolari, N., Choukri,K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani,J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors,

Proceed-ings of the Eleventh International Conference on Language Resources and Evaluation(LREC 2018) , pages 4173–4180. European Language Resources Association (ELRA).Alammar, J. (2018a).

The illustrated BERT, ELMo, and co.: How NLPcracked transfer learning . Jay Alammar. Retrieved July 6, 2020, fromhttp://jalammar.github.io/illustrated-bert/.Alammar, J. (2018b).

The illustrated Transformer . Jay Alammar. Retrieved July 6,2020, from http://jalammar.github.io/illustrated-transformer/.Alammar, J. (2018c).

Visualizing a neural machine translation model: Me-chanics of seq2seq models with attention . Jay Alammar. Retrieved July6, 2020, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/.Amidi, A. and Amidi, S. (2019).

Recurrent neural networks cheatsheet . Stanford Uni-versity. https://stanford.edu/˜shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks.Amsalem, E., Fogel-Dror, Y., Shenhav, S. R., and Sheafer, T. (2020). Fine-grained anal-ysis of diversity levels in the news.

Communication Methods and Measures , 14(4):266–284.Anastasopoulos, L. J. and Bertelli, A. M. (2020). Understanding delegation throughmachine learning: A method and application to the European Union.

AmericanPolitical Science Review , 114(1):291–301.Ansari, M. Z., Aziz, M., Siddiqui, M., Mehra, H., and Singh, K. (2020). Analysis ofpolitical sentiment orientations on Twitter.

Procedia Computer Science , 167:1821–1828.Aßenmacher, M. and Heumann, C. (2020). On the comparability of pre-trained languagemodels. In Ebling, S., Tuggener, D., H¨urlimann, M., Cieliebak, M., and Volk, M.,editors,

Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16thConference on Natural Language Processing (KONVENS) . CEUR-WS.org.55a, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. [arXiv preprint].arXiv:1607.06450.Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointlylearning to align and translate. In Bengio, Y. and LeCun, Y., editors, .Barber´a, P., Boydstun, A. E., Linn, S., McMahon, R., and Nagler, J. (2021). Automatedtext classiﬁcation of news articles: A practical guide.

Political Analysis , 29(1):19–42.Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The long-documenttransformer. [arXiv preprint]. arXiv:2004.05150.Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilisticlanguage model.

Journal of Machine Learning Research , 3:1137–1155.Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., M¨uller, S., and Matsuo, A.(2018). quanteda: An R package for the quantitative analysis of textual data.

Journalof Open Source Software , 3(30):774.Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimalmargin classiﬁers. In Haussler, D., editor,

Proceedings of the Fifth Annual Work-shop on Computational Learning Theory , COLT ’92, pages 144–152. Association forComputing Machinery.Branco, P., Torgo, L., and Ribeiro, R. (2015). A survey of predictive modelling underimbalanced distributions. [arXiv preprint]. arXiv:1505.01658.Brockman, G., Murati, M., Welinder, P., and OpenAI (2020).

OpenAI API . OpenAI.https://openai.com/blog/openai-api/.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakan-tan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ...,and Amodei, D. (2020). Language models are few-shot learners. [arXiv preprint].arXiv:2005.14165.Brownlee, J. (2020a).

How to control the stability of training neural networks with thebatch size . Machine Learning Mastery. https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/.Brownlee, J. (2020b).

Random oversampling and undersampling for imbalanced classi-ﬁcation . Machine Learning Mastery. https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classiﬁcation/.Budhwar, A., Kuboi, T., Dekhtyar, A., and Khosmood, F. (2018). Predicting the voteusing legislative speech. In Zuiderwijk, A. and Hinnant, C. C., editors,

Proceedings of he 19th Annual International Conference on Digital Government Research: Gover-nance in the Data Age . Association for Computing Machinery.Caulﬁeld, B. (2009). What’s the diﬀerence between a CPU and a GPU?

NVIDIA Blog.https://blogs.nvidia.com/blog/2009/12/16/whats-the-diﬀerence-between-a-cpu-and-a-gpu/.Ceron, A., Curini, L., and Iacus, S. M. (2015). Using sentiment analysis to monitorelectoral campaigns: Method matters—evidence from the United States and Italy.

Social Science Computer Review , 33(1):3–20.Chang, C. and Masterson, M. (2020). Using word order in political text classiﬁcationwith long short-term memory models.

Political Analysis , 28(3):395–411.Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In

Pro-ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’16, pages 785–794. Association for Computing Machinery.Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequenceswith sparse Transformers. [arXiv preprint]. arXiv:1904.10509.Cho, K., van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,and Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoderfor statistical machine translation. In Moschitti, A., Pang, B., and Daelemans, W.,editors,

Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1724–1734. Association for Computational Linguistics.Chollet, F. (2020).

Deep Learning with Python . Manning Publications.Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. In . OpenReview.net.Clark, K. and Luong, T. (2020).

More eﬃcient NLP model pre-training with ELEC-TRA . Google AI Blog. https://ai.googleblog.com/2020/03/more-eﬃcient-nlp-model-pre-training.html.Clevert, D., Unterthiner, T., and Hochreiter, S. (2016). Fast and accurate deep networklearning by Exponential Linear Units (ELUs). In Bengio, Y. and LeCun, Y., editors, .Colleoni, E., Rozza, A., and Arvidsson, A. (2014). Echo chamber or public sphere?Predicting political orientation and measuring political homophily in Twitter usingbig data.

Journal of Communication , 64(2):317–332.Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine Learning ,20(3):273–297.Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. (2019).57ransformer–XL: Attentive language models beyond a ﬁxed-length context. [arXivpreprint]. arXiv:1901.02860.Deng, J., Dong, W., Socher, R., Li, L., Kai Li, and Li Fei-Fei (2009). ImageNet: A large-scale hierarchical image database. In , pages 248–255. IEEE.Denny, M. J. and Spirling, A. (2018). Text preprocessing for unsupervised learning: Whyit matters, when it misleads, and what to do about it.

Political Analysis , 26(2):168–189.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-trainingof deep bidirectional Transformers for language understanding. [arXiv preprint].arXiv:1810.04805v1.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-trainingof deep bidirectional Transformers for language understanding. In Burstein, J., Do-ran, C., and Solorio, T., editors,

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Lan-guage Technologies , pages 4171–4186. Association for Computational Linguistics.Diermeier, D., Godbout, J.-F., Yu, B., and Kaufmann, S. (2011). Language and ideologyin Congress.

British Journal of Political Science , 42(1):31––55.Dixon, L. (2017).

Hi! and welcome to our ﬁrst toxicity classiﬁcationchallenge

Proceedings of Machine Learning Research (PMLR) , 32(1):647–655.D’Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P. (2014). Separating the wheatfrom the chaﬀ: Applications of automated document classiﬁcation using support vec-tor machines.

Political Analysis , 22(2):224–242.Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for onlinelearning and stochastic optimization.

Journal of Machine Learning Research , 12:2121–2159.Duthie, R. and Budzynska, K. (2018). A deep modular RNN approach for ethos mining.In Lang, J., editor,

Proceedings of the Twenty-Seventh International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-18 , pages 4041–4047. International Joint Conferenceson Artiﬁcial Intelligence (IJCAI).Elman, J. L. (1990). Finding structure in time.

Cognitive Science , 14(2):179–211.Fowler, E. F., Franz, M. M., Martin, G. J., Peskowitz, Z., and Ridout, T. N. (2020).Political advertising online and oﬄine.

American Political Science Review .58lavaˇs, G., Nanni, F., and Ponzetto, S. P. (2017). Cross-lingual classiﬁcation of topicsin political texts. In Hovy, D., Volkova, S., Bamman, D., Jurgens, D., O’Connor,B., Tsur, O., and Do˘gru¨oz, A. S., editors,

Proceedings of the Second Workshop onNLP and Computational Social Science , pages 42–46. Association for ComputationalLinguistics.Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep learning . MIT Press.Google Colaboratory (2020).

Google Colaboratory: Frequently Asked Questions . Re-trieved October 28, 2020, from https://research.google.com/colaboratory/faq.html.Greene, K. T., Park, B., and Colaresi, M. (2019). Machine learning human rights andwrongs: How the successes and failures of supervised learning algorithms can informthe debate about information eﬀects.

Political Analysis , 27(2):223–230.Grimmer, J. and Stewart, B. M. (2013). Text as data: The promise and pitfalls ofautomatic content analysis methods for political texts.

Political Analysis , 21(3):267–297.Han, R., Gill, M., Spirling, A., and Cho, K. (2018). Conditional word embedding andhypothesis testing via Bayes-by-backprop. In Riloﬀ, E., Chiang, D., Hockenmaier,J., and Tsujii, J., editors,

Proceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 4890–4895. Association for ComputationalLinguistics.Hansen, C. (2020).

Activation functions explained: GELU, SELU, ELU, ReLUand more

Neural Computa-tion , 9(8):1735–1780.Howard, J. and Ruder, S. (2018). Universal language model ﬁne-tuning for text clas-siﬁcation. In Gurevych, I. and Miyao, Y., editors,

Proceedings of the 56th Annual eeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,pages 328–339. Association for Computational Linguistics.Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science &Engineering , 9(3):90–95.Iyyer, M., Enns, P., Boyd-Graber, J., and Resnik, P. (2014). Political ideology detectionusing recursive neural networks. In Toutanova, K. and Wu, H., editors,

Proceedings ofthe 52nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers) , pages 1113–1122. Association for Computational Linguistics.Jigsaw/Conversation AI (2018).

Toxic comment classiﬁcation challenge

American Political Science Review , 113(1):156–172.Kentaro, W. (2020). gdown: Download a large ﬁle from google drive. [Computer soft-ware]. https://github.com/wkentaro/gdown.Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017). Onlarge-batch training for deep learning: Generalization gap and sharp minima. [arXivpreprint]. arXiv:1609.04836.Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. InBengio, Y. and LeCun, Y., editors, .Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A.,Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C.,Kumaran, D., and Hadsell, R. (2017). Overcoming catastrophic forgetting in neuralnetworks.

Proceedings of the National Academy of Sciences , 114(13):3521–3526.Kitaev, N., Kaiser, (cid:32)L., and Levskaya, A. (2020). Reformer: The eﬃcient Transformer.[arXiv preprint]. arXiv:2001.04451.Kozlowski, A. C., Taddy, M., and Evans, J. A. (2019). The geometry of culture: Ana-lyzing the meanings of class through word embeddings.

American Sociological Review ,84(5):905–949.Kwon, K. H., Priniski, J. H., and Chadha, M. (2018). Disentangling user samples:A supervised machine learning approach to proxy-population mismatch in Twitterresearch.

Communication Methods and Measures , 12(2-3):216–237.Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). RACE: Large-scale readingcomprehension dataset from examinations. [arXiv preprint]. arXiv:1704.04683.Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2020).ALBERT: A Lite BERT for self-supervised learning of language representations. In . OpenReview.net.60emaˆıtre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A Pythontoolbox to tackle the curse of imbalanced datasets in machine learning.

Journal ofMachine Learning Research , 18(17):1–5.Li, F.-F., Krishna, R., and Xu, D. (2020a). CS231n: Convolutional neural net-works for visual recognition — optimization I. Lecture Notes, Stanford University.https://cs231n.github.io/optimization-1/.Li, F.-F., Krishna, R., and Xu, D. (2020b). CS231n: Convolutional neural net-works for visual recognition — optimization II. Lecture Notes, Stanford University.https://cs231n.github.io/optimization-2/.Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle-moyer, L., and Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretrain-ing approach. [arXiv preprint]. arXiv:1907.11692.Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In . OpenReview.net.Luong, T., Pham, H., and Manning, C. D. (2015). Eﬀective approaches to attention-based neural machine translation. In M`arquez, L., Callison-Burch, C., and Su, J.,editors,

Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing , pages 1412–1421. Association for Computational Linguistics.Manning, C. D., Raghavan, P., and Sch¨utze, H. (2008).

Introduction to InformationRetrieval . Cambridge University Press.Masters, D. and Luschi, C. (2018). Revisiting small batch training for deep neuralnetworks. [arXiv preprint]. arXiv:1804.07612.McCann, B., Bradbury, J., Xiong, C., and Socher, R. (2018). Learned in translation:Contextualized word vectors. [arXiv preprint]. arXiv:1708.00107.McCormick, C. and Ryan, N. (2019).

BERT ﬁne-tuning tutorial withPyTorch . Chris McCormick. Retrieved September 11, 2020, fromhttps://mccormickml.com/2019/07/22/BERT-ﬁne-tuning/.McKinney, W. (2010). Data structures for statistical computing in Python. In van derWalt, S. and Millman, J., editors,

Proceedings of the 9th Python in Science Conference(SciPy 2010) , pages 51–56.Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Eﬃcient estimation of wordrepresentations in vector space. [arXiv preprint]. arXiv:1301.3781.Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributedrepresentations of words and phrases and their compositionality. In Burges, C. J. C.,Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors,

Proceed-ings of the 26th International Conference on Neural Information Processing Systems ,NIPS’13, pages 3111–3119. Curran Associates Inc.61ikolov, T., Yih, W.-t., and Zweig, G. (2013c). Linguistic regularities in continuousspace word representations. In Vanderwende, L., Daum´e III, H., and Kirchhoﬀ, K.,editors,

Proceedings of the 2013 Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human Language Technologies , pages 746–751. Association for Computational Linguistics.Miller, B., Linder, F., and Mebane, W. R. (2020). Active learning approaches for la-beling text: Review and assessment of the performance of active learning approaches.

Political Analysis , 28(4):532–551.Mitts, T. (2019). From isolation to radicalization: Anti-Muslim hostility and supportfor ISIS in the West.

American Political Science Review , 113(1):173–194.Mohammad, S. M., Sobhani, P., and Kiritchenko, S. (2017). Stance and sentiment intweets.

ACM Transactions on Internet Technology , 17(3):26:1–26:22.Muchlinski, D., Yang, X., Birch, S., Macdonald, C., and Ounis, I. (2020). We need to godeeper: Measuring electoral violence using convolutional neural networks and socialmedia.

Political Science Research and Methods .Nair, V. and Hinton, G. E. (2010). Rectiﬁed Linear Units improve restricted Boltzmannmachines. In F¨urnkranz, J. and Joachims, T., editors,

Proceedings of the 27th In-ternational Conference on International Conference on Machine Learning , ICML’10,pages 807–814. Omnipress.Oliphant, T. E. (2006).

A Guide to NumPy . Trelgol Publishing USA.Pan, S. J. and Yang, Q. (2010). A survey on transfer learning.

IEEE Transactions onKnowledge and Data Engineering , 22(10):1345–1359.Park, B., Greene, K., and Colaresi, M. (2020). Human rights are (increasingly) plu-ral: Learning the changing taxonomy of human rights from large-scale text revealsinformation eﬀects.

American Political Science Review , 114(3):888–910.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ..., and Chintala, S.(2019). Pytorch: An imperative style, high-performance deep learning library. InWallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e Buc, F., Fox, E., and Garnett,R., editors,

Advances in Neural Information Processing Systems 32 , pages 8024–8035.Curran Associates, Inc.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learningin Python.

Journal of Machine Learning Research , 12:2825–2830.Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for wordrepresentation. In Moschitti, A., Pang, B., and Daelemans, W., editors,

Proceedings of he 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 1532–1543. Association for Computational Linguistics.Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer,L. (2018). Deep contextualized word representations. In Walker, M., Ji, H., and Stent,A., editors, Proceedings of the 2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies , pages2227–2237. Association for Computational Linguistics.Peters, M. E., Ruder, S., and Smith, N. A. (2019). To tune or not to tune? Adaptingpretrained representations to diverse tasks. In Augenstein, I., Gella, S., Ruder, S.,Kann, K., Can, B., Welbl, J., Conneau, A., Ren, X., and Rei, M., editors,

Proceedingsof the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) , pages7–14. Association for Computational Linguistics.Pilny, A., McAninch, K., Slone, A., and Moore, K. (2019). Using supervised machinelearning in automated content analysis: An example using relational uncertainty.

Communication Methods and Measures , 13(4):287–304.R Core Team (2020).

R: A language and environment for statistical computing

Journal of Machine Learning Research , 21(140):1–67.Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerablequestions for SQuAD. [arXiv preprint]. arXiv:1806.03822.Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ questionsfor machine comprehension of text. [arXiv preprint]. arXiv:1606.05250.Ramey, A. J., Klingler, J. D., and Hollibaugh, G. E. (2019). Measuring elite personalityusing speech.

Political Science Research and Methods , 7(1):163–184.Raschka, S. (2020). watermark. [Computer software].https://github.com/rasbt/watermark.Rheault, L., Beelen, K., Cochrane, C., and Hirst, G. (2016). Measuring emotion inparliamentary debates with automated textual analysis.

PLoS ONE , 11(12):e0168843.63heault, L. and Cochrane, C. (2020). Word embeddings for the analysis of ideologicalplacement in parliamentary corpora.

Political Analysis , 28(1):112–133.Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S. (2020). Beyond accuracy: Behav-ioral testing of NLP models with CheckList. In Jurafsky, D., Chai, J., Schluter, N.,and Tetreault, J., editors,

Proceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics , pages 4902–4912. Association for ComputationalLinguistics.Riedl, M. (2020).

AI democratization in the era of GPT-3 . The Gradient.https://thegradient.pub/ai-democratization-in-the-era-of-gpt-3/.Rodman, E. (2020). A timely intervention: Tracking the changing meanings of politicalconcepts with word vectors.

Political Analysis , 28(1):87–111.Ruder, S. (2018).

NLP’s ImageNet moment has arrived . Sebastian Ruder.https://ruder.io/nlp-imagenet/.Ruder, S. (2019).

Neural transfer learning for natural lan-guage processing . PhD thesis, National University of Ireland.https://ruder.io/thesis/neural transfer learning for nlp.pdf.Ruder, S. (2020).

NLP-Progress . Retrieved August 4, 2020, fromhttps://nlpprogress.com/.Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, S., and Sedlmair,M. (2018). More than bags of words: Sentiment analysis with word embeddings.

Communication Methods and Measures , 12(2–3):140–157.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representationsby back-propagating errors.

Nature , 323:533–536.Schuster, M. and Nakajima, K. (2012). Japanese and Korean voice search. In , pages5149–5152. IEEE.scikit-learn Developers (2020a). . https://scikit-learn.org/stable/modules/svm.html.scikit-learn Developers (2020b).

Classiﬁcation metrics . https://scikit-learn.org/stable/modules/model evaluation.html

RBF SVM Parameters . https://scikit-learn.org/stable/auto examples/svm/plot rbf parameters.html.Seb˝ok, M. and Kacsuk, Z. (2020). The multiclass classiﬁcation of newspaper articleswith machine learning: The hybrid binary snowball approach.

Political Analysis .Selivanov, D., Bickel, M., and Wang, Q. (2020). text2vec: Modern text mining frameworkfor R . [Computer software]. http://text2vec.org/.64ennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rarewords with subword units. In Erk, K. and Smith, N. A., editors,

Proceedings of the54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) , pages 1715–1725. Association for Computational Linguistics.Slapin, J. B. and Kirkland, J. H. (2020). The sound of rebellion: Voting dissent and leg-islative speech in the UK House of Commons.

Legislative Studies Quarterly , 45(2):153–176.Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts,C. (2013). Recursive deep models for semantic compositionality over a sentimenttreebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S.,editors,

Proceedings of the 2013 Conference on Empirical Methods in Natural LanguageProcessing , pages 1631–1642. Association for Computational Linguistics.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Ma-chine Learning Research , 15(56):1929–1958.Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to ﬁne-tune BERT for textclassiﬁcation? [arXiv preprint]. arXiv:1905.05583.Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning withneural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., andWeinberger, K. Q., editors,

Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 2 , NIPS’14, pages 3104–3112. MIT Press.The HuggingFace Team (2020a).

Everything you always wanted to know aboutpadding and truncation . HuggingFace’s Transformers. Retrieved November 11,2020, from https://huggingface.co/transformers/preprocessing.html

Summary of the models . Hug-gingFace’s Transformers. Retrieved November 13, 2020, fromhttps://huggingface.co/transformers/model summary.html.The HuggingFace Team (2020c).

Tokenizer summary . Hug-gingFace’s Transformers. Retrieved November 19, 2020, fromhttps://huggingface.co/transformers/tokenizer summary.html.Theocharis, Y., Barber´a, P., Fazekas, Z., Popa, S. A., and Parnet, O. (2016). A badworkman blames his tweets: The consequences of citizens’ uncivil Twitter use wheninteracting with party candidates.

Journal of Communication , 66(6):1007–1031.Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space modelsof semantics.

Journal of Artiﬁcial Intelligence Research , 37:141–188.Ushey, K., Allaire, J. J., Wickham, H., and Ritchie, G. (2020). rstudioapi: Safely accessthe RStudio API . [Computer software]. https://github.com/rstudio/rstudioapi.65an Rossum, G. and Drake, F. L. (2009).

Python 3 reference manual . CreateSpace.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,L., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg,U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., edi-tors,

Advances in Neural Information Processing Systems 30 , pages 5998–6008. CurranAssociates, Inc.Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. (2019a). SuperGLUE: A stickier benchmark for general-purpose languageunderstanding systems. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e Buc,F., Fox, E., and Garnett, R., editors,

Advances in Neural Information ProcessingSystems , pages 3266–3280. Curran Associates, Inc.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019b). GLUE:A multi-task benchmark and analysis platform for natural language understanding.In . OpenRe-view.net.Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). Linformer: Self-attentionwith linear complexity. [arXiv preprint]. arXiv:2006.04768.Waskom, M. and Team (2020). Seaborn. [Computer software].https://zenodo.org/record/4379347.Watanabe, K. (2020). Latent semantic scaling: A semisupervised text analysis techniquefor new domains and languages.

Communication Methods and Measures .Welbers, K., Atteveldt, W. V., and Benoit, K. (2017). Text analysis in R.

Communica-tion Methods and Measures , 11(4):245–265.Wickham, H. (2019). stringr: Simple, consistent wrappers for common string operations .[Computer software]. https://stringr.tidyverse.org/.Williams, A., Nangia, N., and Bowman, S. (2018). A broad-coverage challenge corpusfor sentence understanding through inference. In Walker, M., Ji, H., and Stent, A.,editors,

Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume 1(Long Papers) , pages 1112–1122. Association for Computational Linguistics.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault,T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite,Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., ..., and Rush, A. M. (2020). Hugging-Face’s Transformers: State-of-the-art natural language processing. [arXiv preprint].arXiv:1910.03771.Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization.

IEEE Transactions on Evolutionary Computation xgboost.XGBClassiﬁer . Retrieved November 23, 2020, fromhttps://xgboost.readthedocs.io/en/latest/python/python api.html

Advances in Neural Information Processing Systems 32 , pages 5753–5763.Curran Associates, Inc.Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are featuresin deep neural networks? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence,N. D., and Weinberger, K. Q., editors,

Advances in Neural Information ProcessingSystems 27 , pages 3320–3328. Curran Associates, Inc.Zarrella, G. and Marsh, A. (2016). MITRE at SemEval-2016 task 6: Transfer learningfor stance detection. In Bethard, S., Carpuat, M., Cer, D., Jurgens, D., Nakov, P.,and Zesch, T., editors,

Proceedings of the 10th International Workshop on SemanticEvaluation (SemEval-2016) , pages 458–463. Association for Computational Linguis-tics.Zhang, H. and Pan, J. (2019). CASM: A deep-learning approach for identifying collectiveaction events with text and image data from social media.

Sociological Methodology ,49(1):1–57.Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler,S. (2015). Aligning books and movies: Towards story-like visual explanations bywatching movies and reading books. In