[PDF] FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

Abstract

Financial sentiment analysis is a challenging task due to the specialized language and lack of labeled data in that domain. General-purpose models are not effective enough because of the specialized language used in a financial context. We hypothesize that pre-trained language models can help with this problem because they require fewer labeled examples and they can be further trained on domain-specific corpora. We introduce FinBERT, a language model based on BERT, to tackle NLP tasks in the financial domain. Our results show improvement in every measured metric on current state-of-the-art results for two financial sentiment analysis datasets. We find that even with a smaller training set and fine-tuning only a part of the model, FinBERT outperforms state-of-the-art machine learning methods.

Full PDF

FFinBERT: Financial Sentiment Analysis with Pre-trained Language Models submitted in partial fulfillment for the degree of master of scienceDogu Araci12255068master information studiesdata sciencefaculty of scienceuniversity of amsterdam2019-06-25

Internal Supervisor External SupervisorTitle, Name

Dr Pengjie Ren Dr Zulkuf Genc

Affiliation

UvA, ILPS Naspers Group

Email [email protected] [email protected] a r X i v : . [ c s . C L ] A ug inBERT: Financial Sentiment Analysis with Pre-trainedLanguage Models Dogu Tan Araci [email protected] of AmsterdamAmsterdam, The Netherlands

ABSTRACT

Financial sentiment analysis is a challenging task due to the spe-cialized language and lack of labeled data in that domain. General-purpose models are not effective enough because of specializedlanguage used in financial context. We hypothesize that pre-trainedlanguage models can help with this problem because they requirefewer labeled examples and they can be further trained on domain-specific corpora. We introduce FinBERT, a language model basedon BERT, to tackle NLP tasks in financial domain. Our results showimprovement in every measured metric on current state-of-the-art results for two financial sentiment analysis datasets. We findthat even with a smaller training set and fine-tuning only a part ofthe model, FinBERT outperforms state-of-the-art machine learningmethods.

Prices in an open market reflects all of the available informationregarding assets exchanged in an economy [16]. When new infor-mation becomes available, all actors in the economy update theirpositions and prices adjust accordingly, which makes beating themarkets consistently impossible. However, the definition of "new in-formation" might change as new information retrieval technologiesbecome available and early-adoption of such technologies mightprovide an advantage in the short-term.Analysis of financial texts, be it news, analyst reports or officialcompany announcements is a possible source of new information.With unprecedented amount of such text being created every day,manually analyzing these and deriving actionable insights fromthem is too big of a task for any single entity. Hence, automatedsentiment or polarity analysis of texts produced by financial ac-tors using natural language processing (NLP) methods has gainedpopularity during the last decade [4].The principal research interest for this thesis is the polarityanalysis, which is classifying text as positive, negative or neutral,in a specific domain. It requires to address two challenges: 1) Themost sophisticated classification methods that make use of neuralnets require vast amounts of labeled data and labeling financialtext snippets requires costly expertise. 2) The sentiment analysismodels trained on general corpora are not suited to the task, becausefinancial texts have a specialized language with unique vocabularyand have a tendency to use vague expressions instead of easily-identified negative/positive words.Using carefully crafted financial sentiment lexicons such asLoughran and McDonald (2011) [11] may seem a solution becausethey incorporate existing financial knowledge into textual analysis.However, they are based on "word counting" methods, which comeshort in analyzing deeper semantic meaning of a given text. NLP transfer learning methods look like a promising solutionto both of the challenges mentioned above, and are the focus ofthis thesis. The core idea behind these models is that by train-ing language models on very large corpora and then initializingdown-stream models with the weights learned from the languagemodeling task, a much better performance can be achieved. Theinitialized layers can range from the single word embedding layer[23] to the whole model [5]. This approach should, in theory, be ananswer to the scarcity of labeled data problem. Language modelsdon’t require any labels, since the task is predicting the next word.They can learn how to represent the semantic information. Thatleaves the fine-tuning on labeled data only the task of learning howto use this semantic information to predict the labels.One particular component of the transfer learning methods is theability to further pre-train the language models on domain specificunlabeled corpus. Thus, the model can learn the semantic relationsin the text of the target domain, which is likely to have a differ-ent distribution than a general corpus. This approach is especiallypromising for a niche domain like finance, since the language andvocabulary used is dramatically different than a general one.The goal of this thesis is to test these hypothesized advantagesof using and fine-tuning pre-trained language models for financialdomain. For that, sentiment of a sentence from a financial newsarticle towards the financial actor depicted in the sentence will betried to be predicted, using the Financial PhraseBank created byMalo et al. (2014) [17] and FiQA Task 1 sentiment scoring dataset[15].The main contributions of this thesis are the following: • We introduce FinBERT, which is a language model based onBERT for financial NLP tasks. We evaluate FinBERT on twofinancial sentiment analysis datasets. • We achieve the state-of-the-art on FiQA sentiment scoringand Financial PhraseBank. • We implement two other pre-trained language models, ULM-Fit and ELMo for financial sentiment analysis and comparethese with FinBERT. • We conduct experiments to investigate several aspects ofthe model, including: effects of further pre-training on fi-nancial corpus, training strategies to prevent catastrophicforgetting and fine-tuning only a small subset of model lay-ers for decreasing training time without a significant dropin performance.The rest of the thesis is structured as follows: First, relevant lit-erature in both financial polarity analysis and pre-trained languagemodels are discussed (Section 2). Then, the evaluated models aredescribed (Section 3). This is followed by the description of theexperimental setup being used (Section 4). In Section 5, we present he experimental results on the financial sentiment datasets. Thenwe further analyze FinBERT from different perspectives in Section6. Finally, we conclude with Section 7. This section describes previous research conducted on sentimentanalysis in finance (2.1) and text classification using pre-trainedlanguage models (2.2).

Sentiment analysis is the task of extracting sentiments or opinionsof people from written language [10]. We can divide the recentefforts into two groups: 1) Machine learning methods with featuresextracted from text with "word counting" [1, 19, 28, 30], 2) Deeplearning methods, where text is represented by a sequence of em-beddings [2, 25, 32]. The former suffers from inability to representthe semantic information that results from a particular sequence ofwords, while the latter is often deemed as too "data-hungry" as itlearns a much higher number of parameters [18].Financial sentiment analysis differs from general sentiment anal-ysis not only in domain, but also the purpose. The purpose behindfinancial sentiment analysis is usually guessing how the marketswill react with the information presented in the text [9]. Loughranand McDonald (2016) presents a thorough survey of recent workson financial text analysis utilizing machine learning with "bag-of-words" approach or lexicon-based methods [12]. For example, inLoughran and McDonald (2011), they create a dictionary of finan-cial terms with assigned values such as "positive" or "uncertain"and measure the tone of a documents by counting words with a spe-cific dictionary value [11]. Another example is Pagolu et al. (2016),where n-grams from tweets with financial information are fed intosupervised machine learning algorithms to detect the sentimentregarding the financial entity mentioned.On of the first papers that used deep learning methods for tex-tual financial polarity analysis was Kraus and Feuerriegel (2017) [7].They apply an LSTM neural network to ad-hoc company announce-ments to predict stock-market movements and show that methodto be more accurate than traditional machine learning approaches.They find pre-training their model on a larger corpus to improvethe result, however their pre-training is done on a labeled dataset,which is a more limiting approach then ours, as we pre-train alanguage model as an unsupervised task.There are several other works that employ various types ofneural architectures for financial sentiment analysis. Sohangir et al.(2018) [26] apply several generic neural network architectures toa StockTwits dataset, finding CNN as the best performing neuralnetwork architecture. Lutz et al. 2018 [13] take the approach of using doc2vec to generate sentence embeddings in a particular companyad-hoc announcement and utilize multi-instance learning to predictstock market outcomes. Maia et al. (2018) [14] use a combination oftext simplification and LSTM network to classify a set of sentencesfrom financial news according to their sentiment and achieve state-of-the-art results for the Financial PhraseBank, which is used inthesis as well.Due to lack of large labeled financial datasets, it is difficult toutilize neural networks to their full potential for sentiment analysis. Even when their first (word embedding) layers are initialized withpre-trained values, the rest of the model still needs to learn complexrelations with relatively small amount of labeled data. A morepromising solution could be initializing almost the entire modelwith pre-trained values and fine-tuning those values with respectto the classification task.

Language modeling is the task of predicting the next word in a givenpiece of text. One of the most important recent developments innatural language processing is the realization that a model trainedfor language modeling can be successfully fine-tuned for mostdown-stream NLP tasks with small modifications. These modelsare usually trained on very large corpora, and then with additionof suitable task-specific layers fine-tuned on the target dataset [6].Text classification, which is the focus of this thesis, is one of theobvious use-cases for this approach.ELMo (Embeddings from Language Models) [23] was one of thefirst successful applications of this approach. With ELMo, a deepbidirectional language model is pre-trained on a large corpus. Foreach word, hidden states of this model is used to compute a con-textualized representation. Using the pre-trained weights of ELMo,contextualized word embeddings can be calculated for any pieceof text. Initializing embeddings for down-stream tasks with thosewere shown to improve performance on most tasks compared tostatic word embeddings such as word2vec or GloVe. For text classi-fication tasks like SST-5, it achieved state-of-the-art performancewhen used together with a bi-attentive classification network [20].Although ELMo makes use of pre-trained language models forcontextualizing representations, still the information extracted us-ing a language model is present only in the first layer of any modelusing it. ULMFit (Universal Language Model Fine-tuning) [5] wasthe first paper to achieve true transfer learning for NLP, as usingnovel techniques such as discriminative fine-tuning, slanted tri-angular learning rates and gradual unfreezing. They were able toefficiently fine-tune a whole pre-trained language model for textclassification. They also introduced further pre-training of the lan-guage model on a domain-specific corpus, assuming target taskdata comes from a different distribution than the general corpusthe initial model was trained on.ULMFit’s main idea of efficiently fine-tuning a pre-trained alanguage model for down-stream tasks was brought to another levelwith Bidirectional Encoder Representations from Transformers(BERT) [3], which is also the main focus of this paper. BERT hastwo important differences from what came before: 1) It defines thetask of language modeling as predicting randomly masked tokensin a sequence rather than the next token, in addition to a task ofclassifying two sentences as following each other or not. 2) It isa very big network trained on an unprecedentedly large corpus.These two factors enabled in to achieve state-of-the-art results inmultiple NLP tasks such as, natural language inference or questionanswering.The specifics of fine-tuning BERT for text classification has notbeen researched thoroughly. One such recent work is Sun et al. In this section, we will present our BERT implementation for finan-cial domain named as FinBERT, after giving a brief background onrelevant neural architectures.

Long short-term memory (LSTM) is a type of re-current neural network that allows long-term dependencies in asequence to persist in the network by using "forget" and "update"gates. It is one of the primary architectures for modeling any sequen-tial data generation process, from stock prices to natural language.Since a text is a sequence of tokens, the first choice for any LSTMnatural language processing model is determining how to initiallyrepresent a single token. Using pre-trained weights for initial to-ken representation is the common practice. One such pre-trainingalgorithm is GLoVe (Global Vectors for Word Representation) [22].GLoVr is a model for calculating word representations with theunsupervised task of training a log-bilinear regression model ona word-word co-occurance matrix from a large corpus. It is an ef-fective model for representing words in a vector space, howeverit doesn’t contextualize these representations with respect to thesequence they are actually used in . ELMo embeddings [23] are contextualized word rep-resentations in the sense that the surrounding words influencethe representation of the word. In the center of ELMo, there isa bidirectional language model with multiple LSTM layers. Thegoal of a language model is to learn the probability distributionover sequences of tokens in a given vocabulary. ELMo models theprobability of a token given the previous (and separately following)tokens in the sequence. Then the model also learns how to weightdifferent representations from different LSTM layers in order tocalculate one contextualized vector per token. Once the contextual-ized representations are extracted, these can be used to initializeany down-stream NLP task . ULMFit is a transfer learning model for down-streamNLP tasks, that make use of language model pre-training [5]. Un-like ELMo, with ULMFit, the whole language model is fine-tunedtogether with the task-specific layers. The underlying languagemodel used in ULMFit is AWD-LSTM, which uses sophisticateddropout tuning strategies to better regularize its LSTM model [21].For classification using ULMFit two linear layers are added to thepre-trained AWD-LSTM, first of which takes the pooled last hiddenstates as input.ULMFit comes with novel training strategies for further pre-training the language model on domain-specific corpus and fine-tuning on the down-stream task. We implement these strategieswith FinBERT as explained in section 3.2. The pre-trained weights for GLoVE can be found here:https://nlp.stanford.edu/projects/glove/ The pre-trained ELMo models can be found here: https://allennlp.org/elmo

The Transformer is an attention-based archi-tecture for modeling sequential information, that is an alternativeto recurrent neural networks [29]. It was proposed as a sequence-to-sequence model, therefore including encoder and decoder mecha-nisms. Here, we will focus only on the encoder part (though decoderis quite similar). The encoder consists of multiple identical Trans-former layers. Each layer has a multi-headed self-attention layerand a fully connected feed-forward network. For one self-attentionlayer, three mappings from embeddings (key, query and value) arelearned. Using each token’s key and all tokens’ query vectors, asimilarity score is calculated with dot product. These scores areused to weight the value vectors to arrive at the new representationof the token. With the multi-headed self-attention, these layers areconcatenated together, so that the sequence can be evaluated fromvarying "perspectives". Then the resulted vectors go through fullyconnected networks with shared parameters.As it was argued by Vaswani 2017 [29], Transformer architecturehas several advantages over the RNN-based approaches. Becauseof RNNs’ sequential nature, they are much harder to parallelize onGPUs and too many steps between far away elements in a sequencemake it hard for information to persist.

BERT [3] is in essence a language model that consistsof a set of Transformer encoders stacked on top of each other.However it defines the language modeling task differently fromELMo and AWD-LSTM. Instead of predicting the next word givenprevious ones, BERT "masks" a randomly selected 15% of all tokens.With a softmax layer over vocabulary on top of the last encoderlayer the masked tokens are predicted. A second task BERT istrained on is "next sentence prediction". Given two sentences, themodel predicts whether or not these two actually follow each other.The input sequence is represented with token and position em-beddings. Two tokens denoted by [CLS] and [SEP] are added to thebeginning and end of the sequence respectively. For all classifica-tion tasks, including the next sentence prediction, [CLS] token isused.BERT has two versions: BERT-base, with 12 encoder layers, hid-den size of 768, 12 multi-head attention heads and 110M parametersin total and BERT-large, with 24 encoder layers, hidden size of1024, 16 multi-head attention heads and 340M parameters. Both ofthese models have been trained on BookCorpus [33] and EnglishWikipedia, which have in total more than 3,500M words . In this subsection we will describe our implementation of BERT: 1)how further pre-training on domain corpus is done, 2-3) how weimplemented BERT for classification and regression tasks, 4) train-ing strategies we used during fine-tuning to prevent catastrophicforgetting.

Howard and Ruder (2018) [5] showsthat futher pre-training a language model on a target domain corpusimproves the eventual classification performance. For BERT, thereis not decisive research showing that would be the case as well. The pre-trained weights are made public by creators of BERT. The code and weightscan be found here: https://github.com/google-research/bert egardless, we implement further pre-training in order to observeif such adaptation is going to be beneficial for financial domain.For further pre-training, we experiment with two approaches.The first is pre-training the model on a relatively large corpus fromthe target domain. For that, we further pre-train a BERT languagemodel on a financial corpus (details of the corpus can be found onsection 4.2.1). The second approach is pre-training the model onlyon the sentences from the training classification dataset. Althoughthe second corpus is much smaller, using data from the direct targetmight provide better target domain adaptation. Sentiment classification isconducted by adding a dense layer after the last hidden state of the[CLS] token. This is the recommended practice for using BERT forany classification task [3]. Then, the classifier network is trained onthe labeled sentiment dataset. An overview of all the steps involvedin the procedure is presented on figure 1.

While the focus of this paper is clas-sification, we also implement regression with almost the samearchitecture on a different dataset with continuous targets. Theonly difference is that the loss function being used is mean squarederror instead of the cross entropy loss.

As itwas pointed out by Howard and Ruder (2018) [5], catastrophicforgetting is a significant danger with this fine-tuning approach.Because the fine-tuning procedure can quickly cause model to"forget" the information from language modeling task as it tries toadapt to the new task. In order to deal with this phenomenon, weapply three techniques as it was proposed by Howard and Ruder(2018): slanted triangular learning rates, discriminative fine-tuningand gradual unfreezing.Slanted triangular learning rate applies a learning rate schedulein the shape of a slanted triangular, that is, learning rate first linearlyincreases up to some point and after that point linearly decreases.Discriminative fine-tuning is using lower learning rates for lowerlayers on the network. Assume our learning rate at layer l is α . Thenfor discrimination rate of θ we calculate the learning rate for layer l − α l − = θα l . The assumption behind this method is that thelower layers represent the deep-level language information, whilethe upper ones include information for actual classification task.Therefore we fine-tune them differently.With gradual freezing, we start training with all layers but theclassifier layer as frozen. During training we gradually unfreeze allof the layers starting from the highest one, so that the lower levelfeatures become the least fine-tuned ones. Hence, during the initialstages of training it is prevented for model to "forget" low-levellanguage information that it learned from pre-training. We aim to answer the following research questions:(RQ1) What is the performance of FinBERT in short sentence classi-fication compared with the other transfer learning methodslike ELMo and ULMFit?

Table 1: Distribtution of sentiment labels and agreement lev-els in Financial PhraseBank

Agreement level Positive Negative Neutral Count100% %25.2 %13.4 %61.4 226275% - 99% %26.6 %9.8 %63.6 119166% - 74% %36.7 %12.3 %50.9 76550% - 65% %31.1 %14.4 %54.5 627All %28.1 %12.4 %59.4 4845(RQ2) How does FinBERT compare to the state-of-the-art in finan-cial sentiment analysis with targets discrete or continuous?(RQ3) How does futher pre-training BERT on financial domain, ortarget corpus, affect the classification performance?(RQ4) What are the effects of training strategies like slanted trian-gular learning rates, discriminative fine-tuning and gradualunfreezing on classification performance? Do they preventcatastrophic forgetting?(RQ5) Which encoder layer performs best (or worse) for sentenceclassification?(RQ6) How much fine-tuning is enough? That is, after pre-training,how many layers should be fine-tuned to achieve comparableperformance to fine-tuning the whole model?

In order to further pre-train BERT, we usea financial corpus we call TRC2-financial. It is a subset of Reuters’TRC2 , which consists of 1.8M news articles that were publishedby Reuters between 2008 and 2010. We filter for some financialkeywords in order to make corpus more relevant and in limits withthe compute power available. The resulting corpus, TRC2-financial,includes 46,143 documents with more than 29M words and nearly400K sentences. The main sentiment analysis datasetused in this paper is Financial PhraseBank from Malo et al. 2014[17]. Financial Phrasebank consists of 4845 english sentences se-lected randomly from financial news found on LexisNexis database.These sentences then were annotated by 16 people with backgroundin finance and business. The annotators were asked to give labelsaccording to how they think the information in the sentence mightaffect the mentioned company stock price. The dataset also includesinformation regarding the agreement levels on sentences amongannotators. The distribution of agreement levels and sentimentlabels can be seen on table 1. We set aside 20% of all sentences astest and 20% of the remaining as validation set. In the end, our trainset includes 3101 examples. For some of the experiments, we alsomake use of 10-fold cross validation. The corpus can be obtained for research purposes by applying here:https://trec.nist.gov/data/reuters/reuters.html euters TRC2-ﬁnancial[CLS] Token 1 Token 2 [MASK] [SEP][CLS] Token 1 Token 2 [MASK] [SEP][CLS] Token 1 Token 2 [MASK] [SEP][CLS] Token 1 Token 2 [MASK] [SEP]DenseMasked LM predictionDense[is next sentence] predictionBookCorpus +Wikipedia[CLS] Token 1 Token 2 [MASK] [SEP]Embeddings [CLS] Token 1 Token 2 [MASK] [SEP][CLS] Token 1 Token 2 [MASK] [SEP][CLS] Token 1 Token 2 [MASK] [SEP]Encoder 1Encoder 2Encoder 12 DenseMasked LM predictionDense[is next sentence] predictionLanguage model on general corpus [CLS] Token 1 Token 2 Token k [SEP][CLS] Token 1 Token 2 Token k [SEP][CLS] Token 1 Token 2 Token k [SEP][CLS] Token 1 Token 2 Token k [SEP]DenseSentiment predictionClassiﬁcation model on ﬁnancial sentiment datasetFinancialPhrasebankLanguage model on ﬁnancial corpus Figure 1: Overview of pre-training, further pre-training and classification fine-tuning

FiQA [15] is a dataset that was created forWWW ’18 conference financial opinion mining and question an-swering challenge . We use the data for Task 1, which includes1,174 financial news headlines and tweets with their correspondingsentiment score. Unlike Financial Phrasebank, the targets for thisdatasets are continuous ranging between [− , ] with 1 being themost positive. Each example also has information regarding whichfinancial entity is targeted in the sentence. We do 10-fold crossvalidation for evaluation of the model for this dataset. For contrastive experiments, we consider baselines with three dif-ferent methods: LSTM classifier with GLoVe embeddings, LSTMclassifier with ELMo embeddings and ULMFit classifier. It shouldbe noted that these baseline methods are not experimented with asthoroughly as we did with BERT. Therefore the results should notbe interpreted as definitive conclusions of one method being better.

We implement two classifiers using bidirec-tional LSTM models. In both of them, a hidden size of 128 is used,with the last hidden state size being 256 due to bidirectionality.A fully connected feed-forward layer maps the last hidden stateto a vector of three, representing likelihood of three labels. Thedifference between two models is that one uses GLoVe embeddings,while the other uses ELMo embeddings. A dropout probability of0.3 and a learning rate of 3e-5 is used in both models. We train themuntil there is no improvement in validation loss for 10 epochs.

As it was explained in section 3.1.3, classificationwith ULMFit consists of three steps. The first step of pre-traininga language model is already done and the pre-trained weights arereleased by Howard and Ruder (2018). We first further pre-trainAWD-LSTM language model on TRC2-financial corpus for 3 epochs.After that, we fine-tune the model for classification on Financial Data can be found here: https://sites.google.com/view/fiqa/home

PhraseBank dataset, by adding a fully-connected layer to the outputof pre-trained language model.

For evaluation of classification models, we use three metrics: Ac-curacy, cross entropy loss and macro F1 average. We weight crossentropy loss with square root of inverse frequency rate. For exam-ple if a label constitutes 25% of the all examples, we weight the lossattributed to that label by 2. Macro F1 average calculates F1 scoresfor each of the classes and then takes the average of them. Since ourdata, Financial PhraseBank suffers from label imbalance (almost60% of all sentences are neutral), this gives another good measure ofthe classification performance. For evaluation of regression model,we report mean squared error and R , as these are both standardand also reported by the state-of-the-art papers for FiQA dataset. For our implementation BERT, we use a dropout probability of p = .

1, warm-up proportion of 0 .

2, maximum sequence lengthof 64 tokens, a learning rate of 2 e − The results of FinBERT, the baseline methods and state-of-the-arton Financial PhraseBank dataset classification task can be seen ontable 2. We present the result on both the whole dataset and subsetwith 100% annotator agreement. able 2: Experimental Results on the Financial PhraseBank dataset All data Data with 100% agreementModel Loss Accuracy F1 Score Loss Accuracy F1 ScoreLSTM 0.81 0.71 0.64 0.57 0.81 0.74LSTM with ELMo 0.72 0.75 0.7 0.50 0.84 0.77ULMFit 0.41 0.83 0.79 0.20 0.93 0.91LPS - 0.71 0.71 - 0.79 0.80HSC - 0.71 0.76 - 0.83 0.86FinSSLX - - - - 0.91 0.88FinBERT

Bold face indicates best result in the corresponding metric. LPS [17], HSC [8] and FinSSLX[15] results are taken from their respective papers. For LPS and HSC, overall accuracy is notreported on the papers. We calculated them using recall scores reported for different classes.For the models implemented by us, we report 10-fold cross validation results.

For all of the measured metrics, FinBERT performs clearly thebest among both the methods we implemented ourselves (LSTM andULMFit) and the models reported by other papers (LPS [17], HSC [8],FinSSLX [14]). LSTM classifier with no language model informationperforms the worst. In terms of accuracy, it is close to LPS and HSC,(even better than LPS for examples with full agreement), howeverit produces a low F1-score. That is due to it performing much betterin neutral class. LSTM classifier with ELMo embeddings improvesupon LSTM with static embeddings in all of the measured metrics.It still suffers from low average F1-score due to poor performancein less represented labels. But it’s performance is comparable withLPS and HSC, besting them in accuracy. So contextualized wordembeddings produce close performance to machine learning basedmethods for dataset of this size.ULMFit significantly improves on all of the metrics and it doesn’tsuffer from model performing much better in some classes thanthe others. It also handily beats the machine learning based modelsLPS and HSC. This shows the effectiveness of language model pre-training. AWD-LSTM is a very large model and it would be expectedto suffer from over-fitting with this small of a dataset. But due tolanguage model pre-training and effective training strategies, itis able to overcome small data problem. ULMFit also outperformsFinSSLX, which has a text simplification step as well as pre-trainingof word embeddings on a large financial corpus with sentimentlabels.FinBERT outperforms ULMFit, and consequently all of the othermethods in all metrics. In order to measure the performance of themodels on different sizes of labeled training datasets, we ran LSTMclassifiers, ULMFit and FinBERT on 5 different configurations. Theresult can be seen on figure 2, where the cross entropy losses ontest set for each model are drawn. 100 training examples is too lowfor all of the models. However, once the training size becomes 250,ULMFit and FinBERT starts to successfully differentiate betweenlabels, with an accuracy as high as 80% for FinBERT. All of themethods consistently get better with more data, but ULMFit andFinBERT does better with 250 examples than LSTM classifiers dowith the whole dataset. This shows the effectiveness of languagemodel pre-training.

Figure 2: Test loss different training set sizes

The results for FiQA sentiment dataset, are presented on table 3.Our model outperforms state-of-the-art models for both MSE and R . It should be noted that the test set these two papers [31] [24]use is the official FiQA Task 1 test set. Since we don’t have accessto that we report the results on 10-Fold cross validation. There isno indication on [15] that the train and test sets they publish comefrom different distributions and our model can be interpreted tobe at disadvantage since we need to set aside a subset of trainingset as test set, while state-of-the-art papers can use the completetraining set. We first measure the effect of further pre-training on the perfor-mance of the classifier. We compare three models: 1) No furtherpre-training (denoted by Vanilla BERT), 2) Further pre-trainingon classification training set (denoted by FinBERT-task), 3) Fur-ther pre-training on domain corpus, TRC2-financial (denoted byFinBERT-domain). Models are evaluated with loss, accuracy and able 3: Experimental Results on FiQA Senti-ment Dataset Model MSE R Yang et. al. (2018) 0.08 0.40Piao and Breslin (2018) 0.09 0.41FinBERT

Bold face indicated best result in corresponding metric.Yang et. al. (2018) [31] and Piao and Breslin (2018) [24]report results on the official test set. Since we don’t haveaccess to that set our MSE, and R are calculated with 10-Fold cross validation. Table 4: Performance with different pre-training strategies

Model Loss Accuracy F1 ScoreVanilla BERT 0.38 0.85 0.84FinBERT-task 0.39 0.86

FinBERT-domain

Bold face indicates best result in the corresponding met-ric. Results are reported on 10-fold cross validation. macro average F1 scores on the test dataset. The results can be seenon table 4.The classifier that were further pre-trained on financial domaincorpus performs best among the three, though the difference is notvery high. There might be four reasons behind this result: 1) Thecorpus might have a different distribution than the task set, 2) BERTclassifiers might not improve significantly with further pre-training,3) Short sentence classification might not benefit significantly fromfurther pre-training, 4) Performance is already so good, that there isnot much room for improvement. We think that the last explanationis the likeliest, because for the subset of Financial Phrasebank thatall of the annotators agree on the result, accuracy of Vanilla BERTis already 0.96. The performance on the other agreement levelsshould be lower, as even the humans can’t agree fully on them. Moreexperiments with another financial labeled dataset is necessary toconclude that effect of further pre-training on domain corpus is notsignificant.

For measuring the performance of the techniques against cata-strophic forgetting, we try four different settings: No adjustment(NA), only with slanted triangular learning rate (STL), slanted tri-angular learning rate and gradual unfreezing (STL+GU) and thetechniques in the previous one, together with discriminative fine-tuning. We report the performance of these four settings with losson test function and trajectory of validation loss over trainingepochs. The results can be seen on table 5 and figure 3.Applying all three of the strategies produce the best perfor-mance in terms of test loss and accuracy. Gradual unfreezing anddiscriminative fine-tuning have the same reasoning behind them:higher level features should be fine-tuned more than the lower level

Figure 3: Validation loss trajectories with different trainingstrategiesTable 5: Performance with different fine-tuning strategies

Strategy Loss Accuracy F1 ScoreNone 0.48 0.83 0.83STL 0.40 0.81 0.82STL + GU 0.40 0.86

STL + DFT 0.42 0.79 0.79All three

Bold face indicates best result in the correspond-ing metric. Results are reported on 10-fold crossvalidation. STL: slanted triangular learning rates,GU: gradual unfreezing, DFT: discriminative fine-tuning. ones, since information learned from language modeling are mostlypresent in the lower levels. We see from table 5 that using onlydiscriminative fine-tuning with slanted triangular learning ratesperforms worse than using the slanted triangular learning ratesalone. This shows that gradual unfreezing is the most importanttechnique for our case.One way that catastrophic forgetting can show itself is the sud-den increase in validation loss after several epochs. As model istrained, it quickly starts to overfit when no measure is taken accord-ingly. As it can be seen on the figure 3, that is the case when none ofthe aforementioned techniques are applied. The model achieves thebest performance on validation set after the first epoch and thenstarts to overfit. While with all three techniques applied, model ismuch more stable. The other combinations lie between these twocases.

BERT has 12 Transformer encoder layers. It is not necessarily agiven that the last layer captures the most relevant informationregarding classification task during language model training. For able 6: Performance on different encoder layers used forclassification Layer for classification Loss Accuracy F1 ScoreLayer-1 0.65 0.76 0.77Layer-2 0.54 0.78 0.78Layer-3 0.52 0.76 0.77Layer-4 0.48 0.80 0.77Layer-5 0.52 0.80 0.80Layer-6 0.45 0.82 0.82Layer-7 0.43 0.82 0.83Layer-8 0.44 0.83 0.81Layer-9 0.41 0.84 0.82Layer-10 0.42 0.83 0.82Layer-11 0.38 0.84 0.83Layer-12 0.37 0.86 0.84All layers - mean 0.41 0.84 0.84this experiment, we investigate which layer out of 12 Transformerencoder layers give the best result for classification. We put the clas-sification layer after the CLS] tokens of respective representations.We also try taking the average of all layers.As shown in table 6the last layer contributes the most to themodel performance in terms of all the metrics measured. This mightbe indicative of two factors: 1) When the higher layers are used themodel that is being trained is larger, hence possibly more powerful,2) The lower layers capture deeper semantic information, hencethey struggle to fine-tune that information for classification.

BERT is a very large model. Even on small datasets, fine-tuningthe whole model requires significant time and computing power.Therefore if a slightly lower performance can be achieved withfine-tuning only a subset of all parameters, it might be preferable insome contexts. Especially if training set is very large, this changemight make BERT more convenient to use. Here we experimentwith fine-tuning only the last k many encoder layers.The results are presented on table 7. Fine-tuning only the clas-sification layer does not achieve close performance to fine-tuningother layers. However fine-tuning only the last layer handily out-performs the state-of-the-art machine learning methods like HSC.After Layer-9, the performance becomes virtually the same, only tobe outperformed by fine-tuning the whole model. This result showsthat in order to utilize BERT, an expensive training of the wholemodel is not mandatory. A fair trade-off can be made for much lesstraining time with a small decrease in model performance. With 97% accuracy on the subset of Financial PhraseBank with100% annotator agreement, we think it might be an interestingexercise to examine cases where the model failed to predict thetrue label. Therefore in this section we will present several exam-ples where model makes the wrong prediction. Also in Malo et

Table 7: Performance on starting training from different lay-ers

First layer unfreezed Loss Accuracy Training timeEmbeddings layer 0.37 0.86 332sLayer-1 0.39 0.83 302sLayer-2 0.39 0.83 291sLayer-3 0.38 0.83 272sLayer-4 0.38 0.82 250sLayer-5 0.40 0.83 240sLayer-6 0.40 0.81 220sLayer-7 0.39 0.82 205sLayer-8 0.39 0.84 188sLayer-9 0.39 0.84 172sLayer-10 0.41 0.84 158sLayer-11 0.45 0.82 144sLayer-12 0.47 0.81 133sClassification layer 1.04 0.52 119sal. (2014 )[17], it is indicated that most of the inter-annotator dis-agreements are between positive and neutral labels (agreement forseparating positive-negative, negative-neutral and positive-neutralare 98.7%, 94.2% and 75.2% respectively). Authors attribute thatthe difficulty of distinguishing "commonly used company glitterand actual positive statements". We will present the confusion ma-trix in order to observe whether this is the case for FinBERT as well.

Example 1:

Pre-tax loss totaled euro 0.3 million ,compared to a loss of euro 2.2 million in the firstquarter of 2005 .

True value:

Positive

Predicted:

Negative

Example 2:

This implementation is very important tothe operator , since it is about to launch its Fixedto Mobile convergence service in Brazil

True value:

Neutral

Predicted:

Positive

Example 3:

The situation of coated magazine printingpaper will continue to be weak .

True value:

Negative

Predicted:

NeutralThe first example is actually the most common type of failure.The model fails to do the math in which figure is higher, and inthe absence of words indicative of direction like "increased", mightmake the prediction of neutral. However, there are many similarcases where it does make the true prediction too. Examples 2 and 3are different versions of the same type of failure. The model failsto distinguish a neutral statement about a given situation from astatement that indicated polarity about the company. In the thirdexample, information about the company’s business would probablyhelp.The confusion matrix is presented on figure 4. 73% of the failureshappen between labels positive and negative, while same numberis 5% for negative and positive. That is consistent with both theinter-annotator agreement numbers and common sense. It is easier igure 4: Confusion matrix to differentiate between positive and negative. But it might be morechallenging to decide whether a statement indicates a positiveoutlook or merely an objective observation. In this paper, we implemented BERT for the financial domain byfurther pre-training it on a financial corpus and fine-tuning it forsentiment analysis (FinBERT). This work is the first application ofBERT for finance to the best of our knowledge and one of the fewthat experimented with further pre-training on a domain-specificcorpus. On both of the datasets we used, we achieved state-of-the-art results by a significant margin. For the classification task, weincreased the state-of-the art by 15% in accuracy.In addition to BERT, we also implemented other pre-traininglanguage models like ELMo and ULMFit for comparison purposes.ULMFit, further pre-trained on a financial corpus, beat the previousstate-of-the art for the classification task, only to a smaller degreethan BERT. These results show the effectiveness of pre-trained lan-guage models for a down-stream task such as sentiment analysisespecially with a small labeled dataset. The complete dataset in-cluded more than 3000 examples, but FinBERT was able to surpassthe previous state-of-the art even with a training set as small as500 examples. This is an important result, since deep learning tech-niques for NLP have been traditionally labeled as too "data-hungry",which is apparently no longer the case.We conducted extensive experiments with BERT, investigatingthe effects of further pre-training and several training strategies.We couldn’t conclude that further pre-training on a domain-specificcorpus was significantly better than not doing so for our case. Ourtheory is that BERT already performs good enough with our datasetthat there is not much room for improvement that further pre-training can provide. We also found that learning rate regimes thatfine-tune the higher layers more aggressively than the lower onesperform better and are more effective in preventing catastrophicforgetting. Another conclusion from our experiments was that,comparable performance can be achieved with much less trainingtime by fine-tuning only the last 2 layers of BERT.Financial sentiment analysis is not a goal on its own, it is asuseful as it can support financial decisions. One way that our workmight be extended, could be using FinBERT directly with stock market return data (both in terms of directionality and volatility)on financial news. FinBERT is good enough for extracting explicitsentiments, but modeling implicit information that is not neces-sarily apparent even to those who are writing the text should be achallenging task. Another possible extension can be using FinBERTfor other natural language processing tasks such as named entityrecognition or question answering in financial domain.

I would like to show my gratitude to Pengjie Ren and Zulkuf Gencfor their excellent supervision. They provided me with both inde-pendence in setting my own course for the research and valuablesuggestions when I need them. I would also like to thank Naspers AIteam, for entrusting me with this project and always encouragingme to share my work. I am grateful to NIST, for sharing ReutersTRC-2 corpus with me and to Malo et al. for making the excellentFinancial PhraseBank publicly available.

REFERENCES [1] Basant Agarwal and Namita Mittal. 2016.

Machine Learning Approach forSentiment Analysis . Springer International Publishing, Cham, 21–45. https://doi.org/10.1007/978-3-319-25343-5_3[2] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Carlos A.Iglesias. 2017. Enhancing deep learning sentiment analysis with ensemble tech-niques in social applications.

Expert Systems with Applications

77 (jul 2017),236–246. https://doi.org/10.1016/j.eswa.2017.02.002[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.(2018). https://doi.org/arXiv:1811.03600v2 arXiv:1810.04805[4] Li Guo, Feng Shi, and Jun Tu. 2016. Textual analysis and machine leaning: Crackunstructured data in finance and accounting.

The Journal of Finance and DataScience

2, 3 (sep 2016), 153–170. https://doi.org/10.1016/J.JFDS.2017.02.001[5] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. (jan 2018). arXiv:1801.06146 http://arxiv.org/abs/1801.06146[6] Neel Kant, Raul Puri, Nikolai Yakovenko, and Bryan Catanzaro. 2018. Prac-tical Text Classification With Large Pre-Trained Language Models. (2018).arXiv:1812.01207 http://arxiv.org/abs/1812.01207[7] Mathias Kraus and Stefan Feuerriegel. 2017. Decision support from financialdisclosures with deep neural networks and transfer learning.

Decision Support Sys-tems

104 (2017), 38–48. https://doi.org/10.1016/j.dss.2017.10.001 arXiv:1710.03954[8] Srikumar Krishnamoorthy. 2018. Sentiment analysis of financial news articlesusing performance indicators.

Knowledge and Information Systems

56, 2 (aug2018), 373–394. https://doi.org/10.1007/s10115-017-1134-1[9] Xiaodong Li, Haoran Xie, Li Chen, Jianping Wang, and Xiaotie Deng. 2014. Newsimpact on stock price return via sentiment analysis.

Knowledge-Based Systems

69 (oct 2014), 14–23. https://doi.org/10.1016/j.knosys.2014.04.022[10] Bing Liu. 2012. Sentiment Analysis and Opinion Mining.

Synthesis Lectures onHuman Language Technologies

5, 1 (may 2012), 1–167. https://doi.org/10.2200/s00416ed1v01y201204hlt016[11] Tim Loughran and Bill Mcdonald. 2011. When Is a Liability Not a Liability?Textual Analysis, Dictionaries, and 10-Ks.

Journal of Finance

66, 1 (feb 2011),35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x[12] Tim Loughran and Bill Mcdonald. 2016. Textual Analysis in Accounting andFinance: A Survey.

Journal of Accounting Research

54, 4 (2016), 1187–1230.https://doi.org/10.1111/1475-679X.12123[13] Bernhard Lutz, Nicolas Pröllochs, and Dirk Neumann. 2018.

Sentence-LevelSentiment Analysis of Financial News Using Distributed Text Representations andMulti-Instance Learning . Technical Report. arXiv:1901.00400 http://arxiv.org/abs/1901.00400[14] Macedo Maia, Andrï£¡ Freitas, and Siegfried Handschuh. 2018. FinSSLx: A Senti-ment Analysis Model for the Financial Domain Using Text Simplification. In . IEEE, 318–319.https://doi.org/10.1109/ICSC.2018.00065[15] Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross Mcdermott,Manel Zarrouk, Alexandra Balahur, and Ross Mc-Dermott. 2018. Companion ofthe The Web Conference 2018 on The Web Conference 2018, {WWW} 2018, Lyon, France, April 23-27, 2018. ACM. https://doi.org/10.1145/3184558[16] Burton G Malkiel. 2003. The Efficient Market Hypothesis and Its Critics.

Jour-nal of Economic Perspectives

17, 1 (feb 2003), 59–82. https://doi.org/10.1257/ Journal of the Association for Information Science and Technology

65, 4 (2014),782–796. https://doi.org/10.1002/asi.23062 arXiv:arXiv:1307.5336v2[18] G. Marcus. 2018. Deep Learning: A Critical Appraisal. arXiv e-prints (Jan. 2018).arXiv:cs.AI/1801.00631[19] Justin Martineau and Tim Finin. 2009. Delta TFIDF: An Improved Feature Spacefor Sentiment Analysis.. In

ICWSM , Eytan Adar, Matthew Hurst, Tim Finin,Natalie S. Glance, Nicolas Nicolov, and Belle L. Tseng (Eds.). The AAAI Press.http://dblp.uni-trier.de/db/conf/icwsm/icwsm2009.html

CoRR abs/1708.02182 (2017).arXiv:1708.02182 http://arxiv.org/abs/1708.02182[22] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global Vectors for Word Representation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) . Association forComputational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162[23] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized wordrepresentations. (2018). https://doi.org/10.18653/v1/N18-1202 arXiv:1802.05365[24] Guangyuan Piao and John G Breslin. 2018. Financial Aspect and SentimentPredictions with Deep Neural Networks. 1973–1977. https://doi.org/10.1145/3184558.3191829[25] Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter Sentiment Analysiswith Deep Convolutional Neural Networks. In

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval -SIGIR '15 . ACM Press. https://doi.org/10.1145/2766462.2767830[26] Sahar Sohangir, Dingding Wang, Anna Pomeranets, and Taghi M Khoshgoftaar.2018. Big Data: Deep Learning for financial sentiment analysis.

Journal of BigData

5, 1 (2018). https://doi.org/10.1186/s40537-017-0111-6[27] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-TuneBERT for Text Classification? (2019). arXiv:1905.05583 https://arxiv.org/pdf/1905.05583v1.pdfhttp://arxiv.org/abs/1905.05583[28] Abinash Tripathy, Ankit Agrawal, and Santanu Kumar Rath. 2016. Classificationof sentiment reviews using n-gram machine learning approach.

Expert Systemswith Applications

57 (sep 2016), 117–126. https://doi.org/10.1016/j.eswa.2016.03.028[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. Nips (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762[30] Casey Whitelaw, Navendu Garg, and Shlomo Argamon. 2005. Using appraisalgroups for sentiment analysis. In

Proceedings of the 14th ACM internationalconference on Information and knowledge management - CIKM '05 . ACM Press.https://doi.org/10.1145/1099554.1099714[31] Steve Yang, Jason Rosenfeld, and Jacques Makutonin. 2018. Financial Aspect-Based Sentiment Analysis using Deep Representations. (2018). arXiv:1808.07931https://arxiv.org/pdf/1808.07931v1.pdfhttp://arxiv.org/abs/1808.07931[32] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis:A survey.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

8, 4 (mar 2018), e1253. https://doi.org/10.1002/widm.1253[33] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: TowardsStory-like Visual Explanations by Watching Movies and Reading Books. (jun2015). arXiv:1506.06724 http://arxiv.org/abs/1506.067248, 4 (mar 2018), e1253. https://doi.org/10.1002/widm.1253[33] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: TowardsStory-like Visual Explanations by Watching Movies and Reading Books. (jun2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724