Modeling Institutional Credit Risk with Financial News
MModeling Institutional Credit Risk with Financial News
Tam Tran-The
MassMutual Data Science470 Atlantic AveBoston, Massachusetts [email protected]
Abstract
Credit risk management, the practice of mitigating losses byunderstanding the adequacy of a borrower’s capital and loanloss reserves, has long been imperative to any financial in-stitution’s long-term sustainability and growth. MassMutualis no exception. The company is keen on effectively moni-toring downgrade risk, or the risk associated with the eventwhen credit rating of a company deteriorates. Current workin downgrade risk modeling depends on multiple variationsof quantitative measures provided by third-party rating agen-cies and risk management consultancy companies. As thesestructured numerical data become increasingly commoditizedamong institutional investors, there has been a wide push intousing alternative sources of data, such as financial news, earn-ings call transcripts, or social media content, to possibly gaina competitive edge in the industry. The volume of qualita-tive information or unstructured text data has exploded in thepast decades and is now available for due diligence to supple-ment quantitative measures of credit risk. This paper proposesa predictive downgrade model using solely news data repre-sented by neural network embeddings. The model standaloneachieves an Area Under the Receiver Operating Characteris-tic Curve (AUC) of more than 80%. The output probabilityfrom this news model, as an additional feature, improves theperformance of our benchmark model using only quantitativemeasures by more than 5% in terms of both AUC and recallrate. A qualitative evaluation also indicates that news articlesrelated to our predicted downgrade events are specially rele-vant and high-quality in our business context.
Credit risk refers to the possibility of loss resulting from aborrower’s or a bond issuer’s failure to repay a loan or meetcontractual obligations. One subcategory of credit risk isdowngrade risk, which occurs when third-party rating agen-cies, such as Moody’s and S&P, lower their ratings on a bondor a company. For example, a change by S&P from a B toa CCC rating is considered a downgrade event. Such ratinginformation is used extensively by regulatory organization,specifically the National Association of Insurance Commis-sioners (NAIC), to ensure financial solvency of insurance
Copyright c (cid:13) companies. More precisely, NAIC requires a company witha higher exposure to risk to hold a higher amount of capi-tal in reserve. Accurately monitoring rating classes and po-tential deterioration events is therefore critical to MassMu-tual. Receiving enough notice before any possible impend-ing downgrade event would help the company to manage itsinvestment portfolio more efficiently by preparing necessarycapital and/or switching holding positions.In this paper, we design 3 natural language processing(NLP) frameworks to recognize credit relevant patterns (i.e.future 1-year downgrade events) in news articles about morethan 2.2K companies of interest. The study demonstratesthat with an appropriate methodology, we can achieve anAUC of more than 80% using solely news data to predictdowngrade events and yield a more than 5% performancegain when this adverse credit signal in media coverage isadded to the quantitative downgrade risk model.The structure of this paper is as follows: Section 2 pro-vides background on credit risk modeling and the NLP tech-niques used to convert text into meaningful numerical data;Section 3 describes the dataset, benchmark model and mainmodel development; Section 4 presents performance resultsand stability test details of the final model; and Section 5discusses limitations of this work and future directions.
The motivation to develop credit risk models stems fromthe need to construct quantitative estimates of the amountof economic capital needed to support a financial institu-tion’s risk-taking activities. Specifically, minimum capital inreserve is often set in proportion to the risk exposure of acompany’s portfolio. Typical credit risk models take as in-put the conditions of the general economy and those of thespecific firms in question, and generate as output a creditquality measure.
When it comes to using text in a machine learning model,one of the main challenges is how to represent texts as nu- a r X i v : . [ q -f i n . R M ] A p r erical inputs so that we can feed them into the model. Thisproject aims to experiment with different methodologies torepresent news articles and evaluate the algorithms’ perfor-mances based on the downgrade prediction task. Followingare some NLP techniques we use in the study: Latent Dirichlet Allocation (LDA) . As it is believed thatevery article we have is composed of major themes, we usetopic modeling to extract the hidden thematic structure inthe text. Topic modeling is a type of statistical model usedto discover the abstract topics that occur in a collection ofdocuments. Latent Dirichlet allocation (LDA), a generativeprobabilistic model, is a common algorithm of topic model.The basic idea is documents are represented as random mix-tures over latent topics, where each topic is characterized bya distribution over words. Mathematically, this can be for-mulated as:P( θ , z , w | α , β ) = P( θ | α ) N (cid:89) n=1 P(z n | θ )P(w n | z n , β )Given α , a parameter vector on the per-document topic dis-tributions and β , a parameter vector on the per-topic worddistributions, we are finding the joint distribution of a topicmixture θ , a set of N topics z , and a set of N words w (Bleiet al. 2003). In the context of our study, each news articleis represented as a set of topic probabilities and the numberof topics is a tunable parameter. To evaluate an LDA model,we use coherence measure, which gives an estimate of howwell each topic can be represented as a composition of partsthat can be combined (R¨oder, Both, and Hinneburg 2015). Sentiment Lexicons . There is a growing body of senti-ment lexicons, or affective word lists, to examine the toneand sentiment of textual data. Some lexicons we experimentwith are: • Loughran and McDonald , designed to particularly re-flect tone in financial texts (Loughran and McDonald2011) • VADER , specifically attuned to sentiment in microblog-like contexts (Hutto and Gilbert 2014) • AFINN , also constructed based on micro-blog posts in-cluding Internet slang words (Nielsen 2011) • SentiWordNet and
OPINION , for general sentimentanalysis
Neural Net Embeddings . Based on the underlying ideathat “a word is characterized by the company it keeps,” eachword is represented by an embedding, or a vector of con-tinuous numbers so that words of similar semantic (relativeto the task) are closer to one another in the vector space.By using neural network on a supervised/unsupervised task,the resulting weights/parameters that have been adjusted tominimize loss on the task are the embeddings. • Doc2Vec Document Embeddings . Doc2Vec is an unsu-pervised framework that learns fixed-length feature rep-resentations from variable-length pieces of texts. In theframework, a vector representation is trained by stochasticgradient descent and backpropagation to be useful for pre-dicting the next word in the sentence. The Doc2Vec model we use in this study is a distributed memory one, whereeach paragraph vector acts as a memory that rememberswhat is missing from the current context (Le and Mikolov2014). After being trained, the paragraph vectors can beused as features for the paragraph and be fed directly intoconventional machine learning algorithms such as logisticregression. • fastText Word Embeddings . Inspired by the same hy-pothesis as Doc2Vec, fastText word representations aretrained to predict words that appear in their contexts.What distinguishes fastText model is it takes into accountmorphology. More precisely, the model represents eachword by a sum of its character n-grams, which allows usto compute representations for words that did not appearin the training data and proves to be helpful while work-ing with morphologically rich language (Bojanowski etal. 2016). News . For each company at a point in time (daily), we haveall news articles corresponding to the record and a label in-dicating whether the company is going to downgrade withinfuture 1-year period. • Data source . We have access to news data source fromThomson Reuters via CreditEdge’s API provided byMoody’s. This source brings us daily article pieces atcompany level. Reuters articles in the dataset could takevarious forms: news stories about company’s activities(e.g. merger & acquisition, board assignment, etc.), mar-ket snapshots (e.g. individual stock movements, etc.),or quarterly earnings summaries. There are approxi-mately 21K observations identified at company and datelevel, covering more than 2.2K Moody’s unique perma-nent company identifiers (PID) and spanning from midNovember 2017 to early October 2019 (although 99% ofthese articles are in 2019). • Data preprocessing . Since we are working in a very nar-row domain (financial news of a company of interest) andthe data is particularly noisy, we take multiple cleaningand preprocessing steps. First, we remove small machine-generated articles, such as NYSE/NASDAQ/AMEX or-der imbalance, articles that only include video links orarticles whose accuracy hasn’t been verified by Reuters.Second, we remove unnecessary headers, footers, HTMLmetadata, and lowercase all texts. Third, we manuallyread through a handful of articles and flag which for-mat tends to mention multiple companies in the text andwhich tends to talk about a single company. Last but notleast, for pieces that cover information of multiple com-panies, we use a combination of named entity recogni-tion and fuzzy matching techniques, as well as rule-basedexact matching to extract only sentences about the com-pany of relevance. A few more preprocessing layers, suchas tokenization, stemming and stop words removal, areadded depending on which modeling approach we take.By performing a thorough text preprocessing, we hope to odeling data Test data (20%)Train data (80%)
News modelBenchmark model Final model
EvaluationOutputteddowngradeprobability fromnews
Figure 1: Overview of modeling pipeline. For details of benchmark model, see Section 3.2. For details of news model, seeSection 3.3. Note that features for the final model include the outputted probability from news model and all quantitativefeatures from benchmark model.increase the signal to noise ratio which suggests news ismore likely to have a material impact on our downgradeprediction. To ensure there is only one row of data percompany per day, for companies that have multiple ar-ticles on a single day, we concatenate these articles to-gether.
Ratings . Credit ratings for each company are availablefrom both Moody’s and S&P. We take the worst combina-tion between the two rating systems for a company at a pointin time (daily) to derive current rating and the worst rat-ing within the next future 1 year, from which we determinewhether that company has an impending downgrade eventor not. Table 1: Downgrade events by yeardowngrade year count0 2017 422018 1522019 20,5191 2019 324For our benchmark model, all features are variations ofquantitative credit measures provided by third-party ratingagencies and risk management consultancy companies - thisis further discussed in Section 3.2.
The benchmark downgrade model of this study is a logisticregression. For each training data point, we have a vectorof features X and an observed class Y, where X providesquantitative metrics about a company on a daily basis andY = 1 indicates that the company is downgrading within 1year from the date of the data observation. This setup en-sures that both our benchmark and news models have thesame level of data frequency and can be compared and com-bined later. Assuming that P(Y = 1 | X = x) = p(x; θ ) forsome function p parameterized by θ , what we want to modelis: P(Y = 1 | X = x, θ ) = 11 + e – θ x The model is trained on 9 features that are variations interms of either term structures (i.e. 1-year, 5-year) or trans-formation (i.e. lag, diff) of following quantitative measurespurchased from risk management providers: • The probability that a firm will default over 1 year basedon company-specific attributes, industry-related measuresand relevant macro-economic factors • The probability that a firm will downgrade within 1 yearbased on market implied ratings and rating outlooks • Historical credit rating for any firm at a point in timeSince our data is highly imbalanced, where the occurrenceof downgrade events accounts for 1.5% of the dataset, weemploy SMOTE algorithm to replicate observations fromthe minority class. Overall, this benchmark model achievesan AUC of 82.7% and a recall rate of 69.5%.
Figure 1 is an overview of our modeling pipeline. Through-out the training process, we deliberately employ logistic re-gression due to the algorithm’s great interpretability whichis a focus of our business users who work in a highly regu-lated industry. This section details 3 different approaches wetake to translate unstructured texts into meaningful numeri-cal data that can be fed into a logistic regression model. Weevaluate these NLP methodologies on our downstream taskat hand, which is downgrade prediction, to select the bestnews model that can be incorporated into and improve theexisting benchmark model.
Approach 1: Lexicon-based Sentiment and TopicScores . We first experiment with one of the simplest senti-ment analysis approaches, which is to compare words of anarticle against a labeled word list, where each word has beenscored for valence. Based on 5 different lexicons (Loughran& McDonald, VADER, AFINN, SentiWordNet, and OPIN-ION), we compute 5 sentiment scores for each news article.Specifically, we check whether a word in the text exists ineach positive or negative word list, then count the frequencyof that word. For positive words, we do not count those thathave a negation in one of three places preceding it. The senti-ment score is calculated as . Additionally,e train an LDA model with 10 different topics where eachtopic is a combination of keywords and each keyword con-tributes a certain weight to the topic. We decide on 10 as theoptimal number of topics by running multiple LDA mod-els with number of topics ranging from 5 to 40 and pickingthe one that gives the highest coherence value, as shown inFigure 2. Two features that indicate the number of articlesfor each company at a point in time and whether an arti-cle contains any variation of the word “downgrade” are alsocreated. In the end, our news model is trained on 17 fea-tures, including 5 sentiment scores, 10 topic probabilities,and 2 count indicators. This news model standalone has anunimpressive AUC of 59.8%. Although sentiment and topicscores are helpful in giving us a basic analysis of our newsdata at hand, these features are far from sufficient for accu-rate downgrade prediction.
Number of Topics C o h e r e n c e S c o r e Choosing Optimal Number of Topics for LDA Model
Figure 2: Coherence values of LDA model with differenttopic numbers. The optimal number is 10 topics with coher-ence value of 0.704.
Approach 2: Doc2Vec Embeddings . In this approach,we train a distributed-memory Doc2Vec model based onthe vocabulary of our training dataset over 20 epochs andgenerate a document embedding for each news article. Wechoose to use Doc2Vec paragraph vectors since they arelearned from unlabeled data and theoretically can work wellfor tasks that have small amount of labeled data (Le andMikolov 2014), which is our case in this study. These em-beddings are taken as input into a logistic regression model,which results in an AUC of 71.6%.
Approach 3: fastText Embeddings . Taking advantage oftransfer learning, we use 1M word vectors pre-trained by ateam of Facebook researchers on Wikipedia 2017, UMBCwebbase corpus and statmt.org news dataset (Mikolov et al.2018). Since the fastText model was trained on massive datasources, their representations perform very well at transfer-ring to other NLP problems and improve the generaliza-tion of models learned on limited amount of data. UnlikeDoc2Vec which produces representations at document level,fastText generates a 300-dimensional vector for each word.To create an embedding for the whole document, we takethe average of embeddings for words contained in it. With
False Positive Rate T r u e P o s i t i v e R a t e AUC Comparison between Models
Standalone News Model AUC = 0.809Benchmark Model AUC = 0.827Final Model (Benchmark + News) AUC = 0.877
Figure 3: AUC comparison between benchmark, standalonenews, and final models. Final model achieves the highestvalue, 87.7%, resulting in a 5% gain compared with bench-mark model.these final document embeddings fed into a logistic regres-sion, our news model achieves an AUC of 80.9%, which isthe best performance for this standalone model so far.Table 2: Performance of standalone news modelapproach AUC (%)sentiment scores and LDA topic scores 59.8Doc2Vec embeddings 71.6fastText embeddings (averaging out)
Based on the performance results, we decide to employfastText embeddings as the main feature of our news model.The downgrade probability outputted from this is then com-bined with other quantitative measures in the benchmark,coupled with SMOTE oversampling technique, to build a fi-nal augmented model. Results of this final model are furtherexplored in Section 4.
The following results are based on the evaluation on a 20%hold-out test set that are untouched, unseen throughout thetraining process. Both the benchmark and final models aretested on this same set to ensure fair comparison.
AUC . One of the main metrics we use to compare perfor-mances of different models is AUC, which tells how gooda model is at distinguishing classes (the higher the number,the better). Figure 3 indicates that a downgrade model usingnews alone can achieve a test AUC of 80.9% and there is a5% gain when adding this news probability on top of the ex-isting model (from 82.7% for benchmark to 87.7% for finalmodel).
Recall . Since the cost of a false negative in our businesscontext (an impending downgrade event that the model isnot able to capture) is very steep, we put an emphasis on op-timizing the model’s recall rate. Although a standalone news % of Companies Analyzed % o f D o w n g r a d e s C a p t u r e d Cumulative Gains Curves
Final Model (Benchmark + News)Benchmark ModelBaseline
Figure 4: Cumulative gains chart. Given limited resources,the final model is able to provide a more optimal list of com-panies to analyze.model produces a humble recall rate of 54.2%, incorporat-ing news into benchmark model improves recall rate from69.5% to 74.6% on the test set, resulting in a 5.1% gain inrecall.
Cumulative Gains . Given limited human and time re-sources that can be dedicated to the task of analyzing com-panies in the list of anticipated downgrades, we would like tobe able to capture as many downgrades as possible using asfew test cases as possible. Figure 4 indicates that with newsinformation added, analyzing the top 10% companies withthe highest predicted downgrade probabilities can achieve arecall rate of 74.6%, which is a 27.1% improvement com-pared with the benchmark model. Similarly, analyzing thetop 20% companies using the final model results in a recallrate of 83.1%, an increase of 13.6% compared with bench-mark.Overall, our model with the additional downgrade prob-ability of news not only has higher accuracy rate (in termsof AUC and recall) but is also more efficient by providing amore optimal list of companies to analyze given restrainedresources.
To enhance the pragmatic sense of our model, we carefullyexamine all news articles associated with the true positives(actual downgrade events that the model is able to capture)and false negatives (actual downgrade events that the modelis unable to capture) in our test set. Investigation shows thatthe model can pick up high-quality and very relevant articlepieces to infer a downgrade event of a company.
True Positive Samples . There are 44 true positives cor-responding with 16 unique companies in our test data. Fol-lowing is a representative list of articles appearing in ourtrue positive cases. Each example is shown at company level,contains only keywords due to space limit, and gives a goodsummary of the entire article piece. • Company A: close factories; cut down 12,000 jobs; hasbeen among the hardest to be hit by the trade war so far • Company B: lose clients and get sued following miscon-duct revelations • Company C: challenged by [an investor] for stretching it-self financially to buy rival oil driller • Company D: struggle with a host of issues; cut its div-idend and report a wider-than-expected loss in its mainengineering and construction unit • Company E: hire restructuring firms and may choose toseek bankruptcy protection • Company F: plan to wind down its dress-barn retail oper-ations, resulting in the closure of about 650 stores • Company G: anticipate having discussions on construc-tive basis relating to its underperformance • Company H: make sophisticated missiles that use rareearth metals in their guidance systems, and sensors • Company I: face lawsuit over art fraud; had been the will-ing auction house that knowingly and intentionally madethe fraud possibleAlthough none of the articles explicitly mention that acompany is being downgraded, the majority of them pro-vide valuable insights into a company’s financial health orhow the company is perceived in the market. These signalscould potentially be considered as leading indicators towarda downgrade. The model seems to be able to pick up subtleand indirect signs of an impending downgrade event as well.For example, in the case of Company H, the article refers tothe company’s heavy dependence on rare earth metals. Thisdoes not appear to relate to downgrade at first glance; how-ever, more examination suggests that rare earth metal hasbecome a political spotlight recently due to the trade tensionbetween the U.S, who considers this mineral critical to thecountry’s economic and national security, and China, who isthe largest producer and manufacturer of this element in theworld. Company H indeed was downgraded on 07-31-2019.
False Negative Samples . The test set includes 15 falsenegatives associated with 6 distinct companies. After review,we see these cases fall under either of two following cate-gories: • The news articles follow unpopular format structurescompared with others in the dataset, which leads to theirnoisy quality even after preprocessing. • The news articles appear in this false negative list areabout companies that also exist in the true positive list.However, these pieces have a positive/neutral tone, werepublished shortly earlier, and might revolve around a dif-ferent activity mentioned in the articles of the true posi-tives. For example: – Company I’s shareholders approve proposed acquisi-tion; clients are pleased with the company’s help to filetheir first lawsuit against the government (This piecewas published on 06-21-2019. The one in true positivelist was published on 06-25-2019. Company I’s down-grade event happened on 09-17-2019). – Company C could divest most or all of [an investmentvehicle] after buying out [another company]. (This
AUC Gain (in Percentage) F r e q u e n c y = 6.3, = 1.4 Histogram of Model Performance Gain in 100 Runs
Figure 5: Histogram of the performance gain in 100 exper-iment runs. The gain is positive 100 times, with mean value6.3% and standard error 1.6%.piece was published on 06-14-2019. The one in truepositive list was published on 06-27-2019. CompanyC’s downgrade event happened on 08-01-2019).Our takeaways here are that bad news can “outweigh”good/neutral news when it comes to contributing to theprediction of downgrade and companies’ ratings couldtake downturn very shortly after emergence of some neg-ative sentiments about them in the market.
To evaluate the robustness of our final model’s perfor-mance gain, we run 100 experiments using 100 different ran-dom seeds that are involved in the training/splitting, cross-validation and SMOTE oversampling processes. The perfor-mance gain in AUC from adding news to the benchmarkmodel is positive 100 times out of 100 experiments. Themean value of these gains is 6.3% with a standard error of1.6%, as shown in Figure 5.This suggests that our positive performance gain is reli-able and the 5% increase in AUC mentioned in Section 4.1is still on the lower side. As we have access to a larger set ofdata, the standard error will theoretically be pushed closer to0.
In the hope of obtaining unique insights into companies’ fi-nancial health that are not available in traditional quantita-tive credit measures, we train a downgrade risk model onsolely news information. We demonstrate that news cover-age, if represented appropriately in the data, can help de-tect adverse credit signal and considerably improve the per-formance of existing model that is trained on conventionalcredit measures.There are multiple avenues for research opened from thiswork. Throughout our modeling development, we noticehow we preprocess the text data could introduce a mate-rial change in our final model performance. Extracting onlyrelevant information about the company of interest plays a key role in increasing the gain. Thus, we would like to fur-ther explore more advanced techniques regarding informa-tion extraction, such as coreference resolution to find sen-tences that do not exactly include a company’s name but stillrefer to the same entity, or sentence segmentation to pick outsentences when the boundary is ambiguous. In addition, be-cause the news model is built solely based on online articles,it is inherently subject to media bias. Fact-checking news ortackling different forms of bias within mass media is outsidethe scope of this study but offers an exciting opportunity forfuture research.
The author is grateful for helpful discussions about modeldevelopment made by Zizhen Wu and Jasmine Geng, forknowledge sharing about the raw datasets made by Yi Li andYi Wang, and for paper review by Nailong Zhang, JasmineGeng, Adam Fox and Sears Merritt.
References [Blei et al. 2003] Blei, D. M.; Ng, A. Y.; Jordan, M. I.; andLafferty, J. 2003. Latent dirichlet allocation.
Journal ofMachine Learning Research
CoRR abs/1607.04606.[Hutto and Gilbert 2014] Hutto, C. J., and Gilbert, E. 2014.Vader: A parsimonious rule-based model for sentiment anal-ysis of social media text. In Adar, E.; Resnick, P.; Choud-hury, M. D.; Hogan, B.; and Oh, A. H., eds.,
ICWSM . TheAAAI Press.[Le and Mikolov 2014] Le, Q., and Mikolov, T. 2014. Dis-tributed representations of sentences and documents. In
Proceedings of the 31st International Conference on In-ternational Conference on Machine Learning - Volume 32 ,ICML’14, II–1188–II–1196. JMLR.org.[Loughran and McDonald 2011] Loughran, T., and McDon-ald, B. 2011. When is a liability not a liability? textual analy-sis, dictionaries, and 10ks.
Journal of Finance
Proceedings of the In-ternational Conference on Language Resources and Evalu-ation (LREC 2018) .[Nielsen 2011] Nielsen, F. ˚A. 2011. A new ANEW: eval-uation of a word list for sentiment analysis in microblogs.
CoRR abs/1103.2903.[R¨oder, Both, and Hinneburg 2015] R¨oder, M.; Both, A.; andHinneburg, A. 2015. Exploring the space of topic coherencemeasures. In