[PDF] Distillation of News Flow into Analysis of Stock Reactions

Abstract

The gargantuan plethora of opinions, facts and tweets on financial business offers the opportunity to test and analyze the influence of such text sources on future directions of stocks. It also creates though the necessity to distill via statistical technology the informative elements of this prodigious and indeed colossal data source. Using mixed text sources from professional platforms, blog fora and stock message boards we distill via different lexica sentiment variables. These are employed for an analysis of stock reactions: volatility, volume and returns. An increased sentiment, especially for those with negative prospection, will influence volatility as well as volume. This influence is contingent on the lexical projection and different across Global Industry Classification Standard (GICS) sectors. Based on review articles on 100 S&P 500 constituents for the period of October 20, 2009, to October 13, 2014, we project into BL, MPQA, LM lexica and use the distilled sentiment variables to forecast individual stock indicators in a panel context. Exploiting different lexical projections to test different stock reaction indicators we aim at answering the following research questions: (i) Are the lexica consistent in their analytic ability? (ii) To which degree is there an asymmetric response given the sentiment scales (positive v.s. negative)? (iii) Are the news of high attention firms diffusing faster and result in more timely and efficient stock reaction? (iv) Is there a sector-specific reaction from the distilled sentiment measures? We find there is significant incremental information in the distilled news flow and the sentiment effect is characterized as an asymmetric, attention-specific and sector-specific response of stock reactions.

Full PDF

DDistillation of News Flow into Analysis ofStock Reactions*

Junni L. ZhangGuanghua School of Management and Center for Statistical SciencePeking UniversityBeijing, 100871, ChinaWolfgang K. H¨ardleHumboldt-Universit¨at zu BerlinUnter den Linden 6, Berlin 10099, GermanyandSim Kee Boon Institute for Financial EconomicsSingapore Management UniversityAdministration Building, 81 Victoria Street, Singapore 188065Cathy Y. ChenChung Hua University707, Sec.2, WuFu Rd., Hsinchu, Taiwan 30012andHumboldt-Universit¨at zu BerlinUnter den Linden 6, Berlin 10099, GermanyElisabeth BommesHumboldt-Universit¨at zu BerlinUnter den Linden 6, Berlin 10099, GermanySeptember 23, 2020 * This research was supported by the Deutsche Forschungsgemeinschaft through the SFB 649 ‘Eco-nomic Risk’, Humbold-Universit¨at zu Berlin. We like to thank the Research Data Center (RDC) for thedata used in this study. We would also like to thank the International Research Training Group (IRTG)1792.This is a post-peer-review, pre-copyedit version of an article published in the Journal of Business andEconomic Statistics. The ﬁnal authenticated version is available online at: http://dx.doi.org/10.1080/07350015.2015.1110525 a r X i v : . [ q -f i n . S T ] S e p bstract The gargantuan plethora of opinions, facts and tweets on ﬁnancial business oﬀersthe opportunity to test and analyze the inﬂuence of such text sources on future direc-tions of stocks. It also creates though the necessity to distill via statistical technologythe informative elements of this prodigious and indeed colossal data source. Usingmixed text sources from professional platforms, blog fora and stock message boardswe distill via diﬀerent lexica sentiment variables. These are employed for an analysisof stock reactions: volatility, volume and returns. An increased sentiment, especiallyfor those with negative prospection, will inﬂuence volatility as well as volume. Thisinﬂuence is contingent on the lexical projection and diﬀerent across Global IndustryClassiﬁcation Standard (GICS) sectors. Based on review articles on 100 S&P 500constituents for the period of October 20, 2009 to October 13, 2014, we project intoBL, MPQA, LM lexica and use the distilled sentiment variables to forecast individualstock indicators in a panel context. Exploiting diﬀerent lexical projections to test dif-ferent stock reaction indicators we aim at answering the following research questions:(i) Are the lexica consistent in their analytic ability?(ii) To which degree is there an asymmetric response given the sentiment scales (pos-itive v.s. negative)?(iii) Are the news of high attention ﬁrms diﬀusing faster and result in more timelyand eﬃcient stock reaction?(iv) Is there a sector speciﬁc reaction from the distilled sentiment measures?We ﬁnd there is signiﬁcant incremental information in the distilled news ﬂow andthe sentiment eﬀect is characterized as an asymmetric, attention-speciﬁc and sector-speciﬁc response of stock reactions.

Keywords: Investor Sentiment, Attention Analysis, Sector Analysis, Volatility Simulation,Trading Volume, ReturnsJEL Classiﬁcations:

C81, G14, G17 2

Introduction

News are driving ﬁnancial markets. News are nowadays massively available on a varietyof modern digital platforms with a wide spectrum of granularity scales. It is exactly thiscombination of granularity and massiveness that makes it virtually impossible to processall the news relevant to certain ﬁnancial assets. How to distinguish between “noise” and“signal” is also here the relevant question. With a few exceptions the majority of empiricalstudies on news impact work has therefore been concentrated on speciﬁc identiﬁable eventslike scheduled macroeconomic announcements, political decisions, or asset speciﬁc news.Recent studies have looked at continuous news ﬂow from an automated sentiment machineand it has been discovered to be relevant to high frequency return, volatility and tradingvolume. Both approaches have limitations since they concentrate on identiﬁable indicators(events) or use speciﬁc automated linguistic algorithms.This paper uses text data of diﬀerent granularity from blog fora, news platforms andstock message boards. Using several lexical projections, we deﬁne pessimistic (optimistic)sentiment with speciﬁc meaning as the average proportions of negative (positive) wordsin articles published in speciﬁc time windows before the focal trading day, and examinetheir impacts on stock trading volume, volatility and return. We analyze those eﬀects ina panel data context and study their inﬂuence on stock reactions. These reactions mightbe interesting since large institutions, more sophisticated investors, usually express theirviews on stock prospective or prediction through published analyst forecasts. However, an-alysts’ recommendations may be contaminated by their career concerns and compensationscheme; they may also be in alliance with other ﬁnancial institutions such as investmentbanks, brokerage houses or target companies (Hong and Kubik, 2003; Liu, 2012). Due tothe possible conﬂicts of interest from analysts and their powerful inﬂuence on naive smallinvestors, the opinions from individual small investors may be trustworthy since their per-sonal opinions hardly create any manipulation that governs stock reactions. The adventof social media such as

Seeking Alpha enables small investors to share and express theiropinions frequently, real time and responsively.We show that small investors’ opinions contribute to stock markets and create a “news-driven” stock reaction. The conversation in the internet or social media is valuable since the3ntroduction of conversation among a subset of market participants may have large eﬀectson the stock price equilibrium (Cao et al., 2001). Other literature such as Antweiler andFrank (2004), Das and Chen (2007), Chen et al. (2014) demonstrate the value of individualopinions on ﬁnancial market. They show that small investor opinions predict future stockreturns and earnings surprises even after controlling the ﬁnancial analyst recommendation.The projections (of a text into sentiment variables) we employ are based on three senti-ment lexica: the BL, LM and MPQA lexica. They are used to construct sentiment variablesthat feed into the stock reaction analysis. Exploiting diﬀerent lexical projections, and usingdiﬀerent stock reaction indicators we aim at answering the following research questions:(i) Are the lexica consistent in their analytic ability to produce stock reaction indicators,including volatility, detrended log trading volume and return?(ii) To which degree is there an asymmetric response given the sentiment scales (positivev.s. negative)?(iii) Are the news of high attention ﬁrms diﬀusing faster and result in more timely andeﬃcient stock reaction?(iv) Is there a sector speciﬁc reaction from the distilled sentiment measures?Question (i) addresses the variation of news content across diﬀerent granularity andlexica. Whereas earlier literature focuses on numerisized input indices like ReutersNews-Content or Google Search Volume Index, we would like to investigate the usefulness ofautomated news inputs for e.g. statistical arbitrage algorithms. Question (ii) examinesthe eﬀect of diﬀerent sentiment scales on stock reactions like volatility, trading volume andreturns. Three lexica are employed that are producing diﬀerent numerical values and thusraise the concern of how much structure is captured in the resulting sentiment measure.An answer to this question will give us insight into whether the well known asymmetricresponse (bad vs. good news) is appropriately reﬂected in the lexical projections. Question(iii) and (iv) ﬁnally analyze whether stylized facts play a role in our study. This is answeredvia a panel data scheme using GICS sector indicators and attention ratios.Groß-Klußmann and Hautsch (2011) analyze in a high frequency context market reac-tions to the intraday stock speciﬁc “Reuters NewsScope Sentiment” engine. Their ﬁndings4upport the hypothesis of news inﬂuence on volatility and trading volume, but are in con-trast to our study since they are based on a single news source and conﬁned to a limitednumber of assets for which high frequency data are available.Antweiler and Frank (2004) analyze text contributions from stock message boards andﬁnd that the amount and bullishness of messages have predictive value for trading volumeand volatility. On message boards, the self-disclosed sentiment to hold a stock position isnot bias free, as indicated in Zhang and Swanson (2010). Tetlock (2007) concludes thatnegative sentiment in a Wall Street Journal column has explanatory power for downwardmovement of the Dow Jones. Bollen et al. (2011) classify messages from the micro-bloggingplatform Twitter in six diﬀerent mood states and ﬁnd that public mood helps to predictchanges in daily Dow Jones values. Zhang et al. (2012) extends this by ﬁltering the Twittermessages (tweets) for keywords indicating a ﬁnancial context and they consider diﬀerentmarkets such as commodities and currencies. Si et al. (2013) use a reﬁned ﬁltering processto obtain stock speciﬁc tweets and conclude that topic based Twitter sentiment improvesday-to-day stock forecast accuracy. Sprenger et al. (2014) also use tweets on stock leveland conclude that the number of retweets and followers may be used to assess the qualityof investment advice. Chen et al. (2014) use articles and corresponding comments on

Seeking Alpha , a social media platform for investment research, and show predictive valueof negative sentiment for stock returns and earnings surprises. According to Wang et al.(2014), the correlation of Seeking Alpha sentiment and returns is higher than betweenreturns and sentiment in Stocktwits, messages from a micro-blogging platform specializedin ﬁnance.Using either individual lexical projections or a sentiment index comprising the com-mon component of the three lexical projections, we ﬁnd that the text sentiment showsan incremental inﬂuence on the stocks collected from S&P 500 constituents. An asym-metric response of the stock reaction indicators to the negative and positive sentiments isconﬁrmed and supports the leverage eﬀect, that is, the stocks react to negative sentimentmore. The reaction to the distilled sentiment measures is attention-speciﬁc and sector-speciﬁc as well. Due to the advent of social media, the opinions of small traders that havebeen ignored from past till now, do shed some light on stock market activity. The rest of5he paper is organized as follows. Section 2 describes the data gathering process, summa-rizes deﬁnitions of variables and introduces the diﬀerent sentiment lexica. In Section 3, wepresent the regression and simulation results using the entire sample and samples groupedby attention ratio and sectors. The conclusion follows in Section 4.

While there are many possible sources of ﬁnancial articles on the web, there are also le-gal and practical obstacles to clear before obtaining the data. The text source

SeekingAlpha , as used in Chen et al. (2014), prohibits any application of automatic programs todownload parts of the website (web scraper) in their Terms of Use (TOS). While the usageof web scrapers for non-commerical academic research is principally legal, these TOS arestill binding as stated in Truyens and Eecke (2014). For messages on Yahoo! Finance,another popular source of ﬁnancial text data used in Antweiler and Frank (2004); Zhangand Swanson (2010), the TOS are not a hindrance but only limited message history is pro-vided. As of December 2014, only the last 10,000 messages are shown in each stock speciﬁcmessage board and this roughly corresponds to a two-month-period for stocks that peopletalk frequently about like Apple. In opposition to these two examples, NASDAQ oﬀers aplatform for ﬁnancial articles by selected contributors including social media websites suchas

Seeking Alpha and

Motley Fool , investment research ﬁrms such as Zacks. Neither do theTOS prohibit web scraping nor is the history of shown articles limited. We have collected116,691 articles and corresponding stock symbols, spanning roughly ﬁve years from October20, 2009 to October 13, 2014. The data is downloaded by using a self-written web scraperto automate the downloading process.The process of gathering and processing the article data and producing the sentimentscores can be seen in Figure 1. Firstly, the URLs of all articles on NASDAQ are gath-ered and every webpage containing an article is downloaded. Each URL can be used inthe next steps as unique identiﬁer of individual articles to ensure that one article is notused twice due to real-time updates of the NASDAQ webpage. In the pre-processing step,6igure 1: Flowchart of data gathering processthe page navigation and design elements of NASDAQ are removed. The speciﬁcs of eacharticle, namely contributor, publication date, mentioned stock symbols, title and articletext, are identiﬁed and read out. In case of the article text, the results are stored in in-dividual text ﬁles. This database is available for research purposes at RDC, CRC 649,Humboldt-Universit¨at zu Berlin.Furthermore, we collected stock speciﬁc ﬁnancial data. Daily prices and trading volume,deﬁned as number of shares traded, of all stock symbols that are S&P 500 constituents arecollected from Datastream while Compustat is used to gather the GICS sector for thesestocks.We consider three stock reaction indicators: log volatility, detrended log trading volumeand return. For stock symbol i and trading day t , we ﬁrst compute the Garman and Klass(1980) range-based measure of volatility deﬁned as: σ i,t = 0 . u − d ) − . { c ( u + d ) − ud } − . c (1)with u = log( P Hi,t ) − log( P Oi,t ) ,d = log( P Li,t ) − log( P Oi,t ) ,c = log( P Ci,t ) − log( P Oi,t ) , P Hi,t , P Li,t , P Oi,t , P Ci,t as the daily highest, lowest, opening and closing stock prices,respectively. Chen et al. (2006) and Shu and Zhang (2006) show that the Garman and Klassrange-based measure of volatility essentially provides equivalent results to high-frequencyrealized volatility. For example, Shu and Zhang (2006) ﬁnd that an empirical test withS&P 500 index return data shows that the range-based variances are quite close to thehigh-frequency realized variance computed using the sum of 15-minute squared returns.Andersen and Bollerslev (1997) show that the high-frequency realized volatility is verysensitive to the selected interval. In addition, it is also aﬀected by the bid/ask spread.The range-based measure of volatility, on the other hand, avoids the problems caused bymicrostructure eﬀects. However, Alizadeh et al. (2002) argue that range based measuressuch as the Garman-Klass estimator do not make use of the log-normality of volatility. Asshown by Andersen et al. (2001), log realized volatility is less skewed and less leptokurtic incomparison to raw realized volatility. Therefore, we use log σ i,t instead, which also avoidsregressing on a strictly positive variable in the subsequent analysis.Following Girard and Biswas (2007), we estimate the detrended log trading volume foreach stock by using a quadratic time trend equation: V ∗ i,t = α + β ( t − t ) + β ( t − t ) + V i,t , (2)where t is the starting point of the time window in consideration, V ∗ i,t is the raw dailylog trading volume and the residual V i,t is the detrended log trading volume. In order toavoid imposing a look-ahead bias, for each trading day t , we use a rolling window of past120 observations, V ∗ i,t − , . . . , V ∗ i,t − with t = t − V ∗ i,t , and then calculate V i,t = V ∗ i,t − ˆ V ∗ i,t . Furthermore,we calculate the returns as R i,t = log P Ci,t − log P Ci,t − .We focus on 100 stock symbols that are S&P 500 constituents on all 1,255 trading daysbetween October 20, 2009 and October 14, 2014, that belong to one of nine major GICSsectors for stock symbols that are S&P 500 constituents on at least one trading day duringthis period, and that have the most trading days with articles. The distribution of GICSsectors among these 100 symbols are given in Table 1. Out of the 116,691 articles collected,there are 43,459 articles associated with these 100 stock symbols; the number of articlesfor these stocks range from 340 to 5,435, and the number of trading days with articles8anges from 271 to 1,039. Most of the articles are not about one single symbol but containreferences to several stocks.GICS Sector No. StocksConsumer Discretionary 21Consumer Staples 9Energy 6Financials 12Health Care 15Industrials 10Information Technology 21Materials 4Telecommunication Services 2Table 1: Distribution of GICS sectors among the 100 stock symbols To distill sentiment variables from each article, we use and compare three sentiment lexica.The ﬁrst lexicon (BL) is a list of 6,789 sentiment words (2,006 positive and 4,783 negative)compiled over many years starting from Hu and Liu (2004) and maintained by Bing Liuat University of Chicago, Illinois. We ﬁlter each article with this lexicon and calculate theproportions of positive and negative words. The second lexicon (LM) is based on Loughranand McDonald (2011) which is speciﬁcally designed for ﬁnancial applications, and contains354 positive words, 2,329 negative words, 297 uncertainty words, 886 litigious words, 19strong modal words and 26 weak modal words. To be consistent with the usage of theother lexica, we only consider the list of positive and negative words and calculate theproportions of positive and negative words for each article.The third lexicon is the MPQA (Multi-Perspective Question Answering) SubjectivityLexicon by Wilson et al. (2005) which we later refer to as the MPQA lexicon. This lexicon9ontains 8,222 entries. In order to show the rather tedious distillation process let us lookat six example entries: type = weaksubj len =1 word1 = abandoned pos1 = adj stemmed1 =n priorpolarity = negativetype = weaksubj len =1 word1 = abandonment pos1 = noun stemmed1 =n priorpolarity = negativetype = weaksubj len =1 word1 = abandon pos1 = verb stemmed1 =y priorpolarity = negativetype = strongsubj len =1 word1 = abase pos1 = verb stemmed1 =y priorpolarity = negativetype = strongsubj len =1 word1 = abasement pos1 = anypos stemmed1 =y priorpolarity = negativetype = strongsubj len =1 word1 = abash pos1 = verb stemmed1 =y priorpolarity = negative

Here type refers to whether the word is classiﬁed as strongly subjective, indicating that theword is subjective in most contexts, or weakly subjective, indicating that the word only hascertain subjective usages; len denotes the length of the word; word1 is the spelling of theword; pos1 is part-of-speech tag of the word, which could take values adj (adjective), noun,verb, adverb, or anypos (any part-of-speech tag); stemmed1 is an indicator for whether thisword is stemmed, where stemming refers to the process of reducing inﬂected (or sometimesderived) words to their word stem, base or root form; and priorpolarity refers to polarityof the word, which could take values negative, positive, neutral, or both (both negative andpositive). The MPQA lexicon contains 4913 entries with negative polarity, 2718 entries withpositive polarity, 570 entries with neutral polarity, and 21 entries with both polarity. Tobe consistent with the usage of the other two lexica, we only consider positive and negativepolarity.We ﬁrst use the NLTK package in Python to tokenize sentences and (un-stemmed)words in each article, and derive the part-of-speech tagging for each word. We ﬁlter eachtokenized article with the list of entries with stemmed1=n in the MPQA lexicon to countthe number of positive and negative word. We then use the Porter Stemmer in the NLTKpackage to stem each word and ﬁlter each article with the list of entries with stemmed1=y in the MPQA lexicon. If a word has been assigned polarity in the ﬁrst ﬁltering step, it willno longer be counted in the second ﬁltering step. For each article, we can thus count thenumbers of negative and positive words, and divide them by the length of the article to getthe proportions of negative and positive words.Regardless of which lexicon is used, we use a variation of the approach in Hu and Liu(2004) to account for sentiment negation. If the word distance between a negation word(“not”, “never”, “no”, “neither”, “nor”, “none”, “n’t”) and the sentiment word is no larger10han 5, the positive or negative polarity of the word is changed to be the opposite of itsoriginal polarity.Among the words that appear at least three times in our list of articles, there are 470positive and 918 negative words that are unique to the BL lexicon, 267 positive and 916negative words that are unique to the LM lexicon, and 512 positive and 181 negative wordsthat are unique to the MPQA lexicon. The LM lexicon contains less unique positive wordsthan the other two lexica, and the MPQA lexicon contains less unique negative words thanthe other two lexica. Table 2 presents the lists of ten most frequent positive words andten most frequent negative words that are unique to these three lexica. Since the BL andMPQA lexica are designed for general purpose and the LM lexicon is designed speciﬁcallyfor ﬁnancial applications, the unique words under the BL and MPQA lexica indeed lookmore general.Words in the general-purpose lexica may be misclassiﬁed for ﬁnancial applications; forexample, the word “proprietary” in the negative list of the BL lexicon may refer to thingslike “a secure proprietary operating system that no other competitor can breach” and hencehave a positive tone in ﬁnancial applications, and the word “division” in the negative listof the MPQA lexicon may only refer to divisions of companies. However, ﬁnancial analysisusing textual information is unavoidably noisy, and words in the LM lexicon can also bemisclassiﬁed; for example, the word “closing” in the negative list of the LM lexicon mayactually refer to a positive event of closing a proﬁtable deal. Also, the LM lexicon does nottake into account ﬁnancial words such as “debt” and “risks” in the BL lexicon.We next investigate the pairwise relationship among the above three lexica. Amongthe words that appear at least three times in our list of articles, there are 131 positive and322 negative words that are shared only by the BL and LM lexica, 971 positive and 1,164negative words that are shared only by the BL and MPQA lexica, and 32 positive and 30negative words that are shared only by the LM and MPQA lexica. It is not surprisingthat the two general-purpose lexica, BL and MPQA, share the most positive and negativewords. Out of the two general-purpose lexica, BL lexicon shares more positive and negativewords with the special-purpose LM lexicon. Table 3 presents the lists of ten most frequentpositive words and ten most frequent negative words that are shared only by two of these11

L LM MPQAPositive (470) Negative (918) Positive (267) Negative (916) Positive (512) Negative (181)Available Debt Opportunities Declined Just Low(5,836) (12,540) (4,720) (9,809) (17,769) (12,739)Led Fell Strength Dropped Help Division(5,774) (9,274) (4,393) (4,894) (17,334) (5,594)Lead Fool Proﬁtability Late Proﬁt Least(4,711) (5,473) (4,174) (4,565) (15,253) (5,568)Recovery Issues Highest Claims Even Stake(4,357) (3,945) (3,409) (3,785) (13,780) (4,445)Work Risks Greater Closing Deal Slightly(3,808) (2,850) (3,321) (3,604) (13,032) (3,628)Helped Issue Surpassed Closed Interest Close(3,631) (2,821) (2,464) (3,378) (12,237) (3,105)Enough Falling Enable Challenges Above Trial(3,380) (2,768) (2,199) (2,574) (12,203) (2,544)Pros Aggressive Strengthen Force Accord Decrease(2,841) (1,796) (2,157) (2,157) (11,760) (2,205)Integrated Hedge Alliance Unemployment Natural Disease(2,652) (1,640) (1,842) (2,062) (10,135) (2,001)Savings Proprietary Boosted Question Potential Little(2,517) (1,560) (1,831) (1,891) (9,905) (1,775)

Table 2: Lists of ten most frequent positive words and ten most frequent negative wordsthat are unique to the BL, MPQA or LM lexica, along with their frequencies given inparentheses.three lexica. Words shared by the two general-purpose lexica (BL and MPQA) may bemisclassiﬁed for ﬁnancial applications; for example, the word “gross” shared by the negativelists of these two lexica may refer to “the annual gross domestic product” and have a neutraltone. However, words shared by the LM lexicon and one of the general-purpose lexica mayalso be misclassiﬁed; for example, the word “critical” shared by the negative lists of theBL and LM lexica may appear in sentences such as “mobile devices are becoming criticaltools in the worlds of advertising and market research” and have a positive tone.The above discussion shows that projections using the three lexica are all noisy, thereforeit is worthwhile to compare results from these projections. For each stock symbol i and12 L and LM BL and MPQA LM and MPQAPositive (131) Negative (322) Positive (971) Negative (1164) Positive (32) Negative (30)Gains Losses Free Gross Despite Against(7,604) (5,938) (133,395) (8,228) (7,413) (8,877)Gained Missed Well Risk Able Cut(7,493) (3,165) (3,0270) (7,471) (5,246) (3,401)Improved Declining Like Limited Opportunity Challenge(7,407) (3,053) (24,617) (5,884) (4,398) (1,042)Improve Failed Top Motley Proﬁtable Serious(5,726) (2,421) (14,899) (5,165) (3,580) (1,022)Restructuring Concerned Guidance Crude Eﬃciency Contrary(3,210) (1,991) (11,715) (5,109) (2,615) (401)Gaining Declines Signiﬁcant Cloud Popularity Severely(3,150) (1,654) (10,576) (4,906) (1,588) (348)Enhance Suﬀered Worth Fall Exclusive Despite(2,753) (1,435) (10,503) (4,732) (1,225) (342)Outperform Weaker Gold Mar Tremendous Argument(2,518) (1,288) (9,303) (3,190) (611) (324)Stronger Critical Support Hard Dream Seriously(1,657) (1,131) (9,120) (2,957) (581) (240)Win Drag Recommendation Cancer Satisfaction Staggering(1,491) (1,095) (8,993) (2,521) (410) (209)

Table 3: Lists of ten most frequent positive words and ten most frequent negative wordsthat are shared only by BL and LM lexica, only by BL and MPQA lexica, or only by LMand MPQA lexica, along with their frequencies given in parentheses.each trading day t , we derive the sentiment variables listed in Table 4 based on articlesassociated with symbol i and published on or after trading day t and before trading day t + 1. 13 entiment Variable Description I i,t Indicator for whether there is an article.

P os i,t (BL) The average proportion of positive words using the BL lexicon.

N eg i,t (BL) The average proportion of negative words using the BL lexicon.

P os i,t (LM) The average proportion of positive words using the LM lexicon.

N eg i,t (LM) The average proportion of negative words using the LM lexicon.

P os i,t (MPQA) The average proportion of positive words using the MPQA lexicon.

N eg i,t (MPQA) The average proportion of negative words using the MPQA lexicon.

Table 4: Sentiment variables for articles published on or after trading day t and beforetrading day t + 1. Table 5 presents summary statistics of the sentiment variables derived using the BL, LMand MPQA lexical projections for 43,569 symbol-day combinations with I i,t = 1, where I i,t is deﬁned in Table 4 and indicates whether there is an article associated with symbol i and published on or after trading day t and before trading day t + 1. This numberis slightly diﬀerent from the number of articles associated with the 100 selected symbols(43,459), since an article can be associated with multiple symbols. The positive proportionis the largest under the MPQA projection, and the smallest under the LM projection. Thenegative proportions under the three projections are similar. Polarity in Table 5 measuresthe relative dominance between positive sentiment and negative sentiment. For example,the situation, P os i,t (BL) > N eg i,t (BL), accounts for 88.04% of the 43,569 observations.Note that under each projection, there are a small percentage of the observations for which

P os i,t = N eg i,t . Under both the BL and MPQA projections, positive sentiment is moredominant and widespread than negative sentiment. The LM projection, however, resultsin a relative balance between positive and negative sentiment.To check whether the sentiment polarity actually reﬂects the sentiment of the articles,14ariable (cid:98) µ (cid:98) σ Max Q1 Q2 Q3 Polarity

P os i,t (BL) 0.033 0.012 0.134 0.025 0.032 0.040 88.04%

N eg i,t (BL) 0.015 0.010 0.091 0.008 0.014 0.020 10.51%

P os i,t (LM) 0.014 0.007 0.074 0.009 0.013 0.018 55.70%

N eg i,t (LM) 0.012 0.009 0.085 0.006 0.011 0.016 40.17%

P os i,t (MPQA) 0.038 0.012 0.134 0.031 0.038 0.045 96.26%

N eg i,t (MPQA) 0.013 0.008 0.133 0.007 0.012 0.017 2.87%

Note: Sample mean, sample standard deviation, maximum value, 1st, 2nd and 3rd quartiles,and polarity. These descriptive statistics are conditional on I i,t = 1. Table 5: Summary Statistics for Text Sentiment VariablesManual BL Label LM Label MPQA LabelLabel Pos Neg Neu Pos Neg Neu Pos Neg Neu TotalPos 56 4 1 41 12 8 61 0 0 61Neg 9 2 1 0 9 3 9 2 1 12Neu 22 5 0 10 15 2 26 0 1 27Total 87 11 2 51 36 13 96 2 2 100Table 6: Sentiment Classiﬁcation Results for 100 Randomly Selected Articleswe actually carefully checked and read the contents of 100 randomly selected articles andmanually classiﬁed their polarity (positive, negative and neutral), and also use the lexicalprojections to automatically classify these articles as follows. If the proportion of positivewords for an article is larger than (or small than, or equal to) the proportion of negativewords for the same article, then this article is automatically classiﬁed as positive (or nega-tive, or neutral). Table 6 reports the results. It appears that the BL and MPQA projectionsput too much weight on positive sentiment, and are not powerful in detecting negative sen-timent. In contrast, the LM sentiment is powerful in detecting negative sentiment, but isnot so good in detecting positive sentiment.Figure 2 and 3 respectively show the monthly correlation between positive and negativeproportions under two of the three projections. In general, the negative proportions are15ore correlated than positive proportions. Also, the correlation between the BL and LMprojections and that between the BL and MPQA projections are larger than the correlationbetween the LM and MPQA projections, which is consistent with the discussion about thelist of words shared by two of the three projections (see Table 3). . . . . Date C o rr e l a t i on Figure 2: Monthly Correlation between Positive Sentiment: BL and LM (solid), BL andMPQA (dashed), LM and MPQA (dotted) . . . . Date C o rr e l a t i on Figure 3: Monthly Correlation between Negative Sentiment: BL and LM (solid), BL andMPQA (dashed), LM and MPQA (dotted) 16 .1.2 Main Results

Recall from Section 2.1 that we focus on three stock reaction indicators: log volatilitylog σ i,t , where σ i,t is deﬁned in (1), detrended log trading volume V i,t as in (2) and returns R i,t . We ﬁrst consider analyzing these three indicators with one trading day into the future,and use the following (separate) panel regressions.log σ i,t +1 = α + β I i,t + β P os i,t + β N eg i,t + β (cid:62) X i,t + γ i + ε i,t , (3) V i,t +1 = α + β I i,t + β P os i,t + β N eg i,t + β (cid:62) X i,t + γ i + ε i,t , (4) R i,t +1 = α + β I i,t + β P os i,t + β N eg i,t + β (cid:62) X i,t + γ i + ε i,t . (5)where γ i is the ﬁxed eﬀect for stock symbol i satisfying (cid:80) i γ i = 0. X i,t is a vector of controlvariables that includes a set of market variables to control for systematic risk such as (1)S&P 500 index return ( R M,t ) to control for general market returns; (2) the CBOE VIX indexon date t to measure the generalized risk aversion ( V IX t ); and a set of ﬁrm idiosyncraticvariables such as (3) the lagged log volatility (log σ i,t ); (4) the lagged return ( R i,t ); (5) thelagged detrended log trading volume ( V i,t ), where the lagged dependent variable is used tocapture the persistence and omitted variables. These three indicators essentially have atriple dynamic correlation, and they have been modeled as a trivariate vector autoregressive(VAR) model, see Chen et al. (2001) and Chen et al. (2002). Our indicators in Eqs.(4)to (5) not only have themselves dynamic relationship with their lagged values, but alsoare impacted by the other lagged indicators. We incorporate clustered standard errors byArellano (1987) as they allow for both time and cross-sectional dependence in the residuals.Petersen (2009) concludes that standard errors clustered on both dimensions are unbiasedand achieve correctly sized conﬁdence intervals while ordinary least squares standard errorsmight be biased in a panel data setting.To answer our research question (i), if the three lexica are not consistent in their an-alytic ability to produce stock reaction indicators, we would expect that the value andthe signiﬁcance of β , β or β varies across three lexical projections. For question (ii), ifthe positive and negative sentiments have asymmetric impacts, we would expect that β and β have diﬀerent signs or signiﬁcance. To address question (iii), we would expect thatthe value and the signiﬁcance of β , β or β varies with diﬀerent attention levels and in17 ariable BL LM MPQA PCAPanel A: Future Log Volatility log σ i,t +1 I i,t − .

005 (0 .

009 ) − . ∗∗∗ (0 .

007 ) − .

004 (0 .

010 ) − .

014 (0 .

010 )

P os i,t − . ∗ (0 .

228 ) 0 .

156 (0 .

378 ) − . ∗∗ (0 .

217 ) − .

210 (0 .

201 )

Neg i,t . ∗∗∗ (0 .

257 ) 0 . ∗∗∗ (0 .

271 ) 1 . ∗∗∗ (0 .

325 ) 1 . ∗∗∗ (0 .

247 ) R M,t − . ∗∗∗ (0 .

217 ) − . ∗∗∗ (0 .

216 ) − . ∗∗∗ (0 .

216 )

V IX t . ∗∗∗ (0 .

085 ) 2 . ∗∗∗ (0 .

086 ) 2 . ∗∗∗ (0 .

085 )log σ i,t . ∗∗∗ (0 .

010 ) 0 . ∗∗∗ (0 .

010 ) R i,t . ∗∗∗ (0 .

196 ) 1 . ∗∗∗ (0 .

196 ) V i,t . ∗∗∗ (0 .

006 ) 0 . ∗∗∗ (0 .

006 )Panel B: Future Detrended Log Trading Volume V i,t +1 I i,t . ∗∗∗ (0 .

008 ) 0 . ∗∗∗ (0 .

005 ) 0 . ∗∗∗ (0 .

009 ) 0 . ∗∗∗ (0 .

008 )

P os i,t − . ∗∗∗ (0 .

188 ) 0 .

051 (0 .

275 ) − . ∗∗ (0 .

194 ) − . ∗ (0 .

166 )

Neg i,t . ∗∗∗ (0 .

257 ) 0 . ∗∗ (0 .

251 ) 0 . ∗ (0 .

290 ) 0 . ∗∗ (0 .

232 ) R M,t − . ∗∗∗ (0 .

181 ) − . ∗∗∗ (0 .

181 )

V IX t − . ∗∗∗ (0 .

027 ) − . ∗∗∗ (0 .

027 )log σ i,t . ∗∗∗ (0 .

004 ) 0 . ∗∗∗ (0 .

004 ) R i,t . ∗∗∗ (0 .

126 ) 1 . ∗∗∗ (0 .

126 )Panel C: Future Returns R i,t +1 I i,t .

000 (0 .

000 ) 0 .

000 (0 .

000 ) 0 .

000 (0 .

000 ) − .

000 (0 .

000 )

P os i,t . ∗∗∗ (0 .

007 ) 0 . ∗∗∗ (0 .

011 ) 0 . ∗ (0 .

008 ) 0 . ∗∗∗ (0 .

006 )

Neg i,t − .

004 (0 .

008 ) − .

000 (0 .

010 ) − .

009 (0 .

010 ) − .

003 (0 .

008 ) R M,t − . ∗∗∗ (0 .

006 ) − . ∗∗∗ (0 .

006 )

V IX t . ∗∗∗ (0 .

001 ) 0 . ∗∗∗ (0 .

001 )log σ i,t − . ∗∗∗ (0 .

000 ) − . ∗∗∗ (0 .

000 ) R i,t − . ∗∗∗ (0 .

007 ) − . ∗∗∗ (0 .

007 ) V i,t .

000 (0 .

000 ) 0 .

000 (0 .

000 ) 0 .

000 (0 .

000 ) 0 .

000 (0 .

000 ) ∗∗∗ refers to a p value less than 0.01, ∗∗ refers to a p value more than or equal to 0.01 and smaller than 0.05, and ∗ refers to a p value more than or equal to 0.05 and less than 0.1. Values in parentheses are clustered standarderrors. Table 7: Entire Panel Regression Resultsparticular that the coeﬃcient size is larger for higher attention ﬁrms. As to question (iv),we would expect that the coeﬃcients of sentiment variables are sector-speciﬁc.We will discuss the analysis of diﬀerent attention levels and diﬀerent sectors respectivelyin Sections 3.2 and 3.3, and focus now on the entire sample. The regression results aregiven in Table 7. Results in Panel A indicate that the arrival of articles ( I i,t ) distilled usingthe LM method is strongly negatively related to future log volatility, and that contingenton arriving articles, the negative sentiment distilled using the three methods is signiﬁcantly18ositively related to future log volatility, whereas the positive sentiment distilled using theBL and MPQA methods is signiﬁcantly negatived related to future log volatility. Results inPanel B show that contingent on arriving articles, the positive and negative sentiment haveasymmetric strong impacts on future detrended log trading volume: the negative sentimentacross three lexica strongly drives up future detrended log trading volume, whereas thepositive sentiment distilled using the BL and MPQA methods is strongly negatively relatedto future detrended log trading volume. The arrival of articles also strongly drives up futuredetrended log trading volume across three lexica. These ﬁndings support the mixtureof distribution hypothesis originated by Clark (1973). As to future returns in Panel C,across three lexica and contingent on arriving articles, the positive sentiments are stronglypositively related to future returns whereas the negative sentiment is unrelated to futurereturns. This ﬁnding sheds light on case against one unpleasant ﬁnding from Antweilerand Frank (2004) in which bullishness is not statistically signiﬁcant for future return. It isinteresting to note that the coeﬃcients for the control variables do not vary much acrosslexical projections, which indicates that the sentiment measures are not so much correlatedwith the control variables and indeed provide incremental information.It is diﬃcult to diagnose a consensual performance from Table 7 because each lexiconmay not fully reﬂect the complete sentiment and may have its own idiosyncratic nature asbeing evident from Table 2. To overcome this problem that none of the lexica is perfectlycomplete, we design an artiﬁcial sentiment index: the ﬁrst principal component, to capturea common component of three lexica and to take into account the fact from Figures 2 and3 that they reveal the shared sentiment. The positive (negative) sentiment index explains94.14% (92.33%) of the total sample variance. As seen in the last column of Table 7, thesegeneral positive and negative sentiment indices are beneﬁcial to achieve more consistentand interpretable results. The negative sentiment index spurs the future stock volatilityand trading volume. However, the positive sentiment index has very restrictive inﬂuenceon future volatility, and suppresses the trading volume but increases stock returns.19 .1.3 Sentiment Eﬀect with Larger Lags and Neutral Sentiment Based on the sequential arrival of information hypothesis (hereafter SAIH, Copeland, 1976,1977), information arrives to traders at diﬀerent times and hence relationship with lagslarger than one can exist. Hence, we extend the length of lag under investigation to be twoto ﬁve trading days and run regressions using the entire sample. From Table 8, volatilitystill reacts to the news in lagged two days but no more earlier than it: lagged two daynegative sentiment extracted by BL and LM are inﬂuential, indicating that the SAIH hasbeen observed here but lagged relationship is restricted to past one and two day while articlewas posting. In this sense, the market seems eﬃcient to incorporate information no longerthan two days. Likewise, we ﬁnd the negative sentiment in lagged two day still has aninﬂuence on future return. The coeﬃcients across three lexicon projections are signiﬁcantbut positive. The coeﬃcients of negative sentiment projected by the BL and LM methodsare signiﬁcant but positive. The negative sign, even insigniﬁcant, in lagged one day turnspositive in lagged two day to reﬂect that stock returns revert to mean value, which isconsistent with Antweiler and Frank (2004). Although not signiﬁcant, the coeﬃcients’ signfor lag one indicates a slight negative inﬂuence on tomorrow’s stock returns, but return willrevert to its mean value in two days later shown by positive sign as negative news vanish.The sooner reversion is the more eﬃcient market is. For the detrended log trading volume,the lagged eﬀect is relatively insigniﬁcant.Financial market is characterized by the clustering of information (news) arrival, so thatwe will see the volatility clustering (Engle, 2004). The clustering of arrival of sentimentalinformation motivates us to accumulate the sentiment variables from past trading days.Let I i,t :( t + h − , P os i,t :( t + h − and N eg i,t :( t + h − denote the indicator of arrival of articles, theaverage proportion of positive words and the average proportion of negative words basedon articles published on or after trading day t and before trading day t + h . Strikingly,the accumulated sentiment eﬀect projected by BL and LM method on future volatilityshown in Table 9 is very clear and keeps asymmetric, that is, only reacts to negative not topositive sentiment. Sometimes the sentiment news arrive consecutively and its accumulatedinﬂuence lasts up to ﬁve trading days (one week). The accumulative sentiment eﬀect canbe also observed on the detrended log trading volume while accumulating to lagged four20nd ﬁve days, and on the future return while accumulating to lagged two days.We also tried to consider the proportion of neutral words and examine its impact. Basedon the neutral proportion deﬁned by MPQA method, in general we ﬁnd the neutral wordshave no inﬂuence in stock indicators. The results can be provided upon the request. BL LM MPQALag h I i,t

P os i,t

Neg i,t I i,t P os i,t

Neg i,t

Panel A: Future Volatility log σ i,t + h h = 2 − .

000 0 .

000 0 . ∗ − .

000 0 .

001 0 . ∗ − .

000 0 .

001 0 . .

000 ) (0 .

002 ) (0 .

003 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 3 − . − .

001 0 . − .

000 0 .

001 0 . − .

000 0 .

000 0 . .

000 ) (0 .

002 ) (0 .

003 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 4 − .

000 0 .

000 0 . − .

000 0 .

002 0 . − .

000 0 .

001 0 . .

000 ) (0 .

002 ) (0 .

003 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 5 0 . − .

002 0 .

002 0 . − .

002 0 . − . − .

000 0 . .

000 ) (0 .

002 ) (0 .

003 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 )Panel B: Future Detrended Log Trading Volume V i,t + h h = 2 0 .

003 0 . − .

198 0 .

004 0 . − .

158 0 .

003 0 . − . .

006 ) (0 .

140 ) (0 .

174 ) (0 .

005 ) (0 .

227 ) (0 .

183 ) (0 .

007 ) (0 .

140 ) (0 .

219 ) h = 3 0 . − . − .

082 0 . − . − .

125 0 . − . − . .

006 ) (0 .

140 ) (0 .

174 ) (0 .

005 ) (0 .

227 ) (0 .

183 ) (0 .

007 ) (0 .

140 ) (0 .

219 ) h = 4 − .

001 0 . − .

539 0 . − . − .

556 0 . − . − . .

006 ) (0 .

140 ) (0 .

488 ) (0 .

005 ) (0 .

227 ) (0 .

536 ) (0 .

007 ) (0 .

140 ) (0 .

479 ) h = 5 0 . − . − . − . − . − .

096 0 . − . − . .

006 ) (0 .

140 ) (0 .

301 ) (0 .

005 ) (0 .

227 ) (0 .

183 ) (0 .

007 ) (0 .

140 ) (0 .

278 )Panel C: Future Returns R i,t + h h = 2 − .

000 0 .

000 0 . ∗ − . − .

003 0 . ∗∗ − .

000 0 .

001 0 . ∗∗ (0 .

000 ) (0 .

007 ) (0 .

009 ) (0 .

000 ) (0 .

012 ) (0 .

010 ) (0 .

000 ) (0 .

008 ) (0 .

012 ) h = 3 0 . − . − .

001 0 . − .

010 0 .

005 0 . − .

011 0 . .

000 ) (0 .

008 ) (0 .

009 ) (0 .

000 ) (0 .

012 ) (0 .

010 ) (0 .

000 ) (0 .

008 ) (0 .

012 ) h = 4 − .

000 0 .

001 0 . ∗ − .

000 0 .

010 0 . − . − .

003 0 . .

000 ) (0 .

007 ) (0 .

009 ) (0 .

000 ) (0 .

012 ) (0 .

010 ) (0 .

000 ) (0 .

008 ) (0 .

012 ) h = 5 0 . − .

011 0 .

009 0 . − .

018 0 .

002 0 . − .

013 0 . .

000 ) (0 .

007 ) (0 .

009 ) (0 .

000 ) (0 .

012 ) (0 .

010 ) (0 .

000 ) (0 .

009 ) (0 .

012 ) ∗∗∗ refers to a p value less than 0.01, ∗∗ refers to a p value more than or equal to 0.01 and smaller than 0.05, and ∗ refers to a p valuemore than or equal to 0.05 and less than 0.1. Values in parentheses are standard errors. Table 8: Entire Panel Regression Results with Larger Lags (Noncumulative Articles)21

L LM MPQALag h I i,t :( t + h − P os i,t :( t + h − Neg i,t :( t + h − I i,t :( t + h − P os i,t :( t + h − Neg i,t :( t + h − I i,t :( t + h − P os i,t :( t + h − Neg i,t :( t + h − Panel A: Future Volatility log σ i,t + h h = 2 − . − .

001 0 . ∗∗ − . − .

001 0 . .

000 ) (0 .

002 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 3 − . − .

002 0 . ∗∗∗ − . ∗ − .

001 0 . ∗∗∗ − . − .

001 0 . .

000 ) (0 .

002 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 4 − . − .

001 0 . ∗∗∗ − . ∗∗ − .

000 0 . ∗∗∗ − . − .

001 0 . .

000 ) (0 .

002 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 ) h = 5 0 . − .

003 0 . ∗∗ − . − .

002 0 . ∗∗∗ − . − .

002 0 . .

000 ) (0 .

002 ) (0 .

000 ) (0 .

003 ) (0 .

000 ) (0 .

002 ) (0 .

003 )Panel B: Future Detrended Log Trading Volume V i,t + h h = 2 0 . − . − .

133 0 . ∗ − . − .

187 0 .

002 0 . − . .

006 ) (0 .

125 ) (0 .

156 ) (0 .

004 ) (0 .

203 ) (0 .

169 ) (0 .

006 ) (0 .

126 ) (0 .

198 ) h = 3 0 . − . − .

111 0 . − . − .

078 0 . − . − . .

005 ) (0 .

123 ) (0 .

153 ) (0 .

004 ) (0 .

198 ) (0 .

167 ) (0 .

006 ) (0 .

124 ) (0 .

193 ) h = 4 0 . − . ∗∗ − .

096 0 . ∗∗ − . ∗∗ − . ∗ . − . − . .

005 ) (0 .

152 ) (0 .

124 ) (0 .

004 ) (0 .

200 ) (0 .

168 ) (0 .

006 ) (0 .

125 ) (0 .

327 ) h = 5 0 . ∗∗ − . ∗ − . ∗ . ∗ − . ∗∗ − .

228 0 . − . − . .

006 ) (0 .

126 ) (0 .

246 ) (0 .

004 ) (0 .

202 ) (0 .

171 ) (0 .

007 ) (0 .

126 ) (0 .

493 )Panel C: Future Returns R i,t + h h = 2 − . ∗ . ∗∗ . − .

000 0 . ∗ . − . ∗∗ . ∗ . .

000 ) (0 .

007 ) (0 .

008 ) (0 .

000 ) (0 .

011 ) (0 .

009 ) (0 .

000 ) (0 .

007 ) (0 .

010 ) h = 3 − .

000 0 .

009 0 . − .

000 0 .

013 0 . − .

000 0 .

004 0 . .

000 ) (0 .

007 ) (0 .

008 ) (0 .

000 ) (0 .

011 ) (0 .

009 ) (0 .

000 ) (0 .

007 ) (0 .

010 ) h = 4 − .

000 0 .

008 0 . − .

000 0 .

017 0 . − .

001 0 .

005 0 . .

000 ) (0 .

007 ) (0 .

008 ) (0 .

000 ) (0 .

011 ) (0 .

009 ) (0 .

000 ) (0 .

007 ) (0 .

010 ) h = 5 − .

000 0 .

000 0 . − .

000 0 .

004 0 . − . − .

003 0 . .

000 ) (0 .

007 ) (0 .

008 ) (0 .

000 ) (0 .

011 ) (0 .

009 ) (0 .

000 ) (0 .

007 ) (0 .

010 ) ∗∗∗ refers to a p value less than 0.01, ∗∗ refers to a p value more than or equal to 0.01 and smaller than 0.05, and ∗ refers to a p valuemore than or equal to 0.05 and less than 0.1. Values in parentheses are standard errors. Table 9: Entire Panel Regression Results with Larger Lags (Cumulative Articles)

The text sentiment eﬀects, as reported in Table 7, allow us deeper insights and analysis.More precisely we may address the important question of asymmetric reactions to thegiven sentiment scales. In order to do so we employ Monte Carlo techniques to investigatediﬀerent facets of the sentiment eﬀects. The components of this Monte Carlo study are: (1)to simulate the appearance of articles with presumed probabilities; (2) to provide a realisticset of scenarios regarding the frequency and content (positive v.s. negative) of articles; (3)to obtain volatility induced by the generated article (using Table 7); (4) to demonstratethe impact of synthetic text on future volatility; (5) to visualize and test an asymmetry22ﬀect as formulated in research question (ii).The simulation scenarios (for each variable involved) are summarized brieﬂy as follows.We employ a Bernoulli random variable I i,t indicating that articles arrive at a speciﬁcfrequency p i , where for each individual stock symbol i , p i is estimated by the fraction of dayswith at least one relevant article. Given the outcome of this article indicator, we generatethe corresponding positive and negative proportions through a copula approach using theconditional inversion method as described in Frees and Valdez (1998). We follow the two-step approach that is widely mentioned in literature such as Patton (2006), Hotta et al.(2006) and Di Clemente and Romano (2004). In the ﬁrst step, the marginal distributionsare modeled by their corresponding empirical distribution function (edf) to avoid imposinga parametric distribution; in the second step, a Gaussian copula is estimated to takethe inherent dependence among variables into account. For the sentiment variables, thisapproach is applied to each ﬁrm separately since each ﬁrm has a diﬀerent p i and onlydays with at least one article relevant to the ﬁrm are included in the estimation. Tosimulate market returns R M,t and individual returns R i,t for all 100 symbols, we ﬁrst ﬁlterthese variables by estimated MA(1)-GARCH(1,1) processes and standardize the residualsby dividing them by estimated standard deviations. We then apply the copula approachto the standardized residuals, and the simulated standardized residuals are transformedinto simulated values of R M,t or R i,t by multiplying them by the median of the priorlyestimated standard deviations for the market or the speciﬁc ﬁrm i . The company speciﬁcﬁxed eﬀects γ i are not incorporated as the simulated volatility for diﬀerent ﬁrms is otherwisenot graphically comparable. For the other control variables, CBOE VIX index V IX t is ﬁxedat its mean value over the sample period, and past log volatility and past detrended logtrading volume are not used in the simulation.Figure 4 demonstrates, for one simulation, the association between the negative andpositive proportions as distilled via our three projection methods and their simulated fu-ture volatility outcomes. We estimate a local linear regression model (solid line) andcorresponding 95% uniform conﬁdence bands based on Sun and Loader (1994). Both areestimated using Locﬁt by Loader (1999) in the R environment. Loader and Sun (1997)discuss the robustness of this approach and conclude that the results are conservative but23easonable for heavy tailed error distributions. The bandwidth is automatically chosenby using the plug-in selector according to Ruppert et al. (1995). We limit the the visiblearea to sentiment values between 0 and 0 .

04 as well as volatility values between 1 .

45 and1 .

65 to make the diﬀerent lexica visually comparable. Nevertheless, all simulated valuesare utilized in the estimation of the regression curve and conﬁdence bands. The clusteredpoints lying on the vertical axis indicate that there is absence of articles. The range of thiscluster from 0 .

77 to 2 .

57 is caused by the impact from the simulated control variables aswell as the idiosyncratic impact captured by the residual term.Apparently, an asymmetry eﬀect becomes visible. One observes that the slopes of thevolatility curves given negative sentiment is mainly positive while the curves for positivesentiment seem to be rather ﬂat and even go down in the case of BL and MPQA meth-ods. One can also compare the conﬁdence bands to address the question whether negativesentiment has a signiﬁcantly higher eﬀect on the volatility than positive sentiment. Theconﬁdence bands of

P os and

N eg do not overlap for sentiment values between 0 .

023 and0 .

056 for BL, between 0 .

017 and 0 .

039 for LM and between 0 .

023 and 0 .

05 for MPQA.This asymmetry eﬀect parallels the well known imbalance of future volatility given goodv.s. bad news. The leverage eﬀect depicts a negative relation between the lagged return andthe risk resulting from bad news that causes higher volatility. Black (1976) and Christie(1982) ﬁnd that bad news in the ﬁnancial market produce such an asymmetric eﬀect onfuture volatility relative to good news. This leverage eﬀect has also been shown by Bekaertand Wu (2000) and Feunou and T´edongap (2012). In the same vein, Glosten et al. (1993)introduce GARCH with diﬀering eﬀects of negative and positive shocks taking into accountthe leverage eﬀect.

While people post their text to express their opinions, or the comments to other articles,they are undoubtedly paying attention to the ﬁrm mentioned by their articles. In thisrespect article posting is a revealed attention measure. In fact, in our collected 43,459articles across 100 stocks, it is obvious that not every ﬁrm shares the attention equivalently.To reﬂect these diﬀerences, we deﬁne the attention ratio for a symbol as the number of24 .00 0.01 0.02 0.03 0.04 . . . . . BL Negative Proportion, h = 0.001 B L S i m u l a t ed V o l a t ili t y . . . . . LM Negative Proportion, h = 0.001 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Negative Proportion, h = 0.001 M P Q A S i m u l a t ed V o l a t ili t y . . . . . BL Positive Proportion, h = 0.002 B L S i m u l a t ed V o l a t ili t y . . . . . LM Positive Proportion, h = 0.001 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Positive Proportion, h = 0.003 M P Q A S i m u l a t ed V o l a t ili t y Figure 4: Monte Carlo Simulation based on Entire Sample Resultsdays with articles divided by the total number of days in the sample period, 1,255. Thesymbol “AAPL” (Apple Computer Inc.) attracts the most attention with an attentionratio of 0.818. Articles involving AAPL arrive in social media almost every day (81.8days over 100 days). However, the symbol “TRV” (Travelers Companies, Inc.) has thelowest attention ratio, 0.204, which means that one ﬁnds a related article every ﬁve tradingdays, i.e. one week. Diﬀerent from the “indirect” attention measures from stock indicatorssuch as trading volumes, extreme returns or price limits, this attention measure is a kind of“direct” measure of investor attention, and shares the same idea as the Search Volume Index(SVI) constructed by Google. Beyond the SVI, our attention can be further projected to“Positive” or “Negative” attention. In our main research question (ii), we are interested inwhether the well known asymmetric response (bad vs. good news) is appropriately reﬂectedin the lexical projections. Assuming that investors are more risk-averse, they should bemore aware of negative articles and pay more attention to them.Attention is one of the basic elements in traditional asset pricing models. The conven-tional asset pricing models assume that information is instantaneously incorporated into25sset prices when it arrives. The basic assumption behind this argument is that investorspay “suﬃcient” attention to the asset. Under this condition, the market price of assetshould be very eﬃcient in incorporating any relevant news. In this aspect, the high atten-tion ﬁrms should be more responsive to the text sentiment distilled from the articles, andtheir market prices should reﬂect this eﬃciency. As such, the high attention samples standon the side of the traditional asset pricing models, and the ﬁndings from them are expectedto support the eﬃcient market hypothesis. However, attention in reality is a scarce cog-nitive resource, and investors have limited attention instead (Kahneman, 1973). Furtherresearch on this topic from Merton (1987), Sims (2003) and Peng and Xiong (2006) con-ﬁrms that the limited attention can aﬀect asset pricing. The low attention ﬁrms with verylimited attention may ineﬀectively or insuﬃciently reﬂect the text sentiment information,so that their corresponding stock reactions could be greatly bounded. This argument is inaccordance with the fact that the limited attention causes stock prices to deviate from thefundamental values (Hong and Stein, 1999), implying a potential arbitrage opportunity.

Grouping the samples by their attention ratios and examining the responses from diﬀerentattention groups may oﬀer a clue to the aforementioned conjectures. The criterion usedto group the sample ﬁrms is based on the quantiles of the attention ratio. Firms whoseattention ratios are above the 75% quantile (0.3693) are grouped as “extremely high”, be-tween 50% (0.3026) and 75% quantiles as “high”, between 25% (0.2455) and 50% quantilesas “median”, and lower than 25% quantile as “low”. For each attention group, Table 10reports across lexical projections the mean values of positive ( µ P os ) and negative ( µ Neg )sentiment proportions, calculated by averaging

P os i,t or N eg i,t over all relevant symbol-daycombinations, the proportion of relevant symbol-day combinations with

N eg i,t > P os i,t , theaverage attention ratio, and the average number of days with articles, calculated by averag-ing the number of days with articles over all relevant symbols. The “extreme high” groupsreceive an average attention ratio of 55.14%, indicating on average these ﬁrms have beenlooked at every two days. By contrast, the low attention group with an average attentionratio of 21.97% receives attention at weekly frequency (5 trading days). By comparing the26agnitude of µ Neg , one observes that investors are inclined to express negative sentimentsin the “extreme high” group. One may conclude therefore that higher attention is comingwith a “negative text”, or inversely speaking: the negative article creates higher attention.This is evident for example in the case of the LM method, where the proportion of symbol-day combinations with dominance of negative sentiment is 46% in the “extremely high”group. For the constituents in this particular attention group, we ﬁnd on average 691 dayswith articles observed over a total of 1255 sample days (5 years), which is almost threetimes the average number of days with articles for the low attention group.

BL LM MPQA Attention Number of DaysAttention µ Pos µ Neg

Neg > P os µ

Pos µ Neg

Neg > P os µ

Pos µ Neg

Neg > P os

Ratio with ArticlesExtremely high 0.032 0.016 0.119 0.013 0.014 0.460 0.038 0.013 0.027 0.551 691High 0.032 0.015 0.113 0.013 0.012 0.403 0.038 0.013 0.031 0.343 430Median 0.035 0.014 0.083 0.014 0.011 0.339 0.039 0.012 0.027 0.273 356Low 0.036 0.014 0.086 0.015 0.011 0.333 0.040 0.012 0.031 0.220 264

Table 10: The Summary Statistics for diﬀerent Attention Ratio Groups

The central interest of this research focuses on understanding to which extent distilled newsﬂow and its derived parameters (like attention) impacts the relation between text sentimentand stock reactions. We employ panel regression designed for the given attention groups,and therefore each panel regression equally comprises of 25 sample ﬁrms. The results aredisplayed in Table 11. For the “extremely high” group, the text sentiment carries a majorand highly signiﬁcant inﬂuence on future volatility consistently across the three lexicalprojections. As a caveat though please note that the sentiment eﬀect on volatility shown inPanel A is exclusive for negative news contingent on arriving articles, the stock volatilityrarely reacts to positive or optimistic news. Panel B summarizes the attention analysison the detrended log trading volume. For the “extremely high” group, in the LM andMPQA projection methods, arrival of articles ( I i,t ) brings relevant information, and createsa growing trading volume, especially when it comes with negative news. The correspondinganalysis for stock returns are also reasonable. The stock returns of “high” group reactclearly to the sentiments, contingent on arriving articles, they rise for optimistic news27nd decline for pessimistic consensus. In the case of LM method, the signiﬁcant positivecoeﬃcient of N eg i,t for the “extremely high” group suggests that the market participants actaccording to the uncertain market hypothesis developed by Brown et al. (1988) and basedon the overreaction hypothesis by Bondt and Thaler (1985). Here, the market participantsset new prices before the full range of the news content is resolved. In case of unfavorablenews, the investors set stock prices signiﬁcantly below their conditional expected valuesand thus, react risk-averse. On the subsequent day, the mispriced stock price will revert toits true value.The collected empirical evidence so far suggests that the distilled news of high attentionﬁrms eﬀectively drive their stock volatilities, trading volumes and returns. They are highlyresponsive to the sentiment across lexical projections.28

L LM MPQAAttention I i,t P os i,t

Neg i,t I i,t P os i,t

Neg i,t

Panel A: Future Volatility log σ i,t +1 Low 0 . − . − .

074 0 . − . − .

195 0 . − .

655 0 . .

025 ) (0 .

666 ) (0 .

766 ) (0 .

016 ) (1 .

027 ) (0 .

788 ) (0 .

029 ) (0 .

633 ) (0 .

866 )Median 0 . − .

690 1 . ∗∗ − . − .

308 1 . ∗ . − . ∗ . ∗∗ (0 .

016 ) (0 .

449 ) (0 .

446 ) (0 .

016 ) (0 .

778 ) (0 .

630 ) (0 .

019 ) (0 .

515 ) (0 .

707 )High − . − .

460 1 . ∗∗∗ − . ∗∗∗ .

967 1 . ∗∗∗ − . − . ∗∗ . ∗∗∗ (0 .

017 ) (0 .

442 ) (0 .

475 ) (0 .

013 ) (0 .

724 ) (0 .

615 ) (0 .

016 ) (0 .

315 ) (0 .

662 )Extremely − .

010 0 .

027 0 . ∗∗ − .

013 0 .

483 0 . ∗∗ − . − .

182 0 . ∗∗ High (0 .

014 ) (0 .

257 ) (0 .

371 ) (0 .

013 ) (0 .

457 ) (0 .

300 ) (0 .

017 ) (0 .

284 ) (0 .

433 )Panel B: Future Detrended Log Trading Volume V i,t +1 Low 0 . ∗∗ − .

817 0 .

312 0 . ∗∗∗ − . − .

109 0 . ∗ − . − . .

024 ) (0 .

502 ) (0 .

665 ) (0 .

014 ) (0 .

657 ) (0 .

556 ) (0 .

029 ) (0 .

567 ) (0 .

796 )Median 0 . ∗∗∗ − . ∗∗ . ∗ . ∗∗∗ − .

199 0 .

861 0 . ∗∗∗ − . ∗∗ . .

014 ) (0 .

398 ) (0 .

600 ) (0 .

010 ) (0 .

535 ) (0 .

601 ) (0 .

013 ) (0 .

342 ) (0 .

689 )High 0 . ∗∗∗ − .

198 0 .

554 0 . ∗ . ∗ .

447 0 . ∗∗∗ − .

358 0 . .

009 ) (0 .

299 ) (0 .

459 ) (0 .

011 ) (0 .

487 ) (0 .

451 ) (0 .

016 ) (0 .

385 ) (0 .

559 )Extremely 0 . − .

242 0 . ∗∗ . ∗ .

299 0 . ∗ . ∗∗ − .

408 1 . ∗∗ High (0 .

014 ) (0 .

336 ) (0 .

416 ) (0 .

008 ) (0 .

521 ) (0 .

429 ) (0 .

014 ) (0 .

299 ) (0 .

427 )Panel C: Future Returns R i,t +1 Low 0 .

000 0 .

012 0 .

009 0 .

000 0 . − .

001 0 .

000 0 . − . .

001 ) (0 .

022 ) (0 .

023 ) (0 .

000 ) (0 .

030 ) (0 .

023 ) (0 .

001 ) (0 .

021 ) (0 .

032 )Median − .

001 0 . ∗ .

009 0 .

000 0 . ∗ − . − .

001 0 . ∗ . .

001 ) (0 .

012 ) (0 .

018 ) (0 .

000 ) (0 .

019 ) (0 .

024 ) (0 .

001 ) (0 .

018 ) (0 .

024 )High 0 .

000 0 . ∗∗ − . ∗∗∗ . ∗∗ . ∗ − . ∗∗ .

000 0 . ∗∗ − . ∗∗∗ (0 .

000 ) (0 .

012 ) (0 .

011 ) (0 .

000 ) (0 .

022 ) (0 .

018 ) (0 .

001 ) (0 .

011 ) (0 .

016 )Extremely 0 .

000 0 .

017 0 . − .

000 0 .

031 0 . ∗∗ . ∗ − .

006 0 . .

000 ) (0 .

012 ) (0 .

000 ) (0 .

021 ) (0 .

013 ) (0 .

000 ) (0 .

011 ) (0 .

016 ) ∗∗∗ refers to a p value less than 0.01, ∗∗ refers to a p value more than or equal to 0.01 and smaller than 0.05, and ∗ refers to a p valuemore than or equal to 0.05 and less than 0.1. Values in parentheses are clustered standard errors. Table 11: Attention Analysis: The Impact on future Volatility, Trading Volume and Re-turnsGiven the high attention received, any relevant information including the articles madeby individual traders has been fully incorporated into their asset prices and dynamics.Due to their eﬃciency, the article posting and discussing today can predict stock reactionstomorrow. For lower attention ﬁrms, one cannot make such a strong claim. Investors maythink those ﬁrms are negligible and may therefore underreact to the available information.The underreaction from limited attention is likely to cause stock prices to deviate fromthe fundamental values, and an arbitrage opportunity may emerge. Our evidence is in linewith Da et al. (2011) in which they support the attention-induced price pressure hypothesis.29y using the SVI from Google as attention measure, they ﬁnd stronger attention-inducedprice pressure among stocks in which individual investor attention matters most. Beyondtheir study, we ﬁnd that high attention is usually accompanied with negative articles, andnegative articles contribute more to attention and cause more stock reactions, supportingan asymmetric response.It is interesting to note that the coeﬃcients for the control variables do not vary muchacross lexical projections in each attention group (results not shown here), which indicatesthat for each attention group, the sentiment measures are not so much correlated with thecontrol variables and provide incremental information.

Like Section 3.1.4, we present a realistic Monte Carlo scenario for diﬀerent attention groupsusing the results from Table 11. We keep the parameter settings of the data generationand the calculation of conﬁdence bands as before. Figure 5 summarizes the associationsbetween the negative proportions and the simulated future volatilities across diﬀerent at-tention groups. The scatter plots of the high attention panel are quite dense, whereas thoseof the low attention group are sparser due to its lower frequency of articles. Interestingly,the higher volatilities of high attention ﬁrms are prominently driven by negative text sen-timent, but have an inverse relationship with positive sentiment. Through comparison ofthe conﬁdence bands we can conclude for all three lexica that the eﬀect of negative sen-timent signiﬁcantly diﬀers from that of positive sentiment. The regions where the bandsdo not overlap are quite large for BL (0 .

022 - 0 . .

020 - 0 . .

019 - 0 . The stock reactions that we analyze in relation to text sentiment can be further segmentedinto sector speciﬁc responses. Given a growing body of literature that has suggested thatindustry plays a role in stock reactions (see Fama and French (1997), Chen et al. (2007),Hong et al. (2007)), we investigate whether this relation is industry-speciﬁc in nature. Adetailed analysis of sector speciﬁc reactions would go far beyond the scope of this paper andis in fact the subject of research by Chen et al. (2015). We therefore only highlight a fewinsights from lexical sentiment for the business sectors. We ignore the “TelecommuicationServices” sector since it only contains two stock symbols. Descriptive statistics for theother 8 sectors are displayed in Table 12 across the three lexical projections. It is ofinterest to study the variation of the proportion of negative over positive sentiments acrossthe 8 sectors. One observes that consistently over all lexical projections the ﬁnancial sectorhas the highest average discrepancy in negative and positive proportion. By contrast thehealth care sector has (except for MPQA) the lowest average discrepancy. Investors showtheir discrepant opinions or disagreement in a very extreme case of

N eg > P os = 0 . .00 0.01 0.02 0.03 0.04 . . . . . BL Negative Proportion, h = 0.002 B L S i m u l a t ed V o l a t ili t y . . . . . High Attention

LM Negative Proportion, h = 0.002 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Negative Proportion, h = 0.002 M P Q A S i m u l a t ed V o l a t ili t y . . . . . BL Positive Proportion, h = 0.004 B L S i m u l a t ed V o l a t ili t y . . . . . LM Positive Proportion, h = 0.002 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Positive Proportion, h = 0.004 M P Q A S i m u l a t ed V o l a t ili t y . . . . . BL Negative Proportion, h = 0.001 B L S i m u l a t ed V o l a t ili t y . . . . . Low Attention

LM Negative Proportion, h = 0.001 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Negative Proportion, h = 0.001 M P Q A S i m u l a t ed V o l a t ili t y . . . . . BL Positive Proportion, h = 0.002 B L S i m u l a t ed V o l a t ili t y . . . . . LM Positive Proportion, h = 0.001 L M S i m u l a t ed V o l a t ili t y . . . . . MPQA Positive Proportion, h = 0.003 M P Q A S i m u l a t ed V o l a t ili t y Figure 5: Monte Carlo Simulation based on Attention Analysis Results32

L LM MPQA AttentionSector µ Pos µ Neg

Neg > P os µ

Pos µ Neg

Neg > P os µ

Pos µ Neg

Neg > P os

RatioConsumer Discretionary 0.034 0.014 0.088 0.014 0.011 0.346 0.038 0.012 0.030 0.332Consumer Staples 0.034 0.014 0.099 0.014 0.012 0.365 0.037 0.013 0.025 0.324Energy 0.028 0.015 0.152 0.011 0.011 0.467 0.038 0.014 0.033 0.370Financials 0.032 0.019 0.195 0.013 0.018 0.594 0.038 0.015 0.045 0.413Health Care 0.035 0.014 0.059 0.014 0.011 0.344 0.039 0.014 0.031 0.287Industrials 0.035 0.012 0.069 0.013 0.011 0.355 0.041 0.011 0.018 0.336Information Technology 0.033 0.015 0.101 0.014 0.012 0.373 0.038 0.023 0.012 0.364Materials 0.034 0.014 0.097 0.013 0.013 0.498 0.039 0.031 0.013 0.287Note: This table reports, for the BL, LM and MPQA methods, the mean values of positive ( µ Pos ) and ( µ Neg ) negativesentiment proportions as well as the proportion of relevant symbol-day combinations with dominance of negative sen-timent. For each sector, an article is accumulated only if a ﬁrm appeared in this article belongs to this sector. Theattention ratio for each sector is calculated as the number of days with articles related to this sector divided by the totalnumber of days in the sample period.

Table 12: Summary statistics in each sectorThe attention also vary with the sectors. The evidence that ﬁnancials sector has at-tracted the highest attention with an attention ratio of 0.413 may be attributed to (1) theinvestors’ widespread involvement in this industry because we all need to keep a relation-ship with banks to deposit our money, trade for securities or some ﬁnancial reasons; (2) theoutbreak of the US subprime crisis and the European sovereign debt crisis have brought thehighest attention to this sector; (3) their sensitivity on changes in the economy, monetarypolicy and regulatory policy. The health care sector, however, is much less attractive andthis could be explained by a stable demand and reduced sensitivity to economic cycles.Given these observations we will now continue our analysis of stock reactions for these twosectors only, and leave a bundle of interesting issues to further research.To address the important question of whether there is a sector dependent stock reac-tions, we further analyze how the text sentiment aﬀects, as reported in Table 13, the futurevolatility, trading volume and return. In order to do so we employ the panel regression(as described in (4)-(5)) and report the results in Table 13. The variable I i,t was used toindicate arrival of articles on this sector. Contingent on arriving articles, the three senti-ment projections in ﬁnancial sectors yielded signiﬁcant and positive eﬀects on future logvolatility from negative proportions, meaning that increasing the negative text sentimentswill result in higher volatility. The exclusive response to negative sentiment in ﬁnancial33ector indeed is in line with our entire panel evidence. However, the ﬁnding in the healthcare sector is too insigniﬁcant to claim it. Potentially, investor inattention for the healthcare sector may cause a signiﬁcant mispricing on the stocks. Investors possibly neglect thenews of this sector posted on social media, or this sector has a slow information diﬀusionthat could lead to a delayed reaction. BL LM MPQASector I i,t P os i,t

Neg i,t I i,t P os i,t

Neg i,t

Panel A: Future Volatility log σ i,t +1 Financials − . − .

052 1 . ∗∗ − .

025 0 .

275 1 . ∗∗∗ − . − .

143 1 . ∗∗∗ (0 .

026 ) (0 .

319 ) (0 .

435 ) (0 .

027 ) (0 .

924 ) (0 .

259 ) (0 .

029 ) (0 .

503 ) (0 .

586 )Health Care 0 . − . − .

509 0 .

009 0 . − .

130 0 . − .

118 0 . .

026 ) (0 .

522 ) (0 .

891 ) (0 .

023 ) (1 .

138 ) (0 .

921 ) (0 .

024 ) (0 .

595 ) (0 .

783 )Panel B: Future Detrended Log Trading Volume V i,t +1 Financials 0 . ∗ − .

334 0 .

015 0 .

017 1 . − .

313 0 . ∗∗∗ − . ∗∗ . .

020 ) (0 .

494 ) (0 .

527 ) (0 .

015 ) (0 .

766 ) (0 .

476 ) (0 .

015 ) (0 .

305 ) (0 .

536 )Health Care 0 .

031 0 . − .

314 0 .

022 0 . − .

042 0 . − . − . .

023 ) (0 .

436 ) (0 .

846 ) (0 .

018 ) (0 .

863 ) (0 .

837 ) (0 .

025 ) (0 .

443 ) (0 .

873 )Panel C: Future Returns R i,t +1 Financials − .

001 0 . ∗ . ∗∗ − .

000 0 .

030 0 . ∗∗ .

001 0 .

003 0 . .

001 ) (0 .

017 ) (0 .

014 ) (0 .

001 ) (0 .

033 ) (0 .

016 ) (0 .

001 ) (0 .

020 ) (0 .

019 )Health Care 0 . − .

000 0 .

008 0 .

000 0 .

006 0 .

015 0 .

000 0 . − . .

000 ) (0 .

008 ) (0 .

018 ) (0 .

000 ) (0 .

019 ) (0 .

018 ) (0 .

001 ) (0 .

012 ) (0 .

022 ) ∗∗∗ refers to a p value less than 0.01, ∗∗ refers to a p value more than or equal to 0.01 and smaller than 0.05, and ∗ refers to a p valuemore than or equal to 0.05 and less than 0.1. Values in parentheses are standard errors. Table 13: Sector analysis: The Impact on future Volatility, Trading Volume and ReturnsThe trading volume is another stock reaction we may attribute to text sentiments. Us-ing the BL and the MPQA projection method, we ﬁnd that the arrival of article bringsrelevant information and therefore stimulates the trading volume. It is interesting to notethat contingent on arriving articles, the negative sentiment distilled using the BL and LMmethods is signiﬁcantly positively related to stock returns on the next trading day. Toinvestigate the reason for this, we also run a contemporaneous regression for the ﬁnan-cials sector (results not shown) and found a signiﬁcantly negative impact of the negativesentiment distilled using the BL and MPQA methods on contemporaneous returns R i,t ,and the size of the coeﬃcients is about twice of that in lagged regression in Table 13. Thismight suggest that the market participants monitor ﬁnancial companies quite carefully andoverreact in case of bad news. On the next day, the participants fully recognize the scope34f the news and reverse part of their prior decisions, and hence the negative sentiment ontrading day t has positive impact on returns on trading day t + 1. This is also in line withthe ﬁnding in Kuhnen (2015) which suggests that that being in a negative domain leadspeople to form overly pessimistic beliefs about stocks. After the 2008 ﬁnancial crisis andthe bankruptcy of some major ﬁnancial companies, this might be the case for the ﬁnancialssector.From these analysis, we know that investors indeed pay diﬀerent attentions to sectorsthey are of interest, and their attentions eﬀectively govern the equity’s variation. Attentionconstraints in some sectors may aﬀect investors’ trading decisions and the speed of priceadjustments. In this paper, to analyze the reaction of stocks’ future log volatility, future detrended logtrading volume and future returns to social media news, we distill sentiment measuresfrom news using two general-purpose lexica (BL and MPQA) and a lexicon speciﬁcallydesigned for ﬁnancial applications (LM). We demonstrate that these sentiment measurescarry incremental information for future stock reactions. Such information varies acrosslexical projections, across groups of stocks that attract diﬀerent level of attention, andacross diﬀerent sectors. The positive and negative sentiments also have asymmetric impacton future stock reaction indicators. A detailed summary of the results is given in Table 14in the Supplementary Material. There is no deﬁnite picture for which lexicon is the best.This is an important contribution of our paper to the line of research on textual analysisfor ﬁnancial market. Besides, the advanced statistical tools that we have utilized, includingpanel regression and conﬁdence bands, are novel contributions to this line of research.

References

Alizadeh, S., Brandt, M. W., and Diebold, F. X. (2002). Range-based estimation of stochas-tic volatility models.

The Journal of Finance , 57(3):1047–1091.35ndersen, T. G. and Bollerslev, T. (1997). Intraday periodicity and volatility persistencein ﬁnancial markets.

Journal of Empirical Finance , 4(2-3):115–158.Andersen, T. G., Bollerslev, T., Diebold, F. X., and Ebens, H. (2001). The distribution ofrealized stock return volatility.

Journal of Financial Economics , 61(1):43–76.Antweiler, W. and Frank, M. Z. (2004). Is all that talk just noise? the information contentof internet stock message boards.

The Journal of Finance , 59(3):1259–1294.Arellano, M. (1987). Computing Robust Standard Errors for Within-Groups Estimators.

Oxford Bulletin of Economics and Statistics , 49(4):431–34.Bekaert, G. and Wu, G. (2000). Asymmetric volatility and risk in equity markets.

Reviewof Financial Studies , 13:1–42.Black, F. (1976). Studies of stock price volatility changes. In

Proceedings of the 1976 Meet-ings of the American Statistical Association, Business and Economic Statistics Section,American Statistical Association , pages 177–181.Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market.

Journalof Computational Science , 2(1):1–8.Bondt, W. F. M. D. and Thaler, R. (1985). Does the stock market overreact?

Journal ofFinance , 40(3):793.Brown, K. C., Harlow, W., and Tinic, S. M. (1988). Risk aversion, uncertain information,and market eﬃciency.

Journal of Financial Economics , 22(2):355–385.Cao, H. H., Coval, J. D., and Hirshleifer, D. A. (2001). Sidelined investors, trading-generated news, and security returns.

Dice Working Paper No. 2000-2 .Chen, G. M., Firth, M., and Rui, O. M. (2001). The dynamic relation between stockreturns, trading volume, and volatility.

The Financial Review , 36(3):153–174.Chen, G. M., Firth, M., and Rui, O. M. (2002). The dynamic relationship between stockreturns and trading volume: Domestic and cross-country evidence.

Journal of Bankingand Finance , 36(3):51–78. 36hen, H., De, P., Hu, Y. J., and Hwang, B.-H. (2014). Wisdom of crowds: The value of stockopinions transmitted through social media.

Review of Financial Studies , 27(5):1367–1403.Chen, L., Lakonishok, J., and Swaminathan, B. (2007). Industry classiﬁcations and returncomovement.

Financial Analsts Journal , 63:56–70.Chen, Y. H. C., Bommes, E., H¨ardle, W. K., and Zhang, J. (2015). News and big news: atext sentiment analysis for GICS speciﬁc stock reactions.

SFB 649, discussion paper .Chen, Z., Daigler, R. T., and Parhizgari, A. M. (2006). Persistence of volatility in futuresmarkets.

Journal of Futures Markets , 26:571–594.Christie, A. A. (1982). The stochastic behavior of common stock variance.

Journal ofFinancial Economics , 10:407–432.Clark, P. K. (1973). A subordinated stochastic process model with ﬁnite variance forspeculative prices.

Econometrica: journal of the Econometric Society , pages 135–155.Copeland, T. E. (1976). A model of asset trading under the assumption of sequentialinformation arrival.

Journal of Finance , 41:135–156.Copeland, T. E. (1977). A probability model of asset trading.

Journal of Financial andQuantitative Analysis , 12:563–578.Da, Z., Engelberg, J., and Gao, P. (2011). In search of attention.

The Journal of Finance ,66(5):1461–1499.Das, S. and Chen, M. (2007). Yahoo! for amazon: Sentiment extraction from small talkon the web.

Management Science , pages 1375–1388.Di Clemente, A. and Romano, C. (2004). Measuring and Optimizing Portfolio Credit Risk:A Copula-based Approach*.

Economic Notes , 33(3):325–357.Engle, R. F. (2004). Risk and volatility: economic models and ﬁnancial practice.

TheAmerican Economic Review , 94:405–420. 37ama, E. F. and French, K. R. (1997). Industry costs of equity.

Journal of FinancialEconomics , 43:153–193.Feunou, B. and T´edongap, R. (2012). A stochastic volatility model with conditional skew-ness.

Journal of Business and Economic Statistics , 30:576–591.Frees, E. W. and Valdez, E. A. (1998). Understanding relationships using copulas.

NorthAmerican actuarial journal , 2(1):1–25.Garman, M. B. and Klass, M. J. (1980). On the estimation of security price volatilitiesfrom historical data.

The Journal of Business , 53(1):67–78.Girard, E. and Biswas, R. (2007). Trading volume and market volatility: developed versusemerging stock markets.

Financial Review , 42(3):429–459.Glosten, L. R., Jagannathan, R., and Runkle, D. E. (1993). Relationship between theexpected value and the volatility of the nominal excess return on stocks.

The Journal ofFinance , 48(5):1779–1801.Groß-Klußmann, A. and Hautsch, N. (2011). When machines read the news: Using auto-mated text analytics to quantify high frequency news-implied market reactions.

Journalof Empirical Finance , 18(2):321–340.Hong, H. and Kubik, J. D. (2003). Analyzing the analysts: Career concerns and biasedearnings forecasts.

The Journal of Finance , 58(1):313–351.Hong, H. and Stein, J. C. (1999). A uniﬁed theory of underreaction, momentum trading,and overreaction in asset markets.

The Journal of Finance , 54(6):2143–2184.Hong, H., Torous, W., and Valkanov, R. (2007). Do industries lead stock markets?

Journalof Financial Economics , 83:367–396.Hotta, L. K., Lucas, E. C., and Palaro, H. P. (2006). Estimation of VaR Using Copula andExtreme Value Theory.

SSRN Electronic Journal .38u, M. and Liu, B. (2004). Mining and summarizing customer reviews. , pages168–177.Kahneman, D. (1973).

Attention and Eﬀort . Prentice-Hall, Englewood Cliﬀs, NJ.Kuhnen, C. M. (2015). Asymmetric learning from ﬁnancial information.

The Journal ofFinance , 70(5):2029–2062.Liu, B. (2012).

Sentiment Analysis and Opinion Mining . Morgan and Claypool.Loader, C. (1999).

Local regression and likelihood . Statistics and computing. Springer, NewYork.Loader, C. and Sun, J. (1997). Robustness of tube formula based conﬁdence bands.

Journalof Computational and Graphical Statistics , 6:242.Loughran, T. and McDonald, B. (2011). When is a liability not a liability? textual analysis,dictionaries, and 10-ks.

The Journal of Finance , 66(1):35–65.Merton, R. C. (1987). A simple model of capital market equilibrium with incompleteinformation.

The Journal of Finance , 42:483–510.Patton, A. J. (2006). Modelling asymmetric exchange rate dependence.

Internationaleconomic review , 47(2):527–556.Peng, L. and Xiong, W. (2006). Investor attention, overconﬁdence and category learning.

Journal of Financial Economics , 80:563–602.Petersen, M. A. (2009). Estimating standard errors in ﬁnance panel data sets: Comparingapproaches.

Review of ﬁnancial studies , 22(1):435–480.Ruppert, D., Sheather, S. J., and Wand, M. P. (1995). An eﬀective bandwidth selector forlocal least squares regression.

Journal of the American Statistical Association , 90:1257–1270.Shu, J. and Zhang, J. E. (2006). Testing range estimators of historical volatility.

Journalof Futures Markets , 26:297–313. 39i, J., Mukherjee, A., Liu, B., Li, Q., Li, H., and Deng, X. (2013). Exploiting topic basedtwitter sentiment for stock prediction. In

ACL (2) , pages 24–29.Sims, C. A. (2003). Implications of rational inattention.

Journal of Monetary Economics ,50:665–690.Sprenger, T. O., Tumasjan, A., Sandner, P. G., and Welpe, I. M. (2014). Tweets and trades:the information content of stock microblogs: Tweets and trades.

European FinancialManagement , 20(5):926–957.Sun, J. and Loader, C. (1994). Simultaneous conﬁdence bands for linear regression andsmoothing.

The Annals of Statistics , pages 1328–1345.Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stockmarket.

The Journal of Finance , 62(3):1139–1168.Truyens, M. and Eecke, P. V. (2014). Legal aspects of text mining. In Chair), N. C. C.,Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J.,and Piperidis, S., editors,

Proceedings of the Ninth International Conference on LanguageResources and Evaluation (LREC’14) , Reykjavik, Iceland. European Language ResourcesAssociation (ELRA).Wang, G., Wang, T., Wang, B., Sambasivan, D., Zhang, Z., Zheng, H., and Zhao, B. Y.(2014). Crowds on wall street: Extracting value from social investing platforms.

WorkingPaper .Wilson, T., Wiebe, J., and Hoﬀmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis.

Proceedings of HLT-EMNLP-2005 .Zhang, X., Fuehres, H., and Gloor, P. A. (2012). Predicting asset value through twitterbuzz. In

Advances in Collective Intelligence 2011 , pages 23–34. Springer.Zhang, Y. and Swanson, P. E. (2010). Are day traders bias free?—evidence from internetstock message boards.

Journal of Economics and Finance , 34(1):96–112.40

Supplementary Material

Table 14 summarizes all the results from entire panel sample analysis, attention analysisand sector analysis. Take the “BL” row in Panel A as an example. Arrival of articles ( I i,t )and the positive sentiment distilled using the BL method ( P os i,t ) has no signiﬁcant impacton future volatility log σ i,t +1 in entire sample analysis, attention analysis or sector analysis;the negative sentiment distilled using the BL method ( N eg i,t ) is signiﬁcantly positivelyrelated to future volatility in entire sample analysis and for the “Extremely High” group inattention analysis, and is signiﬁcantly negatively related to future volatility for the “HealthCare” sector in sector analysis.

Lexicon Type of Analysis I i,t Pos i,t

Neg i,t