[PDF] A Machine Learning Based Regulatory Risk Index for Cryptocurrencies

Abstract

Cryptocurrencies' values often respond aggressively to major policy changes, but none of the existing indices informs on the market risks associated with regulatory changes. In this paper, we quantify the risks originating from new regulations on FinTech and cryptocurrencies (CCs), and analyse their impact on market dynamics. Specifically, a Cryptocurrency Regulatory Risk IndeX (CRRIX) is constructed based on policy-related news coverage frequency. The unlabeled news data are collected from the top online CC news platforms and further classified using a Latent Dirichlet Allocation model and Hellinger distance. Our results show that the machine-learning-based CRRIX successfully captures major policy-changing moments. The movements for both the VCRIX, a market volatility index, and the CRRIX are synchronous, meaning that the CRRIX could be helpful for all participants in the cryptocurrency market. The algorithms and Python code are available for research purposes on this http URL.

Full PDF

AA Machine Learning Based Regulatory Risk Indexfor Cryptocurrencies

Xinwen Ni * a , Wolfgang Karl H¨ardle a,b,c,d,f , and Taojun Xie ga School of Business and Economics, Humboldt-Universit¨at zu Berlin, Berlin, Germany b Sim Kee Boon Institute for Financial Economics, Singapore Management University, Singapore c W.I.S.E. - Wang Yanan Institute for Studies in Economics, Xiamen University, Fujian, China d Department of Probability and Mathematical Statistics, Faculty of Mathematics and Physics,Charles University, Prague, Czech Republic f Department of Information Management and Finance, National JiaoTong University, Taiwan g Asia Competitiveness Institute, Lee Kuan Yew School of Public PolicyNational University of Singapore, Singapore

October 26, 2020

Abstract

Cryptocurrencies’ values often respond aggressively to major policy changes, butnone of the existing indices informs on the market risks associated with regulatorychanges. In this paper, we quantify the risks originating from new regulations onFinTech and cryptocurrencies (CCs), and analyse their impact on market dynamics.Speciﬁcally, a C ryptocurrency R egulatory R isk I nde X Keywords:

Cryptocurrency, Regulatory Risk, Index, LDA, News Classiﬁcation

JEL classiﬁcation:

C45, G11, G18 * The authors gratefully acknowledge ﬁnancial support from the Deutsche Forschungsgemeinschaft throughthe International Research Training Group IRTG 1792 “High Dimensional Non Stationary Time Series”, theYushan Sholar Program, the Czech Science Foundation under grant no. 19-28231X, as well as the EuropeanUnion’s Horizon 2020 research and innovation program ”FIN-TECH: A Financial supervision and Technologycompliance training programme” under the grant agreement No 825215 (Topic: ICT-35-2018, Type of action:CSA), Humboldt-Universit¨at zu Berlin. a r X i v : . [ q -f i n . R M ] O c t Introduction

Today, there are nearly 2,500 cryptocurrencies worth more than $252.5 trillion trading in themarket (Dolan, 2020). The original boom of cryptocurrencies occurred in an unregulatedenvironment. Even as news outlets and investors paid closer attention to the market, regula-tors and international actors remained largely distant from the action, and prices continuedto soar unabated. However, the situation changed since 2014.Regulations are designed to protect the investors, to put a stop on money laundering, orto prevent the ﬁat currency from being crowded out. Despite these good wills, speculationand implementation of regulations have resulted in volatile price movements in the cryp-tocurrency markets. Recent incidents, including China’s ban on cryptocurrency exchangesand the rumours of Korea doing the same, have caused major sell-offs and losses amonginvestors. It is therefore important to identify the extent to which new regulations and spec-ulations on them have affected the cryptocurrency markets. Ignoring this source of risk,regulators could end up with self-destroying outcomes and create thus systemic risk bias.In this paper, we aim to quantify the risks originating from introducing regulations onthe Cryptocurrency (CC) markets and identify their impact on the cryptocurrency invest-ments. In order to measure the regulatory risk, particularly the effects of regulations, someresearchers considered event-study methods (Binder, 1985; Buckland and Fraser, 2001;Binder, 1985; Binder, 1985; Schwert, 1981 and etc.). However, cryptocurrency market isyoung and so different from other ﬁnancial market that the previous regulatory event maynot appear again, such as a certain country ban the market. Therefore, we need a measure-ment tool, which is able to represent the risk level, be comparative and track the changesover time. An index matches all those requirement.Indices have been applied to track the Cryptocurrency markets already. The CC indexdeveloped in Trimborn and H¨ardle (2018), known as CRIX is a benchmark and tracks theprice movements in the CC markets on a daily basis. A volatility measure, VCRIX (Kimet al., 2019), similar to VIX, is also presented. there to reﬂect the market’s volatility How- seen in crix.berlin, or thecrix.de In this section, we ﬁrst generally review the development of cryptocurrency. Based on that,we display the picture of trends of regulatory dynamics in the CC market. The researchquestions touch the identiﬁcation of the regulatory risk, the construction of the regulatory The CCi30 is a rules-based index designed to objectively measure the overall growth, daily and long-termmovement of the blockchain sector. It does so by tracking the 30 largest cryptocurrencies by market capitaliza-tion, seen in http://cci30.com

The history of digital or programmable monies can be traced back to as far as four decadesago, although the word “cryptocurrency” became popular only recently. In the 1980s, theconcept of E-cash was introduced by David Chaum in a paper entitled “Blind Signatures forUntraceable Payments” (Fiorillo, 2018). Subsequently, DigiCash, B-Money, and Bit Goldwere proposed by Chaum, Wei Dai, and Nick Szabo in the 1990s, respectively. Most ofthese digital monies did not survive as they failed to address the practical issues of “doublespending” and “third-party trust”.The milestone development in this ﬁeld came in after the global ﬁnancial crisis. Nakamoto,in the seminal white paper, 2008, proposed the bitcoin. This is a peer-to-peer electronic cashsystem implemented via the blockchain technology, with the participation of a network ofcomputer owners known as miners. The blockchain technology, later known as the dis-tributed ledger technology, or DLT, ensures that transaction records are easy to be updatedbut costly to be changed, avoiding the “double spending” issue. Miners, after solving com-plex mathematical puzzles, are rewarded with a predetermined amount of bitcoins. Theamount of the reward can only be amended with the agreement from majority of the miners.Such a mechanism avoids the “thrid-party trust” problem in ﬁat monies, whose issuance de-pends on the central banks’ sole discretion. Since its birth, the BTC model has deﬁned themeaning of “cryptocurrency”, which now typically refers to a decentralized digital networkthat facilitates secured transactions using cryptographic methods.Bitcoin has sparked a series of events since 2009. In Figure 1, we show a timeline ofa few major events in the last decade. Some events pertained to innovations that aimedat improving the technology behind cryptocurrencies (highlighted in blue). For example,competing CCs, also known as altcoins, began to emerge in 2011. The events highlightedin yellow are the ones involving CCs being used in legitimate real-life applications. A well-known example was the two pizzas in 2010 that cost 10,000 bitcoins. Through this series of3igure 1: Cryptocurrency timeline4vents, the general public became aware of the strengths and weaknesses of CCs. As goodand bad news took turns to be reported, the price of bitcoin, highlighted in gray, and thatof the other CCs also experienced volatile movements. Notably, events relating to lossesof cryptocurrency exchanges have been associated with the largest price movements. Forinstance, as Mt. Gox went bankruptcy in 2014 after losing over 850,000 bitcoins, the priceof bitcoin fell from over $1,000 to the around $400. This event has been foreseeable throughtextual analysis of BTC blog and other solid media channels. Linton et al. (2017) and thechapter 3 of H¨ardle et al. (2017) employ a technique similar to ours to evaluate discussionsin social media. A recent price movement was an upswing of 1300% in year 2017, followedby a fall of more than half in May 2018.

Among the features of cryptocurrencies, anonymity has been the most controversial one.Users of cryptocurrencies like this feature for it is difﬁcult to trace one’s spending history,but regulators dislike it for the exact same reason. We thus see an interesting interactionhere. Users have proposed numerous improvements to enhance anonymity. Zcash and Mon-ero were designed to facilitate anonymous transactions. At the same time, incidents such asthe Silk Road going live and terrorists using cryptocurrencies for remittance got the regula-tors on the toes. Regulators, on the one hand, insisted on know-your-customer (KYC) mea-sures to trace any illegitimate transactions, but on the other hand, prepared to launch theirown cryptocurrencies (Barrdear and Kumhof, 2016; George et al., 2020). Fighting againstillegitimate transactions became one of the ﬁrst tasks for the cryptocurrency regulators.Trading activities at the exchanges was the next issue that regulators reacted to. Theanonymity feature and a lack of regulation at the cryptocurrency exchanges cultivated illegaland unethical behaviors, such as money laundering, pump-and-dump activities and scams.Ignorant users of the cryptocurrency exchanges faced high risks while trading. Respond-ing to these, starting in 2011, the US Treasury Department’s Financial Crimes EnforcementNetwork (FinCEN) began oversight of cryptocurrency exchanges, transmitters, and admin-5strators under the Bank Secrecy Act related to anti-money laundering and combating theﬁnancing of terrorism (AML/CFT) (Lee and Deng, 2018). In the same year, the UnitedStates Department of Homeland Security (DHS) initiated ﬁrst investigation relating to cryp-tocurrency. The number of cases raised to over 200 in 2017 (seen in Figure 1).However, there has been a dilemma in regulating the cryptocurrency space. While it wasimportant to protect retail investors and to prevent unlawful transactions, new technologydriving the development of cryptocurrencies needed to be incubated until the ecosystem wasmatured. It was then imperative for the regulators to move towards systematic governance.In Figure 2, we show a timeline of countries’ publications of guidance on the cryptocurrencyspace. Figure 2: Timeline for Cryptocurrency GuidanceIn order to apply the existing law and regulations, the very ﬁrst step for the governors isto identify the cryptocurrency nature as means of holding, transferring or investing “money”.Since the birth of cryptocurrency, the debate, whether BTC, or generally cryptocurrencies,are currency or asset, eolved with the growth of the market (Glaser et al., 2014, Baur et6l., 2018). Despite using “currency” as the name name, most countries that have permittedcryptocurrencies view these monies as assets. In 2014, the US Internal Revenue Service(IRS) stipulated that virtual currency is considered property for the purpose of federal tax”.Purchases using cryptocurrencies as the media of exchange are considered barter trades, orexchanges between properties and services (Blandin et al., 2019). The Swiss Financial Mar-ket Supervisory Authority (FINMA), in 2018, published guidelines for initial coin offerings(ICOs). In those guidelines, tokens were categorised into three types based on their economicfunctions: payment tokens, utility tokens and asset tokens (Caytas, 2018).Regulatory responses to cryptocurrencies vary from strict bans to government adoption.Most of these regulations were designed to protect retail investors or to ensure that cryp-tocurrencies are not used for illegal activities. The Financial Action Task Force (FATF), rec-ommended that regulations be implemented to prevent the use of cryptocurrencies in moneylaundering and terrorist ﬁnance (Gold and McBride, 2019). At the G20 Summit in 2018,the FATF urged on all countries to take necessary preventive measures towards the misuseof cryptocurrencies. In the European Union, by 2020, all member states will implementAML/CFT rules to cryptocurrency exchanges and wallet operators (Houben and Snyers,2018). On the contrary, some countries began to change their attitude towards the adoptionof DLT. In China, the use of BTC as currency for ﬁnancial institutions was prohibited in2013 (Glaser et al., 2014), but in Octobor 2019, the president of China announced that thecountry will encourage enterprises to seize the opportunity in the up-and-coming technol-ogy. Subsequently, the People’s Bank of China announced that it would launch digital Yuan,commonly known as the central bank digital currency (Zhong, 2019). Such support for DLTlead to a new round of debate globally.This disparity in regulatory approaches creates interesting dynamics in the cryptocur-rency markets. As poited out earlier, good and bad news is likely to induce different move-ments in the price of cryptocurrencies. As more regulations are on their way, and becausethe cryptocurrency market is globally uniﬁed, policy changes in one country or even therumor about the attitude adjustment of one government, would be widely discussed in the AML/CFT: Anti-money laundering / combating the ﬁnancing of terrorism.

For the most cases, the regulation changes do not come out from nowhere. A good example isthe Fed’s rate cut (or hike). Before the Fed’s announcement of changes, there are discussionsand predictions about the Fed’s move in newspapers. Numerous studies have attempted toexamine whether the text mining technology would contribute to the forecasting for ﬁnancialmarkets (Nassirtoussi et al., 2014), e.g. Ghiassi et al. (2013), Geva and Zahavi (2014), Tu andH¨ardle (2018). But not many use textual data to analyze ﬁnancial regulatory risk. Gulen andIon (2016), Baker et al. (2016) and Kang and Ratti (2013) argue that news from newspaperscould be a good indicator for macro policy uncertainty.As discussed before, indices have been introduced to trace the movement of CC market,but none of them addresses regulatory risks which plays an important role for the future ofCCs. We try to, in this paper, quantify the risks brought by introducing policies on the CCmarket and further discuss its impact on the CC investment. A regulatory risk index for theCC market, which could serve as a tool for passive investors, for the fund manager, and evenfor policy-makers.There are mainly three kinds of indices as to the data sources which were applied toconstruct the index. First, and most commonly, some indices use real data, e.g. VIX, S&P500, and DAX, which employed real market price or volume data. Second, some are basedon a regular survey, e.g. IFO business Climate Index and Purchasing Managers’ Index (PMI),which is relied on a monthly survey of supply chain managers. Recently, the third source,news data, or generally speaking the text data, becomes popular, e.g. Thomson ReutersMarketPsych Indices , sentiment Indices, which are standard input to trading desks and are Research Question 1

How to indentify the regulatory risk for Cryptocurrencies?

Research Question 2

How to construct an index of regulatory risk for Cryptocur-rency market based on news data?

Research Question 3

What is the impact of regulatory risk to the market?

Since CCs are frequently mentioned in newspapers only for the very recent years, we chooseto use news data from the top online cryptocurrency news platform (Guides, 2018). The9epresentative news platform Coindesk and Bitcoin Magazine were considered in thispaper, because both of them are not only pioneers and leaders in the market but also theyoffer news data which could trace back to the beginning of BTC’s boost.Coindesk is a news website with a particular focus on Blockchain, Bitcoin and Cryptocur-rencies as a whole. The site launched in April 2013 and released close to 25000 articles. Thearticles have been classiﬁed already in categories: markets, technology, business, policy ®ulation and people. The aim of this paper is to introduce a policy uncertainty index bycalculating the frequency of policy-related news. The pre-classiﬁed news data from Coin-desk perfectly matches our demand and will be further applied as training data for the MLmodels. The textual data from the source was collected via a dynamic web scraper.After checking for duplicates, eventually we keep 16,528 articles from 01 April 2013 to18 July 2019. The data covers 76 months, 329 weeks and 2300 days. Out of the total over16,000 articles, 2,468 are marked as policy-related news. The data is available for furtherresearch at the Blockchain Research Center (BRC) and on Quantlet.de.Figure 3 represents the average number of news per week related to policy and the aver-age number of all news. The number of daily articles increased in 2014 and 2018 both in totaland in regulation-related term. In those years, the price of Bitcoin underwent volatile move-ments, declining by more than 70 % in 2014 in 2018. Before these massive corrections, theend of 2013 and 2017 marked periods of price discovery and all-time highs in USD valuationbeing broken every other day. During the same periods, number of blockchain-related newsarticles increased, indicating growing interests in blockchain and distributed ledger tech-nology. There is no doubt that, simultaneously, the market attracted policy-makers’ strongattention as well. CRRIXBitcoin Magazine is another leading and pioneer platform supplying information on thenew market. It was founded in Feb 2012, one year earlier than Coindesk. There are fewer https://bitcoinmagazine.com https://hu.berlin/BRC The CRIX and VCRIX is chosen to represent the value and the volatility of the entire cryp-tocurrency market for the later analysis. The CRIX (CRyptocurrency IndeX), created by11rimborn and H¨ardle (2018), closely tracks the entire cryptocurrency market performance.Its construction is robust in the sense it takes into account the dynamics of market structure,thus ensuring the representativity and the tracking performance of the index. It follows thatconstituents of CRIX change over time, depending on market conditions and the relativedominance of CCs. The CRIX series begins from July 2014, and is available through the-crix.de. Reallocation of the CRIX happens on a monthly and quarterly basis. It adopts aliquidity rule when incorporating a certain cryptocurrency into CRIX, and hence guaranteesthe trading of CRIX, which is good for ETFs and traders. CRIX has been widely investigatedin the pioneering research on cryptocurrencies, including Hafner (2020), Klein et al. (2018),Trimborn et al. (2018), and da Gama Silva et al. (2019).Like VIX or VDAX, which provide a measure for implied volatility, VCRIX, created byKim et al. (2019), is a volatility index, able to grasp the risk induced by the cryptocurrencymarket. This index accurately addresses the market dynamics on the basis of CRIX and thusproved to be a proper basis for option pricing. Similar to CRIX, the data of VCRIX is alsobe able to be download from thecrix.de.

Based on the rich text corpus, on can now enter the machine learning text mining step toidentify policy-related news from others. In this paper, the classiﬁcation problem is simplybinary: policy-related or not. In the literature, SVM is widely applied to solve such kindof binary problem. However, this method didn’t performance well with imbalanced cases,whilst our target is to classify the regulatory news (a very small subgroup) from all. In ourtraining data, the ratio is . Indeed there are multiple ways to solve the unbalanced dataproblem, such as oversampling or class-weighted SVM by assigning higher misclassiﬁcationpenalties. But the pre-process of oversampling or undersampling will change the distributionof labels and further change the distribution of test data. Our index is constructed based onregulatory news frequency and it is therefore sensitive to the distribution of classes.On the other hand, we could assume that when policy-related topics are discussed, sim-12lar topics with their key words are used and their distributions are close. Based on thatassumption, the Latent Dirichlet Allocation (LDA) method can be employed to analyze thetopic distribution and words distribution for the corpus and further identify the policy-relatedarticles according to the similarity calculation.

The Latent Dirichlet Allocation (LDA) technique proposed by Blei et al. (2003), is an unsu-pervised machine learning algorithm that learns the unobserved topics of a corpus (individualnews articles in this paper). This technique is widely applied to establish the thematic struc-ture of text and other discrete data in the linguistic, information retrieval, biologic and evenengineering literatures (see Blei, 2012 for a review of topic modelling and its application tovarious text collections).The LDA technique is based on a generative statistical method to identify the distribu-tion of words that contribute to a topic, while simultaneously constructing documents withdifferent probabilities of topics, meaning that each topic z is annotated with a collections ofthe most probable words w , and each document d is annotated with a collections of the mostprobable topics z . It is an unsupervised algorithm which requires no labeled texts and learnsthese two latent (unobserved) distributions p ( w | z ) and p ( z | d ) by acquiring model parame-ters that maximize the probability of each word appearing in each document with the numberof topics K as given.Then, with Bayes theorem, the probability of observed word w n appearing in a document d m is given by: p ( d m , w n ) = p ( d m ) p ( w n | d m ) (1) = p ( d m ) K (cid:88) k =1 p ( w n | z k ) p ( z k | d m ) (2)where z k is a latent variable indicating the k th topic from which the words were drawn ( Z

13n Figure 4), p ( w n | z k ) is a distribution for each topic over the vocabulary ( φ in Figure 4),and p ( z k | d m ) denotes the topic proportions for the m th document (article in this paper) ( θ in Figure 4). Intuitively, φ indicates which words weight more to a topic, while θ states theimportance of those topics to a document.Figure 4: Graphic LDA ModelBoth θ and φ follow the Dirichlet distribution with hyper-parameter α and β respec-tively. With higher α , the topic distribution per article turns to be more speciﬁc, whilesimilarly, higher β leads to a more speciﬁc word distribution per topic. In general, α linksto the similarity of documents, meaning that a higher alpha value implies that documentsare embodied by more similar weights of each topic. The same holds for β but meaningthat a higher beta value indicates that topics contents more similar weights of each word.In the Python package gensim, the symmetric or asymmetric hyper-parameters are learnedfrom data. The generative process of LDA is based on the following joint distribution of theobserved variables w and the unobserved variables z , θ , φ , α and β , p ( w , z , θ , φ ; α, β ) = K (cid:89) k =1 p ( φ k ; β ) M (cid:89) d =1 p ( θ d ; α ) N (cid:89) n =1 p ( z d,k | θ d ) p ( w d,n | φ z d,k ) , (3)14 .2 Number of Topics for LDA The standard LDA model proposed by Blei et al. (2003) has a signiﬁcant weakness that itrequires pre-determination of the number of topics, meaning that users should set the numberof unobserved topics manually before applying the method. The quality of LDA model isheavily depended on the choice of topic number. Therefore, many believe that choosing thebest value for the topic numbers is more art than science (Azqueta-Gavald´on, 2017).A common solution is to plug in a set of values and pick the optimal topic number ei-ther based on some intrinsic criterion, such as the coherence of the topics, or based on someextrinsic criterion, such as accuracy on a speciﬁc task, e.g. paraphrase identiﬁcation. Therealso exist other methods to help with choosing the number of topics. For example, nonpara-metric Bayesian models, e.g. Hierarchical Dirichlet process are employed to automaticallygenerate the number of topics (Teh et al., 2004). However, it is computationally inefﬁcientto apply such nonparametric models to LDA (Wallach et al., 2009).Coherence measures which are based on word co-occurrence are widely applied to quan-tify the quality of topic models. The poor quality topics with the type of “chained”, “in-truded” and “random” could be detected using detected with coherence measures (Mimnoet al., 2011). Newman et al. (2010) proposed a coherence measure which is comparableto the human rating of topics. Their coherence measure ( C UCI ) takes the set of the top Jwords ( w , ..., w J ) for a given topic and sum a conﬁrmation measure over all word pairs.The function is given as follows: C UCI = 2 J · ( J − J − (cid:88) i =1 J (cid:88) j = i +1 log P ( w i , w j ) + (cid:15)P ( w i ) · P ( w j ) (4)where the probabilities are estimated on Wikipedia outperform which is used as external ref-erence corpus. Mimno et al. (2011) employ an asymmetrical conﬁrmation measure betweentop word pairs in the calculation of coherence C UMass : C UMass = 2 J · ( J − J (cid:88) i =2 i − (cid:88) j =1 log P ( w i , w j ) + (cid:15)P ( w j ) (5)15nlike C UCI , the probabilities in function 5 are estimated based on the original corpus withapplied to trained topic models. R¨oder et al. (2015) build a coherence framework and reporta measure ( C v ) with the best performance. Different from C UMass and C UCI , C v deﬁnes theconﬁrmation using normalized point-wise mutual information (NPMI) for the j − th elementof the context vector −→ v i of word w i : v ij = NPMI ( w i , w j ) γ =  log P ( w i ,w j )+ (cid:15)P ( w i ) · P ( w j ) − log ( P ( w i , w j ) + (cid:15) )  γ (6)where γ denotes the weight for NPMI. In this paper, we use coherence value C v as the criteriato select model. Semantic similarity problems can be classiﬁed according to different levels of granularity,speciﬁcally ranging from word-to-word to sentence-to-sentence to document-to-documentsimilarities (Niraula et al., 2013). In this paper, our task is to analyze document-to-documentsimilarity, particularly as a binary decision problem in which an article is policy related ornot. We rely on one probabilistic method, LDA, which regards documents as distributionover topics and topics as distribution over words. So, we assume that policy-related articleshave similar topic distributions.The Hellinger distance can be applied to compute the distance between two distributions.For document p and document q , the distributions of topics are z p = ( z p, , . . . , z p,k , . . . , z p,K ) and z q = ( z q, , . . . , z q,k , . . . , z q,K ) respectively. Hellinger Distance d H between those twonews with K topics is given as follows: d H ( z p , z q , X ) = 1 √ (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 ( √ z p,k − √ z q,k ) (7)There are reasons that we choose Hellinger distance rather than other distances to calcu-late news similarity. First, if we denote ˆ f ( x ) as a kernel density estimator, the asymptotic16istribution of √ nh ( ˆ f ( x ) − f ( x )) depends on f ( x ) , the true distribution, however, aftertaking square root, the asymptotic distribution of √ nh ( (cid:113) ˆ f ( x ) − (cid:112) f ( x )) eliminates its de-pendency with f ( x ) (see the proof in Appendix). Second, when applying Hellinger distancein a binary decision criterion, it is not sensitive to the class skew (Cieslak and Chawla, 2008),implying that it performs well with imbalance data. Besides, the results from Hellinger dis-tance is bounded by [0 , for all values of z p,k and z q,k . It is easy to read and compare. Thehighest value, 1, indicates the maximized distance and therefore means that the comparedtwo distributions differ from each other signiﬁcantly, whereas the value 0 implies the high-est similarity and shortest distances. Meanwhile, d H is symmetric, meaning d H ( z p , z q ) = d H ( z , z p ) .As the next step, we calculate the average distance between the article i and all policy-related news d i : ¯ d l,i = 1 N r N r (cid:88) j =1 d H ( z l,i , z r,j ) (8)where N r is the number of regulatory news, z u,i denotes the topic distribution of article i and l = { r, non or u } meaning that the article i is regulatory news “ r ”, non-regulatorynews “ non ” or unclassiﬁed news “ u ”. z r,j represents topic distribution of regulatory new j ( j = 1 , ..., N r ).Since we have the assumption that regulatory news have smaller distances between eachother than the other news, the average distance ¯ d r = { ¯ d r, , ..., ¯ d r,N r } for all policy-relatednews should be relatively smaller than ¯ d non = { ¯ d n, , ..., ¯ d n,N non } for all non-policy-relatednews. Then if ¯ d u,i of the unclassiﬁed article i is small and close to ¯ d r , we mark that news aspolicy-related. In this paper, we set a threshold d equals to τ th quantile of ¯ d r and τ = 0 . in this paper. 17 .4 Construction of CRRIX As mentioned before, the construction of CRRIX is simply the coverage frequency of policy-related news, as followed:

CRRIX st = N st,reg N st,all (9)where s is the periodicity and s = { daily, weekly or monthly } . N t,reg and N t,all arethe number of regulatory news and all news at time t . First we did some pre-processing of the data (words) : stopwords are eliminated (wordsthat do not informatively or semantically contribute to an article, e.g. “at”, “or”, “and”); allwords have been converted to lower case. We calculate the coherence value of LDA modelswith different topic numbers from to given an automatic generated hyperparameter α ( α =0.01) and β = 0 . . The Figure 5 indicates that the best performanced model with optimalnumber of topics for the corpus in this paper is the model with K = 14 . When K = 14 ,the model has the highest coherence value and when the number of topics increases after theoptimal choice, the coherence value turns to relatively stable. We also test the robustnesswith different α and β from 0.01 to 0.3. The above mentioned combination performancesbest but the coherence value doesn’t change much for given topic number.Figure 5: Coherence value for different numbers of topics K K = 14 ). The red cell representsstrongly uncorrelated topics, while the blue cell indicates high correlation. The left diagramwas generated using Jaccard distance and for the right one, we apply Hellinger distance tocalculate the differences between topics. All elements except those in the diagonal in both di-agrams are red or reddish, which means the 14 topics are relatively different from each otherand our LDA model performance well in this aspect. However, compared with Hellingerdistance, even though Jaccard distance is robust and wildly used in ML methodologies, itis less sensitive than Hellinger distance. Therefore, in the later discussion, we only applyHellinger’s method in the distance calculation.In order to further show the performance of the trained LDA model, we try to comparethe topics with other sources. In the leading Cryptocurrency news platform, Coindesk, thenews are labelled as those categories: “Opinions”, “Tech”, “Business”, “Policy & Regula-tions”, “Market” and “Feature”. These categories appear in Table 1 (column 1) together withtheir equivalent topic (column 2) which is generated by our LDA model and the list of rep-resentative words for each topic (column 3). From the table we can see that for the major19 oindesk Subcategory LDA Topic Top Keywords Opinions Opinions Bitcoin, say, people, make, goget, take, would, could, wayTech Technology System, blockchain, use, transaction, chainTechnology, security, work, datum, networkBusiness Business Company, say, business, new, servicebase, startup, ﬁrm, founder, CEOPolicy & Regulation Regulation Currency, business, virtual, law, stateRegulation, money, digital, exchange, taxMarket Investment Bitcoin, market, currency, price, exchangevalue, investor, Litecoin, trade, investmentTrading and Exchange Exchange, BTC, account, customer, tradingUser, deposit, page, trade, fundFeature Mining mine, power, asic, block, hashchip, network, unit, hardware, poolCoins Coin, project, Dogecoin, game, Altcoindeveloper, community, donate, crowdfunder, token

Table 1: Categories (Coindesk.com) matched by LDA topics20rue \ Pred NB SVM cw LDA Total1 0 1 0 1 01 0 582 0 582 361 221 5820 0 4004 0 4004 188 3816 4004Total 0 4586 0 4586 549 4037 4586Accuracy 0.873 0.873 0.907Table 2: Confusion matrix of classiﬁcation for three methods (Naive Bayes, Class-weightedSVM and LDA)categories in the popular platforms, we could ﬁnd the corresponding topics in our model. Inthe case of “Market”, topics go beyond the categories proposed by the platform. We must ad-mit that parts of the category “Business” overlaps with the category “Market”. Even thoughwe select the topic “Trading and Exchange” to match the category “Market”, it could also beput under a bigger concept of “Business”. In this sense, the machine learning LDA techniqueperforms better and clearly identiﬁes topics which keep distances with each other.We use the trained LDA model to calculate the Hellinger distances. We ﬁnd that the dis-tributions of average distances between each regulatory news and all other regulatory news ¯ d r and of average distances between each non-regulatory news and all regulatory news ¯ d non are signiﬁcantly different. Policy-related news are similar with smaller distances, whereasthe most non-policy-related news are further away. Then we calculate the average Hellingerdistance ¯ d u,i for each unclassiﬁed article i . Those, which are smaller than 0.392, the 0.95quantile of ¯ d r , will be classiﬁed to the policy-related group.The classiﬁcation results of our method based on LDA was compared with those gen-erated by Naive Bayes and SVM, two broadly used supervised ML classiﬁcation methods.The confusion matrices of classiﬁcation results and the manually classiﬁed data can be foundin Table 2. “True” value is given by human involved annotation, and “Pred” value is pre-dicted by the here applied ML technique. Number means that the given news is labelled aspolicy-related and number those marked by is non-policy-related.The accuracies of all three methods are relatively high, over 0.87 (seen in Table 2). How-21ver, for supervised ML methods Naive Bayes and Class-weighted SVM, they simply labelall articles non-policy-related. Even with the high accuracy, those methods can not helpwith our research question. The purpose of classiﬁcation in this paper is to ﬁnd the ratioof policy-related news over all. With the increase of news taken in to the calculation, but 0policy-related news was identiﬁed, the index will go towards destruction.Meanwhile, the accuracy of our methods is higher, 0.91. Although from the Table 2 wecan read that the type I error for LDA classiﬁcation are also high, almost percent, it stillcould contribute to build the index and in this sense performances much better than NB andSVM. Figure 7: CRRIX (Monthly) with news and price highlightsMultiple reasons could contribute to the misclassiﬁcation. One could be that the LDAbased criteria are relatively strict. News, which have the distance to all policy-related news22s close as that of 95% the pre-identiﬁed regulatory news, would be counted for regulatorynews. Those with a bit larger Hellinger distance are all excluded. Another reason couldcome from the data itself. Cryptocurrency market is young and the core policies, whichwere discussed by the public and announced by the governments, were time-varying. Usingall time slot from the year 2013 to 2019 might be biased. A dynamic LDA with rollingwindow will solve this problem but it requires sufﬁcient data points. In this paper, we didn’tconsider this method, since the data is limited.Figure 8: VCRIX (in red) and Regulatory Risk Index of Cryptocurrency Market (in blue)CRRIXAfter each article was labeled according to its distance with other pre-identiﬁed policy-related news, we simply count the number of articles under the class “regulation” for a givetime slot and divide it by the number of all articles for the same time. Since the daily timeseries is too noisy, especially at the early stage of the market, we only consider weekly timesteps. Figure 7 indicates that the peaks and jumps of the index are mainly led by big policychanges. The red arrow means positive policy which brought an increase to the price ofBitcoin, whereas the green arrow is vice versa. The number next to arrows is the weeklyreturn rate (positive with red arrow and negative with green arrow).Figure 7 reveals that the changes of policies are accompanied with drastic price ﬂuctua-tions which bring high risk to the market. Out index successfully captures those big changingmoments.Figure 8 shows that our regulatory risk index is closely related to VCRIX, a volatility23 umber of lags (no zero) 1ssr based F test: F=23.1736 p=0.0000 df denom=825 df num=1ssr based chi2 test: chi2=23.2579 p=0.0000 df=1likelihood ratio test: chi2=22.9372 p=0.0000 df=1parameter F test: F=23.1736 p=0.0000 df denom=825 df num=1 Table 3: Granger causality test results for lag index for CCs market. Especially for the period from Sep 2017 to March 2018, the extremelyhigh volatility is driven by the policy uncertainty. The movements for both VCRIX and theregulatory risk index are synchronous. The correlation between these two indices is 0.44712.Our regulatory risk index could contribute to forecast the market movement.We further test the causality between CRRIX and VCRIX. First we do Dicky Fuller testto conﬁrm stationary of both time series. The results reject the non-stationary hypothesis( p − value equals to . for VCRIX and . for CRRIX). Here we only showthe Granger causality test results for lag 1 in the Table 3. The null hypothesis for Grangercausality test is that the time series CRRIX, does NOT Granger cause the time series VCRIX.Here, for lag 1, we reject the null hypothesis. This means that the past (lag 1) values ofCRRIX (lag 1) have a statistically signiﬁcant effect on the current value of VCRIX. Theresults hold for lag 1 to 7. In this paper, via the machine learning tool LDA, we quantify the risks originating fromintroducing regulations on the cryptocurrency market and identify their impact on the cryp-tocurrency investments. Indices have been constructed to track the Cryptocurrency markets,however, none of these indices directly address regulatory risks. The indices introduced inBaker et al. (2016) focus on economic policy uncertainty in general. Similar to that, we con-struct a regulatory risk index for cryptocurrencies that is based on the policy-related newscoverage frequency. Unlike the classical annotation approach, which involves a meticulous24anual process, we employ the ML-LDA rather costless and efﬁcient method to classifypolicy-related news.We ﬁrst generally reviewed the development of cryptocurrencies and the trend of regu-latory dynamics. Based on that, we discussed the research questions: What exactly is theregulatory risk for Cryptocurrencies? How to construct an index of regulatory risk for Cryp-tocurrency market based on news data? What is the impact of regulatory risk to the market?In order to address the answer to those questions, we ﬁrst collected news data from thetop online cryptocurrency news platform (Guides, 2018), Coindesk and Bitcoin Magazine,via a dynamic web scraper. In addition, the CRIX and VCRIX is chosen to represent thevalue and the volatility of the entire cryptocurrency market for the later analysis.To calculate the coverage frequency, we tried to solve the problem of semantic similarityas a binary decision problem in which an article is policy related or not, using Latent DirichletAllocation (LDA), which models the underlining topics for a corpus of documents, whereeach topic is a mixture over words and each document is a mixture over topics.The topics given by LDA were comparable with the leading Cryptocurrency news plat-form. For the major categories in the popular platforms, we could ﬁnd the correspondingtopics in our model and the clearly identiﬁed topics kept distances with each other. Accord-ing to our model, the top words for regulation topic are: currency, business, virtual, law,state, regulation, money, digital, exchange and tax .We use the trained LDA model to calculate the Hellinger distances. Those with smallaverage distance will be classiﬁed to the group of policy-related news. The results werecompared with that of Naive Bayes and Class-weighted SVM methods. Since our data isvery imbalanced, the performance of those to supervised ML methods were not helpful. Butour LDA based distance classiﬁcation turned to a high accuracy 0.91 and could contribute toconstruct the index.The ﬁnal results of the regulatory risk index are shown in Figure 7 and in Figure 8.Our index successfully captures those big policy changing moments. The movements forboth VCRIX and the regulatory risk index are synchronous, and the Granger test proved thecausality of CRRIX to the market volatility. 25 eferences

Auer, Raphael and Stijn Claessens , “Cryptocurrency market reactions to regulatory news,”Discussion Paper DP14602, CEPR April 2020.

Azqueta-Gavald´on, Andr´es , “Developing news-based economic policy uncertainty indexwith unsupervised machine learning,”

Economics Letters , 2017, , 47–50.

Bai, Yiqi and Jie Wang , “News classiﬁcations with labeled LDA,” in “2015 7th Interna-tional Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowl-edge Management (IC3K),” Vol. 1 IEEE 2015, pp. 75–83.

Baker, Scott R, Nicholas Bloom, and Steven J Davis , “Has economic policy uncertaintyhampered the recovery?,”

Government policies and the delayed economic recovery , 2012, , , and , “Measuring economic policy uncertainty,”

The quarterly journal of eco-nomics , 2016, (4), 1593–1636.

Barrdear, John and Michael Kumhof , “The macroeconomics of central bank issued digitalcurrencies,” Staff Working Paper 605, Bank of England July 2016.

Baur, Dirk G, Kihoon Hong, and Adrian D Lee , “Bitcoin: Medium of exchange or specu-lative assets?,”

Journal of International Financial Markets, Institutions and Money , 2018, , 177–189. Binder, John J , “Measuring the effects of regulation with stock price data,”

The RANDJournal of Economics , 1985, pp. 167–183.

Blandin, Apolline, Ann Soﬁe Cloots, Hatim Hussain, Michel Rauchs, Rasheed Saleud-din, Jason G Allen, Katherine Cloud, and Bryan Zheng Zhang , “Global cryptoassetregulatory landscape study,”

Available at SSRN , 2019.

Blei, David M , “Probabilistic topic models,”

Communications of the ACM , 2012, (4),77–84. 26 Andrew Y Ng, and Michael I Jordan , “Latent dirichlet allocation,”

Journal of machineLearning research , 2003, (Jan), 993–1022. Borke, Lukas and Wolfgang K H ¨ardle , “Q3-D3-LSA: D3. js and Generalized Vector SpaceModels for Statistical Computing,” in “Handbook of Big Data Analytics,” Springer, 2018,pp. 377–424.

Buckland, Roger and Patricia Fraser , “Political and regulatory risk in water utilities: Betasensitivity in the United Kingdom,”

Journal of Business Finance & Accounting , 2001, (7-8), 877–904. Caytas, Joanna Diane , “Regulation of Cryptocurrencies and Initial Coin Offerings inSwitzerland: Declared Vision of a ‘Crypto Nation’,”

Includes Chapter News , 2018, (1), 53. Chen, Jingnian, Houkuan Huang, Shengfeng Tian, and Youli Qu , “Feature selectionfor text classiﬁcation with Na¨ıve Bayes,”

Expert Systems with Applications , 2009, (3),5432–5435. Chen, Xingyuan, Yunqing Xia, Peng Jin, and John Carroll , “Dataless text classiﬁcationwith descriptive LDA,” in “Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence”2015.

Cieslak, David A and Nitesh V Chawla , “Learning decision trees for unbalanced data,”in “Joint European Conference on Machine Learning and Knowledge Discovery inDatabases” Springer 2008, pp. 241–256. da Gama Silva, Paulo Vitor Jord ˜ao, Marcelo Cabus Klotzle, Antonio Carlos FigueiredoPinto, and Leonardo Lima Gomes , “Herding behavior and contagion in the cryptocur-rency market,”

Journal of Behavioral and Experimental Finance , 2019, , 41–50. Dnes, Antony W and Jonathan S Seaton , “The regulation of British Telecom: An eventstudy,”

Journal of Institutional and Theoretical Economics (JITE)/Zeitschrift f¨ur diegesamte Staatswissenschaft , 1999, pp. 610–616.27 olan, Shelagh

Fiorillo, Steve , “Bitcoin History: Timeline, Origins and Founder,”

The Street , 2018.

Flamary, R’emi and Nicolas Courty , “POT Python Optimal Transport library,” 2017.

George, Ammu, Taojun Xie, and Joseph Alba , “Central Bank Digital Currency with Ad-justable Interest Rate in Small Open Economies,” Technical Working Paper, Asia Com-petitiveness Institute June 2020.

Geva, Tomer and Jacob Zahavi , “Empirical evaluation of an automated intraday stock rec-ommendation system incorporating both market data and textual news,”

Decision supportsystems , 2014, , 212–223. Ghiassi, Manoochehr, James Skinner, and David Zimbra , “Twitter brand sentiment anal-ysis: A hybrid system using n-gram analysis and dynamic artiﬁcial neural network,”

Ex-pert Systems with applications , 2013, (16), 6266–6282. Glaser, Florian, Kai Zimmermann, Martin Haferkorn, Moritz Christian Weber, andMichael Siering , “Bitcoin-asset or currency? revealing users’ hidden intentions,”

Reveal-ing Users’ Hidden Intentions (April 15, 2014). ECIS , 2014.

Gold, Zack and Megan McBride , “Cryptocurrency: A Primer for Policy-Makers,” 2019.

Grifﬁths, Thomas L and Mark Steyvers , “Finding scientiﬁc topics,”

Proceedings of theNational academy of Sciences , 2004, (suppl 1), 5228–5235.

Guides, Trading Strategy , “Top 10 Cryptocurrency Blogs (You Should Follow in 2019),”2018.

Gulen, Huseyin and Mihai Ion , “Policy uncertainty and corporate investment,”

The Reviewof Financial Studies , 2016, (3), 523–564.28 afner, Christian , “Testing for bubbles in cryptocurrencies with time-varying volatility,” Available at SSRN 3105251 , 2018.

Hafner, Christian M , “Testing for bubbles in cryptocurrencies with time-varying volatility,”

Journal of Financial Econometrics , 2020, (2), 233–249. H¨ardle, Wolfgang Karl, Cathy Yi-Hsuan Chen, and Ludger Overbeck , Applied quanti-tative ﬁnance , Springer, 2017.

Houben, Robby and Alexander Snyers , Cryptocurrencies and blockchain: Legal contextand implications for ﬁnancial crime, money laundering and tax evasion

Kang, Wensheng and Ronald A Ratti , “Oil shocks, policy uncertainty and stock marketreturn,”

Journal of International Financial Markets, Institutions and Money , 2013, ,305–318. Kim, Alisa, Simon Trimborn, and Wolfgang K H¨ardle , “VCRIX-A Volatility Index forCrypto-Currencies,”

Available at SSRN 3480348 , 2019.

Klein, Tony, Hien Pham Thu, and Thomas Walther , “Bitcoin is not the New Gold–Acomparison of volatility, correlation, and portfolio performance,”

International Review ofFinancial Analysis , 2018, , 105–116. Lee, David and Robert H Deng , “Handbook of blockchain, digital ﬁnance, and inclusion:Cryptocurrency, FinTech, InsurTech, and regulation,” 2018.

Lee, Seonggyu, Jinho Kim, and Sung-Hyon Myaeng , “An extension of topic models fortext classiﬁcation: A term weighting approach,” in “2015 International Conference on BigData and Smart Computing (BIGCOMP)” IEEE 2015, pp. 217–224.

Lin, Yung-Shen, Jung-Yi Jiang, and Shie-Jue Lee , “A similarity measure for text classi-ﬁcation and clustering,”

IEEE transactions on knowledge and data engineering , 2013, (7), 1575–1590. 29 inton, Marco, Ernie Gin Swee Teo, Elisabeth Bommes, CY Chen, and Wolfgang KarlH¨ardle , “Dynamic topic modelling for cryptocurrency community forums,” in “AppliedQuantitative Finance,” Springer, 2017, pp. 355–372. Mimno, David, Hanna M Wallach, Edmund Talley, Miriam Leenders, and AndrewMcCallum , “Optimizing semantic coherence in topic models,” in “Proceedings of theconference on empirical methods in natural language processing” Association for Com-putational Linguistics 2011, pp. 262–272.

Nakamoto, Satoshi , “Bitcoin: A peer-to-peer electronic cash system,” Technical Report,Manubot 2019. et al. , “Bitcoin: A peer-to-peer electronic cash system.(2008),” 2008.

Nassirtoussi, Arman Khadjeh, Saeed Aghabozorgi, Teh Ying Wah, and DavidChek Ling Ngo , “Text mining for market prediction: A systematic review,”

Expert Sys-tems with Applications , 2014, (16), 7653–7670. Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin , “Automatic evalua-tion of topic coherence,” in “Human language technologies: The 2010 annual conferenceof the North American chapter of the association for computational linguistics” Associa-tion for Computational Linguistics 2010, pp. 100–108.

Niraula, Nobal, Rajendra Banjade, Dan S¸ tef˘anescu, and Vasile Rus , “Experiments withsemantic similarity measures based on lda and lsa,” in “International conference on statis-tical language and speech processing” Springer 2013, pp. 188–199.

Prager, Robin A , “The effects of deregulating cable television: evidence from the ﬁnancialmarkets,”

Journal of Regulatory Economics , 1992, (4), 347–363. Redondo, Connor Lamon Eric Nielsen Eric , “Cryptocurrency Price Change PredictionUsing News.” 30 ¨oder, Michael, Andreas Both, and Alexander Hinneburg , “Exploring the space of topiccoherence measures,” in “Proceedings of the eighth ACM international conference on Websearch and data mining” 2015, pp. 399–408.

Schwert, G William , “Measuring the effects of regulation: evidence from the capital mar-kets,”

Journal of Law and Economics , 1981, (1), 121–58. Sebastiani, Fabrizio , “Machine learning in automated text categorization,”

ACM computingsurveys (CSUR) , 2002, (1), 1–47. Teh, Yee Whye, Michael I Jordan, Matthew J Beal, and David M Blei , “Sharing clus-ters among related groups: Hierarchical dirichlet processes,” in “Proceedings of the 17thInternational Conference on Neural Information Processing Systems” MIT Press 2004,pp. 1385–1392.

Tong, Simon and Daphne Koller , “Support vector machine active learning with applica-tions to text classiﬁcation,”

Journal of machine learning research , 2001, (Nov), 45–66. Trimborn, Simon and Wolfgang Karl H¨ardle , “CRIX an Index for cryptocurrencies,”

Journal of Empirical Finance , 2018, , 107–122. , Mingyang Li, and Wolfgang K H¨ardle , “Investing with cryptocurrencies-A liquidityconstrained investment approach,” Published version: Trimborn, S., Li, M. and WK H¨ardle(2019)” Investing with Cryptocurrencies—a Liquidity Constrained Investment Approach”Journal of Financial Econometrics, doi. org/10.1093/jjﬁnec/nbz016 , 2018.

Tu, Jun and Wolfgang Karl H¨ardle , “Information Arrival, News Sentiment, Volatilitiesand Jumps of Intraday Returns Ya Qian,” 2018.

Wallach, Hanna M, David M Mimno, and Andrew McCallum , “Rethinking LDA: Whypriors matter,” in “Advances in neural information processing systems” 2009, pp. 1973–1981. 31 hong, Raymond ppendix

A Proofs for Hellinger Distance

A.1 Asymptotic distribution of √ nh [ ˆ f ( x ) − f ( x )] depends on f ( x ) . Proof

Denote ˆ f ( x ) is a kernel density estimator, √ nh [ ˆ f ( x ) − f ( x )] = √ nh { ˆ f ( x ) − E[ ˆ f ( x )] } + √ nh { E[ ˆ f ( x )] − f ( x ) } (10)For the second part, Bias (cid:110) ˆ f h ( x ) (cid:111) = E (cid:110) ˆ f h ( x ) (cid:111) − f ( x )= 1 n n (cid:88) i =1 E { K h ( x − X i ) } − f ( x )= E { K h ( x − X ) } − f ( x )= (cid:90) h K (cid:18) x − uh (cid:19) f ( u ) du − f ( x ) (11)The transformation s = u − xh , i.e. u = hs + x , (cid:12)(cid:12) dsdu (cid:12)(cid:12) = h . A second-order Taylor expansionof f ( u ) around x is given by f ( x + hs ) = f ( x ) + f ( x ) (cid:48) hs + 12 f (cid:48)(cid:48) ( x ) h s + o (cid:0) h (cid:1) (12)33hen, Bias (cid:110) ˆ f h ( x ) (cid:111) = (cid:90) h K ( − s ) f ( x + hs ) hds − f ( x )= (cid:90) K ( s )[ f ( x ) + f ( x ) (cid:48) hs + 12 f (cid:48)(cid:48) ( x ) h s + o (cid:0) h (cid:1) ] ds − f ( x )= f ( x ) (cid:90) K ( s ) ds + f ( x ) (cid:48) h (cid:90) sK ( s ) ds + 12 f (cid:48)(cid:48) ( x ) h (cid:90) s K ( s ) ds − f ( x ) + o (cid:0) h (cid:1) = h f (cid:48)(cid:48) ( x ) µ ( K ) + o (cid:0) h (cid:1) , as h → (13)where (cid:82) K ( s ) ds = 1 , (cid:82) sK ( s ) ds = 0 and (cid:82) s K ( s ) ds = µ ( K ) .For the ﬁrst part, Var (cid:110) ˆ f h ( x ) (cid:111) = Var (cid:40) n n (cid:88) i =1 K h ( x − X i ) (cid:41) = 1 n n (cid:88) i =1 Var { K h ( x − X i ) } = 1 n Var { K h ( x − X ) } = 1 n (cid:8) E (cid:2) K h ( x − X ) (cid:3) − { E [ K h ( x − X )] } (cid:9) = 1 n (cid:90) h K (cid:18) x − th (cid:19) f ( t ) dt − n (cid:18) h (cid:90) K (cid:18) x − th (cid:19) f ( t ) dt (cid:19) = 1 n (cid:90) h K (cid:18) x − th (cid:19) f ( t ) dt − n ( f ( x ) + Bias( ˆ f ( x ))) (14)Substituting s = u − xh , Var (cid:110) ˆ f h ( x ) (cid:111) = 1 nh (cid:90) K ( s ) f ( x + hs ) ds − n ( f ( x ) + o ( h )) (15)34pplying a Taylor approximation yields Var (cid:110) ˆ f h ( x ) (cid:111) = 1 nh (cid:90) K ( z ) (cid:18) f ( x ) + hsf (cid:48) ( x ) + 12 f (cid:48)(cid:48) ( x ) h s + o ( h ) (cid:19) ds − n ( f ( x ) + o ( h )) = 1 nh (cid:107) K (cid:107) f ( x ) + o (cid:18) nh (cid:19) , as nh → ∞ (16)where (cid:82) K ( s ) ds = (cid:107) K (cid:107) . With Var( ˆ f ( x )) → as nh → ∞ ˆ f ( x ) − E[ ˆ f ( x )] (cid:113) Var( ˆ f ( x )) d −→ N (0 , (17)Substituting the expression for Var( ˆ f ( x )) √ nh { ˆ f ( x ) − E[ ˆ f ( x )] } d −→ N (cid:0) , f ( x ) (cid:107) K (cid:107) (cid:1) (18)If the bandwidth tends to zero faster than the optimal rate, then √ nh { E[ ˆ f ( x )] − f ( x ) } → (19)and the bias term vanishes from the asymptotic distribution, √ nh [ ˆ f ( x ) − f ( x )] d −→ N (cid:0) , f ( x ) (cid:107) K (cid:107) (cid:1) (20) A.2 Asymptotic distribution of √ nh [ (cid:113) ˆ f ( x ) − (cid:112) f ( x )] does not dependon f ( x ) Proof

From transform theorems, we know that if If √ n ( t − µ ) L −→ N p (0 , Σ) , √ n [ f ( t ) − f ( µ )] L −→ N q (cid:0) , D (cid:62) Σ D (cid:1) for n −→ ∞ (21)35enote g ( x ) = x / , then dgdx = x − / . With √ nh ( ˆ f ( x ) − f ( x )) d −→ N (0 , f ( x ) (cid:107) K (cid:107) ) ,then √ nh [ (cid:113) ˆ f ( x ) − (cid:112) f ( x )] d −→ N (cid:18) , (cid:107) K (cid:107) (cid:19) (22)The asymptotic distribution of √ nh [ (cid:113) ˆ f ( x ) − (cid:112) f ( x )] does not depend on f ( x ))