Investigation of Flash Crash via Topological Data Analysis
IInvestigation of Flash Crash via Topological DataAnalysis
Wonse Kim a,1,2 , Younng-Jin Kim a,3 , Gihyun Lee a,1 , Woong Kook a,2,4, ∗ a Department of Mathematical Sciences, Seoul National University, Seoul, South Korea
Abstract
Topological data analysis has been acknowledged as one of the most successfulmathematical data analytic methodologies in various fields including medicine,genetics, and image analysis. In this paper, we explore the potential of thismethodology in finance by applying persistence landscape and dynamic timeseries analysis to analyze an extreme event in the stock market, known as
FlashCrash . We will provide results of our empirical investigation to confirm theeffectiveness of our new method not only for the characterization of this extremeevent but also for its prediction purposes.
Keywords:
Persistence landscape, Flash Crash, Time series analysis
1. Introduction
Topological data analysis (TDA) is a relatively new, large, and growing fieldin mathematics using a wide variety of techniques based on homology theoryand statistics. TDA extracts hidden intelligence from the shape of data that ∗ Corresponding author
Email addresses: [email protected] (Wonse Kim), [email protected] (Younng-JinKim), [email protected] (Gihyun Lee), [email protected] (Woong Kook) W. Kim and G. Lee are partially supported by BK21 PLUS SNU Mathematical SciencesDivision. W. Kim and W. Kook are partially supported by National Research Foundation of Korea(NRF) Grant funded by the Korean Government (MSIP) [No. 2018R1A2A3075511]. Y.-J. Kim is supported by National Research Foundation of Korea (NRF) Grant fundedby the Korean Government (NRF-2015-Global Ph.D. Fellowship Program). W. Kook is partially supported by National Research Foundation of Korea (NRF) Grantfunded by the Korean Government (MSIP) [No. 2017R1A5A1015626].
Preprint submitted to Topology and its applications August 27, 2020 a r X i v : . [ q -f i n . S T ] A ug revious data analytic methods could not reveal, and various fields includingmedicine, image analysis, and genetics [7, 9] have tremendously benefited fromthis intriguing methodology. Nevertheless, in the field of finance, there has beenonly a small number of applications so far [5]. The purpose of this paper is toapply persistence landscape to detect extreme anomalies in financial market,and demonstrate how TDA can provide not only global characterization of databut also dynamic local features for prediction purposes.In May 6, 2010, there was an unprecedented sudden intraday stock marketcrash in U.S. financial markets known as Flash Crash , and that event has re-cently emerged as an interesting research topic in finance [3, 4, 6, 8]. In thispaper, we combine the methods of TDA including persistent landscape pro-posed by [5] and techniques of time series analysis, and create a new method fordetecting intraday stock market crashes based on L -norm of persistence land-scape. We then empirically show that our method can characterize and predictthe event of flash crash as well.The rest of the paper is organized as follows: Section 2 introduces back-ground and techniques of topological data analysis which are used in the paper,and provide a summary of the flash crash event. Section 3 describes our dataconstructed from major stock market indexes, and explains our new methodbased on TDA and statistical analysis. Section 4 reports empirical results tovalidate effectiveness of our methods for prediction purposes. Section 5 con-cludes the paper with further practical implications of our results.
2. Background
In this section, we gather key concepts of topological data analysis that willbe used later in the paper. Consider a data set X = { x , x , . . . , x n } , which isa finite subset of a Euclidean space R d . Let R ( X , (cid:15) ) denote the Vietoris-Ripscomplex for the data set X and a distance (cid:15) >
0, i.e., R ( X , (cid:15) ) is the simplicial2omplex on the vertex set X such thata subset σ of X is a simplex in R ( X , σ ) if and only if d ( x, y ) < (cid:15) for all x, y ∈ σ. (1)Here d ( x, y ) is the Euclidean distance between x and y .In what follows, let us denote by H i ( K ) the i -th homology group of a sim-plicial complex K . Throughout this paper we only use homology with realcoefficients. It follows from the very definition of Vietoris-Rips complex (1)that we have R ( X , (cid:15) ) ⊂ R ( X , (cid:15) (cid:48) ) whenever (cid:15) < (cid:15) (cid:48) . Using this we see that theVietoris-Rips complexes R ( X , (cid:15) ), (cid:15) >
0, form a filtration, which is called the
Vietoris-Rips filtration . This filtration induces homomorphisms between ho-mology groups. That is, for each i , we get a canonical linear map, H i ( R ( X , (cid:15) )) −→ H i ( R ( X , (cid:15) (cid:48) )) whenever (cid:15) < (cid:15) (cid:48) . (2)The set of homology groups { H i ( R ( X , (cid:15) )) } (cid:15)> and the linear maps (2) forma persistence module in the sense of [2], which enables us to track the birth and death of homology classes as the parameter (cid:15) > (cid:15) > i , let 0 (cid:54) = α ∈ H i ( R ( X , (cid:15) )). By using the Vietoris-Rips filtrationwe see that there exist positive real numbers b α and d α satisfying b α ≤ (cid:15) ≤ d α and • α is not the image of the map H i ( R ( X , δ )) → H i ( R ( X , (cid:15) )) if δ < b α . • For b α ≤ δ ≤ (cid:15) , there is 0 (cid:54) = β ∈ H i ( R ( X , δ )) whose image by the map H i ( R ( X , δ )) → H i ( R ( X , (cid:15) )) is α . • For (cid:15) ≤ δ ≤ d α , the image of α under the map H i ( R ( X , (cid:15) )) → H i ( R ( X , δ ))is non-zero. • The map H i ( R ( X , (cid:15) )) → H i ( R ( X , δ )) sends α to the zero in H i ( R ( X , δ )) if δ > d α .Let ( b α , d α ) denote the open interval corresponding to the birth and death ofa given homology class α . In what follows let T i denote the set of open intervals, { ( b α , d α ) | α is a non-zero i -dimensional homology class } . T i are counted with multiplicity, i.e., if α and β are dis-tinct i -dimensional homology classes, then ( b α , d α ) and ( b β , d β ) are regarded asdifferent elements of T i even though these two open intervals may coincide.Given a pair of real numbers b < d , set f ( b,d ) ( x ) = x − b for b ≤ x ≤ b + d − x + d for b + d ≤ x ≤ d persistence landscape associated with the data set X is a function λ ( X ) : N × R → R which encodes the information about the birth and death of homologyclasses [1]. It is defined by λ ( X )( k, x ) = k -max { f ( b,d ) ( x ) | ( b, d ) ∈ T i } , ( k, x ) ∈ N × R . Here k -max denotes the k -th largest value counted with multiplicity. Notethat the persistence landscape λ ( X ) can be seen as a sequence of functions { λ ( X ) k } k ≥ .For 1 ≤ p < ∞ , the L p -norm of the persistence landscape λ ( X ) = { λ ( X ) k } k ≥ is defined by (cid:107) λ ( X ) (cid:107) p = ∞ (cid:88) k =1 (cid:18)(cid:90) ∞−∞ | λ ( X ) k ( x ) | p d x (cid:19) p . (3)As X is a finite set, all simplicial complexes R ( X , (cid:15) ), (cid:15) >
0, are finite. Thisshows that for k large enough, we have λ ( X ) k = 0, and hence there is no issueof convergence of the series given in (3). In May 6, 2010, there was a sudden intraday stock market crash in U.S. finan-cial markets known as
Flash Crash . The crash event, started at 2:32 p.m. EDT,lasted for approximately 36 minutes, during which major U.S. stock indexessuch as the S&P 500, Dow Jones Industrial Average, and Nasdaq Compositedropped larger than 6% and rebounded very rapidly. Figure 1 shows intradayprice time series of S&P 500 futures, Dow Jones futures, and NASDAQ futures4n May 6, 2010. Since such an extreme price-swing event lasting only for a veryshort time had never been reported before in financial markets, many financialresearchers have been trying to understand the
Flash Crash [3, 4, 6, 8].
Figure 1: Intraday price time series of S&P 500 futures, Dow Jones futures, and NASDAQfutures on May 6, 2010.
3. Data and Methods
We purchased one-minute price data for three futures based on three majorU.S. stock market indexes: (a) S&P 500 futures, (b) Dow Jones futures, and5c) NASDAQ futures from April 1, 2010 to May 28, 2010 (42 trading days)from BacktestMarket ( ). For each future i ( i = a, b, c ) and for each minute j , we calculate one-minute log-return as r i,j = log( P i,j P i,j − ), where P i,j represents the closing price of the futures i at theminute j . Thus, for each minute j , we have a 3-dimensional vector defined by X j = ( r ( a ) ,j , r ( b ) ,j , r ( c ) ,j ) . Given a window size w = 50, and for each t ≥ w , we then construct the 3-dimesional time series defined by X t = { X t − , X t − , . . . , X t − , X t } . For each 3-dimensional time series X t , we compute L -norm of persistence land-scape, (cid:107) λ ( X t ) (cid:107) , so we get a time series of L -norm of persistence landscape, Y = ( Y , Y , . . . , Y n ) = ( (cid:107) λ ( X w ) (cid:107) , (cid:107) λ ( X w +1 ) (cid:107) , . . . , (cid:107) λ ( X w +( n − ) (cid:107) ) , as in [5]. In order to detect an abnormality of the time series Y , we develop anew abnormality measure based on the notions of Exponential Moving Average(EMA) and Exponential Moving Variance (EMVar) processes: Given the initialvalues, EMA = Y , EMVar = 0 , the subsequent values, EMA i and EMVar i , are computed using the followingrecursive formulae, δ i = Y i − EMA i − EMA i = EMA i − + α · δ i EMVar i = (1 − α ) · (EMVar i − + α · δ i ) . Then, our new abnormality measure Z t is defined by Z t = Y t − EMA t − (cid:112) EMVar t − . The new measure Z t represents the extent to which the current value Y t isdeviated from the previous values. 6 . Empirical Results Figure 2 displays the intraday time series plots of our new abnormality mea-sure Z t and the prices of S&P 500 futures on May 6, 2010, the day of FlashCrash. As Figure 2 shows, just before the extreme price-swing of Flash Crashbegins, the abnormality measure Z t has the value of 28,034.92. The value isexceptionally huge compared to the values of Z t at other times of the day. Fig-ure 3 (resp., Figure 4) also presents the intraday time series plots of Z t and S&P500 futures in four randomly selected days before (resp., after) the day of FlashCrash out of our sample period. Figure 3 and 4 show that the largest value of Z t in Flash Crash day, 28,034.92, is also extremely large compared to the valuesof Z t in other days.
5. Conclusion
In this paper, we present a new method of detecting an abnormal phe-nomenon based on TDA and time series analysis, and investigate the event ofFlash Crash via the new method. The empirical results from our study revealthat via the new method based on TDA, we can predict the event of FlashCrash. Therefore, our result shows that TDA not only can be used in fore-casting long-term market crash events [5], but also can be used in predictingintraday market crash events.
References [1] P. Bubenik,
Statistical topological data analysis using persistence land-scapes , J. Mach. Learn. Res. 16 (2015) 77–102.[2] P. Bubenik, J.A. Scott,
Categorification of persistent homology , DiscreteComput. Geom. 51 (2014) 600–627.[3] Commodity Futures Trading Commission, Securities & ExchangeCommission,
Findings regarding the market events of May 6, 2010 .7 ,2010 (accessed 10.10.2019).[4] D. Easley, M.M.L. De Prado, M. O’Hara, The microstructure of the “flashcrash”: Flow toxicity, liquidity crashes, and the probability of informedtrading , J. Portf. Manag. 37 (2011) 118–128.[5] M. Gidea, Y. Katz,
Topological data analysis of financial time series: Land-scapes of crashes , Physica A 491 (2018) 820–834.[6] A. Kirilenko, A.S. Kyle, M. Samadi, T. Tuzun,
The flash crash: Highfre-quency trading in an electronic market , J. Finance 72 (2017) 967–998.[7] L. Li, W.-Y. Cheng, B.S. Glicksberg, O. Gottesman, R. Tamler, R. Chen,E.P. Bottinger, J.T. Dudley,
Identification of type 2 diabetes subgroupsthrough topological analysis of patient similarity , Sci. Transl. Med. 7 (2015)1–16.[8] M. Paddrik, R. Hayes, A. Todd, S. Yang, P. Beling, W. Scherer,
An agentbased model of the E-Mini S&P 500 applied to Flash Crash analysis , in:Proceedings of the 2012 IEEE Conference on Computational Intelligencefor Financial Engineering & Economics, IEEE, New York, NY, 2012, pp.1–8.[9] G. Singh, F. Mmoli, G.E. Carlsson,
Topological methods for the analysisof high dimensional data sets and 3d object recognition , in: M. Botsch, R.Pajarola, B. Chen, M. Zwicker (Eds.), Symposium on Point-Based Graphics2007, Taylor & Francis Inc., Natick, MA, 2007, pp. 91–100.8 igure 2: Intraday time series plots of our new abnormality measure, Z t , and prices of S&P500 futures on May 6, 2010, the day of Flash Crash. igure 3: Intraday time series plots of our new abnormality measure, Z t , and prices of S&P500 futures on April 1, 8, 13, and 26, 2010. igure 4: Intraday time series plots of our new abnormality measure, Z t , and prices of S&P500 futures on May 7, 12, 19, and 28, 2010., and prices of S&P500 futures on May 7, 12, 19, and 28, 2010.