Optimal Segmented Linear Regression for Financial Time Series Segmentation
OOPTIMAL SEGMENTED LINEAR REGRESSION FOR FINANCIAL TIME SERIESSEGMENTATION
Chi-Jen Wu, Wei-Sheng Zeng and Jan-Ming Ho
Institute of Information Science,Academia Sinca { cjwu } @arbor.ee.ntu.edu.tw, { wszeng,hoho } @iis.sinica.edu.tw ABSTRACT
Given a financial time series data, one of the most fundamen-tal and interesting challenges is the need to learn the stockdynamics signals in a financial time series data. A good ex-ample is to represent the time series in line segments whichis often used as a pre-processing step for learning market-ing signal patterns in financial computing. In this paper, wefocus on the problem of computing the optimal segmenta-tions of such time series based on segmented linear regres-sion models. The major contribution of this paper is to definethe problem of Multi-Segment Linear Regression (MSLR) ofcomputing the optimal segmentation of a financial time se-ries, denoted as the MSLR problem, such that the global meansquare error of segmented linear regression is minimized. Wepresent an optimum algorithm with two-level dynamic pro-gramming (DP) design and show the optimality of OMSLRalgorithm. The two-level DP design of OMSLR algorithmcan mitigate the complexity for searching the best tradingstrategies in financial markets. It runs in O( kn ) time, where n is the length of the time series sequence and k is the numberof non-overlapping segments that cover all data points. Index Terms — time-series, financial signal processing,segmented linear regression, stock market signal.
1. INTRODUCTION
Learning investment or trading signals from financial marketdata is one of most fundamental and interesting research chal-lenges in both academia and industry [1, 2]. For example, theAmerican hedge fund, Renaissance Technologies has lever-aged financial signal processing technologies in stock tradingfor a long time. However, financial time series is difficult tosummarize or be represented due to its highly non-stationarynature [3]. Given a financial time series data, an initial pro-cessing step of learning the signal patterns is often to repre-sent the time series in line segments to alleviate data uncer-tainty and noise [4].In a segmentation process, a time series is divided into k non-overlapping segments, and each segment is represented https://rentec.com by a model to describe data points in the segment. Thesegment representation is measured using error functions de-pending on requirements of their applications. Actually, timeseries segmentation is widely for dimensionality reductionpurposes in economics, engineering and science [5].Time series segmentation has been extensively discussedin different domains and various models, which has resultedin a large number of works [6, 7]. In 1961, the first versionof time series segmentation problem is reported in [8] anda dynamic programming (DP) algorithm with time complex-ity O( kn ) is also described. Time series segmentation alsoarises in data mining applications. The article [9] gives a re-view on applications of segmentation methods in data min-ing. The methods are classified into three categories, includ-ing sliding windows, bottom-up and top-down methods. Theexperimental comparisons showed that the bottom-up methodresults in better performance than other methods.In the past few years, a few algorithms [10, 11, 12, 13]have been proposed to reduce the time complexity of time se-ries segmentation problem. The objective is to simplify rep-resent of large scales time series data. The Piecewise Lin-ear Approximation (PLA) [10] is a widely used approach forthe segmentation task. Acharya et al., [12] presented near-linear time algorithms that achieve a significant improvementcompared to the DP approach on large time series. Interestedreader can refer to Esling and Agon [7] who present a surveyon approximation segmentation of time series. To the best ourknowledge, previous approaches have not addressed the prob-lem of optimum segmentation of financial time series. Mostof them discussed segmentation methods in terms of approx-imation representation [10], on-line processing [11] and theirtime complexity [12].In this paper, we are interested in the open question [14],how to best choose k , the optimal number of segments used torepresent a particular time series. For financial trading strate-gies, k is a measure of number of times of changes in markettrend. It is also an indicator of how many time to trade in themarket while receiving a reasonable amount of trading profits.Instead of answering the open question directly, we will startwith focusing on minimizing global square error for a given a r X i v : . [ c s . C E ] J a n , and also derive the optimal representation of each of the k segments.Firstly, we formulate the Multi-Segment Linear Regres-sion (MSLR) problem and define the MSLR square error asthe performance index. Then, we present the Optimal Multi-Segment Linear Regression (OMSLR) algorithm, the two-level DP approach for producing the globally optimal seg-mentation. Finally, we show the optimality of the proposedOMSLR algorithm. The time complexity of the OMSLR al-gorithm is O( kn ), where n is the length of the time seriesand k is the number of non-overlapping segments that coverall data points. To the best our knowledge, this paper is thefirst to investigate the global optimal segmentation problemin time series processing, especially for financial time series.This paper is organized as follows. In Section 2 wepresent the formulation of segmentations as an optimizationproblem, named MSLR problem. In Section 3 we presentthe OMSLR algorithm. Some segmentation experiments arepresented in Section 4, and the results are summarized inSection 5.
2. FORMULATION OF PROBLEM MSLR
A formal definition of the Multi-Segment Linear Regression(MSLR) problem is described in this section. Given a timeseries X = { x , x , . . . , x n } and an integer k , the objectiveto MSLR problem is to partition X into k contiguous andnon-overlapping intervals, i.e., [ l i − , l i ) and [ l k − , l k ] , where l = 1 , l k = n, ≤ l i ≤ n, ≤ i ≤ k − , l i ∈ N ,such that the multi-segment linear regression square er-ror, ψ (1 , n | φ k ( X n )) , with respect to the k -segment par-tition φ k ( X n ) = { , l , . . . , l k } is minimized. Note that ψ (1 , n | φ k ( X n )) is also denoted as the Global Mean SquareError of the multi-segment linear regression representation,or GMSE for short. In other words, we have ψ ( X n | φ k ( X n ))= k − (cid:88) i =1 σ ( l i − , l i −
1) + σ ( l k − , l k )= ψ ( X l k − − | φ k − ( X l k − − )) + σ ( l k − , l k ) , (1)where σ ( i, j ) = (cid:80) jm =1 ( x m − µ ( i, j, m )) and µ ( i, j, m )) = β ij ∗ m + α ij , i ≤ m ≤ j with α ij and β ij being the linearregression parameters on the interval [ i, j ] of the time series X n , i.e., X ij = { x i , x i +1 , . . . , x j } . Thus we have α ij , β ij and σ ( i, j ) as follows. α ij = ¯ x ij − β ij ∗ ¯ t ij β ij = (cid:80) jm = i ( x m − ¯ x ij )( m − ¯ t ij ) (cid:80) jm = i ( m − ¯ t ij ) σ ( i, j ) = j (cid:88) m =1 ( x m − µ ( i, j, m )) = j (cid:88) m =1 ( x m − β ij ∗ m − α ij ) , (2)where ¯ t i,j = ( i + j )2 , ¯ x i,j = (cid:80) jm = i x m j − i +1 , and µ ( i, j, m ) = β ij ∗ m + α ij , i ≤ m ≤ j . It can be shown that the above equa-tions can be rewritten into iterative forms such that can becomputed in O( n ) time for all ≤ i ≤ j ≤ n . Due to spacelimitations we skip the detailed derivations here.
3. OMSLR ALGORITHM
We present the OMSLR algorithm for the MSLR problem asfollows. Given a time series X = { x , x , . . . , x n } and an in-teger k , the algorithm OMSLR iteratively segments the timeseries X j = { x , x , . . . , x j } , where ≤ j ≤ n , into i seg-ments, starting with i = 1 to i = k . Since Equation 1 is aniterative function, we design a DP algorithm to compute thematrix M , in which M [ i, j ] = ( γ i,j , ρ i,j ) , for i = 1 → k ,as a representation of the best way of partitioning X j into i segments ∀ j, ≤ j ≤ n . Here, γ i,j , ≤ γ i,j ≤ j denotes thestarting point of the last segment of ˆ φ i ( X j ) and the variable ρ i,j is the global mean square error of i -segment partition of X j based on ˆ φ i ( X j ) . γ i,j and ρ i,j can be computed by the following equations. γ i,j = arg min ( i − d X, k ) Input: X ← a time series data set, n ← the length of Xk ← the number of segments Initialize: gmse ← ∅ /* the Global Mean Square Error */ γ i,j ← , ρ i,j ← σ ← the pre-computed matrix M /* for 2-segment linear regression */ for j = 1 , , . . . n do γ ,j = arg min (1 1) + σ ( γ ,j , j ) end for /* dynamic programming for computing k ≥ */ for i = 2 , . . . k − do for j = 1 , , . . . n do γ i,j = arg min (1 Given a time series X n = { x , x , . . . , x n } and the number of segments k , the i -segment partition ˆ φ i +1 ( X j ) , ∀ j, ≤ j ≤ n, ∀ i, ≤ i ≤ k as computedby Algorithm OMSLR is optimum.Proof. We give a sketch of the proof and prove Theorem 1by contradiction. We skip the case k = 1 , it is a naturallinear regression. For the case k = 2 , it is obviously to seethat ∀ j, ≤ j ≤ n , ˆ φ ( X j ) is optimum, since AlgorithmOMSLR enumerated all the feasible solutions.In the induction step, we assume that ∀ j, ≤ j ≤ n , ˆ φ i ( X j ) is optimum. To show that ˆ φ i +1 ( X j ) is also opti-mum, ∀ j, ≤ j ≤ n , we assume that there exits an in-teger τ, ≤ τ ≤ n , such that ˆ φ i +1 ( X τ ) is not optimum.Let φ ∗ i +1 ( X τ ) = { , l ∗ , . . . , l ∗ i , l ∗ i +1 = τ } be the optimum ( i + 1) -segment partition of X τ . Then we have the following Fig. 1 : A step by step results of OMSLR, for k = 2 → .equation: ψ ( X τ | ˆ φ i +1 ( X τ )) > ψ ( X τ | φ ∗ i +1 ( X τ )) . (4)The induction assumption says that the i -segment partition, ˆ φ i ( X l ∗ i − is optimum, thus it implies the Equation 5: ψ ( X l ∗ i − | φ ∗ i ( X l ∗ i − )) = ψ ( X l ∗ i − | ˆ φ i ( X l ∗ i − )) , (5)and Algorithm OMSLR also guarantees the Equation 6. ψ ( X τ | ˆ φ i +1 ( X τ )) ≤ ψ ( X l ∗ i − | ˆ φ i ( X l ∗ i − )) + σ ( l ∗ i , τ )= ψ ( X l ∗ i − | φ ∗ i ( X l ∗ i − )) + σ ( l ∗ i , τ )= ψ ( X τ | φ ∗ i +1 ( X τ )) (6)Equation 5 and 6 imply that the assumption of Equation 4 isa contradiction. Thus we have Theorem 1.The running time of the algorithm OMSLR is obviouslyto seen to take O( kn ) time. Due to the space limitation, weomit the detailed proof of Theorem 2. Theorem 2. The running time of the algorithm OMSLR is atmost O( kn ) for X n = { x , x , . . . , x n } and k is the numberof non-overlapping segments of X n . To illustrate how the algorithm OMSLR handles k -segment partition of X n , we show a step-by-step resultsin Fig. 1. To keep things simple we assume that k = 5 and n = 22 in this example. In the Fig. 1, each step of the i -segment partition is demonstrated, for i = 2 → . Asshown as Theorem 1, each i -segment partition is an optimalresult. As we presented in Section 2 and Section 3, the firststep of OMSLR algorithm is to generate the matrix M , whicha) The segmentations of OMSLR and Bottom-up G M S E Bottom UpOMSLR (b) The comparison of GMSE Fig. 2 : The experimental results.can be processed iteratively based on Equation 2 in a DPway. So that the algorithm OMSLR, based on backtrackingthe matrix M , can derive each optimal i -segment partitionby a DP approach as shown in Algorithm 1 (in Line:7 → Line:12). Algorithm 1 observably is a two-level DP design.Leveraging the two-level DP design, the algorithm OMSLRcan return any i -segment partition of X n with a reasonablesize of i without having to re-compute from scratch. Due toits low complexity, the algorithm OMSLR offers an opportu-nity for one to search the best trading strategies in financialcomputing. 4. EXPERIMENTAL RESULTS We provide an experimental evaluation of the two algorithms,i.e., our OMSLR and the Bottom-up algorithm [9], for exam-ining the performance in terms of Global Mean Square Error (GMSE), is defined in Section 2, with respect to the value of k . The two results are summarized in Fig 2.In the first experiment, we compare the step-by-step seg-ment partition as k varies from 2 to 5 in Fig. 2 (a). For il-lustration purposes, we plot the time series with small samplesize. The data only contains 42 data points, and spans a periodfrom 2008-08-01 to 2008-09-30 selected from S&P 500 indexhistorical daily price data. In Fig. 2 (a), it is shown that OM-SLR has smaller GMSE than the Bottom-up algorithm. It alsoshows that OMSLR always maintains optimality in partition-ing the time series into multi-segment linear representationfor each value of k .In the second experiment, we focus on analyzing the rela-tionship between k and GMSE with a large sample size. Weuse S&P 500 index historical 1-minute price data from 2010-07-01 to 2010-07-07 with a total of 1,560 data points. Wecompare the GMSE calculated by OMSLR and Bottom-up al-gorithm for each k from 1 to 200. Fig. 2 (b) demonstrates thatGMSE generated by the two algorithms both decreases mono-tonically, and sharply at the beginning. Therefore, a searchingmethod can be designed for locating the best value of k witha given GMSE bound since the curve is a monotonically de-creasing function. Compared to Bottom-up algorithm, a muchsmaller number of segments is required for algorithm OM-SLR to find a multi-segment linear regression representationof the given time series to satisfy a given GMSE bound. 5. CONCLUSION AND FUTURE WORK In this paper we study the problem of optimal segmentationof financial time series based on segmented linear regressionmodels. We present the OMSLR algorithm based on the two-level DP design. We show that the algorithm is optimum withtime complexity O( kn ). We also demonstrate its applica-tion in analyzing financial time series. The representationgenerated by the algorithm OMSLR may be fed into otherintelligent applications, e.g., to predict future trend of a finan-cial market. The algorithm may also find further applications,e.g., we may use it as a benchmark for other on-line stocktrading algorithms [15]. The on-line version of the OMSLRalgorithm can also be used in stock trading. We may also usethe algorithm in processing data of other application domains,such as medical and data science applications. . REFERENCES [1] John J. Murphy, “Technical analysis of the financialmarkets: A comprehensive guide to trading methods andapplications,” New York Institute of Finance, January1999.[2] Ali N. Akansu, Sanjeev R. Kulkarni, and Dmitry M.Malioutov, “Financial signal processing and machinelearning,” Wiley-IEEE Press, May 2016.[3] Y. Abu-Mostafa and A. F. Atiya, “Introduction to fi-nancial forecasting,” Applied Intelligence , vol. 6, pp.205–213, 2004.[4] Victor Lavrenko, Matt Schmill, Dawn Lawrie, PaulOgilvie, David Jensen, and James Allan, “Mining ofconcurrent text and time series,” in Proceedings of 6thACM SIGKDD Workshop on Text Mining . 2000, pp. 37–44, ACM.[5] Ella Bingham, Aristides Gionis, Niina Haiminen, HeliHiisil¨a, Heikki Mannila, and Evimaria Terzi, “Segmen-tation and dimensionality reduction,” in Proceedings ofthe 2006 SIAM International Conference on Data Min-ing , pp. 372–383.[6] Jessica Lin, Eamonn Keogh, Li Wei, and StefanoLonardi, “Experiencing sax: a novel symbolic repre-sentation of time series,” Data Mining and KnowledgeDiscovery , vol. 15, no. 2, pp. 107–144, 2007.[7] Philippe Esling and Carlos Agon, “Time-series datamining,” ACM Comput. Surv. , vol. 45, no. 1, Dec. 2012.[8] Richard Bellman, “On the approximation of curves byline segments using dynamic programming,” Commun.ACM , vol. 4, no. 6, pp. 284, June 1961.[9] Eamonn Keogh, Selina Chu, David Hart, and MichaelPazzani, “Segmenting time series: A survey and novel approach,” in Data Mining in Time Series Databases ,2004, pp. 1–21.[10] H. Shatkay and S. B. Zdonik, “Approximate queries andrepresentations for large data sequences,” in Proceed-ings of the Twelfth International Conference on DataEngineering , 1996, pp. 536–545.[11] Guy Rosman, Mikhail Volkov, Danny Feldman, JohnW. Fisher III, and Daniela Rus, “Coresets for k-segmentation of streaming data,” in Proceedings of the27th International Conference on Neural InformationProcessing Systems - Volume 1 , Cambridge, MA, USA,2014, NIPS’14, p. 559–567, MIT Press.[12] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Lud-wig Schmidt, “Fast algorithms for segmented regres-sion,” in Proceedings of the 33rd International Confer-ence on International Conference on Machine Learning- Volume 48 . 2016, ICML’16, p. 2878–2886, JMLR.org.[13] Evimaria Terzi and Panayiotis Tsaparas, “Efficient al-gorithms for sequence segmentation,” in Proceedings ofthe Sixth SIAM International Conference on Data Min-ing , 2006.[14] Eamonn J. Keogh and Michael J. Pazzani, “An enhancedrepresentation of time series which allows fast and accu-rate classification, clustering and relevance feedback,”in Proceedings of the Fourth International Conferenceon Knowledge Discovery and Data Mining . 1998, p.239–243, AAAI Press.[15] L. Conegundes and A. C. M. Pereira, “Beating the stockmarket with a deep reinforcement learning day tradingsystem,” in2020 International Joint Conference onNeural Networks (IJCNN)