[PDF] Optimal Segmented Linear Regression for Financial Time Series Segmentation

Abstract

Given a financial time series data, one of the most fundamental and interesting challenges is the need to learn the stock dynamics signals in a financial time series data. A good example is to represent the time series in line segments which is often used as a pre-processing step for learning marketing signal patterns in financial computing. In this paper, we focus on the problem of computing the optimal segmentations of such time series based on segmented linear regression models. The major contribution of this paper is to define the problem of Multi-Segment Linear Regression (MSLR) of computing the optimal segmentation of a financial time series, denoted as the MSLR problem, such that the global mean square error of segmented linear regression is minimized. We present an optimum algorithm with two-level dynamic programming (DP) design and show the optimality of OMSLR algorithm. The two-level DP design of OMSLR algorithm can mitigate the complexity for searching the best trading strategies in financial markets. It runs in O(kn^2) time, where n is the length of the time series sequence and k is the number of non-overlapping segments that cover all data points.

Full PDF

OOPTIMAL SEGMENTED LINEAR REGRESSION FOR FINANCIAL TIME SERIESSEGMENTATION

Chi-Jen Wu, Wei-Sheng Zeng and Jan-Ming Ho

Institute of Information Science,Academia Sinca { cjwu } @arbor.ee.ntu.edu.tw, { wszeng,hoho } @iis.sinica.edu.tw ABSTRACT

Given a ﬁnancial time series data, one of the most fundamen-tal and interesting challenges is the need to learn the stockdynamics signals in a ﬁnancial time series data. A good ex-ample is to represent the time series in line segments whichis often used as a pre-processing step for learning market-ing signal patterns in ﬁnancial computing. In this paper, wefocus on the problem of computing the optimal segmenta-tions of such time series based on segmented linear regres-sion models. The major contribution of this paper is to deﬁnethe problem of Multi-Segment Linear Regression (MSLR) ofcomputing the optimal segmentation of a ﬁnancial time se-ries, denoted as the MSLR problem, such that the global meansquare error of segmented linear regression is minimized. Wepresent an optimum algorithm with two-level dynamic pro-gramming (DP) design and show the optimality of OMSLRalgorithm. The two-level DP design of OMSLR algorithmcan mitigate the complexity for searching the best tradingstrategies in ﬁnancial markets. It runs in O( kn ) time, where n is the length of the time series sequence and k is the numberof non-overlapping segments that cover all data points. Index Terms — time-series, ﬁnancial signal processing,segmented linear regression, stock market signal.

1. INTRODUCTION

Learning investment or trading signals from ﬁnancial marketdata is one of most fundamental and interesting research chal-lenges in both academia and industry [1, 2]. For example, theAmerican hedge fund, Renaissance Technologies has lever-aged ﬁnancial signal processing technologies in stock tradingfor a long time. However, ﬁnancial time series is difﬁcult tosummarize or be represented due to its highly non-stationarynature [3]. Given a ﬁnancial time series data, an initial pro-cessing step of learning the signal patterns is often to repre-sent the time series in line segments to alleviate data uncer-tainty and noise [4].In a segmentation process, a time series is divided into k non-overlapping segments, and each segment is represented https://rentec.com by a model to describe data points in the segment. Thesegment representation is measured using error functions de-pending on requirements of their applications. Actually, timeseries segmentation is widely for dimensionality reductionpurposes in economics, engineering and science [5].Time series segmentation has been extensively discussedin different domains and various models, which has resultedin a large number of works [6, 7]. In 1961, the ﬁrst versionof time series segmentation problem is reported in [8] anda dynamic programming (DP) algorithm with time complex-ity O( kn ) is also described. Time series segmentation alsoarises in data mining applications. The article [9] gives a re-view on applications of segmentation methods in data min-ing. The methods are classiﬁed into three categories, includ-ing sliding windows, bottom-up and top-down methods. Theexperimental comparisons showed that the bottom-up methodresults in better performance than other methods.In the past few years, a few algorithms [10, 11, 12, 13]have been proposed to reduce the time complexity of time se-ries segmentation problem. The objective is to simplify rep-resent of large scales time series data. The Piecewise Lin-ear Approximation (PLA) [10] is a widely used approach forthe segmentation task. Acharya et al., [12] presented near-linear time algorithms that achieve a signiﬁcant improvementcompared to the DP approach on large time series. Interestedreader can refer to Esling and Agon [7] who present a surveyon approximation segmentation of time series. To the best ourknowledge, previous approaches have not addressed the prob-lem of optimum segmentation of ﬁnancial time series. Mostof them discussed segmentation methods in terms of approx-imation representation [10], on-line processing [11] and theirtime complexity [12].In this paper, we are interested in the open question [14],how to best choose k , the optimal number of segments used torepresent a particular time series. For ﬁnancial trading strate-gies, k is a measure of number of times of changes in markettrend. It is also an indicator of how many time to trade in themarket while receiving a reasonable amount of trading proﬁts.Instead of answering the open question directly, we will startwith focusing on minimizing global square error for a given a r X i v : . [ c s . C E ] J a n , and also derive the optimal representation of each of the k segments.Firstly, we formulate the Multi-Segment Linear Regres-sion (MSLR) problem and deﬁne the MSLR square error asthe performance index. Then, we present the Optimal Multi-Segment Linear Regression (OMSLR) algorithm, the two-level DP approach for producing the globally optimal seg-mentation. Finally, we show the optimality of the proposedOMSLR algorithm. The time complexity of the OMSLR al-gorithm is O( kn ), where n is the length of the time seriesand k is the number of non-overlapping segments that coverall data points. To the best our knowledge, this paper is theﬁrst to investigate the global optimal segmentation problemin time series processing, especially for ﬁnancial time series.This paper is organized as follows. In Section 2 wepresent the formulation of segmentations as an optimizationproblem, named MSLR problem. In Section 3 we presentthe OMSLR algorithm. Some segmentation experiments arepresented in Section 4, and the results are summarized inSection 5.

2. FORMULATION OF PROBLEM MSLR

A formal deﬁnition of the Multi-Segment Linear Regression(MSLR) problem is described in this section. Given a timeseries X = { x , x , . . . , x n } and an integer k , the objectiveto MSLR problem is to partition X into k contiguous andnon-overlapping intervals, i.e., [ l i − , l i ) and [ l k − , l k ] , where l = 1 , l k = n, ≤ l i ≤ n, ≤ i ≤ k − , l i ∈ N ,such that the multi-segment linear regression square er-ror, ψ (1 , n | φ k ( X n )) , with respect to the k -segment par-tition φ k ( X n ) = { , l , . . . , l k } is minimized. Note that ψ (1 , n | φ k ( X n )) is also denoted as the Global Mean SquareError of the multi-segment linear regression representation,or GMSE for short. In other words, we have ψ ( X n | φ k ( X n ))= k − (cid:88) i =1 σ ( l i − , l i −

1) + σ ( l k − , l k )= ψ ( X l k − − | φ k − ( X l k − − )) + σ ( l k − , l k ) , (1)where σ ( i, j ) = (cid:80) jm =1 ( x m − µ ( i, j, m )) and µ ( i, j, m )) = β ij ∗ m + α ij , i ≤ m ≤ j with α ij and β ij being the linearregression parameters on the interval [ i, j ] of the time series X n , i.e., X ij = { x i , x i +1 , . . . , x j } . Thus we have α ij , β ij and σ ( i, j ) as follows. α ij = ¯ x ij − β ij ∗ ¯ t ij β ij = (cid:80) jm = i ( x m − ¯ x ij )( m − ¯ t ij ) (cid:80) jm = i ( m − ¯ t ij ) σ ( i, j ) = j (cid:88) m =1 ( x m − µ ( i, j, m )) = j (cid:88) m =1 ( x m − β ij ∗ m − α ij ) , (2)where ¯ t i,j = ( i + j )2 , ¯ x i,j = (cid:80) jm = i x m j − i +1 , and µ ( i, j, m ) = β ij ∗ m + α ij , i ≤ m ≤ j . It can be shown that the above equa-tions can be rewritten into iterative forms such that can becomputed in O( n ) time for all ≤ i ≤ j ≤ n . Due to spacelimitations we skip the detailed derivations here.

3. OMSLR ALGORITHM

We present the OMSLR algorithm for the MSLR problem asfollows. Given a time series X = { x , x , . . . , x n } and an in-teger k , the algorithm OMSLR iteratively segments the timeseries X j = { x , x , . . . , x j } , where ≤ j ≤ n , into i seg-ments, starting with i = 1 to i = k . Since Equation 1 is aniterative function, we design a DP algorithm to compute thematrix M , in which M [ i, j ] = ( γ i,j , ρ i,j ) , for i = 1 → k ,as a representation of the best way of partitioning X j into i segments ∀ j, ≤ j ≤ n . Here, γ i,j , ≤ γ i,j ≤ j denotes thestarting point of the last segment of ˆ φ i ( X j ) and the variable ρ i,j is the global mean square error of i -segment partition of X j based on ˆ φ i ( X j ) . γ i,j and ρ i,j can be computed by the following equations. γ i,j = arg min ( i − d

X, k ) Input: X ← a time series data set, n ← the length of Xk ← the number of segments Initialize: gmse ← ∅ /* the Global Mean Square Error */ γ i,j ← , ρ i,j ← σ ← the pre-computed matrix M /* for 2-segment linear regression */ for j = 1 , , . . . n do γ ,j = arg min (1

1) + σ ( γ ,j , j ) end for /* dynamic programming for computing k ≥ */ for i = 2 , . . . k − do for j = 1 , , . . . n do γ i,j = arg min (1 do pivot ← γ p seg,p cur push pivot into pivot set p cur ← pivot − p seg ← p seg − end while gmse ← ρ k,n Output: pivot set , gmse problem. In the following, we are going to show that ˆ φ k ( X n ) is an optimal solution of the MSLR problem. Theorem 1.

Given a time series X n = { x , x , . . . , x n } and the number of segments k , the i -segment partition ˆ φ i +1 ( X j ) , ∀ j, ≤ j ≤ n, ∀ i, ≤ i ≤ k as computedby Algorithm OMSLR is optimum.Proof. We give a sketch of the proof and prove Theorem 1by contradiction. We skip the case k = 1 , it is a naturallinear regression. For the case k = 2 , it is obviously to seethat ∀ j, ≤ j ≤ n , ˆ φ ( X j ) is optimum, since AlgorithmOMSLR enumerated all the feasible solutions.In the induction step, we assume that ∀ j, ≤ j ≤ n , ˆ φ i ( X j ) is optimum. To show that ˆ φ i +1 ( X j ) is also opti-mum, ∀ j, ≤ j ≤ n , we assume that there exits an in-teger τ, ≤ τ ≤ n , such that ˆ φ i +1 ( X τ ) is not optimum.Let φ ∗ i +1 ( X τ ) = { , l ∗ , . . . , l ∗ i , l ∗ i +1 = τ } be the optimum ( i + 1) -segment partition of X τ . Then we have the following Fig. 1 : A step by step results of OMSLR, for k = 2 → .equation: ψ ( X τ | ˆ φ i +1 ( X τ )) > ψ ( X τ | φ ∗ i +1 ( X τ )) . (4)The induction assumption says that the i -segment partition, ˆ φ i ( X l ∗ i − is optimum, thus it implies the Equation 5: ψ ( X l ∗ i − | φ ∗ i ( X l ∗ i − )) = ψ ( X l ∗ i − | ˆ φ i ( X l ∗ i − )) , (5)and Algorithm OMSLR also guarantees the Equation 6. ψ ( X τ | ˆ φ i +1 ( X τ )) ≤ ψ ( X l ∗ i − | ˆ φ i ( X l ∗ i − )) + σ ( l ∗ i , τ )= ψ ( X l ∗ i − | φ ∗ i ( X l ∗ i − )) + σ ( l ∗ i , τ )= ψ ( X τ | φ ∗ i +1 ( X τ )) (6)Equation 5 and 6 imply that the assumption of Equation 4 isa contradiction. Thus we have Theorem 1.The running time of the algorithm OMSLR is obviouslyto seen to take O( kn ) time. Due to the space limitation, weomit the detailed proof of Theorem 2. Theorem 2.

The running time of the algorithm OMSLR is atmost O( kn ) for X n = { x , x , . . . , x n } and k is the numberof non-overlapping segments of X n . To illustrate how the algorithm OMSLR handles k -segment partition of X n , we show a step-by-step resultsin Fig. 1. To keep things simple we assume that k = 5 and n = 22 in this example. In the Fig. 1, each step of the i -segment partition is demonstrated, for i = 2 → . Asshown as Theorem 1, each i -segment partition is an optimalresult. As we presented in Section 2 and Section 3, the ﬁrststep of OMSLR algorithm is to generate the matrix M , whicha) The segmentations of OMSLR and Bottom-up G M S E Bottom UpOMSLR (b) The comparison of GMSE

Fig. 2 : The experimental results.can be processed iteratively based on Equation 2 in a DPway. So that the algorithm OMSLR, based on backtrackingthe matrix M , can derive each optimal i -segment partitionby a DP approach as shown in Algorithm 1 (in Line:7 → Line:12). Algorithm 1 observably is a two-level DP design.Leveraging the two-level DP design, the algorithm OMSLRcan return any i -segment partition of X n with a reasonablesize of i without having to re-compute from scratch. Due toits low complexity, the algorithm OMSLR offers an opportu-nity for one to search the best trading strategies in ﬁnancialcomputing.

4. EXPERIMENTAL RESULTS

We provide an experimental evaluation of the two algorithms,i.e., our OMSLR and the Bottom-up algorithm [9], for exam-ining the performance in terms of Global Mean Square Error (GMSE), is deﬁned in Section 2, with respect to the value of k . The two results are summarized in Fig 2.In the ﬁrst experiment, we compare the step-by-step seg-ment partition as k varies from 2 to 5 in Fig. 2 (a). For il-lustration purposes, we plot the time series with small samplesize. The data only contains 42 data points, and spans a periodfrom 2008-08-01 to 2008-09-30 selected from S&P 500 indexhistorical daily price data. In Fig. 2 (a), it is shown that OM-SLR has smaller GMSE than the Bottom-up algorithm. It alsoshows that OMSLR always maintains optimality in partition-ing the time series into multi-segment linear representationfor each value of k .In the second experiment, we focus on analyzing the rela-tionship between k and GMSE with a large sample size. Weuse S&P 500 index historical 1-minute price data from 2010-07-01 to 2010-07-07 with a total of 1,560 data points. Wecompare the GMSE calculated by OMSLR and Bottom-up al-gorithm for each k from 1 to 200. Fig. 2 (b) demonstrates thatGMSE generated by the two algorithms both decreases mono-tonically, and sharply at the beginning. Therefore, a searchingmethod can be designed for locating the best value of k witha given GMSE bound since the curve is a monotonically de-creasing function. Compared to Bottom-up algorithm, a muchsmaller number of segments is required for algorithm OM-SLR to ﬁnd a multi-segment linear regression representationof the given time series to satisfy a given GMSE bound.

5. CONCLUSION AND FUTURE WORK

In this paper we study the problem of optimal segmentationof ﬁnancial time series based on segmented linear regressionmodels. We present the OMSLR algorithm based on the two-level DP design. We show that the algorithm is optimum withtime complexity O( kn ). We also demonstrate its applica-tion in analyzing ﬁnancial time series. The representationgenerated by the algorithm OMSLR may be fed into otherintelligent applications, e.g., to predict future trend of a ﬁnan-cial market. The algorithm may also ﬁnd further applications,e.g., we may use it as a benchmark for other on-line stocktrading algorithms [15]. The on-line version of the OMSLRalgorithm can also be used in stock trading. We may also usethe algorithm in processing data of other application domains,such as medical and data science applications. . REFERENCES [1] John J. Murphy, “Technical analysis of the ﬁnancialmarkets: A comprehensive guide to trading methods andapplications,” New York Institute of Finance, January1999.[2] Ali N. Akansu, Sanjeev R. Kulkarni, and Dmitry M.Malioutov, “Financial signal processing and machinelearning,” Wiley-IEEE Press, May 2016.[3] Y. Abu-Mostafa and A. F. Atiya, “Introduction to ﬁ-nancial forecasting,” Applied Intelligence , vol. 6, pp.205–213, 2004.[4] Victor Lavrenko, Matt Schmill, Dawn Lawrie, PaulOgilvie, David Jensen, and James Allan, “Mining ofconcurrent text and time series,” in

Proceedings of 6thACM SIGKDD Workshop on Text Mining . 2000, pp. 37–44, ACM.[5] Ella Bingham, Aristides Gionis, Niina Haiminen, HeliHiisil¨a, Heikki Mannila, and Evimaria Terzi, “Segmen-tation and dimensionality reduction,” in

Proceedings ofthe 2006 SIAM International Conference on Data Min-ing , pp. 372–383.[6] Jessica Lin, Eamonn Keogh, Li Wei, and StefanoLonardi, “Experiencing sax: a novel symbolic repre-sentation of time series,”

Data Mining and KnowledgeDiscovery , vol. 15, no. 2, pp. 107–144, 2007.[7] Philippe Esling and Carlos Agon, “Time-series datamining,”

ACM Comput. Surv. , vol. 45, no. 1, Dec. 2012.[8] Richard Bellman, “On the approximation of curves byline segments using dynamic programming,”

Commun.ACM , vol. 4, no. 6, pp. 284, June 1961.[9] Eamonn Keogh, Selina Chu, David Hart, and MichaelPazzani, “Segmenting time series: A survey and novel approach,” in

Data Mining in Time Series Databases ,2004, pp. 1–21.[10] H. Shatkay and S. B. Zdonik, “Approximate queries andrepresentations for large data sequences,” in

Proceed-ings of the Twelfth International Conference on DataEngineering , 1996, pp. 536–545.[11] Guy Rosman, Mikhail Volkov, Danny Feldman, JohnW. Fisher III, and Daniela Rus, “Coresets for k-segmentation of streaming data,” in

Proceedings of the27th International Conference on Neural InformationProcessing Systems - Volume 1 , Cambridge, MA, USA,2014, NIPS’14, p. 559–567, MIT Press.[12] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Lud-wig Schmidt, “Fast algorithms for segmented regres-sion,” in

Proceedings of the 33rd International Confer-ence on International Conference on Machine Learning- Volume 48 . 2016, ICML’16, p. 2878–2886, JMLR.org.[13] Evimaria Terzi and Panayiotis Tsaparas, “Efﬁcient al-gorithms for sequence segmentation,” in

Proceedings ofthe Sixth SIAM International Conference on Data Min-ing , 2006.[14] Eamonn J. Keogh and Michael J. Pazzani, “An enhancedrepresentation of time series which allows fast and accu-rate classiﬁcation, clustering and relevance feedback,”in

Proceedings of the Fourth International Conferenceon Knowledge Discovery and Data Mining . 1998, p.239–243, AAAI Press.[15] L. Conegundes and A. C. M. Pereira, “Beating the stockmarket with a deep reinforcement learning day tradingsystem,” in2020 International Joint Conference onNeural Networks (IJCNN)