A Statistical Machine Learning Approach to Yield Curve Forecasting
AA Statistical Machine Learning Approach to Yield Curve
Forecasting
Rajiv Sambasivan and Sourish Das Department of Computer Science, Chennai Mathematical Institute Department of Mathematics, Chennai Mathematical InstituteMarch 7, 2017
Abstract
Yield curve forecasting is an important problem in finance. In this work we explore the use ofGaussian Processes in conjunction with a dynamic modeling strategy, much like the KalmanFilter, to model the yield curve. Gaussian Processes have been successfully applied to modelfunctional data in a variety of applications. A Gaussian Process is used to model the yield curve.The hyper-parameters of the Gaussian Process model are updated as the algorithm receivesyield curve data. Yield curve data is typically available as a time series with a frequency of oneday. We compare existing methods to forecast the yield curve with the proposed method. Theresults of this study showed that while a competing method (a multivariate time series method)performed well in forecasting the yields at the short term structure region of the yield curve,Gaussian Processes perform well in the medium and long term structure regions of the yieldcurve. Accuracy in the long term structure region of the yield curve has important practicalimplications. The Gaussian Process framework yields uncertainty and probability estimatesdirectly in contrast to other competing methods. Analysts are frequently interested in thisinformation. In this study the proposed method has been applied to yield curve forecasting,however it can be applied to model high frequency time series data or data streams in otherdomains.
Accurate yield curve forecasting is of critical importance in financial applications. Investors watchthe bond market closely as it is a very good predictor of future economic activity and levels ofinflation. Future economic activity and levels of inflation affect prices of goods, stocks and realestate. The yield curve is a key representation of the state of the bond market. The slope of theyield curve is an important indicator of short term interest rates and is followed closely by investors.(see Nielsen [2017]). As a consequence, this has been the focus of considerable research. Severalstatistical techniques and tools commonly used in econometrics and finance have been applied tomodel the yield curve (see for example, Diebold and Li [2006],Chen and Niu [2014] and Spencer Haysand Huang [2012]). In this work, we took a machine learning perspective on this problem. GaussianProcesses (GP) are a widely used machine learning technique (Rasmussen and Williams [2005]). Wepropose a dynamic method that uses Gaussian Processes to model the yield curve. Yield curve data1 a r X i v : . [ s t a t . M L ] M a r an be viewed as functional data. Gaussian Process regression has been applied with great successin many domains. Results from this study suggest that Gaussian Process regression performs betterthan methods currently used for yield curve forecasting in the medium and long term regions ofthe yield curve. Achieving higher accuracy at longer term structures is more difficult than with theshorter term structures. This is because data points at the longer term structure region of the yieldcurve are farther apart than in the short term region. A multivariate time series based approachis also commonly used to model the yield curve. This technique had the best results in the shortterm region of the yield curve. This suggests that these two techniques could be used together. Themultivariate time series based approach could be used for short term forecasts and the GP approachcould be used for medium and long term forecasting.The dynamic Gaussian Process method has been applied to model yield curve data in this work.However, functional data presents as a time series in many domains. For example, the hourly userrequests processed at a data center could be viewed as functional data. The hourly user traffic fora day may be a variable we wish to forecast. In Das et al. [2016], the daily sea ice surface areain the arctic region observed in one year periods is treated as functional data. Observed sea icesurface area for years passed, could be used to forecast the sea ice surface area for a future year.This suggests that the method proposed in this study could be useful in other application domainstoo. This is an area of future work.The rest of this paper is organized as follows. In section 2,we present an overview of relevant aspects of functional data analysis. In section 3, the details ofthe various methods used to model yield curves, including the proposed method, are provided. Insection 4, we describe the methodology for validating the performance of the the methods for yieldcurve forecasting. In section 5, we describe the results of the study. Finally in section 6, we presentthe conclusions from this study. In functional data, the data have a functional representation. Yield curve data are represented interms of the yields associated with a set of term structures. For example, the data for this studyconsists of 11 terms. We have a yield associated with each term. This constitutes a map (a function)with 11 elements between terms and yield. The i th yield curve is modeled as a function that mapsterms to yields: y i = f ( τ i ) + (cid:15) i The function f ( τ ) , can be represented as: f ( τ ) = K (cid:88) k =1 β k φ k ( τ ) = φβ (1)we say φ is a basis system for f ( τ ) . That is, y = φβ + (cid:15). Many basis functions have been used for functional representation, each having a particular nicheof applications that it is well suited to. The sine cosine functions of increasing frequencies y i = β + β sin( ωτ ) + β cos( ωτ ) + β sin(2 ωτ ) + β cos(2 ωτ ) . . . + (cid:15) i (2)2orms the Fourier basis, where constant ω = 2 π/P defines the period P of oscillation of the firstsine/cosine pair. A comparison of Equation 2 with Equation 1 shows that φ = { , sin( ωτ ) , cos( ωτ ) , sin(2 ωτ ) , cos(2 ωτ ) ... } is the Fourier basis and β T = { β , β , β , . . . } are the corresponding unknown coefficients. Otherbasis are: • Nelson-Siegel Basis : φ = { , − e − λτ λτ , − e − λτ λτ − e − λτ } . • Exponential Basis : φ = { , e λ t , e λ t ... }• Gaussian Basis : φ = { , e − λ ( t − c ) , e − λ ( t − c ) ... } To model functional data, we need to pick a basis function that is appropriate for a particularproblem. Once the basis has been picked, the β ’s in Equation 1 need to be determined. We willdiscuss the techniques to do this next. One popular technique to determine the β ’s is to use OrdinaryLeast Squares (OLS). This procedure minimizes the total square error (SSE) between the functionand the actual values of the response. SSE = ( y − φβ ) T ( y − φβ ) . The OLS estimate of β is ˆ β = ( φ T φ ) − φ T y , and the estimator of f ( t ) is ˆ f = φ ˆ β = φ ( φ T φ ) − φ T y . The OLS method overfits the model. By overfitting we mean it tries to model the white noise (seeRamsay and Silverman [2002] for detailed discussion).
Regularization is one solution to solving the overfitting problem. Regularization achieves this bypenalizing the complexity of the solution :
PSSE = ( y − φβ ) T ( y − φβ ) + λP ( f ) ,P ( f ) measures the “roughness" of the f , λ represents a continuous tuning parameter. • λ ↑ ∞ roughness increasingly penalized; f ( t ) becomes smooth. • λ ↓ penalty reduces; f ( t ) models small shocks and tends to overfit as it move towards OLS.Essentially P ( f ) measures the curvature of f ( t ) . 3 .3 The D Operator
We define the D -Operator as follows • Df ( t ) = ∂∂t f ( t ) is the instantaneous slope of f ( t ) • D f ( t ) = ∂ ∂t f ( t ) is the curvature of f ( t ) We measure the size of the curvature for all of f by P ( f ) = (cid:90) [ D f ( t )] dt = (cid:90) β T [ D φ ( t )][ D φ ( t )] T β dt = β T R β , where [ R ] jk = (cid:82) [ D φ j ( t )][ D φ k ( t )] T dt is the penalty matrix. The penalized sum of squares errorfunction is PSSE = ( y − φβ ) T ( y − φβ ) + λ β T R β Certainly one can try higher order operator as penalty; which is out of the scope of this paper. Thepenalized least squares estimate for β is ˆ β = ( φ T φ + λR ) − φ T y . Note that it looks like the ‘Ridge Estimator’.
Parameter learning in the methods discussed above involved learning an optimal representation byminimizing a loss function. These approaches posit that that there is a fixed unique set of parametersassociated with the functional representation of the yield curve. A contrasting methodology, theBayesian methodology treats these parameters differently. In Bayesian methodology, the unknownparameters are assumed to be random variables with valid probability measure on the parameterspace.
Consider the model: y = f ( t ) + (cid:15) Where: (cid:15) ∼ N (0 , σ (cid:15) I ) . This implies y ∼ N ( f ( t ) , σ (cid:15) I ) .The function f ( t ) has the following representation: f ( t ) = φβ = ∞ (cid:88) k =1 φ k ( t ) β k ,
4e want to estimate β . We adopt a Bayesian methodology, so we assume β ’s are uncorrelatedrandom variables and φ k ( t ) are known deterministic real-valued functions. Then due to Kosambi-Karhunen-Loeve theorem, f ( t ) is a stochastic process . If we assume β ∼ N ( , σ (cid:15) I ) , then f ( t ) = φβ follows a Gaussian process and the induced process on f ( t ) is known as ‘ GaussianProcess Prior ’. The prior on β : p ( β ) ∝ exp (cid:18) − σ (cid:15) β T β (cid:19) . The induced prior on f = φβ : p ( f ) ∝ exp (cid:18) − σ (cid:15) β T φ T K − φβ (cid:19) , where the prior mean and covariance of f are given by(see Rasmussen and Williams [2005]): E [ f ] = φE [ β ] = φ β = , cov [ f ] = E [ f.f T ] = φ. E [ β . β T ] φ T = σ (cid:15) φ.φ T = K . An alternative generic formulation of the model is: f ( τ ) = µ ( t ) + W ( t ) , y = µ ( t ) + W ( t ) + (cid:15) , where W ( τ ) ∼ N ( , K ) and µ ( τ ) is a parametric function. If there are m many points then, f ∼ N m ( µ ( τ ) , K ) , (cid:15) ∼ N m (0 , σ (cid:15) I m ) y ∼ N m ( f ( τ ) , K + σ (cid:15) I ) . (3)The likelihood function is given by: L ( f | y , φ , σ ) ∝ ( σ (cid:15) ) − m/ exp (cid:18) − σ (cid:15) ( y − f ) T [ K + σ (cid:15) I ] − ( y − f ) (cid:19) , The negative log-likelihood function can then be expressedas: l ( f ) ∝ σ (cid:15) ( y − f ) T [ K + σ (cid:15) I ] − ( y − f ) . The corresponding negative log-posterior function is: p ( f ) ∝ σ (cid:15) (cid:18) ( y − f ) T [ K + σ (cid:15) I ] − ( y − f ) + f T K − f (cid:19) . Hence the induced penalty matrix in the Gaussian process prior is identity matrix. It looks likeweighted least square method with L penalty P ( f ) = f T K − f . The posterior distribution overthese functions is computed by applying Bayes theorem. The posterior is used to make predictions.The estimated value of y for a given t is the mean (expected) value of the functions sampled fromfrom the posterior at that value of t . The expected value of the estimate at t ∗ is given by: ˆ f ( t ∗ ) = E ( f | t ∗ , y ) (4) = µ ( t ∗ ) + K ( t ∗ , t ) . [ K ( t, t ) + σ (cid:15) . I ] − . ( y − µ ( t )) (5)The variance of the estimate at t ∗ is given by cov ( f ∗ ) = K ( t ∗ , t ∗ ) − K ( t ∗ , t ) . [ K ( t, t ) + σ (cid:15) . I ] − .K ( t, t ∗ ) (6)5 Forecasting Methods
In this section we discuss the methods use to forecast yield curves. This includes the proposeddynamic Gaussian Process method.
The Nelson-Siegel model Nelson and Siegel [1987], Chen and Niu [2014] specifies the yield curve as: y ( τ ) = β + β (cid:18) − e − λτ λτ (cid:19) + β (cid:18) − e − λτ λτ − e − λτ (cid:19) + (cid:15) ( τ ) , (cid:15) ( τ ) ∼ N (0 , σ (cid:15) ) (7)where y ( τ ) is the yield at maturity τ . The three factors β , β and β are denoted as level,slope and curvature of slope respectively. Parameter λ controls exponentially decaying rate of theloadings for the slope and curvature.These factors have the following econometric interpretations: • The factor β captures the strength of the long term component of the yield curve. • The factor β captures the strength of the short term component of the yield curve. • The factor β captures the strength of the medium term component of the yield curve.The goodness-of-fit of the yield curve is not very sensitive to the specific choice of λ Nelson andSiegel [1987]. Therefore Chen and Niu [2014] treated λ as a known quantity. The factors of theNelson-Siegel model need to be estimated from the data for the yield curve. Yield curve data areinstances of a type of data called functional data. When this technique is applied to a successiveyield curves, there could be a pattern in the evolution of the coefficients for the Nelson-Siegel modelover time. Section 3.3 provides a mathematical framework to abstract this problem. A common method to model yield curve data is to use a Vector Auto-Regressive model to representthe yields for the term structures. (Diebold and Rudebusch [2013]). An auto-regressive model oforder k is represented by: y i ( τ ) = β + β . y i − ( τ ) + . . . + β k y i − k ( τ ) (8)Equation 8 represents a regression of the i th yield curve on the previous k yield curves. A modelselection criterion, like the Bayesian Information Criterion is used to determine the optimal order, k , for the data. Forecasting is then performed using the optimal model. The results of modelingare presented in section 5. The Dynamic Nelson-Siegel (DNS) model Nelson and Siegel [1987], Chen and Niu [2014] for yieldcurve has the following representation: 6 t ( τ j ) = β t + β t (cid:18) − e − λτ j λτ j (cid:19) + β t (cid:18) − e − λτ j λτ j − e − λτ j (cid:19) + (cid:15) t ( τ j ) ,β it = θ i + θ i β i,t − + η i , i = 1 , , here: • (cid:15) t ( τ j ) ∼ N (0 , σ (cid:15) ) • η i ∼ N (0 , σ η ) , • t = 1 , , . . . , T represents the time steps in days • j = 1 , , . . . , m represents the term structure or maturity • y t ( τ ) is the yield for maturity τ (in months) at time t .The three factors β t , β t and β t are denoted as level, slope and curvature of slope respectively.Parameter λ controls exponentially decaying rate of the loadings for the slope and curvature. Thegoodness-of-fit of the yield curve is not very sensitive to the specific choice of λ Nelson and Siegel[1987]. Therefore Chen and Niu [2014] chose λ to be known. In practice, λ can be determinedthrough grid-search method. There are eight static parameters θ = ( θ , θ , θ , θ , θ , θ , σ (cid:15) , σ η ) in the model. In matrix notation the DNS model can be presented as β t = θ + Zβ t − + η t , (9) y t = φβ t + (cid:15) t , (10)where y t = y t ( τ ) y t ( τ ) ... y t ( τ m ) m × , φ = f ( τ ) f ( τ )1 f ( τ ) f ( τ ) ... ... ... f ( τ m ) f ( τ m ) m × , β t = β t β t β t × , (cid:15) t = (cid:15) (cid:15) ... (cid:15) m m × ,such that: • f ( τ j ) = (cid:0) − e − λτj λτ j (cid:1) • f ( τ j ) = (cid:0) − e − λτj λτ j − e − λτ j (cid:1) , j = 1 , , ..., m . The index j represents the term structure ormaturity. There are term structures for this study ( m = 11 ) • θ = θ θ θ Z = θ θ
00 0 θ Note that (cid:15) t ∼ N m (0 , σ (cid:15) I m ) and η t ∼ N (0 , σ η I ) . Note that (9) is system equation and (10)is observation equation . Diebold and Li [2006] suggest that the factors of the Nelson-Siegel modelbe estimated using a least squares procedure. The data for each yield curve produces a set factorsassociated with the Nelson-Siegel representation of the yield curve. The dataset is a collectionof yield curves. Therefore sequential application of the least squares procedure would yield a setof Nelson-Siegel factors. The evolution of these factors can be represented using a Vector Auto-Regressive model. A model selection methodology like the Bayesian Information Criterion can beused to determine the optimal lag order for the model. Once an optimal model structure has beendetermined, forecasting is performed using the optimal model. Here we introduce dynamic Gaussian process prior model. The observation equation is y t = µ t ( τ ) + (cid:15) t , where y t and (cid:15) t are defined as in (10), µ t ( τ ) is the mean function. The system equation is definedas µ t ( τ ) = µ t − ( τ ) + W t , (11)where W t ( τ ) ∼ N m ( , K t − ) , where K t − = K ( τ, τ (cid:48) | ρ t − ) , ρ t − is the hyper-parameter estimated at t − . The key notion here is that given the data Y t = ( y t , y t − , . . . , y ) inference about µ t andprediction about y t +1 can be carried via Bayes theorem, which can be expressed as P ( µ t ( τ ) | Y t ) ∝ P ( y t | µ t ( τ ) , Y t − ) × P ( µ t ( τ ) | Y t − ) . (12)Note that the expression on the left of equation (12) is the posterior process of µ ( τ ) at time t ,whereas the first and second expression on the right side of (12) is the likelihood and prior process of µ ( τ ) , respectively. Suppose the posterior process at time point t − is the µ t − | Y t − ∼ N m (cid:16) ˆ µ t − ( τ ) , ˆ K t − (cid:17) , where ˆ µ t − ( τ ) is the posterior mean function and ˆ K t − is the posterior covariance function of theprocess at the time-point ( t − ). Following the structure of the GP regression model as presentedin (3) and (11), the prior predictive process at time point t is µ t | Y t − ∼ N m (cid:16) ˆ µ t − ( τ ) , ˆ K t − (cid:17) , the likelihood function is y t | µ t ( τ ) , Y t − ∼ N m ( µ t ( τ ) , σ t I m ) , and the marginal likelihood function is y t | Y t − ∼ N m (ˆ µ t − ( τ ) , ˆ K t − + σ t − I m ) . (13)8ote that in (13) the µ t − ( τ ) is a measurable under the σ -field generated by Y t − . We can estimatethe hyper-parameters θ t = ( ρ t − , σ t − ) , using a optimization procedure to maximize the marginal-likelihood (13). Let’s assume ˆ θ t − is the estimated hyper-parameter estimated by optimizing the(13). We can then provide an estimate for the observation at time t using the expected value of y t | Y t − (obtained from 13). This is: ˆ µ t ( τ ∗ ) = E ( µ t ( τ ∗ ) | Y t − )= K ( τ ∗ , τ | ˆ ρ t − ) . [ K ( τ, τ | ˆ ρ t − ) + ˆ σ t − . I ] − . y t − ( τ ) . Once we have obtained the observation at time t , we can update the posterior process over y t as: y t ( τ ) | Y t ∼ N m (ˆ µ t.updated ( τ ) , ˆ K t.updated ) , where the corresponding covariance function is ˆ K t.updated = K ( τ ∗ , τ ∗ | ˆ ρ t ) − K ( τ ∗ , τ | ˆ ρ t ) . [ K ( τ, τ | ˆ ρ t ) + ˆ σ t . I ] − .K ( τ, τ ∗ | ˆ ρ t ) , and the mean function or the expected value of µ t at τ ∗ is: ˆ µ t.updated ( τ ∗ ) = E (ˆ µ t ( τ ∗ ) | Y t )= ˆ µ t ( τ ∗ ) + K ( τ ∗ , τ | ˆ ρ t ) . [ K ( τ, τ | ˆ ρ t ) + ˆ σ t . I ] − . (cid:0) y t +1 − ˆ µ t ( τ ) (cid:1) . The details of an algorithmic implementation of this procedure is provided section 3.4.1
There are two distinct phases of the algorithm. These correspond to the time steps t = 0 (the firstyield curve in the dataset) and t > (the subsequent yield curves in the dataset). The details ofeach of these phases is provided below. Time step t = 0 :1. Hyper-parameter Estimation:
Estimate hyper-parameters, ˆ θ , of the Gaussian Process y ∼ N m ( , K + σ I m ) . Here K = K ( τ, τ ∗ | ρ ) and θ = ( ρ , σ ) are the hyper-parametersat time t = 0 . The hyper-parameters are obtained by maximizing the marginal log-likelihoodusing an optimization algorithm (gradient descent, conjugate gradient descent etc.)2. Predict:
Provide an estimate of the yield for time step t = 1 using: ˆ µ ( τ ∗ ) = E ( y ( τ ∗ ) | Y )= K ( τ ∗ , τ | ˆ ρ ) . [ K ( τ, τ | ˆ ρ ) + ˆ σ . I ] − . y ( τ ) . The predictive interval for time point can be provided using following distribution µ | Y ∼ N m ( ˆ µ , ˆK ) . Update : 9a) Update the posterior covariance function as: ˆ K updated = K ( τ ∗ , τ ∗ | ˆ ρ ) − K ( τ ∗ , τ | ˆ ρ ) . [ K ( τ, τ | ˆ ρ ) + ˆ σ . I ] − .K ( τ, τ ∗ | ˆ ρ ) . (b) Update the posterior mean function as: ˆ µ updated ( τ ∗ ) = E ( µ ( τ ∗ ) | Y )= K ( τ ∗ , τ | ˆ ρ ) . [ K ( τ, τ | ˆ ρ ) + ˆ σ . I ] − (cid:0) y − ˆ µ (cid:1) which is the mean function associated with the Hyper-parameter Estimation stepfor time step t = 1 . Time step t ≥ :1. Hyper-parameter Estimation:
Estimate hyper-parameters ˆ θ t of the Gaussian Process y t | Y t − ∼ N m (ˆ µ updated ( τ ) , K + σ t I m ) . Here K = K ( τ, τ ∗ | ρ t ) and θ t = ( ρ t , σ t ) are the hyper-parameters at time step t . The hyper-parameters are obtained by maximizing the marginallog-likelihood using an optimization algorithm.2. Predict:
Provide an estimate of the yield for time step t + 1 using: ˆ µ t ( τ ∗ ) = E ( y t +1 ( τ ∗ ) | Y t )= K ( τ ∗ , τ | ˆ ρ t ) . [ K ( τ, τ | ˆ ρ t ) + ˆ σ t . I ] − . y t ( τ ) . The predictive interval for time point t + 1 can be provided using following process y t +1 | Y t ∼ N m (ˆ µ t , ˆ K t ) . Update: (a) Update the posterior covariance function as: ˆ K updated = K ( τ ∗ , τ ∗ | ˆ ρ t ) − K ( τ ∗ , τ | ˆ ρ t ) . [ K ( τ, τ | ˆ ρ t ) + ˆ σ t . I ] − .K ( τ, τ ∗ | ˆ ρ t ) (b) Update the posterior mean function for term τ ∗ as: ˆ µ updated ( τ ∗ ) = E ( µ t ( τ ∗ ) | Y t )= ˆ µ t ( τ ∗ ) + K ( τ ∗ , τ | ˆ ρ t ) . [ K ( τ, τ | ˆ ρ t ) + ˆ σ t . I ] − . (cid:0) y t +1 − ˆ µ t ( τ ) (cid:1) , which is mean function for the Hyper-parameter Estimation step of the subsequentiteration.The covariance function to use with the algorithm is a modeling decision and is problem specific.See Duvenaud [2017] and Rasmussen and Williams [2005] for guidelines. For the data used in thisstudy a combination of a linear kernel and a squared exponential (Radial Basis Function) kernelproduced good results.
Remark 1 : Note that in this dynamic process, the posterior of last time ( t − ) is being consideredas a prior-predictive process for next time point t . Remark 2 : In this algorithm, we are estimating hyper-parameter at every stage. This is feasiblebecause m is small in our case. However, this may not be the possible, in many practical problems.10 .5 Relationship between Dynamic Gaussian Process and Optimal BayesianFilter Smith [1981] proposes a Bayesian framework for dynamic models where the prior ( π ) at time step t , has a power law form: π ( y t ) ∝ p ( y t | Y t − , δ t ) , (14)where: • y t represents our prior at time step t • δ t represents the power at time step t The main idea behind the power filter approach is to propagate information from one time step tothe next. The posterior density at stage t is given by: π ( y t ) ∝ f ( y t | Y t − ) . (cid:2) π t | t − ( y t ) (cid:3) δ t such that ≤ δ t ≤ , (15)where: • f ( y t | Y t − ) represents the likelihood • (cid:2) π t | t − ( y t ) (cid:3) δ t represents the priorWhen δ t = 1 and the likelihood and the prior are Multivariate Gaussian, then we obtain theDynamic GP. Das and Dey [2013] develop the power filter for dynamic generalized linear models.They show that the power filter model yields an efficient information processing rule in a dynamicmodel setting. Results and Discussion
This study examines data over a ten year period. The Root Mean Square Error was used as themetric to assess the performance of the method. The Root Mean Square Error is defined by:
RMSE = (cid:115) (cid:80) i = Ni =1 (cid:80) τ =11 τ =1 (ˆ y [ τ, i ] − y [ τ, i ]) N , (16)where: • ˆ y [ τ, i ] is the estimated yield for day i associated with term τ • y [ τ, i ] is the actual yield for day i associated with term τ • N is the number of yield curves that are estimated using the procedure • There are 11 terms in each yield curve.A summarized view of the results for this ten year period are shown in Table 1. An inspection ofTerm GP MVTS TSNS1 Month 0.104 0.088 0.1213 Months 0.071 0.066 0.0806 Months 0.054 0.047 0.0881 Year 0.047 0.043 0.0852 Years 0.052 0.055 0.0883 Years 0.058 0.061 0.1145 years 0.065 0.068 0.1267 Years 0.065 0.070 0.14910 Years 0.063 0.067 0.19720 Years 0.061 0.065 0.97730 Years 0.060 0.063 10.838Table 1: RMSE for term structures for all methodsTable 1 shows that the Nelson Siegel model based time series does relatively poorly in comparisonto the multivariate time series method and the dynamic GP. We examine the performance of thesemethods over three time durations - short term structures, medium term structures and long termstructures. The short term structure included term structures upto 1 year. The medium termstructures consists of bonds with maturities of 2 years, 3 years and 5 years. The long term structurecategory consists of bonds that mature at 7 years, 10 years, 20 years and 30 years. The multivariatetime series appears to do well in the short term region of the yield curve while the dynamic GPmethod does well in the medium and long term regions of the yield curve.Figure 1 shows estimates using the techniques discussed for a sample of days in the dataset.The data for days , , , and were used for this illustration. This provides a cross-sectional view of the estimates from the various methods used in this study. Note that Figure 1also provides the actual yield associated with the term-structures on these days. On some days likeFebruary 10, 2010 and February 08, 2012, the estimates from all methods are close to the actual.On other days the estimates may be quite different. However an inspection of figure 1 shows thatthe estimates from the dynamic GP agree quite well with the actual yields over the 10 year period12 a) Estimates for Feb 06, 2008 (b) Estimates for Feb 10, 2010(c) Estimates for Feb 08, 2012 (d) Estimates for Feb 07, 2014 Figure 1: Estimates for a Sample of the Data Using the Methods Discussed13onsidered for this study. Estimates from the methods are usually quite close, so a small amountof Gaussian noise (jitter) was added to discriminate the curves in Figure 1. As evident from Table1, the performance of the Nelson-Siegel based method is inferior to that of the multivariate timeseries method and the dynamic GP method. Therefore, in the analysis that follows, we limit ourdiscussion to the comparison of the dynamic GP method and the multivariate time series method.Figure 2 through Figure 4 show the squared error of the estimates associated with the dynamic GPmethod and the multivariate time series methods.The short term performance of the dynamic GP and the multivariate time series methods over the10 year period is shown in Figure 2. A review of Figure 2 shows that the performance of the themultivariate time series method is in general better than the dynamic GP method in the short termregion of the yield curve. (a) 1 Month Performance (b) 3 Month Performance(c) 6 Month Performance (d) 1 Year Performance
Figure 2: Short Term Performance - GP versus MVTS over a 10 Year PeriodThe medium term performance of the dynamic GP and the multivariate time series methods overa 10 year period is shown in Figure 3. The long term performance of these methods is shown in4. An analysis of the medium term and long term performance curves shows that the dynamic GPperforms better than the multivariate time series method in the medium and short term structureregions. This is consistent with summarized RMSE over the 10 year period in Table 1.14 a) 2 Year Performance (b) 3 Year Performance(c) 5 Year Performance
Figure 3: Medium Term Performance - GP versus MVTS over a 10 Year Period15 a) 7 Year Performance (b) 10 Year Performance(c) 20 Year Performance (d) 30 Year Performance
Figure 4: Long Term Performance - GP versus MVTS over a 10 Year PeriodThe
GPy python package GPy [2012–2014] was used for developing the Gaussian Process modelsreported in this work. The vars R packagePfaff [2008b] was used to model the time series basedmethods to forecast the Nelson-Siegel coefficients or the yield curve term rates.
Gaussian processes have been used for functional data analysis in several domains (see Rasmussenand Williams [2005]). The results of this study suggest that they can be used for yield curveforecasting. The nature of yield curve data is such that there is more data in the short and mediumterm structure regions than the long term structure regions . This makes long term forecastschallenging. This study revealed that the proposed dynamic GP method can forecast this regionof the yield curve well. Analysts could use a mix of methods to forecast the yield curve. The datafor this study spans a large time interval - over ten years. The results of this study indicate thatthe multivariate time series approach is more accurate for forecasting the short term structures,while the proposed dynamic Gaussian Process based method is a better choice for the medium andlong term structures associated with the yield curve. The proposed method has been applied to aforecasting problem in the financial domain, however, this method can be applied to other domainsas well. Demand forecasting is a common business requirement. In an IT data center, we might16nterested in forecasting the hourly number of user requests serviced by a group of computers. Thehourly energy demand might be of interest to an electrical utility company. In summary, we believethat the dynamic Gaussian Process model could be useful in other application domains too.
References
T. W. Anderson.
An Introduction to Multivariate Statistical Analysis . Wiley, 1984.Ming-Hui Chen, Joseph G Ibrahim, and Qi-Man Shao. Power prior distributions for generalizedlinear models.
Journal of Statistical Planning and Inference , 84(1):121–137, 2000.Ying Chen and Linlin Niu. Adaptive dynamic nelson-siegel term structure model with applications.
Journal of Econometrics , 180(1):98–115, 2014.Purba Das, Ananya Lahiri, and Sourish Das. Understanding sea ice melting via functional dataanalysis. arXiv preprint arXiv:1610.07024 , 2016.Sourish Das and Dipak K Dey. On dynamic generalized linear models with applications.
Methodologyand Computing in Applied Probability , pages 1–15, 2013.Francis Diebold and Canlin Li. Forecasting the term structure of government bond yields.
Journalof Econometrics , 130(1):337–364, 2006.Francis X Diebold and Glenn D Rudebusch.
Yield Curve Modeling and Forecasting: The DynamicNelson-Siegel Approach . Princeton University Press, 2013.David Duvenaud.
Kernel Cookbook Kernel Cookbook , 2017. URL .Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The elements of statistical learning , vol-ume 1. Springer series in statistics Springer, Berlin, 2001.GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy ,2012–2014.Mayetri Gupta and Joseph G. Ibrahim. An information matrix prior for bayesian analysis ingeneralized linear models with high dimensional data.
Stat Sinica , 2009.Joseph G Ibrahim, Ming-Hui Chen, and Debajyoti Sinha. On optimality properties of the powerprior.
Journal of the American Statistical Association , 98(461):204–213, 2003.Richard J. Meinhold and Nozer D. Singpurwalla. Understanding the kalman filter.
The AmericanStatistician , 37(2):123–127, 1983.Charles R. Nelson and Andrew F. Siegel. Parsimonious modeling of yield curve.
The Journal ofBusiness , 60(4):473–489, 1987.Barry Nielsen.
Bond Yield Curve Holds Predictive Powers Treasury Rates , 2017. URL .B. Pfaff.
Analysis of Integrated and Cointegrated Time Series with R . Springer, New York, secondedition, 2008a. URL . ISBN 0-387-27960-1.17ernhard Pfaff. Var, svar and svec models: Implementation within R package vars.
Journal ofStatistical Software , 27(4), 2008b. URL .R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria, 2016. URL .James O Ramsay and Bernard W Silverman.
Applied functional data analysis: methods and casestudies , volume 77. Citeseer, 2002.Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for Machine Learning(Adaptive Computation and Machine Learning) . The MIT Press, 2005. ISBN 026218253X.Guido Rossum. Python reference manual. Technical report, Amsterdam, The Netherlands, TheNetherlands, 1995.JQ Smith. The multiparameter steady model.
Journal of the Royal Statistical Society. Series B(Methodological) , pages 256–260, 1981.Haipeng Shen Spencer Hays and Jianhua Z. Huang. Functional dynamic factor models with appli-cations to yield curve forecasting.
Annals of Applied Statistics , 6(3):870–894, 2012.Mike West. Bayesian model monitoring.
Journal of the Royal Statistical Society. Series B (Method-ological)
Treasury Rates Treasury Rates , 2017. URL