Machine Learning for Yield Curve Feature Extraction: Application to Illiquid Corporate Bonds (Preliminary Draft)
MMachine Learning for Yield Curve Feature Extraction:Application to Illiquid Corporate Bonds (Preliminary Draft) ∗ Greg Kirczenow , Ali Fathi and Matt Davison [email protected] [email protected] [email protected] Abstract
This paper studies the application of machine learning in ex-tracting the market implied features from historical risk neu-tral corporate bond yields. We consider the example of a hy-pothetical illiquid fixed income market. After choosing a sur-rogate liquid market, we apply the Denoising Autoencoder al-gorithm from the field of computer vision and pattern recog-nition to learn the features of the missing yield parametersfrom the historically implied data of the instruments traded inthe chosen liquid market. The results of the trained machinelearning algorithm are compared with the outputs of a point-in-time 2 dimensional interpolation algorithm known as theThin Plate Spline. Finally, the performances of the two algo-rithms are compared.
1. Introduction
In many fixed income markets it can be difficult to findbond yields for all rating and tenor pairs. This can arisein situations where market liquidity is insufficient to facili-tate price discovery, either due to noisy or incomplete data.The central idea of the present paper is to design an un-supervised machine that learns the features of the corpo-rate bonds yield curves by observing a sufficient numberof historical examples in a liquid market, and then uses thelearned shapes to fill in the missing yields in the illiquid mar-ket. Roughly speaking, the machine is a tool for interpola-tion/extrapolation, however, it also incorporates its memoryof typical yield curve shapes.This paper is highly influenced by algorithms that havebeen successfully employed for pattern recognition andcomputer vision. Specifically, our algorithm of choice isthe denoising autoencoder (DAE) which is popular in im-age reconstruction problems and also training deep neuralnetworks (see [(1)] and [(9)] for background). More con-cretely, we consider the example of a hypothetical illiquidfixed income market. Because the market is illiquid, reliableimplied yields are available for only a limited set of tradedratings/tenors. After choosing a surrogate liquid market, weapply the DAE algorithm to learn the features of the missing ∗ This is a preliminary draft exposition (for discussion pur-poses) of the main ideas presented in Toronto-CAIMS 2018. yields from the historically implied data of the instrumentstraded in the chosen liquid market.The results of the trained machine learning algorithm arecompared with the outputs of a point-in-time 2 dimensionalinterpolation algorithm known as the Thin Plate Spline(TPS). While the TPS algorithm is a fairly flexible algo-rithm, in our view, it is an inferior tool since it only inter-polate/extrapolates the rating/tenor grid on the spot withoutany insight of how the shape of the yield curve for a givenrating should look. As a result, it can produce comparativelypoor estimates in cases where key areas in the surface aremissing.The paper is organized as follows. In Section 2, the setupof the problem is explained. In Section 3, a high level de-scription of the DAE algorithm and various network archi-tectures for DAEs is given. For the sake of completeness,the details of the TPS algorithm are included. In Section4, the details of the used data and algorithms implementa-tion is presented. Finally, the results of the experiments aresummarized in Section 5.
2. Problem Definition
The problem setup is fairly straight forward. Consider afixed income trader who is supposed to provide quotes onprices for a variety of illiquid corporate/sovereign bonds fordifferent ratings and tenors in a specific market. The traderonly has access to bond yields for a few anchor points onthe rating/tenor grid and she is tasked to complete the rat-ing/tenor matrix based on the given points.This is a problem of interpolation/extrapolation undermarket no-arbitrage constraints. These constraints are em-bodied collectively as the features present in the shape ofthe curves. For instance, in a normal market environment,the yield curves for each rating are upward sloping, withlonger term interest rates being higher than shorter term.The curve might exhibit other features such as humps andmixes etc., based on the supply and demand for the instru-ments. In typical situations for our problem some of theyields at the short end of the curves for some of the ratingsare known. There exists numerous approaches for interpola- a r X i v : . [ q -f i n . M F ] J un igure 1: A typical yield curve structure(source: Bloomberg)tion of the yield curve along the tenor coordinate (see [(2)]for a survey of the methods). However, the situation we areinterested in consists of filling in a sparse matrix grid alongboth the rating and the tenor coordinates, therefore a two di-mensional generalization of the above referred methods isneeded.Thin Plates Splines (TPS), [(3),(6), (10)] are natural ana-logues of the 1-D smoothing splines to 2-D, and are effectivetools for filling in a 2-D grid and transforming it to a surfacewhile the oscillatory behavior of the interpolating surfacecan be controlled (see Section 3.3 below). As noted in [(2)],spline interpolators do not necessarily produce the empiri-cally expected features in the yield curve. This is also truefor the 2-D TPS algorithm. Therefore, the raw outputs ofthe TPS algorithm most undergo an additional modificationstep in order to produce the desired curve features (such asmonotonicity etc.).Another subtlety involved in applying the ordinary inter-polating algorithms to our case of sparse rating/tenor matrixis the assignment of a numerical coordinate vector to therating dimension. In other words, in order to implement a2-D interpolator such as TPS, an extra non-canonical crite-rion must be set in order to convert the ordinal values of theratings to a quantitative coordinate vector.
3. Theoretical Background
In this section we provide a brief presentation of the theoret-ical background to understand the algorithms implementedin this paper. We start by defining the class of Autoencoderalgorithms. Next we specify the sub class of this familyof algorithms called Denoising Autoencoders. We also de-scribe both the Fully Connected Neural Network (FCNN)and Convolutional Neural Network (CNN) architectures forthese algorithms. Finally we introduce the Thin Plate Spline(TPS) algorithm which is used as the base line model in thispaper. We refer the reader to [(1)] for details of NNs, CNNsand Autoencoders. The reader can consult [(3)] or [(10)] for more details on the TPS algorithm.
In this section, we follow the exposition in [(1)] and [(9)]closely, where the reader may find the basics of neural net-works. Autoencoders can be described as a class of unsuper-vised learning algorithms which are designed to transfer theinputs to the output through some (non-linear) reproducingprocedure. They are called unsupervised since the algorithmonly uses the inputs X for learning. An autoencoder trans-forms an input vector x ∈ R d to a hidden representation h ( x ) ∈ R d via a mapping h ( x ) = g ( W x + b ) . W is a d × d weight matrix and b is called a bias vector. The map-ping g ( . ) is also called the encoder part of the autoencoder.The decoder part of the algorithm then maps the obtainedhidden representation back to a reconstructed vector z ∈ R d via a mapping z = k ( W ∗ h ( x ) + c ) .Figure 2: A general autoencoder architecture.Therefore, observation x i is first mapped to a correspond-ing h ( x i ) and then to a reconstruction z i . If we define θ = { W, b } and θ = { W ∗ , c } , the optimal weights arefound by solving, argmin { θ,θ } N N X i =1 L ( x i , z i ) . (1)Here, L ( x, z ) is a chosen loss function . A principal recipe instatistical learning for density estimation [(8)] can guide usin choosing the cost function for training the autoencoder. Ingeneral, one starts with choice of a joint distribution of thedata p ( x | θ ) , where θ is the vector of distribution parameters.Next, one defines relationship between and θ and h ( x ) andsets up a loss function L ( k ◦ g ( x )) = − log p ( x, θ ) , where k ◦ g ( x ) = z is the autoencoder functional form. For in-stance, the choice of square error loss L ( x, z ) = || x − z || is equivalent to the choice of a Gaussian distribution withean θ and the identity covariance matrix for the data, p ( x, θ ) = 12 π N/ exp ( − N X i =1 ( x i − θ i ) ) , (2)where θ = c + W ∗ h ( x ) (see [(1)]). The basic idea of an autoencoder as explained above is tominimize an expected loss function E [ L ( x, k ◦ g ( x )] . Here,the loss function penalizes the dissimilarity of the k ◦ g ( x ) from x . This may drive the functional form to be merely anidentity mapping k ◦ g ( x ) = x if the algorithm architectureallows for it. In other words if, the hidden layer h ( x ) of theautoencoder is wider than input vector (overcomplete), thealgorithm would just copy the inputs to the hidden layer andhence not learn meaningful features of the inputs.To avoid this situation, a denoising autoencoder (DAE)can be trained. (This section follows the exposition in [(9)]closely; see [(1)] and references therein for more details.)The idea is to minimize the expected loss E [ L ( x, k ◦ g (˜ x )] where ˜ x is a noisy version of x . The DAE must thereforelearn enough features in order to salvage the original datafrom the corrupted version rather than simply copying theirinput.The process starts by first injecting noise to the initial in-put x to obtain a partially corrupted copy ˜ x via a stochasticprocedure (so, ˜ x ∼ q D (˜ x | x ). In [(9)] the following cor-ruption process is implemented: for each input x , a fixedproportion ν (fixed in advance) of the coordinates are ran-domly set equal to while the others are left untouched. ∗ The corrupted copy of the input ˜ x is then fed into a regularautoencoder (see Figure.1). It is important to notice that theparameters of the model are trained to minimize the aver-age reconstruction error E [ L ( x, z )] not E [ L (˜ x, z )] , namelyto obtain z as close as possible to the uncorrupted input x .Here, the output z is related deterministically to ˜ x rather than x . The encoder and decoder parts of an autoencoder are feed-forward neural networks with variable architectures (see[(9)] for definitions and details).The Figure.2 above demonstrates the fully connected neu-ral network (FCNN) for the DAE in this paper (see Section4.2). It is seen in the figure that the input filtered through theencoder to produce the code. The decoder then reconstructsthe output based on the code. The decoder architecture de-picted above in Figure.2 is the mirror image of the encoder,this is not a requirement but a popular choice of architec-ture. The only requirement is the dimensions of the inputspace and the output space must be identical. ∗ An alternative way is to use Gaussian additive noise whosevariance is set via hyper-parameter tuning (see [(1)]
Figure 3: An autoencoder corrupts and then reconstructs theinput.As emphasized in [(5)], FCNN architectures for autoen-coders and DAEs both ignore the 2D structure of an im-age (2D structure of the rating/tenor surface in our case).Hence, the most successful models in image pattern recog-nition capture the localized features that repeat themselvesin the input (see [(5)] and references therein). These algo-rithms employ the CNN architectures for both the decoderand encoder parts of autoencoders (see Figure.4 and [(1)]for details on CNNs).Figure 4: The optimal CNN network trained on historicalrate surfaces.
Thin Plate Splines (TPS) are 2D generalizations of 1D cubicsplines [see (3), (6), (10)]. TPS is an interpolation algorithmfor the grid points { ( X i , y i ) } mi =1 where X i = ( x i , x i ) ∈ R . The spline surface f ( x , x ) is constructed such that itpasses the grid points as closely as possible while exhibitinga controlled amount of oscillation. This Bias/Variance opti-mal surface is obtained by minimizing the following actionfunctional over an appropriate function space (see (10)). S [ f ] = X i | f ( X i ) − y i | + λ Z | H ( f ) | dx. (3)Here, | H ( f ) | denotes the sum of square entries of the Hes-sian matrix of f and λ ∈ R + is a regularization parameter.The first term is a measure of fitting error and the integralin the second term measures the oscillating behaviour of theinterpolation surface.The TPS solution is the minimizer of the above definedaction. Applying the Euler-Lagrange recipe for finding theinimum of the functional in (3) one obtains a 4th orderPDE with certain orthogonality conditions on the space ofsolutions [see (3), (6)]. The solution in the 3D euclideanspace (2D TPS) is given by the following closed form for-mula, f ( x ) = m X i =1 a i G ( X, X i ) + ( b + b x + b x ) . (4)In practice, in the 3D case one can set G ( X, s ) = u ( | X − s | ) ,where u ( x ) = x log x . Using the constraints f ( X i ) = y i ,one obtains the following linear system, Y = M a + N b, (5)where M is an m × m matrix with the entries M ij = G ( X i , X j ) and N is an m × matrix with the rows [1 X Ti ] .The system is subject to the orthogonality condition [see (6)] N T a = 0 . It turns out that the matrix M is non-singular andthe system (5) can be solved by, b = ( N T M − N ) − N T M − Y, (6) a = M − ( Y − N b ) . (7)The hyper-parameter λ must be set in advance. In prac-tice, λ is chosen through a hyper-parameter tuning processsuch as k − fold cross validation (see [(3)] and [(7)]).
4. Experimental Results
The data are collected from Bloomberg (under the Univer-sity of Western Ontario licence) and consist of mid-yieldsof corporate industrial bond indices. There are 13 indices(AAA, AA, A+, A, A-, BBB+, BBB, BBB-, BB+, BB, BB-, B+, B), and each index provides yields at 15 tenors (3m,6m, 1y, 2y, 3y, 4y, 5y, 7y, 8y, 9y, 10y, 15y, 20y, 25y, 30y),resulting in a 2D surface with 195 points for each observa-tion date. The yields are computed by Bloomberg from con-stituent bonds and are taken as given. The data are daily andrange from Jan 29, 2018 to April 27, 2018 (63 observations).Yields at 20, 25 and 30 years are unavailable for the BB+,BB, BB-, B+ and B indices. These missing points are pop-ulated by finding the spread between the missing tenor andthe 15 year tenor for generic double and single B industrialcorporate indices, and assuming that the same spread appliesto the corresponding more granular rated indices.It is observed that the yields are weakly monotonic in rat-ing, with the exception of B+ and B indices at the long endof the curve for a single observation. Yields are also mono-tonic in tenor over the period observed, except in the verylong end of the curve for some indices.
Training and test data for the neural networks are con-structed as follows. First, the yields are scaled such that themaximum yield is 1. This aligns the model inputs and out-puts with the range of the sigmoid activation function. Then,10 percent of the observations are held out for testing. Sincethe purpose of the algorithm is to reconstruct a rate surfacefrom known inputs, rather than predict a rate surface out oftime, the test set is selected at random from the data.Finally, to provide the algorithm with additional informa-tion for training, each observation from both the training andtest sets is repeated 10 times and subjected to a randomizedcorruption procedure. Following ((9)), a fixed number of el-ements from each example are chosen at random and theirvalue is forced to 0, creating sparse rate surfaces. Each neu-ral network is then trained to reconstruct complete rate sur-faces using the sparse surfaces as inputs.The neural networks are implemented in Keras with Ten-sorFlow backend. Adam optimization with standard param-eter values and mean square error loss is used for both net-works. The Python package
Hyperas , with built-in Tree-of-Parzen-Estimators (TPE) algorithm ((11)), is used to con-duct hyperparameter optimization for the learning rate, de-cay rate, batch size, number of nodes in the FCNN hiddenlayer, and number of CNN layers and filters. The Recti-fied Linear Unit (ReLU) is used as the activation functionfor hidden layers, and the sigmoid as the activation functionfor the output layer. Batch normalization is used after eachhidden layer activation in both networks.The FCNN uses a single overcomplete hidden layer (seeFigure.2). Implicit in the structure is the idea that the net-work must learn relationships between each pair of rat-ings in the surface, no matter the distance between them intenor/rating space. However, economic intuition suggeststhat adjacent yields on the surface will be strongly corre-lated. The CNN captures this feature in its network architec-ture, which consists of several layers of convolutions with3x3 filters and pooling, followed by convolutions and up-sampling to restore the rate surface dimensions. As such, theCNN has considerably fewer parameters than the FCNN.
The
Tps functionality in the R -package fields v9.6 was used in order to fit the TPS surfaces. We refer the readerto [(7)] for the details and set up of the functionality. TheTPS algorithm was fitted to each of the 70 sparse matricesin the test set and then full matrix was predicted using thecalibrated tps parameters. The results were then comparedwith actual full test matrices to derive the relevant error met-rics (see Section 4.4 below). Average performance on the 70 test set examples is shownin the table below.igure 5: A sample full/ sparse yield matrix (Source: Bloomberg).Metric TPS FCNN CNNMAE (bps) 11 8 10MAE (percent) 3 2 3RMSE (bps) 18 11 13RMSE (percent) 4 3 3Table 1: Test set performance ν = 0.75The Mean Absolute Error (MAE) and Root Mean SquareError (RMSE) are calculated over all points on the rate sur-face for all test set examples, and the values in percent arecomputed relative to the true observed yields.For the Bloomberg data set, the two networks have similarperformance. Recall that this time series is extremely short,with only 2 months of observations. The shape of the surfaceover this time period is rather stable.The FCNN and CNN procedures were also run on a pro-prietary data set with a substantially longer history, includ-ing a variety of different surface shapes. For this data set, theCNN outperforms the FCNN, suggesting that it may havesuperior flexibility in learning a variety of tasks.
5. Summary of Results
This paper has demonstrated a novel financial application ofwell known algorithms in the image recognition literature.FCNN and CNN DAEs are shown to be capable of extract-ing features from liquid markets. Assuming that these samefeatures are present in illiquid markets, the algorithms can beused to estimate missing information. This paper provides an example for corporate bonds, but similar approaches arelikely to be fruitful in other areas such as equity volatilitysurfaces.We note that the algorithm does not necessarily respectmontonicity with respect to bond ratings in all cases. In theexample we chose, about 10 percent of estimated yields vi-olate this condition. While in principle this could be cor-rected by simply adjusting the loss function, we are moreinterested in finding a solution that allows the algorithm todiscover this without prior knowledge of this feature.
References
1. I. Goodfellow, Y. Bengio, A Courville.
Deep Learning . MITPress, 2017.2. P. Hagan, G. West.
Methods for Constructing a Yield Curve .Wilmott Magazine, May 2008.3. T. Hastie, R. Tibshirani, J. Friedman.
The Elements of StatisticalLearning: Data Mining, Inference, and Prediction . SpringerSeries in Statistics, 2016.4. J. Hyman.
Accurate monotonicity preserving cubic interpola-tion . SIAM Journal on Scientific and Statistical Computing,4(4):645-654, 1983.5.J. Masci, U. Meier, Dan Cirean, J. Schmidhuber.
Stacked Con-volutional Auto-Encoders for Hierarchical Feature Extrac-tion . ICANN 2011, Artificial Neural Networks and MachineLearning ICANN 2011 pp 52-59.6. J. Meinguet.
Multivariate interpolation at arbitrary points madesimple . Journal of Applied Mathematicsand Physics (ZAMP),vol. 30, pp. 292-304, 1979.. D. Nychka, R. Furrer, J. Paige, S. Sain.
Package fields . CRAN2015.8. V. Vapnik.
An Overview of Statistical Learning Theory . IEEETransactions on Neural Networks, VOL. 10, NO. 5, 1999.9. P. Vincent, H. Larochelle, Y. Bengio, P. Manzagol.
Extractingand composing robust features with denoising autoencoders .In Proceedings of the 25th international conference on Ma-chine learning, pp. 1096-1103. ACM, 2008.10. G. Wahba.
Spline models for observational data . CBMS-NSFRegional Conference Series in Applied Mathematics, SIAM,1990.11. Bergstra, James, Dan Yamins, and David D. Cox.