Surrogate Monte Carlo
SSurrogate Monte Carlo
A. Christian Silva ∗ Fernando F. Ferreira † Department of Physics-FFCLRPUniversidade de São Paulo (USP)Ribeirao Preto-SP, 14040-901, BrazilandCentre for Interdisciplinary Research on Complex SystemsUniversidade de São Paulo (USP)03828-000 São Paulo, Brazil
This article proposes an artificial data generating algorithm that is simple and easy to customize.The fundamental concept is to perform random permutation of Monte Carlo generated randomnumbers which conform to the unconditional probability distribution of the original real time series.Similar to constraint surrogate methods, random permutations are only accepted if a given objectivefunction is minimized. The objective function is selected in order to describe the most importantfeatures of the stochastic process. The algorithm is demonstrated by producing simulated log-returnsof the S&P 500 stock index.
I. INTRODUCTION
The generation of artificial time series has beentraditionally associated with Monte Carlo simula-tions. For instance, one assumes a stochastic pro-cess and finds the parameters of such process by fit-ting the stochastic model to the observed time se-ries. Once the parameters are known, the researchergenerates artificial data by Monte Carlo simulations.One of the most simple examples is the use of Ge-ometric Brownian motion to model stock prices [6].Other examples span the most diverse research areasfrom fluid dynamics to trafic flow [7, 11, 15].In recent years, deep learning methods have pro-posed an alternative to the traditional Monte Carlosimulation [2, 5]. The idea of using deep neural net-works can bypass the specification of the stochasticprocess and therefore be model-free. In this scenariothe researcher has to train the model to the empiri-cal time series without necessarily understanding theunderlying dynamics. The benefit of such method-ology is that one can, in principle, produce high fi-delity artificial data with features that can not bemodeled by traditional methods either due to thecomplexity or because prior research was not ableto uncover such features. The risk of using deeplearning is of course overfiting and, for some, theblack-box nature of such models.This article proposes a methodology that is alsomodel-free by combining random number generation ∗ [email protected] † ferff[email protected] and surrogates. Surrogates have a long history in thephysical sciences most notably in the non-linear dy-namics/chaos communities [9]. The general idea isto transform the original time series by performingsuitable permutations of the observations such thatthe resulting time series is composed by exactly thesame observations but in a different order. This or-der is chosen to satisfy some constraints (constraintsurrogate generation method by [14]). The goal isto perform hypothesis tests by comparing the orig-inal time series and the surrogate since both haveexactly the same unconditional distribution [10]. Atypical application is to test whether the originaltime series is generated by a non-linear process bycomparing surrogates which preserve the linear auto-correlations (power spectrum) of the original timeseries but nothing else [10]. If one requires more andmore constraints on the permutations when gener-ating surrogates less surrogates can be generated.In the limit, if one insists on having dynamics whichare identical to the original time series, the only sur-rogate is the time series itself. The recipe for pro-ducing surrogate time series which are statisticallysimilar to the original is to require the “correct” setof constraints which capture the essence of the dy-namics.Surrogates traditionally are created using the ex-act same observations of the measured time serieswhich is not desired when simulating. The idea is tosimulate from the “correct” joint probability distri-bution without necessarily modeling such distribu-tion. The method proposed here is to first draw ran-dom numbers of the unconditional stationary prob-ability distribution of the original time series. This a r X i v : . [ q -f i n . C P ] F e b creates an independent and identically distributedtime series of random numbers. One then permutessuch random numbers imposing constraints muchlike what is done for surrogates [14]. This procedureproduces an artificial time series with joint proba-bility distribution that agrees to the measured jointprobability density within some accuracy.Notice, however, that this procedure might notconverge since contrary to the standard surrogatemethod, we are now applying permutations of ran-dom numbers which have a low probability of sam-ples which are corner cases (to many zeros etc). Ifthat happens no permutation might satisfy the con-straints within the desired accuracy. This risk isparticularly high if the time series is “small” andthe number of constraints is “high”.The next sections detail the algorithm and illus-trate the methodology by comparing artificial dailylog-returns of the S&P 500 with the actual returns.Finally, the appendix presents further details on theconvergence of the algorithm as well as surrogateMonte Carlo applied to a known AR(1) model as asimple sanity check. II. UNCONDITIONAL PROBABILITYDENSITY
There are many strategies that can be used to pro-duce random numbers with an unconditional prob-ability distribution that agrees approximately withobservations. Parametrically, one popular approachin finance is to assume student’s t-distribution[1]. However, the choice here is to remain non-parametric and to adapt the inverse transformmethod [6] to the discrete empirical distribution ofthe asset returns.The algorithm is as follows. First one builds theempirical cumulative distribution function (CDF) ofthe asset returns X . In practice, the CDF is a tableof values that maps a given asset return ( x ) to anumber between zero and one ( u ). The next step isto generate uniform random numbers between zeroand one ( U ∼ U nif [0 , x which corresponds to thenumber u by looking at the CDF table.Note that both U and X are real numbers andtherefore there is no perfect match to any value inthe table. This issue is resolved by linear interpo-lation. Therefore any value of u will have a corre-sponding x except when u → ,
1. These limitingcases are extrapolations which have to be decidedby additional considerations.In this work, the choice is to truncate to thelargest and smallest historical x . Therefore when u → , x → x min , x max . This is the most simple option, however a parametricmodel for the tails could be a better choice since itis very difficult to envision any other method giventhe very few data points.Figure 1 compares the empirical CDF of the dailyS&P 500 log-returns with the Monte Carlo generatedlog-returns of price P defined by x t = log P D P D − t (1)where x t is the return over time interval t and P D is the price on day D . The SPY exchange tradedfund is used as a proxy for the S&P 500 prices withdata adjusted for dividends. The close prices aredownloaded from Yahoo finance go from 1993-02 to2020-08, a total of 6930 days.Figure 1 shows that the agreement between dataand simulated data is good over many orders of mag-nitude. Notice that the CDF is folded (also know asmountain plot [12]), that is, if the CDF ( x ) is de-fined as R x −∞ p ( u ) du then the plot shows CDF ( x )for x < = 0 and 1 − CDF ( x ) for x >
0. This partic-ular presentation of the CDF highlights the tails ofthe distribution which are much harder to estimateand simulate. log−return CD F FIG. 1. Cumulative distribution function (CDF) of theS&P 500 daily log-returns from 1993 to 2020 (black cir-cles) together with the monte carlo generated log-returns(red solid line).
III. SURROGATES
The idea of using the constraint randomizationalgorithm introduced by [14] precludes that one hasa set of features which can be measured and thatthese features are enough in order to approximatethe joint probability density of which the measuredtime series is one realization.The key issue is to select such features and it ishere that deep learning methods are most practicalsince in principle these methods would not requirethe end user to know such features a priori. Thisautomation is not without dangers and represents aparadigm shift which is sometimes at odds with thetraditional scientific method of recognizing the fun-damental dynamical drivers [4]. The current worktakes a more traditional approach, however, learn-ing algorithms could be used in conjunction with theideas proposed here (to identify such features for in-stance).The approach taken here leverages extensive re-search validated in the past 25 years that has iden-tified some of the most relevant features for finan-cial assets (stylized facts) [3]. In particular volatil-ity clustering and the leverage effect are well docu-mented features which should be present when pro-ducing artificial asset returns. The first is responsi-ble for the long memory dependence of the volatil-ity and can be identified by the slow decay of theauto-correlation of the absolute returns and the lastby the relation between future volatility and presentreturns.Given N observed elements x t and a Monte Carlogenerated time series of elements z t one starts bydefining a objective function. The objective func-tion is a sum of cross-correlation functions whichcapture the autocorrelation of the returns as wellas the leverage effect and volatility clustering. Theautocorrelation function is defined as C f,g ( τ ) = < f ( u t ) g ( u t − τ ) > p < f ( u t ) > p < g ( u t ) > (2)where f ( u ) and g ( u ) are arbitrary functions ofvariable u and <> stands for time series aver-ages. For example, if f ( u ) = g ( u ) = u − < u > , C u − ,u − ( τ ) is the usual textbook lag τ auto-correlation function.In order to keep the notation simple and withoutloss of generality, the returns x t have the mean re-moved before calculating C in Equation (2). There-fore the quantity of interest here is ρ ( x ) = L X τ =1 C x,x ( τ )+ C x, | x | ( τ )+ K X τ =1 C | x | , | x | ( τ )+ C x ,x ( τ )(3)where the first term accounts for the autocorre-lation of the returns, the second term accounts forthe leverage effect, the last 2 terms for the volatilityclustering and L and K is the maximum lag τ in-cluded in the sum. Finally the objective function ∆is a function of Equation (3) ∆ = | ρ ( x ) − ρ ( z ) | (4)where x is the time series of the actual data and z is the time series of the Monte Carlo surrogates.Equation (4) is minimized using simulated annealingproposed in [14]. The simulated annealing algorithmis implemented by the nonlinear time series analysissoftware TISEAN [8] using its default parameters. Asimple sample code written in Julia that illustratesthe algorithm for the S&P 500 data can be found ongithub [13].Figure 2 compares the S&P 500 time series withthe surrogate Monte Carlo generated time series af-ter 100 million optimization steps. Surrogate MonteCarlo (SMC) produces data which is visually sim-ilar to the original time series. Notice that SMClog-returns show bursts of large and persistent re-turns as well as occasional sharp drops similar tothe actual returns. $1.00$2.00$3.00$4.00$5.00$6.00$7.00$8.00$9.00 2000 2010 2020S&P500Surrogate Monte Carlo Price ($1 in 1993)
S&P S u rr oga t e M on t e C a r l o Log−return per day
FIG. 2. Left Panel: Time series of the price of theS&P 500 together with three realization of the surro-gate monte carlo generated data. Right panel: dailylog-return of artificial time series as well as the actualS&P 500 data from 1993-02 to 2020-08.
Figure 3 illustrates the convergence of the opti-mization algorithm using L = 40 and K = 200 inEquation (2). The choice of values for L and K wassufficient for convergence within the 99% confidencelevel. The solid red line is SMC generated data andthe black circles the actual data. The agreement isvery good. IV. CONCLUSION
Surrogate Monte Carlo algorithm builds on theidea that simulated data is an approximation of theactual data. Therefore if one knows the most im-portant features that describe the data well, one can C | x || x | C x | x | C xx t (days) FIG. 3. Autocorrelation as function of time lag in days.Symbols are for the S&P 500 and the solid red lines comefrom the surrogate Monte Carlo algorithm. Dashed hor-izontal lines: 99% confidence level. Top: autocorrelationof the daily absolute returns ( C | x | , | x | ( τ )). Middle: cross-correlation of the daily returns with the daily absolutereturns ( C x, | x | ( τ )). Bottom: autocorrelation of the dailyreturns( C x,x ( τ )). produce data that complies with these features usingthe simple algorithm described here. This approachis different from the traditional use of surrogates inhypothesis testing where one tries to isolate a featureto test the significance of an other [10].The complexity of this algorithm is deciding onthe features one wants to use as constraints as wellas its computational cost. The optimization problemis NP hard with no general convergence guarantee.However, the experiments conducted in this articledid show good convergence. The algorithm appearsto be well suited to model stochastic processes ingeneral and financial data in particular. V. APPENDIXA. Details on convergence
The importance of the probability density is il-lustrated in this section. Suppose one wants to ex-actly replicate a deterministic function by reorderingrandom numbers. This is precisely surrogate montecarlo (SMC) where the objective function is the de-terministic function itself. The quality of the repli-cation depends only on the random number sam-ple. First one needs to draw random numbers whichconform with the probability density of the actualprocess. Second, one needs to have enough samples.Take the following deterministic function y ( t ) =sin(2 πt/T ) with a period of T = 200. Figure 4 il-lustrates the importance of the using the “correct” unconditional probability density by comparing theempirical probability density build looking at the si-nusoidal time series of 10000 data points and theuniform probability density between minus one andone.The experiment is as follows, first draw 10000 ran-dom numbers using either probability distribution(PDF illustrated in the first column of Figure 4).Next apply random permutations to these randomnumbers per SMC algorithm with the goal of repli-cating original sinusoidal. Columns 2 and 3 of Figure4 shows the quality of the agreement by stopping thealgorithm after 100 million steps. In particular, thephase diagram (middle column) is a circle of radiusone as theoretically expected where as if one startswith uniform random numbers one recovers a nearlysolid disk. mids c oun t s −1.0−0.50.00.51.0 y ( t + ) −1.0 −0.5 0.0 0.5 1.0 y(t) −1.0−0.50.00.51.0 0 50 100 150 200 t y ( t ) mids c oun t s −1.0−0.50.00.51.0 y ( t + ) −1.0 −0.5 0.0 0.5 1.0 y(t) −1.0−0.50.00.51.0 0 50 100 150 200 t y ( t ) FIG. 4. Top row: histogram for the sine function fromwhich SMC draws random numbers. To the right thephase space diagram of the sine function (solid line) andthe recovered diagram (points) by applying SMC to therandom numbers. Finally, top right shows the one periodof the sine function together with the average plus/minusone standard deviation over 50 periods of the SMC re-constructed sine. Bottom row: the histogram of uniformrandom numbers between minus one and positive oneand the resulting phase diagram followed by one periodtime series. The difference between the rows illustratesthe importance of using the correct probability density.
The effect of the length of the time series can beillustrated by comparing a time series of 500 pointswith the original time series of 10000 points for thesame sinusoidal. Less samples clearly affects thecapacity of empirically estimating the probabilitydensity and therefore the simulated time series isa worse approximation of the original function. Fig-ure 5 shows that the recovered sinusoidal (red cir-cles) shows gaps and its phase diagram is very noisyif compared to the expected unit circle. −1.0−0.50.00.51.0 −1.0 −0.5 0.0 0.5 1.0 y(t) y ( t + ) FIG. 5. Phase diagram of SMC reconstructed sinu-soidal starting with 500 (0.5K) data points (red circles)compared to 10000 (10K) data points (blue triangles).The quality of the recovered sinusoidal using 500 datapoints is much worse (red circles vs blue triangles).
B. Toy example
We simulate an AR (1)( p = 0 .
6) time series withcoefficient p = 0 . AR (1)( p = 0 .
6) process.One can also get an idea on the rate of convergenceof the surrogate monte carlo (SMC) algorithm byapplying SMC to AR (1) process with different au-tocorrelation coefficients. It is expected that largerautocorrelation will lead to a longer run time sinceone starts with a time series of white noise and thenrecovers the autocorrelation by re-arranging the or-der of elements. Therefore, larger autocorrelation re-duces the possible permutations of the original timeseries that conform with the constraints. The SMCalgorithm take approximately 3 . p = 0 . p = 0 . Lag
DataSMC0.00.20.40.6 0 10 20 30 40SMC − FinalSMC − Initial
FIG. 6. Autocorrelation function of an AR(1) processfor 10000 data points (symbols) as well as the theoreticalcurve (solid blue line). Surrogate monte carlo (SMC) al-gorithm starts from independent Gaussian random vari-ables and converges to the empirical autocorrelation ofthe AR(1) process after approximately 600 thousand it-erations.
VI. ACKNOWLEDGMENTS
ACS thanks Andrei Da Silva for his help in writingthe sample Julia code. [1] Lisa Borland.
Financial Market Models , pages 257–273. 11 2018.[2] Hans Buehler, Blanka Horvath, Terry Lyons,Imanol Perez Arribas, and Ben Wood. A data-driven market simulator for small data environ-ments, 2020.[3] Anirban Chakraborti, Ioane Muni Toke, Marco Pa-triarca, and Frederic Abergel. Econophysics review:I. empirical facts.
Quantitative Finance , 11(7):991–1012, 2011.[4] Bradley Efron. Prediction, estimation, and attribu-tion.
Journal of the American Statistical Associa-tion , 115(530):636–655, 2020. [5] Cristóbal Esteban, Stephanie L. Hyland, and Gun-nar Raetsch. Real-valued (medical) time series gen-eration with recurrent conditional gans, 2017.[6] P. Glasserman.
Monte Carlo Methods in FinancialEngineering . Applications of mathematics : stochas-tic modelling and applied probability. Springer,2004.[7] Paul Grassia, E. Hinch, and Ludwig Nitsche. Com-puter simulations of brownian motion of complexsystems.
Journal of Fluid Mechanics , 282:373 – 403,01 1995.[8] Rainer Hegger, Holger Kantz, and ThomasSchreiber. Tisean 3.0.1. [9] H. Kantz and T. Schreiber.
Nonlinear Time SeriesAnalysis . Cambridge nonlinear science series. Cam-bridge University Press, 2004.[10] Gemma Lancaster, Dmytro Iatsenko, AleksandraPidde, Valentina Ticcinelli, and Aneta Stefanovska.Surrogate data for hypothesis testing of physical sys-tems.
Physics Reports , 748:1 – 60, 2018. Surrogatedata for hypothesis testing of physical systems.[11] Alan McKane. Brownian agents and active parti-cles: Collective dynamics in the natural and socialsciences.
Journal of Physics A: Mathematical and General , 36:12348, 11 2003.[12] Katherine L. Monti. Folded empirical distributionfunction curves—mountain plots.
The AmericanStatistician , 49(4):342–345, 1995.[13] https://github.com/silvaac/SMCarticle.[14] Thomas Schreiber. Constrained randomization oftime series data.
Phys. Rev. Lett. , 80:2105–2108,Mar 1998.[15] F. Schweitzer.