[PDF] A Comparison of Economic Agent-Based Model Calibration Methods

Abstract

Interest in agent-based models of financial markets and the wider economy has increased consistently over the last few decades, in no small part due to their ability to reproduce a number of empirically-observed stylised facts that are not easily recovered by more traditional modelling approaches. Nevertheless, the agent-based modelling paradigm faces mounting criticism, focused particularly on the rigour of current validation and calibration practices, most of which remain qualitative and stylised fact-driven. While the literature on quantitative and data-driven approaches has seen significant expansion in recent years, most studies have focused on the introduction of new calibration methods that are neither benchmarked against existing alternatives nor rigorously tested in terms of the quality of the estimates they produce. We therefore compare a number of prominent ABM calibration methods, both established and novel, through a series of computational experiments in an attempt to determine the respective strengths and weaknesses of each approach and the overall quality of the resultant parameter estimates. We find that Bayesian estimation, though less popular in the literature, consistently outperforms frequentist, objective function-based approaches and results in reasonable parameter estimates in many contexts. Despite this, we also find that agent-based model calibration techniques require further development in order to definitively calibrate large-scale models. We therefore make suggestions for future research.

Full PDF

AA C O M P A R I S O N O F E C O N O M I C A G E N T - B A S E DM O D E L C A L I B R A T I O N M E T H O D S d o n o v a n p l a t t ∗ , Mathematical Institute, University of Oxford Institute for New Economic Thinking (INET) at the Oxford Martin School a b s t r a c t

Interest in agent-based models of ﬁnancial markets and the wider econ-omy has increased consistently over the last few decades, in no smallpart due to their ability to reproduce a number of empirically-observedstylised facts that are not easily recovered by more traditional modellingapproaches. Nevertheless, the agent-based modelling paradigm facesmounting criticism, focused particularly on the rigour of current vali-dation and calibration practices, most of which remain qualitative andstylised fact-driven. While the literature on quantitative and data-drivenapproaches has seen signiﬁcant expansion in recent years, most studieshave focused on the introduction of new calibration methods that areneither benchmarked against existing alternatives nor rigorously testedin terms of the quality of the estimates they produce. We therefore com-pare a number of prominent ABM calibration methods, both establishedand novel, through a series of computational experiments in an attemptto determine the respective strengths and weaknesses of each approachand the overall quality of the resultant parameter estimates. We ﬁndthat Bayesian estimation, though less popular in the literature, consis-tently outperforms frequentist, objective function-based approaches andresults in reasonable parameter estimates in many contexts. Despite this,we also ﬁnd that agent-based model calibration techniques require fur-ther development in order to deﬁnitively calibrate large-scale models.We therefore make suggestions for future research.

Keywords : Agent-based modelling, Calibration, Simulated minimum distance, Bayesianestimation

JEL Classiﬁcation : C · C The modelling of economic systems presents a signiﬁcant challenge – the hetero-geneity of their constituent agents, the complex nature of the interactions withinthem, and the non-linearity of their emergent dynamics makes them extremely dif-ﬁcult, if not impossible to represent using traditional methods. As a result, the ∗ Corresponding author, [email protected] a r X i v : . [ q -f i n . C P ] F e b se of empirically-inconsistent assumptions and a disregard for the heterogeneityand non-linearity that characterise economic systems has for many decades been thedominant paradigm in economic modelling (Geanakoplos and Farmer ; Farmerand Foley ; Fagiolo and Roventini ).Such criticisms of traditional economic models, or more speciﬁcally those derivedfrom general equilibrium theory and its various extensions , have become increas-ingly prominent in the wake of the Great Recession of the late s, where suchmodels, which had long been thought to be trustworthy, would be shown to beinadequate, both in predicting the possibility of a ﬁnancial crisis and in providingconcrete solutions to resolve it (Farmer and Foley ; Geanakoplos et al. ;Fagiolo and Roventini ). In this context, the need for substantial improvementsto existing methodologies or the development of a viable alternative is clear.Recent advances in computing power, along with successes in other domainssuch as ecology, have resulted in the emergence of a growing community arguingthat agent-based models (ABMs), which simulate systems at the level of individ-ual agents and the interactions between them , may provide a more principledapproach to the modelling of the economy (Farmer and Foley ; Fagiolo andRoventini ). Indeed, recent decades have seen the emergence of a wide varietyof economic and ﬁnancial ABMs that largely dispense with the unrealistic assump-tions that characterise traditional approaches in favour of more realistic alternativesrooted in empirically-observed behaviours (Chen ; LeBaron ).This paradigm shift has ultimately resulted in a degree of success. In more detail,ABMs are well-known for their ability to replicate empirically-observed stylisedfacts, or qualitative properties that appear consistently in empirically-measureddata, despite the fact that such properties are not readily recovered using traditionalapproaches (LeBaron ; Barde ). The literature regarding such stylised factsis very well-developed. An authoritative survey in the context of ﬁnancial timeseries is presented by Cont ( ) and indicates that empirically-observed asset re-turns typically have a fat-tailed distribution, demonstrate no serial autocorrelationand present evidence of volatility clustering. In a more general economic context,the distribution of ﬁrm sizes is known to follow a Zipf distribution (Axtell )and the distribution of ﬁrm growth rates is typically fat-tailed (Dosi et al. ).The preceding examples are by no means exhaustive and, in general, recent yearshave seen the emergence of ABMs that are increasingly ambitious in scope andcapable of reproducing increasingly large sets of stylised facts .Despite the aforementioned successes, ABMs face strong criticisms of their own,focused particularly on the inadequacy of current validation and calibration prac-tices (Grazzini and Richiardi ). In the vast majority of studies, particularlythose that introduce large-scale models, validation procedures are qualitative in na-ture and seldom venture beyond the demonstration of a candidate model’s abilityto reproduce a set of empirically-observed stylised facts (Panayi et al. ; Gueriniand Moneta ). Calibration in such investigations is equally rudimentary, andtypically takes the form of manual parameter adjustments or ad-hoc processes thataim to select parameters that allow the model to reproduce the set of stylised factsconsidered during validation . While such stylised fact-centric methodologies mayseem reasonable at ﬁrst glance, the very large number of models able to recovera similar number of stylised facts, the wide variety of behavioural rules employedin these models, and the difﬁculty experienced in attempting to identify the causaleffects of many behavioural rules on emergent dynamics, renders robust model This includes, but is not limited to assumptions of perfect rationality and the existence of representativeagents. A more detailed discussion on general equilibrium theory and criticisms thereof is presented byGeanakoplos and Farmer ( ). Dynamic stochastic general equilibrium (DSGE) models are a prominent, contemporary example. See Macal and North ( ) for a brief introduction to agent-based modelling. See, for example, the Eurace (Cincotti et al. ) and Schumpeter Meeting Keynes (Dosi et al. )models. The procedure employed by Jacob Leal et al. ( ) is a representative example. omparison an impossibility when using qualitative, stylised fact-centric methods(LeBaron ; Barde ; Lamperti et al. ). This leads to what is often re-ferred to as the "wilderness of bounded rationality" problem (Geanakoplos andFarmer ; Barde ).In response to these criticisms, a small, but growing literature dealing with moresophisticated quantitative validation and calibration techniques has emerged (Fagi-olo et al. ). While signiﬁcant progress has been made, particularly in the lastthree years, this literature still suffers from a number of key weaknesses. Firstly, itis overly compartmentalised. By this, we mean that most publications within thisresearch area focus on the proposal of new methods and seldom compare the pro-posed techniques to other contemporary alternatives. This, combined with the factthat the theoretical properties of many of these new techniques are not well under-stood (Grazzini et al. ), leaves the modeller with a difﬁcult choice between alarge number of methods with no obvious reason to favour one approach over an-other. Secondly, severe computational limitations have resulted in most techniquesonly ever being applied to highly-simpliﬁed models that are several decades oldand no longer a good representation of the current state of economic agent-basedmodelling (Lamperti et al. ; Fagiolo et al. ). This leads to signiﬁcant doubtregarding the applicability of current methods to the large-scale models that nowdominate the literature.We therefore aim to address the above using a principled, yet practical approach.Speciﬁcally, we compare a number of prominent ABM calibration methods, bothestablished and novel, through a series of computational experiments involvingthe calibration of various candidate models in an attempt to determine the respec-tive strengths and weaknesses of each approach. Thereafter, we apply the mostpromising of the considered calibration techniques to a large-scale ABM of the UKhousing market in order to assess the extent to which the performance achieved inthe context of simple models is maintained when confronting state of the art ABMs.Through these computational experiments, we obtain a broader and uniﬁed under-standing of the current state of ABM calibration and make suggestions for futureresearch. As previously alluded to, there is a signiﬁcant overlap between ABM validation andcalibration literature. In most cases, calibration is concerned with selecting modelparameters that result in dynamics that are as close as possible to those observed ina particular dataset, measured according to some criterion . The same criterion maythen also be used for validation purposes. Therefore, while we focus exclusively onthe problem of ABM calibration, many of these discussions would also be applicableto ABM validation. DirectObservation AnalyticalMethods Frequentist BayesianSimulation-Based Methods

Figure 1:

An illustration of the various calibration strategies one might consider when at-tempting to calibrate a particular ABM.

At this point, it is worth noting that some authors use the terms calibration and es-timation interchangeably, while others maintain that there are nuanced differences In most cases, this would refer to the bias and consistency of the associated estimators. Perhaps the most notable of these is the Brock and Hommes ( ) model. These could include a set of stylised facts or some objective function. etween them. As an example, Grazzini and Richiardi ( ) suggest that calibra-tion is concerned only with obtaining agreement between the dynamics of a modeland those of real-world data, while estimation additionally aims to ensure that theobtained model parameter values are an accurate reﬂection of the parameter val-ues associated with the real-world data-generating process. Estimation would thusplace additional emphasis on the uncertainty surrounding the obtained parametervalues, while this may not be of much concern in the case of calibration. As stated byHansen and Heckman ( ), however, the distinction between these terms is ofteninconsistent and not entirely clear. We will therefore use the terms interchangeably,since the methods employed are likely to be very similar in either case.We now proceed with a comprehensive review of the ABM calibration literature ,beginning with the categorisation of ABM calibration strategies into three distinctclasses, illustrated in Figure , which we then discuss in more detail. Since ABMs model systems by directly simulating the interactions of their microcon-stituents, it often arises that a number of model parameters reﬂect directly observ-able (or easily inferable) quantities, such as the net worth of ﬁrms or the distributionof ages among homeowners in a particular country. In this case, sophisticated es-timation techniques are not required and the parameter values can be easily readdirectly from the data.For some relatively simple models, such as the CATS model considered by Bianchiet al. ( ), values for almost all of the model’s parameters can be determined inthis way. While the same is not true for more sophisticated models, it still oftenarises that a large number of parameter values can be determined using a similarstrategy, such as in the UK housing market model proposed by Baptista et al. ( ),where this process is referred to as micro-calibration.

Most ABMs include behavioural rules that require parameters that do not representdirectly observable quantities. Therefore, even if appropriate values for many ofa model’s parameters can be determined through direct observation, it would stillbe necessary to apply statistical estimation techniques to a given dataset to selectappropriate values for the remaining parameters. A logical ﬁrst choice might bemaximum likelihood estimation, or alternatively the method of moments, giventhat they are fairly general and well-understood methods.In a very limited number of cases, maximum likelihood estimation has been ap-plied successfully to ABMs (Alfarano et al. ; ; ), but in general it andrelated methods are not appropriate as they rely on obtaining an analytical expres-sion of the joint density function of time series simulated by the model. This ispossible only for very simple models and even then is nontrivial. Owing to the fact that most ABMs of interest are incompatible with methods re-quiring analytical solutions for key model properties, it is inevitable that methodsinvolving the generation of simulated data will need to be considered. Such ap-proaches represent the key focus of most recent literature and we thus discuss themand related issues in extensive detail. The interested reader may also wish to refer to the surveys of Lux and Zwinkels ( ) and Fagiolo et al.( ). .3.1 Frequentist Inference: Traditional Approaches to Simulated Minimum Distance

The vast majority of existing calibration attempts have adopted a frequentist ap-proach, often employing variations of what are commonly referred to as simulatedminimum distance (SMD) methods (Grazzini and Richiardi ; Grazzini et al. ). Broadly speaking, these methods involve the construction of an objectivefunction that measures the distance between simulated and measured time seriesfor a given set of parameters, followed by the application of optimisation methods.More precisely, we have argmin θ ∈ Θ f ( θ ) , ( )where θ is a vector of parameters, Θ is the space of feasible parameters, and f : Θ → R is a function measuring the distance between real and simulated time series.While truly general and standardised methods are yet to appear in the contextof economic ABMs, methods which consider weighted sums of the squared errorsbetween simulated and empirically-measured moments (or other quantities that canbe estimated from time series data) rose to prominence in early literature (Gilli andWinker ) and the method of simulated moments (MSM) has been applied innumerous investigations . Key motivations for the consideration of MSM includeits prevalence throughout econometric literature, its transparency (Franke ),and its well-understood mathematical properties .MSM is not without a number of shortcomings, however. The single most signiﬁ-cant concern is that the selection of moments is entirely arbitrary and a given set ofmoments will only represent a limited number of aggregate properties of the dataand may therefore not sufﬁciently capture important dynamics. Further, the weightmatrix obtained using the methodology described by Winker et al. ( ), the stan-dard method in this context, may lead to numerical challenges in some cases dueto the inversion of near-singular matrices (Fabretti ; Platt and Gebbie ).A related approach is indirect inference (II), introduced by Gourieroux et al.( ) . Rather than relying on estimated moments, II involves the use of whatis called an auxiliary model, essentially a simple model that is amenable to estima-tion via analytical methods such as maximum likelihood. An objective function isconstructed by estimating the auxiliary model on both empirically-measured andsimulated data and comparing the obtained parameters, with minimisation imply-ing the greatest similarity between the two sets of estimated parameters. In general,II faces similar criticisms to those levelled against MSM (Grazzini et al. ), sinceit involves the selection of an arbitrary auxiliary model. Frequentist Inference: New Approaches to Simulated Minimum Distance

As previously stated, traditional approaches to SMD such as MSM and II sufferfrom a number of weaknesses, the most signiﬁcant of which is the need to selectan arbitrary set of moments or auxiliary model. Not surprisingly, a number ofalternatives have emerged aiming to address this particular problem, focusing onthe structure of a given time series and its patterns rather than aggregate propertiesin an attempt to exploit the full informational content of the data.Among the most straightforward of these alternatives is choosing the objectivefunction to be the sum of appropriately-weighted squared differences betweenthe values of the simulated and empirically-measured time series at a number ofpoints in time, as has been done by Recchioni et al. ( ). More sophisticatedapproaches to quantifying differences in the structure of simulated and empirically-measured time series are presented by Lamperti ( ) and Barde ( ) in theform of information-theoretic criteria called the generalised subtracted L-divergence See Franke ( ), Franke and Westerhoff ( ), Fabretti ( ), Grazzini and Richiardi ( ), Chenand Lux ( ) and Platt and Gebbie ( ) for examples. The estimator is both consistent and asymptotically normal (McFadden ). An example of the application of II in the context of ABMs is presented by Bianchi et al. ( ). GSL-div) and Markov information criterion (MIC) respectively . Recent yearshave also seen the emergence of attempts at comparing the causal mechanisms un-derlying real and simulated data through the use of SVAR regressions, as suggestedby Guerini and Moneta ( ).While the above approaches largely succeed in attenuating concerns related to theselection of arbitrary moments or auxiliary models, they introduce new challenges.While MSM and II have well-understood theoretical properties, many of the recentlyintroduced alternatives have not been subjected to rigorous mathematical analyses(Grazzini et al. ). As a consequence, comparisons between different calibrationmethods tend to be non-trivial, as many of the aforementioned techniques havefew conceptual similarities, and most arguments in favour of a particular methodare likely to be superﬁcial. This leaves the modeller with a choice between a largenumber of potential calibration approaches, with very little evidence available onwhich to base such a decision. Bayesian Inference

Although less common in the literature, an additional alternative to traditional SMDmethods is Bayesian inference, which does not require the selection of arbitrary ag-gregate features and also allows for the incorporation of known, prior informationregarding the parameter values (Grazzini et al. ). This is achieved through thefollowing application of Bayes’ theorem: P ( θ | X ) ∝ P ( X | θ ) P ( θ ) , ( )where X represents empirically-measured data, P ( θ ) represents one’s prior viewsregarding the parameter values, and P ( X | θ ) is the likelihood of the model generat-ing X when initialised using parameter set θ . In contrast to SMD methods, whichproduce point estimates, it should be apparent that we do not obtain a single pa-rameter set, but rather a distribution. This distribution can, however, be analysedto produce a suitable point estimate by taking the mean, median or mode.The most signiﬁcant challenge presented by the approach is the estimation of thelikelihood , P ( X | θ ) . Grazzini et al. ( ) indicate that non-parametric methodssuch as kernel density estimation (KDE) may be too computationally demandingto be feasible in most ABM applications, necessitating fairly strong assumptions ,while parametric approaches are often not ﬂexible enough to accurately representthe distribution. Addressing Computational Diﬃculties

While the lack of comparisons between newer methods is indeed a signiﬁcant weak-ness of the ABM calibration literature, computational difﬁculties remain an evenlarger obstacle in the development of robust and widely-applicable ABM calibra-tion strategies. Since most models of interest are likely to be costly to simulate andgiven the large amount of data that needs to be generated for comparison in mostABM calibration frameworks, few attempts have been made to calibrate large-scaleABMs. Instead, most investigations favour proof-of-concept demonstrations on sim-pler, closed-form models, such as those of Brock and Hommes ( ) and Farmerand Joshi ( ). Ultimately, despite the conceptual similarities between large-scalemodels and simpler variants, the extent to which existing results can be generalisedremains an open question. More detailed discussions relating to the GSL-div and MIC can be found in Section . . It is worth noting that it is also possible to maximise P ( X | θ ) directly, rather than employing it within aBayesian framework, resulting in a simulation-based approximation to maximum likelihood estimation.While Kukacka and Barunik ( ) apply such a procedure to the Brock and Hommes ( ) model,achieving a degree of success, Bayesian methods are, in general, more robust to overﬁtting and thusbetter suited to more complex models (Murphy ). We describe the approach of Grazzini et al. ( ) in more detail in Section . . ecently, there has been a relatively modest increase in the awareness of this pointwithin some groups and increasing emphasis has been placed on addressing thesecomputational difﬁculties in some circles (Grazzini et al. ; Lamperti et al. ).A somewhat promising suggestion is the use of surrogate modelling to circum-vent the intensive ABM simulation process, with representative examples includingGaussian process interpolation, also known as kriging (Salle and Yildizoglu ) ,and the more general machine learning surrogate approach of Lamperti et al. ( ).Recent years have also seen the emergence of cloud computing platforms, such asAmazon Web Services, which provide users with access to computing resources thatare essentially rented for a required task, rather than purchased outright, grantingaccess to computing power several orders of magnitude greater than may be pos-sible through traditional on-site means. Therefore, the use of such services mayprovide researchers with the means to tackle more ambitious calibration problemswithout the need to resort to surrogate modelling. This is in fact the approach thathas been taken in this investigation. In order to effectively compare various calibration techniques, it is necessary todevelop a series of tests that quantify the notion of calibration success and result inmeasures that can be directly compared. One might assume that a natural approachwould be the calibration of a set of candidate models to empirically-observed datausing a variety of different approaches and assessing the resulting goodness of ﬁt.Unfortunately, such an approach is likely to be suboptimal for a number of reasons.Firstly, regardless of the quality and sophistication of a candidate model, it islikely to be misspeciﬁed to some extent when compared to the true data-generatingprocess, especially in an economic context. As an example, a given model may beoverly-simpliﬁed and fail to capture the nuances of the underlying data-generatingprocess, resulting in a poor ﬁt to the data regardless of the merits of each calibrationmethod. Secondly, the notion of goodness of ﬁt is difﬁcult to quantify in the contextof ABMs. Indeed, every SMD objective function is an attempt at quantifying it, witheach differing in what they consider to be the most important characteristics of thedata. We therefore introduce a general approach to calibration method comparisonthat addresses the above concerns.

We begin by letting X si ( θ , T ) be the output of a candidate model , M , for parameterset θ , output length T , and random seed i , where the use of a superscript s indicatesthat the quantity being described is derived from or related to simulated rather thanreal data. Since empirically-observed data is nothing more than a single realisationof the true data-generating process, which may itself be viewed as a model with itsown set of parameters, it follows that we may consider X = X si ∗ ( θ true , T emp ) as aproxy for real data to which M may be calibrated.In this case, we are essentially calibrating a perfectly speciﬁed model to data forwhich the true set of parameters, θ true , is known. It can be argued that a goodcalibration method would, in this idealised setting, be able to recover the true setof parameters to some extent and that methods which produce estimates closer to θ true would be considered superior. It also follows that methods which perform Refer to Barde and van der Hoog ( ) for a ﬁrst attempt at using similar methods to calibrate a large-scale economic ABM. In this case, we consider univariate time series, but analogous arguments apply to the case of panel data. ell in this context are far more likely to perform well in the more realistic case ofa misspeciﬁed model. We therefore deﬁne the loss function L ( θ true , ˆ θ ) = || θ true − ˆ θ || , ( )where ˆ θ is the parameter estimate produced by a given calibration method. There-fore, using this approach, we not only address concerns related to misspeciﬁedmodels, but are also able to directly compare calibration methods according to theirassociated loss function values.Of course, we are not suggesting that this is a deﬁnitive tool for calibrationmethod comparison, but do believe it to be a principled approach that is able toprovide meaningful insights in this context.Given that we have now deﬁned a metric that allows calibration methods to bedirectly compared for a given model, M , and true parameter set, θ true , it is relativelystraightforward to develop a comprehensive series of comparative tests.We begin by noting that the difﬁculty of a given test depends on three factors:the complexity of the dynamics produced by M , the number of free parameters in θ true , and the length of the time series data to which M is calibrated, T emp . We musttherefore consider a set of models M , where each model differs in the complexityand overall nature of its resultant dynamics, and a variety of true parameter setsof different cardinalities for each model . We can then determine the loss functionvalues associated with each calibration method for the various models and true pa-rameter sets. This will ultimately provide insight into which calibration techniquesdeliver the best performance and in which situations.At this point, it should be noted that the models we consider will, in general, notbe ergodic . Therefore, each calibration experiment involves the comparison of ourproxy for real data, X si ∗ ( θ true , T emp ) , with an ensemble of R Monte Carlo replicationsfor each candidate set of parameters, X si ( θ , T sim ) , i = i , i +

1, . . . , i + R −

1, where i ∈ N and we assume that i ∗ (cid:54)∈ { i , i +

1, . . . , i + R − } . We now provide descriptions of the implemented models and brief motivations fortheir consideration in this study. The ﬁrst four are computationally inexpensive,appear frequently in the existing calibration literature, and are capable of produc-ing dynamics ranging from very basic to relatively sophisticated. This initial set isselected to identify the most promising among the considered methods and thosethat perform well in this simpliﬁed context will then be applied to a far more com-putationally expensive, large-scale ABM of the UK housing market.The computationally inexpensive nature of the initial set of models allows usto perform a thorough series of tests involving all of the considered calibrationmethods in a realistic period of time, after which only methods that are likely tobe successful are applied in the more complex setting. Indeed, methods which donot perform well in this simpliﬁed context will almost certainly perform poorlywhen applied to the UK housing market model and we therefore need not wastecomputational resources on such tests. AR ( ) Model

The ﬁrst model we consider is an autoregressive model of order 1, given by x t + = a x t + (cid:101) t + , ( )where (cid:101) t ∼ N (

0, 1 ) . Note that we do not vary the empirical time series length in our experiments, since the effects associatedwith the number of free parameters and complexity of the model dynamics are generally dominant. The interested reader should refer to Grazzini ( ) for a detailed discussion on ergodicity and itsrelationship with ABM calibration. t should be apparent that the above is a basic, single parameter model capableof producing a limited range of dynamics and that its calibration should be a trivialexercise. It has been included in order to create a baseline test for which we expectall of the considered calibration methods to perform well. ARMA (

2, 2 ) -ARCH ( ) Model

The second model we consider is an ARMA (

2, 2 ) model with ARCH ( ) errors, givenby x t + = a + a x t + a x t − + b σ t (cid:101) t + b σ t − (cid:101) t − + σ t + (cid:101) t + , σ t + = c + c (cid:101) t + c (cid:101) t − , ( )where (cid:101) t ∼ N (

0, 1 ) .The above is a logical successor to the previously considered AR ( ) model, as itis drawn from the same general class, traditional econometric time series models,but is capable of producing more nuanced dynamics and has a parameter spaceof signiﬁcantly larger cardinality. Additionally, the calibration of such models isrelatively straightforward using analytical methods. It would thus be worthwhileto assess the performance of simulation-based methods in the context of modelsknown to be amenable to accurate estimation using analytical approaches.A similar model was considered by Barde ( ) when testing the MIC, thoughit was not used in full calibration experiments and involved the comparison of theobtained objective function values for a limited number of parameter combinations. Random Walks with Structural Breaks

The third model we consider is a random walk capable of replicating simple struc-tural breaks, given by x t + = x t + d t + + (cid:101) t + , ( )where (cid:101) t ∼ N ( σ t ) and d t , σ t = (cid:40) d , σ t ≤ τ d , σ t > τ . ( )Despite the simplicity of the model’s equations, its replication of structural breaksand non-stationarity should present a signiﬁcant challenge to the considered meth-ods.It should also be noted that the above model was used by Lamperti ( ) totest the GSL-div, though, in much the same way that the ARMA (

2, 2 ) -ARCH ( ) model was employed by Barde ( ) during the testing of the MIC, this involvesthe comparison of the objective function values obtained for a limited number ofparameter value combinations rather than attempts at calibration. Brock and Hommes (1998) Model

The fourth model we consider is the heterogenous agent model proposed by Brockand Hommes ( ), which has a closed-form solution given by x t + = + r H ∑ h = n h , t + ( g h x t + b h ) + (cid:101) t + , n h , t + = exp ( β U h , t ) ∑ Hh = exp ( β U h , t ) , U h , t = ( x t − Rx t − )( g h x t − + b h − Rx t − ) , ( )where (cid:101) t ∼ N ( σ ) and R = + r . he above is well-known in the literature as an early example of a class of ABMsthat attempt to model the trading of assets on an artiﬁcial stock market by simulat-ing the interactions of heterogenous traders that follow various trading strategies .Each strategy, h , has an associated trend following component, g h , and bias, b h ,both of which are real-valued parameters that are of particular interest in our inves-tigation. The model also includes positive-valued parameters that affect all traderagents, regardless of the strategy they are currently employing, speciﬁcally β , whichcontrols the rate at which agents switch between various strategies, and the prevail-ing market interest rate, r .Despite its relative simplicity, the Brock and Hommes ( ) model is capable ofproducing a range of sophisticated dynamics, including chaotic behaviours, and iscomputationally inexpensive to simulate, unlike most contemporary ABMs. Thesedesirable features have led to it being a popular choice in many ABM calibration ex-ercises throughout the literature, making it a natural inclusion in this investigation. INET Oxford Housing Market Model

The ﬁfth and ﬁnal model we consider is a large-scale ABM of the UK housingmarket, which was introduced in the working paper by Baptista et al. ( ). Aspreviously discussed, the consideration of such models in the calibration literatureis still very rare, with most recent attempts focusing on far simpler alternatives,such as the Brock and Hommes ( ) model. Since the estimation of large-scalemodels remains an open problem and is the ultimate goal of current research inthe ﬁeld, attempts to calibrate this model will form a central component of thisinvestigation.Unfortunately, the sophistication of the model is such that we are unable todescribe it here. Nevertheless, the interested reader should refer to the originalworking paper and the publicly available source code, which can be found at: https://github.com/EconomicSL/housing-model . Given that we have now described the overall experimental procedure and selectedmodels in detail, we ﬁnally discuss each of the implemented calibration methods,which have been selected such that each major area of the literature is represented.

SMD Objective Functions

Since SMD is the most popular ABM calibration paradigm, it is appropriate thatit features prominently in this investigation. We therefore consider several SMDobjective functions, including: • MSM. Despite its need for the selection of an arbitrary set of moments, MSMremains a popular choice throughout the ABM calibration and econometric lit-erature in general. Its inclusion is therefore essential as a benchmark againstwhich to compare newer approaches. While there are a number of differentinterpretations of the MSM framework and many moment sets that could po-tentially be considered, we ultimately chose the implementation described byFranke ( ), combined with the moment set proposed by Chen and Lux( ), consisting of the variance, kurtosis, autocorrelation coefﬁcients for theraw series, absolute value series and squared series at lag 1, and the auto-correlation coefﬁcients for the absolute value series and squared series at lag5. The interested reader should refer to Brock and Hommes ( ) for a detailed discussion of the model’sunderlying assumptions and the derivation of the above closed-form solution. The model has roughly 100 parameters. The GSL-div (Lamperti ). In essence, this recently introduced informationtheoretic criterion compares the distributions of temporal patterns occurringin different time series. It achieves this by discretising both the simulated andempirically-observed data into series of windows of a given length, viewingeach window as a word in a corresponding vocabulary, and proceeding to cal-culate the subtracted L-divergence between the word distributions implied byeach dataset using frequency-based estimators. Thereafter, these L-divergenceestimates are weighted and summed for a desired number of window lengths. • The MIC (Barde ). Despite also being an information theoretic criterionand therefore direct competitor to the GSL-div, the MIC adopts a fundamen-tally different approach to achieve its goals and therefore warrants considera-tion in our investigation. In more detail, it constructs an N -th order Markovprocess based on the model-simulated data using an algorithm known as con-text tree weighting (Willems et al. ) and makes use of the obtained transi-tion probabilities and the sum of the binary log scores along the length of theempirically-observed data to produce an estimate of the cross entropy. Optimisation Algorithms

In the preceding subsection, we brieﬂy discussed three distinct approaches thatmay be considered when constructing an objective function in the SMD framework.A second and equally important concern is the method used to minimise a givenobjective function. Unfortunately, the resulting optimisation problem is, in general,incredibly difﬁcult.Firstly, the simulation-based nature of the aforementioned SMD objective func-tions means that we cannot make use of gradient-based methods or related ap-proaches that require analytical expressions for the value of the objective functionat a given point. This will ultimately force us to consider heuristic methods, whichoften produce solutions with properties that are not well-understood and do notguarantee convergence to a global minimum (Gilli and Winker ; Fabretti ).Secondly, and even more importantly, such methods typically require a signiﬁcantnumber of objective function evaluations, each of which are very computationallyexpensive in our case.Given that there are no best practices in the literature, we will consider (andultimately compare) two contemporary heuristics, namely: • Particle swarm optimisation, an evolutionary algorithm which mimics theﬂocking and swarming behaviours of organisms in ecological systems. Al-gorithms of this nature are now standard in the optimisation literature, withan accessible overview provided by Kaveh ( ). • The approach of Knysh and Korkolis ( ), a surrogate model method basedon the work of Regis and Shoemaker ( ) and designed for the optimisationof expensive blackbox functions. The approach begins by efﬁciently samplingthe parameter space using latin hypercube sampling and evaluating the ob-jective function at the sampled points. Thereafter, it proceeds by constructinga computationally inexpensive approximation to the objective function usingradial basis functions, which is then minimised to obtain an initial solution .This solution is then gradually improved through the consideration of furtherevaluations of the original objective function in an iterative procedure knownas the CORS algorithm. Bayesian Estimation

As previously stated, the consideration of Bayesian inference is still relatively rarein the ABM calibration literature. Nevertheless, the work of Grazzini et al. ( ) Note that this is similar to the kriging approach of Salle and Yildizoglu ( ), discussed in Section . rovides a ﬁrst attempt at its use within this context and is thus worth comparingto the dominant SMD paradigm.In essence, the approach of Grazzini et al. ( ) assumes that observations ineach of the considered time series (model-simulated or empirically-observed) arei.i.d. random samples from unconditional distributions that characterise observa-tions in each series. Under this assumption, KDE is used to construct a densityfunction that characterises the distribution of observations in the simulated seriesfor a given set of parameter values, which can then be used to determine the like-lihood of the empirically-observed series for this parameter set. Applying Bayestheorem results in a posterior distribution that is amenable to sampling using arandom walk Metropolis-Hastings algorithm.While Grazzini et al. ( ) consider only full posterior distributions, we requirepoint estimates to ensure compatibility with our loss function. In most cases andparticularly for parameters taking on continuous values, the mean or median arepreferred, since the mode may be an atypical point (Murphy ). In our case, weconsider the posterior mean, since it corresponds to the optimal estimate for thesquared error loss and thus our chosen loss function. In Section , we described a procedure for comparing the effectiveness of variousABM calibration techniques using computational experiments. We now present theresults of these experiments and discuss their implications. The ﬁrst series of tests involves the application of all of the implemented calibrationtechniques to a set of simple time series models. While the main elements of theexperimental procedure have been outlined in Section , we have also provided amore detailed overview of the technical details of these experiments in AppendixA, including discussions related to dataset construction and the setting of the hy-perparameters for each calibration method. AR ( ) Model

We now proceed with the presentation of the results of our proposed comparativeexperiments, beginning with those obtained when attempting to calibrate the AR ( ) model. Since the model has only a single parameter, a ∈ [

0, 1 ] , and forms part ofa baseline test that all of the considered methods are expected to perform well in,our discussion will be relatively brief. Table 1:

Calibration Results for the AR ( ) Model a L ( ˆ θ , θ true ) θ true a, where we plot the objective function curves for each ofthe considered SMD methods and the corresponding minima found by each optimi-sation algorithm, we ﬁnd that, as expected, each method produces estimates close o the true parameter value with little difﬁculty. Similarly, when referring to Fig-ure b, where we present the posterior distribution of a obtained using Bayesianestimation, we ﬁnd that the posterior mean (our selected point estimate) is indeedclose to the true parameter value and is estimated with a relatively small degreeof uncertainty, as is evident from the low variance implied by the obtained densityfunction.More detailed results, including the loss function values associated with eachpoint estimate, are presented in Table , with the loss function suggesting thatBayesian estimation delivers the best performance for the simple AR ( ) model, fol-lowed by the MIC, MSM, and ﬁnally the GSL-div. f () GSL-div( , true ) true PS MinKK Min 0.0 0.2 0.4 0.6 0.8 1.00.00.10.20.30.4 f () MSM( , true ) true PS MinKK Min 0.0 0.2 0.4 0.6 0.8 1.0370038003900400041004200 f () MIC( , true ) true PS MinKK Min (a)

SMD objective function curves for parameter a . p () Prior Density true

Posterior MeanPosterior Density (b)

Posterior distribution of parameter a . Figure 2:

A graphical illustration of the calibration results obtained for the AR ( ) model. ARMA (

2, 2 ) -ARCH ( ) Model

The next model to be calibrated, an ARMA (

2, 2 ) -ARCH ( ) model, has a larger pa-rameter space and is capable of producing more complex dynamics than the AR ( ) model. It therefore warrants a more comprehensive investigation, leading us to con-sider two free parameter sets, [ a , a ] and [ b , b , c , c , c ] , where all parameters areassumed to lie in the interval [

0, 1 ] , with the exception of a , which we assume liesin the interval [

0, 0.8 ] .As is evident in Figure , where we plot the objective function surfaces associ-ated with the ﬁrst free parameter set, differences in the relative performance ofthe GSL-div, MSM and MIC have become more pronounced when attempting tocalibrate this more sophisticated model. While MSM and the MIC produce reason-able parameter estimates, the GSL-div appears to have performed relatively poorly,yielding an estimate for a that is signiﬁcantly different from its true value. A morethorough inspection of the objective function surface reveals that changes in thevalue of a appear to have a limited effect on the GSL-div. The fact that a is simplyan additive constant suggests that the GSL-div is unable to differentiate between Values of a > eries of the form { x t | t ≥ } and { x t + C | t ≥ C ∈ R } , an important limitationthat is not yet discussed in the literature. (a) GSL-div (b)

MSM (c)

MIC

Figure 3:

SMD objective function surfaces for free parameter set 1 of the ARMA (

2, 2 ) -ARCH ( ) model. In the case of Bayesian estimation, we see that the method of Grazzini et al. ( )once again performs well, with the means of the posterior distributions of a and a , shown in Figure , producing good estimates of the true parameter values. Theloss function values presented in Table also suggest that Bayesian estimation is thebest performing method, as was the case for the AR ( ) model, followed by MSM,the MIC and ﬁnally the GSL-div. a p ( a ) Prior Density true

Posterior MeanPosterior Density 0.0 0.2 0.4 0.6 0.8 a p ( a ) Prior Density true

Posterior MeanPosterior Density

Figure 4:

Marginal posterior distributions for free parameter set 1 of the ARMA (

2, 2 ) -ARCH ( ) model. At this point, it is natural to ask whether similar behaviours are observed for freeparameter sets of greater cardinality. Referring to Figure , it is apparent that theperformance of Bayesian estimation is indeed maintained in the more challengingcontext of 5 free parameters, with the resultant estimates again being reasonably able 2: Calibration Results for Free Parameter Set 1 of the ARMA (

2, 2 ) -ARCH ( ) Model a a L ( ˆ θ , θ true ) θ true . Nevertheless, the uncertainty of estimationis greater than in the previously presented cases, as suggested by the increasedvariance of the posterior distributions obtained for the second free parameter set.This is to be expected, however, since the number of parameter combinations thatproduce a reasonable ﬁt to a given dataset is likely to increase with the numberof free parameters, which could most likely be addressed by the consideration ofadditional data or a larger number of Monte Carlo replications . b p ( b ) Prior Density true

Posterior MeanPosterior Density b p ( b ) Prior Density true

Posterior MeanPosterior Density c p ( c ) Prior Density true

Posterior MeanPosterior Density

Figure 5:

Marginal posterior distributions for free parameter set 2 of the ARMA (

2, 2 ) -ARCH ( ) model. In contrast to Bayesian estimation, SMD methods do not seem to maintain a sim-ilar level of performance as the number of dimensions is increased, as indicatedby the signiﬁcant differences between the obtained estimates and true parametervalues presented in Table . One should also be aware that the optimisation algo-rithms no longer agree on the obtained estimates, which can be seen as indicativeof convergence to local minima.A possible explanation for the observed phenomena is that the objective functionsare characterised by the presence of large, ﬂat regions of local minima (possiblycontaining the true parameter values) as opposed to a single global minimum. This Note that since our calibration experiments involve the comparison of an ensemble of Monte Carloreplications, all of which are generated using different random seeds, to a single series, itself generatedusing a unique random seed, the observed level of noise is to be expected. Note that this ability to quantify the uncertainty of estimation, albeit rather crudely, offers a distinctadvantage over SMD methods, which simply produce point estimates and lack an inherent method ofdeducing the degree of uncertainty without repeated estimation attempts on multiple (bootstrapped)datasets. able 3: Calibration Results for Free Parameter Set 2 of the ARMA (

2, 2 ) -ARCH ( ) Model b b c c c L ( ˆ θ , θ true ) θ true Random Walks with Structural Breaks

While the ARMA (

2, 2 ) -ARCH ( ) model is indeed capable of producing signiﬁcantlymore nuanced dynamics than the AR ( ) model, it is still unable to replicate a rangeof behaviours observed in both real economic data and the outputs of large-scaleABMs. Among these phenomena, structural breaks, or sudden and dramatic de-viations from prior temporal trends, are of particular interest. This leads us toinvestigate the extent to which a random walk model capable of producing simplestructural breaks can be successfully calibrated using the considered methods. f () GSL-div( , true ) true PS MinKK Min 0 200 400 600 800 10000.000.050.100.150.200.250.300.35 f () MSM( , true ) true PS MinKK Min 0 200 400 600 800 1000380039004000410042004300440045004600 f () MIC( , true ) true PS MinKK Min (a)

SMD objective function curves for parameter τ . p () Prior Density true

Posterior MeanPosterior Density (b)

Posterior distribution of parameter τ . Figure 6:

A graphical illustration of the calibration results obtained for parameter τ of therandom walk model. n this simpliﬁed context, a successful method would not only be able to calibrateparameters which determine the pre- and post-break dynamics, but should also beable to identify the point at which the structural break occurs, even if not precisely.We therefore consider two free parameter sets, [ τ ] ∈ [

0, 1000 ] and [ σ , σ ] , σ i ∈ [

0, 1 ] ,which correspond to each of the aforementioned aspects.Beginning with Figure a, which presents the SMD objective function curves re-sulting from attempts to determine the location of the structural break, we see thatboth MSM and the MIC are capable of inferring the correct location to some ex-tent, although the estimate associated with MSM is far closer to the true value. TheGSL-div, on the other hand, delivers far less compelling results, with its associatedobjective function curve suggesting that values roughly in the interval [ ] ,over half of the total parameter range, are preferable to the true value.Moving away from SMD methods, Figure b illustrates the ability of Bayesianestimation to estimate τ to a reasonable extent and with a relatively low degreeof uncertainty. Furthermore, the loss function values presented in Table suggestthat Bayesian estimation again delivers the best performance among the consideredmethods. Table 4:

Calibration Results for Parameter τ of the Random Walk Model τ L ( ˆ θ , θ true ) θ true

700 0GSL-div/PS 354 346GSL-div/KK 355 355MSM/PS 741 41MSM/KK 742 42MIC/PS 559 141MIC/KK 575 125BE 732 32At this point, one can identify the emergence of a persistent trend that appearsthroughout the preceding experiments. In particular, notice that Bayesian estima-tion always delivers the best calibration performance (as measured by the loss func-tion) and has, in all of the cases considered thus far, been relatively successful inrecovering the parameters used to generate the artiﬁcial datasets. In contrast tothis, SMD methods are less consistent in their performance, with some methodsperforming well in certain contexts and poorly in others. In most cases, however, itwould appear that MSM and the MIC perform better than the GSL-div. p ( ) Prior Density true

Posterior MeanPosterior Density 0.0 0.2 0.4 0.6 0.8 1.0 p ( ) Prior Density true

Posterior MeanPosterior Density

Figure 7:

Marginal posterior distributions for free parameter set 2 of the random walk model.

Unsurprisingly, Figures and demonstrate that this trend largely persists in thecase of the second free parameter set. Note, however, that the loss function valuespresented in Table suggest that the GSL-div minimum obtained using particle warm optimisation is a better estimate of the true parameter values than the pos-terior mean obtained using Bayesian estimation, while, paradoxically, the GSL-divminimum obtained using the method of Knysh and Korkolis ( ) is a signiﬁcantlyworse estimate. A closer examination of Figure reveals that the GSL-div objectivefunction surface includes a large, ﬂat region containing the true parameter values.Therefore, the proximity of the estimate obtained using particle swarm optimisa-tion to the true parameter values seems to simply be a fortunate coincidence of theinitialisation of the optimisation algorithm as opposed to a robust result. (a) GSL-div (b)

MSM (c)

MIC

Figure 8:

SMD objective function surfaces for free parameter set 2 of the random walk model.

Recall that a similar lack of agreement between the minima obtained by the con-sidered optimisation algorithms was observed for the second free parameter set ofthe ARMA (

2, 2 ) -ARCH ( ) model, which we suggested may also stem from large,ﬂat regions being present in the objective functions. The visual presence of the ﬂatregion in Figure provides some evidence that this may indeed be the case. Table 5:

Calibration Results for Free Parameter Set 2 of the Random Walk Model σ σ L ( ˆ θ , θ true ) θ true , are not consistent with the true parameter set. This suggests that, in thecase of SMD methods, one has to be aware that even if a unique global minimum an be obtained in higher dimensions, it may not necessarily be a good estimateof the parameters of interest. A plausible explanation for this inconsistency is thepossibility that the distance function may be unable to detect relevant aspects of theconsidered data. More concretely, MSM involves the estimation of various quan-tities for the full length of the time series and is therefore not well-suited to caseswhere different regimes characterise different regions of the data. In contrast to this,the MIC evaluates the binary log scores at the level of each observation, making ita more logical choice for problems of this nature. It is therefore no surprise that theloss function values associated with the MIC in Table are nearly 4 times smallerthan those associated with MSM. Brock and Hommes (1998) Model g p ( g ) Prior Density true

Posterior MeanPosterior Density 0.0 0.2 0.4 0.6 0.8 1.0 b p ( b ) Prior Density true

Posterior MeanPosterior Density

Figure 9:

Marginal posterior distributions for free parameter set 1 of the Brock and Hommes( ) model. g p ( g ) Prior Density true

Posterior MeanPosterior Density b p ( b ) Prior Density true

Posterior MeanPosterior Density g p ( g ) Prior Density true

Posterior MeanPosterior Density

Figure 10:

Marginal posterior distributions for free parameter set 2 of the Brock andHommes ( ) model. a) GSL-div (b)

MSM (c)

MIC

Figure 11:

SMD objective function surfaces for free parameter set 1 of the Brock and Hommes( ) model.

The ﬁnal model to be investigated is the Brock and Hommes ( ) model, consid-ered a classic example of the use of heterogenous agents to model asset price timeseries. As previously stated, the model is ubiquitous in the calibration literatureand is often used to test new approaches. Despite this, experiments of the type weconsider here, or attempts to recover known parameter values, are still relativelyrare. Given the size of the model’s parameter space, there are potentially manyfree parameter sets of interest that could be considered, though we have ultimatelysettled on two, [ g , b ] and [ g , b , g , b ] , with all parameters assumed to lie in theinterval [

0, 1 ] , with the exception of b , which we assume lies in the interval [ −

1, 0 ] .As expected, the results obtained using Bayesian estimation, shown in Figures and , are similar to those observed for previous models, with the associatedestimates being comparable to the true parameter values and the uncertainty ofestimation increasing with the cardinality of the free parameter set. Similarly, theresults obtained for free parameter set 1 using SMD methods are also consistentwith the previously identiﬁed trend that certain methods work well in some cases,but not in others. More concretely, Figure indicates that MSM produces a reason-able estimate of the true parameter values, while this is not case for the MIC andGSL-div. Table 6:

Calibration Results for Free Parameter Set 1 of the Brock and Hommes ( ) Model g b L ( ˆ θ , θ true ) θ true aking a ﬁnal look at the constructed loss function, we see that the values pre-sented in Table indicate that Bayesian estimation is the best performing methodfor free parameter set 1, as has been observed for all models and parameter sets con-sidered up to this point. Rather surprisingly, however, we observe some deviationfrom the established trends in the case of free parameter set 2, with the loss functionvalues presented in Table suggesting that it is now the GSL-div that produces thebest estimates, despite evidence of convergence to local minima. Table 7:

Calibration Results for Free Parameter Set 2 of the Brock and Hommes ( ) Model g b g b L ( ˆ θ , θ true ) θ true − − − − − Overall Assessment

From the preceding computational experiments, it should be apparent that the per-formance of Bayesian estimation far exceeds that of any of the considered SMDmethods. In more detail, we ﬁnd that for every model and parameter set tested,the method of Grazzini et al. ( ) produces estimates that are comparable to thetrue parameter values, while also being the best performing method for all but thesecond free parameter set of the Brock and Hommes ( ) model.SMD methods, on the other hand, are far less consistent. While there are indeeda number of cases where each objective function performs well, we often observethe emergence of cases where the associated estimates are completely different tothe true parameter values. We suggest that two main factors contribute to the emer-gence of these phenomena. Firstly, in some cases, it appears that a given methodmay be unable to capture important differences between two datasets. For example,recall that the GSL-div is unable to discriminate between series that differ only byan additive constant and MSM performs poorly when regions of the considereddatasets are characterised by different regimes. Secondly, we ﬁnd that the identiﬁ-cation of global minima can become problematic, especially in higher dimensions,resulting in estimates that largely depend on the idiosyncrasies of the consideredheuristics.Given its clear dominance, Bayesian estimation will therefore be the method ofchoice when attempting to tackle the more ambitious problem of calibrating large-scale ABMs.

Although we would ultimately wish to fully calibrate the housing market model,we limit this investigation to two more tractable, though still challenging calibrationsubproblems involving 4 and 19 free parameters respectively. These free parametersets may seem trivially small in the context of a 100 parameter model; nevertheless,existing attempts aimed at calibrating models of a similar scale are yet to considermore than 8 free parameters (Barde and van der Hoog ), making a set of 19free parameters a relatively ambitious undertaking. As in the case of the simplemodels we previously considered, we provide the technical details associated withthese experiments in Appendix A. .2.1 Free Parameter Set

MarketAverage Price Decay , Sale Epsilon , P Investor and

Min Investor Percentile are expectedto be relatively important parameters, we consider them in our ﬁrst calibrationattempt.Referring to Figure , we see that despite a signiﬁcant increase in the sophistica-tion of the candidate model, the performance of the method of Grazzini et al. ( )is unaffected, resulting in estimates that are reasonably close to the true parametervalues and behaviours that are consistent with those observed in the 4 and 5 pa-rameter cases of the ARMA (

2, 2 ) -ARCH ( ) and Brock and Hommes ( ) modelsrespectively (see Figures and ). p ( M a r k e t A v e r a g e P r i c e D e c a y ) Prior Density true

Posterior MeanPosterior Density p ( S a l e E p s il o n ) Prior Density true

Posterior MeanPosterior Density p ( P I n v e s t o r ) Prior Density true

Posterior MeanPosterior Density p ( M i n I n v e s t o r P e r c e n t il e ) Prior Density true

Posterior MeanPosterior Density

Figure 12:

Marginal posterior distributions for free parameter set 1 of the housing marketmodel.

This is a very promising result and suggests that the method is relatively robustand can be trusted in a wide variety of contexts to reliably calibrate several param-eters. Nevertheless, we will ultimately need to consider a much larger number offree parameters, which we do shortly, in order to assess the extent to which thisperformance is maintained as the number of dimensions is increased.

Free Parameter Set .Referring to Figure , we observe that: . The estimates obtained for 3 of the considered free parameters, P Investor , HoldPeriod , and

Decision to Sell HPC , differ from the true parameter values by less han 5% of the explored parameter range. We consider these parameters tohave been estimated with a high degree of accuracy. . The estimates obtained for 5 of the considered free parameters, Min InvestorPercentile , HPA Expectation Factor , Decision to Sell Alpha , Decision to Sell Beta ,and

Desired Rent Income Fraction , differ from the true parameter values by5 −

10% of the explored parameter range. Further, the estimates obtainedfor 3 of the considered free parameters,

Sale Epsilon , Derived Parameter G , and

Tenancy Length Average , differ from the true parameter values by 10 − . The estimates for the remaining 8 free parameters differ from the true pa-rameter values by more than 15% of the explored parameter range, with theestimates for 5 of these parameters differing from the true parameter valuesby more than 35%. In this case, we consider calibration to have been unsuc-cessful. p ( M a r k e t A v e r a g e P r i c e D e c a y ) Prior Density true

Posterior MeanPosterior Density p ( S a l e E p s il o n ) Prior Density true

Posterior MeanPosterior Density p ( P I n v e s t o r ) Prior Density true

Posterior MeanPosterior Density p ( M i n I n v e s t o r P e r c e n t il e ) Prior Density true

Posterior MeanPosterior Density p ( H P A E x p e c t a t i o n F a c t o r ) Prior Density true

Posterior MeanPosterior Density p ( H P A Y e a r s t o C h e c k ) Prior Density true

Posterior MeanPosterior Density p ( D e r i v e d P a r a m e t e r G ) Prior Density true

Posterior MeanPosterior Density p ( D e r i v e d P a r a m e t e r K ) Prior Density true

Posterior MeanPosterior Density p ( D e r i v e d P a r a m e t e r K L ) Prior Density true

Posterior MeanPosterior Density p ( T e n a n c y L e n g t h A v e r a g e ) Prior Density true

Posterior MeanPosterior Density p ( H o l d P e r i o d ) Prior Density true

Posterior MeanPosterior Density p ( D e c i s i o n t o S e ll A l p h a ) Prior Density true

Posterior MeanPosterior Density p ( D e c i s i o n t o S e ll B e t a ) Prior Density true

Posterior MeanPosterior Density p ( D e c i s i o n t o S e ll H P C ) Prior Density true

Posterior MeanPosterior Density p ( D e c i s i o n t o S e ll I n t e r e s t ) Prior Density true

Posterior MeanPosterior Density p ( B T L C h o i c e I n t e n s i t y ) Prior Density true

Posterior MeanPosterior Density p ( D e s i r e d R e n t I n c o m e F r a c t i o n ) Prior Density true

Posterior MeanPosterior Density p ( P s y c h o l o g i c a l C o s t o f R e n t i n g ) Prior Density true

Posterior MeanPosterior Density p ( S e n s i t i v i t y o f R e n t o r P u r c h a s e ) Prior Density true

Posterior MeanPosterior Density

Figure 13:

Marginal posterior distributions for free parameter set 2 of the housing marketmodel. e therefore ﬁnd that the method of Grazzini et al. ( ) is able to obtain rea-sonable estimates for 11 of the 19 considered free parameters . While this mayseem somewhat disappointing initially, one should bear in mind that, as previouslystated, previous investigations involving models of a similar scale have only suc-ceeded in the calibration of 8 free parameters. Additionally, one should recall thatwe have made use of only a single time series , rather than the full set of model out-puts, and that the method makes very strong independence assumptions. Therefore,it is entirely reasonable to assume that if additional model outputs are consideredand the method’s assumptions relaxed, Bayesian estimation could provide a robustframework for the calibration of large-scale economic ABMs. Goodness of Fit Tests

While deﬁning a comprehensive notion of goodness of ﬁt in the context of ABMs isnon-trivial, recall that the method of Grazzini et al. ( ) essentially sets parame-ters such that the unconditional distribution of the model-simulated series matchesthat of its empirically-sampled equivalent, a notion of goodness of ﬁt that is rela-tively easy to test using the two-sample Kolmogorov-Smirnov (KS) test. Figure atherefore presents the KS test p values obtained when our artiﬁcial data is com-pared to each of the 50 series in the ensemble of Monte Carlo replications thatresults when the model is initialised using the parameter values obtained for theﬁrst free parameter set using Bayesian estimation. We observe that in all but asmall fraction of cases, the KS test suggests that the model-simulated and (artiﬁcial)empirically-observed series come from the same distribution, as we would expect,given the proximity of our estimate to the true parameter values. p KS Test p Value (a)

Parameter set 1. p KS Test p Value (b)

Parameter set 2.

Figure 14:

Goodness of ﬁt testing for the housing market model.

Repeating the goodness of ﬁt tests for the second free parameter set, we ﬁnd thatthe obtained ﬁt is signiﬁcantly worse, with the KS test suggesting that the model-simulated and (artiﬁcial) empirically-observed series come from the same distribu-tion in only a minority of cases (See Figure b). This is entirely expected, however,since we obtained inaccurate estimates for 8 of the considered free parameters, andprovides further motivation for efforts to improve existing Bayesian methods. From the preceding discussions, it should be apparent that despite the impressiveprogress that has been made during the preceding decade, ABM calibration tech- It should also be noted that the obtained posterior distributions demonstrate a far stronger tendencyto be multimodal when compared to our previously considered cases, which can affect the accuracy ofpoint estimates (Murphy ). See Appendix A. iques still require further development before truly robust and general methodscan be devised.In particular, we ﬁnd that SMD methods, which are favoured in most of thecontemporary literature, deliver less compelling performance than is currently sug-gested by most surveys. This is likely due to the fact that computational exper-iments of the type we consider here, in which we attempt to recover known pa-rameter values though calibration and compare a variety of methods in identicalsituations, are almost nonexistent. Furthermore, most existing ABM calibrationstudies do not emphasise assessments of the uncertainty or accuracy of the associ-ated estimates.This appears to have led to a lack of awareness regarding the weaknesses of SMDmethods and an overall assessment that is perhaps excessively optimistic. As pre-viously suggested, there certainly is some awareness regarding the difﬁculty associ-ated with minimising SMD objective functions as a result of the haphazard natureof their hypersurfaces and the computational cost of their evaluation. Nevertheless,our computational experiments have revealed that even if these objective functionscan be minimised, the associated estimators may suffer from signiﬁcant a bias incertain situations, which, at this stage, appear impossible to characterise a-priori.On the other hand, Bayesian methods, which seem to have been largely ignored,deliver far more compelling performance in a wide range of circumstances, failingonly when confronted with a large-scale model and a free parameter set of signiﬁ-cant cardinality. This would suggest that a paradigm shift is required, with Bayesianmethods and the improvement of existing Bayesian estimation techniques needingto become key areas of focus in future research.Along these lines, we suggest that attempts be made to improve the method ofGrazzini et al. ( ), with a particular focus on relaxing the required independenceassumptions and allowing for temporal dependencies to be captured when compar-ing datasets. a c k n o w l e d g e m e n t s The author would like to thank Amazon Web Services for the generous provisionof access to cloud computing resources and the UK government for the award of aCommonwealth Scholarship. Responsibility for the conclusions herein lies entirelywith the author. r e f e r e n c e s

S. Alfarano, T. Lux, and F. Wagner. Estimation of agent-based models: The case ofan asymmetric herding model.

Computational Economics , ( ): – , .S. Alfarano, T. Lux, and F. Wagner. Estimation of a simple agent-based model ofﬁnancial markets: An application to australian stock and foreign exchange data. Physica A: Statistical Mechanics and its Applications , ( ): – , .S. Alfarano, T. Lux, and F. Wagner. Empirical validation of stochastic models ofinteracting agents. The European Physical Journal B: Condensed Matter and ComplexSystems , ( ): – , .R. Axtell. Zipf distribution of u.s. ﬁrm sizes. Science , : – , .R. Baptista, J. Farmer, M. Hinterschweiger, K. Low, D. Tang, and A. Uluc. Macropru-dential policy in an agent-based model of the uk housing market. Bank of EnglandStaff Working Paper , , . . Barde. Direct comparison of agent-based models of herding in ﬁnancial markets. Journal of Economic Dynamics and Control , : – , .S. Barde. A practical, accurate, information criterion for nth order markov processes. Computational Economics , ( - ), .S. Barde and S. van der Hoog. An empirical validation protocol for large-scaleagent-based models. Studies in Economics, School of Economics, University of Kent , .C. Bianchi, P. Cirillo, M. Gallegati, and P. Vagliasindi. Validating and calibratingagent-based models: A case study. Computational Economics , ( ): – , .W. Brock and C. Hommes. Heterogeneous beliefs and routes to chaos in a simpleasset pricing model. Journal of Economic Dynamics and Control , ( - ): – , .S. Chen. Agent-based computational macroeconomics: A survey. In T. Terano,H. Deguchi, and K. Takadama, editors, Meeting the Challenge of Social Problems viaAgent-Based Simulation , pages – . Springer-Verlag, .Z. Chen and T. Lux. Estimation of sentiment effects in ﬁnancial mar-kets: A simulated method of moments approach. Computational Economics ,https://doi.org/ . /s - - - , .S. Cincotti, M. Raberto, and A. Teglio. Credit money and macroeconomic instabilityin the agent-based model and simulator eurace. Economics: The Open-Access, Open-Assessment E-Journal , : – , .R. Cont. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance , : – , .G. Dosi, G. Fagiolo, and A. Roventini. Schumpeter meeting keynes: A policy-friendly model of endogenous growth and business cycles. Journal of EconomicDynamics and Control , ( ): – , .G. Dosi, M. Pereira, and M. Virgillito. On the robustness of the fat-tailed distributionof ﬁrm growth rates: a global sensitivity analysis. Journal of Economic Interactionand Coordination , .A. Fabretti. On the problem of calibrating an agent based model for ﬁnancial mar-kets. Journal of Economic Interaction and Coordination , ( ): – , .G. Fagiolo and A. Roventini. Macroeconomic policy in dsge and agent-based modelsredux: New developments and challenges ahead. Journal of Artiﬁcial Societies andSocial Simulation , ( ): , .G. Fagiolo, M. Guerini, F. Lamperti, A. Moneta, and A. Roventini. Validation ofagent-based models in economics and ﬁnance. LEM Papers Series, Laboratory of Eco-nomics and Management, Sant’Anna School of Advanced Studies, Pisa, Italy , / , .J. Farmer and D. Foley. The economy needs agent-based modelling. Nature , : – , .J. Farmer and S. Joshi. The price dynamics of common trading strategies. Journal ofEconomic Behavior and Organization , ( ): – , .R. Franke. Applying the method of simulated moments to estimate a small agent-based asset pricing model. Journal of Empirical Finance , ( ): – , . . Franke and F. Westerhoff. Structural stochastic volatility in asset pricing dynam-ics: Estimation and model contest. Journal of Economic Dynamics and Control , ( ): – , .J. Geanakoplos and J. Farmer. The virtues and vices of equilibrium and the futureof ﬁnancial economics. Complexity , ( ): – , .J. Geanakoplos, R. Axtell, J. Farmer, P. Howitt, B. Conlee, J. Goldstein, M. Hendrey,N. Palmer, and C. Yang. Getting at systemic risk via an agent-based model of thehousing market. American Economic Review , ( ): – , .M. Gilli and P. Winker. A global optimization heuristic for estimating agent basedmodels. Computational Statistics and Data Analysis , ( ): – , .C. Gourieroux, A. Monfort, and E. Renault. Indirect inference. Journal of AppliedEconometrics , :S –S , .J. Grazzini. Analysis of the emergent properties: Stationarity and ergodicity. Journalof Artiﬁcial Societies and Social Simulation , ( ): , .J. Grazzini and M. Richiardi. Estimation of ergodic agent-based models by simu-lated minimum distance. Journal of Economic Dynamics and Control , : – , .J. Grazzini, M. Richiardi, and M. Tsionas. Bayesian estimation of agent-based mod-els. Journal of Economic Dynamics and Control , : – , .M. Guerini and A. Moneta. A method for agent-based models validation. Journal ofEconomic Dynamics and Control , : – , .L. Hansen and J. Heckman. The empirical foundations of calibration. Journal ofEconomic Perspectives , ( ): – , .S. Jacob Leal, M. Napoletano, A. Roventini, and G. Fagiolo. Rock around the clock :An agent-based model of low- and high-frequency trading. Journal of EvolutionaryEconomics , ( ): – , .A. Kaveh. Particle swarm optimization. In Advances in Metaheuristic Algorithms forOptimal Design of Structures , chapter , pages – . Springer-Verlag, .P. Knysh and Y. Korkolis. blackbox: A procedure for parallel optimization of expen-sive black-box functions. arXiv , . , .J. Kukacka and J. Barunik. Estimation of ﬁnancial agent-based models with sim-ulated maximum likelihood. Journal of Economic Dynamics and Control , : – , .F. Lamperti. An information theoretic criterion for empirical validation of simula-tion models. Econometrics and Statistics , : – , .F. Lamperti, A. Roventini, and A. Sani. Agent-based model calibration using ma-chine learning surrogates. LEM Papers Series, Laboratory of Economics and Manage-ment, Sant’Anna School of Advanced Studies, Pisa, Italy , / , .B. LeBaron. Agent-based computational ﬁnance. In L. Tesfatsion and K. Judd,editors, Handbook of Computational Economics , volume , chapter , pages – . Elsevier, .T. Lux and R. Zwinkels. Empirical validation of agent-based models. SSRN , , .C. Macal and M. North. Tutorial on agent-based modelling and simulation. Journalof Simulation , ( ): – , . . McFadden. A method of simulated moments for estimation of discrete responsemodels without numerical integration. Econometrica , ( ): – , .K. Murphy. Machine Learning: A Probabilistic Perspective . MIT Press, .E. Panayi, M. Harman, and W. Wetherilt. Agent-based modelling of stock marketsusing existing order book data. In F. Giardini and F. Amblard, editors,

Multi-Agent-Based Simulation XIII , pages – . Springer-Verlag, .D. Platt and T. Gebbie. Can agent-based models probe market microstructure? Physica A: Statistical Mechanics and its Applications , : – , .M. Recchioni, G. Tedeschi, and M. Gallegati. A calibration procedure for analyzingstock price dynamics in an agent-based framework. Journal of Economic Dynamicsand Control , : – , .R. Regis and C. Shoemaker. Constrained global optimization of expensive blackbox functions using radial basis functions. Journal of Global Optimization , ( ): – , .I. Salle and Yildizoglu. Efﬁcient sampling and meta-modeling for computationaleconomic models. Computational Economics , ( ): – , .F. Willems, Y. Shtarkov, and T. Tjalkens. The context-tree weighting method: Basicproperties. IEEE Transactions on Information Theory , IT– : – , .P. Winker, M. Gilli, and V. Jeleskovic. An objective function for simulation basedinference on exchange rate data. Journal of Economic Interaction and Coordination , : – , . a t e c h n i c a l d e t a i l s The primary purpose of this appendix is to provide readers with a complete speci-ﬁcation of the technical details associated with each computational experiment de-scribed in the preceding sections. It is hoped that this will, in principle, allow othersto reproduce the presented results. a.1 Simple Time Series Modelsa.1.1

Dataset Construction

Recall that we do not make use of empirically-observed data. Rather, we considerdata generated by a particular model for a chosen set of parameter values and assessthe extent to which the true parameter set can be recovered through the calibrationof the model to the artiﬁcial data. We therefore begin our series of numerical exper-iments by generating a single dataset for each of the models described in Section . , which have been initialised using the parameter values presented in Table . Table 8:

True Parameter Values for the Set of Simple Time Series Models

Model θ true (Symbols) θ true (Values)1 a a , a , a , b , b , c , c , c

0, 0.7, 0.1, 0.2, 0.2, 0.25, 0.5, 0.33 τ , σ , σ , d , d g , b , g , b , g , b , g , b , r , β

0, 0, 0.9, 0.2, 0.9, − The model numbers above correspond to the order in which the models were introduced inSection . . t should be noted that each calibration method makes certain assumptions re-garding the data and we should therefore ensure that all such assumptions aresatisﬁed by each artiﬁcial dataset before proceeding any further. The most notableof these assumptions, and one common to the majority of the considered methods,is that of stationarity. Referring to Figure , where we present the artiﬁcial dataobtained for the selected parameter values, we see that the assumption of stationar-ity is indeed satisﬁed in almost all cases, with the only exception being the randomwalk. This is easily addressed, however, since the output of the random walk canbe transformed to induce stationarity by considering the series of ﬁrst differences . x t X [ X ] Stationary, p =1.0 (a) AR(1) x t X [ X ]Stationary, p =1.0 (b) ARMA(2, 2)-ARCH(2) x t X [ X ] Non-stationary, p =0.0 (c) Random walk x t X [ X ] Stationary, p =0.963 (d) Random walk (differences) x t X [ X ]Stationary, p =1.0 (e) Brock and Hommes ( ) Figure 15:

An illustration of the artiﬁcial data considered in the calibration experiments. Theprovided p values are obtained from the stationarity test proposed by Grazzini( ). a.1.2 Experimental Procedure

While Section provides a relatively complete description of the experimental pro-cedure, it is still necessary to clarify a number of ﬁner details. This transformation is applied to all series generated by the random walk model in all calibration exper-iments. irstly, we are required to choose appropriate lengths for both the empirically-observed and model-simulated time series. Figure reveals that we have selectedan empirical time series length of T = T s = . In this case, one must be aware of the increased computational costassociated with the simulation of each additional series, while still ensuring that theensemble is sufﬁciently large to reduce the variance associated with Monte Carlosimulation to acceptable levels. We therefore consider an ensemble of 250 realisa-tions in the case of SMD methods and 100 realisations in the case of Bayesianestimation.At this point, it should be noted that SMD methods generally require a greaternumber of Monte Carlo replications when compared to the Bayesian estimationapproach of Grazzini et al. ( ). This is because most SMD methods involvethe estimation of quantities such as moments or occurrence frequencies for eachseries, which are then averaged over the ensemble. Each additional series thereforeprovides only one additional value of the quantity of interest. In the case of Bayesianestimation, however, each new series adds an additional T s observations to thesample to which KDE is applied.Finally, we note that while the optimisation algorithms employed in the case ofSMD methods can simply be iterated until convergence, we need to specify thedesired number of sampled points in the case of the Metropolis-Hastings algorithmused during Bayesian estimation. In this simpliﬁed context, we make use of 4independent instantiations of the algorithm, each sampling 5000 points, the ﬁrst1500 of which are discarded. This results in a combined sample consisting of 14000sampled points. a.1.3 Method Hyperparameters

It often arises that calibration methods have parameters of their own that need tobe carefully chosen in order to maximise performance. In the case of the GSL-div,MSM and Bayesian estimation, we simply adopt the suggestions of the originalauthors.

Table 9:

MIC Hyperparameter Values and Validity Tests

Model b l b u r L Uniform Uncorrelated1 − p = p = −

30 30 7 2 Yes, p = p = −

15 15 6 3 Yes, p = p = − p = p = The model numbers above correspond to the order in which the models were introduced inSection . . The setting of the MIC’s hyperparameters is, however, slightly more nuanced,since suitable parameter settings vary from model to model. Fortunately, Barde( ) introduces a set of tests that can be applied to ensure that the chosen values Note that this excludes our artiﬁcial datasets. While they are indeed model-generated, we consider themto be equivalent to empirically-observed data. For series of length 1000, Barde ( ) suggests that ensemble sizes of 250 or 500 be considered whenemploying the MIC, motivating our choice. re appropriate. Our selected values and the results of these tests are presented inTable .It should be noted that the choice L = ) ﬁnds that such asetting is still reasonable, even if not necessary ideal. a.2 INET Oxford Housing Market Modela.2.1 Dataset Construction

Unlike the models we have previously considered, the housing market model pro-duces panel data as opposed to a single time series. While one would ultimatelywant to make use of the full spectrum of the model’s outputs when attempting tocalibrate a signiﬁcant number of parameters, the method of Grazzini et al. ( ) islimited to the univariate case. For this reason, we are required to choose a singletime series from the model-generated ensemble. Fortunately, this decision is notdifﬁcult, with the housing price index (HPI) being a logical choice in this case. H P I ( t ) HPI

Transient[

HPI ] Non-stationary, p =0.0062 (a) Original Time Series H P I ( t ) HPI [ HPI ] Stationary, p =0.6477 (b) First Difference Series with the TransientPeriod Discarded

Figure 16:

HPI time series for

Market Average Price Decay = Sale Epsilon = P Investor = Min Investor Percentile = p values are obtained fromthe stationarity test proposed by Grazzini ( ). Referring to Figure a, we see that the consideration of HPI results in furthercomplications. Firstly, notice that the HPI time series is characterised by an initialtransient period, followed by cyclical dynamics that persist for the remainder ofthe simulation. Since this transient period corresponds to the model’s initialisationprocedure rather than its true dynamics, it should be discarded. Secondly, we seethat the model output is non-stationary, hence we consider the time series of ﬁrstdifferences, as in the case of the random walk model (see Figure b). a.2.2 Experimental Procedure

Our experimental procedure remains largely unchanged from that employed duringthe calibration of the set of simple time series models. Nevertheless, the increasedcomputational cost associated with the housing market model requires us to makea number of small adjustments. Firstly, we reduce the number of Monte Carloreplications to 50, such that the amount of time taken to perform a single evaluationof the posterior density function is more computationally tractable. Secondly, wereduce the overall number of sampled points. In the case of parameter set 1, weconsider 2 independent instantiations of the Metropolis-Hastings algorithm, eachsampling 5000 points, resulting in a total sample size of 7000 (after discarding theﬁrst 1500 samples of each instantiation). In the case of parameter set 2, we againconsider 2 independent instantiations of the Metropolis-Hastings algorithm, though ach instantiation now samples 10000 points, resulting in a total sample size of17000 (after burning-in).ach instantiation now samples 10000 points, resulting in a total sample size of17000 (after burning-in).