[PDF] Automated and Distributed Statistical Analysis of Economic Agent-Based Models

Abstract

We propose a novel approach to the statistical analysis of simulation models and, especially, agent-based models (ABMs). Our main goal is to provide a fully automated and model-independent tool-kit to inspect simulations and perform counterfactual analysis. Our approach: (i) is easy-to-use by the modeller, (ii) improves reproducibility of results, (iii) optimizes running time given the modeller's machine, (iv) automatically chooses the number of required simulations and simulation steps to reach user-specified statistical confidence, and (v) automatically performs a variety of statistical tests. In particular, our framework is designed to distinguish the transient dynamics of the model from its steady-state behaviour (if any), estimate properties of the model in both "phases", and provide indications on the ergodic (or non-ergodic) nature of the simulated processes -- which, in turns allows one to gauge the reliability of a steady-state analysis. Estimates are equipped with statistical guarantees, allowing for robust comparisons across computational experiments. To demonstrate the effectiveness of our approach, we apply it to two models from the literature: a large scale macro-financial ABM and a small scale prediction market model. Compared to prior analyses of these models, we obtain new insights and we are able to identify and fix some erroneous conclusions.

Full PDF

AAutomated and Distributed Statistical Analysisof Economic Agent-Based Models

Andrea Vandin a , Daniele Giachini a , Francesco Lamperti a,c , Francesca Chiaromonte a,b a Institute of Economics and EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy. b Dept. of Statistics and Huck Institutes of the Life Sciences, Penn State University, USA c RFF-CMCC European Institute on Economics and the Environment, Milan, Italy.

Abstract

We propose a novel approach to the statistical analysis of simulation models and, especially, agent-based models(ABMs). Our main goal is to provide a fully automated and model-independent tool-kit to inspect simulations andperform counter-factual analysis. Our approach: (i) is easy-to-use by the modeller, (ii) improves reproducibility ofresults, (iii) optimizes running time given the modeller’s machine, (iv) automatically chooses the number of requiredsimulations and simulation steps to reach user-speciﬁed statistical conﬁdence, and (v) automatically performs a varietyof statistical tests. In particular, our framework is designed to distinguish the transient dynamics of the model fromits steady state behaviour (if any), estimate properties of the model in both “phases”, and provide indications on theergodic (or non-ergodic) nature of the simulated processes – which, in turns allows one to gauge the reliability ofa steady state analysis. Estimates are equipped with statistical guarantees, allowing for robust comparisons acrosscomputational experiments. To demonstrate the e ﬀ ectiveness of our approach, we apply it to two models from theliterature: a large scale macro-ﬁnancial ABM and a small scale prediction market model. Compared to prior analysesof these models, we obtain new insights and we are able to identify and ﬁx some erroneous conclusions. Keywords:

ABM, Statistical Model Checking, Ergodicity analysis, Steady state analysis, Transient analysis,Warmup estimation, T-test and power, Prediction markets, Macro ABM

JEL Classiﬁcation : C15, C18, C63, D53, E30

1. Introduction

In this article we present a model-independent and fully automated approach to the statistical analysis of simu-lation models and, especially, agent-based models (ABMs). Leveraging a tool-box of e ﬃ cient algorithms to inspectsimulations and perform model-based counter-factual analysis, our approach (i) is easy-to-use by the modeller, (ii)improves reproducibility of the results, (iii) distributes simulations across the cores of a machine or across computernetworks, (iv) automatically chooses a su ﬃ cient number of simulations and simulation steps to reach a user-speciﬁedstatistical conﬁdence, and (v) automatically runs a variety of statistical tests that are often overlooked by practitioners.In particular, the proposed approach allows one to distinguish the transient dynamics of the model from its steady statebehaviour (if any), to estimate properties of the model in both “phases”, to check whether the ergodicity assumption is Email addresses: [email protected] (Andrea Vandin), [email protected] (Daniele Giachini), [email protected] (Francesco Lamperti), [email protected] (Francesca Chiaromonte) a r X i v : . [ ec on . GN ] F e b easonable, and to equip the results with statistical guarantees, allowing for robust comparison of model behaviours’across computational experiments.In the last two decades, the use of Agent-Based Models (ABMs) has spread across several ﬁelds – includingecology (Grimm and Railsback, 2013), health care (E ﬀ ken et al., 2012), sociology (Macy and Willer, 2002), geography(Brown et al., 2005), medicine (An and Wilensky, 2009), research in bioterrorism (Carley et al., 2006), and militarytactics (Ilachinski, 1997). In economics, ABMs contributed to the understanding of a variety of micro and macrophenomena (Tesfatsion and Judd, 2006). They provided an alternative environment for policy-testing in the aftermathof the last ﬁnancial crisis, when more traditional approaches (e.g., dynamic stochastic general equilibrium models andcomputable general equilibrium models) failed (Fagiolo and Roventini, 2012, 2017). Moreover, they were recentlyused for macroeconomic forecasting with promising results (Delli Gatti and Grazzini, 2020).The key advantage of ABMs is the ﬂexibility they allow in modelling realistic micro-level behaviours (e.g.,bounded rationality, routines, stochastic decision processes) and agents’ interactions (e.g., imitation, network e ﬀ ects,spatial inﬂuence), which give rise to aggregate dynamics that are qualitatively di ﬀ erent from those at the individuallevel (emergent properties). This advantage comes at the cost of model complexity, which typically prevents analyticaltreatment and forces the modeller to rely on numerical simulations.Typically, little attention has been devoted to simulation protocols. Yet decisions about (i) how many steps to run,(ii) how many steps to “cut” up front as transient (aka, the warm-up period), and (iii) how many runs to perform undereach parameter conﬁguration, deeply inﬂuence an analysis and the reliability of its results. For example, statementslike “ the results have been averaged over n simulations ” or “ we run a Monte Carlo exercise of size n ”, without a properjustiﬁcation for the choice of n , are rather ubiquitous (see e.g. Beygelzimer et al., 2012; Kets et al., 2014; Caiani et al.,2016; Lamperti et al., 2018, 2019; Dosi et al., 2019; Fagiolo et al., 2020). This can lead to ine ﬀ ective estimatesof model behaviour, with low precision and poor statistical conﬁdence. While irrelevant for “thought experiments”,these aspects deserve more attention when di ﬀ erent policies are compared in counter-factual simulation experiments,or multiple parameter conﬁgurations are explored to discriminate among emerging behaviours. Secchi and Seri (2017)conducted a study on 55 ABMs published between 2010 and 2013 in high-quality management and organizationalscience journals. Their study showed that - in most cases - simulation exercises did not o ﬀ er acceptable statisticalquality , casting doubt on the results and their implications. The main cause of low statistical accuracy turned out tobe an insu ﬃ cient number ( n ) of simulations performed. Similarly, a poor handling of transient behaviours can distortresults. As we will show in Section 7, discarding “ the initial w periods from each simulation to focus on the stationarybehaviour ” (see, e.g., Kets et al., 2014) without a proper justiﬁcation for the choice of w can lead to misleadingconclusions about steady state properties. Furthermore, as we illustrate in Section 8, ergodicity tests are necessary inorder to establish whether performing a steady state analysis makes sense at all.In our opinion, these problems are due to the fact that the simulation-based analysis of ABMs (i.e., the inspectionof models’ simulations) is often handcrafted , resulting in a time-consuming and error-prone process (see also Leeet al., 2015). The simulation, operations research, and computer science communities have substantially advanced theengineering of such tasks, developing automatic techniques equipped with statistical guarantees (see, e.g., Law andKelton, 2015, ). While cross-disciplines fertilisation has recently increased (Dahlke et al., 2020), these developmentsare often overlooked by the so-called ACE (agent-based computational economics) community. In Section 2, we usethe model by Grazzini (2012) to illustrate an example concerning the identiﬁcation of transient dynamics.In this article we introduce a novel fully-automated and engineered approach to ABMs inspection. We borrow The authors analyzed the power of t-tests in simulations on di ﬀ erent model parametrizations. ﬃ ciently analyse stochastic simulation models and equip results with statistical guarantees. In particular, we proposenovel, streamlined, parallelized and automated algorithms to carry out both transient analysis (estimating the averagedynamics of the model at speciﬁc time points and characterizing the associated uncertainty) and steady state analysis (estimating the average dynamics of the model on the long run by estimating and removing the transient period) – andshow that such algorithms are computationally e ﬃ cient. We also equip our steady state analysis with a methodologyfor ergodicity diagnostics , which provides indications on whether the model behaves ergodically, and thus on the thereliability of the steady state analysis itself.Our work contributes to two strands of the ACE literature. First, it complements many recent proposals for thevalidation of simulated models (see the surveys in Fagiolo et al., 2019; Lux and Zwinkels, 2018). For example, themethod proposed by Guerini and Moneta (2017) for macro ABMs requires the model to be in a steady state as wellas the removal of all observations belonging to the transient period, and calibration approaches based on simulatedmoments (Winker et al., 2007; Franke and Westerho ﬀ , 2012; Grazzini and Richiardi, 2015), as well as recent Bayesiantechniques (e.g. Grazzini et al., 2017), typically apply to ergodic models. From a di ﬀ erent perspective, our transientanalysis can evaluate multiple features at each time step – including the probabilities of observing certain patterns– and thus can support the use of validation metrics recently proposed in the literature (e.g. Barde, 2016; Lamperti,2018a,b). Second, we contribute to the analysis of the complexities of ABMs’ output (Lee et al., 2015; Mandes andWinker, 2017; Kukacka and Kristoufek, 2020) by providing fast and practical tools to inspect models with statisticalguarantees (Secchi and Seri, 2017), and by complementing the proposals in Seri and Secchi (2017) for determiningthe adequate number of simulation runs to use. Finally, we o ﬀ er an automated environment to carry out tests acrossexperiments that are typical in the macro ABM literature (see e.g. Dosi et al., 2015).We validate our approach on two models from the literature. In Section 6 we replicate and enrich the transientanalysis from Caiani et al. (2016) on a large scale benchmark stock ﬂow consistent macro ABM. We optimize thenumber of simulations to reach a given (user-deﬁned) level of statistical precision for each time point of interest. Weshow how this is necessary to establish, in a statistically sound manner, di ﬀ erences across model conﬁgurations –thereby facilitating counter-factual policy analysis. In Section 7 we perform a steady state analysis of the predictionmarket model of Kets et al. (2014). This model has been chosen because of its analytical tractability, which provides ane ﬀ ective ground truth against which we test our framework . We show that an erroneous identiﬁcation of the transientperiod led to misleading qualitative and quantitative results in the original simulation-based analysis by Kets et al.(2014); the number of long-run surviving agents and the relationships among market price and other model parameterswere incorrectly characterized, and the agents’ relative wealths were miscomputed. Instead, our framework allows usto correctly detect the transient period and correctly characterize the (analytically known) steady state properties byBottazzi and Giachini (2019b). In Section 8 we also apply our methodology for ergodicity analysis to (non-ergodic)variants of this prediction market model, showing how it can be used to further increase the reliability of a steadystate analysis. Finally, we highlight how the distributed nature of our algorithms allows us to automatically parallelisesimulations in the cores of a computer without modifying the original (purely sequential) model. This a ﬀ ords, e.g., a20x speed-up on a machine with 20 physical cores – in the case of the macro ABM model, this resulted in a decreaseof analysis run-time from 15 days to about 16 hours.From a technical perspective, our framework builds on MultiVeStA (Sebastio and Vandin, 2013; Gilmore et al., Material for replicating the experiments presented in this paper is available at https://github.com/andrea-vandin/MultiVeStA/wiki The model has been analytically studied in Bottazzi and Giachini (2019b), proving asymptotic results about agents’ wealth and market price. The reminder of this article is organized as follows. Section 2 discusses the analysis of simulation output. Sec-tion 3 and 4 present our algorithms and methodology, respectively, and Section 5 introduces MultiVeStA. Sections 6and 7 illustrate our transient and steady state analysis techniques using two ABM models from the literature. Sec-tion 8 showcases our methodology for ergodicity analysis. Section 9 demonstrates the run-time gains a ﬀ orded byMultiVeStA’s parallelization capabilities, and Section 10 provides some conclusions.

2. Analysis of simulation output

ABM analysis typically employs stochastic simulations, relying on Monte Carlo methods, to derive reliable esti-mates of the true model characteristics (Richiardi et al., 2006; Lee et al., 2015; Fagiolo et al., 2019).Without loss of generality, one can represent an ABM as a mapping map : I → O from a set of input parameters I into an output set O . I is usually a multidimensional space spanned by the support of each parameter. O is typicallylarger and more complex, as it comprises time-series realizations of a very large number of micro- and macro-levelvariables. In most cases we can think of the output of an ABM as a discrete-time stochastic process ( Y t ) t > describingthe longitudinal evolution of a vector of variables of interest (e.g., the wealth of an agent, the GDP of a country, etc.).For simplicity, here we focus on the case in which ( Y t ) t > contains only one time series of interest ( Y t ) t > . However,our framework straightforwadly covers the concurrent analysis of multiple time series.Figure 1(a) depicts n independent simulations of Y t (one per row) each comprising t = , . . . , m steps (one percolumn) . The outcome of a simulation i is therefore a sequence { y i , , . . . , y i , m } denoting a realization of length m .Clearly, the observations within the same row i are not independent, while those in the same column t are independentand identically distributed (IID). Here we focus on two typical classes of properties: • Transient properties concerning E [ Y t ]; what is the expected value of a model’s property at a given time t (orwithin a time range, or at the occurrence of a speciﬁc event)? • Steady state properties concerning E [ Y ] = lim t →∞ E [ Y t ]; what is the expected value of a model’s property atsteady state (i.e. when the system has reached a statistical equilibrium, or after the initial warm-up period)?An example of transient property is given in Section 6: what is the expected unemployment rate in each of the ﬁrst 400quarters of a macro ABM? In this example a transient analysis is particularly important because the model has beendesigned to study ﬂuctuations in the quarters following a given initial condition. In contrast, an example of steadystate property is given in Sections 7 and 8: given a market with repeated sessions, what is the expected wealth of MultiVeStA supports Java, R, C ++ , and Python simulators. The extension regards support for statistical tests (and their power) to comparedi ﬀ erent model parametrizations, and ex-novo development of steady state analysis. Furthermore, by applying it to two known ABM models, wealso contribute to increasing the accessibility of ABMs and to the replicability of their results. MultiVeStA is maintained by one of the authors. By independent simulations we mean runs obtained from di ﬀ erent random seeds that have been used for each replication, with the simulatorstatus reset to an initial conﬁguration at the beginning of each replication. a) Simulations (b) Transient analysis (c) Steady state analysis by Replication and Deletion (RD)Figure 1: Transient and steady state analysis using n simulations of m steps each each agent at steady state? In this example a steady state analysis is particularly important because the model hasbeen designed to study problems of market selection and informative e ﬃ ciency. Obviously a steady state analysis ismeaningful only “around” a statistical equilibrium. This requires that lim t →∞ E [ Y t ] exists and is ﬁnite. We ﬁrst presenttwo complementary techniques for steady state analysis that rely on such assumption, and then (in Section 4) combinethem into a methodology for ergodicity diagnostics; that is, for assessing whether the assumption is reasonable orclearly violated .Figures 1(b) and (c) depict how to compute statistical estimates for E [ Y t ] and E [ Y ]. Such estimates can and shouldbe accompanied by appropriate measure of uncertainty, e.g., computing “ α - δ conﬁdence intervals” (CI) around them.Given two user-speciﬁed parameters α ∈ (0 ,

1) and δ ∈ R + , we will show how to guarantee with statistical conﬁdence(1 − α ) · δ centred at its estimate, and how tooptimize the number of runs needed to obtain such guarantee. These steps, which are sometimes overlooked in theABM community, can make the statistical analysis of any stochastic simulation model sounder and more informativefor policy analysis. We now provide more details on our proposals for transient and steady state analyses. Transient analysis.

Procedures for transient analysis are well-established and relatively simple. As shown in Fig-ure 1(b), for a given time of interest t (a column) we obtain a natural (unbiased) estimator for E [ Y t ] by computingthe vertical mean Y t of the observations at t (across the rows). Since these observations are IID, we can use standardstatistical techniques based on the law of large numbers to build CIs as follows (see Chapter 9 of Law and Kelton,2015): Y t ± t n − , − α · (cid:115) s t n , (1)where n is the number of simulations, s t is the sample variance of Y t , and the multiplier t n − , − α is obtained from thetabulation of the Student’s T distribution with n − t n − , − α is equal 1 − α ). For any ﬁxed conﬁdence level α , the width of the CI decreasesas n increases. Therefore, in an automated procedure for computing an α - δ CI, we can continue performing newsimulations until the width becomes smaller than the desired δ (the target width can also be expressed as a fraction ofthe mean value; δ % of Y t ). Note that the CI width shrinks slowly, at the rate of the square root of n . Therefore, it isimportant to perform the correct number of simulations to guarantee the target width without performing unnecessary We leave to future work extensions of our framework that would allow us to detect the number and nature of the statistical equilibria of asimulation model. ﬀ ers an automated procedure for doing so. Furthermore, since in many cases di ﬀ erenttimes t might have di ﬀ erent variances s t , we account for the fact that di ﬀ erent number of simulations might be requiredat each t to get CIs of homogeneous width across times. As we will se in Section 6.3, this is particularly important forcounter-factual analysis.A common exercise that builds upon transient analysis is to compare estimates obtained for di ﬀ erent model conﬁg-urations (typically corresponding to di ﬀ erent sets of input parameter values) – as to assess whether the conﬁgurationsdi ﬀ er signiﬁcantly in terms of the output variable(s) under consideration. Given the outcomes of the transient analysesfor the two conﬁgurations and a user-deﬁned signiﬁcance level a w , our tool-box performs a Welch’s t-test of means’equality (Welch, 1947) for every t of interest. MultiVeStA also computes the power of such test (Chow et al., 2002)in detecting a di ﬀ erence of at least a given (precision) ε (see Section 3.1.2) . Steady state analysis.

As depicted in Figure 1(c), a steady state analysis can be performed similarly to a transientanalysis by adding a pre-processing step. We ﬁrst compute the horizontal mean Y i ( w ) within each simulation i ,ignoring a given number of initial observations w . Since all these means are IID, we can compute their vertical mean Y ( w ) and build a CI around it as in Equation (1).Unfortunately, this approach has intricacies that hinder its automatic implementation and can lead to relevantanalysis errors. Depending on the chosen number w of initial observations to discard, the estimator Y ( w ) of E [ Y ]might carry a bias due to the transient behavior of the system, and not give us reliable information on its steady state(see Section 7.2 for a notable example from the literature). In order to avoid this issue, we need to identify the correctw the system needs to exit its transient (or warmup ) period, and discard the initial w observations from each simulation.Such procedure is known as Replication and Deletion (RD, Law and Kelton, 2015). E ﬀ ectively identifying the lengthof the warmup period is a di ﬃ cult problem. The most popular approaches in the ABM community are rooted in theWelch’s method (Welch, 1983):1. Perform n simulations of given length m and compute averages Y t , t = , . . . , m as in Figure 1(b);2. Plot Y t , t = , . . . , m ;3. Choose the time w after which the plot seems to converge . If no such time exists, iterate the procedure frompoint (1), performing a new batch of n simulations of length m , and computing averages over all simulations.Being only semi-automated and based on a visual assessment, this procedure is time consuming, error-prone, and notbacked by a strong statistical justiﬁcation. It also critically depends on choosing a large enough “time horizon” m –of course progressively larger m can be tried, adding to the computational burden.More recently, Grazzini (2012) presented an alternative approach where a single simulation of length m is per-formed and divided into windows of length wi ( m and wi are arbitrarily chosen). If the distribution of the meanscomputed within each window passes a randomness test (in particular the Runs Test by Gibbons, 1986; Wald andWolfowitz, 1940), then the author concludes that the system is in steady state. The use of statistical tests rather thanvisual assessments makes the approach more reliable, fostering its use in the ABM literature (e.g., Guerini and Mon-eta, 2017; Lamperti et al., 2020). However, the approach is still not fully automated – and relies on the arbitrary choiceof m and wi ; quoting from the author “ with appropriate settings the tests can detect non-stationarity ” (Grazzini, 2012).In the next Section we introduce a fully automated statistical procedure for estimating the end of the warmup period. In Section 6.3 we show that a reasonable choice is to set ε = δ . In point (2) one might smooth the plot, e.g., employing moving-windows averages, where one is in e ﬀ ect further averaging each Y t with a fewneighbouring steps. . Automated simulation-based analysis with statistical guarantees Our approach to transient and steady state analysis is fully automated, in that all parameters are computed auto-matically or have default values. The user speciﬁes the properties to be studied, the α and δ parameters to be employedin the CI construction, and an optional maximum number of allowed simulations (if this number is reached beforesatisfying the CI constraints, the analysis terminates with the currently computed CIs). As in Section 2, we focusthe description on a single variable Y t , but our treatment applies straightforwardly to the analysis of multiple modelcharacteristics (indeed, MultiVeStA implements multi-variable analyses). Section 3.1.1 describes how to estimate transient properties expressed as expected values, E [ Y t ], and how to buildCIs around them. After this, Section 3.1.2 describes how to statistically compare estimates from di ﬀ erent experiments. Algorithm 1 illustrates autoIR , a simple automated algorithm for transient analysis that takes in input bl (dis-cussed later), a time of interest t , and α and δ , and produces in output an estimate of E [ Y t ] and a corresponding CI.The algorithm determines automatically the number n of simulations required to guarantee that the (1 − α ) × δ .Lines 2-4 set t as time horizon m , and initialize the counter n of computed simulations and the list µ to store theobservations at step t from each simulation (the y i , t in Figure 1(b)). Lines 6-11 perform a block of bl simulations(by default 20 (Law and Kelton, 2015)), populating µ . In Line 7, y is a list of size m containing a value y i , t for eachtime point t from 1 to m for the current simulation i , but only the value for t = m is used, adding it to µ . Afterperforming bl simulations, autoIR computes the mean µ and variance s of µ , used to compute the width d of thecurrent CI. If d is greater than δ , autoIR performs another block of bl simulations, otherwise it returns the currentCI. The implementation of autoIR in MultiVeStA allows one to concurrently estimate E [ Y t ] for di ﬀ erent time points t (e.g. average bankruptcies in each t from 1 to 400 in Section 6). This is done by computing, at each iteration, mean,variance and CI only for the elements of y (Line 7) that correspond to time points whose current CI width is still above δ . At each iteration of a block of bl simulations, the time horizon m is updated with the largest t still to be processed. MultiVeStA allows one to compare, in a statistically meaningful and reliable way, expected values correspondingto di ﬀ erent settings or parametrizations of a model. Given that the compared means might come from experimentswith di ﬀ erent sample sizes and variances, we use the Welch’s t-test (Welch, 1947), whose power can be computed asin Chow et al. (2002). The Welch’s t-test.

Given estimates from two transient analyses for a set of time points T , our tool performs a testfor equality of means for each t ∈ T using (Welch, 1947). In symbols, given two experiments { j , k } , deﬁne the setof triplets D = { ( Y i , t , s i , t , n i , t ) | i ∈ { j , k } , t ∈ T } , each containing the mean, the sample variance, and number ofsimulations for time t in experiment i . MultiVeStA takes D as input and, for each t , computes τ t = Y j , t − Y k , t (cid:112) f j , t + f k , t , (2) For the sake of presentation, all algorithms in the paper consider δ given as absolute values. The case of δ given in percentage terms relativelyto the studied means is trivially obtained by changing the comparisons d > δ in d /µ > δ . lgorithm 1 autoIR : Transient analysis

Require: bl , α , δ , t (default: 20, 0.05, 0.1, NA) // Set t as time horizon m, and initialize data structures m ← t n ← µ ← empty list repeat for i ∈ { , . . . , bl } do y ← drawIndependentSimulation ( m ) // Add y i , t to µ µ. add ( y [ m ]) n ← n + end for ( µ, s ) ← computeMeanAndVariance ( µ ) d ← computeCIWidth ( µ, s , n , α ) until d > δ return (1 − α ) · µ − d , µ + d ] of width at most δ Algorithm 2 autoRD : Steady state analysis by Replication and Deletion

Require: B , b , bs , minVar , bl , α , δ (default: 128, 4, 16, 1E-7, 20, 0.05, 0.1) w ← autoWarmup ( B , b , bs , minVar ) m ← w · n ← µ ← empty list repeat for i ∈ { , . . . , bl } do y ← drawIndependentSimulation ( m ) y (cid:48) ← ( y w + , . . . , y m ) µ. add ( computeMean ( y (cid:48) )) n ← n + end for ( µ, s ) ← computeMeanAndVariance ( µ ) d ← computeCIWidth ( µ, s , n ) until d > δ return (1 − α ) · µ − d , µ + d ] of width at most δ where f i , t = s i , t / n i , t , i ∈ { j , k } . Following Welch (1947), under the null hypothesis that the di ﬀ erence between the twomeans is zero, each τ t follows a Student’s t-distribution with degrees of freedom approximated as in Satterthwaite(1946): ν t ≈ ( f j , t + f k , t ) f j , t / ( n j , t − + f k , t / ( n k , t − . Therefore, given a statistical signiﬁcance a w , MultiVeStA uses τ t to perform the test of no di ﬀ erence between the twomeans producing 1 if τ t ∈ [ − t ν t , − aw , t ν t , − aw ] (the null hypothesis of equal means is not rejected) and 0 otherwise. Thesigniﬁcance a w is user-speciﬁed, and can be set to be equal to the α used for the transient analysis. Power of the test.

Following Chow et al. (2002), MultiVeStA estimates the power 1 − β t of Welch’s t-test in detectinga di ﬀ erence of at least a given precision ε between the two means at time t . This is β t = T ν t  t ν t , − aw (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ε | (cid:112) f j , t + f k , t  , (3)where T ν t ( x | θ ) is the cumulative distribution function of a non-central t-distribution with ν t degrees of freedom andnon-centrality parameter θ , evaluated at point x . Calculating the power of Welch’s t-test requires specifying theminimum di ﬀ erence ε (Chow et al., 2002). As a rule of thumb, we suggest setting ε ≥ δ , the parameter used in thetransient analysis, which expresses a precision for the estimated mean. In Section 6, setting ε = δ leads to very goodpower for the considered macro ABM. A statistically sound analysis of steady state properties poses challenges that have been thoroughly investigated bythe simulation community – at the boundary of computer science and operations research. Two main approaches haveemerged (Alexopoulos and Goldsman, 2004; Whitt, 1991; Law and Kelton, 2015): those based on

Replication andDeletion (RD; see Section 2), and those based on batch means (BM) (Conway, 1963; Alexopoulos and Seila, 1996;Steiger et al., 2005). Unlike RD, which computes many short simulations, BM computes one long run which is evenlydivided into adjacent non-overlapping subsamples labelled as batches . Intuitively, if certain statistical properties hold,8 igure 2:

Steady state analysis by

Batch Means (BM) using one long simulation: (i) We split the simulation into batches (consecutivesteps) of size b , and we compute the mean within each batch (the batch means B i ); (ii) We compute the mean of such means, the grand mean , ignoring the ﬁrst l batches where it is assumed to terminate the warmup. We obtain B ( l ), an estimator for E [ Y ]. each batch can be used as a simulation in RD – as depicted in Figure 2. This can be seen as a generalized versionof the proposal by Grazzini (2012), which allows one to estimate the end of the warmup period rather than to checkwhether a given time is subsequent to such end .There is no best approach between RD and BM (Alexopoulos and Goldsman, 2004; Whitt, 1991; Kelton andLaw, 1984). They are complementary, and therefore have complementary (dis)advantages. RD, which uses manyshort simulations, su ﬀ ers from biases due to initial conditions. BM, which uses many short batches from one longsimulation, is less a ﬀ ected by initialisation bias but su ﬀ ers from correlations among batch means.While some automated BM-based procedures have been proposed (e.g., Steiger et al., 2005; Tafazzoli et al.,2011; Gilmore et al., 2017), to the best of our knowledge, little attention has been paid to RD. Interestingly, Ladaet al. (2013) tried to combine the two approaches exploiting their respective strengths: they use BM for warmupanalysis, and automate the use of RD by discarding the estimated transient behaviour from each simulation. Followinga similar approach, we extract and condense the warmup analysis capabilities inspired by BM into a simple self-standing procedure for warmup estimation ( autoWarmup ), and we introduce automated RD- and BM-based algorithms( autoRD and autoBM , respectively) which use autoWarmup . In all algorithms (see Figure 3) we favour simplicity andaccessibility. Algorithm 2 illustrates autoRD . The di ﬃ cult part in automating RD is the warmup analysis. However, in oursetting we can easily do this by invoking autoWarmup (Line 1; see Section 3.2.2 below). For now it is su ﬃ cientto know that w is the last step of the estimated warmup period. Once w has been determined, we have to set a substantially larger time horizon (Law and Kelton, 2015). We can do this using a (small) multiplier. In Line 2 thedefault multiplier for w is 2. The code of autoRD presents also a second modiﬁcation with respect to that of autoIR :we replaced Line 9 of Algorithm 1 with Lines 8-9 of Algorithm 2 to discard the ﬁrst w observations from y , and addthe mean of the remaining values of y (the horizontal mean in Figure 1(b)) to µ . Algorithm 3 provides pseudo-code for our automatic warmup estimation, inspired by existing BM-based ap-proaches for steady state analysis (Steiger et al., 2005; Gilmore et al., 2017; Tafazzoli et al., 2011). Indeed, such algo-rithms include a form of warmup analysis that we extract and reﬁne into a simple self-standing procedure. Lines 1-5 More precisely, Grazzini (2012) appears to employ a non-automated version of BM. Yet the ﬁrst automated version of BM was published in1979 (Law and Carson, 1979). This is a clear signal of the potential (and often overlooked) complementaries between the simulation communityand the ABM community in economics. lgorithm 3 autoWarmup : Warmup estimation

Require: B , b , bs , minVar , (default: 128, 4, 16, 1E-7) // Draw the ﬁrst B · bs steps µ ← array ( B ) for i ∈ { , . . . , B } do µ [ i ] ← drawBatchAndComputeMean ( bs ) end for ( a , ρ ) ← goodnessOfFitTests ( µ, b , minVar ) // Keep doubling bs and time horizon until tests pass while a > a ∗ or ρ > ρ ∗ do for i ∈ { , . . . , B / } do µ [ i ] ← ( µ [2 · i ] + µ [2 · i + / end for bs ← · bs for i ∈ { B / + , . . . , B } do µ [ i ] ← drawBatchAndComputeMean ( bs ) end for ( a , ρ ) ← goodnessOfFitTests ( µ, b , minVar ) end while return Warmup period estimated to terminated af-ter B · bs steps Algorithm 4 goodnessOfFitTests

Require: µ , b , minVar µ (cid:48) ← ( µ b + , . . . , µ B ) ( µ, s ) ← computeMeanAndVariance ( µ (cid:48) ) ( α, ρ ) ← (0 , if s > minVar then a ← AndersonDarlingNormalityTest ( µ (cid:48) , µ, s ) ρ ← lag1Autocorrelation ( µ (cid:48) , µ, s ) end if return ( a , ρ ); Algorithm 5 autoBM : Steady state analysis by Batch Means

Require: B , b , bs , minVar , α , δ (default: 128, 4, 16, 1E-7, 0.05, 0.1) autoWarmup ( B , b , bs , minVar ) // Fast-forward simulation after warmup µ ← array ( B ) for i ∈ { , . . . , B } do µ [ i ] ← drawBatchAndComputeMean ( bs ) end for ( a , ρ, µ, d ) ← goodnessOfFitTestsAndCI ( µ, b , minVar , α ) // Keep doubling bs and time horizon until tests pass while a > a ∗ or ρ > ρ ∗ or d > δ do for i ∈ { , . . . , B / } do µ [ i ] ← ( µ [2 · i ] + µ [2 · i + / end for bs ← · bs for i ∈ { B / + , . . . , B } do µ [ i ] ← drawBatchAndComputeMean ( bs ) end for ( a , ρ, µ, d ) ← goodnessOfFitTestsAndCI ( µ, b , minVar , α ) end while return (1 − α ) · µ − d , µ + d ] of width at most δ , adjusted for keeping into account residual correlation Algorithm 6 goodnessOfFitTestsAndCI

Require: µ , b , minVar , α µ (cid:48) ← ( µ b + , . . . , µ B ) ( µ, s ) ← computeMeanAndVariance ( µ (cid:48) ) ( α, ρ, d , n ) ← (0 , , , B − b ) if s > minVar then a ← AndersonDarlingNormalityTest ( µ (cid:48) , µ, s ) ρ ← lag1Autocorrelation ( µ (cid:48) , µ, s ) d ← computeCIWidth ( µ, s , n , α, ρ ) end if return ( a , ρ, µ, d ); Figure 3: BM-based algorithms for estimating the initial warmup period (left), and for studying steady state properties (right). perform a simulation of m = B × bs steps (by default, B =

128 and bs = B adjacentnon-overlapping batches , each containing bs steps. After this, the array µ stores the mean of each batch (therefore thename batch means ): each entry µ [ i ] stores the corresponding B i (see Figure 2). The algorithm then proceeds iterativelyby performing statistical tests to check whether m is large enough to cover the warmup period, doubling the numberof performed steps while keeping the number of batches ﬁxed (doubling the steps bs in each batch) until all tests arepassed. The key point is that if the process satisﬁes properties required for steady state analysis (Law and Carson(1979); Steiger and Wilson (2001); see also Section 4), then such iterative procedure will lead to approximately IIDnormally distributed batch means µ for a su ﬃ ciently large value of bs .BM-based approaches perform di ﬀ erent statistical tests on µ to check whether m is large enough for completing thewarmup period: Tafazzoli et al. (2011) use the von Neumann (1941) randomness test, while Steiger et al. (2005) use atest for stationary multivariate normality on groups of 4 consecutive batches followed by a check for low correlationamong consecutive batch means (i.e., the lag-1 autocorrelation of µ ). Gilmore et al. (2017) apply the Anderson-Darling test for normality on µ , followed by a check for low lag-1 autocorrelation of µ . In all cases, a few (typically4) initial batches are ignored as they are likely the most a ﬀ ected ones by the initial transient. We follow the latterapproach, as speciﬁed in the subprocedure goodnessOfFitTests of Algorithm 4: Line 1 skips b (by default 4)10nitial batches, obtaining µ (cid:48) , then Line 2 computes the variance of the batch means, used in Line 4 to decide whetherthe statistical tests are necessary or pass by default. The rationale is that if the variance among the batch means isbelow a minimum threshold (parameter minVar with default value 1E-7), then the process is likely converging to adeterministic ﬁxed point, therefore we can safely assume that the intial warmup period has terminated. Concerningnormality, Line 5 uses the Anderson-Darling test implemented in the SSJ library (L’Ecuyer, 2016; L’Ecuyer et al.,2002) to check whether it is statistically plausible that µ (cid:48) has been sampled from a normal distribution speciﬁed byits mean and variance, and obtain a p-value a . Line 6 stores ρ , the lag1-autocorrelation of µ (cid:48) . The subprocedure thusreturns a and ρ , which are used in Line 8 of autoWarmup to decide whether the tests are passed – using minimumthresholds a ∗ and ρ ∗ based on prior publications (Gilmore et al., 2017; Steiger et al., 2005) . If any of the testsfail, then an iteration of the while loop in Lines 8-17 is performed to double the number of steps m by doubling thecurrent batch size bs . We note that the current B batch means are squeezed in the ﬁrst half of µ , (Lines 9-11), and m new steps are performed to create the new batch means in the second half of µ (Lines 13-15). The statistical tests areperformed on the new batch means, and new iterations of the loop are performed until both statistical tests are passes.The algorithm terminates returning the ﬁnal value of m = B × bs as the estimated end of the warmup period. Algorithm 5 illustrates our automatic BM-based procedure for steady state analysis. Line 1 invokes autoWarmup ,which moves the simulator to the end of the estimated warmup period. No other information from autoWarmup isused. The algorithm then proceeds similarly to autoWarmup , the only di ﬀ erence being that we add a third statisticaltest: we also compute the width d of the CI according to the current batch means. This is obtained by invoking goodnessOfFitTestsAndCI from Algorithm 6 rather than goodnessOfFitTests Since the tests for normality andabsence of correlation were already passed during autoWarmup , one might expect that they are no longer necessary.We note however that the tests passed for the last value of bs used in autoWarmup , so they could potentially fail forinitial small values of bs . When all statistical tests are passed, we return the computed (1 − α )100% CI of width atmost δ , adjusted by the computed residual correlation among the batch means. This is done similarly to Steiger et al.(2005), using an inverse Cornish-Fisher expansion (Stuart and Ord, 1994) based on a standard normal density . autoBM and autoRD We note that both autoBM and autoRD proceed by iterations, during which new samples are drawn and newstatistical tests are performed. In autoBM , new simulation steps are added onto the same long simulation (the numberof simulations n is constant, the time horizon m grows). In autoRD , new simulations of ﬁxed length are added ( n grows, m is constant). In some sense, the computational burden of autoRD is higher, as w steps from each newlyperformed simulation are ignored. However, the simulations performed in each iteration of autoRD can be triviallyparallelized – so the additional computation can be e ﬃ ciently handled. Rather, which approach to prefer depends onthe model at hand and on the available hardware: • The longer the initial warmup period, the more advantageous is autoBM relative to autoRD . • The larger is the degree of parallelism supported by the hardware, the more advantageous is autoRD relative to autoBM . In particular, the signiﬁcance level for the normality test is set to a ∗ =

1% while the lag-1 autocorrelation threshold is set to ρ ∗ = sin(0 . − q √ size ( µ ) ), where q is the 99% quantile of the standard normal distribution. See (Steiger et al., 2005) for more details. See Section 2 of (Steiger et al., 2005) for the exact formula. igure 4: Procedure for ergodicity diagnostics based on autoRD and autoBM to asses the reliability of a steady state analysis The ABM community favours the RD approach due to its simplicity, but its trivially parallelizable nature isnot always exploited due to limited computer engineering skills. Notably, some studies have shown that the BMapproach might provide more accurate results in speciﬁc cases (Whitt, 1991; Alexopoulos and Goldsman, 2004). Theimplementation of both algorithms in MultiVeStA enables modelers to freely choose between the two and to exploitthe distributed nature of the tool-box to parallelize simulations. The next section shows how the RD and BM can becombined to obtain a methodology for ergodicity diagnosis.

4. Ergodicity diagnostics and detection of multiple stationary points using autoRD and autoBM

Consistency and unbiasedness of the estimates produced by autoBM and autoRD rely on the underlying processpossessing the strong mixing property . Indeed, once normality of batch means in autoWarmup is well approximatedand autocorrelation is low, we can be conﬁdent that future observations will not have initialization bias (Steiger et al.,2005). If at a certain point in time the batch means resemble a sample of IID observations from the same Gaussianpopulation, then the non-random e ﬀ ects of initial conditions must have disappeared. Further, the strong mixingassumption ensures that such a point in time can eventually be reached by increasing the batch size (Law and Carson,1979; Steiger and Wilson, 2001). Here we describe a procedure that combines autoRD and autoBM to assess whetherthis assumption is met. The procedure, depicted in Figure 4, is fully implemented in MultiVeStA and is validated inSection 8 on variants of a prediction market model.We start performing both autoBM and autoRD for given α and δ (step 1). If any of the two fails to convergein due time (step 2), we have evidence that the process is eventually non-stationary (or fails to reach its stationaryphase within the allotted computational time / resources). In such cases, performing any steady state analysis could bemisleading and should be avoided (step 3).If both autoBM and autoRD successfully terminate, we can be conﬁdent that the process possesses ergodic proper-ties (Gray, 2009; Billingsley, 1995) – meaning, intuitively, that the horizontal means of its realisations (i.e., the meansacross simulations as in Fig. 1(c)) indeed converge asymptotically to a ﬁnite number. However, there could be poten-tially di ﬀ erent limits for di ﬀ erent simulations. In these cases, a natural check for ergodicity is to compare the resultsof autoBM and autoRD (step 4). This is in line with previous approaches to ergodicity analysis from the literature(e.g., Grazzini, 2012), where BM-like means across one long simulation are compared with RD-like means acrossseveral simulations. The di ﬀ erence is that our BM and RD results are obtained using automated algorithms ( autoBM The strong mixing property guarantees that two su ﬃ ciently distant observations in ( Y t ) t > are approximately independent. There are variousdeﬁnitions of the property; we utilize the φ -mixing deﬁnition provided in Steiger and Wilson (2001) autoRD ), rather then from arbitrarily parametrized experiments. Our test, performed in step 4, checks whetherthe di ﬀ erence between BM and RD estimates is larger in absolute value than the δ used for CI implementation. Ifthis is the case, we have evidence of non-ergodicity and therefore of violation of the strong mixing assumption (step6). For example, this could be due to the presence of multiple stationary points in the process: autoBM would endup exploring only one of such stationary points, while autoRD would provide averaged information on the possiblerealizations. If the di ﬀ erence between BM and RD estimates is small, we have no evidence that the assumption isviolated and we proceed with a second test on autoRD ’s horizontal means (step 5). Indeed, under the null hypothesisthat the process is strongly mixing and that the initial warmup phase has been e ﬀ ectively discarded, the central limittheorem for weakly correlated variables states that the horizontal means should be approximately normally distributed.In particular, we perform an Anderson-Darling normality test (with signiﬁcance level 1%) on the sample of horizontalmeans. If the null hypothesis is not rejected we again have no evidence of violation of the strong mixing assumption,and we therefore return the values computed by either of the two algorithms (step 7).

5. Operationalizing the framework: Statistical Model Checking and MultiVeStA

This section discusses how we operationalise our approach. In particular, we frame our approach to ABM anal-ysis in the context of Statistical Model Checking and show how we integrate it into MultiVeStA, a model-agnosticstatistical model checker that can be integrated with existing simulators.

Statistical Model-Checking (SMC) (Agha and Palmskog, 2018; Legay et al., 2019) is a successful simulation-based veriﬁcation approach from computer science. SMC allows to study quantitative properties of large-scale modelsthrough completely automated analysis procedures equipped with statistical guarantees. Following the principle of separation of concerns , the idea is to o ﬀ er a simple external language to express properties of interest that can be queried on the model using predeﬁned analysis procedures. The goal of SMC is therefore that of o ﬀ ering a one-click-analysis experience to the modeler which is freed from the burden of modifying the model to generate large CSV ﬁlesevery time a new analysis is required, and then analyzing such CSV ﬁles in an error-prone semi-automated manner.This guarantees that the analysis procedures are written once and then extensively tested, decreasing the possibilityof errors. Making a parallel with databases, we do not have to explicitly manipulate the internal representation of thedata every time a new query is needed, rather we deﬁne the data to be selected using compact languages (e.g. SQL).Several statistical model checkers exist, most of which require to implement models into proprietary languages.We consider black-box SMC (Sen et al., 2004; Younes, 2005), where the idea is to o ﬀ er a model-independent anal-ysis framework that can be easily attached to existing simulation models, e ﬀ ectively enriching them with automatedstatistical analysis techniques. In particular, we use MultiVeStA (Sebastio and Vandin, 2013; Gilmore et al., 2017),redesigned and extended here with the techniques from the previous sections, tailored for ABM community. MultiVeStA only needs to interact with a simulator by triggering 3 basic actions: ( i ) reset(seed) , to reset thesimulator to its “initial state”, and update the random seed used to generate pseudo-random numbers. This is necessaryto reset the model before performing a new simulation; ( ii ) next , to perform one step of simulation; ( iii ) eval(obs) ,to evaluate an observation in the current simulation state, where an observation ( obs ) can be any feature of theaggregate model or of any group of agents. A new model can be integrated with MultiVeStA by implementing an13 daptor between MultiVeStA and the considered simulator, obtained by instantiating MultiVeStA’s (Java) interface.As a consequence, it natively supports Java-based simulators, but it has been also integrated with C- and Python-basedsimulators, and it has been recently extended to support R-based ones.For the ABM macro model from Section 6 we are interested in two aggregate features of the model: the number ofbankruptcies and the unemployment rate in a given step. Therefore, the model has been integrated such that these canbe obtained using eval("bankruptcy") , and eval("unemploymentRate") , respectively. Instead, the predictionmarket models from Sections 7 and 8 have been integrated such that eval(i) gives a particular feature of agent i (itscurrent wealth), and eval("price") gives a certain aggregate feature of the model (the prevailing price). MultiVeStA o ﬀ ers a powerful and ﬂexible property speciﬁcation language , MultiQuaTEx, which allows to expresstransient and steady state properties, including warmup analysis. Transient properties.

Intuitively, a MultiQuaTEx query might describe a random variable (e.g., the number of bankrupt-cies in an ABM macro model at a certain point in time during a simulation). Following the discussion in Section 2,the expected value of a MultiQuaTEx query is estimated as the mean x of n samples (taken from n simulations), with n large enough (but minimal) to guarantee that the (1 − α ) · x has size at most δ , for given α and δ .MultiQuaTEx actually allows to express more random variables in one query, all analysed independently reusingthe same simulations. Listing 1 depicts a MultiQuaTEx query used in Section 6 to study the evolution of the numberof bankruptcies and of the unemployment rate in an ABM macro model.Coming to the structure of a MultiQuaTEx query, it contains a list of parametric operators that can be used in an eval autoIR command to specify the properties to be estimated. Lines 1-4 of Listing 1 deﬁne the parametric operator obsAtStep having two parameters, t and obs , respectively the step and observation of interest. Such operator isevaluated, in every simulation, as the value of obs at time point t . Before discussing the body of the operator, we notethat Line 5 uses it twice for observations the number of bankruptcies and the unemployment rate for each step from1 to 400 (with increment 1). Therefore 800 properties will be studied (400 for each observation), all evaluated usingthe same simulations and with their own CI. The body of an operator (Lines 1-4) might contain:1. conditional statements (the if - then - else - fi );2. real-valued observations on the current simulation state (the s.eval in Line 1 and Line 2);3. a next operator that triggers the execution of a simulation step (Line 3);4. recursion, used in Line 3 to evaluate obsAtStep(t,obs) in the next simulation step;5. arithmetic expressions.This is general enough to express a wide family of properties at varying of time. In the case of Listing 1, we checkwhether we have reached the step of interest (Line 1), in which case we return the required observation (Line 2).Otherwise, we perform a step of simulation (Line 3), and evaluate recursively the operator in the next simulation state. obsAtStep (t,obs) = if (s.eval (" steps ") == t) then s.eval(obs) else next( obsAtStep (t,obs)) fi ; eval autoIR (E[ obsAtStep (t," bankruptcy ") ],E[ obsAtStep (t," unemploymentRate ") ],t ,1 ,1 ,400) ; Listing 1: A transient MultiQuaTEx query obs(o) = s.eval(o) ; eval warmup (E[ obs (0) ],E[ obs (1) ],E[ obs (2) ],E[ obs (" price ") ]) ; eval autoBM (E[ obs (0) ],E[ obs (1) ],E[ obs (2) ],E[ obs (" price ") ]) ; eval autoRD (E[ obs (0) ],E[ obs (1) ],E[ obs (2) ],E[ obs (" price ") ]) ; Listing 2: A steady state MultiQuaTEx query. Only one of the three eval commands should be used at a time. … …

Simulator Instance 1MultiVeStAServer 1 A d a p t o r Simulator Instance 2MultiVeStAServer 2 A d a p t o r Simulator Instance NMultiVeStAServer N A d a p t o r MultiVeStAClient

Figure 5: MultiVeStA’s client-server architecture enabling parallelization of simulations.

Steady state properties and warmup analysis.

MultiQuaTEx has been extented to support MultiVeStA’s extensionwith steady state and warmup analysis capabilities discussed in Section 3.2. Listing 2 provides a steady state

Multi-QuaTEx query used in Section 7 to study the average value at steady state of the wealth of three agents (0, 1, and2), and of the price in our test-bed market selection model. The query is simple, as in this case the operator obs just returns the observation of interest, while Lines 2-4 show how to run the three types of supported analysis. Inparticular, a steady state query is composed of two parts: A list of next -free operators, and one of the three eval commands in Listing 2, provided with a list of operators to study.Intuitively, a steady state MultiQuaTEx query deﬁnes observations on single simulation states, implicitly studiedat steady state. In particular, warmup performs the warmup estimation procedure (Section 3.2.2) for each of the listedproperties. Indeed, every random variable deﬁned on a process might have a di ﬀ erent warmup period. We will seeexamples of this in Section 7. Instead, autoBM performs a warmup estimation on each property, and, begins computingthe batch means procedure (Section 3.2.3) on each of them as soon as the property completes its warmup period. Thecommand autoRD is similar, but it ﬁrst completes the warmup analysis for all considered properties, and then feedsthis information to the replication deletion procedure from Section 3.2.1. In all cases, the default values described inSection 3 will be used if not otherwise speciﬁed by the user when running the analysis.MultiQuaTEx supports two further eval commands: manualBM and manualRD . These behave the same as autoBM and autoRD , respectively, but skip the warmup analysis phase and required as input an estimation of the warmupperiod. These might be useful in case one has this information due to previous analyses. In Section 7 we use them toreplicate erroneous steady state analyses from the literature based on a wrong estimation of the warmup period. MultiVeStA has a client-server architecture as sketched in Figure 5. This is a classic software architecture fordistributing tasks in the cores of a machine or in the nodes of a network. We distribute the simulations of autoIR and autoRD . In the ﬁgure, arrows denote visibility / control / activation of the source component on the target one: • A user runs the client specifying the model, query, CI, and the parallelism degree N . Transparently to the user,the client will trigger, distribute, and handle the necessary simulations providing to the user the results. • The client creates N servers among whom distributes the analysis tasks.15 uto Batch Means More to come

PlotterCSV Generator

Visualizer

Auto Replication & DeletionAuto Warmup Estimation

Post-processing Steady state Analyser

MultiVeStA Client

MultiQuaTExCompiler

Query Editor Further features

More to comeMore to come

Auto Transient Analysis

Transient AnalyserErgodicity Diagnosis

Compare resultst-test & power

Figure 6: The novel architecture of the MultiVeStA client • Each server runs independently, therefore in parallel, the required simulations. Each server creates its owninstance of the simulator, and controls it through the adaptor to perform the simulations.As discussed, we extended MultiVeStA with a number of analysis techniques. In particular, we mainly extendedthe client, where the analysis logic is localized. The new architecture of the client is depicted in Figure 6. It consistsof a number of modules, the central ones regarding steady state and transient analysis. Further modules regard: post-processing of analysis computed by MultiVeStA like t-tests and power computation to compare results obtained fordi ﬀ erent model conﬁgurations (Section 3.1.2), or the methodology for ergodicity analysis (Section 4); support for thecreation and parsing of MultiQuaTEx queries, o ﬀ ered by a novel compiler for MultiQuaTEx queries; visualization ofthe analysis results through a plotter and of a CSV ﬁle creator.

6. Application: Transient analysis of a large macro ABM

We apply the transient analysis from Section 3.1 to the large-scale macro-ﬁnancial ABM of Caiani et al. (2016).

The model has been developed to bridge the stock ﬂow consistent approach (SFC; Godley and Lavoie (2006))with the macroeconomic agent based literature (see, e.g., Delli Gatti et al., 2005; Cincotti et al., 2010; Dosi et al.,2010; Dawid et al., 2012; Popoyan et al., 2020). It depicts an economy composed of households selling their laborto ﬁrms in exchange for wages, consuming, and saving into deposits at (commercial) banks. Households own sharesof ﬁrms and banks in proportion to their wealth, and receive a share of ﬁrms’ and banks’ proﬁts as dividends; theyalso pay taxes as set by the Government, which runs ﬁscal policy. There are two categories of ﬁrms. Consumptionﬁrms produce a homogeneous good using labor and the capital goods manufactured by the other class of ﬁrms: capitalﬁrms. Firms may apply for loans in order to ﬁnance production and investment. Retained proﬁts enter the ﬁnancialsystem as banks’ deposits. Banks provide credit to ﬁrms, buy bonds issued by the Government and need to satisfymandatory capital and liquidity ratios. Finally, a Central Bank holds banks’ reserve accounts and the governmentaccount, accommodates banks’ demand for cash advances at a ﬁxed discount rate, and possibly buy governmentbonds that have not been purchased by banks. A rather detailed overview of the macro ABM literature can be found in Fagiolo and Roventini (2017) and Dosi and Roventini (2019). The labour market is composed of workers, ﬁrms, and the public sector. Firms in the capital good sector (indexedby k ) demand workers based on their desired level of production y Dxt and the productivity of labor ( µ N ), which isassumed to be constant and exogenous: N Dkt = y Dxt µ N . (4)Di ﬀ erently, the request of workers by consumption good ﬁrms (indexed by c ) is given by N Dct = u Dct κ ct l k , (5)where κ ct is the capital stock, l k is a constant expressing the capital-to-labor ratio and u ct is the utilization capacityneeded to obtain the desired production. Workers can be ﬁred under two circumstances: workers in excess of produc-tion needs are randomly sampled from the pool of ﬁrm employees and ﬁred, and workers can loose their job becauseof an exogenous positive employee turnover (a ﬁxed share of workers is ﬁred in every period). Finally, a constantshare of households are employed by the public sector and public servants are also subject to an exogenous turnover.After having planned production, ﬁrms and the government interact with unemployed households on the labormarket. Workers follow an adaptive heuristic to set the wage they ask for: if over the year (i.e., four periods), theyhave been unemployed for more than two quarters, they lower the asked wage by a stochastic amount. In the oppositecase, they increase their asked wage. The share of workers that is not employed at the end of each session of interactionin the labour market represents the prevailing unemployment rate.After production, ﬁrms sell their products and need to compensate for the inputs they received. Firms may defaultwhen they run out of liquidity to pay wages or to honour the debt service. Defaulted ﬁrms are bailed-in by households(who are the owners of ﬁrms and banks and receive dividends) and depositors, as the authors seek to maintain thenumber of ﬁrms constant. Hence, the bankruptcy rate emerges as the ratio between defaulted ﬁrms before the bailing-in event and the total number of ﬁrms in the economy. As the defaulted ﬁrms create non-performing loans that mighttrigger vicious cycles and - ultimately - a ﬁnancial crisis, they o ﬀ er key information on the turbulence and riskiness ofthe business cycle. autoIR : automatic computation of conﬁdence intervals The model is run in its baseline conﬁguration considered in Section 5.1 of Caiani et al. (2016). The artiﬁcial timeseries show the model ﬁrst experiences a sequence of expansionary and recession regimes, then converging, in mostcases, to a relatively stable behaviour where aggregate variables (including the unemployment and the bankruptcyrates) ﬂuctuate around particular values, and nominal aggregates grow at similar rates. Our focus is centred on theﬁrst part of such process.As a ﬁrst exercise, we reproduce the behaviour of the economic indicators we selected in the ﬁrst 400 steps ofthe simulation, and construct CIs around their mean according to Equation (1) (Figure 7). In particular, we choose Of course, all variables present in the original model could be analysed using the very same procedure; we selected two for illustrative purposes.

50 100 150 200 250 300 350 400Steps0.040.050.060.070.080.090.100.11 unemploymentRate with 97.5% CI 0 50 100 150 200 250 300 350 400Steps0123456 bankruptcy with 97.5% CI

Figure 7: Unemployment rate and bankruptcy over time. The dashed lines are the computed 97.5% CI of size δ at most 0 .

005 and 0 .

5, respectively. α = .

025 and set δ at the maximal allowed width of the conﬁdence intervals around our central estimates (i.e. δ U = .

005 for the unemployment rates and δ B = . autoIR automatically decidethe number of simulations needed to obtain the desired conﬁdence intervals. We stress that MultiVeStA automaticallydetermines the number of runs required to obtain the desired CI for each point in time and for whatever variable ofinterest. As shown in the top of Figure 8, this required at most 378 simulations for both properties.As a concept-proof of our approach, the inspection of Figure 7 conﬁrms that our algorithms do not modify themodel and deliver the same dynamics (see Figure 2 of Caiani et al., 2016). However, Caiani et al. (2016) performed100 simulations for all the 400 time steps, without providing information on the obtained conﬁdence intervals.The ability to specify the precision of the conﬁdence intervals comes with a number of advantages. First, it isa ﬂexible requirement that can be expressed either in absolute or relative terms (see Section 3), leaving the chanceto statistically compare the expected behaviour of the model to a certain target (say, an employment rate not higherthat 5% or an inﬂation rate of 2%) or to its mean (e.g. allowing one to compute for each period the probabilityto observe bankruptcy rates 10% higher than the average). Second, and more relevantly, it allows evaluating therobustness (and the uncertainty) of the dynamics simulated by the model. In particular, Figure 8 shows how the widthof the conﬁdence intervals vary, for each time point and property, across the simulation span for various number ofsimulations. The top of the ﬁgure shows the intermediate CI widths obtained after every iteration of the blocks ofsimulations performed by MultiVeStA (see the discussion in Section 3.1.1 - we use bl = δ , demonstrating that the minimum number of simulations are computed for the given δ .Instead, the setting used by Caiani et al. (2016) might lead to CIs of di ﬀ erent widths which follow the trend of thecomputed means. This is particularly evident for the case of ﬁrms’ bankruptcies, cf. Figure 8 (bottom-right), whilethe same does not happen for unemployment rates (bottom-left of the ﬁgure). The ﬁgure suggests that each property We highlight that the artiﬁcial time series we generate are somehow comparable to Figure 2 of Caiani et al. (2016); however, Caiani et al. applya bandpass ﬁlter over their series and just show the emerging trend component. Contrarily, we show the “raw” series that the model generates. Wenotice that the latter is the prevailing practice in the literature (see for example the models reviewed in Fagiolo and Roventini, 2017).

50 100 150 200 250 300 350 400Steps0.0000.0010.0020.0030.0040.0050.0060.0070.008

Unemployment rate

Required 0.005CI with 42 simsCI with 84 simsCI with 126 simsCI with 168 simsCI with 210 simsCI with 252 simsCI with 294 simsCI with 336 simsCI with 378 sims

Bankruptcies

Required 0.5CI with 42 simsCI with 84 simsCI with 126 simsCI with 168 simsCI with 210 simsCI with 252 simsCI with 294 simsCI with 336 simsCI with 378 sims

Unemployment rate

CI with 0.005CI with 100 sims

Bankruptcies

CI with 0.5CI with 100 sims

Figure 8: Top: Intermediate widths of the 97.5% CI for bankruptcy and unemployment rate computed by MultiVeStA (top); Bottom: Comparisonamong CI widths computed by MultiVeStA and by setting 100 simulations for all properties and t as in Caiani et al. (2016) (obtained using autoIR and setting both bl and the maximum number of simulations performed to 100). and time point should be studied using its own best number of simulations, conﬁrming that a trade-o ﬀ between aninsu ﬃ cient and an excessively large number of simulations exists. When this is too low the across-runs variabilitymight not be adequately washed-out and the representation of (stochastic) uncertainty could depend on the level ofthe relevant variable; conversely, when the number of simulations is too large, simulations are redundant and the samerepresentation of uncertainty can be e ﬀ ectively o ﬀ ered saving computational time. Finally, in line with Secchi andSeri (2017), the right-hand panels of both ﬁgures conﬁrm that the arbitrary choice of n = The second exercise we perform uses the conﬁdence intervals previously computed (and the means, variances andnumber of samples returned by MultiVeStA) to automatize a series of tests that identify statistical di ﬀ erences acrossmodel conﬁgurations discussed in Section 3.1.2. Indeed, one of the most common approaches in the macro ABMliterature is to focus on key parameters or mechanisms of interest - often reﬂecting either behavioural attitudes orpolicy strength - and test how the dynamics of the model respond to changes. Just to make few examples, Dosi et al.(2015) compare a series of rules for monetary and ﬁscal policy, Lamperti et al. (2020) explore feed-in tari ﬀ s and R&Dsubsidies, and Caiani et al. (2019) extend the model analysed in this section to study various progressive tax schemesand their e ﬀ ects on growth and inequality. The di ﬀ erence across experiments is tested comparing the value of somestatistic of interest (e.g. the growth rate of output) - usually averaged over the entire time span - by means of t-tests(e.g. in Dosi et al., 2015; Popoyan et al., 2020). Obviously, the ability of the test to discern across experiments andto validate the counter-factual policy intervention is a ﬀ ected by the choice of n , as an insu ﬃ cient number or runs islikely to make model conﬁgurations (i.e. experiments) di ﬃ cult to distinguish. Even further, it is not infrequent that19tatistical tests about di ﬀ erences across experiments are completely missing (e.g. Cincotti et al., 2010; Caiani et al.,2019), which weakens the potential of the paper and the eventual policy recommendations.Our tool-box provides an automatic series of t-tests across experiments, where the expected value of any variableof interest in any pre-determined set of experiments is tested against a baseline conﬁguration for each step of thetransient period. As discussed in Section 3.1.2, tests are run post-mortem and consist in Welch’s t-tests (Equation (2)),whose power can be computed with respect to a minimum distance ε between the means that the test is expected todetect (Equation (3)). Figure 9 shows the results in our test-bed macro ABM. Among di ﬀ erent possible experiments,we evaluate the e ﬀ ects that changes in the degree of agents’ risk aversion ( C ) produces on the number of bankruptciesand the unemployment rate. As showed by Caiani and co-authors, when risk aversion of the agents increases, theeconomy tends to completely avoid the recession phase experienced in the baseline conﬁguration. While they did noto ﬀ er a statistical analysis of these di ﬀ erences, our approach automatically embeds it. In particular, we contrast modelbehaviours across the baseline value of C and a 50% increase of the latter. Figure 9 shows the results of our testscomparing the set-up of the original analysis (i.e. with n =

100 for all properties and time points, left column) to ourapproach ( n automatically determined for each property and time point, right column). While increasing risk aversiondelays the peak of bankruptcy rates, we show that no statistical di ﬀ erence between the two experiments is found butfor the central part of the simulation, that is when the economy ﬁrst experiences a deep crisis and then recovers (seeFigure 2 in Caiani et al., 2016). This is evident in both set-ups and suggests that doubling risk-aversion modiﬁes theshape of the crisis (smoother surge of bankruptcies and slower decline) but not its existence nor duration, which isfurther conﬁrmed by the behaviour of the unemployment rate (see Figure 10). Though using n =

100 or our approach makes little di ﬀ erence in terms of type 1 errors, a key advantage is evidentwhen comparing powers. Indeed, our setup guarantees a much higher power of the tests, thereby reducing dramaticallythe chance of not rejecting the null hypothesis of equality across experiments when it is actually false (see also Secchiand Seri, 2017). We notice that the same holds for the unemployment rate (see Figure 10), though the discrepancyis less marked. Further, our approach delivers - for given signiﬁcance a w and setting ε equal to the δ used for thetransient analysis - a good and stable power across the simulation horizon, i.e. above 0 .

8, which is usually consideredan acceptable threshold in the applied statistics literature (see e.g. Secchi and Seri, 2017; Cohen, 1992; Lehr, 1992).This comes by the fact that for each property and time point MultiVeStA had to run the correct number of samples toobtain a constant width of the CI embedded in the choice of δ (and by the assumption that the minimum di ﬀ erencewe want to detect – ε – is equal to δ ). Indeed, it is interesting to note how the t-test for bankruptcies obtained for thesetting with 100 simulations (Figure 9 bottom-left) has a low power which appears to decrease specularly to how thecorresponding CI width increases in Figure 8 (bottom-right). Hence, we can derive a rule of thumb to support themodeller’s choice of the two free parameters in a set-up that compares di ﬀ erent experiments: ﬁrst of all, a w can be setequal to the α used for the transient analysis, from 5% to 1%, whose extrema are the most di ﬀ used levels of statisticalsigniﬁcance in the social sciences. Then, by setting δ in the transient analysis equal to the ε of interest, we expectto obtain t-tests with good power. If this is not the case, one can perform the transient analysis for smaller values of δ while keeping constant ε . In case the maximum budget of simulations that have been originally chosen does notallow to meet such conditions, a trade-o ﬀ exists in accepting an higher chance of type II error (not detection of falsenegatives) and the computational resources at disposal of the modeller. However, we stress that while increasing thesize of the simulation exercise might come at the expenses of computational time, the proposed tool automatically We also highlight that our approach identiﬁes a statistically signiﬁcant di ﬀ erence between the two experiments for the slight increase inbusiness insolvencies between period 200 and 250. a) CIs width for α = .

025 and N =

100 simulations (b) CIs width for α = .

025 and δ = . a w = .

025 (d) T-test are means in (b) point-wise equal? - signiﬁcance a w = . ﬀ erence ε = . ﬀ erence ε = . ﬀ erent risk aversions for consumption ﬁrms: are they point-wise equal? Red dots denote initial steps withvariances so small to get intermediate results below the numerical tolerance of our implementation of the test (1E-15). a) CIs width for α = .

025 and N =

100 simulations (b) CIs width for α = .

025 and δ = . a w = .

025 (d) T-test are means in (b) point-wise equal? - signiﬁcance a w = . ﬀ erence ε = .

005 (f) Power of t-test in (d) for di ﬀ erence ε = . ﬀ erent risk aversions for consumption ﬁrms: are they point-wise equal? Red dots denote initialsteps with variances so small to get intermediate results below the numerical tolerance of our implementation of the test (1E-15). ﬃ ciency of our approach.Finally, we remark that when using the original code and simulation environment (JMAB) of the Caiani et al.paper alone, it is not possible to perform any form of statistical analysis automatically, requiring to process CSV ﬁlescreated by the framework. Our integration of MultiVeStA provides JMAB with analysis capabilities described so far,encompassing both the transient and the steady state analysis, while leaving the simulation environment unaltered.

7. Application: Steady state analysis in a model of market selection

Here we use MultiVeStA to perform a statistical analysis of the steady state expected value of wealth shares andmarket price in a simple repeated prediction market model. The model we consider has been extensively studied in theliterature (Beygelzimer et al., 2012; Kets et al., 2014; Bottazzi and Giachini, 2017, 2019b) and o ﬀ ers a perfect testbedfor our procedures for automated steady state analysis. In particular, its steady state properties have been numericallyinvestigated by Kets et al. (2014) and, later on, studied analytically in Bottazzi and Giachini (2019b) showing thatthe numerical results of Kets et al. (2014) were inaccurate both qualitatively and quantitatively. As brieﬂy reportedby Bottazzi and Giachini (2019b) and as we shall see, the source of inaccuracy can be traced back to the strongautocorrelation and initial condition bias that process possesses. The post-mortem nature of the numerical analysiscarried on by Kets et al. (2014) is unable to properly deal with those issues. Our approach, instead, uses statisticaltests and procedures able to manage both autocorrelation and the initial condition bias in an automated way. Thus,in what follows, we ﬁrst introduce the model, then we repeat the numerical analyses of Kets et al. (2014) showinghow and why the inaccuracies emerge, and ﬁnally we use autoRD and autoBM to accurately perform the steady stateanalyses. In fact, we match the correct analytical results from Bottazzi and Giachini (2019b). The model consists in a pure exchange economy in discrete time, indexed by t ∈ N , where N agents repeatedlybet on the occurrence of a binary event. That is, in every t two contracts are available for wagering: the ﬁrst pays1 dollar if the event occurs and zero otherwise, while the second pays 1 dollar if the event does not occur and zerootherwise. We model the event by means of a Bernoulli random variable s t , such that s t = t has occurred, and s t = s t = π ∗ ∈ (0 , i ∈ { , , . . . , N } assigns a subjective probability π i to the realization of the event at any time t . Agent i has initialwealth equal to w i and at the end of every betting round it evolves in w it depending on the results of her betting. Thetotal initial wealth in the market is normalized to 1, such that, since wealth is only redistributed by the betting system,it is (cid:80) Ni = w it = t . Every agent i bets on the occurrence of the event at time t a fraction α it of her wealth w it − ,while 1 − α it is the fraction bet against the occurrence. As in Kets et al. (2014); Bottazzi and Giachini (2017, 2019b),we focus on the so-called fractional Kelly rule, that is ∀ i , t α it = c π i + (1 − c ) p t , (6)with c ∈ (0 , Hence, one can indi ﬀ erently refer to w it as both the wealth and the wealth share of agent i at time t . eliefs WealthN c π π π w w w .

01 0 . . . .

33 0 .

33 0 . Table 1: Parameters used for the prediction market model. supply. Hence, calling p , t and p , t the price of the ﬁrst and second contract, respectively, we have ∀ t = N (cid:88) i = α it p , t w it − and 1 = N (cid:88) i = − α it p , t w it − . Since wealth sums up to 1 in every period, one has p , t + p , t =

1, hence we call p , t = p t and p , t = − p t . Substitutingwith Equation (6) and applying simple algebraic manipulations, one obtains p t = N (cid:88) i = π i w it − ∀ t . (7)After the market round, the outcome of the binary event is revealed and the wealth of agent i evolves according to w it =  α it p t w it − = (cid:32) − c + c π i p t (cid:33) w it − if s t = , − α it − p t w it − = (cid:32) − c + c − π i − p t (cid:33) w it − if s t = . (8)In this setting, Kets et al. (2014) want to explore the selection dynamics of the model and are particularly interestedin the asymptotic (steady state) value of expected wealth shares and price for c →

0. Indeed, they conjecture that insuch a limit the steady state expectation of p t matches π ∗ and use the case c = .

01 as a proxy.In Section 7.2, we replicate exactly the analysis of Kets et al. (2014), reproducing their Figures 3(c) and 3(d).Thus, we follow the procedure proposed in Kets et al. (2014) to estimate steady state expected wealth shares andprice for several values of π ∗ under the parametrization in Table 1. In doing that, we highlight some issues related toinitial condition bias and strong autocorrelation. Indeed, the warmup period appears not correctly determined, and theautocorrelation within the observations of each performed simulation not correctly accounted for. This is due to theprocedure for computing CIs used in Kets et al. (2014) and has the result of producing extremely wide CIs.After this, in Section 7.3, we perform the steady state analyses using our approach. We show that, thanks to theprovided automatic procedures, our estimates (and the corresponding conﬁdence intervals) of steady state expectedwealth shares and price are correctly determined. Our conclusions di ﬀ er not just quantitatively, but also qualitativelyfrom those in Kets et al. (2014), and match those from Bottazzi and Giachini (2019b). Overall, our analysis shows theimportance of using an automated procedure provided with statistical guarantees. manualRD using original wrong warmup estimation As discussed in Section 5.3, by using manualRD

MultiVeStA allows one to manually set an a-priori estimate of thewarm-up period. MultiVeStA also allows one to ﬁx the maximum number of simulations used in an analysis based on Notice that the mathematical speciﬁcation of the model may appear di ﬀ erent from the one presented in Kets et al. (2014). However, asexplained in Bottazzi and Giachini (2019b), the two speciﬁcations are indeed equivalent. igure 11: Steady state analysis of expected agents’ wealth and market price according to the manual warm-up and simulation lengthsettings of Kets et al. (2014). We obtain these results by using manualRD setting the end of the warmup periods to 90 000, andthe time horizons to 100 000. By setting both bl (the number of simulations in a block) and the maximum number of simulationsperformed to 1 000, we use precisely 1 000 simulations to estimate each property, perfectly matching the setting used in Kets et al.(2014). We consider 39 equally spaced values for π ∗ , from 0 .

025 to 0 . independent replication. In general it is always advisable to do not ﬁx such parameters a-priori , but to use the o ﬀ eredautomated procedures so to avoid bias in the estimates and excessively large CIs. In this section we exemplify theseissues by ﬁxing a priori the erroneous warmup estimate and number of simulations used in Kets et al. (2014), anddiscuss the problems this introduced in the obtained results.Kets et al. (2014) performed an RD-based steady state analysis of the agents’ wealth ( w t , w t , w t ) and marketprice p t using the parametrization of Table 1. The authors arbitrarily estimated the end of the warmup period after90 000 steps, and ﬁxed the time horizon of each simulation to 100 000 steps and the number of performed simulationsto 1 000. This means that estimates were computed averaging the last 10 000 observations in each simulation (thehorizontal means of Figure 1(c)) and then further averaging the so-computed means from each simulation (the verticalmeans). We shall see how this led to estimates highly biased by the initial conditions.Regarding computations of conﬁdence intervals, Kets et al. (2014) did not follow the standard approach relyingon the central limit theorem used by MultiVeStA. Rather, Kets et al. (2014) considered how the above discussed1 000 averages built for each simulation distribute. In particular, the 5-th and 95-th percentiles of such distribution aretaken as the bounds of conﬁdence intervals with 10% statistical signiﬁcance. The problem with this approach is that,di ﬀ erently from the approach used by MultiVeStA, it is based on the assumption that each of the considered 1 000averages has the same distribution of an average across independent replications computed at a time t large enough tohave reached steady state. This is not correct because, as well as the initial condition bias, the process is characterizedby strong autocorrelation. We shall discuss how this led to erroneous interpretation of the results. Agents’ wealth.

In Figure 11 we report the outcomes of the exercise replicating those from Figures 3(c) and 3(d)in Kets et al. (2014) considering model variants for 39 di ﬀ erent values of π ∗ . Looking at the left panel one shouldconclude that there exist model conﬁgurations in which all agents have strictly positive expected wealth share in steady25 igure 12: Estimates of steady state di ﬀ erence between the expected price and π ∗ using the settings from Figure 11. We consider 49equally spaced values for π ∗ , from 0 .

31 to 0 .

79. The dashed lines are CIs built on the same samples using two di ﬀ erent procedures.Left: CIs are erroneously computed (using an external Octave script) according to the procedure in Kets et al. (2014). Right: CIsas computed by MultiVeStA using the approach based on the central limit theorem. state. This is, however, in contrast with the analytical analysis from Proposition 4.1 of Bottazzi and Giachini (2019b),which proves that no more than two agents can have asymptotic positive wealth share. Thus, the fact that Kets et al.(2014) incorrectly suggest that more than two traders can have positive expected wealth in steady state is an artifactof the initial condition bias that a ﬀ ects their analysis. Notice that the convergence to zero of the wealth share of atleast one trader is asymptotic, thus wealth shares show a bias for any t . However, such bias decreases with t and canbe made negligible choosing a su ﬃ ciently long warm-up and simulation length. What we observe is that discardingthe ﬁrst 90 000 observation of every run is simply not enough. Market price.

The right panel of Figure 11 shows the average price and should support one of the main results ofKets et al. (2014): the expected market price matches π ∗ when c = .

01 and π ∗ is strictly between the lowest and thehighest of agents’ beliefs. Bottazzi and Giachini (2019b) suggest that such a conclusion is not correct and the sourceof inaccuracy should be found in the way in which CIs are built by Kets et al. (2014). In order to better understandthis aspect, we create a new plot in Figure 12 focusing on the di ﬀ erence between market price and π ∗ . In the left panelof the ﬁgure we report the CIs (dashed lines) obtained by applying the procedure of Kets et al. (2014), while in theright panel we show those obtained by MultiVeStA. As one can notice, the procedure of Kets et al. (2014) produceslarge CIs, backing the claim of the authors. Instead, the CIs obtained by MultiVeStA using the approach based on thecentral limit theorem are much tighter and disprove the claim of the authors. To understand the source of disagreementbetween the two approaches, consider the following argument: if the 10 000 observations of p t from each simulationused to compute the average price of every replication were independent and at steady state, then we would not havespotted any signiﬁcant di ﬀ erence. Indeed, according to the Central Limit Theorem, we have that each time averageis (approximately) distributed as a normal random variable with mean the steady state expectation of the price andvariance the steady state variance of price over √

10 000. The initial condition bias lets the expected time averagebe di ﬀ erent from the steady state expectation. The strong autocorrelation in the price process (Bottazzi and Giachini,2019b) lets conﬁdence intervals be too wide. While the manualRD procedure used here can do nothing about the initialcondition bias, with respect to conﬁdence intervals MultiVeStA does not assume anything about the distribution oftime averages, simply relies on the Central Limit Theorem. Indeed, the average across 1 000 independent replications26 igure 13: Estimates of steady state di ﬀ erence between the expected price and π ∗ using the settings from Figure 11 for warmupestimation and time horizon. The number of simulations is automatically chosen by MultiVeStA according to the used values of δ , for α = . π ∗ , from 0 .

31 to 0 .

79. Left: δ = . δ = . δ = . of the 10 000-period time averages is (approximately) distributed as a normal random variable with variance the timeaverages’ variance over √ α ) andconﬁdence interval width ( δ ) instead of the total number of independent replicas is a much more reliable and e ﬃ cientprocedure to test hypotheses on steady state expectations. Market price for di ﬀ erent α - δ . In Figure 13, we test the hypothesis from Kets et al. (2014) that no di ﬀ erence betweenthe average price and π ∗ exists under the parametrization in Table 1. We use again manualRD keeping the same settingsfor warmup estimation and time horizon discussed in advance, while the number of simulations is automaticallychosen by MultiVeStA according to di ﬀ erent values of δ (0.002, 0.001, and 0.0005), for α = .

025 (i.e. a statisticalconﬁdence of 97.5%). If the hypothesis from Kets et al. (2014) were correct, the di ﬀ erence should be almost neversigniﬁcantly di ﬀ erent from zero for any δ considered. Instead, we notice that δ plays an important role in assessingthe hypothesis testing outcome. Indeed, while with δ = .

002 the computed CIs for the di ﬀ erence among market priceincludes 0 for almost all π ∗ , with δ = .

001 the CIs almost never includes 0. This conﬁrms the point of Bottazzi andGiachini (2019b) and the results we have obtained in Figure 12 right panel: the hypothesis of no di ﬀ erence betweenthe average price and π ∗ is generically rejected. Focusing on δ = . ﬀ erence between the average priceand π ∗ exist can be rejected with less than 1 000 simulations . In other cases, instead, 1 000 independent replicationsare not enough and one may risk to get to the wrong conclusion simply because of an insu ﬃ cient number of replicas. Notice, however, that due to the arbitrary choice of the end of the warmup period, all the estimates are biased bythe initial conditions. We next use MultiVeStA’s automated steady state analysis ( autoRD and autoBM ) to accuratelyestimate steady state expectations and to ﬁnally assess on such obtained results the hypothesis of Kets et al. (2014). autoRD and autoBM using automatic warmup estimation

Now we repeat the exercises using the automated tools for steady state analysis provided by MultiVeStA, startingby estimating expected wealth shares. This is typically the case at the extrema, indeed 0.31 and 0.79 require, respectively, 420 and 480 replicas. This may occur for π ∗ around 0.55, where MultiVeStA needs 1 440 simulations to reach the required interval width. igure 14: Steady state levels of average wealth shares. Left: Replication-Deletion. Right: Batch Means. We set α = .

025 and δ = . π ∗ , from 0 .

025 to 0 . Agents’ wealth.

Our results are displayed in Figure 14. In the left panel we use the Replication-Deletion approach( autoRD ) while on the right we use the Batch Means approach ( autoBM ). As one can notice: i ) our results complywith the theoretical and numerical ones of (Bottazzi and Giachini, 2019b, cf. Figure 7) and ii ) no signiﬁcant di ﬀ erencecan be spotted between the two pictures. Hence, our automated procedures allow one to avoid (or, at least, reduce)biases generated by initial conditions. As a practical example, let us consider the statistical analysis of steady stateexpectations for π ∗ = .

6. The manualRD procedure, using the settings from Section 7.2, estimates the expectedwealth share of agents 1, 2, and 3 to, respectively, 0 . . . autoRD and autoBM estimatethe expected wealth shares to, respectively, 0, 0 . . ﬀ erences between the estimates generated by autoRD and autoBM, the timerequired for producing the results changes. Indeed, the analysis runtime of autoRD (using parallelism degree 3) isabout 3 times larger than the one of autoBM . As discussed in Section 3.2.4, cases like this with large warmup periodstend to favour autoBM . Market price.

Next, we estimate the expected value in steady state of the market price. As we did in Figures 12 and 13,in order to magnify conﬁdence bands in Figure 15 we show our estimates of the di ﬀ erence between the expected priceand π ∗ for di ﬀ erent values of π ∗ ∈ (0 . , . autoRD and autoBM results,respectively. We notice that, for any value of π ∗ considered, the estimated di ﬀ erence between the expected price and π ∗ does not change in a signiﬁcant manner between the two plots. The emerging expected di ﬀ erence presents a clearpattern: it is larger (in absolute value) when π ∗ is close to the belief of one of the agents. Moreover, the expecteddi ﬀ erence appears to be negative when π ∗ is closer to the belief of the agent whose belief is, relatively, the smallest(i.e. among the surviving ones) while it tends to be positive when it is the other way round. These features are in28 igure 15: Steady state levels of average price. Left: Replication-Deletion. Right: Batch Means. We set α = .

025 and δ = . π ∗ , from 0 .

31 to 0 .

79, each one of these points requires a separate MultiVeStAquery that has been invoked and aggregated with the others by means of an external Octave script. line with the results obtained by Bottazzi and Giachini (2019b) in Figure 4. Hence, the reliability of the steady stateanalysis performed by MultiVeStA is conﬁrmed. Moreover, we can conclude that, contrary to what Kets et al. (2014)argue, the steady state value of the average price does not generally match π ∗ when c = .

8. Application: Ergodicity diagnosis in a CRRA prediction market model with noise

We apply our methodology for ergodicity diagnosis to variants of the prediction market model. For all analyseswe set α = .

05 and δ = . Here we modify the model studied in the previous section to allow violations of the ergodicity assumption. Fol-lowing Bottazzi and Giachini (2019a), it is enough to assume that in the market there are N = ﬀ erent behav-ioral assumption changes the betting rules. Indeed, we keep the assumption that agents 1 and 2 have heterogeneousbeliefs ( π and π , respectively, with π < π ) and we add risk preferences, assuming that the relative risk aversioncoe ﬃ cient of agent i is γ i >

0, with i = ,

2. Thus, we replace eq. (6) with α t = (1 − b t ) p t and α t = (1 − b t ) p t + b t , where (9) b t = (cid:16) p t (1 − π ) (cid:17) γ − (cid:16) π (1 − p t ) (cid:17) γ (cid:0) p t (1 − π ) (cid:1) γ + p t (cid:0) π (cid:1) γ (1 − p t ) − γ γ and b t = (cid:16) π (1 − p t ) (cid:17) γ − (cid:16) p t (1 − π (cid:17) γ (cid:0) π (1 − p t ) (cid:1) γ + (1 − p t ) (cid:0) − π (cid:1) γ ( p t ) − γ γ . (10)Bottazzi and Giachini (2019a) show that, depending on the parameters values – in particular γ and γ – severallong-run selection scenarios are possible. Indeed, one can generically have that: i ) one of the two agent has asymptotic29 cenario Beliefs Risk Aversion Wealth NoiseN π ∗ π π γ γ w w η θ IID noise 2 0 .

45 0 . . . . . . .

45 0 . . . . . . . .

45 0 . . . . . . Table 2: Parameters used for the prediction market model with CRRA traders and noisy price reporting. unitary wealth share, ii ) both agents maintain positive wealth share asymptotically, iii ) path dependent scenariosin which either agent 1 obtains unitary wealth share asymptotically while agent 2 loses everything or vice-versa.Focusing on the market price p t , in case i ) p t converges to π i (with i the dominating agent), in case ii ) p t ﬂuctuates inthe interval ( π , π ), and in case iii ) p t either converges to π or to π depending on the particular sequence of eventsrealized. Case iii ) is the one we are interested in: in such a case the ergodicity assumption is violated. However,the asymptotic convergence of the price to one out of two points makes quite easy to spot the lack of ergodicity andthe presence of the two possible long-run price values. Hence, we complicate the setting assuming that there existsa third agent in the model who does not trade nor interacts in any way with agents 1 and 2. He simply observes theprice and reports it. Such report is, however, noisy. Deﬁning ˜ p t the price such external agent reports, we assume˜ p t = p t + v t with v t = θ v t − + u t and u t a uniformly distributed random variable: u t ∼ U ( − η, η ), η >

0. Suchprice reports are not taken into account by agents 1 and 2, hence all the properties of p t deriving from the analysisof Bottazzi and Giachini (2019a) remain una ﬀ ected. Moreover, v t is an autoregressive process of order 1 with zeromean. Hence, assuming | θ | <

1, we have that in the long-run ˜ p t ﬂuctuates around either π or π depending on thesequence of realized events. Thus, the lack of ergodicity ˜ p t shows is somehow “well-behaved”. That is, if one isolatesthe sequences in which p t converges to a given π i , one will obtain that the time averages (i.e. horizontal means) of therelative observations of ˜ p t , for t large enough, are approximately normally distributed with mean π i . Hence, we cansay that ˜ p t presents two stationary points. At the same time, studying ergodicity of ˜ p t is much more complicated thanperforming the same tasks on p t . In what follows, we set η = . θ . In the ﬁrst one, weconsider θ =

0. We refer to it as “IID noise” scenario, since we have that, in the long-run, the ﬂuctuation describedby ˜ p t around either π or π are IID. In the second scenario, instead, we consider the opposite case: setting θ = . ii ): long-run survival of both agents. This makes p t ﬂuctuate in theinterval ( π , π ) indeﬁnitely. Moreover, we set θ = . ﬀ ectingthe ergodic properties of ˜ p t , should make relatively harder for our methodology to work. We refer to this case as“Ergodic”. Table 2 summarizes the parametrization used in our analyses. While the setting for IID noise and

AR noise scenarios ensure the emergence of multiple stationary points, leading to a non-ergodic scenario, the assumptions forthe

Ergodic scenario guarantee the persistent ﬂuctuation of p t (Bottazzi and Giachini, 2019a). We start our analysis by applying autoBM and autoRD to the IID noise case. The former requires 33 792steps of simulation. It signals that the warmup ends after the ﬁrst batch of 1 024 steps, and estimates the steady statemean as 0 . autoRD signals that the warmup ends after 1 032 steps. After 2 604 independent replications,it estimates the steady state mean as 0 .

10 15 20 25 30 35 40 45 50 55 60Parallelism degree0.00.20.40.60.81.0 R un t i m e o v e r s e q u e n t i a l o n e Parallelization study

Prediction MarketMacroABMOptimal speedup 15 20 25 30 35 40 45 50 55 60Parallelism degree0.000.020.040.060.080.10 R un t i m e o v e r s e q u e n t i a l o n e Parallelization study 15-60

Prediction MarketMacroABMOptimal speedup

Figure 16: Analysis of the runtime speed-ups on a machine with 20 physical cores. we performed step 1, and passed the termination check of step 2. After that, step 4 requires to compare the resultsof autoBM and autoRD . The di ﬀ erence among the two results is larger than δ , suggesting an ergodicity problem.According to our method, we already have an indication of non-ergodicity. However, for illustrative reasons we alsoperformed the Anderson-Darling normality test on the horizontal means computed by autoRD (step 5), obtaining ap-value equal to 6.092E-251, which allows us to reject the null hypothesis that the horizontal means are normallydistributed. Hence, our methodology is able to correctly spot that the IID noisy price lacks ergodicity. AR noise.

We consider now the AR noise case starting with autoBM . The algorithm estimates the warmup to endin 1 024 steps, and the steady state mean as 0 . autoRD we obtain that the warmup isestimated to end after 1 032 steps. The total number of independent replications needed by autoRD to reach the ICwidth is 2 709, obtaining as result 0 . ﬀ erent results for theused δ , suggesting an ergodicity problem (step 4). This is conﬁrmed by the Anderson-Darling normality test of step 5which computes a p-value of 1 . Ergodic.

Finally, we perform a robustness check on our methodology by applying it to the Ergodic scenario. Using autoBM , the warmup is estimated to end after 1 024 steps producing as result 0 . autoRD , one gets thatthe warmup is estimated to end after 1 032 steps. The number of independent replications needed for reaching CIs ofwidth δ is 210, obtaining as result 0 . δ (step 4). Thenormality test from step 5 cannot reject the null hypothesis of normality, as we get a p-value of 0 .

9. Parallelization study

One of the key issues in the analysis of ABMs in social sciences concerns computational time; while some ap-proaches have recently proposed to take advantage of machine learning surrogates (Lamperti et al., 2018; van derHoog, 2019), the most direct approach to speed-up simulation is an e ﬃ cient parallelisation of the experiments. In thissection we discuss how MultiVeStA can e ﬃ ciently and automatically parallelize the various runs. Notably, we demon-strate the potential analysis speedups showing an analysis that requires about 15 days when performed in sequential,31nd about 16 hours when parallelizing it on a machine with 20 cores. In particular, we show the actual runtime gainsobtained on the analysis of our case studies when using di ﬀ erent degrees of parallelism on a machine with 1 CPUIntel Xeon Gold 6252 (20 physical cores) and 94GB of RAM. This machine allows to perform up to 20 processes inparallel, but hyperthreading further allows for limited speedups also with parallelism degrees higher than 20.Figure 16 (left) shows the results of our study considering the sequential case and parallelism degrees N multipleof 5 up to 60. Intuitively, in the ideal case an analysis using parallelism degree N should take N of the time required bya sequential analysis (i.e. with N = N for all considered N . The blue and yellow dots, instead, show the actualspeedups obtained for our two case studies. In particular, in order to compare with the optimal speedup, for eachvalue of N we provide the ratio among the runtime obtained with parallelism degree N over the one of the sequentialcase. For the prediction market model, we consider the autoRD analysis from Figure 14 (left) for π ∗ = .

45, whilefor the macro model we consider the analysis from Figure 7. Notably, the analysis of the macro model took about 15days when executed sequentially, while it goes down to about 18 hours for N =

20, and 16 hours for N =

25. Theanalysis failed for higher values of N due to the high memory requirements of the model. Instead, the analysis of theprediction market model requires about 14 minutes in sequential and about 50 seconds for N =

20. The analysis couldbe performed for all considered N , with a minimum runtime of about 38 seconds for N = N =

20, while they tendto deteriorate for higher values of N . Figure 16 (right) focuses on the values of N from 15 onwards. We see thatthe speedups obtained for the macro model tend to be closer to the optimal ones. This is because simulations arecomputationally intensive, taking more than 1 hour. Therefore, the overhead (i.e. the extra computations) introducedby the communications among the MultiVeStA client and servers has almost no impact on the overall runtime. Instead,the prediction market model is not particularly computationally expensive, making the extra communications inﬂuencemore the overall runtime. In particular, the ﬁgure shows that relatively limited speedups are obtained for N greaterthan 25. This is expected, as discussed. Interestingly, increasing N further than 40 actually worsens the performances,as the processor is not anyway able to perform more than 20 processes in parallel while the overhead costs increase.

10. Conclusion

In this article we presented a fully automated framework for the statistical analysis of simulation models and,in particular, agent-based models (ABM). The framework, implemented through the statistical analyzer MultiVeStA,provides a novel toolkit to the ABM community. These tools range from transient analysis, with statistical tests tocompare results for di ﬀ erent model conﬁgurations, to warmup estimation and the exploration of steady state proper-ties, including a procedure for diagnosing ergodicity – and hence the reliability of any steady state analysis.Our approach can be easily applied to simulators written in Java, Python, R or C ++ , and we also added native sup-port in JMAB, a framework for building macro stock-ﬂow consistent ABMs. Our tools allow modellers to automatetheir explorations, save time and avoid mistakes originating from semi-automated and error-prone tasks. Importantly,this facilitates reproducibility of experiments and promotes the use of a minimal set of default analyses that should beperformed when proposing or studying a model.We validated our approach on two models from the literature: a large scale macro ﬁnancial ABM and a smallscale prediction market ABM (and variants thereof). We obtained new insights on these models, identifying andﬁxing erroneous results from prior analyses. Our framework also allows one to easily parallelize simulations withinthe cores of a machine or in a computer network. For instance, we reduced the analysis runtime for the macro ABM32rom 15 days to 16 hours on a machine with a CPU with 20 cores. Indeed, our toolkit enables modellers to runextensive tests in a unique environment (i.e. without the need of exporting data) and optimizing computational time(which is often precious; see also the discussion in Lamperti et al., 2018).Our approach is rooted in results from the simulation, computer science and operations research communities,which we aim to make available to the ABM community. Connecting these communities is critical to leverage themost e ﬀ ective techniques and approaches across ﬁelds. For example, the stationarity analysis proposed by Grazzini(2012) mentioned in Section 2 can be viewed as a non-automated version of the batch means approach by Conway(1963) and Law and Carson (1979).In the near future, we plan to integrate MultiVeStA with other popular platforms used to build and analyse simula-tion models – including the LSD environment for ABMs (Valente, 2008) and the JASMINE environment for discrete-event simulations (Richiardi and Richardson, 2017). We see this article as a ﬁrst step in bringing practices from thestatistical model checking (SMC) tool-set to the ABM computational economics community. Of particular interest inthis respect are SMC techniques developed to mitigate two classic problems of Monte Carlo methods: dealing withmodels that present rare events (Legay et al., 2016), and using machine learning techniques to reduce the number ofsimulations (Bortolussi et al., 2015). Finally, we will expand the family of automated analysis techniques o ﬀ ered inMultiVeStA. For instance, we will extend and reﬁne our ergodicity diagnostics procedure, e.g. tackling the problemof identifying multiple stationary points (assuming they are ﬁnitely many) by means of clustering algorithms. Wealso plan to further improve our proposals for the analysis of simulation output, e.g., by introducing corrections formultiple testing across the time domain, and move beyond it, e.g., by considering sensitivity analysis and parametercalibration, which are prominent in the ABM community. References

Agha, G. and K. Palmskog (2018). A survey of statistical model checking.

ACM Trans. Model. Comp. Simul. 28 (1), 6:1–6:39.Alexopoulos, C. and D. Goldsman (2004). To batch or not to batch?

ACM Transactions on Modeling and Computer Simulation (TOMACS) 14 (1),76–114.Alexopoulos, C. and A. F. Seila (1996). Implementing the batch means method in simulation experiments. In

Proceedings of the 28th Conferenceon Winter Simulation , WSC ’96, Washington, DC, USA, pp. 214–221. IEEE Computer Society.An, G. and U. Wilensky (2009). From artiﬁcial life to in silico medicine. In M. Komosinski and A. Adamatzky (Eds.),

Artiﬁcial Life Models inSoftware , pp. 183–214. London: Springer London.Barde, S. (2016). Direct comparison of agent-based models of herding in ﬁnancial markets.

Journal of Economic Dynamics and Control 73 ,329–353.Belzner, L., R. De Nicola, A. Vandin, and M. Wirsing (2014). Reasoning (on) service component ensembles in rewriting logic. In

Speciﬁcation,Algebra, and Software , Volume 8373 of

LNCS , pp. 188–211. Springer.Belzner, L., R. Hennicker, and M. Wirsing (2016). Onplan: A framework for simulation-based online planning. In

Formal Aspects of ComponentSoftware , Volume 9539 of

LNCS , pp. 1–30. Springer.Beygelzimer, A., J. Langford, and D. Pennock (2012). Learning performance of prediction markets with kelly bettors. arXiv preprintarXiv:1201.6655 .Billingsley, P. (1995).

Probability and measure . John Wiley & Sons.Bortolussi, L., D. Milios, and G. Sanguinetti (2015). Machine learning methods in statistical model checking and system design - tutorial. InE. Bartocci and R. Majumdar (Eds.),

Runtime Veriﬁcation - 6th International Conference, RV 2015 Vienna, Austria, September 22-25, 2015.Proceedings , Volume 9333 of

Lecture Notes in Computer Science , pp. 323–341. Springer.Bottazzi, G. and D. Giachini (2017). Wealth and price distribution by di ﬀ usive approximation in a repeated prediction market. Physica A: StatisticalMechanics and its Applications 471 , 473–479.Bottazzi, G. and D. Giachini (2019a). Betting, selection, and luck: A long-run analysis of repeated betting markets.

Entropy 21 (6), 585.Bottazzi, G. and D. Giachini (2019b). Far from the madding crowd: Collective wisdom in prediction markets.

Quantitative Finance 19 (9),1461–1471.Brown, D. G., S. Page, R. Riolo, M. Zellner, and W. Rand (2005). Path dependence and the validation of agent-based spatial models of land use.

International Journal of Geographical Information Science 19 (2), 153–174.Caiani, A., A. Godin, E. Caverzasi, M. Gallegati, S. Kinsella, and J. E. Stiglitz (2016). Agent based-stock ﬂow consistent macroeconomics:Towards a benchmark model.

Journal of Economic Dynamics and Control 69 , 375–408.Caiani, A., A. Russo, and M. Gallegati (2019). Does inequality hamper innovation and growth? an ab-sfc analysis.

Journal of EvolutionaryEconomics 29 (1), 177–228. arley, K. M., D. B. Fridsma, E. Casman, A. Yahja, N. Altman, L.-C. Chen, B. Kaminsky, and D. Nave (2006). Biowar: scalable agent-basedmodel of bioattacks. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 36 (2), 252–265.Chow, S.-C., J. Shao, and H. Wang (2002). A note on sample size calculation for mean comparisons based on noncentral t-statistics.

Journal ofbiopharmaceutical statistics 12 (4), 441–456.Ciancia, V., D. Latella, M. Massink, R. Paˇskauskas, and A. Vandin (2016). A tool-chain for statistical spatio-temporal model checking of bikesharing systems. In

Leveraging Applications of Formal Methods , Volume 9952 of

LNCS , pp. 657–673.Cincotti, S., M. Raberto, and A. Teglio (2010). Credit money and macroeconomic instability in the agent-based model and simulator eurace.

Economics: The Open-Access, Open-Assessment E-Journal 4 .Cohen, J. (1992). A power primer.

Psychological bulletin 112 (1), 155.Conway, R. W. (1963). Some tactical problems in digital simulation.

Management Science 10 (1), 47–61.Dahlke, J., K. Bogner, M. Mueller, T. Berger, A. Pyka, and B. Ebersberger (2020). Is the juice worth the squeeze? machine learning (ml) in andfor agent-based modelling (abm).Dawid, H., S. Gemkow, P. Harting, S. Van der Hoog, and M. Neugart (2012). The eurace@ unibi model: An agent-based macroeconomic modelfor economic policy analysis.Delli Gatti, D., C. Di Guilmi, E. Ga ﬀ eo, G. Giulioni, M. Gallegati, and A. Palestrini (2005). A new approach to business ﬂuctuations: heterogeneousinteracting agents, scaling laws and ﬁnancial fragility. Journal of Economic behavior & organization 56 (4), 489–512.Delli Gatti, D. and J. Grazzini (2020). Rising to the challenge: Bayesian estimation and forecasting techniques for macroeconomic agent basedmodels. Journal of Economic Behavior & Organization 178 , 875–902.Dosi, G., G. Fagiolo, M. Napoletano, A. Roventini, and T. Treibich (2015). Fiscal and monetary policies in complex evolving economies.

Journalof Economic Dynamics and Control 52 (C), 166–189.Dosi, G., G. Fagiolo, and A. Roventini (2010). Schumpeter meeting keynes: A policy-friendly model of endogenous growth and business cycles.

Journal of Economic Dynamics and Control 34 (9), 1748–1767.Dosi, G. and A. Roventini (2019). More is di ﬀ erent... and complex! the case for agent-based macroeconomics. Journal of Evolutionary Eco-nomics 29 (1), 1–37.Dosi, G., A. Roventini, and E. Russo (2019). Endogenous growth and global divergence in a multi-country agent-based model.

Journal of EconomicDynamics and Control 101 , 101–129.E ﬀ ken, J. A., K. M. Carley, J.-S. Lee, B. B. Brewer, and J. A. Verran (2012). Simulating nursing unit performance with orgahead: strengths andchallenges. Computers, informatics, nursing: CIN 30 (11), 620.Fagiolo, G., D. Giachini, and A. Roventini (2020). Innovation, ﬁnance, and economic growth: an agent-based approach.

Journal of EconomicInteraction and Coordination 15 (3), 703–736.Fagiolo, G., M. Guerini, F. Lamperti, A. Moneta, and A. Roventini (2019). Validation of agent-based models in economics and ﬁnance. In

Computer Simulation Validation , pp. 763–787. Springer.Fagiolo, G. and A. Roventini (2012). Macroeconomic policy in dsge and agent-based models.

Revue de l’OFCE 124 , 67–116.Fagiolo, G. and A. Roventini (2017). Macroeconomic policy in dsge and agent-based models redux: New developments and challenges ahead.

Journal of Artiﬁcial Societies and Social Simulation 20 (1).Franke, R. and F. Westerho ﬀ (2012). Structural stochastic volatility in asset pricing dynamics: Estimation and model contest. Journal of EconomicDynamics and Control 36 (8), 1193–1211.Gibbons, J. D. (1986). Nonparametric statistical inference, 2nd. ed. statistics: Textbooks and monographs vol. 65. marcel dekker, inc., new yorkand basel 1985, xv, 408 s., $ 41,25 ($ 34,50 us and canada).

Biometrical Journal 28 (8), 936–936.Gilmore, S., D. Reijsbergen, and A. Vandin (2017). Transient and steady-state statistical analysis for discrete event simulators. In

Integrated FormalMethods - 13th International Conference, IFM 2017, Turin, Italy, September 20-22, 2017, Proceedings , pp. 145–160.Gilmore, S., M. Tribastone, and A. Vandin (2014). An analysis pathway for the quantitative evaluation of public transport systems. In

IntegratedFormal Methods , Volume 8739 of

LNCS , pp. 71–86. Springer.Godley, W. and M. Lavoie (2006).

Monetary economics: an integrated approach to credit, money, income, production and wealth . Springer.Gray, R. M. (2009).

Probability, random processes, and ergodic properties , Volume 1. Springer.Grazzini, J. (2012). Analysis of the emergent properties: Stationarity and ergodicity.

Journal of Artiﬁcial Societies and Social Simulation 15 (2), 7.Grazzini, J. and M. Richiardi (2015). Estimation of ergodic agent-based models by simulated minimum distance.

Journal of Economic Dynamicsand Control 51 , 148–165.Grazzini, J., M. G. Richiardi, and M. Tsionas (2017). Bayesian estimation of agent-based models.

Journal of Economic Dynamics and Control 77 ,26–47.Grimm, V. and S. F. Railsback (2013).

Individual-based modeling and ecology . Princeton university press.Guerini, M. and A. Moneta (2017). A method for agent-based models validation.

Journal of Economic Dynamics and Control 82 , 125–141.Ilachinski, A. (1997). Irreducible semi-autonomous adaptive combat (isaac): An artiﬁcial-life approach to land warfare. Technical report, DTICDocument.Kelton, W. D. and A. M. Law (1984). An analytical evaluation of alternative strategies in steady-state simulation.

Oper. Res. 32 (1), 169–184.Kets, W., D. M. Pennock, R. Sethi, and N. Shah (2014). Betting strategies, market selection, and the wisdom of crowds. In

Twenty-Eighth AAAIConference on Artiﬁcial Intelligence .Kukacka, J. and L. Kristoufek (2020). Do ‘complex’ﬁnancial models really lead to complex dynamics? agent-based models and multifractality.

Journal of Economic Dynamics and Control 113 , 103855.Lada, E. K., A. C. Mokashi, and J. R. Wilson (2013). Ard: An automated replication-deletion method for simulation analysis. In , pp. 802–813. IEEE.Lamperti, F. (2018a). Empirical validation of simulated models through the gsl-div: an illustrative application.

Journal of Economic Interactionand Coordination 13 (1), 143–171.Lamperti, F. (2018b). An information theoretic criterion for empirical validation of simulation models.

Econometrics and Statistics 5 , 83–106.Lamperti, F., V. Bosetti, A. Roventini, and M. Tavoni (2019). The public costs of climate-induced ﬁnancial instability.

Nature Climate Change 9 (11), Ecological Economics 150 , 315 – 339.Lamperti, F., G. Dosi, M. Napoletano, A. Roventini, and A. Sapio (2020). Climate change and green transitions in an agent-based integratedassessment model.

Technological Forecasting and Social Change 153 , 119806.Lamperti, F., A. Roventini, and A. Sani (2018). Agent-based model calibration using machine learning surrogates.

Journal of Economic Dynamicsand Control 90 , 366–389.Law, A. M. and J. S. Carson (1979). A sequential procedure for determining the length of a steady-state simulation.

Operations Research 27 (5),1011–1025.Law, A. M. and D. M. Kelton (2015).

Simulation Modeling and Analysis (5th ed.). McGraw-Hill Higher Education, http: // / simulation-book / .L’Ecuyer, P. (2016). SSJ: Stochastic simulation in Java, software library. http://simul.iro.umontreal.ca/ssj/ .L’Ecuyer, P., L. Meliani, and J. Vaucher (2002). SSJ: A framework for stochastic simulation in Java. In E. Y¨ucesan, C.-H. Chen, J. L. Snowdon,and J. M. Charnes (Eds.), Proceedings of the 2002 Winter Simulation Conference , pp. 234–242. IEEE Press.Lee, J. S., T. Filatova, A. Ligmann-Zielinska, B. Hassani-Mahmooei, F. Stonedahl, I. Lorscheid, A. Voinov, G. Polhill, Z. Sun, and D. C. Parker(2015). The complexities of agent-based modeling output analysis.

The journal of artiﬁcial societies and social simulation 18 (4).Legay, A., A. Lukina, L. Traonouez, J. Yang, S. A. Smolka, and R. Grosu (2019). Statistical Model Checking. In B. Ste ﬀ en and G. J. Woeginger(Eds.), Computing and Software Science: State of the Art and Perspectives , Volume 10000 of

LNCS , pp. 478–504. Springer.Legay, A., S. Sedwards, and L. Traonouez (2016). Rare events for statistical model checking an overview. In K. G. Larsen, I. Potapov, and J. Srba(Eds.),

Reachability Problems - 10th International Workshop, RP 2016, Aalborg, Denmark, September 19-21, 2016, Proceedings , Volume 9899of

Lecture Notes in Computer Science , pp. 23–35. Springer.Lehr, R. (1992). Sixteen s-squared over d-squared: A relation for crude sample size estimates.

Statistics in medicine 11 (8), 1099–1102.Lux, T. and R. C. Zwinkels (2018). Empirical validation of agent-based models. In

Handbook of computational economics , Volume 4, pp. 437–488.Elsevier.Macy, M. W. and R. Willer (2002). From factors to actors: Computational sociology and agent-based modeling.

Annual review of sociology ,143–166.Mandes, A. and P. Winker (2017). Complexity and model comparison in agent based modeling of ﬁnancial markets.

Journal of EconomicInteraction and Coordination 12 (3), 469–506.Pianini, D., S. Sebastio, and A. Vandin (2014). Distributed statistical analysis of complex systems modeled through a chemical metaphor. In

International Conference on High Performance Computing & Simulation, HPCS 2014, Bologna, Italy, 21-25 July, 2014 , pp. 416–423.Popoyan, L., M. Napoletano, and A. Roventini (2020). Winter is possibly not coming: Mitigating ﬁnancial instability in an agent-based model withinterbank market.

Journal of Economic Dynamics and Control , 103937.Richiardi, M. G., R. Leombruni, N. J. Saam, and M. Sonnessa (2006). A common protocol for agent-based social simulation.

Journal of artiﬁcialsocieties and social simulation 9 .Richiardi, M. G. and R. E. Richardson (2017). Jas-mine: A new platform for microsimulation and agent-based modelling.

International Journalof Microsimulation 10 (1), 106–134.Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.

Biometrics bulletin 2 (6), 110–114.Sebastio, S. and A. Vandin (2013). Multivesta: statistical model checking for discrete event simulators. In , pp. 310–315.Secchi, D. and R. Seri (2017). Controlling for false negatives in agent-based models: a review of power analysis in organizational research.

Computational and Mathematical Organization Theory 23 (1), 94–121.Sen, K., M. Viswanathan, and G. Agha (2004). Statistical model checking of black-box probabilistic systems. In

CAV 2004 , pp. 202–215. Springer.Seri, R. and D. Secchi (2017). How many times should one run a computational simulation? In

Simulating Social Complexity , pp. 229–251.Springer.Steiger, N. M., E. K. Lada, J. R. Wilson, J. A. Joines, C. Alexopoulos, and D. Goldsman (2005). Asap3: A batch means procedure for steady-statesimulation analysis.

ACM Transactions on Modeling and Computer Simulation (TOMACS) 15 (1), 39–73.Steiger, N. M. and J. R. Wilson (2001). Convergence properties of the batch means method for simulation output analysis.

INFORMS Journal onComputing 13 (4), 277–293.Stuart, A. and J. K. Ord (1994).

Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory (6th ed.).Tafazzoli, A., J. R. Wilson, E. K. Lada, and N. M. Steiger (2011). Performance of skart: A skewness-and autoregression-adjusted batch meansprocedure for simulation analysis.

INFORMS Journal on Computing 23 (2), 297–314.ter Beek, M. H., A. Legay, A. L. . Lafuente, and A. Vandin (2020). A framework for quantitative modeling and analysis of highly (re) conﬁgurablesystems.

IEEE Trans. Software Eng. 46 (3), 321–345.ter Beek, M. H., A. Legay, A. Lluch-Lafuente, and A. Vandin (2015). Quantitative analysis of probabilistic models of software product lineswith statistical model checking. In

Proceedings 6th Workshop on Formal Methods and Analysis in SPL Engineering, FMSPLE@ETAPS 2015,London, UK, 11 April 2015 , pp. 56–70.Tesfatsion, L. and K. L. Judd (2006).

Handbook of computational economics: agent-based computational economics . Elsevier.Valente, M. (2008). Laboratory for simulation development: Lsd. Technical report, LEM Working Paper Series.van der Hoog, S. (2019). Surrogate modelling in (and of) agent-based models: A prospectus.

Computational Economics 53 (3), 1245–1263.von Neumann, J. (1941). Distribution of the ratio of the mean square successive di ﬀ erence to the variance. Ann. Math. Statist. 12 (4), 367–395.Wald, A. and J. Wolfowitz (1940). On a test whether two samples are from the same population.

The Annals of Mathematical Statistics 11 (2),147–162.Welch, B. L. (1947). The generalization ofstudent’s’ problem when several di ﬀ erent population variances are involved. Biometrika 34 (1 / The computer performance modeling handbook 22 , 268–328.Whitt, W. (1991). The e ﬃ ciency of one long run versus independent replications in steady-state simulation. Management Science 37 (6), 645–666. inker, P., M. Gilli, and V. Jeleskovic (2007). An objective function for simulation based inference on exchange rate data. Journal of EconomicInteraction and Coordination 2 (2), 125–145.Younes, H. L. (2005). Probabilistic veriﬁcation for “black-box” systems. In

CAV 2015 , pp. 253–265. Springer., pp. 253–265. Springer.