Bayesian nonparametric sparse VAR models
BBayesian nonparametric sparse VAR models ∗ Monica Billio †§ Roberto Casarin † Luca Rossini †‡ † Ca’ Foscari University of Venice, Italy ‡ Free University of Bozen-Bolzano, Italy
Abstract.
High dimensional vector autoregressive (VAR) models require a largenumber of parameters to be estimated and may suffer of inferential problems. Wepropose a new Bayesian nonparametric (BNP) Lasso prior (BNP-Lasso) for high-dimensional VAR models that can improve estimation efficiency and predictionaccuracy. Our hierarchical prior overcomes overparametrization and overfittingissues by clustering the VAR coefficients into groups and by shrinking the coefficientsof each group toward a common location. Clustering and shrinking effects inducedby the BNP-Lasso prior are well suited for the extraction of causal networks fromtime series, since they account for some stylized facts in real-world networks, whichare sparsity, communities structures and heterogeneity in the edges intensity. Inorder to fully capture the richness of the data and to achieve a better understandingof financial and macroeconomic risk, it is therefore crucial that the model used toextract network accounts for these stylized facts.
Keywords:
Bayesian nonparametrics; Bayesian model selection; Connectedness; Large vectorautoregression; Multilayer networks; Network communities; Shrinkage. ∗ The authors are grateful to the Guest Editor and the reviewers for their useful commentswhich significantly improved the quality of the paper. We would like to thank all the conferencepartecipants for helpful discussions at: “9th RCEA” and “12th RCEA” in Rimini; “InternalSeminar” at Ca’ Foscari University; “Statistics Seminar” at University of Kent; “9th CFE” inLondon; “ISBA 2016” in Sardinia; “3rd BAYSM” at University of Florence, “7th ESOBE” inVenice, “10th CFE” in Seville, “7th ICEEE” in Messina, “Bomopav 2017” in Venice, “Big Datain Predictive Dynamic Econometric Modeling” at University of Pennsylvania and “11th BNP”in Paris. We benefited greatly from suggestions and discussions with Conception Ausin, FrancisDiebold, Lorenzo Frattarolo, Pedro Galeano, Jim Griffin, Mark Jensen, Gary Koop, DimitrisKorobilis, Stefano Tonellato and Mike West. This research used the SCSCF multiprocessor clustersystem at University Ca’ Foscari. § Corresponding author: [email protected] (M. Billio). Other contacts: [email protected] (R.Casarin) and [email protected] (L. Rossini). a r X i v : . [ q -f i n . E C ] O c t Introduction
In the last decade, high dimensional models and large datasets have increasedtheir importance in economics (e.g., see Scott and Varian, 2014) and finance. Inmacroeconomics, some authors investigate the use of large datasets (between 10 and170 series) to improve forecasts (e.g., see Banbura et al., 2010; Stock and Watson,2012; Koop, 2013; Carriero et al., 2015; McCracken and Ng, 2016; Kaufmann andSchumacher, 2017), while in finance, large datasets (between 10 and 200 series) havebeen used to analyse financial crises, contagion effects and their impact on the realeconomy (e.g., see Brownlees and Engle, 2016; Barigozzi and Brownlees, 2018) .Moreover, the level of interaction, or connectedness, between financial institutionsrepresents a powerful tool in monitoring financial stability (e.g., see Diebold andYilmaz, 2015; Scott, 2016), whereas measuring interdependence in business cyclesand in financial markets (Diebold and Yilmaz, 2014; Demirer et al., 2018) is essentialin pursuing economic stability. Making inference on dependence structures in largesets of time series and representing them as graphs (or networks) becomes a relevantand challenging issue in financial and macroeconomic risk measurement (e.g., seeBillio et al., 2012; Diebold and Yilmaz, 2015, 2016; Bianchi et al., 2018).For forecasting economic variables and analysing their dynamic properties andcausal structure, vector autoregressive (VAR) models have been widely used. Inorder to avoid overparametrization and overfitting issues, Bayesian inference andsuitable classes of prior distributions have been successfully used for VARs (e.g.,see Litterman, 1986; Doan et al., 1984; Sims and Zha, 1998)) and for panel VARs(Canova and Ciccarelli (2004)) . Nevertheless, these prior distributions may be noteffective when dealing with very large VAR models. Thus, new prior distributionshave been proposed to deal with zero restrictions in VAR parameters. Georgeet al. (2008) introduce Stochastic Search Variable Selection (SSVS) based on spike-and-slab prior distribution. SSVS has been extended in many directions (e.g.,see Korobilis, 2013; Koop and Korobilis, 2016; Korobilis, 2016). Wang (2010);Ahelgebey et al. (2016a,b) propose graphical VAR relying on graphical priordistributions. Gefang (2014) proposes Bayesian Lasso based on doubly adaptiveelastic-net prior. We thus propose a new Bayesian nonparametric (BNP) priordistributions for VARs, which can improve both estimation efficiency and predictionaccuracy, thanks to the combination of two types of restrictions, called clusteringand shrinking effects.Our BNP Lasso prior distribution (BNP-Lasso) groups the VAR coefficients intoclusters and shrinks the coefficients within a cluster toward a common location.The shrinking effects are due to the Lasso-type distribution at the first stage of See Table S.1 and S.2 in Supplementary Material for a review. See Karlsson (2013) for a comprehensive review.
Let N be the number of units (e.g. countries, regions or micro studies) in a paneldataset and y i,t = ( y i, t , . . . , y i,mt ) (cid:48) a vector of m variables available for the i -th unit,with i = 1 , . . . , N . A panel VAR is defined as the system of regression equations: y i,t = b i + N (cid:88) j =1 p (cid:88) l =1 B ijl y j,t − l + ε i,t , t = 1 , . . . , T and i = 1 , . . . , N, (1)where b i = ( b i, , . . . , b i,m ) (cid:48) the vector of constant terms and B ijl the ( m × m ) matrixof unit- and lag-specific coefficients. We assume that ε i,t = ( ε i, t , . . . , ε i,mt ) (cid:48) , are i.i.d.4or t = 1 , . . . , T , with Gaussian distribution N m ( , Σ i ) and that Cov( ε i,t , ε j,t ) = Σ ij .Eq. 1 can be written in the more compact form as y i,t = B i x t + ε i,t , i = 1 , . . . , N, (2)where B i = ( b i , B i, , . . . , B i, p , . . . , B i,N , . . . , B i,Np ) is a ( m × (1 + N mp )) matrixof coefficients and x t = (1 , x ,t , . . . , x N,t ) (cid:48) is the vector of lagged variables with x i,t = ( y i,t − , . . . , y i,t − p ) for i = 1 , . . . , N .The system of equations in (2) can be written in the SUR regression form: y t = ( I Nm ⊗ x (cid:48) t ) β + ε t , (3)where β = vec( B ), B = ( B (cid:48) , . . . , B (cid:48) N ), ε (cid:48) t = ( ε (cid:48) t , . . . , ε (cid:48) Nt ), ⊗ is the Kroneckerproduct and vec( · ) the column-wise vectorization operator that stacks the columnsof a matrix into a column vector (Magnus and Neudecker, 1999, pp. 31–32). The number of coefficients in (3) is n = N m (1 +
N mp ), which can be large for highdimensional panel of time series. In order to avoid overparameterization, unstablepredictions and overfitting problem, we follow a hierarchical specification strategyfor the prior distributions (e.g., Canova and Ciccarelli (2004), Kaufmann (2010),Bassetti et al. (2014)). Some classes of hierarchical prior distributions are usedto incorporate interdependences across groups of parameters with various degreesof information pooling (see Karlsson, 2013, for a review). Other classes of priors(e.g., MacLehose and Dunson (2010), Wang (2010), Bianchi et al. (2018)) are usedto induce sparsity. Our new class of hierarchical priors for the VAR coefficients β combines sparsity and information pooling within groups of coefficients.We assume that β can be exogenously partitioned in M blocks, β i =( β i , . . . , β in i ), i = 1 , . . . , M . In our empirical applications, blocks correspond tothe VAR coefficients at different lags. This assumption is not essential to appearin our prior for the sparsity and information pooling. Nevertheless, we envisagetwo advantages in using this specification strategy. From a modeling perspective, itadds flexibility to the model since it allows for different degrees of prior informationin the blocks of parameters. For example, lag-specific shrinking effects as in aMinnesota type prior, or country-specific shrinking effects in panel VAR model canbe easily introduced in our framework. From a computational perspective, it allowsfor breaking the posterior simulation procedure into several steps overcoming thedifficulties of dealing with high dimensional parameter vectors.For each block, we introduce shrinking effects by using a Bayesian-Lasso prior f i ( β i ) = (cid:81) n i j =1 N G ( β ij | µ, γ, τ ), where: N G ( β | µ, γ, τ ) = (cid:90) + ∞ N ( β | µ, λ ) G a ( λ | γ, τ / dλ, (4)5s the normal-gamma distribution with µ , γ and τ , the location, shape and scaleparameter, respectively (see Appendix A). The normal-gamma distribution inducesshrinkage toward the prior mean of µ . In Bayesian Lasso (Park and Casella (2008)), µ is set equal to zero. We extend the Lasso model by allowing for multiple location,shape and scale parameter such that: f i ( β i ) = (cid:81) n i j =1 N G ( β ij | µ ∗ ij , γ ∗ ij , τ ∗ ij ), and reducecurse of dimensionality and overfitting by employing a Dirichlet process (DP) as aprior for the normal-gamma parameters. More specifically, let θ ∗ = ( µ ∗ , γ ∗ , τ ∗ ) bethe normal-gamma parameter vector, our BNP-Lasso prior for θ ∗ and β i is β ij ind ∼ N G ( β ij | θ ∗ ij ) , and θ ∗ ij | Q i i.i.d. ∼ Q i , (5)with j = 1 , . . . , n i , where Q i is a random measure.Following the strategy in M¨uller et al. (2004); Pennell and Dunson (2006);Hatjispyros et al. (2011), we assume Q i is a convex combination of a common P and a block-specific P i random measure, that is Q ( d θ ) = π P ( d θ ) + (1 − π ) P ( d θ ) , (6)... Q M ( d θ M ) = π M P ( d θ M ) + (1 − π M ) P M ( d θ M ) . The random measure P favours sparsity by shrinking coefficients toward zero, as instandard Bayesian Lasso, i.e. P ( d θ ) ∼ δ { (0 ,γ ,τ ) } ( d ( µ, γ, τ )) , with ( γ , τ ) ∼ G S ( γ , τ | ν , p , s , n ) , (7)where δ { θ ∗ } ( θ ) denotes the Dirac measure indicating that the random vector θ hasa degenerate distribution with mass at the location θ ∗ , and G S ( γ, τ | ν, p, s, n ) is aGamma scale-shape distribution with parameters ν , p , s and n (see Appendix A).The random measure, P i , i >
0, is a DP prior (DPP), which induces a randompartition of the parameter space and a parameter clustering effect (see, Hirano(2002), Griffin and Steel (2011), Bassetti et al. (2014)), and shrinks coefficientstoward multiple non-zero locations, i.e. P i ( d θ ) i.i.d. ∼ DPP( ˜ α, H ) , with H ∼ N ( µ | c, d ) · G S ( γ, τ | ν , p , s , n ) , (8)where ˜ α and H are the DPP concentration parameter and base measure, respectively.See Appendix A for a definition of DPP. For the mixing parameter π i , we assume aBeta distribution, π i i.i.d. ∼ B e ( π i | , α i ).The amount of shrinkage in P and P i is determined by the hyperparameters of G S ( ν, p, s, n ). In our empirical application, we assume the hyperparameter values v = 30, s = 1 / p = 0 . n = 18 for the sparse component and v = 3,6 = 1 / p = 0 . n = 10 for the DPP component and α i = 1 for the mixingparameter, as in MacLehose and Dunson (2010).For the variance-covariance matrix Σ, we assume a graphical prior distributionas in Carvalho et al. (2007) and Wang (2010), where the zero restrictions on thecovariances are induced by a graph G , that is by an ordered pair of sets ( N G , E G ),where N G is a vertex set and E G a edge set. Conditionally to a specified graph G ,we assume a Hyper Inverse Wishart prior distribution for Σ, that is:Σ ∼ HIW G ( b, L ) , (9)where b and L are the degrees of freedom and scale hyperparameters, respectively.See Appendix A for further details on Hyper Inverse Wishart distributions.The prior over the graph structure is defined as a product of Bernoullidistributions with parameter ψ , which is the probability of having an edge. That is,a m -node graph G = ( N G , E G ), has a prior probability: p ( G ) ∝ (cid:89) i,j ψ e ij (1 − ψ ) (1 − e ij ) = ψ | E G | (1 − ψ ) κ −| E G | , (10)with e ij = 1 if ( i, j ) ∈ E G , where | N G | and | E G | are the cardinalities of the vertexand edge sets, respectively, κ = (cid:0) | N G | (cid:1) the maximum number of edges. To inducesparsity we choose ψ = 2 / ( p −
1) which would provide a prior mode at p edges.In summary, our hierarchical prior in Eq. (5)-(10) is represented through the Directed Acyclic Graph (DAG) in Fig. 1. Shadow and empty circles indicate theobservable and non-observable random variables, respectively. The directed arrowsshow the causal dependence structure of the model. The left panel shows the priorsfor β i and Σ, which are the first stage of the hierarchy. The second stage (rightpanel) involves the sparse dependent DP prior for the shrinking parameters µ , γ and τ .Our prior has the infinite mixture representation with a countably infinitenumber of clusters, which is one of the appealing feature of BNP: f i ( β i | Q i ) = ∞ (cid:88) k =0 ˇ w ik N G ( β i | ˇ θ ik ) , (11)where ˇ w ik = (cid:26) π i , k = 0 , (1 − π i ) w ik , k > , ˇ θ ik = (cid:26) (0 , γ , τ ) , k = 0 , ( µ ik , γ ik , τ ik ) , k > . See Appendix B for a proof. This representation shows that our prior not onlyshrinks coefficients to zero as in Bayesian Lasso (Park and Casella (2008)) or Elastic-net (Zou and Hastie (2005)), but also induces a probabilistic clustering effect in7 t β j Σ Gλ j µ j γ j τ j b, Lψ µ j γ j τ j Q G Hπ ν , p , s , n ν , p , s , n ˜ αc, d Figure 1: DAG of the Bayesian nonparametric model for inference on VARs. Itexhibits the hierarchical structure of priors and hyperparameters. The directedarrows show the causal dependence structure of the model. Left panel is related tothe first stage of the hierarchy and right panel to the second stage.the parameter set as in other Bayesian nonparametric models (Hirano (2002) andBassetti et al. (2014)).Our prior places no bounds on the number of mixture components and canfit arbitrary complex distribution. Nevertheless, since the mixture weights, w ik ’s,decrease exponentially quickly only a small number of clusters will be used to modelthe data a priori. The prior number of clusters is a random variable with distributiondriven by the concentration parameter ˜ α . In our applications, we set ˜ α = 1. Since the posterior distribution is not tractable, Bayesian estimator cannot beobtained analytically. In this paper, we rely on simulation based inference methods,and develop a Gibbs sampler algorithm for approximating the posterior distribution.For the sake of simplicity and without loss of generality, we shall describe thesampling strategy for the two-block case, i.e. M = 2.In order to develop a more efficient MCMC procedure, we follow a dataaugmentation approach. For each block i = 1 ,
2, we introduce two sets of allocationvariables, ξ ij , d ij , j = 1 , . . . , n i , a set of stick-breaking variables, v ij , j = 1 , , . . . anda set of slice variables, u ij , j = 1 , . . . , n i . The allocation variable, ξ ij , selects thesparse component P , when ξ ij is equal to zero and the non-sparse component P i ,8hen it is equal to one. The second allocation variable, d ij , selects the componentof the Dirichlet mixture P i to which each single coefficient β ij is allocated to. Thesequence of stick-breaking variables defines the mixture weights, whereas the slicevariable, u ij , allows us to deal with the infinite number of mixture components byidentifying a finite number of stick-breaking variables to be sampled and an upperbound for the allocation variables d ij .We demarginalize the Normal-Gamma distribution by introducing a latentvariable λ ij for each β ij and obtain the joint posterior distribution f (Θ , Σ , Λ , U, D, V, Ξ | Y ) ∝ T (cid:89) t =1 (2 π | Σ | ) − / exp (cid:18) −
12 ( y t − X (cid:48) t β ) (cid:48) Σ − ( y t − X (cid:48) t β ) (cid:19) · n (cid:89) j =1 f ( β j , λ j , u j , d j , ξ j ) n (cid:89) j =1 f ( β j , λ j , u j , d j , ξ j ) · (12) (cid:89) k> B e ( v k | , α ) B e ( v k | , α ) HIW G ( b, L ) G S ( γ , τ | ν , p , s , n ) · (cid:89) k> N ( µ k | c, d ) G S ( γ k , τ k | ν , p , s , n ) N ( µ k | c, d ) G S ( γ k , τ k | ν , p , s , n ) , where U = { u ij : j = 1 , , . . . , n i and i = 1 , } and V = { v ij : j =1 , , . . . and i = 1 , } are the collections of slice variables and stick-breakingcomponents, respectively; D = { d ij : j = 1 , , . . . , n i and i = 1 , } andΞ = { ξ ij : j = 1 , , . . . , n i and i = 1 , } are the allocation variables; Θ = { ( µ , γ , τ ) , ( µ ik , γ ik , τ ik ) : i = 1 , k = 1 , , . . . } are the atoms; π = ( π , π )are the block-specific probabilities of shrinking coefficients to zero and f i ( β ij , λ ij , u ij , d ij , ξ ij ) = (cid:0) I ( u ij < ˜ w d ij ) N ( β ij | , λ ij ) G a ( λ ij | γ , τ / (cid:1) − ξ ij · (cid:0) I ( u ij < w id ij ) N ( β ij | µ id ij , λ ij ) G a ( λ ij | γ id ij , τ id ij / (cid:1) ξ ij π − ξ ij i (1 − π i ) ξ ij . (13)See Appendix B for a derivation.We obtain random samples from the posterior distributions by Gibbs sampling.The Gibbs sampler iterates over the following steps using the conditionalindependence between variables as described in Appendix C:1. The slice and stick-breaking variables U and V are updated given[Θ , β , Σ , G, Λ , D, Ξ , π, Y ];2. The latent scale variables Λ are updated given [Θ , β , Σ , G, U, V, D, Ξ , π, Y ];3. The parameters of the Normal-Gamma distribution Θ are updated given[ β , Σ , G, Λ , U, V, D, Ξ , π, Y ]; 9. The coefficients β of the VAR model are updated given[Θ , Σ , G, Λ , U, V, D, Ξ , π, Y ];5. The matrix of variance-covariance Σ is updated given[Θ , β , G, Λ , U, V, D, Ξ , π, Y ];6. The graph G is updated given [Θ , β , Σ , Λ , U, V, D, Ξ , π, Y ];7. The allocation variables D and Ξ are updated given [Θ , β , Σ , G, Λ , U, V, π, Y ];8. The probability, π , of shrinking-to-zero the coefficient is updated given[Θ , β , Σ , G, Λ , U, V, D, Ξ , Y ]. Pairwise Granger causality has been used to extract linkages and networks describingrelationships between variables of interest, e.g. financial and macroeconomiclinkages (Billio et al., 2012; Barigozzi and Brownlees, 2018). The pairwise approachdoes not consider conditioning on relevant covariates thus generating spuriouscausality effects. The conditional Granger approach includes relevant covariates,however the large number of variables relative to the number of data points canlead to overparametrization and consequently to inefficiency in gauging the causalrelationships (e.g., see Ahelgebey et al., 2016a,b). Our BNP-Lasso can be used toextract the network while reducing overfitting and curse of dimensionality problems.Also, it allows to define edge-coloured graphs, which account for various stylized factsrecently investigated in financial networks, such as the presence of communities, hubsand linkage heterogeneity.For sake of simplicity, we consider one lag, one block of coefficients and one unit,i.e. M = 1, N = 1 and p = 1. We denote with G = ( V, E ) the directed graph withvertex set V = { , . . . , m } and edge set E ⊂ V × V . A directed edge from i to j isthe ordered set { i, j } ∈ E with i, j ∈ V . The adjacency matrix A associated with G has ( j, i )-th element a j,i = (cid:26) { i, j } ∈ E, . See Newman (2010) for an introduction to graph theory and network analysis andJackson (2008) for an economic perspective on networks. A weighted graph is definedby the ordered triplet G = ( V, E, C ), where C is the weight matrix with ( j, i )-thelement c i,j representing the weight associated to the edge { i, j } . When c i,j belongsto a countable set, then G = ( V, E, C ) is called edge coloured graph.In time series graphs, causal networks encode the conditional independencestructure of a multivariate process (Eichler, 2013). An edge { i, j } ∈ E exists ifand only if the VAR coefficient B ji of the variable y i,t − in the equation of y j,t is10ot null. An advantage in using BNP-Lasso VAR is that the inference procedureprovides edge probabilities, edge weights and edge partition as a natural outputand graph uncertainty can be easily included in network analysis. More specifically,we assume an edge from i to j exists, i.e. { i, j } ∈ E , if ˆ ξ ,g ( i,j ) = 0, where ˆ ξ ,g ( i,j ) is the Maximum a Posterior (MAP) estimator of ξ ,g ( i,j ) . The posterior clusteringof the VAR coefficients allows us to define an edge coloured graph G = ( V, E, C ).Following a standard procedure in BNP (e.g., see Bassetti et al., 2014), we use theMCMC draws of the allocation variables d i and d j to approximate the posteriorprobabilities of joint classification of an edge φ ij = P ( d i = d j | Y, ξ i = 0 , ξ j = 0),ˆ φ ij = 1 H H (cid:88) h =1 δ d h i ( d h j ) (14)and to find the posterior partition ˆ D , which minimizes the sum of squared deviationsfrom the joint classification probabilities, i.e.ˆ D = arg min D ∈{ D ,...,D H } ˜ m (cid:88) i =1 ˜ m (cid:88) j =1 (cid:16) δ d h i ( d h j ) − ˆ φ ij (cid:17) . (15)The number of distinct elements ˆ K in the partition ˆ D provides an estimate of thenumber of edge intensity levels (colours). The weight c ji indicating the intensity ofthe edge { i, j } can be computed by using the location atom posterior mean, ˆ µ ∗ kl , c ji = ˆ µ ∗ if ˆ d ,g ( i,j ) = 1 and ˆ ξ g ( i,j ) ,l = 0 , ... ...ˆ µ ∗ K if ˆ d ,g ( i,j ) = ˆ K and ˆ ξ ,g ( i,j ) = 0 , ξ ,g ( i,j ) = 1 . (16)Panel (a) in Fig. 2 provides an example of weighted network representation for aVAR(1) with m = 4 variables and ˆ K = 2 intensity levels.A better understanding of the nodes centrality and of the risk connectednesscan be achieved by exploiting the information in the edge weights (e.g., see Bianchiet al., 2018). One can view the number of edges as more important than the weights,so that the presence of many edges with any weight might be considered moreimportant than the total sum of edge weights. However, edges with large weightsmight be considered to have a much greater impact than edges with only smallweights (e.g., see Opsahl et al., 2010). BNP-Lasso VAR can be used to betteranalyse the contribution of the weights to nodes centrality. The posterior partitionof the edges (VAR coefficients) into a finite number of groups allows us to represent G = ( V, E, C ) as a ˆ K -multilayer graph G = ( V, { E k } ˆ Kk =1 , { C k } ˆ Kk =1 ), where the k -th11a) (b) (c) (d) (e) v v v v v v v v v v v v v v v v v v v v . . . . . . . . .
30 0 0 0 .
30 0 0 0 . . . . . . .
10 0 0 . . . . Figure 2: Weighted graphs with ˆ K = 2 intensity levels: ˆ µ ∗ = 0 . µ ∗ = 0 . v i represents the variable i in the4-dimensional VAR(1), a clockwise-oriented edge from node j to node i represents anon-null coefficient for the variable y j,t − in the i -th equation of the VAR. Panel (a):weighted graph G = ( V, E, C ) (top) with vertex set V = { v , v , v , v } , edge set E = { e , e , e , e } , where e = { v , v } , e = { v , v } , e = { v , v } , e = { v , v } , e = { v , v } , e = { v , v } , e = { v , v } and the adjacency matrix A (middle)and the weights matrix C (bottom). Panel (b): the subgraph G = ( V, E ) with E = { e , e , e } induced by edges of intensity level ˆ µ ∗ = 0 .
3. Panel (c): thesubgraph G = ( V, E ) with E = { e , e , e , e } induced by edges with intensityˆ µ ∗ = 0 .
3. Panel (d): graph with two communities (red V = { v , v } and blue V = { v , v } ). Panel (e): graph with a hub (node v ).layer is the sub-graph G k = ( V, E k ), with the set E k = {{ i, j } ∈ E ; c ij = ˆ µ ∗ k } ofall edges with intensity ˆ µ ∗ k (see Kivel¨a et al. (2014)). Panels (b) and (c) of Fig. 2provide an example with two layers.Node centrality and network connectivity can be now analysed exploiting themultilayer representation. The out-degree, ω + i , and weighted out-degree (Opsahlet al., 2010), σ + i , of the node i can be decomposed along the ˆ K layers ω + i = m (cid:88) j =1 a ji = ˆ K (cid:88) k =1 ω + i,k , where ω + i,k = m (cid:88) j =1 a ji I ( c ij = ˆ µ ∗ k ) , (17) σ + i = m (cid:88) j =1 c ji = ˆ K (cid:88) k =1 σ + i,k = ˆ K (cid:88) k =1 ˆ µ ∗ k ω + i,k , where σ + i,k = m (cid:88) j =1 c ji I ( c ij = ˆ µ ∗ k ) . (18)Similarly, it is possible to decompose the vertex in-degree and total-degree measures.12hese decompositions are useful in quantifying the contribution of each node to thesystem-wide connectedness. In our financial and macroeconomic applications, wefind that nodes with large out-degree become less relevant when edge intensity isconsidered. To exemplify, node v has the largest out-degree, ω +1 = 3, in Panel (a),Fig. 2, but is not central in the layer G (Panel (b)). In this layer node v hasout-degree ω +1 , = 1 whereas node v has the largest out-degree ω +4 , = 2. Followingthe weighted degree measure, node v is central in G and in G with out-degree, σ +4 = 0 . σ +4 , = 0 .
6, respectively, but it is not central in G , where its degree is σ +4 , = 0.The connectivity analysis presented above relies on the number of directconnections among nodes. In order to gain a broader idea of the network connectivitypatterns, indirect connections need to be considered. The notion of path s ij betweentwo vertices i and j , called endvertices, accounts for the indirect connectionsbetween them. A path s ij is defined by ordered sequences of distinct vertices V ( s ij ) = { v , v , . . . , v l } and edges E ( s ij ) = { e , . . . , e l } ⊂ V , with e = ( v , v ), e l = { v l − , v l } , and v = i and v l = j . The number of edges | E ( s ij ) | = l in a path iscalled path length.The distance between two nodes i and j is defined as the length of the shortestpath, i.e. d ( i, j ) = min {| E ( s ij ) | , s.t. s ij is a path between i and j } . It can be used todefine the d -step ego network of a node i ∈ E , that is a sub-graph of G induced by i and all nodes with distance less than d from i , V ( i ) = { j ∈ E, s.t. d ( i, j ) ≤ d } and E ( i ) = {{ i, j } ∈ E, s.t. i, j ∈ V ( i ) } ∪ {{ j, i } ∈ E, s.t. i, j ∈ V ( i ) } . In our empiricalapplications, ego-networks will be used to show the heterogeneity in the linkagesstrength of a given node.The distance is a local connectivity measure, which does not account for thewhole interconnected topology, thus we consider the average path length (APL),that is a global connectivity measure defined as AP L ( G ) = 1 m ( m − m (cid:88) i =1 m (cid:88) j =1 d ( i, j ) . (19)A low APL may indicate a random graph structure of either Erd¨os-R´enyi type orWatts-Strogatz type (see Newman (2010)). APL assumes shocks follow the shortestpossible path, but economic shocks are in general unlikely to be restricted to followspecific paths and are also likely to have feedback effects.A measure which accounts for all possible paths connecting two nodes is theeigenvector centrality (modified Bonacich’s beta centrality, see Bianchi et al. (2018)) h ( G ) = (cid:18) I m − m − AA (cid:48) (cid:19) − (cid:18) I m − m − A (cid:19) ι , with ι = (1 , . . . , (cid:48) . (20)If a VAR of order p > y i,t − p to y j,t shouldbe considered as in Granger causality, but also indirect causal effects from y i,t − p y j,t through y k,t − l , 1 < l < p , as in Sims causality (Eichler, 2013). Moreover,VAR shocks can be correlated and their identification can be challenging in highdimension. In order to solve all these issues, Diebold and Yilmaz (2014) propose aglobal connectivity measure, which relies on a generalized identification frameworkand on the H -step ahead forecast error varianceΘ( G ) = m (cid:88) i,j =1 ,i (cid:54) = j H − (cid:88) h =0 ( e (cid:48) i Φ h Σ e j ) mσ jj c i , H = 1 , , . . . , (21)where c i is a normalizing constant and Φ h satisfies the recursion Φ h = B Φ h − + B Φ h − + . . . + B p Φ h − p with Φ = I m and Φ h = O m for h <
0. The second and thirdterm of the decomposition( e (cid:48) i Φ h Σ e j ) = (cid:32) h − (cid:88) l =1 e (cid:48) i B l Φ h − l Σ e j (cid:33) + 2 (cid:32) h − (cid:88) l =1 e (cid:48) i B l Φ h − l Σ e j (cid:33) ( e (cid:48) i B h Σ e j ) + ( e (cid:48) i B h Σ e j ) (22)provide the contribution of the h -lag connectivity structure to the global measure.See Appendix B for a derivation. The presence of nodes with large out-degree inthe h -th lag graph can impact significantly on the system-wide connectedness. Bothglobal measures will be used in combination with multilayer decomposition of thecoloured graph.Finally, other two useful notions are the ones of community and hub. Assumethe vertex set V is partitioned in a sequence of disjoint subsets V , . . . , V K , thenan element V k of the partition is called community. An example of communitystructured graph is in Panel (d) of Fig. 2. By construction, the partition of thenodes induces an edges partition (e.g., communities V and V in Panel (d)). Sinceour BNP-Lasso prior allows for any type of partition of the edge set, then typicalcommunity structures appearing in stochastic block models and modular graphs (seeNewman, 2010) can be extracted with our model. Panel (e) of Fig. 2 provides anexample of hub, that is a node with a larger number of edges in comparison withother nodes in the graph. Node v is an hub in the graph of Panel (e), since it hasout-degree 3 whereas the other nodes have null out-degree. We study the goodness of fit of our model and, following the standard practice inBNP analysis (e.g., see Griffin and Steel (2006), Griffin and Steel (2011), Griffin andKalli (2018)), we simulate data from a parametric model, which is in the family ofthe likelihood of our model. We consider a VAR(1) model y t = B y t − + ε t , ε t i.i.d. ∼ N m ( , Σ) t = 1 , . . . , , B represents the collection of pairwise relationships between thetime series in the panel (i.e., countries or institutions). A non-zero coefficientindicates there is a linkage between two nodes in the network.We consider three dimension settings for y t ( m = 20 (small), m = 40 (medium)and m = 80 (large)) and different parameter settings. For each setting, we conduct50 experiments, generating 50 independent datasets and estimating our BNP-LassoVAR model. In all the experiments, we have chosen the hyperparameters as inSection 2.2, and set the degree of freedom parameter, b = 3 and the scale matrix, L = I n . We iterate 5 ,
000 times the Gibbs sampler described in Section 3, and afterdiscarding a burn-in sample and applying thinning, we use the MCMC samples ofΞ to estimate the network adjacency matrix and the samples of D to estimate thenumber of colours and classify the edge intensity. We provide in Fig. 3 the typicalcausality network estimated in four experiments. See Supplementary Material forfurther results.As one can see from Fig. 3, two experimental settings have been consideredfor B , which reflect two typical network configurations detected in many empiricalstudies. The first setting represents a system which is robust to random failures,but vulnerable to targeted attacks. We consider many nodes (financial institutions)with low degree, and a small number of nodes, known as hubs (e.g. Barab´asi andAlbert, 1999; Acemoglu et al., 2012), or systemically important institutions (e.g.Billio et al., 2012; Battiston et al., 2012), with a number of linkages that exceeds theaverage. The presence of hubs characterizes periods of high systemic risk and whenthe default of the largest institution (hubs) can lead to the breakdown of the wholesystem. Hubs have been detected in the S&P100 return networks (e.g., Bianchiet al., 2018) and in EuroStoxx-600 realized volatility networks (e.g., Ahelgebey et al.,2016b). Also, we assume nodes cluster into groups internally densely connectedwith no connections among them. This feature is known as community structure,or modularity (Newman, 2006; Leicht and Newman, 2008) and has been detectednot only in financial-return networks, but also in bank-firm (Bargigli and Gallegati,2013), regional banks (Puliga et al., 2016) and inter-bank market (De Souza et al.,2016), networks. The presence of communities structure is a relevant features, sinceit can prevent the spread of contagion to the whole system. The setting describedcorresponds to a block-diagonal matrix B with (4 ×
4) blocks B j ( j = 1 , . . . , m/ b lk,j ∼ U ( − . , . l, k = 1 , . . . ,
4. The matrix B is checked for the weakstationarity condition of the VAR.The second setting represents a stable financial or economic condition, whichreflects in a network with few connections, randomly distributed across node pairs.This network configuration has been observed during periods of low systemic risk.More specifically, we consider a random matrix B of dimension (80 × U ( − . , . a) m = 20 (b) m = 40(c) m = 80 with block entries (d) m = 80 with random entries Figure 3: Weighted network estimated in four Monte Carlo experiments, for differentmodel dimensions m = 20 (panel (a)), m = 40 (panel (b)) and m = 80, with blockentries (panel (c)) and with random entries (panel (d)). Blue edges mean negativeweights and red ones represent positive weights, while the edges are clockwise-oriented. In each graph the nodes represent the n variables of the VAR model,and a clockwise-oriented edge from node j to node i represents a non-null coefficientfor the variable y j,t − in the i -th equation of the VAR. Blue edges represent negativecoefficients, while red edges represent positive coefficients.remaining 6250 elements are set to zero. The matrix generated is then checked forweak stationarity. In addition to the two settings previously described, we considera VAR(4) parametrized on the posterior mean of the macroeconomic applicationpresented in Section 4. The network configuration at each lag, given in Fig. 5,exhibits edge heterogeneity and hubs.The burn-in period and the thinning rate have been chosen following a graphicaldissection of the posterior progressive averages and the standard convergenceMCMC diagnostics. Table 1 shows the diagnostics averaged over 50 independentexperiments.Since in a data augmentation framework, the mixing of the MCMC may dependon the autocorrelation in both parameter and latent variable samples, we focuson λ ij , u ij and β ij . The diagnostic for λ ij , after removing 500 burn-in iterations,16D KS INEFF ACF(10)before after before after m = 20 -0.629 0.3667 13.5092 7.0346 0.1724 0.0188 m = 40 -1.197 0.002 21.3259 8.5001 0.3366 0.1334 m = 80 block -1.354 0.5806 20.1303 9.6897 0.3147 0.1229 m = 80 random 2.451 0.0158 17.6933 5.8095 0.267 0.0185Table 1: Geweke (CD) and Kolmogorov-Smirnov (KS) convergence test; inefficiencyfactor (INEFF) and autocorrelation (ACF) computed at lag 10 for the L norm of λ ij for each experiments on the whole (before) and thinned (after) MCMC samples.All quantities are averages over 50 independent experiments.indicates the chain has converged. The inefficiency factor and the autocorrelationat lag 10 on the original sample suggest high levels of dependence in the sample,which can be mitigated by thinning the chains. We keep only every 5th sample andobtain substantial reduction of sample autocorrelation. See Supplementary Materialfor further details.We compare our prior (BNP) with the Bayesian Lasso (B-Lasso, Park and Casella(2008)), the Elastic-net (EN, Zou and Hastie (2005)) and Stochastic Search VariableSelection (SSVS) of George et al. (2008). For the SSVS, we assume the defaulthyperparameters values τ = 0 . τ = 4 and π = 0 .
5. Following Korobilis(2016), we use the Mean Square Deviation (MSD) for measuring the performance ofthe four different priors. See Supplementary Material for a more detailed description.For each parameter setting we generate 50 independent datasets and estimate themodels under comparison. Fig. 4 shows the quartiles and median of the MSDstatistics based on 50 Monte Carlo experiments by means of box plots. In allsettings, it is likely that the BNP-Lasso works better compared to other shrinkagemethods in recovering the true VAR coefficients since the interquartile ranges (i.e.the boxes) do not overlap. However, one can better appreciate a superior relativeperformance of the BNP-Lasso when model dimension increases (from m = 20 (left)to m = 80 (right)). In our simulation results, the BNP-Lasso shrinking effect is amidway between SSVS and Bayesian Lasso. It is smaller with respect to B-Lassoand Elastic-Net and comparable to SSVS when the true β ij is nonnull and it is largerthan SSVS when β ij = 0. The reduction in MSD is largely a result of decreased bias.The BNP-Lasso clustering effect tends to identify the correct coefficient clusteringreducing the bias. In the Supplementary Material, the comparison in other settings(Fig. S.8-S.11) confirms our findings. Moreover, Fig. S.12 shows that all shrinkagemethods are over-performing the Minnesota prior, thus confirming the results givenin Korobilis and Pettenuzzo (2018) for a VAR of dimension m = 20. Our findingsextend their results to higher dimensions ( m = 40 and m = 80) and to othershrinkage methods (BNP-Lasso, Elastic-Net). Also, we find that the performance17f the Minnesota prior deteriorates when the model dimension increases. m = 20 m = 40 m = 80 random Figure 4: Boxplot of Mean Square Deviation (MSD) of the estimated VARcoefficients from their true values based on 50 independent Monte Carlo exercises forour Bayesian nonparametric model (BNP), Elastic-net (EN), Bayesian Lasso (BLA)and Stochastic Search Variable Selection (SSVS). On each box, the central markindicates the median, while the bottom and top edges of the box are the 25th and75th percentiles, respectively. The whiskers extend to the most extreme points andthe outliers are plotted using ◦ . Following the literature on international business cycles we consider a multi-countrymacroeconomic dataset (e.g., see Kose et al., 2003; Francis et al., 2017; Kaufmannand Schumacher, 2017) and extract a network of linkages between the cycles of theOECD countries, by applying BNP-Lasso VAR.We consider the quarterly GDP growth rate (logarithmic first differences) from1961:Q1 to 2015:Q2, for a total of T = 215 observations. Due to missing valuesin some series, we choose a subset of high industrialised countries: Rest of theWorld (Australia, Canada, Japan, Mexico, South Africa, Turkey, United States)and Europe (Austria, Belgium, Denmark, Finland, France, Germany, Greece,Ireland, Iceland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden,Switzerland, United Kingdom).The posterior probability of a macroeconomic linkage is 0 .
13, which providesevidence of sparsity in the network and indicates that a small proportion of linkagesis responsible for connectedness in business cycle risk. The posterior number ofclusters and the co-clustering matrix (see Supplementary Material) reveals theexistence of three levels of linkage intensity, customarily called “negative”, “positive”18 a) lag t − t − t − t − Figure 5: Weighted Granger networks for the GDP growth rates of 25 OECDcountries for the sample period 1961Q1 to 2015Q2. We consider different lags:(a) t −
1, (b) t −
2, (c) t −
3, (d) t −
4. Node size indicates node degree, blue edgesrepresent negative weights and red ones positive weights. At the lag l , clockwise-oriented edge from node j to node i represents a non-null coefficient for the variable y j,t − l in the i -th equation of the VAR.and “strong positive”. Fig. 5 draws the coloured networks at different time lagswith negative (blue) and positive (red) weights. In a multilayer graph, each intensitylevel (or edge colour) identifies the set of nodes and edges belonging to a specificlayer. The network connectivity and nodes centrality can be investigated either atthe global level, or for each single layer.The network topology analysis in Table 2 reveals a significant connectivity(graph density and average degree) and complexity (average path length) at alllags. Consequently, each lag-specific network contribute to the build up of the globalconnectedness as shown in Eq. (22). It becomes important to study the dynamicalstructure of the directional connectedness received from other countries (in-degree)or transmitted to other countries (out-degree). The multilayer analysis (colourededges in Fig. 5 and Table 2) will be used to identify the countries, who mostlycontribute at each lag to the system-wide connectedness (Diebold and Yilmaz, 2014).At the first two lags, core European countries (Austria, Belgium, Finland,19 inks Avg Degree Density Avg Path Length t − t − t − t − t − t − t − t − t − t − t − t − Table 2: Statistics for single layers networks of the GDP growth networks of 25OECD countries for the sample period 1961Q1 to 2015Q2. The sub-graphs (layers)are induced by positive (red) and negative (blue) edge intensity. The density Theaverage path length represents the average graph-distance between all pairs of nodes.Connected nodes have graph distance 1.France, Germany, Netherlands) are mainly receivers, whereas the peripheryEuropean countries (Greece, Ireland, Italy, Portugal, Spain) transmit the highestpercentage of shocks. In the other lags, core countries receive and transmit a highpercentage of shocks.Decomposing the degree by the intensity levels (see Eq. (17)), we see that theaverage degree at the first lag (2 .
92) is mainly driven by positive and strong positiveedge intensities (2 . Based on the literature on risk connectedness among financial institutions andmarkets (see Hautsch et al. (2015) and Diebold and Yilmaz (2014)), we constructdaily realized volatilities using intraday high-low-close price indexes of 118institutions of the Euro Stoxx 600 obtained from Datastream, from January 3,2005 to September 19, 2014. The dataset consists of 42 banks, 31 financialservices, 31 insurance companies and 22 real estates from Austria, Belgium, Finland,France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal andSpain. As in Ahelgebey et al. (2016b), we build the realized volatility as RV t =0 . H t − log L t ) − (2 log 2 −
1) (log C t − log C t − ) , where H t , L t and C t denote21he high, low and closing price of a given stock on day t , respectively and considera VAR(1) model.There is a strong evidence of sparsity with a probability of 0 .
16 of edge existence(Fig. 6). Based on the posterior number of clusters, we identify five levels of linkagestrengths, customarily called ”negative” (blue), ”weak positive” (green), ”positive”(orange) and ”strong positive” (red). Negative linkages imply some institutionsreact with a volatility decrease to positive volatility shocks, thus reducing systemicrisk. Top-left plot in Fig. 6 shows the coloured realized-volatility network. Thenode size increases with the node eigenvector centrality, which is evaluated on theoverall network without accounting for the linkage intensity. We find that insurancecompanies and banks (e.g., AGEAS and Bank of Ireland) are central and presentdifferent type of linkages (e.g., see AGEAS two-step ego network in the bottomplot). Also, a community structure can be detected with some small communitiesof same countries banks, such as the one related to the Italy (Monte dei Paschi andother Popular Banks) and Greece (Alpha Bank, National Bank of Greece, Bank ofPiraeus).Evaluating the node eigenvector centrality in the subgraphs of positive and strongpositive linkages, we found that also real estates sector is central (e.g. Immofiz, top-right plot). Our approach thus helps in better understanding the role of financialinstitutions in the build-up of the systemic risk, revealing different aspects of theiractivity and in particular for insurance and real estate industries.
This paper introduces a novel Bayesian nonparametric Lasso prior (BNP-Lasso) forVAR models, which combines Dirichlet process and Bayesian Lasso priors. The two-stage hierarchical prior allows for clustering the VAR coefficients and for shrinkingthem either toward zero or random locations, thus inducing sparsity in the VARcoefficients. The simulation studies illustrate the good performance of our model inhigh dimension settings, compared to some existing priors, such as the StochasticSearch Variable Selection, Elastic-net and simple Bayesian Lasso.The BNP-Lasso VAR is well suited for extracting networks which account forsome stylized facts observed in financial and macroeconomic networks such ascommunity structures and linkage heterogeneity. The linkages partition can beeasily obtained with the BNP-Lasso framework and the extraction of a colourednetwork allows for analyzing the centrality of the nodes in the intensity-specificsubgraphs.In the macroeconomic (financial) application, our BNP-Lasso VAR helps inunderstanding the centrality of countries (institutions) and their role in the build-up of risk. In all applications, there is evidence of network sparsity and of different22evels of linkage intensity, meaning that few linkages could be responsible for theconnectedness. We show evidence that some nodes are central when the intensitylevel is considered, even if they are not central in the network. Also, few nodespossess linkages with different intensities whereas most of the nodes have only onetype of linkages.
References
Acemoglu, D., Carvalho, V. M., Ozdaglar, A., and Tahbaz-Salehi, A. (2012). Thenetwork origins of aggregate fluctuations.
Econometrica , 80(5):1977–2016.Acemoglu, D., Ozdaglar, A., and Tahbaz-Salehi, A. (2015). Systemic risk andstability in financial networks.
The American Economic Review , 105(2):564–608.Ahelgebey, D. F., Billio, M., and Casarin, R. (2016a). Bayesian Graphical Modelsfor Structural Vector Autoregressive Processes.
Journal of Applied Econometrics ,31(2):357–386.Ahelgebey, D. F., Billio, M., and Casarin, R. (2016b). Sparse Graphical VectorAutoregression: A Bayesian Approach.
Annals of Economics and Statistics ,123:333–361.Banbura, M., Giannone, D., and Reichlin, L. (2010). Large Bayesian vectorautoregressions.
Journal of Applied Econometrics , 25(1):71–92.Barab´asi, A.-L. and Albert, R. (1999). Emergence of scaling in random networks.
Science , 286(5439):509–512.Bargigli, L. and Gallegati, M. (2013). Finding communities in credit networks.
Economics , 7(17):1.Barigozzi, M. and Brownlees, C. (2018). NETS: Network Estimation for Time Series.
Working Paper .Bassetti, F., Casarin, R., and Leisen, F. (2014). Beta-product dependent Pitman-Yor processes for Bayesian inference.
Journal of Econometrics , 180(1):49–72.Battiston, S., Puliga, M., Kaushik, R., Tasca, P., and Caldarelli, G. (2012).Debtrank: Too central to fail? Financial networks, the fed and systemic risk.
Scientific Reports , 2.Bianchi, D., Billio, M., Casarin, R., and Guidolin, M. (2018). Modeling systemicrisk with Markov switching graphical SUR models.
Journal of Econometrics ,forthcoming. 23illio, M., Getmansky, M., Lo, A. W., and Pelizzon, L. (2012). Econometricmeasures of connectedness and systemic risk in the finance and insurance sectors.
Journal of Financial Econometrics , 104(3):535–559.Brownlees, C. and Engle, R. (2016). SRISK: A Conditional Capital ShortfallMeasure of Systemic Risk.
The Review of Financial Studies , 30(1):48–79.Canova, F. and Ciccarelli, M. (2004). Forecasting and turning point prediction in aBayesian panel VAR model.
Journal of Econometrics , 120(2):327–359.Caron, F. and Fox, E. B. (2017). Sparse graphs using exchangeable randommeasures.
Journal of the Royal Statistical Society: Series B , 79(5):1295–1366.Carriero, A., Clark, T., and Marcellino, M. (2015). Bayesian VARs: Specificationchoices and forecast accurancy.
Journal of Applied Econometrics , 30(1):46–73.Carvalho, C. M., Massam, H., and West, M. (2007). Simulation of hyper-inversewishart distributions in graphical models.
Biometrika , 94(3):647–659.De Souza, S. R. S., Silva, T. C., Tabak, B. M., and Guerra, S. M. (2016). Evaluatingsystemic risk using bank default probabilities in financial networks.
Journal ofEconomic Dynamics and Control , 66:54–75.Demirer, M., Diebold, F. X., Liu, L., and Yilmaz, K. (2018). Estimating GlobalBank Network Connectedness.
Journal of Applied Econometrics , 33(1):1–15.Di Lucca, M., Guglielmi, A., Muller, P., and Quintana, F. (2013). A simple class ofBayesian nonparametric autoregression models.
Bayesian Analysis , 8(1):63–88.Diebold, F. X. and Yilmaz, K. (2014). On the network topology of variancedecompositions: Measuring the connectedness of financial firms.
Journal ofEconometrics , 182(1):119–134.Diebold, F. X. and Yilmaz, K. (2015).
Measuring the Dynamics of Global BusinessCycle Connectedness , pages 45–89. Oxford Univerisity Press.Diebold, F. X. and Yilmaz, K. (2016). Trans-Atlantic Equity VolatilityConnectedness: U.S. and European Financial Institutions, 2004–2014.
Journalof Financial Econometrics , 14(1):81–127.Doan, T., Litterman, R., and Sims, C. A. (1984). Forecasting and conditionalprojection using realistic prior distributions.
Econometric Reviews , 3(1):1–100.Eichler, M. (2013). Causal inference with multiple time series: principles andproblems.
Philosophical Transactions of the Royal Society A , 371(1997).24erguson, T. S. (1973). A Bayesian Analysis of some Nonparametric Problems.
TheAnnals of Statistics , 1(2):209–230.Francis, N., Owyang, M., and Savascin, O. (2017). An endogenously clusteredfactor approach to international business cycles.
Journal of Applied Econometrics ,32(7):1261–1276.Gefang, D. (2014). Bayesian doubly adaptive elastic-net Lasso for VAR shrinkage.
International Journal of Forecasting , 30(1):1–11.George, E. I., Sun, D., and Ni, S. (2008). Bayesian stochastic search for VAR modelrestrictions.
Journal of Econometrics , 142(1):553–580.Griffin, J. and Kalli, M. (2018). Bayesian nonparametric vector autoregressivemodels.
Journal of Econometrics , 203(2):267–282.Griffin, J. and Steel, M. F. J. (2006). Inference with non-gaussian ornstein–uhlenbeckprocesses for stochastic volatility.
Journal of Econometrics , 134(2):605–644.Griffin, J. E. and Steel, M. F. J. (2011). Stick-breaking autoregressive processes.
Journal of Econometrics , 162(2):383–396.Hatjispyros, S. J., Nicoleris, T. N., and Walker, S. G. (2011). Dependent mixtures ofDirichlet processes.
Computational Statistics & Data Analysis , 55(6):2011–2025.Hautsch, N., Schaumburg, J., and Schienle, M. (2015). Financial network systemicrisk contributions.
Review of Finance , 19(2):685–738.Hirano, K. (2002). Semiparametric Bayesian inference in autoregressive panel datamodels.
Econometrica , 70(2):781–799.Hjort, N. L., Homes, C., M¨uller, P., and Walker, S. G. (2010).
BayesianNonparametrics . Cambridge University Press.Huber, F. and Feldkircher, M. (2017). Adaptive shrinkage in Bayesian vectorautoregressive models.
Journal of Business & Economic Statistics , 0(0):1–13.Jackson, M. (2008).
Social and Economic Networks . Princeton University Press.Jensen, J. M. and Maheu, M. J. (2010). Bayesian semiparametric stochasticvolatility modeling.
Journal of Econometrics , 157(2):306–316.Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005).Experiments in stochastic computation for high-dimensional graphical models.
Statistical Science , 20(4):388–400. 25alli, M., Griffin, J. E., and Walker, S. G. (2011). Slice sampling mixture models.
Statistics and Computing , 21(1):93–105.Karlsson, S. (2013). Forecasting with Bayesian Vector Autoregression. In Elliott, G.,Granger, C., and Timmermann, A., editors,
Handbook of Economic Forecasting ,volume 2, chapter 15, pages 791–897. Elsevier.Kaufmann, S. (2010). Dating and forecasting turning points by Bayesian clusteringwith dynamic structure: a suggestion with an application to Austrian data.
Journal of Applied Econometrics , 25(2):309–344.Kaufmann, S. and Schumacher, C. (2017). Identifying relevant and irrelevantvariables in sparse factor models.
Journal of Applied Econometrics , 32(6):1123–1144.Kivel¨a, M., Arenas, A., Barthelemy, M., Gleeson, J. P., Moreno, Y., and Porter,M. A. (2014). Multilayer networks.
Journal of Complex Networks , 2(3):203.Koop, G. (2013). Forecasting with medium and large Bayesian VARs.
Journal ofApplied Econometrics , 28(2):177–203.Koop, G. and Korobilis, D. (2016). Model uncertainty in panel vectorautoregressions.
European Economic Review , 81:115–131.Korobilis, D. (2013). VAR forecasting using Bayesian variable selection.
Journal ofApplied Econometrics , 28(2):204–230.Korobilis, D. (2016). Prior selection for panel vector autoregressions.
ComputationalStatistics & Data Analysis , 101:110–120.Korobilis, D. and Pettenuzzo, D. (2018). Adaptive Hierarchical Priors for High-Dimensional Vector Autoregressions.
Journal of Econometrics , Forthcoming.Kose, M. A., Otrok, C., and Whiteman, C. H. (2003). International BusinessCycles: World, region and country specific factors.
American Economic Review ,93(4):1216–1239.Leicht, E. A. and Newman, M. E. (2008). Community structure in directed networks.
Physical Review Letters , 100(11):118703.Litterman, R. (1986). Forecasting with Bayesian vector autoregressions-five yearsof experience.
Journal of Business and Economic Statistics , 4(1):25–38.Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. densityestimates.
The Annals of Statistics , 12(1):351–357.26acEachern, S. N. (1999). Dependent nonparametric processes. In
In ASAProceedings of the Section on Bayesian Statistical Science, Alexandria, VA , pages50–55. American Statistical Association.MacEachern, S. N. (2001). Decision theoretic aspects of dependent nonparametricprocesses. In George, E., editor,
Bayesian Methods with Applications to Science,Policy and Official Statistics , pages 551–560. Creta: ISBA.MacLehose, R. and Dunson, D. (2010). Bayesian semiparametric multiple shrinkage.
Biometrics , 66(2):455–462.Magnus, J. R. and Neudecker, H. (1999).
Matrix Differential Calculus withApplications in Statistics and Econometrics, 2nd Edition . John Wiley.McCracken, M. W. and Ng, S. (2016). Fred-md: A monthly database formacroeconomic research.
Journal of Business & Economic Statistics , 34(4):574–589.Miller, R. (1980). Bayesian analysis of the two-parameter gamma distribution.
Technometrics , 22(1):65–69.M¨uller, P., Quintana, F., and Rosner, G. (2004). A method for combining inferenceacross related nonparametric Bayesian models.
Journal of the Royal StatisticalSociety B , 66:735–749.Newman, M. (2010).
Networks: an introduction . Oxford University Press.Newman, M. E. (2006). Modularity and community structure in networks.
Proceedings of the National Academy of Sciences , 103(23):8577–8582.NietoBarajas, L. E. and Quintana, F. A. (2016). A Bayesian nonparametric dynamicAR model for multiple time series analysis.
Journal of Time Series Analysis ,37(5):675–689.Opsahl, T., Agneessens, F., and Skvoretz, J. (2010). Node centrality in weightednetworks: Generalizing degree and shortest paths.
Social Networks , 32(3):245 –251.Park, T. and Casella, G. (2008). The Bayesian Lasso.
Journal of the AmericanStatistical Association , 103(482):681–686.Pennell, M. L. and Dunson, D. B. (2006). Bayesian semiparametric dynamic frailtymodels for multiple event time data.
Biometrics , 62:1044–1052.27itman, J. and Yor, M. (1997). The two parameter Poisson-Dirichlet distributionderived from a stable subordinator.
Annals of probability , 25:855–900.Puliga, M., Flori, A., Pappalardo, G., Chessa, A., and Pammolli, F. (2016). Theaccounting network: How financial institutions react to systemic crisis.
PLOSOne , 11(10):e0162855.Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with anunknown number of components.
Journal of the Royal Statistical Society SeriesB , 59:731–792.Scott, H. (2016).
Connectedness and Contagion: Protecting the Financial Systemfrom Panics . MIT Press.Scott, S. L. and Varian, H. R. (2014). Predicting the present with Bayesian structuraltime series.
International Journal of Mathematical Modelling and NumericalOptimisation , 5(1-2):4–23.Sethuraman, J. (1994). A constructive definition of the Dirichlet process prior.
Statistica Sinica , 4:639–650.Sims, C. A. and Zha, T. (1998). Bayesian methods for dynamic multivariate models.
International Economic Review , 39(4):949–968.Stock, J. H. and Watson, M. W. (2012). Generalized shrinkage methods forforecasting using many predictors.
Journal of Business & Economic Statistics ,30(4):481–493.Taddy, M. A. and Kottas, A. (2009). Markov switching Dirichlet process mixtureregression.
Bayesian Analysis , 4(4):793–816.Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices.
Communications in Statistics - Simulation and Computation , 36(1):45–54.Wang, H. (2010). Sparse seemingly unrelated regression modelling: Applications infinance and econometrics.
Computational Statistics & Data Analysis , 54(11):2866–2877.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elasticnet.
Journal of the Royal Statistical Society B , 67(2):301–320.28
Further details on prior specification
A.1 Normal-Gamma distribution
A normal-gamma random variable X ∼ N G ( µ, γ, τ ) has probability density function f ( x | µ, γ, τ ) = τ γ +14 | x − µ | γ − γ − √ π Γ( γ ) K γ − ( √ τ | x − µ | ) , where K γ ( · ) represents the modified Bessel function of the second kind with index γ , µ ∈ R is the location parameter, γ > τ > γ = 1 and can be represented as a scale mixture ofnormals, N G ( x | µ, γ, τ ) = (cid:90) + ∞ N ( x | µ, λ ) G a ( λ | γ, τ / dλ, (A.1)where G a ( ·| a, b ) denotes a gamma distribution with mean a/b and variance a/b . A.2 Gamma scale-shape distribution
A Gamma scale-shape random vector (
X, Y ) ∼ G S ( ν, p, s, n ) has pdf g ( x, y | ν, p, s, n ) ∝ τ νx − p x − exp {− sy } x ) n , (A.2)with parameters ν > p > s > n > g ( x, y ) = g ( y | x ) g ( x ), where g ( y | x ) = g ( x, y ) g ( x ) = τ νx − e − sy Γ( νx ) s νx that is a density of a G a ( νx, s ), and g ( x ) = (cid:90) ∞ g ( x, y ) dy = C Γ( νx )Γ( x ) n p x − s νx is the marginal density with normalizing constant C such that (cid:82) ∞ g ( x ) dx = 1. Weshow in Figure A.1 the density function, g ( x ), for the two parameter settings usedin the empirical application: v = 30, s = 1 / p = 0 . n = 18 (dashed line)and v = 3, s = 1 / p = 0 . n = 10 (solid line).29igure A.1: Probability density function, g ( x ), for sparse (dashed line) and non-sparse (solid line) case, respectively. A.3 Hyper-inverse Wishart Distribution
An Hyper-inverse Wishart random variable Σ ∼ H IW G ( b, L ) has pdf p (Σ) = (cid:89) P ∈P p (Σ P ) (cid:32)(cid:89) S ∈S p (Σ S ) (cid:33) − , where S = { S , . . . , S n S } and P = { P , . . . , P n P } are the set of separators and ofprime components, respectively, of the graph G . In particular, b is the degree offreedom, L is the scale matrix and p (Σ P ) ∝ | Σ P | − ( b +2Card( P )) / exp (cid:26) −
12 tr(Σ − P L P ) (cid:27) , with L P the positive-definite symmetric diagonal block of L corresponding to Σ P . A.4 Dirichlet Process prior
The Dirichlet Process, DP( ˜ α, H ), can be defined by using the stick-breakingrepresentation (Sethuraman (1994)) given by: P i ( · ) = ∞ (cid:88) j =1 w ij δ { θ ij } ( · ) , i = 1 , . . . , M. Following the definition of the dependent stick-breaking processes (see MacEachern(1999) and MacEachern (2001)), the atoms θ ij are i.i.d. sequences of randomelements with probability distribution H ( θ ij i.i.d. ∼ H ); and the weights w ij are30etermined through the stick-breaking construction, for j = 1, w i = v i and for j > w ij = v ij j − (cid:89) k =1 (1 − v ik ) , i = 1 , . . . , M with v j = ( v j , . . . , v Mj ) independent random variables taking values in [0 , M distributed as a B e (1 , ˜ α ) such that (cid:80) j ≥ w ij = 1 almost surely for every i . B Proof of the results in the paper
Proof of result in Eq. (11) . Let K ( β i | θ i ) be the joint density kernel of the sequence β ij ind ∼ N G ( β ij | θ ∗ ij ), j = 1 , . . . , n i and θ ∗ ij | Q i i.i.d. ∼ Q i , the atoms of the randomprobability measure Q i . Let f i ( β i | Q i ) be the random probability density functionobtained by integrating out the random measure Q i (Lo (1984)): f i ( β i | Q i ) = (cid:90) K ( β i | θ i ) Q i ( d θ i ) . (B.1)Using the definition of Q i as a convex combination of a sparse component and aDPP in equation (B.1), we obtain f i ( β i | Q i ) = π i (cid:90) K ( β i | θ i ) P ( d θ i ) + (1 − π i ) (cid:90) K ( β i | θ i ) P i ( d θ i ) (B.2)We use the stick-breaking representation of the DPP (Appendix A) and get f i ( β i | Q i ) = π i (cid:90) N G ( β i | µ , γ , τ ) P ( d ( µ , γ , τ )) + (1 − π i ) (cid:90) N G ( β i | µ , γ , τ ) P i ( d ( µ , γ , τ ))= π i N G ( β i | , γ , τ ) + (1 − π i ) ∞ (cid:88) k =1 w ik N G ( β i | µ ik , γ ik , τ ik ) (B.3)By defining,ˇ w ik = (cid:26) π i , k = 0 , (1 − π i ) w ik , k > , ˇ θ ik = (cid:26) (0 , γ , τ ) , k = 0 , ( µ ik , γ ik , τ ik ) , k > . we have the infinite mixture representation f i ( β i | Q i ) = ∞ (cid:88) k =0 ˇ w ik N G ( β i | ˇ θ ik ) . roof of result in Eq. (13) . We introduce a set of slice latent variables, u ij , j =1 , . . . , n i , which allows us to represent the full conditional of β ij as follows, f i ( β ij , u ij | ( µ i , γ i , τ i ) , w i ) = π i ∞ (cid:88) k =0 I ( u ij < ˜ w ik ) N G ( β ij | (0 , γ ik , τ ik ))++ (1 − π i ) ∞ (cid:88) k =1 I ( u ij < w ik ) N G ( β ij | µ ik , γ ik , τ ik )= π i I ( u ij < ˜ w ) N G ( β ij | (0 , γ , τ )) + (1 − π i ) ∞ (cid:88) k =1 I ( u ij < w ik ) N G ( β ij | µ ik , γ ik , τ ik ) , where ˜ w ik = ˜ w = 1 if k = 0 and ˜ w ik = 0 for k > , γ i , τ i ) = (0 , γ , τ ).The slice variables allow us to represent the infinite mixture as a finite mixtureconditionally on a random number of components. More specifically, define the set A w i ( u ij ) = { k : u ij < w ik } , j = 1 , . . . , n i , then it can be proved that the cardinality of A w i is almost surely finite. Posteriordraws for u ij are easily obtained by slice sampling.The conditional joint pdf for β ij and u ij , f i ( β ij , u ij | ( µ i , γ i , τ i ) , w i ), is equal to π i I ( u ij < ˜ w ) N G ( β ij | , γ , τ ) + (1 − π i ) (cid:88) k ∈A wi ( u ij ) N G ( β ij | µ ik , γ ik , τ ik ) . We iterate the data augmentation principle and for each f i we introduce twoallocation variables ξ ij and d ij , associated with the sparse and non-sparsecomponents, respectively, of the random measure Q i . The joint pdf is f i ( β ij , u ij , d ij , ξ ij ) = (cid:0) π i I ( u ij < ˜ w d ij ) N G ( β ij | , γ , τ ) (cid:1) − ξ ij · (cid:0) (1 − π i ) I ( u ij < w ld ij ) N G ( β ij | µ id ij , γ id ij , τ id ij ) (cid:1) ξ ij . From (4), we de-marginalize the Normal-Gamma distribution by introducing a latentvariable λ ij for each β ij and conclude that the joint distribution is f i ( β ij , λ ij , u ij , d ij , ξ ij ) = (cid:0) I ( u ij < ˜ w d ij ) N ( β ij | , λ ij ) G a ( λ ij | γ , τ / (cid:1) − ξ ij · (cid:0) I ( u ij < w id ij ) N ( β ij | µ id ij , λ ij ) G a ( λ ij | γ id ij , τ id ij / (cid:1) ξ ij π − ξ ij i (1 − π i ) ξ ij . roof of result in Eq. (22) . For h ≤ p , Φ h satisfies Φ h = R h + B h , where R h = (cid:80) h − l =1 B l Φ h − l . By the properties of the Hadamard product ◦ , it follows( e (cid:48) i Φ h Σ e j ) = e (cid:48) i (cid:0) (( R h + B h )Σ) ◦ (( R h + B h )Σ) (cid:1) e j = e (cid:48) i (( R h Σ) ◦ ( R h Σ)) e j + e (cid:48) i (( R h Σ) ◦ ( B h Σ)) e j + e (cid:48) i (( B h Σ) ◦ ( R h Σ)) e j + e (cid:48) i (( B h Σ) ◦ ( B h Σ)) e j = ( e (cid:48) i ( R h Σ) e j ) + 2 ( e (cid:48) i R h Σ e j ) ( e (cid:48) i B h Σ e j ) + ( e (cid:48) i B h Σ e j ) C Gibbs sampling details
We introduce for k ≥ k -thmixture component, D ik = { j ∈ , . . . , n i : d ij = k, ξ ij = 1 } and the set of the non-empty mixture components, D ∗ = { k | ∪ i D ik (cid:54) = 0 } . The number of stick-breakingcomponents is denoted by D ∗ = max i { max j ∈{ ,...,n i } d ij } . As noted by Kalli et al.(2011), the sampling of infinitely many elements of Θ and V is not necessarily, sinceonly the elements in the full conditional distributions of ( D, Ξ) are needed and themaximum number is N ∗ = max i { N ∗ i } , where N ∗ i is the smallest integer such that (cid:80) N ∗ i k =1 w ik > − u ∗ i , where u ∗ i = min ≤ j ≤ n i { u ij } . Update the stick-breaking and slice variables V and U We treat V as three blocks: V ∗ = { V k : k ∈ D ∗ } , V ∗∗ = ( v kD ∗ +1 , . . . v kN ∗ ) and V ∗∗∗ = { V k : k > N ∗ } . In order to sample ( U, V ), a further blocking is used:i) Sampling from the full conditional of V ∗ , by using f ( v ij | . . . ) ∝ B e (cid:32) n i (cid:88) j =1 I ( d ij = d, ξ ij = 1) , α + n i (cid:88) j =1 I ( d ij > d, ξ ij = 1) (cid:33) . for k ≤ D ∗ . The elements of ( V ∗∗ , V ∗∗∗ ) are sampled from the prior B e (1 , α ) . ii) Sampling from full conditional posterior distribution of Uf ( u ij | . . . ) ∝ (cid:26) I ( u ij < w d ij ) ξ ij if ξ ij = 1 , I ( u ij < − ξ ij if ξ ij = 0 , Update the mixing parameters λ Regarding the mixing parameters λ ij , the full conditional posterior distribution is f ( λ ij | . . . ) ∝ λ C ij − ij exp (cid:26) − (cid:20) A ij λ ij + B ij λ ij (cid:21)(cid:27) ∝ G i G ( A ij , B ij , C ij ) , G i G denotes a Generalize Inverse Gaussian with A ij > B ij > C ij ∈ R A ij = (cid:2) (1 − ξ ij ) τ + ξ ij τ id ij (cid:3) , B ij = (cid:2) (1 − ξ ij ) β ij + ξ ij ( β ij − µ id ij ) (cid:3) ,C ij = (cid:20) (1 − ξ ij ) γ + γ id ij ξ ij − (cid:21) . The matrix Λ i = diag { λ i } has the elements of λ i = ( λ i , . . . , λ in i ) (cid:48) on the diagonal. Update the atoms
ΘWe consider the sparse and the non-sparse case. In the sparse case, the fullconditional distribution of µ is f ( µ | . . . ) = δ { } ( µ ) and the full conditionaldistribution of ( γ , τ ) is: f (( γ , τ ) | . . . ) ∝ G S γ , τ | ν + M (cid:88) i =1 n i, , p M (cid:89) i =1 (cid:89) j | ξ ij =0 λ ij , s + 12 M (cid:88) i =1 (cid:88) j | ξ ij =0 λ ij , n + M (cid:88) i =1 n i, , where we assume n i, = (cid:80) n i j =1 (1 − ξ ij ) = n i − n i, and n i, = (cid:80) n i j =1 ξ ij . In thenon-sparse case, we generate samples ( µ ik , γ ik , τ ik ), k = 1 , . . . , N ∗ , by applying asingle move Gibbs sampler. The full conditional for µ ik is f ( µ ik | . . . ) ∝ N ( µ ik | c, d ) (cid:89) j | ξ ij =1 ,d ij = k N ( β ij | µ ik , λ ij ) ∝ N ( ˜ E k , ˜ V k )with ˜ E k = ˜ V k (cid:16) cd + (cid:80) j | ξ ij =1 ,d ij = k β ij λ ij (cid:17) and ˜ V k = (cid:16) d + (cid:80) j | ξ ij =1 ,d ij = k λ ij (cid:17) − , the meanand variance, respectively. The joint conditional posterior of ( γ ik , τ ik ) is: f (( γ ik , τ ik ) | . . . ) ∝ G S γ ik , τ ik | ν + n i, k , p (cid:89) j | ξ ij =1 ,d ij = k λ ij , s + 12 (cid:88) j | ξ ij =1 ,d ij = k λ ij , n + n i, k , for k ∈ D ∗ , where n i, k = (cid:80) n i j =1 ξ ij I ( d ij = k ) and from the prior H for k / ∈ D ∗ . Inorder to draw samples from G S in both cases, we apply a collapsed Gibbs sampler.Samples from f ( γ ) are obtained by a Metropolis-Hastings (MH) algorithm andsamples from f ( τ | γ ) are obtained from a Gamma distribution. Update the coefficients β The full conditional posterior distribution of β is: f ( β i | . . . ) ∝ N n i ( ˜v i , M i ) , with mean ˜v i = M i (cid:0)(cid:80) t X (cid:48) t Σ − y t + Λ − i ( µ ∗ i (cid:12) ξ i ) (cid:1) and variance M i = (cid:0)(cid:80) t X (cid:48) t Σ − X t + Λ − i (cid:1) − ; and µ ∗ i = ( µ id i , . . . , µ id ini ) (cid:48) , ξ i = ( ξ i , . . . , ξ in i ) (cid:48) .34 pdate the covariance matrix ΣBy using the sets S and P as described in Appendix A and a decomposable graph,the likelihood of the graphical Gaussian model can be approximated as the ratiobetween the likelihood in the prime and in the separator components. Thus, theposterior for Σ factorizes as follows: p (Σ | . . . ) ∝ HIW G (cid:32) b + T, L + T (cid:88) t =1 ( y t − X (cid:48) t β ) (cid:48) ( y t − X (cid:48) t β ) (cid:33) . Update the graph G
We apply a MCMC for multivariate graphical models for G (see Jones et al. (2005))and due to prior independence assumption, p ( y | G ) = (cid:90) (cid:90) T (cid:89) t =1 (2 π ) − n/ | Σ | − n/ exp (cid:18) −
12 ( y t − X (cid:48) t β )Σ − ( y t − X (cid:48) t β ) (cid:19) p ( β ) p (Σ | G ) d β d Σ . Following Jones et al. (2005) we apply a local-move MH based on the conditionalposterior p ( G | . . . ) with an add/delete edge move proposal. Update the allocation variables D and ΞSampling from the full conditional of D is obtained from P ( d ij = d, ξ ij = 1 | . . . ) ∝ (1 − π i ) N ( β ij | µ id , λ ij ) G a ( λ ij | γ id , τ id / (cid:80) k ∈ A wi ( u ij ) N ( β ij | µ ik , λ ij ) G a ( λ ij | γ ik , τ ik / d ∈ A w i ( u ij ). While in the sparse case, from P ( d ij = d, ξ ij = 0 | . . . ) ∝ π i I ( u ij < ˜ w id ) , d ∈ A ˜ w ( u ij )where A ˜ w ( u ij ) = { k : u ij < ˜ w k } = { } , because ˜ w k = 0 , ∀ k > P ( d ij = d, ξ ij = 0 | . . . ) ∝ (cid:26) π i I ( u ij < N ( β ij | , λ ij ) G a ( λ ij | γ , τ /
2) if d = 0 , d > . Update the prior restriction probabilities π The full conditional for π i is, f ( π i | . . . ) ∝ B e (cid:32) n i + 1 − n i (cid:88) i =1 I ( ξ ii = 1) , α i + n i (cid:88) i =1 I ( ξ ii = 1) (cid:33) ..