The Variational Bayesian Inference for Network Autoregression Models
TThe Variational Bayesian Inference for Network
Autoregression Models
Wei-Ting Lai
Department of Statistics, National Cheng Kung University, Tainan, Taiwan
Ray-Bing Chen
Department of Statistics, National Cheng Kung University, Tainan, TaiwanInstitute of Data Science, National Cheng Kung University, Tainan, Taiwan
Ying Chen
Department of Mathematics, National University of Singapore, SingaporeRisk Management Institute, National University of Singapore, SingaporeInstitute of Data Science, National University of Singapore, Singapore
Thorsten Koch
Chair of Software and Algorithms for Discrete Optimization,Technische Universität Berlin, Berlin, GermanyDepartment of Applied Algorithmic Intelligence Methods,Zuse Institute Berlin, Berlin, Germany
Abstract:
We develop a variational Bayesian (VB) approach for estimating large-scale dy-namic network models in the network autoregression framework. The VB approach allowsfor the automatic identification of the dynamic structure of such a model and obtains a directapproximation of the posterior density. Compared to Markov Chain Monte Carlo (MCMC)based sampling approaches, the VB approach achieves enhanced computational efficiencywithout sacrificing estimation accuracy. In the simulation study conducted here, the pro-posed VB approach detects various types of proper active structures for dynamic networkmodels. Compared to the alternative approach, the proposed method achieves similar orbetter accuracy, and its computational time is halved. In a real data analysis scenario ofday-ahead natural gas flow prediction in the German gas transmission network with 51 nodesbetween October 2013 and September 2015, the VB approach delivers promising forecastingaccuracy along with clearly detected structures in terms of dynamic dependence.
Keywords:
Dynamic network; EM algorithm; MCMC algorithm; Vector autoregression.1 a r X i v : . [ s t a t . M E ] F e b Introduction
Networks have emerged and become available in various fields, such as energy transmission,logistics and transportation, and financial systems. Networks are dynamic in terms of theirtemporal dependence, and they often have large scales. The understanding and inference ofnetwork dynamics have profound implications for operations and decision making in modernindustries. Tremendous growth and heterogeneity in both nodes/edges and dependence overtime are the key characteristics of such networks. However, conventional statistical methodseither assume that networks are static or consider only low-dimensional temporal data. Thiscreates a need for efficient computational approaches that are able to reveal the essentialdependence structures in high dimensions and simultaneously deliver accurate inferenceswith a low computational cost.Industrial networks contain series of temporal-spatial data collected over time. Whilethe nodes/edges are often fixed or possess trivial changes, the lead-lag temporal dependenceissue can no longer be ignored in network inference. Graph theory has been widely usedfor unraveling structural information in large-scale network analysis. For example, Fan etal. (2009) and Guo et al. (2011) proposed a sparse graphic network. Liu et al. (2012)proposed the semiparametric Gaussian copula graphical model. Despite their efficiency,these graphical models assume static networks, and the evolution of the network dependenceis not considered in their estimation processes.The temporal dependence of a network can be represented in the vector autoregressive(VAR) modeling framework. In the VAR framework, each node is considered as one timeseries, and the network dependence is measured as the lead-lag cross-correlations amongmultiple time series. Both the theoretical properties and empirical performance of VAR havebeen well studied with respect to multivariate data; see Lütkepohl (2007) and Bańbura et al.(2010). However, the application of VAR for large-scale network analysis is still challenging.Given a network with 𝑚 models, which correspond to 𝑚 time series, and supposing that thedynamics depend on the last 𝑝 lags, to the result is that there are 𝑝𝑚 unknown coefficientsin the VAR model. When the number of nodes 𝑚 becomes large or the temporal dependence 𝑝 increases, VAR is overparameterized and this leads to low estimation accuracy or even2nfeasibility with regard to model inference.A unique feature in industrial networks is that their lead-lag temporal dependences, suchas concurrent dependence among their nodes, are less dense than the networks themselves.Individual networks are also much sparser than other social networks. It is conceivablethat large-scale industrial networks, are driven by a few essential and cohesive connectionsamong their nodes to facilitate network evolution. This motivates the modeling of large-scaledynamic networks using sparse VAR. In particular, penalties are imposed on the parameterspace of the VAR framework with various possible types of structural assumptions. Forthe purpose of both estimation and interpretability, structural sparsity can be enabled inelements, groups and lags. Lag sparsity investigates the effect of time-lagged information,while group sparsity highlights the impacts of certain nodes on others. In addition to theuniversal effect on a group of lag coefficients, a sparse element illustrates a single effect. Basu& Michailidis (2015) investigated the theoretical properties of ℓ -regularized estimates, wheremultiple time series were assumed to be stable Gaussian processes. Melnyk & Banerjee (2016)established bounds on the non-asymptotic estimation error of the Lasso-type estimator forstructured VAR parameters. Nicholson et al. (2014) proposed several structures for VAR andLasso, group Lasso and sparse group penatly functions to achieve sparsity in the elementsand groups of a network; see also Hsu et al. (2008), Song & Bickel (2011), and Chen et al.(2020). Moreover, VAR models can be easily extended to the above three kinds of sparsityby building up a hierarchical lag structure in the autoregression model via the inclusion ofhigh-dimensional exogenous variables.The estimation of the structured of a VAR framework faces two challenges. First, thesparse structure needs to be specified to avoid the overfitting of the high-dimensional models.Second, the framework should be able to adapt to different kinds of stochastic behaviors,as empirical data are likely non-Gaussian. Bayesian methods are natural choices becausethey deliver stable performances without prespecified assumptions about the structure anddistribution of the model. Stochastic search variable selection (SSVS), for example, is themost commonly used Bayesian variable selection approach. It introduces latent indicatorsembedded in the priors and stochastically searches subsets by generating posterior sampleswith the Markov Chain Monte Carlo (MCMC) algorithm; see George & McCulloch (1993).3eweke (1996) proposed the component-wise Gibbs method for the purpose of improvingcomputational efficiency. By using the spike-and-slab prior, it avoids computing inversematrices and thus reduces the computational cost. However, the Gibbs sampler requestseither random or systematic updating of the coefficients. Chen et al. (2011) proposed thestochastic matching pursuit (SMP) algorithm, which updates the coefficients of each step forobtaining the best fit based on the current residual vector. In terms of structural selection,Farcomeni (2010) introduced Bayesian-constrained variable selection. Chen et al. (2016)proposed the groupwise Gibbs sampler. As a proof of concept, Chu et al. (2019) implementeda Bayesian variable selection approach in the VAR framework and named it the VAGSA,the vector autoregression-based Gibbs sampler algorithm. For the high-dimensional VARmodel, Kastner & Huber (2020) proposed a large Bayesian vector autoregression approachwith a Dirichlet-Laplace prior and factor stochastic volatility (FSV), and they applied it onhigh-dimensional US economic data.Nevertheless, these MCMC algorithms are known to be computationally expensive forsequentially generating posterior samples. Variational inference, as an alternative, showsgreat potential in terms of improving computational speed without sacrificing much accu-racy. It obtains an approximation of the target posterior density using the Kullbak-Leiblerdivergence, based on which an EM-type algorithm is devised a reduced computational cost.Titsias & Lázaro-Gredilla (2011) and Carbonetto & Stephens (2012) introduced variationalBayesian approaches with spike-and-slab priors for dealing with variable selection problemsin linear regression models. Cai et al. (2020) proposed a variational Bayesian method forsparse group selection in linear models and extended it to multiple response models.In our study, we propose a variational Bayesian (VB) approach for estimating a large-scale dynamic network model. The serial dependence in a given network is representedin a vector autoregression framework with three possible types of structural assumptionsand various nesting types. Here, we also call this model a network autoregression (NAR)model. We derive variational inferences and develop the corresponding algorithms. The VBapproach allows for the automatic identification of the dynamic structure of data and obtainsan approximation of the posterior density directly. Compared to MCMC-based samplingapproaches, such as the VAGSA in Chu et al. (2019), the VB approach achieves enhanced4umerical performance with similar accuracy. A simulation study shows that compared withexisting methods, the VB approach not only detects the proper active structures in variousdynamic network models but also halves the computational time, with similar or betteraccuracies. In a real data analysis, we predict day-ahead natural gas flows for a Germannetwork with 51 nodes over 2 years from Oct 1, 2013, to Sep 30, 2015. Germany’s gastransport system is essential to the European energy supply. The adequate, high-precisionestimation of supply and demand is a crucial issue for efficient control and operations in gastransmission. The VB approach delivers a clear dynamic dependence structure, providinginterpretability and insights for understanding and managing the gas transmission network.To the best of our knowledge, this is the first attempt to derive variational inference for alarge-scale dynamic network analysis in a structured NAR/VAR framework.This paper is organized as follows. Section 2 introduces the dynamic network modelin the VAR framework. Several types of structural assumptions are also demonstrated.Section 3 presents the proposed variational Bayesian algorithms for large-scale dynamicnetwork inference. Section 4 investigates the finite-sample performance of the proposed VBapproach. Section 5 reports the network inference for day-ahead gas flow forecasting with 51high-pressure nodes in the natural gas transmission network in Germany. Section 6 providesa brief conclusion of our work. Model
Let Y 𝑡 ∈ R × 𝑚 denote a vectorized time series of networks with 𝑚 nodes at time 𝑡 over thetime period [ , 𝑇 ] . Without loss of generality, Y 𝑡 is demeaned. We consider a dynamicnetwork model in a vector autoregression framework Y 𝑡 which is assumed to depend on thepast values of the network at lags to 𝑝 , i.e., Y 𝑡 = Y 𝑡 − 𝐵 + · · · + Y 𝑡 − 𝑝 𝐵 𝑝 + 𝜖 𝑡 , 𝑡 = 𝑝 + , 𝑝 + , . . . , 𝑇 . Here, we let 𝐵 ℓ be an 𝑚 × 𝑚 coefficient matrix for lag- ℓ , ℓ = , , · · · , 𝑝 , which is used tomeasure the lead-lag temporal dependence in the network. The number of coefficients in B ℓ 𝑚 . Given 𝑚 = in the gas transformationnetwork, there are = unknown coefficients for each 𝐵 ℓ . Obviously, the diagonalelements of 𝐵 ℓ indicate the serial dependence of each node on its own lag- ℓ value and thedependences of off-diagonal elements on other nodes' lag- ℓ values. The term { 𝜖 𝑡 } 𝑇𝑡 = 𝑝 + is asequence of serially uncorrelated × 𝑚 random vectors, with a mean vector of zero and acovariance matrix Σ . Then, the dynamic network model can be represented in matrix formas follows: Y = XB + (cid:15) , (1)where Y ∈ R ( 𝑇 − 𝑝 )× 𝑚 is the response matrix, X = ( X , ..., X ℓ ) is a ( 𝑇 − 𝑝 ) × 𝑚 matrixwith X ℓ = (cid:16) Y (cid:48) 𝑝 + − ℓ , Y (cid:48) 𝑝 + − ℓ , . . . , Y (cid:48) 𝑇 − ℓ (cid:17) (cid:48) , ℓ = , , · · · , 𝑝 , B = (cid:16) 𝐵 (cid:48) , 𝐵 (cid:48) , . . . , 𝐵 (cid:48) 𝑝 (cid:17) (cid:48) , and (cid:15) = (cid:16) 𝜖 (cid:48) 𝑝 + , 𝜖 (cid:48) , · · · , 𝜖 (cid:48) 𝑇 (cid:17) (cid:48) .In this paper, we consider a structured NAR/VAR framework, i.e., dynamic dependenceis sparsely motivated by a large-scale industrial network, such as the German gas transforma-tion network. Figure 1 displays the lag-1 and lag-3 cross-correlations of 11 nodes arbitrarilyselected from the German natural gas transmission network. The 11 nodes belong to 4 dif-ferent types: municipal (labeled with M), industrial (I), border (B), and others (O). Theleft-hand side of Figure 1 is the lag-1 cross-correlation matrix. The right-hand side of Fig-ure 1 is the cross-correlation matrix for lag-3. This shows the coexistence of strong serialdependence and sparsity in elements, groups, and lags. According to Figure 1, due to theircross-correlation values, nodes sharing the same type may possess similar patterns. Thus,the dynamics of the network are not driven by each node individually, every group of nodes,or each lagged network in the past.While element sparsity and lag sparsity are clear, there are different kinds of groupsparsity. Following Song & Bickel (2011), we categorize the various structures into threetypes and discuss them as follows.• UG Structure: The universal grouping (UG) structure in the coefficient matrix 𝐵 ℓ means that the off-diagonal coefficients have the same sparsity pattern across the dif-ferent columns. For example, municipal nodes M3, M4, M5, and M6 may have similar6igure 1: The cross-correlations of 11 nodes arbitrarily selected from the German naturalgas transmission network.patterns, and we can treat them as a group. Therefore, if one node affects another node,the others are all influenced by this node. As shown in Figure 2, node M4 affects othernodes, and nodes M3, M5 and M6 do not affect other nodes. As such, the coefficientmatrix can be separated into a diagonal matrix reflecting the dynamic dependence ofeach node on its own, and a sparse matrix showing the temporal dependence of eachnode on others. In network analysis, the columns that have the same row sparsity aregrouped together.• SG Structure: For the segmentation grouping (SG) structure in 𝐵 ℓ , the nodes in thesame segment interact with each other but are independent from the other nodes. Thisis associated with the empirical observation that there are different types of nodes inthe network. Figure 3 illustrates the SG structure among municipal and industrialnodes. The network is divided into disjoint segments of the municipal and industrialtypes. The estimation process can be conducted segment by segment.7 (cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • • • •
00 0 0 • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Total ) → (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Own ) + (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Others ) Figure 2: An illustration of the coefficient matrix for the universal grouping structure.• NG Structure: Consider the “no grouping” (NG) structure in 𝐵 ℓ . Each node has itsown impact on the other nodes over time. This is a scenario where no regular pattern isidentified among the nodes. Figure 4 shows an example with different types of nodes.In fact, there are no similar patterns among them. In this case, we separate themduring the inference procedure.It is easy to see that the first and third types of structures, UG and NG, are two specialcases of the SG structure when the numbers of segments are and 𝑚 , respectively. While thethree types of structures are defined for each coefficient matrix, they may appear in differenttime lags. The identification of a proper structure can help not only increase the estimationaccuracy but also enhance computational time of the algorithm.8 (cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • • • • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Total ) → (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Own ) + (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) •
00 0 • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Others ) Figure 3: An illustration of the coefficient matrix for the segmentation groupingstructure. Variational Approximation Algorithm
We introduce the variational Bayesian approach for NAR/VAR structure inference in high-dimensional scenarios, and it is expected to automatically choose the proper structure andperform estimation at a low computational cost. Here, we derive only the VB method for thesegmentation grouping structure because it nests the other two types of structures as specialcases of segmentation grouping. Thus, it is a derivation under a general setup. To simplifythe procedure, we assume that all coefficient matrices share the same segmentation groupingstructure. Denote 𝑆 = (cid:8) 𝑠 , 𝑠 , · · · , 𝑠 𝑔 (cid:9) as an overall index set for the columns in each 𝐵 ℓ ,where 𝑠 𝑘 is the index set of the 𝑘 th segment with size < | 𝑠 𝑘 | < 𝑚 and (cid:205) 𝑔𝑘 = | 𝑠 𝑘 | = 𝑚 , and 𝑔 is the total number of segments (groups). In this section, we start by introducing the9 (cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • •
00 0 0 • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Total ) → (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • • • • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Own ) + (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) • (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) ( Others ) Figure 4: An illustration of the coefficient matrix for the “no grouping” structure.Bayesian structure selection approach, and then we mention the details of the proposed VBmethod.
For the Bayesian hierarchical model used to select segmentation structures, Chu et al. (2019)introduced two indicators 𝛾 ℓ,𝑖 and 𝜂 ℓ,𝑖,𝑘 to identify the active structures. These indicatorsdenote the active nonzero coefficients of the 𝑘 th segment in the 𝑖 th row in 𝐵 ℓ with respectto the lag values of itself and others. For its own lags, 𝛾 ℓ,𝑖 = denotes that the elementof the 𝑖 th row and the 𝑖 th column in the coefficient matrix for lag- ℓ , 𝐵 ℓ,𝑖,𝑖 , is nonzero, and 𝛾 ℓ,𝑖 = indicates a coefficient of zero, i.e., 𝐵 ℓ,𝑖,𝑖 = . In addition, 𝜂 ℓ,𝑖,𝑘 = indicates thatthe 𝑖 th row in 𝐵 ℓ and the columns of the 𝑘 th segment for the off-diagonal elements in 𝐵 ℓ are all nonzero, i.e. 𝐵 𝑙,𝑖, (cid:101) 𝑠 𝑘 ≠ , and 𝜂 𝑙,𝑖,𝑘 = otherwise. Consider the Bayesian structure10election approach for determining the segmentation structure in the NAR/VAR model.Following the literature, the prior distributions of the indicators 𝛾 ℓ,𝑖 and 𝜂 ℓ,𝑖,𝑘 are chosento be independent Bernoulli distributions with 𝑃 (cid:0) 𝛾 ℓ,𝑖 = (cid:1) = 𝜋 and 𝑃 (cid:0) 𝜂 ℓ,𝑖,𝑘 = (cid:1) = 𝜋 , i.e., 𝐵𝑒𝑟 ( 𝜋 ) and 𝐵𝑒𝑟 ( 𝜋 ) , respectively. The priors of the elements in matrix 𝐵 ℓ are dependenton γ (cid:96) and η (cid:96) , which are the vectors of the indicators in lag- ℓ , as follows: 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 , 𝜎 𝐵 ∼ 𝛾 ℓ,𝑖 𝑁 ( , 𝜎 𝐵 ) + ( − 𝛾 ℓ,𝑖 ) 𝛿 , (2) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 , 𝜎 𝐵 ∼ 𝜂 ℓ,𝑖,𝑘 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , 𝐼, 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) + ( − 𝜂 ℓ,𝑖,𝑘 ) 𝛿 , (3)where (cid:101) 𝑠 𝑘 is the index set of the 𝑘 th segment except 𝑖 , i.e., 𝑖 ∈ 𝑠 𝑘 , (cid:101) 𝑠 𝑘 = 𝑠 𝑘 \ { 𝑖 } , and | (cid:101) 𝑠 𝑘 | denotes the number of elements contained in (cid:101) 𝑠 𝑘 . In addition, 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) is a × | (cid:101) 𝑠 𝑘 | multivariate normal distribution with a mean vector of and a covariance matrix, 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | , and 𝛿 and 𝛿 are a point mass at and a zero vector , respectively. Therefore,the coefficient prior is a mixture prior of the normal distribution and a point mass. Last, (cid:0) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 (cid:1) and (cid:0) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 (cid:1) for 𝑘 = , , · · · , 𝑔 , 𝑖 = , , . . . , 𝑚 and 𝑙 = , , · · · , 𝑝 are assumedto be independent.Based on these prior assumptions regarding the coefficients and the additional inverse-Wishart prior for the covariance matrix of the NAR/VAR model, coefficient inference can beperformed by the vector autoregression Gibbs sampler (VAGSA, Chu et al., 2019). Basically,in the VAGSA, we need to iteratively generate the posterior samples of the two indicatorsand coefficients for the further inference. Similar to other MCMC algorithms, the VAGSAis computationally expensive, especially when the number of nodes increases. Instead ofgenerating the posterior samples as an approximation of the posterior distribution, the vari-ational Bayesian approach is adopted in our study to directly obtain the approximation ofthe posterior density function. Here, an EM-type algorithm is used to solve the correspond-ing optimization problem, and this is expected to significantly improve the computationalefficiency of our approach. 11 .2 The Variational inference procedure Before introducing the variational Bayesian method, we first reparametrize the NAR/VARmodel. Recall that in the VAGSA, the coefficient prior is a mixture distribution with anormal distribution and a delta function at point zero. When an indicator is equal to zero,we can simply set the corresponding coefficient to zero. To simplify the model structure,we reparametrize the coefficient as the product of the coefficient and the indicator. Thatis, 𝐵 ℓ,𝑖,𝑖 = 𝛾 ℓ,𝑖 (cid:101) 𝐵 ℓ,𝑖,𝑖 and 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 = 𝜂 ℓ,𝑖,𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 . Here, the priors of the indicators, 𝛾 𝑙,𝑖 and 𝜂 𝑙,𝑖,𝑘 ,are still independent Bernoulli distributions with probabilities 𝜋 and 𝜋 , respectively. Inaddition, the prior of (cid:101) 𝐵 ℓ,𝑖,𝑖 is chosen as a normal distribution with mean zero and variance 𝜎 𝛽 , and the prior of (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 comes from the multivariate normal distribution with a meanzero vector and a covariance matrix 𝜎 𝛽 I | (cid:101) 𝑠 𝑘 | . Among them, the priors of (cid:101) 𝐵 ℓ,𝑖,𝑖 and (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 areindependent of 𝛾 ℓ,𝑖 and 𝜂 ℓ,𝑖,𝑘 . Thus, 𝛾 ℓ,𝑖 (cid:101) 𝐵 ℓ,𝑖,𝑖 and 𝜂 ℓ,𝑖,𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 have the same effects as thoseshown in Eqs. (2) and (3).Instead of generating the posterior samples directly, the variational Bayesian approachidentifies the best approximate distribution of the true posterior for the further Bayesianinference. According to the ordinary variational Bayesian approach (Bishop, 2006), theKullback-Leibler divergence (KL divergence) is used to measure the dissimilarity of the trueposterior distribution from the approximated posterior distribution. Let 𝜃 = (cid:8) 𝜋 , 𝜋 , Σ , 𝜎 𝐵 (cid:9) be the set of parameters and (cid:110) η , γ , (cid:101) B (cid:111) be the set of the indicator variables and coefficientmatrices, where η is the vector of 𝜂 ℓ,𝑖,𝑘 , γ is the vector of 𝛾 ℓ,𝑖 , and (cid:101) B = (cid:16) (cid:101) 𝐵 (cid:48) , (cid:101) 𝐵 (cid:48) , . . . , (cid:101) 𝐵 (cid:48) 𝑝 (cid:17) (cid:48) .Given the prior assumptions, the posterior density function of η , γ , (cid:101) B is proportional to thethe following joint density function: 𝑃 ( Y , η , γ , (cid:101) B | X , 𝜃 ) = 𝑃 ( Y | η , γ , (cid:101) B , X , 𝜃 ) 𝑃 ( η , γ , (cid:101) B | X , 𝜃 ) = 𝑀 𝑁 𝑇 × 𝑚 ( XB , I , Σ ) 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑁 ( , 𝜎 𝐵 ) 𝜋 𝛾 ℓ,𝑖 ( − 𝜋 ) ( − 𝛾 ℓ,𝑖 ) 𝑔 (cid:214) 𝑘 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) 𝜋 𝜂 ℓ,𝑖,𝑘 ( − 𝜋 ) ( − 𝜂 ℓ,𝑖,𝑘 ) , (4)12efine 𝑞 ( η , γ , (cid:101) B ) as an approximate posterior density function of 𝑃 ( (cid:101) B , η , γ | Y , X ) . Since 𝐾 𝐿 ( 𝑞 || 𝑃 ) = ∫ ∑︁ γ ∑︁ η 𝑞 ( (cid:101) B , γ , η ) log (cid:32) 𝑞 ( (cid:101) B , γ , η ) 𝑃 ( (cid:101) B , γ , η | Y , X ; 𝜃 ) (cid:33) 𝑑 (cid:101) B = ∫ ∑︁ γ ∑︁ η 𝑞 ( (cid:101) B , γ , η ) log (cid:32) 𝑞 ( (cid:101) B , γ , η ) 𝑃 ( (cid:101) B , γ , η , Y | X ; 𝜃 ) (cid:33) 𝑑 (cid:101) B (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) − 𝐿 ( 𝑞 ) + ∫ ∑︁ γ ∑︁ η 𝑞 ( (cid:101) B , γ , η ) log 𝑃 ( Y | X ; 𝜃 ) 𝑑 (cid:101) B (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) log 𝑃 ( Y | X ; θ ) , (5)we have log 𝑃 ( Y | X ; θ ) = 𝐿 ( 𝑞 ) + 𝐾 𝐿 ( 𝑞 || 𝑃 ) . (6)According to Eq. (5), the marginal likelihood log 𝑃 ( Y | X ; 𝜃 ) is independent of 𝑞 and canbe treated as a fixed constant. Thus, 𝐿 ( 𝑞 ) can be defined as the lower bound of the KLdivergence between 𝑞 and 𝑃 . Minimizing the KL divergence with respect to 𝑞 , is equivalentto maximizing the lower bound 𝐿 ( 𝑞 ) of 𝑞 . Here, one key is to specify the approximationdensity 𝑞 via a factorization structure. Following Titsias & Lázaro-Gredilla (2011) and Caiet al. (2020), due to the independence assumption among the disjoint groups and their ownlags, the following hierarchically factorized distribution is chosen as an approximate densityfunction: 𝑞 ( η , γ , (cid:101) B ) , i.e., 𝑞 ( η , γ , (cid:101) 𝐵 ) = 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 ) , where 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 ) and 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 ) are chosen from the corresponding prior distri-butions. Then, a variational extension of the EM algorithm is adopted in the estimationprocess by maximizing the corresponding 𝐿 ( 𝑞 ) with respect to the parameters. In the E-step, one would take the expectation of 𝐿 ( 𝑞 ) with respect to η , γ and (cid:101) 𝐵 , and then in theM-step, one optimizes 𝐿 ( 𝑞 ) with respect to 𝜃 . Iterate the two steps until the lower bound 𝐿 ( 𝑞 ) converges. 13n the E-step, 𝑞 ( η , γ , (cid:101) 𝐵 ) is updated as follows: 𝑞 ( η , γ , (cid:101) 𝐵 ) = 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖 ( 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 ) 𝑞 ℓ,𝑖,𝑘 ( 𝜂 ℓ,𝑖,𝑘 ) . = 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 (cid:0) 𝜙 ,ℓ,𝑖 𝑁 (cid:0) 𝜇 ,ℓ,𝑖,𝑖 , Σ 𝐵 ℓ,𝑖,𝑖 (cid:1) (cid:1) 𝛾 ℓ,𝑖 (cid:16) (cid:0) − 𝜙 ,ℓ,𝑖 (cid:1) 𝑁 ( , 𝜎 𝐵 ) (cid:17) ( − 𝛾 ℓ,𝑖 ) (cid:16) 𝜙 ,ℓ,𝑖,𝑘 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 , I , Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ) (cid:17) 𝜂 ℓ,𝑖,𝑘 (cid:16) (cid:0) − 𝜙 ,ℓ,𝑖,𝑘 (cid:1) 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) (cid:17) ( − 𝜂 ℓ,𝑖,𝑘 ) , where Σ 𝐵 ℓ,𝑖,𝑖 = (cid:16) X ( 𝑖 ) (cid:48) ℓ X ( 𝑖 ) ℓ ( Σ − ) 𝑖,𝑖 + 𝜎 𝐵 (cid:17) − ,𝜇 ,ℓ,𝑖,𝑖 = Σ 𝐵 ℓ,𝑖 (cid:169)(cid:173)(cid:171) ( Σ − ) 𝑖,𝑖 X ( 𝑖 ) (cid:48) ℓ (cid:169)(cid:173)(cid:171) 𝑌 ( 𝑖 ) − 𝑝 ∑︁ 𝑗 ≠ 𝑙 X 𝑗 𝐸 ( 𝐵 ( 𝑖 ) 𝑗 ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖,𝑖 ) ℓ ) (cid:170)(cid:174)(cid:172) + 𝐸 (cid:16) 𝑡𝑟 (cid:16) ( Σ − ) − 𝑖,𝑖 X ( 𝑖 ) (cid:48) ℓ (cid:16) 𝑌 (− 𝑖 ) − XB (− 𝑖 ) (cid:17) (cid:17) (cid:17) (cid:17) ,𝜙 ,ℓ,𝑖 = 𝐼𝑛𝑣 − 𝑙𝑜𝑔𝑖𝑡 (cid:40) 𝑙𝑜𝑔𝑖𝑡 ( 𝜋 ) −
12 log ( 𝜎 𝐵 ) +
12 log ( 𝑑𝑒𝑡 ( Σ 𝐵 ℓ,𝑖,𝑖 )) + ( Σ 𝐵 ℓ,𝑖,𝑖 ) − 𝜇 ,ℓ,𝑖,𝑖 (cid:41) , Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 = (cid:16) X ( 𝑖 ) (cid:48) ℓ X ( 𝑖 ) ℓ ( Σ − ) (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 + 𝜎 𝐵 I | (cid:101) 𝑠 𝑘 | (cid:17) − , µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 = (cid:169)(cid:173)(cid:171) X ( 𝑖 ) (cid:48) ℓ (cid:169)(cid:173)(cid:171) 𝑌 ( (cid:101) 𝑠 𝑘 ) − 𝑝 ∑︁ 𝑗 ≠ 𝑙 X 𝑗 𝐸 ( 𝐵 ( (cid:101) 𝑠 𝑘 ) 𝑗 ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖, (cid:101) 𝑠 𝑘 ) ℓ ) (cid:170)(cid:174)(cid:172) ( Σ − ) (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 + 𝐸 (cid:16) 𝑡𝑟 (cid:16) X ( 𝑖 ) (cid:48) ℓ (cid:16) 𝑌 (− (cid:101) 𝑠 𝑘 ) − XB (− (cid:101) 𝑠 𝑘 ) (cid:17) ( Σ − ) − (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 (cid:17) (cid:17) (cid:17) Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ,𝜙 ,ℓ,𝑖,𝑘 = 𝐼𝑛𝑣 − 𝑙𝑜𝑔𝑖𝑡 (cid:26) 𝑙𝑜𝑔𝑖𝑡 ( 𝜋 ) −
12 log ( 𝑑𝑒𝑡 ( 𝜎 𝛽 I (cid:101) 𝑠 𝑘 )) +
12 log ( 𝑑𝑒𝑡 ( Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ))+ 𝑡𝑟 (cid:16) ( Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ) − µ (cid:48) ,ℓ,𝑖, (cid:101) 𝑠 𝑘 µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) (cid:27) . Here, X ( 𝑖 ) ℓ and X (− 𝑖 ) ℓ denote the 𝑖 th column of X ℓ and matrix X ℓ excluding the 𝑖 th column,respectively. The same definition structure is used for Y and 𝐵 . In addition, 𝐵 (− 𝑖,𝑖 ) ℓ and ( Σ ) − − 𝑖,𝑖 denote the 𝑖 th column of 𝐵 ℓ and Σ − without 𝑖 th element. 𝐼𝑛𝑣 − 𝑙𝑜𝑔𝑖𝑡 (·) is an inverselogistic function.In the M-step, we take the derivative of 𝐿 ( 𝑞 ) with respect to 𝜃 and then set it as a zero14ector. Thus, the components in 𝜃 can be updated as the solutions of the normal equations,i.e., 𝜋 = (cid:205) 𝑝ℓ = (cid:205) 𝑚𝑖 = 𝜙 ,ℓ,𝑖 𝑚 𝑝 ,𝜋 = (cid:205) 𝑝ℓ = (cid:205) 𝑚𝑖 = (cid:205) 𝑔𝑘 = 𝜙 ,ℓ,𝑖,𝑘 (cid:205) 𝑔𝑘 = | (cid:101) 𝑠 𝑘 | 𝑚 𝑝 , Σ = 𝐸 (cid:0) ( Y − XB ) (cid:48) ( Y − XB ) (cid:1) 𝑇 ,𝜎 𝐵 = (cid:205) 𝑝𝑙 (cid:205) 𝑚𝑖 𝜙 ,𝑙,𝑖 ( Σ 𝐵,𝑙,𝑖,𝑖 + 𝜇 ,𝑙,𝑖 ) + (cid:205) 𝑝𝑙 (cid:205) 𝑚𝑖 (cid:205) 𝑔𝑘 𝜙 ,𝑙,𝑖,𝑘 𝑡𝑟 ( Σ 𝐵,𝑙,𝑖, (cid:101) 𝑠 𝑘 + µ (cid:48) ,𝑙,𝑖, (cid:101) 𝑠 𝑘 µ ,𝑙,𝑖, (cid:101) 𝑠 𝑘 ) (cid:205) 𝑝𝑙 (cid:205) 𝑚𝑖 (cid:16) 𝜙 ,𝑙,𝑖 + (cid:205) 𝑔𝑘 | (cid:101) 𝑠 𝑘 | 𝜙 ,𝑙,𝑖,𝑘 (cid:17) . More details regarding both steps are shown in the supplementary materials.Since we iterate the E- and M-steps sequentially, a natural choice of a stopping criterionis that the difference between the values of 𝐿 ( 𝑞 ) in two consecutive iterations is less than acertain threshold. Then one can approximate the posterior inclusion probabilities as follows: 𝑃 ( 𝜂 ℓ,𝑖,𝑘 = | Y , X , ˆ 𝜃 ) ≈ 𝑞 ( 𝜂 ℓ,𝑖,𝑘 = | ˆ 𝜃 ) = 𝜙 ,ℓ,𝑖,𝑘 ,𝑃 ( 𝛾 ℓ,𝑖,𝑖 = | Y , X , ˆ 𝜃 ) ≈ 𝑞 ( 𝛾 ℓ,𝑖,𝑖 = | ˆ 𝜃 ) = 𝜙 ,ℓ,𝑖 . Thus, based on the median probability criterion (Barbieri & Berger, 2004), 𝐵 ℓ,𝑖,𝑖 or 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 isidentified as non zero if 𝜙 ,𝑙,𝑖 or 𝜙 ,𝑙,𝑖,𝑘 is larger than or equal to / . Finally, the one-stephead prediction ˆ Y 𝑇 + can be obtained by ˆ Y 𝑇 + = Y 𝑇 ˆ 𝐵 + Y 𝑇 − ˆ 𝐵 + . . . , Y 𝑇 + − 𝑝 ˆ 𝐵 𝑝 , where the ˆ 𝐵 ℓ are estimated based on the identified active structures.The EM-type method is a locally optimal approach and may be sensitive to the initialstatus. Realistic initial values of 𝜃 may provide poor estimation results due to improperchoices of the initial prior probabilities. In our study, the initial values of the prior probabil-ities, 𝜋 and 𝜋 , are set to be small, e.g., 𝜋 = 𝜋 = . . The initial values of the coefficientmatrix, 𝐵 , are obtained via the least-squares estimation method, and the initial value of Σ isset as the sample variance of Y divided by . The threshold value of the stopping criterionis usually set to a prespecified value, which is less than or equal to − .15 Simulation
This section investigates the finite-sample performance of the proposed VB approach com-pared with that of a known data generation process. We first revisit the simulation studies ofVAGSA in Chu et al. (2019). Given dynamic networks, we care not only about the structureselection ability but also about the computational efficiency of our approach. For medium-sized examples, i.e., 𝑚 = and , we directly compare the performances of the VB methodand the VAGSA. When the dimensionality increases to 𝑚 = , the VAGSA requires anextremely long computational time, and thus, we illustrate only the results of the proposedVB method. We follow the simulation setups in Chu et al. (2019) and generate medium-sized networkseries with 𝑚 = and for a fair comparison. We assume that there are multiple lags 𝑝 in the VAR framework. Three dependence structures are considered in the simulations. m10UG and m20UG: The true models follow the universal grouping structure with ( 𝑚, 𝑝 ) = ( , ) and ( , ) , respectively. There are and nonzero coefficients in two cases. m10SG and m20SG: The segmentation structure is considered in these two simulationswith 𝑚 = and . Let 𝑆 𝑚,𝑔 denote a specific group structure with 𝑚 time series for 𝑔 disjoint groups. We set 𝑆 , = {( , , ) , ( , , ) , ( , , , )} for ( 𝑔, 𝑚, 𝑝 ) = ( , , ) and 𝑆 , = {( , , , , ) , ( , , , , ) , ( , , ) , ( , . . . , )} for the case where ( 𝑔, 𝑚, 𝑝 ) = ( , , ) . There are 40 and 109 nonzero coefficients, respectively. m10NG and m20NG: In the “no grouping” cases, there are 18 and 27 nonzero coefficientsfor ( 𝑚, 𝑝 ) = ( , ) and ( , ) , respectively.In addition, the error terms are generated from a multinormal distribution with zero mean.We consider two different covariance matrices when generating the stochastic noises, namely,an identity matrix I 𝑚 indicating that there are no concurrent correlations among the nodesand a symmetric matrix Σ defined as follows:16 : The diagonal of Σ is ( . , . , . , . , . , . , . , . , . , . ) , and the off-diagonalcorrelation coefficients of Σ are . | 𝑖 (cid:48) − 𝑖 | for 𝑖 (cid:48) ≠ 𝑖 . Σ : The diagonal of Σ is ( . , . , . , . , . , . , . , · · · , . ) , and the off-diagonal cor-relation coefficients of Σ are . | 𝑖 (cid:48) − 𝑖 | for 𝑖 (cid:48) ≠ 𝑖 .For each simulation, we generate data with 𝑇 = , where the first samples are usedfor model training and the last point 𝑌 is used to demonstrate the prediction ability ofthe proposed method. Each simulation is replicated 𝑁 = times, and the segmentationstructure is assumed to be known. We simply fix the number of lags 𝑝 to 10 and set thethreshold value of the stopping criteria to − in all simulations.As mentioned before, we also implement the VAGSA for the simulation cases for com-parison purposes. According to Chu et al. (2019), first, we add the inverse Wishart priorfor Σ with 𝑚 degrees of freedom and a scale matrix 𝐼 𝑚 . For the other parameters, the priorprobabilities, 𝜋 and 𝜋 , are fixed at 0.5, and for the variance in the mixture prior distribu-tion, 𝜎 𝐵 is 0.5 for the cases of m10UG, m20UG, m10SG and m20SG and 15 for the othertwo cases, m10NG and m20NG. When we implement the VAGSA, there are 3000 sweeps intotal, and we take the last 1000 samples for inference. For the details of implementing theVAGSA, please refer to Chu et al. (2019). To illustrate the feasibility of the proposed VB approach for high-dimensional scenarios, weconduct simulations with 𝑚 = nodes. The setups of the large-scale simulations are similarto those used in Section 4.1. Here, we also consider the three different grouping structures,and the details of three cases are shown as follows. m50UG: There are 355 nonzero coefficients in 𝐵 , 𝐵 and 𝐵 for ( 𝑚, 𝑝 ) = ( , ) . m50SG: We set 𝑆 , = {( , , , , ) , ( , , , , ) , ( , , ) , ( , . . . , ) , ( , . . . , ) , ( , . . . , ) , ( , . . . , ) , ( , . . . , )} . There are 360 nonzero coefficients in 𝐵 , 𝐵 and 𝐵 for ( 𝑚, 𝑝 ) = ( , ) . m50NG: There are 128 nonzero coefficients in 𝐵 , 𝐵 and 𝐵 for ( 𝑚, 𝑝 ) = ( , ) .17igure 5 visualizes the true sparsity of the network dependence with the nonzero coefficientsmarked in bold. Note that only the active coefficient matrices are displayed, while the others,such as lag-2 and lag-4, are omitted as zero everywhere.Similarly, two different noise covariance matrices are used for data generation, i.e., a × identity matrix I and a symmetric matrix Σ , which is defined as follows: Σ : The diagonal elements are (cid:169)(cid:173)(cid:173)(cid:171) . , . , . , . , . , . , . , · · · , . (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) , . , · · · , . (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) , . , · · · , . (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:170)(cid:174)(cid:174)(cid:172) ,and the correlation coefficients of Σ are set as . | 𝑖 (cid:48) − 𝑖 | for 𝑖 (cid:48) ≠ 𝑖 .In this simulation example, we generate data with 𝑇 = . The first 700 samples are usedfor model training, and the last sample 𝑌 is used to illustrate the prediction ability of themodel. Based on the simulation setups in Sections 4.1 and 4.2, we independently repeat each case100 times. When we implement the proposed VB approach, the initial setups are basicallythe same as those use in Section 3, except for the case of m50NG with a covariance matrixof Σ . For this case, we set 𝜋 = 𝜋 = . , instead of 0.01. To illustrate the performances ofthe proposed method, we consider four measurements. The true positive rate (TPR) and thefalse positive rate (FPR) are designed to show the accuracy of structure identification. Theaverage model size (AMS) measures how many active elements are identified. Compared tothe TPR and the FPR, the AMS gives an overall indicator of identification accuracy. Thelast measurement is the average of the mean square prediction errors (MSPEs), and this isused to report the prediction ability of the model. The definitions of these four measurements18a) m50UG (b) m50SG (c) m50NGFigure 5: The true active coefficient matrices for all UG, SG and NG cases.19re shown as follows:TPR = , FPR = ,𝐴𝑀 𝑆 = 𝑁 𝑁 ∑︁ 𝑗 𝑗 th replicate , where 𝑁 is the number of replicates andMSPE = 𝑚𝑁 𝑚 ∑︁ 𝑖 = 𝑁 ∑︁ 𝑗 = (cid:16) 𝑌 𝑇 + ,𝑖, 𝑗 − ˆ 𝑌 𝑇 + ,𝑖, 𝑗 (cid:17) . Last, the average CPU time for a replication is also reported in seconds. Here, we run theR code for the proposed VB method, and we run the VAGSA based on its MATLAB code.All codes are implemented on a PC with an Intel(R) Core(TM) i7-4770 CPU @3.400 GHzand 16.0 GB of RAM.The results of all cases are summarized in Table 1. For the medium-sized networks with 𝑚 = and 𝑚 = , both the VB and VAGSA results are reported. For large-scale networkswith 𝑚 = , only the VB results are reported and the measurements for the VAGSA aredenoted by “-”. Consider the performances of the medium-sized networks. According to Table1, both Bayesian methods achieve similar structural identifications and inference accuraciesfor medium-sized networks. In particular, the TPRs are perfect with values close to ,implying that all active elements are successfully identified. In terms of AMS, both methodsshare similar AMS values and are all close to the true model sizes. In addition, the VBmethod has better performance in terms of the FPR than the alternative VAGSA. For theUG and SG cases, the FDR values of the VB method are less than a quarter of those of theVAGSA. The VAGSA performs slightly better for the four NG cases; however, the differencesbetween the two methods are minor. To check the selection results in detail, we find that theVB approach might not identify a variable with a small coefficient for the NG cases. Finally,consider the prediction ability. The VB method has slightly better accuracies than those ofthe VAGSA in most cases, as reflected in the MSPE. Most importantly, there is a dramatic20mprovement in terms of the average CPU time when using the VB method. In general, itonly needs approximately / of the computational cost required by the VAGSA. For somecases with 𝑚 = and 𝐼 , the VB approach even saves / CPU time, without sacrificingaccuracy.The good performance of the VB method continues for the large-scale networks becauseit is computationally possible. Figures 6 and 7 show the average estimates of the coefficientmatrices for two different covariance matrices. These coefficient matrices are all close tothe true matrices. Use the UG case as an example. The structure identification results areperfect because the TPR = , the FPR = and the AMS is still close to the truevalue. In the NG case with a covariance matrix of, Σ , the corresponding TPR is ,which is slight lower than the TPRs in the other cases. This may be because the VB methodis used to approximate the true posterior distribution, and thus, active variables with smallcoefficients might not be detected. Overall, the proposed VB method has good performancein identifying the active structures for large-scale network cases. In addition, the averageCPU times show that the case with a higher level structure, namely, UG, requires the leastcomputational time, followed by the SG and NG cases.21able 1: Numerical Comparison between the VB Method and the VAGSA VB/VAGSA TPR(%) FPR(%) AMS MSPE Ave. CPU Time m10UG- 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ 𝐼 Σ Empirical Study
We apply the VB approach to estimate the temporal dependences of the German naturalgas flow networks and perform day-ahead forecasting with the NAR/VAR framework. Thedata cover two years, from 1st October 2013 to 30th September 2015. The gas flows containboth inflows (supply) and outflows (demand) recorded 7 days a week at distributionnodes belonging to 4 categories with different functions. There are 34 municipal nodes, thatprovide gas to local residential areas and small business districts. The 11 industry nodesare responsible for factory production. Moreover there is 1 border node, which is importantin Germany, because these nodes serve as network transfer points for natural gas imported22a) m50UG (b) m50SG (c) m50NGFigure 6: Estimated coefficient matrices with Σ = I for all UG, SG and NG cases.23a) m50UG (b) m50SG (c) m50NGFigure 7: Estimated coefficient matrices with Σ = Σ for all UG, SG and NG cases.24unicipal (34 nodes) Industrial (11 nodes)Border (1 node) Others (5 nodes)Figure 8: Normalize daily gas flows values in different types.and exported via Germany. The rest 5 nodes are categorized to “others” that serve as switchnodes or perform other functions. Figure 8 displays the time series of the 51 nodes thatshow different dynamic patterns. For further analysis, we normalize the values of the flows.In addition, note that the data in municipal nodes have been seasonally adjusted via thedaily temperature, and 3 nodes, O3, O4, and O5, in the “others” category have also beenseasonally adjusted.We adopt the NAR/VAR model with the segmentation structure to analyze the gasnetwork data. Here, the segmented grouping structures are defined based on the types ofnodes that are involved. That is, each type of node is treated as a group. In addition, weset the number of lags 𝑝 as to incorporate social dependence up to two weeks ahead.25e use the data from Oct 1st, 2013, to March 31, 2015, as the in-sample training data.The threshold value of the stopping criterion for the proposed VB method is set to be − ,and the other initial settings are the same as those used we set in the simulation studiesin Section 4. After learning the model via the VB method, we perform a one-step aheadforecast. That is, at each daily point, we shift one more day, use the expanded sample toretrain the model coefficients based on the active segmentation structures, and perform aone-day ahead forecast. Note that we repeat the iterative forecast for the remaining . years.Overall, we obtain 183 forecasts. To illustrate the performance of the proposed VBmethod, in addition to the mean absolute percentage estimator (MAPE), the normalizedMSE (NRMSE) for the 51 nodes is also used, and both measures are defined as follows: 𝑀 𝐴𝑃𝐸 = × ∑︁ 𝑡 =
547 51 ∑︁ 𝑖 = | 𝑌 𝑡 + ,𝑖 − ˆ 𝑌 𝑡 + ,𝑖 || 𝑌 𝑡 + ,𝑖 | , (7) 𝑁 𝑅𝑀 𝑆𝐸 = × (cid:32) ∑︁ 𝑡 =
547 51 ∑︁ 𝑖 = ( 𝑌 𝑡 + ,𝑖 − ˆ 𝑌 𝑡 + ,𝑖 ) (cid:33) / × ∑︁ 𝑡 =
547 51 ∑︁ 𝑖 = 𝑌 𝑡 + ,𝑖 . (8)Figures 9 displays the estimated coefficient matrices for lag-1 to lag-3, and these threecoefficient matrices are quite sparse. For lag-1, it can be observed that the th and throws, corresponding to municipal nodes M4 and M14, affect the other municipal nodes. Inthe th row, an industry node, I1, influences the other industry nodes but not the otherthree types. The st row is the status of node O5; the others node only have relationswith municipal nodes. On the other hand, the only border node is only related to itself andhas no interactions with the others. For lag-2, only the th row I1 influences the nodes inthe other category. For lag-3, some nodes have minor influences by themselves. As the lagincreases, there are fewer and fewer nodes that may influence others. Furthermore, there isno more serial dependence in the network after lag-10.Table 2 reports the overall model fitting and forecasting performances via the obtainedMAPE and NRMSE values. The details of individual nodes are shown in the Appendix.Basically, the VB method delivers good performances for both the in-sample training periodfrom 1st October 2014 to 31st March 2015 and the out-of-sample forecasting period from26igure 9: The coefficient matrix for lag-1 to lag-3.27able 2: The MAPEs and NRMSEs for the fitted and forecast results classified by type. In sample from 01/10/2014 to 31/03/2015Type Number MAPE (%) Range (%) SD NRMSE Range SDMunicipal
34 6.28 (3.20, 14.70) 1.93 0.09 (0.04, 0.36) 0.05
Industrial
11 11.69 (0.89, 34.22) 11.74 0.12 (0.01, 0.31) 0.10
Border − − − −
Others
Out of sample from 01/04/2015 to 30/09/2015Municipal
34 7.89 (4.66, 11.85) 1.60 0.10 (0.06, 0.15) 0.02
Industrial
11 15.98 (1.70, 54.57) 19.00 0.12 (0.02, 0.30) 0.10
Border − − − −
Others , and the corresponding NRMSE values are all less than . . Thus, we can claim that the NAR/VAR model with segmentation structures fitsthe in-sample training data well and provides good prediction capability for out-of-sampleforecasting. According to Table A1 in the Appendix, most nodes have small mean valuesand standard deviations with repect to MAPE and NRMSE. For nodes I5, I8 and I10, boththe MAPE and NRMSE values are large. A possible reason for this phenomenon may be theviolation of the stationarity assumption in the NAR/VAR model because the correspondingtrend of the daily data changes frequently. In addition, in the “others” type, due to someextreme values for a few special days, node O1 may not have good model fitting results.When we consider the out-of-sample forecasting performance, the corresponding MAPE andNRMSE values share similar patterns to those witnessed in the in-sample training results.Nodes I5, I8, and I10 still have large values for both measurements. However, node O1 doeshave good MAPE and NRMSE values because there are no extreme values shown in theprediction period. Overall, the out-of-sample forecasting performance is acceptable becausethe average MAPE and NRMSE values are 9.78% and 0.11, respectively. Conclusion
In this paper, we focus on Bayesian analysis with NAR/VAR models, especially when thereare many time series involved. First, to simplify the model complexity, the model structure28ssumptions in Song & Bickel (2011) are adopted. Then, model inference can be performedvia a Bayesian structure selection approach. Here, to denote active structures, indicatorsare added to the NAR/VAR model. Instead of generating the posterior samples of theseindicators via an MCMC algorithm, we directly obtain the proper approximation of theposterior density function via a variational Bayesian approach. Based on a factorized ap-proximation assumption for the latent structures, a variational EM-type algorithm is used toobtain the best approximation; then, according to the approximated posterior probabilitiesof the indicators, the median probability criterion is used to determine the active structuresin the corresponding coefficient matrices. The simulation results support the notion that theproposed variational Bayesian approach not only identifies the proper active structures inNAR/VAR models but also significantly reduces the computational cost. Finally, Germangas flow network data with 51 nodes are analyzed to illustrate the performance of the pro-posed method. The analytical results of the proposed VB method yield the trends of nodes,which may be useful for assisting operators with performing appropriate operations.There are several possible future research directions. The first concerns the parameterturning strategy for the proposed VB method. In our experience, the performance may besensitive to the chosen initial parameters, especially for the prior probabilities 𝜋 and 𝜋 .According to Ormerod et al. (2017) and Zhang et al. (2019), we suggest setting 𝜋 𝑖 as smallprobabilities, as this would tend to select a compact model. However, sometimes the modelmay not identify active variables with small coefficients. For example, in the simulated NGcase with m = 50 and a covariance matrix of Σ , we set 𝜋 = 𝜋 = . , instead of 0.01.Thus, choosing the proper initial setups should be an considered issue for the proposed VBmethod. The cross-validation approach should be examined as a possible strategy, and otherpossibilities should be related to proper information criteria. The second research directionis to take the group sparsity assumption into account. That is, the variables within an activesegmentation still satisfy the elementwise sparsity assumption. Following Chen et al. (2016),we need to add new indicators for the variables within each segmentation. The idea of thevariational Bayesian approach in Cai et al. (2020) could be modified for analyses using NARmodels. In our real example, the segmentation structures are defined based on the typesof nodes present. However, these may not be the best segmentation structures based on29hese prespecified gas station types. Following Chu et al. (2019), we may apply a data-driven clustering approach to identify other segmentation structures for the correspondingNAR models. Thus, it would be an interesting research direction to integrate the clusteringalgorithm with the variational Bayesian approach. That is, we can identify the propersegmentation structures and determine the active structures simultaneously. References
Bańbura, M., Giannone, D., & Reichlin, L. (2010). Large Bayesian vector auto regressions.
Journal of Applied Econometrics , (1), 71–92.Barbieri, M. M., & Berger, J. O. (2004). Optimal predictive model selection. The Annals ofStatistics , , 870–897.Basu, S., & Michailidis, G. (2015). Regularized estimation in sparse high-dimensional timeseries models. The Annals of Statistics , (4), 1535–1567.Bishop, C. M. (2006). Pattern recognition and machine learning . Springer.Cai, M., Dai, M., Ming, J., Peng, H., Liu, J., & Yang, C. (2020). Bivas: a scalable bayesianmethod for bi-level variable selection with applications.
Journal of Computational andGraphical Statistics , (1), 40–52.Carbonetto, P., & Stephens, M. (2012). Scalable variational inference for Bayesian variableselection in regression, and its accuracy in genetic association studies. Bayesian Analysis , , 73–108.Chang, S.-M., Chen, R.-B., & Chi, Y. (2016). Bayesian variable selections for probit mod-els with componentwise Gibbs samplers. Communications in Statistics-Simulation andComputation , (8), 2752–2766.Chen, R.-B., Chu, C.-H., Lai, T.-Y., & Wu, Y. N. (2011). Stochastic matching pursuit forBayesian variable selection. Statistics and Computing , , 247–259.30hen, R.-B., Chu, C.-H., Yuan, S., & Wu, Y. N. (2016). Bayesian sparse group selection. Journal of Computational and Graphical Statistics , , 665–683.Chen, Y., Koch, T., Zakiyeva, N., & Zhu, B. (2020). Modeling and forecasting the dynamicsof the natural gas transmission network in germany with the demand and supply balanceconstraint. Applied Energy , , 115597.Chu, C.-H., Lo Huang, M.-N., Huang, S.-F., & Chen, R.-B. (2019). Bayesian structureselection for vector autoregression model. Journal of Forecasting , , 422–439.Fan, J., Feng, Y., & Wu, Y. (2009). Network exploration via the adaptive LASSO andSCAD penalties. The Annals of Applied Statistics , (2), 521.Farcomeni, A. (2010). Bayesian constrained variable selection. Statistica Sinica , , 1043–1062.George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journalof the American Statistical Association , (423), 881–889.Geweke, J. (1996). Variable selection and model comparison in regression. In BayesianStatistics , , 609—620.Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2011). Joint estimation of multiple graphicalmodels. Biometrika , (1), 1–15.Hsu, N.-J., Hung, H.-L., & Chang, Y.-M. (2008). Subset selection for vector autoregressiveprocesses using LASSO. Computational Statistics & Data Analysis , (7), 3645–3657.Kastner, G., & Huber, F. (2020). Sparse bayesian vector autoregressions in huge dimensions. Journal of Forecasting .Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L., et al. (2012). High-dimensionalsemiparametric Gaussian copula graphical models.
The Annals of Statistics , (4), 2293–2326.Lütkepohl, H. (2007). General-to-specific or specific-to-general modelling? An opinion oncurrent econometric terminology. Journal of Econometrics , (1), 319–324.31elnyk, I., & Banerjee, A. (2016). Estimating structured vector autoregressive models. International Conference on Machine Learning , 830–839.Nicholson, W. B., Matteson, D. S., & Bien, J. (2014). Structured regularization for largevector autoregressions.
Cornell University .Ormerod, J. T., You, C., Müller, S., et al. (2017). A variational Bayes approach to variableselection.
Electronic Journal of Statistics , (2), 3549–3594.Song, S., & Bickel, P. J. (2011). Large vector auto regressions. arXiv preprintarXiv:1106.3915 .Titsias, M. K., & Lázaro-Gredilla, M. (2011). Spike and slab variational inference for multi-task and multiple kernel learning. Advances in Neural Information Processing Systems , , 2339-2347.Zhang, C.-X., Xu, S., & Zhang, J.-S. (2019). A novel variational Bayesian method forvariable selection in logistic regression models. Computational Statistics & Data Analysis , , 1–19. 32 Appendix
Table A1: The MAPEs and NRMSEs for the in-sample training period from 01/10/2014 to31/03/2015 classified by type.
Municipal:MAPE(%) 1 2 3 4 5 6 7 8 9 10 11 12mean sd NRMSE
MAPE(%) 13 14 15 16 17 18 19 20 21 22 23 24mean sd NRMSE
MAPE(%) 25 26 27 28 29 30 31 32 33 34mean sd NRMSE
Industry:MAPE(%) 1 2 3 4 5 6 7 8 9 10 11mean sd Border:MAPE(%) 1mean sd NRMSE
Others:MAPE(%) 1 2 3 4 5mean sd NRMSE
Municipal:MAPE(%) 1 2 3 4 5 6 6 8 9 10 11 12mean sd NRMSE
MAPE(%) 13 14 15 16 17 18 19 20 21 22 23 24mean sd NRMSE
MAPE(%) 25 261 27 28 29 30 31 32 33 34mean sd NRMSE
Industry:MAPE(%) 1 2 3 4 5 6 7 8 9 10 11mean sd NRMSE
Border:MAPE(%) 1mean sd NRMSE
Others:MAPE(%) 1 2 3 4 5mean sd NRMSE Supplementary Document
B.1 Variational EM algorithm for th NAR model
In this extra supplementary document, we provide more details about the EM algorithm forthe VB approach with respect to the NAR/VAR model.
B.1.1 E-Step
Let 𝜃 = (cid:8) 𝜋 , 𝜋 , Σ , 𝜎 𝐵 (cid:9) be the collection of NAR model parameters and { η , γ , B } be the setof latent variables. The joint probability of Y , η , γ , (cid:101) 𝐵 and as follows: 𝑃 ( Y , η , γ , (cid:101) B | X , 𝜃 ) = 𝑃 ( Y | η , γ , (cid:101) B , X , 𝜃 ) 𝑃 ( η , γ , (cid:101) B | X , 𝜃 ) = 𝑀 𝑁 𝑇 × 𝑚 ( XB , I , Σ ) 𝑝 (cid:214) 𝑙 𝑚 (cid:214) 𝑖 𝑁 ( , 𝜎 𝐵 ) 𝜋 𝛾 ℓ,𝑖 ( − 𝜋 ) ( − 𝛾 ℓ,𝑖 ) 𝑔 (cid:214) 𝑘 𝑀 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) 𝜋 𝜂 ℓ,𝑖,𝑘 ( − 𝜋 ) ( − 𝜂 ℓ,𝑖,𝑘 ) . As mentioned before, in this study, we can maximize the lower bound 𝐿 ( 𝑞 ) with respect tothe approximate density function 𝑞 . In the reparameterized spike-and-slab prior, each pairof variables (cid:8) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 (cid:9) and (cid:8) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 (cid:9) are strongly correlated since their product is theunderlying variable that interacts with the data. Thus, a sensible approximation must treateach pair (cid:8) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 (cid:9) and (cid:8) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 (cid:9) as a unit so that (cid:8) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 (cid:9) and (cid:8) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 (cid:9) areplaced in the same factor of the variational distribution. The simplest factorization thatachieves this is: 𝑞 ( η , γ , (cid:101) 𝐵 ) = 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 ) , 𝑞 can have the following formulation: 𝑞 ( η , γ , (cid:101) 𝐵 ) = 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 , 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 , 𝜂 ℓ,𝑖,𝑘 ) , (9) = 𝑝 (cid:214) 𝑙 𝑚 (cid:214) 𝑖 𝑔 (cid:214) 𝑘 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖 ( 𝛾 ℓ,𝑖 ) 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 ) 𝑞 ℓ,𝑖,𝑘 ( 𝜂 ℓ,𝑖,𝑘 ) , where 𝑞 ℓ,𝑖 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) , 𝑞 ℓ,𝑖,𝑘 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 ) , 𝑞 ℓ,𝑖 ( 𝛾 ℓ,𝑖 ) , and 𝑞 ℓ,𝑖,𝑘 ( 𝜂 ℓ,𝑖,𝑘 ) are the approximated pos-terior distribution of (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 , (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 , 𝛾 ℓ,𝑖 , and 𝜂 ℓ,𝑖,𝑘 , respectively, which were obtainedfrom the prior distributions. We have assumed that segments are independent and that theelements with their own lags are also independent. With this assumption, we rewrite thelower bound as 𝐿 ( 𝑞 ) = 𝐸 𝑞 ( γ ) ,𝑞 ( η ) (cid:104) 𝐸 𝑞 ( (cid:101) B | η , γ ) (cid:104) log 𝑃 (cid:16) Y , η , γ , (cid:101) B | X , 𝜃 (cid:17) − log 𝑞 (cid:16) η , γ , (cid:101) B (cid:17) (cid:105) (cid:105) . 𝐿 ( 𝑞 ) = ∑︁ 𝛾 (cid:214) ℓ (cid:214) 𝑖 𝑞 ( 𝛾 ℓ,𝑖 ) ∑︁ 𝜂 (cid:214) ℓ (cid:214) 𝑖 (cid:214) 𝑘 𝑞 ( 𝜂 ℓ,𝑖,𝑘 ) ∫ (cid:101) 𝐵 (cid:16) log 𝑃 ( Y , γ , η , (cid:101) B | X , 𝜃 ) − log 𝑞 ( γ , η , (cid:101) B ) (cid:17) 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑝 (cid:214) 𝑘 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | η ℓ,𝑖,𝑘 ) 𝑝 (cid:214) ℓ 𝑚 (cid:214) 𝑖 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) 𝑑 (cid:101) B = ∑︁ 𝛾 ℓ,𝑖 𝑞 ( 𝛾 ℓ,𝑖 ) ∑︁ η (cid:214) ℓ (cid:214) 𝑖 (cid:214) 𝑘 𝑞 ( 𝜂 ℓ,𝑖,𝑘 ) ∫ ∫ 𝑞 ( 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) (cid:20)∫ log 𝑃 ( Y , γ , η , (cid:101) B | X , 𝜃 ) ∑︁ 𝛾 ℓ (cid:48) ,𝑖 (cid:214) ℓ (cid:48) ≠ ℓ (cid:214) 𝑖 𝑞 ( 𝛾 ℓ (cid:48) ,𝑖 ) 𝑞 ( (cid:101) 𝐵 ℓ (cid:48) ,𝑖,𝑖 | 𝛾 ℓ (cid:48) ,𝑖,𝑖 ) 𝑑 (cid:101) 𝐵 ℓ (cid:48) ,𝑖,𝑖 + ∑︁ 𝛾 ℓ,𝑖 (cid:48) (cid:214) ℓ (cid:214) 𝑖 (cid:48) ≠ 𝑖 𝑞 ( 𝛾 ℓ,𝑖 (cid:48) ) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖 (cid:48) ,𝑖 (cid:48) | 𝛾 ℓ,𝑖 (cid:48) ,𝑖 (cid:48) ) 𝑑 (cid:101) 𝐵 ℓ,𝑖 (cid:48) ,𝑖 (cid:48) 𝑑 (cid:101) 𝐵 ℓ,𝑖,𝑖 (cid:214) ℓ (cid:214) 𝑖 (cid:214) 𝑘 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 ) 𝑑 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 − ∑︁ 𝛾 ℓ,𝑖 𝑞 ( 𝛾 ℓ,𝑖 ) ∑︁ η (cid:214) ℓ (cid:214) 𝑖 (cid:214) 𝑘 𝑞 ( 𝜂 ℓ,𝑖,𝑘 ) ∫ ∫ 𝑞 ( 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) (cid:2) log (cid:0) 𝑞 ( 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) 𝑞 ( 𝛾 𝑙,𝑖 ) (cid:1) + log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 𝑙,𝑖,𝑘 ) 𝑞 ( 𝜂 𝑙,𝑖,𝑘 ) (cid:17) (cid:105) 𝑑 (cid:101) 𝐵 𝑙,𝑖,𝑖 𝑑 (cid:101) 𝐵 𝑙,𝑖, (cid:101) 𝑠 𝑘 + constant = 𝐸 𝑞 ( η ) ,𝑞 ( (cid:101) B 𝜂 | η ) (cid:104) 𝐸 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:104) 𝐸 ℓ (cid:48) ≠ ℓ,𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ (cid:48) ,𝑖 , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) + 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ,𝑖 (cid:48) , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) − log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:17) (cid:105) (cid:105) 𝑞 ( 𝛾 ℓ,𝑖 = )+ 𝐸 𝑞 ( η ) ,𝑞 ( (cid:101) B 𝜂 | η ) (cid:104) 𝐸 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:104) 𝐸 ℓ (cid:48) ≠ ℓ,𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ (cid:48) ,𝑖 , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) + 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ,𝑖 (cid:48) , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) − log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:17) (cid:105) (cid:105) 𝑞 ( 𝛾 ℓ,𝑖 = )+ constant,where 𝐸 ℓ (cid:48) ≠ ℓ,𝑖 (·) denotes that the expectation is taken with respect to all the 𝑖 th columnvariables except lag- ℓ , and 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (·) denotes taking the expectation for all the lag- ℓ exceptthe 𝑖 th column. For a fixed 𝛾 ℓ,𝑖 = , 𝐿 ( 𝑞 ) can be represented as 𝐸 𝑞 ( η ) ,𝑞 ( (cid:101) B 𝜂 | η ) (cid:104) 𝐸 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:104) 𝐸 ℓ (cid:48) ≠ 𝑙,𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ (cid:48) ,𝑖 , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) + 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (cid:16) 𝑙𝑜𝑔 (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ,𝑖 (cid:48) , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) − 𝑙𝑜𝑔 (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:17) (cid:105) (cid:105) . This formulate can be treated as the negative KL divergence between 𝐸 ℓ (cid:48) ≠ 𝑙,𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ (cid:48) ,𝑖 , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:1) (cid:1) + 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ,𝑖 (cid:48) , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) and 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) . Thus,37hen log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:17) = 𝐸 ℓ (cid:48) ≠ ℓ,𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ (cid:48) ,𝑖 , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) (10) + 𝐸 ℓ,𝑖 (cid:48) ≠ 𝑖 (cid:16) log (cid:16) 𝑃 ( Y , (cid:101) B , γ ℓ,𝑖 (cid:48) , 𝛾 ℓ,𝑖 = , η | X , 𝜃 ) (cid:17) (cid:17) , we can obtain the best approximation 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) . Again, the cases for the given 𝛾 ℓ,𝑖 = , 𝜂 ℓ,𝑖,𝑘 = , and 𝜂 ℓ,𝑖,𝑘 = can be derived with the same procedure. Since both 𝛾 ℓ,𝑖 and 𝜂 ℓ,𝑖,𝑘 arefrom the independent Bernoulli distribution, with Eq. (10), we can add some variational pa-rameters regarding 𝑞 ( 𝛾 ℓ,𝑖 ) and 𝑞 ( 𝜂 ℓ,𝑖,𝑘 ) and then derive the conditional distributiona of (cid:101) 𝐵 ℓ,𝑖,𝑖 given 𝛾 ℓ,𝑖 and (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 given 𝜂 ℓ,𝑖,𝑘 . Last, we optimize 𝐿 ( 𝑞 ) to find the variational parameters.First, we derive 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 ) and 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 ) where this involves the joint probability func-tion. To find the optimal form of Eq. (10), we rearrange the joint probability function to38etain only the terms involving ℓ , 𝑖 , and (cid:101) 𝑠 𝑘 as follows: log (cid:16) 𝑃 ( Y , (cid:101) B , γ , η | X ; 𝜃 ) (cid:17) = − 𝑇 𝑚 ( 𝜋 ) − 𝑇 ( 𝑑𝑒𝑡 ( Σ )) (11) − (cid:18) − 𝑡𝑟 (cid:16) Σ − ( Y − XB ) (cid:48) ( Y − XB ) (cid:17) (cid:19) − 𝑝𝑚 (cid:16) log ( 𝜋 ) + log ( 𝜎 𝐵 ) (cid:17) − (cid:205) ℓ (cid:205) 𝑖 (cid:101) 𝐵 ℓ,𝑖,𝑖 𝜎 𝐵 − 𝑝𝑚 ( 𝑚 − ) (cid:16) log ( 𝜋 ) + log ( 𝜎 𝐵 ) (cid:17) − ∑︁ ℓ ∑︁ 𝑖 ∑︁ | (cid:101) 𝑠 𝑘 | 𝑡𝑟 (cid:16) ( 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) − (cid:101) 𝐵 (cid:48) ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) + ∑︁ ℓ ∑︁ 𝑖 𝛾 ℓ,𝑖 log ( 𝜋 ) + ∑︁ ℓ ∑︁ 𝑖 ( − 𝛾 ℓ,𝑖 ) log ( − 𝜋 )+ ∑︁ ℓ ∑︁ 𝑖 ∑︁ 𝑘 𝜂 ℓ,𝑖,𝑘 log ( 𝜋 ) + ∑︁ ℓ ∑︁ 𝑖 ∑︁ 𝑘 ( − 𝜂 ℓ,𝑖,𝑘 ) log ( − 𝜋 )∝ − 𝑇 ( 𝑑𝑒𝑡 ( Σ )) − 𝑡𝑟 (cid:16) Σ − ( Y − XB ) (cid:48) ( Y − XB ) (cid:17) − ∑︁ ℓ ∑︁ 𝑖 (cid:0) ( − 𝛾 ℓ,𝑖 ) + 𝛾 ℓ,𝑖 (cid:1) log ( 𝜎 𝐵 ) − (cid:205) ℓ (cid:205) 𝑖 (cid:0) ( − 𝛾 ℓ,𝑖 + 𝛾 ℓ,𝑖 ) (cid:1) (cid:101) 𝐵 ℓ,𝑖,𝑖 𝜎 𝐵 − ∑︁ ℓ ∑︁ 𝑖 ∑︁ 𝑘 (cid:0) ( − 𝜂 ℓ,𝑖,𝑘 ) + 𝜂 ℓ,𝑖,𝑘 (cid:1) log ( 𝑑𝑒𝑡 ( 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ))− 𝑡𝑟 (cid:16) ( 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) − (cid:0) ( − 𝜂 ℓ,𝑖,𝑘 ) + 𝜂 ℓ,𝑖,𝑘 (cid:1) (cid:101) 𝐵 (cid:48) ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) + ∑︁ ℓ ∑︁ 𝑖 𝛾 ℓ,𝑖 log ( 𝜋 − 𝜋 ) + ∑︁ ℓ ∑︁ 𝑖 ∑︁ 𝑘 𝜂 ℓ,𝑖,𝑘 log ( 𝜋 − 𝜋 )+ 𝑝𝑚 ( log ( − 𝜋 )) + 𝑝𝑚𝑔 ( log ( − 𝜋 )) + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡. We can derive log ( 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 )) by taking the expectation in Eq. (10). When 𝛾 ℓ,𝑖 = , wehave log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) (cid:17) = − (cid:34) ( Σ ) − 𝑖,𝑖 ( X ( 𝑖 ) ℓ ) (cid:48) X ( 𝑖 ) ℓ + 𝜎 𝐵 (cid:35) (cid:101) 𝐵 ℓ,𝑖,𝑖 − ( Σ ) − 𝑖,𝑖 (cid:101) 𝐵 (cid:48) ℓ,𝑖,𝑖 (cid:104) ( X ( 𝑖 ) ℓ ) (cid:48) ( Y ( 𝑖 ) − X − 𝑙 𝐸 ( 𝐵 ( 𝑖 )− ℓ ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖,𝑖 ) ℓ )) (cid:105) − 𝐸 (cid:16) 𝑡𝑟 ( ( Σ ) − − 𝑖,𝑖 (cid:101) 𝐵 (cid:48) ℓ,𝑖,𝑖 ( ( X ( 𝑖 ) ℓ ) (cid:48) ( Y (− 𝑖 ) − ( XB ) (− 𝑖 ) ))) (cid:17) + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡.
39e can find that this is a quadratic form of (cid:101) 𝐵 ℓ,𝑖,𝑖 , the posterior of 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) that followsa normal distribution in the form 𝑁 ( 𝜇 ,ℓ,𝑖,𝑖 , Σ 𝐵 ℓ,𝑖,𝑖 ) , where Σ 𝐵 ℓ,𝑖,𝑖 = (cid:16) X ( 𝑖 ) (cid:48) ℓ X ( 𝑖 ) ℓ ( Σ − ) 𝑖,𝑖 + 𝜎 𝐵 (cid:17) − ,𝜇 ,ℓ,𝑖,𝑖 = Σ 𝐵 ℓ,𝑖 (cid:169)(cid:173)(cid:171) ( Σ − ) 𝑖,𝑖 X ( 𝑖 ) (cid:48) ℓ (cid:169)(cid:173)(cid:171) 𝑌 ( 𝑖 ) − 𝑝 ∑︁ 𝑗 ≠ 𝑙 X 𝑗 𝐸 ( 𝐵 ( 𝑖 ) 𝑗 ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖,𝑖 ) ℓ ) (cid:170)(cid:174)(cid:172) + 𝐸 (cid:16) 𝑡𝑟 (cid:16) ( Σ − ) − 𝑖,𝑖 X ( 𝑖 ) (cid:48) ℓ (cid:16) 𝑌 (− 𝑖 ) − X 𝐵 (− 𝑖 ) (cid:17) (cid:17) (cid:17) (cid:17) . Similarly, 𝑙𝑜𝑔 ( 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 )) takes the expectation in Eq. (10); when 𝜂 ℓ,𝑖,𝑘 = , we have log (cid:16) 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖,𝑘 = ) (cid:17) = − 𝑡𝑟 (cid:16) ( Σ ) − (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 ( X ( 𝑖 ) ℓ ) (cid:48) X ( 𝑖 ) ℓ (cid:101) 𝐵 (cid:48) ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) − 𝑡𝑟 (cid:16) ( 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) − (cid:101) 𝐵 (cid:48) ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) − 𝑡𝑟 (cid:16) ( X ( 𝑖 ) ℓ ) (cid:48) ( Y ( (cid:101) 𝑠 𝑘 ) − X − 𝑙 𝐸 ( 𝐵 ( (cid:101) 𝑠 𝑘 )− ℓ ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖, (cid:101) 𝑠 𝑘 ) ℓ )) ( Σ − ) (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 (cid:17) − 𝐸 (cid:16) 𝑡𝑟 (cid:16) ( X ( 𝑖 ) ℓ ) (cid:48) ( Y (− (cid:101) 𝑠 𝑘 ) − ( XB ) (− (cid:101) 𝑠 𝑘 ) ) ( Σ − ) (cid:101) 𝑠 𝑘 ,𝑖 (cid:17) (cid:17) + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡, where the posterior 𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖, (cid:101) 𝑠 𝑘 = ) follows the × | (cid:101) 𝑠 𝑘 | multivariate normal distribution 𝑁 ×| (cid:101) 𝑠 𝑘 | ( µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 , I , Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ) , in which Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 = (cid:16) ( X ( 𝑖 ) ℓ ) (cid:48) X ( 𝑖 ) ℓ ( Σ − ) (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 + 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | (cid:17) − , µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 = (cid:169)(cid:173)(cid:171) ( X ( 𝑖 ) ℓ ) (cid:48) (cid:169)(cid:173)(cid:171) 𝑌 ( (cid:101) 𝑠 𝑘 ) − 𝑝 ∑︁ 𝑗 ≠ 𝑙 X 𝑗 𝐸 ( 𝐵 ( (cid:101) 𝑠 𝑘 ) 𝑗 ) − X (− 𝑖 ) ℓ 𝐸 ( 𝐵 (− 𝑖, (cid:101) 𝑠 𝑘 ) ℓ ) (cid:170)(cid:174)(cid:172) ( Σ − ) (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 + 𝐸 (cid:16) 𝑡𝑟 (cid:16) ( X ( 𝑖 ) ℓ ) (cid:48) (cid:16) 𝑌 (− (cid:101) 𝑠 𝑘 ) − X 𝐵 (− (cid:101) 𝑠 𝑘 ) (cid:17) ( Σ − ) − (cid:101) 𝑠 𝑘 , (cid:101) 𝑠 𝑘 (cid:17) (cid:17) (cid:17) Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 . According to the same process mentioned above, when 𝛾 ℓ,𝑖 = and 𝜂 ℓ,𝑖,𝑘 = , we have 𝑞 ( (cid:101) 𝐵 ℓ,𝑖,𝑖 | 𝛾 ℓ,𝑖 = ) ∼ 𝑁 ( , 𝜎 𝐵 ) ,𝑞 ( (cid:101) 𝐵 ℓ,𝑖, (cid:101) 𝑠 𝑘 | 𝜂 ℓ,𝑖, (cid:101) 𝑠 𝑘 = ) ∼ 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) . 𝜙 ,ℓ,𝑖 and 𝜙 ,ℓ,𝑖,𝑘 are the probabilities of 𝛾 ℓ,𝑖 = and 𝜂 ℓ,𝑖,𝑘 = , respectively, andwe have 𝑞 ( η , γ , (cid:101) 𝐵 ) = 𝑝 (cid:214) 𝑙 𝑚 (cid:214) 𝑖 (cid:0) 𝜙 ,ℓ,𝑖 𝑁 (cid:0) 𝜇 ,ℓ,𝑖,𝑖 , Σ 𝐵 ℓ,𝑖,𝑖 (cid:1) (cid:1) 𝛾 ℓ,𝑖 (cid:16) (cid:0) − 𝜙 ,ℓ,𝑖 (cid:1) 𝑁 ( , 𝜎 𝐵 ) (cid:17) ( − 𝛾 ℓ,𝑖 ) 𝑔 (cid:214) 𝑘 (cid:16) 𝜙 ,ℓ,𝑖,𝑘 𝑁 ×| (cid:101) 𝑠 𝑘 | ( 𝜇 ,ℓ,𝑖, (cid:101) 𝑠 𝑘 , I , Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ) (cid:17) 𝜂 ℓ,𝑖,𝑘 (cid:16) (cid:0) − 𝜙 ,ℓ,𝑖,𝑘 (cid:1) 𝑁 ×| (cid:101) 𝑠 𝑘 | ( , I , 𝜎 𝐵 𝐼 | (cid:101) 𝑠 𝑘 | ) (cid:17) ( − 𝜂 ℓ,𝑖,𝑘 ) , where 𝜙 ,ℓ,𝑖 and 𝜙 ,ℓ,𝑖,𝑘 are 𝜙 ,ℓ,𝑖 = 𝐼𝑛𝑣 − 𝑙𝑜𝑔𝑖𝑡 (cid:40) 𝑙𝑜𝑔𝑖𝑡 ( 𝜋 ) − 𝑙𝑜𝑔 ( 𝜎 𝐵 ) + 𝑙𝑜𝑔 ( 𝑑𝑒𝑡 ( Σ 𝐵 ℓ,𝑖,𝑖 )) + ( Σ 𝐵 ℓ,𝑖,𝑖 ) − 𝜇 ,ℓ,𝑖,𝑖 (cid:41) and 𝜙 ,ℓ,𝑖,𝑘 = 𝐼𝑛𝑣 − 𝑙𝑜𝑔𝑖𝑡 (cid:26) 𝑙𝑜𝑔𝑖𝑡 ( 𝜋 ) − 𝑙𝑜𝑔 ( 𝑑𝑒𝑡 ( 𝜎 𝛽 𝐼 | (cid:101) 𝑠 𝑘 | )) + 𝑙𝑜𝑔 ( 𝑑𝑒𝑡 ( Σ 𝐵 ℓ,𝑖,𝑘 ))+ 𝑡𝑟 (cid:16) ( Σ 𝐵 ℓ,𝑖, (cid:101) 𝑠𝑘 ) − µ (cid:48) ,ℓ,𝑖, (cid:101) 𝑠 𝑘 µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 (cid:17) (cid:27) . B.1.2 M-Step
During the M-step, we update the parameters 𝜃 = (cid:8) 𝜋 , 𝜋 , Σ , 𝜎 𝐵 (cid:9) with 𝐿 ( 𝑞 ) 𝜕𝜃 = . Considering 𝜋 and 𝜋 , by setting 𝐿 ( 𝑞 ) 𝜕𝜋 = and 𝐿 ( 𝑞 ) 𝜕𝜋 = , we obtain 𝜋 = (cid:205) 𝑝ℓ (cid:205) 𝑚𝑖 𝜙 ,ℓ,𝑖 𝑝𝑚 ,𝜋 = (cid:205) 𝑝ℓ (cid:205) 𝑚𝑖 (cid:205) 𝑔𝑘 𝜙 ,ℓ,𝑖,𝑘 (cid:205) 𝑔𝑘 | (cid:101) 𝑠 𝑘 | 𝑝𝑚 . For Σ and 𝜎 𝛽 , setting 𝐿 ( 𝑞 ) 𝜕 Σ = and 𝐿 ( 𝑞 ) 𝜕𝜎 𝛽 = , we obtain Σ = 𝐸 ( ( Y − XB ) (cid:48) ( Y − XB )) 𝑇 ,𝜎 𝐵 = (cid:205) 𝑝ℓ (cid:205) 𝑚𝑖 𝜙 ,ℓ,𝑖 ( Σ 𝐵,ℓ,𝑖,𝑖 + 𝜇 ,ℓ,𝑖 ) + (cid:205) 𝑝ℓ (cid:205) 𝑚𝑖 (cid:205) 𝑔𝑘 𝜙 ,ℓ,𝑖,𝑘 𝑡𝑟 ( Σ 𝐵,𝑙,𝑖, (cid:101) 𝑠 𝑘 + µ (cid:48) ,ℓ,𝑖, (cid:101) 𝑠 𝑘 µ ,ℓ,𝑖, (cid:101) 𝑠 𝑘 ) (cid:205) 𝑝ℓ (cid:205) 𝑚𝑖 (cid:16) 𝜙 ,ℓ,𝑖 + (cid:205) 𝑔𝑘 | (cid:101) 𝑠 𝑘 | 𝜙 ,ℓ,𝑖,𝑘 (cid:17) ..