High Dimensional Bayesian Network Classification with Network Global-Local Shrinkage Priors
HHigh Dimensional Bayesian Network Classification withNetwork Global-Local Shrinkage Priors
Sharmistha Guha & Abel RodriguezSeptember 25, 2020
Abstract
This article proposes a novel Bayesian classification framework for networks withlabeled nodes. While literature on statistical modeling of network data typically in-volves analysis of a single network, the recent emergence of complex data in severalbiological applications, including brain imaging studies, presents a need to devise anetwork classifier for subjects. This article considers an application from a brain con-nectome study, where the overarching goal is to classify subjects into two separategroups based on their brain network data, along with identifying influential regionsof interest (ROIs) (referred to as nodes). Existing approaches either treat all edgeweights as a long vector or summarize the network information with a few summarymeasures. Both these approaches ignore the full network structure, may lead to lessdesirable inference in small samples and are not designed to identify significant networknodes. We propose a novel binary logistic regression framework with the network asthe predictor and a binary response, the network predictor coefficient being modeledusing a novel class global-local shrinkage priors. The framework is able to accuratelydetect nodes and edges in the network influencing the classification. Our framework isimplemented using an efficient Markov Chain Monte Carlo algorithm. Theoretically,we show asymptotically optimal classification for the proposed framework when the a r X i v : . [ s t a t . M E ] S e p umber of network edges grows faster than the sample size. The framework is em-pirically validated by extensive simulation studies and analysis of a brain connectomedata. Keywords:
Brain Connectome, High dimensional binary regression; Global-Local shrinkageprior; Node selection; Network predictor; Posterior consistency.
Of late, the statistical literature has paid substantial attention to the unsupervised analysis ofa single network, thought to be generated from a variety of classic models, including randomgraph models Erdos and R´enyi (1960), exponential random graph models Frank and Strauss(1986), social space models Hoff et al. (2002); Hoff (2005, 2009) and stochastic block modelsNowicki and Snijders (2001). These models have found prominence in social networkingapplications where the nodes in the network are exchangeable. However, there are pertinentbiological and physiological applications in which network nodes are labeled and a networkis available corresponding to every individual. Section 6 presents one such example from abrain connectome study, where brain networks are available for multiple individuals who areclassified as subjects with high or low IQ (Intelligence Quotient). In this study, the humanbrain has been divided according to the Desikan Atlas Desikan et al. (2006) that identifies34 cortical regions of interest (ROIs) both in the left and the right hemispheres of the humanbrain, implying 68 cortical ROIs in all. A brain network for each subject is represented by asymmetric adjacency matrix whose rows and columns are labeled corresponding to differentROIs (shared among networks corresponding to all individuals) and entries correspond toestimates of the number of fibers connecting pairs of brain regions. The scientific goal inthis setting pertains to developing a predictive rule for classifying a new subject as havinglow or high IQ based on his/her observed brain network with labeled nodes. Additionally,it is of specific interest for neuroscientists to identify influential brain regions (nodes in the2rain network) and significant connections between different brain regions predictive of IQ.Guha and Rodriguez (2020) discuss the network regression problem with a continuousresponse and an undirected network predictor. However, there are pertinent biological andphysiological studies where a network along with a binary response is obtained for eachsubject. The goal of these studies is usually to classify the networks according to the binaryresponse and predict the associated binary response from a network. We refer to this problemas the network or graph classification problem. Additionally, Guha and Rodriguez (2020)focus on a specific network shrinkage prior, whereas this article generalizes the inference to aclass of network global-local shrinkage priors, which includes the prior specification in Guhaand Rodriguez (2020) as a special case.Earlier literature on network or graph classification has been substantially motivated bythe problem of classification of chemical compounds Srinivasan et al. (1996), Helma et al. (2001), where a graph represents a compound’s molecular structure. In such analyses, cer-tain discriminative patterns in a graph are identified and used as features for training astandard classification method Deshpande et al. (2005), Fei and Huan (2010). Another typeof method is based on graph kernels Vishwanathan et al. (2010), which defines a similaritymeasure between two networks. Both of these approaches are computationally feasible onlyfor small networks, do not account for uncertainty, and do not facilitate influential networknode identification. When the number of network nodes is moderately large, a common ap-proach to network classification is to use a few summary measures (average degree, clusteringcoefficient, or average path length) from the network and then apply statistical proceduresin the context of standard classification methods (see, for e.g., Bullmore and Sporns (2009)and references therein). These procedures have been recently employed in exploring therelationship between the brain network and neuropsychiatric diseases, such as Parkinson’sOlde Dubbelink et al. (2013) and Alzheimer’s Daianu et al. (2013), but the analyses aresensitive to the chosen network topological measures, with substantially different results ob-tained for different types of summary statistics. Indeed, global summary statistics collapse3ll local network information, which can affect the accuracy of classification. Furthermore,identification of the impact of specific nodes on the response, which is of clear interest inour setting, is not feasible. As with network regression problems, an alternate approachproceeds to vectorize the network predictor and treat edge weights together as a long vectorfollowed by developing a high dimensional regression model with this long vector of edgeweights as predictors Richiardi et al. (2011); Craddock et al. (2009); Zhang et al. (2012).This approach can take advantage of the recent developments in high dimensional binaryregression, consisting of both penalized optimization Tibshirani (1996) and Bayesian shrink-age Park and Casella (2008); Carvalho et al. (2010); Armagan et al. (2013a) perspectives.However, as mentioned in Guha and Rodriguez (2020), this treats the links of the networkas exchangeable, ignoring the fact that coefficients involving common nodes can be expectedto be correlated a priori. In a related work, Vogelstein et al. (2013) propose to look for aminimal set of nodes which best explains the difference between two groups of networks.This requires solving a combinatorial problem. Again, Durante and Dunson (2017) proposea high dimensional Bayesian tensor factorization model for a population of networks thatallows to test for local edge differences between two groups of subjects. Both of these ap-proaches tend to focus mainly on classification and are not designed to detect importantnodes and edges impacting the response.Our goal in this article is to develop a high-dimensional Bayesian network classifier thatadditionally infers on influential nodes and edges impacting classification. To achieve thisgoal, we formulate a high dimensional logistic network regression model with the binary re-sponse regressed on the network predictor corresponding to each subject. The network pre-dictor coefficient is assigned a prior from the class of
Bayesian network global-local shrinkagepriors discussed in this article. The proposed prior imparts low-rank and near sparse struc-tures a priori on the network predictor coefficient. The low-rank structure of the coefficientis designed to address the transitivity effect on the network predictor coefficient and capturesthe effect of network edge coefficients on classification due to the interaction between nodes.4n the other hand, the near sparse structure accounts for the residual effect due to edges.One important contribution of this article is a careful study of the asymptotic propertiesof the proposed binary network classification (BNC) framework. In particular, we focus onconsistency properties for the posterior distribution of the BNC framework using a specificnetwork global-local shrinkage prior, namely the
Bayesian Network Lasso prior . Theoryof posterior contraction for high dimensional regression models has gained traction lately,though the literature is less developed in shrinkage priors compared to point-mass priors.For example, Castillo et al. (2012) and Belitser and Nurushev (2015) have established pos-terior concentration and variable selection properties for certain point-mass priors in thenormal-means models. The latter article also establishes asymptotically nominal coverage ofBayesian credible sets. Results on posterior concentration and variable selection in high di-mensional linear models are also established by Castillo et al. (2015) and Martin et al. (2017)for certain point-mass priors. In contrast, literature on posterior contraction properties forhigh dimensional Bayesian shrinkage priors is relatively limited. To this end, Armagan et al. (2013b) were the first to show posterior consistency in the ordinary linear regression modelwith shrinkage priors for low-dimensional settings under the assumption that the numberof covariates does not exceed the number of observations. Using direct calculations, VanDer Pas et al. (2014) show that the posterior based on the ordinary horseshoe prior con-centrates at the optimal rate for normal-mean problems. Recently, Song and Liang (2017)considers a general class of continuous shrinkage priors and obtains posterior contractionrates in ordinary high dimensional linear regression models. In the same vein, Wei andGhosal (2017) offers analysis of posterior concentration for logistic regression models withshrinkage priors on coefficients. While Wei and Ghosal (2017) are the first to delineate atheoretical approach for ordinary high dimensional binary classification models with shrink-age priors, the study of posterior contraction properties for more structured binary networkclassification problems in the Bayesian paradigm has not appeared in the literature. In fact,developing the theory for Bayesian network classification with the Bayesian Network Lasso5rior proposed in this article is faced with two major challenges. First, the novel BayesianNetwork Lasso prior imparts a more complex prior structure (incorporating a low-rank struc-ture in the prior mean of edge coefficients, as described in Guha and Rodriguez (2020) thanthat in Wei and Ghosal (2017), introducing additional theoretical challenges. Second, weaim at proving a challenging but practically desirable result of asymptotically optimal clas-sification when the number of edges in the network predictor grows at a super-linear rateas a function of the sample size. Both of these present obstacles which we overcome in thiswork. The theoretical results provide insights on how the number of nodes in the networkpredictor, or the sparsity in the true network predictor coefficients should vary with samplesize n to achieve asymptotically optimal classification. We must mention that developing asimilar theory for the Bayesian Network Horseshoe prior proposed in this article faces morechallenges due to complex prior structure in parameters. We plan to tackle that problem aspart of future work.Section 2 develops the model and the prior distributions. Section 3 discusses theoreticaldevelopments justifying the asymptotically desirable prediction from the proposed model.Section 4 details posterior computation. Results from various simulation experiments and abrain connectome data analysis have been presented in Sections 5 and 6 respectively. Finally,Section 7 concludes the article with a brief discussion of the proposed methodology. In the context of network classification, we propose the high dimensional logistic regressionmodel of the binary response y i ∈ { , } on the undirected network predictor A i as y i ∼ Ber (cid:20) exp( ψ i )1 + exp( ψ i ) (cid:21) , ψ i = µ + (cid:104) A i , Γ (cid:105) F , (1)where Γ is a V × V symmetric network coefficient matrix whose ( k, l )th element is given by γ k,l /
2, with γ k,k = 0, for all k = 1 , ..., V . 6odel (1) can be expressed in the form of a generalized linear model. To be morespecific, (cid:104) A i , Γ (cid:105) F = (cid:80) ≤ k 1) and σ ∼ C + (0 , Network Horseshoe prior . The rest of the hierarchyon λ r ’s, u k ’s follows as in Guha and Rodriguez (2020). This section establishes convergence results for (1) with γ k,l ’s following the Bayesian NetworkLasso shrinkage prior. From the hierarchical specification given in (4), the Bayesian NetworkLasso shrinkage prior is given by γ k,l | s k,l ∼ N ( u (cid:48) k Λ u l , s k,l ) , s k,l ∼ Exp ( θ n / θ n as a function of n Armagan et al. (2013a).Our theoretical investigations will also fix θ n (the exact expression is given in Condition (F)in the next subsection) with the fixed values specified later.Here we consider an asymptotic setting in which the number of nodes in the networkpredictor, V n , grows with the sample size n . This paradigm attempts to capture the factthat the number of elements in A i , given by V n can be substantially larger than samplesize. Since model (1) is equivalent to model (3), the size of the coefficient γ in (3) is alsoa function of n , given by q n = V n ( V n − . This creates theoretical challenges, related to (butdistinct from) those faced in showing posterior consistency for high dimensional continuousArmagan et al. (2013a) and binary regressions Wei and Ghosal (2017).Let y n = ( y , ..., y n ) (cid:48) . Using the superscript (0) to indicate true parameters, the truedata generating model is given by y i ∼ Bernoulli (cid:34) exp( ψ (0) i )1 + exp( ψ (0) i ) (cid:35) , ψ (0) i = (cid:104) A i , Γ (0) (cid:105) F . (5)where Γ (0) is the true network coefficient. Let γ (0) be the vectorized upper triangular part of8 (0) . We assume, γ (0) k,l = u (0) (cid:48) k Λ u (0) l + γ (0)2 ,k,l , where u (0) k is a R dimensional vector, k = 1 , ..., V . γ (0)2 is the vector of all γ (0)2 ,k,l , k < l , and we denote the number of nonzero elements of γ (0)2 by s ,n , i.e. || γ (0)2 || = s ,n .For any (cid:15) > 0, define A n = (cid:26) γ : n n (cid:80) i =1 | f γ ( x i ) − f γ (0) ( x i ) | ≤ (cid:15) (cid:27) as a neighborhood aroundthe true density. Further suppose π n ( · ) and Π n ( · ) are the prior and posterior densities of γ with n observations, so that Π n ( A cn ) = (cid:82) A cn p γ ( y n ) π n ( γ ) (cid:82) p γ ( y n ) π n ( γ ) , where p γ ( y n ) denotes the likelihood of the n dimensional response vector y n . To show the posterior contraction results, we follow Wei and Ghosal (2017) and Armagan et al. (2013a), with substantial modifications required due to the nature of our proposednetwork lasso prior distribution. In proving the results, we make a couple of simplifications.It is assumed that the dimension R of u k is fixed and is the same as R , the dimensionof u (0) k . Consequently, effective dimensionality is not required to be estimated, and hence Λ = I is a non-random matrix. Additionally, we assume M to be non-random and M = I .We emphasize that both these assumptions are not essential for the posterior contractionrate result to be true, and are only introduced for simplifying calculations.For two sequences { C ,n } n ≥ and { C ,n } n ≥ , C ,n = o ( C ,n ) if C ,n /C ,n → 0, as n → ∞ .To begin with, we state the following assumptions under which posterior contraction will beshown.(A) sup r =1 ,..,R ; k =1 ,..,V n | u (0) k,r | < ∞ ;(B) V n = o ( n log( n ) );(C) || A i || ∞ is bounded for all i = 1 , .., , w.l.o.g assume || A i || ∞ ≤ s ,n log( q n ) = o ( n )(E) || γ (0)2 || ∞ < ∞ ;(F) θ n = Cq n n ρ/ log( n ) for some C > ρ ∈ (1 , Remark: Conditions (A), (C) and (E) are technical conditions ensuring that each of theentries in the true network coefficient and the network predictor are bounded. Condition(B) puts an upper bound on the growth of the number of network nodes with sample size toachieve asymptotically optimal classification. Similarly, (D) puts a restriction on the numberof nonzero elements of γ (0)2 with respect to n .The following theorem shows contraction of the posterior asymptotically under mildsufficient conditions on V n , s ,n . The proof of the theorem is provided in Appendix F. Theorem 3.1 Under assumptions (A)-(F) for the Bayesian Network Lasso prior on γ , Π n ( A n ) → in P γ (0) as n → ∞ , for any (cid:15) > . We have implemented both the Bayesian Network Lasso and Network Horseshoe shrinkagepriors on γ . Using the result in Polson et al. (2013), the data augmented representation ofthe distribution of y i given in (2) follows as below p ( y i | ω i ) = 2 − b exp( k i ψ i ) exp( − ω i ψ i / , ω i ∼ P G (1 , , (6)where k i = y i − / 2. Let x i = ( a i, , , a i, , , ..., a i, ,V , a i, , , a i, , , ..., a i, ,V , ...., a i,V − ,V ) (cid:48) be ofdimension q × 1, where q = V ( V − . Assume X = ( x : · · · : x n ) (cid:48) is an n × q matrix. Then10he conditional likelihood of y = ( y , ..., y n ) (cid:48) given ω = ( ω , ..., ω n ) (cid:48) and γ is given by p ( y | X , γ , ω ) ∝ n (cid:89) i =1 p ( y i | x i , γ , ω i , ... ) ∝ n (cid:89) i =1 exp (cid:8) ( y i − . µ + x (cid:48) i γ ) − ω i ( µ + x (cid:48) i γ ) / (cid:9) ∝ n (cid:89) i =1 exp (cid:40) − ω i (cid:20) ( y i − . ω i − ( µ + x (cid:48) i γ ) (cid:21) (cid:41) In matrix notation, the likelihood may be written as p ( y | X , γ , ω ... ) ∝ N ( t | µ + Xγ , Ω − )where t = (( y − . /ω , ..., ( y n − . /ω n ) (cid:48) = ( k /ω , ..., k n /ω n ) (cid:48) and Ω = diag ( ω , ..., ω n ).While the full posterior distributions for the parameters are not in closed forms, they mostlybelong to the standard families. Hence drawing posterior samples using MCMC can bereadily implemented. Appendix D and Appendix E describe full conditional distributions ofparameters for Bayesian Network Lasso and Network Horseshoe priors on γ , respectively.Let Ω (1) , ..., Ω ( L ) , Γ (1) , ..., Γ ( L ) and µ (1) , ..., µ ( L ) be the L post burn-in MCMC samples for Ω , Γ and µ respectively after suitable thinning. To classify a newly observed network M ∗ as a member of one of the two groups, we compute S ( l ) = exp( µ (1) + (cid:104) M ∗ , Γ ( l ) (cid:105) )1+exp( µ (1) + (cid:104) M ∗ , Γ ( l ) (cid:105) ) for l = 1 , ..., L . M ∗ is classified as a member of group ‘low’ or ‘high’ if L (cid:80) Ll =1 S ( l ) is less than or greaterthan 0 . 5, respectively. To judge sensitivity to the choice of the cut-off, the simulation sectionpresents Area under Curve (AUC) of ROC curves with True Positive Rates (TPR) and FalsePositive Rates (FPR) of classification corresponding to a range of cut-off values.Node k is recognized to be influential in the classification process if L (cid:80) Ll =1 ξ ( l ) k > . ξ (1) k , ..., ξ ( L ) k are the L post burn-in MCMC samples of ξ k . Again, one of the goals ofthe proposed framework is to identify influential network edges impacting the response. Weemploy the algorithm described in Appendix C to identify influential edges. The algorithmtakes care of multiplicity correction by controlling the false discovery rate (FDR) at 5% level.11inally, we present an estimate of P ( R eff = r | Data ) computed by L (cid:80) Ll =1 I ( (cid:80) Rm =1 λ ( l ) m = r ),where I ( A ) for an event A is 1 if the event A happens and 0 otherwise, and λ (1) m , ..., λ ( L ) m arethe L post burn-in MCMC samples of λ m . This section evaluates the inferential and classification ability of our proposed Bayesiannetwork classification (BNC) framework, along with a number of competitors, using syntheticnetworks generated under various simulation settings. Our proposed network classificationapproach with the Bayesian Network Lasso prior and the Bayesian Network Horseshoe priorare referred to as the Bayesian Network Lasso classifier (BNLC) and Bayesian NetworkHorseshoe classifier (BNHC), respectively. In each simulation, we assess the ability of theBNLC and BNHC approaches to correctly identify influential nodes and edges, to accuratelyestimate predictive edge coefficients and to classify a network with precise characterizationof uncertainties. Classification performance of both methods are assessed using the areaunder the Receiving Operating Characteristics (ROC) curve (AUC).To study all competitors under various data generation schemes, we simulate the responsefrom (1) given by y i ∼ Ber (cid:18) exp( µ + (cid:104) A i , Γ (cid:105) F )1 + exp( µ + (cid:104) A i , Γ (cid:105) F ) (cid:19) , (7)where Γ is a symmetric matrix with zero diagonal entries. The intercept µ is fixed at 2 inall simulation scenarios. We consider two different schemes of generating the network A i ,referred to as Simulation 1 and Simulation 2 , respectively. Simulation 1. In Simulation 1 , the network edges (i.e., the elements of the matrix A i ) aresimulated from N(0 , Simulation 1 assumes that the network predictor follows anErdos-Renyi graph. 12 imulation 2. In Simulation 2 , the network predictor A i corresponding to the i th sampleis generated from a stochastic blockmodel. Here nodes in a simulated network are organizedinto communities so that nodes in the same community tend to have stronger connectionsthan nodes belonging to different communities. This simulation scenario simulates net-works which closely mimic brain connectome networks Bullmore and Sporns (2009). Tosimulate networks with such community structures, we assign each node a community la-bel, f k ∈ { , , ..., } , k = 1 , ..., V . The node assignments are the same for all networks inthe population. Given the community labels, the ( k, k (cid:48) )th element of A is simulated from N ( m f k ,f k (cid:48) , σ ), where m k,l = 0 . k = l . When k (cid:54) = l , i.e., the concerned edges connectnodes belonging to different clusters, we sample a fixed number of edge locations randomlyand simulate the values from N (0 , σ = 1 and the three clusters with 8, 9 and 8 nodes respectively, in the three com-munities. We note that the network predictors are simulated from a stochastic blockmodelin Simulation 2 which also ensures transitivity in the network predictor. Simulating the network predictor coefficient Γ . In both Simulations 1 and 2, the networkpredictor coefficient Γ is constructed as the sum of two matrices Γ , and Γ , . We providethe details of constructing the two matrices as below.In both Simulations 1 and 2, we draw V latent variables u k, , each of dimension R g , froma mixture distribution given by u k, ∼ πN R g ( u m,g , u s,g ) + (1 − π ) δ ; k ∈ { , ..., V } , (8)where δ is the Dirac-delta function and π is the probability of any u k, being nonzero.Define a symmetric matrix Γ , whose ( k, l )th element is given by u (cid:48) k, u l, , k < l and = 0 if k = l . Note that if u k, is zero, then the k th node has no contribution to the mean functionin (7), i.e., the k th node becomes non-influential in predicting the response. Since (1 − π ) isthe probability of a node being inactive, it is referred to as the node sparsity parameter in13he context of the data generation mechanism under Simulations 1 and . All elements of u m,g are taken to be 0 . u s,g is taken to be 1.We also construct another symmetric sparse matrix Γ , to add additional edge effectscorresponding to edges connecting a few randomly selected nodes. Let π be the proportionof nonzero elements of Γ , , set randomly at either 0 . 05 or 0 . 1. We randomly choose π proportion of locations from the set of all ( k, l ). The nonzero entries are drawn using one ofthe three following strategies: Strategy 1: Nonzero entries are simulated from N(1 , . Strategy 2: Nonzero entries are simulated from N(0 . , . Strategy 3: All nonzero entries are fixed at 0 . − π ) is referred to as the residual edge sparsity .Note that the specification of true edge coefficients largely preserves the transitivityproperty in Γ . To see this, note that Γ , is highly sparse, so that γ , ,k,l = γ ,k,l for mostpairs ( k, l ), k < l . For those pairs, γ ,k,l (cid:54) = 0 and γ ,l,l (cid:48) (cid:54) = 0 imply that u k, (cid:54) = , u l, (cid:54) = and u l (cid:48) , (cid:54) = . Thus it follows that γ ,k,l (cid:48) = u (cid:48) k, u l (cid:48) , (cid:54) = 0 . For a comprehensive picture of Simulation 1 and Simulation 2 , we consider 4 differentcases each in both simulations as summarized in Table 1 and 2 respectively. In each ofthese cases, the network predictor coefficient and the response are generated by changingthe node sparsity (1 − π ), the residual edge sparsity (1 − π ) and the true dimension R g ofthe latent variables u k, ’s. The table also presents the maximum fitted dimension R of thelatent variables u k for the logistic regression model (2). Note that the various cases alsoallow model mis-specification with unequal choices of R and R g .As competitors, we use generic variable selection and shrinkage methods that treat edgesbetween nodes together as a long predictor vector to run high dimensional regression, therebyignoring the relational nature of the predictor. More specifically, we use Lasso Tibshirani(1996), which is a popular penalized optimization scheme, and the Bayesian Lasso (BLassofor short)Park and Casella (2008) and Bayesian Horseshoe (BHS for short) priors Carvalho14ases R g R Node Residual Edge StrategySparsity (1 − π ) Sparsity (1 − π )Case - 1 2 2 0.5 0.95 Strategy 1Case - 2 3 5 0.6 0.95 Strategy 1Case - 3 2 5 0.5 0.90 Strategy 2Case - 4 2 5 0.4 0.90 Strategy 3Table 1: Table presents different cases for Simulation 1 . The true dimension R g is thedimension of vector object u k, using which data has been generated. The maximum dimen-sion R is the dimension of vector object u k using which the model has been fitted. Nodesparsity and residual edge sparsity are described in the text.Cases R g R Node Residual Edge StrategySparsity (1 − π ) Sparsity (1 − π )Case - 1 2 2 0.5 0.95 Strategy 1Case - 2 2 4 0.5 0.95 Strategy 1Case - 3 2 3 0.7 0.95 Strategy 1Case - 4 2 5 0.4 0.90 Strategy 3Table 2: Table presents different cases for Simulation 2 . The true dimension R g is thedimension of vector object u k, using which data has been generated. The maximum dimen-sion R is the dimension of vector object u k using which the model has been fitted. Nodesparsity and residual edge sparsity are described in the text. et al. (2010), which are popular Bayesian shrinkage regression methods, all three underthe logistic regression framework. We use the glmnet package in R Friedman et al. (2010)to implement the frequentist Lasso, while we write our own codes for BLasso and BHS.A comparison with these methods will indicate any relative advantage of exploiting thestructure of the network predictor. Additionally, we compare our methods to a frequentistapproach that develops network classification in the presence of a network predictor and abinary response Reli´on et al. (2017). We refer to this approach as Reli´on. All Bayesian competitors are allowed to draw 50 , 000 MCMC samples, out of whichthe first 30 , 000 are discarded as burn-ins. Convergence is assessed by comparing differentsimulated sequences of representative parameters starting at different initial values Gelman et al. (2014b). All posterior inference is carried out based on the rest 20 , 000 MCMC samplesafter suitably thinning the post burn-in chain. We monitor the auto-correlation plots and15ffective sample sizes of the iterates, and they are found to be satisfactorily uncorrelated. Inall of our simulations, we set V = 25 nodes and n = 250 samples.We present analysis for ν = 20, a ∆ = b ∆ = 1. For BNLC, there are two additional hyper-parameters ι and ζ , both of which are set to 1. Note that the choice of a ∆ = b ∆ = 1 ensuresthat the prior on models is such that we have a uniform distribution on the number of activenodes, and conditional on the size of the model, a uniform distribution on all possible modelsof that size. The choice of ν = 20 ensures that the prior distribution of M is concentratedaround a scaled identity matrix. Since model is invariant to rotations of the latent positions,so we want the prior on u k ’s to also be invariant under rotation. That requires that wecenter M around a matrix that is proportional to the identity. Our choice of ι and ζ set theprior mean of s k,l at 0 . Figures 1 and 2 show the posterior probability of the k -th node being detected as influential,i.e., P ( ξ k = 1 | Data ), by BNLC and BNHC for each node and each case within Simulations1 and , respectively. Some interesting observations emerge from the results. We findthat both methods work well with lower node sparsity and higher residual edge sparsity.Decreasing the residual edge sparsity and increasing the node sparsity have adverse effectson the performance. In general, BNLC shows relatively better performance than BNHCin cases with higher node sparsity and/or lower residual edge sparsity. We provide a briefdiscussion below to support these observations.For BNHC, case 2 exhibits a few false positives, and the separation of posterior proba-bilities for truly active and truly inactive nodes is much more stark in case 1 than in case 2.BNLC does a better job of node identification than BNHC in case 2. Residual edge effectdoes have an impact on the probabilities, which is evident by comparing cases 1 and 3. For16NHC, case 3 (Simulation 1) displays poor performance with a higher number of both falsepositives and false negatives. Performance of BNLC appears to be better than BNHC incase 3. Fixing the residual edge sparsity and increasing the node sparsity has a negativeimpact on node identification, as seen by comparing performances in cases 3 and 4 (Simu-lation 1). For Simulation 2, both competitors perform quite well in cases 1 a nd 2. Again,case 3 (Simulation 2) represents a higher node sparsity, so that both BNHC and BNLC donot perform well in this case. Similar to Simulation 1, BNHC shows inferior performanceto BNLC in case 3. While BNHC offers a few false positives and false negatives in case 4(Simulation 2), the performance appears to be much better than in case 3. Notice that case3 has both higher node sparsity and residual edge sparsity than case 4. While they haveopposing effects, it appears that higher node sparsity demonstrates more of an adverse effecthere compared to a small perturbation in the residual edge sparsity. Recall that Reli´on et al. (2017) is the only other competitor which is designed to detect influential nodes. It detectsall nodes to be influential in all simulation cases. We apply the algorithm with a mixture of skewed t-distributions described in Appendix C todetect influential edges from the post burn-in MCMC samples of the edge coefficients usinga threshold of t = 0 . 05. The proposed approach controls FDR below a threshold of 0 . 05 toaccount for multiplicity correction. Tables 3 and 4 provide the true positive rates (TPR)and false positive rates (FPR) in detecting important edges for Simulations 1 and 2 for thecompetitors, respectively. It is observed that when node sparsity is moderate and residualedge sparsity is high (cases 1 and 2), both BNLC and BNHC offer moderate performance interms of identifying true positives, and include very few false positives. In these cases, BNHCgenerally exhibits a little higher FPR than BNLC. In the case of high node sparsity (e.g.,case 3, Simulation 2) both these methods unfortunately show much lower true positive rates.Again, lower edge sparsity (case 3, Simulation 1) has almost no effect on FPR of BNLC,17 imulation Cases N ode s (a) BNLC Simulation Cases N ode s (b) BNHC Figure 1: Simulation 1: clear background denotes uninfluential and dark background denotes influential nodes in the truth for BNLC and BNHC models. Note that there are 25 rows(corresponding to 25 nodes) and 4 columns corresponding to 4 different cases in Simulation1. The model-detected posterior probability of being influential has been super-imposed ontothe corresponding node.but decreases TPR substantially. For BNHC, both TPR and FPR increase when residualedge sparsity is reduced. Nevertheless, both of them perform significantly better than Lassoin almost all cases. The competitor in Reli´on et al. (2017) appears to have suboptimalperformance, as it identifies all edges as important in all the simulation scenarios, resultingin high FPRs. BNLC BNHC Lasso Reli´on (2017)Cases TPR FPR TPR FPR TPR FPR TPR FPRCase - 1 0.65 0.01 0.72 0.12 0.50 0.22 1 1Case - 2 0.64 0.00 0.63 0.02 0.40 0.14 1 1Case - 3 0.45 0.00 0.86 0.40 0.42 0.22 1 1Case - 4 0.72 0.09 0.70 0.12 0.54 0.16 1 1Table 3: True Positive Rates (TPR) and False Positive Rates (FPR) for edges for cases in Simulation 1 .The results in Tables 3 and 4 indicate higher number of edges identified as influential18 a) BNLC (b) BNHC Figure 2: Simulation 2: clear background denotes uninfluential and dark background denotes influential nodes in the truth for BNLC and BNHC models. Note that there are 25 rows(corresponding to 25 nodes) and 4 columns corresponding to 4 different cases in Simulation2. The model-detected posterior probability of being influential has been super-imposed ontothe corresponding node. BNLC BNHC Lasso Reli´on(2017)Cases TPR FPR TPR FPR TPR FPR TPR FPRCase - 1 0.63 0.00 0.84 0.08 0.44 0.20 1 1Case - 2 0.56 0.00 0.63 0.12 0.53 0.22 1 1Case - 3 0.46 0.02 0.59 0.08 0.31 0.16 1 1Case - 4 0.68 0.03 0.75 0.06 0.34 0.12 1 1Table 4: True Positive Rates (TPR) and False Positive Rates (FPR) for edges for cases in Simulation 2 .by BNHC than BNLC in all simulations. Digging a bit deeper, we report the ratio of thenumber of edges in the intersection of both methods to the number of total edges identified byeach method independently in Table 5. In all simulation cases, almost all edges identified asinfluential by BNLC are also identified as influential by BNHC. In cases 2 and 4 (Simulation1), the fractions corresponding to BNLC and BNHC are very similar, indicating similar edgeidentification by both of them. However, this fraction appears to be lower in BNHC forcases 1 and 3 (Simulation 1). This again shows that the edges identified by BNLC are also19dentified by BNHC, with BNHC identifying more edges. The discrepancy turns out to bemore in case 3 (Simulation 1) where BNHC has identified many more edges. Simulation 2shows a similar trend. We further track the top 10, 20 and 30 edges identified from BNLCand record how many of these edges belong to the top 10, 20 and 30 edges identified fromBNHC. Table 5 shows a high level of intersection among the top edges identified by thesetwo methods.A number of interesting observations emerge from the analysis. First of all, as mentionedearlier, the edges identified by BNLC are generally also identified by BNHC. BNHC tendsto identify more edges, leading to higher TPR and FPR. Broadly, in presence of highernode sparsity, the discrepancy is greater, with BNHC having much higher TPR and FPR.Interestingly, the absolute values of the edge coefficients follow very similar rankings forBNHC and BNLC, which leads to high intersections among the top edges selected by thesemethods. Perhaps the difference in shrinkage mechanism imposed by BNHC and BNLC isresponsible for their difference in tail behavior, leading to differences in edge selection.Simulation 1 Simulation 2Cases N BL,BH N BL N BL,BH N BH Top N BL,BH N BL N BL,BH N BH Top10 20 30 10 20 301 0.94 0.61 9 19 27 1.00 0.58 7 17 262 0.85 0.83 8 14 21 1.00 0.46 8 17 263 1.00 0.25 9 13 24 0.97 0.70 9 18 284 0.91 0.87 8 18 27 0.91 0.75 8 17 27Table 5: N BL,BH represents the number of edges identified by both BNLC and BNHC.Similarly, N BL and N BH represent the number of edges identified by BNLC and BNHC,respectively. Top 10 represents the number of edges common among the top ten edgesidentified by BNLC and BNHC. Top 20 and Top30 are defined analogously. The mean squared errors (MSE) associated with the point estimation of edge coefficients fordifferent competitors are presented in Tables 6 and 7, corresponding to Simulations 1 and 2,20espectively. For the Bayesian competitors, point estimates are computed using the posteriormeans of the edge coefficients. In all cases, BNLC and BNHC consistently outperform allother competitors, with the binary Bayesian Lasso exhibiting the next best performance. Inall simulation cases, BNLC comprehensively outperforms BNHC in terms of estimating edgecoefficients. Consistent with earlier observations, both competitors tend to be less accuratewhen node sparsity increases. Figure 3 records AUC for all competitors in Simulations 1and 2. In almost all cases, AUC for BNHC and BNLC turn out to be higher than othercompetitors. On the other hand, Reli´on et al. (2017) appears to have close to randomclassification of samples with AUC around 0 . Cases A UC . . . BNLC BNHC Binary Lasso Relion Binary BLasso Binary BHS (a) Simulation 1 (b) Simulation 2 Figure 3: Figure shows classification performance in the form of Area under Curve (AUC)of ROC for all cases in Simulations 1 and 2. Figures 4 and 5 present posterior probabilities of effective dimensionality of the latent posi-tions u , . . . , u V for BNLC and BNHC in Simulations 1 and 2, respectively. Note that thetrue dimension of the latent space is known and recorded for all simulations in Tables 1 and2. In all 8 cases, the posterior mode corresponds to the true dimension of the latent space for21 a) Case 1, BNLC (b) Case 2, BNLC (c) Case 3, BNLC(d) Case 4, BNLC (e) Case 1, BNHC (f) Case 2, BNHC(g) Case 3, BNHC (h) Case 4, BNHC Figure 4: Plots showing posterior probability distribution of effective dimensionality forBNLC and BNHC models in all 4 cases in Simulation 1. Filled bullets indicate the truevalue of effective dimensionality. 22 a) Case 1, BNLC (b) Case 2, BNLC (c) Case 3, BNLC(d) Case 4, BNLC (e) Case 1, BNHC (f) Case 2, BNHC(g) Case 3, BNHC (h) Case 4, BNHC Figure 5: Plots showing posterior probability distribution of effective dimensionality forBNLC and BNHC models in all 4 cases in Simulation 2. Filled bullets indicate the truevalue of effective dimensionality. 23SECases BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL HorseshoeCase - 1 R eff in BNHCconcentrates more sharply around R g in all cases.MSECases BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL HorseshoeCase - 1 To assess how sensitive the inferences from BNLC and BNHC are, we analyze BNLC andBNHC with different combinations of hyperparameters. Specifically for BNLC, we use thefive different combinations given by, (i) a ∆ = 1 , b ∆ = 9; (ii) ν = 20 , δ = 5 (iii) ν = 50 , δ = 5(iv) ν = 20 , δ = 0 . ν = 50 , δ = 0 . 2. Combination (i) ensures small prior mean for ξ k ’s,while combinations (ii)-(v) allow a range of prior means for θ and M . On the other hand,the three different combinations we employ for BNHC are, (i)’ a = 1, b = 9 (ii)’ ν = 10 (iii)’24 = 50. With these hyperparameter combinations for BNLC and BNHC, we analyze the datasimulated in case 4, Simulation 1 (case chosen randomly), report performances on influentialnode and edge identification and the MSE values for estimating the network coefficientmatrix. All these inferences with different choices of hyperparameters are compared amongthemselves and compared with the inferences reported earlier on case 4, Simulation 1.Table 8 records the MSE values for estimating the network coefficient under all thesecombinations. The MSE values for BNLC range between 0 . 10 and 0 . 30 (please see table 6).MSE values for BNHC are found to range between 0 . 19 and 0 . 28 with different choicesof hyperparameters, as shown in able 6. Figure 6 shows the posterior probabilities of anode being identified as influential under all these hyperparameter combinations. It showsprobabilities being only little affected by the change of hyper-parameters. In fact, underhyper-parameter combinations (i),(ii) and (iv), BNLC identifies the same set of nodes asinfluential which have been identified as influential by the original BNLC prior. Undercombination (iii), BNLC does not identify node 9 as influential which has been identifiedas influential by the original BNLC prior. Under combination (iv) BNLC identifies oneadditional node (node 21) as influential over the set of nodes identified by the originalprior. Under hyperparameter combination (i)’, BNHC identifies the same set of nodes withthe original BNHC prior except nodes 4 , , , 25 which are identified as influential by theoriginal prior, but not by the combination (i)’. Combinations (ii)’ and (iii)’ also identifythe same set of nodes with the original BNHC prior except for nodes 9 , , 25. Finally,Table 9 offers TPR and FPR values corresponding to the identification of influential edgesfor BNLC and BNHC under various combinations of hyper-parameters. The TPR for BNHCunder combination (iii)’ turns out to be a little higher than the rest, but overall numbers donot show a lot of variation. We emphasize that the results turn out to be better than ourcompetitors under all combinations. 25NLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’MSE 0.14 0.30 0.22 0.10 0.22 0.19 0.28 0.28Table 8: Mean Squared Error (MSE) of estimating the network coefficient in BNLC andBNHC for different combinations of hyper-parameters. Simulation Cases N ode s (a) BNLC Sensitivity Combinations N ode s (b) BNHC Sensitivity Figure 6: Figure shows P ( ξ k = 1 | Data ) for BNLC and BNHC under different hyper-parameter combinations in the simulated data for case 4 (Simulation 1).BNLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’TPR 0.80 0.76 0.82 0.83 0.78 0.64 0.88 0.82FPR 0.16 0.21 0.17 0.21 0.18 0.19 0.24 0.18Table 9: True Positive Rates (TPR) and False Positive Rates (FPR) of identifying influentialedges in BNLC and BNHC for different combinations of hyper-parameters. In this section, we present the inferential and classification ability of BNLC and BNHC inthe context of a weighted diffusion tension imaging (DTI) dataset. Our dataset containsinformation on the full scale intelligence quotient (FSIQ) for multiple individuals. Full scaleintelligence quotient (FSIQ) is a measure of an individual’s complete cognitive capacity. It26s derived from administration of selected sub-tests from the Wechsler Intelligence Scales(WIS), designed to provide a measure of an individual’s overall level of general cognitive andintellectual functioning, and is a summary score derived from an individual’s performanceon a variety of tasks that measure acquired knowledge, verbal reasoning, attention to verbalmaterials, fluid reasoning, spatial processing, attentiveness to details, and visual-motor in-tegration Caplan et al. (2011). A substantial body of literature has suggested that there isan IQ threshold (usually described as an IQ of approximately 120 points) that may be char-acterized as superior reasoning ability Brown et al. (2009); Carson et al. (2003). Followingthis literature, we have converted the FSIQ scores into a binary response variable y , whichtakes value 0 if FSIQ is less or equal to 120, and takes value 1 if FSIQ is greater than 120.Thus, we classify the subjects in our study as belonging to the low IQ group if y = 0, andthe high IQ group if y = 1.Along with FSIQ measurements, brain connectome information for n = 114 subjects isgathered using weighted diffusion tensor imaging (DTI). DTI is a brain imaging techniquethat enables measurement of the restricted diffusion of water in tissue in order to produceneural tract images. The brain imaging data we use has been pre-processed using the NDMGpre-processing pipeline Kiar et al. (2016); Kiar et al. (2017a); Kiar et al. (2017b). In thecontext of DTI, the human brain is divided according to the Desikan atlas Desikan et al. (2006), which identifies 34 cortical regions of interest (ROIs) both in the left and righthemispheres of the human brain, implying 68 cortical ROIs in all. Similar to Guha andRodriguez (2020), this results in a brain network of a 68 × 68 matrix for each individual.Our scientific goals in this setting include identification of brain regions or network nodessignificantly related to FSIQ and classification of a subject into the low IQ or high IQ groupbased on his/her brain connectome information.Identical prior distributions for all the parameters as in the simulation studies have beenused. BNLC and BNHC are both fitted with R = 4, which is found to be sufficient for thisstudy. Further, Guha and Rodriguez (2020) show robust inference as long as the chosen R 27s bigger than the effective dimensionality of the latent variables. Similar to article Guhaand Rodriguez (2020), we also do a sensitivity study to check the impact of R on predictiveinference. The choice of hyperparameters for BNLC and BNHC are made similar to thesimulation studies. A brief explanation for such choices of hyper parameters is provided inthe simulation section. The MCMC chain is run for 50 , 000 iterations, with the first 30 , et al. (2014a).All inference is based on the remaining 20 , 000 post burn-in iterates appropriately thinned. As in simulation studies, we put our emphasis on identifying influential brain regions ofinterest (ROIs) associated with FSIQ. The BNLC model estimates posterior probabilitiesover 0 . influential ) for 38 ROIs, out of which 20 regions are in the lefthemisphere and 18 regions are in the right hemisphere. Among the regions detected in boththe hemispheres, a large number belong to the frontal , temporal and cingulate lobes. Usingthe same principle, the BNHC model identifies 48 nodes to be influential. Out of the 48influential nodes, 26 are detected in the left hemisphere and the rest in the right hemisphere.The ROIs are mainly detected in the temporal, frontal, parietal and cingulate lobes in bothhemispheres. Figure 8 plots the estimated posterior probability of an ROI being detected asinfluential by the BNLC and BNHC models. Notably, there are 29 ROIs identified by bothBNLC and BNHC, given in Table 10.A large number of the 29 influential nodes detected by both BNLC and BNHC are partof the frontal lobes in both the hemispheres. Numerous studies have linked the frontal regionto an individual’s intelligence and cognitive functions Yoon et al. (2017); Stuss et al. (1985);Razumnikova (2007); Miller and Milner (1985); Kolb and Milner (1981). Our method alsofinds a significant association between FSIQ and the left inferior parietal lobule , the left precuneus and the supramarginal gyri in both the hemispheres, in the parietal lobe, regions28lso found to be significantly related to FSIQ by Yoon et al. (2017).We additionally look into ROIs which are detected by only of the two methods (lets say,BNLC), and report the posterior probabilities of these ROIs being active under the othermethod (i.e., BNHC). Figure 7 shows the posterior probabilities of nodes being active underthe ‘other’ method as discussed above. It is observed that the nodes selected by BNHC butnot by BNLC have probabilities not very far from 0 . (cid:0) (cid:1) and (cid:0) (cid:1) possibilities, respectively. Since a different number of nodes are detected as influentialby BNHC and BNLC, to make a fair comparison, we consider the 29 nodes detected asinfluential by both methods, and use our algorithm to find the number of influential edgesamong these (cid:0) (cid:1) possibilities for both BNLC and BNHC. The numbers turn out to be 96and 184, respectively. We note that there are a few nodes which are identified as influentialby either BNHC or BNLC, but none of the edges connecting these nodes are found to beinfluential. As an example, although the frontal pole and the temporal pole in the lefthemisphere are identified as influential nodes by BNLC, none of the edges connecting thesetwo nodes turn out to be influential. This phenomenon may be due to the use of the FDRin the edge selection procedure, which finds edges that are most likely to be active whilecontrolling for false discoveries. Hence, not identifying an edge does not necessarily meanthat the edge is not active, it just means that there are others that satisfy the criteria better.29 odes Selected by BNLC, but not by BNHC P r obab ili t y . l h - c auda l m i dd l e f r on t a l l h - pa r ah i ppo c a m pa l l h - pe r i c a l c a r i ne l h - po s t e r i o r c i ngu l a t e l h - p r e c uneu s r h - l a t e r a l o r b i t o f r on t a l r h - m ed i a l o r b i t o f r on t a l r h - p r e c en t r a l r h - s upe r i o r f r on t a l Nodes Selected by BNHC, but not by BNLC P r obab ili t y . l h - ban kss t s l h - c o r pu sc a ll o s u m l h - c uneu s l h - i n f e r i o r pa r i e t a l l h - i s t h m u sc i ngu l a t e l h - l a t e r a l o cc i p i t a l l h - li ngua l l h - pa r s t r i angu l a r i s l h - po s t c en t r a l l h -r o s t r a l an t e r i o r c i ngu l a t e l h -r o s t r a l m i dd l e f r on t a l l h - s upe r i o r f r on t a l r h - c auda l m i dd l e f r on t a l r h - en t o r h i na l r h - f u s i f o r m r h - pa r ah i ppo c a m pa l r h - pe r i c a l c a r i ne r h - po s t c en t r a l r h - p r e c uneu s Figure 7: Figure shows the posterior probabilities of nodes selected as influential by onemethod, but not by another, of being active.Similar to simulation studies, we dig deeper to analyze the discrepancy in the number ofinfluential edges identified by BNLC and BNHC. Specifically, we rank the (cid:0) (cid:1) = 406 edgesconnecting the nodes found to be influential by both BNLC and BNHC, according to theabsolute values of their posterior means. Table 11 shows between 23-74% intersections.To examine the predictive ability of the Bayesian network classification model, we reportthe area under curve (AUC) of the ROC curve for BNLC and BNHC, along with all com-peting methods. The AUCs are computed using a 10-fold cross validation approach. TheAUC estimates presented in Table 12 indicate better performance of both BNLC and BNHC,with BNLC slightly outperforming. Frequentist Binary Lasso turns out to be the next bestperformer, while BLasso and BHS perform very similar to a random classifier. Finally, the30ffective dimensionality of the model is investigated for both BNLC and BNHC, and theyturn out to be 2 . 17 and 2, respectively. (a) BNLC(b) BNHC Figure 8: Lateral and medial views of the brain (left and right hemispheres) showing all 68regions of interest (ROIs). The size and color of the ROIs vary according to the value of theposterior probabilities of them being actively related to the binary response for both BNLCand BNHC models. 31 emisphere Lobe NodeLeft Temporal fusiform, middle temporal gyrus, parahippocampal, temporal pole, transverse temporalCingulate isthmus cingulate cortexFrontal pars opercularis, pars orbitalis, pars triangularis, frontal poleOccipital lingualParietal inferior parietal lobule, precuneus, supramarginal gyrusInsula insulaRight Temporal parahippocampal, superior temporal gyrus, temporal poleCingulate caudal anterior cingulate, isthmus cingulate cortexFrontal lateral orbitofrontal, medial orbitofrontal, pars opercularis, pars orbitalis,rostral middle frontal gyrus, superior frontal gyrusOccipital pericalcarineParietal supramarginal gyrusInsula insula Table 10: Nodes identified as influential by both BNLC and BNHC.Top 100 Top 200 Top 30023 99 222Table 11: Top 100 represents the number of edges common among the top 100 edges identifiedby BNLC and BNHC. Top 200 and Top 300 are defined analogously. Method BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL BHS AUC We have already discussed how the hyperparameters are chosen for the simulation studiesand data analysis. To assess how sensitive the inferences from BNLC and BNHC are, weanalyze BNLC and BNHC with different combinations of hyperparameters. Specifically forBNLC, we use the five different combinations (i)-(v) given in Section 5.5, and three differentcombinations (i)’-(iii)’ for BNHC also mentioned in Section 5.5. We report performanceson the number of influential nodes identified. We also find the number of influential edgesconnecting influential nodes.Table 13 records the number of nodes identified as influential and the number of inter-32 h - c auda l m i dd l e f r on t a l l h - en t o r h i na l l h - f u s i f o r m l h - i n f e r i o r t e m po r a l l h - l a t e r a l o r b i t o f r on t a l l h - m ed i a l o r b i t o f r on t a l l h - m i dd l e t e m po r a l l h - pa r ah i ppo c a m pa l l h - pa r a c en t r a l l h - pa r s ope r c u l a r i s l h - pa r s o r b i t a li s l h - pe r i c a l c a r i ne l h - po s t e r i o r c i ngu l a t e l h - p r e c en t r a l l h - p r e c uneu s l h - s upe r i o r t e m po r a l l h - s up r a m a r g i na l l h - f r on t a l po l e l h - t e m po r a l po l e l h - t r an sv e r s e t e m po r a l r h - c auda l an t e r i o r c i ngu l a t e r h - i n f e r i o r t e m po r a l r h - l a t e r a l o cc i p i t a l r h - l a t e r a l o r b i t o f r on t a l r h - li ngua l r h - m ed i a l o r b i t o f r on t a l r h - m i dd l e t e m po r a l r h - pa r a c en t r a l r h - pa r s ope r c u l a r i s r h - pa r s t r i angu l a r i s r h - p r e c en t r a l r h -r o s t r a l an t e r i o r c i ngu l a t e r h -r o s t r a l m i dd l e f r on t a l r h - s upe r i o r f r on t a l r h - s upe r i o r pa r i e t a l r h - s upe r i o r t e m po r a l r h - f r on t a l po l e r h - t e m po r a l po l e rh-temporalpole rh-frontalpole rh-superiortemporal rh-superiorparietal rh-superiorfrontal rh-rostralmiddlefrontal rh-rostralanteriorcingulate rh-precentral rh-parstriangularis rh-parsopercularis rh-paracentral rh-middletemporal rh-medialorbitofrontal rh-lingual rh-lateralorbitofrontal rh-lateraloccipital rh-inferiortemporal rh-caudalanteriorcingulate lh-transversetemporal lh-temporalpole lh-frontalpole lh-supramarginal lh-superiortemporal lh-precuneus lh-precentral lh-posteriorcingulate lh-pericalcarine lh-parsorbitalis lh-parsopercularis lh-paracentral lh-parahippocampal lh-middletemporal lh-medialorbitofrontal lh-lateralorbitofrontal lh-inferiortemporal lh-fusiform lh-entorhinal lh-caudalmiddlefrontal (a) BNLC l h - ban kss t s l h - c o r pu sc a ll o s u m l h - c uneu s l h - en t o r h i na l l h - f u s i f o r m l h - i n f e r i o r pa r i e t a l l h - i n f e r i o r t e m po r a l l h - i s t h m u sc i ngu l a t e l h - l a t e r a l o cc i p i t a l l h - l a t e r a l o r b i t o f r on t a l l h - li ngua l l h - m i dd l e t e m po r a l l h - pa r a c en t r a l l h - pa r s ope r c u l a r i s l h - pa r s o r b i t a li s l h - pa r s t r i angu l a r i s l h - po s t c en t r a l l h - po s t e r i o r c i ngu l a t e l h -r o s t r a l an t e r i o r c i ngu l a t e l h -r o s t r a l m i dd l e f r on t a l l h - s upe r i o r f r on t a l l h - s upe r i o r t e m po r a l l h - s up r a m a r g i na l l h - f r on t a l po l e l h - t r an sv e r s e t e m po r a l r h - c auda l an t e r i o r c i ngu l a t e r h - c auda l m i dd l e f r on t a l r h - en t o r h i na l r h - i n f e r i o r t e m po r a l r h - l a t e r a l o cc i p i t a l r h - li ngua l r h - m i dd l e t e m po r a l r h - pa r ah i ppo c a m pa l r h - pa r a c en t r a l r h - pa r s ope r c u l a r i s r h - pa r s o r b i t a li s r h - pa r s t r i angu l a r i s r h - pe r i c a l c a r i ne r h - po s t c en t r a l r h - p r e c en t r a l r h - p r e c uneu s r h -r o s t r a l an t e r i o r c i ngu l a t e r h -r o s t r a l m i dd l e f r on t a l r h - s upe r i o r pa r i e t a l r h - s upe r i o r t e m po r a l r h - s up r a m a r g i na l r h - f r on t a l po l e r h - t r an sv e r s e t e m po r a l rh-transversetemporal rh-frontalpole rh-supramarginal rh-superiortemporal rh-superiorparietal rh-rostralmiddlefrontal rh-rostralanteriorcingulate rh-precuneus rh-precentral rh-postcentral rh-pericalcarine rh-parstriangularis rh-parsorbitalis rh-parsopercularis rh-paracentral rh-parahippocampal rh-middletemporal rh-lingual rh-lateraloccipital rh-inferiortemporal rh-entorhinal rh-caudalmiddlefrontal rh-caudalanteriorcingulate lh-transversetemporal lh-frontalpole lh-supramarginal lh-superiortemporal lh-superiorfrontal lh-rostralmiddlefrontal lh-rostralanteriorcingulate lh-posteriorcingulate lh-postcentral lh-parstriangularis lh-parsorbitalis lh-parsopercularis lh-paracentral lh-middletemporal lh-lingual lh-lateralorbitofrontal lh-lateraloccipital lh-isthmuscingulate lh-inferiortemporal lh-inferiorparietal lh-fusiform lh-entorhinal lh-cuneus lh-corpuscallosum lh-bankssts (b) BNHC Figure 9: Plot showing whether an edge connecting two influential nodes is influential ornot. Note that the map is a M × M symmetric matrix, where M denotes the number ofinfluential nodes, and each cell denotes an edge connecting the corresponding pair of nodes.The axis labels are the abbreviated names of the influential ROIs in the left (starting with‘lh -’) and the right (starting with ‘rh -’) hemispheres of the brain. Full names of the ROIscan be obtained from the widely available Desikan brain atlas. A white cell represents aninfluential edge, while red cell represents a non-influential edge.33NLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’ (cid:0) (cid:1) edges and (cid:0) (cid:1) edges in BNLC and BNHC respectively, for all hyperparam-eter combinations. Table 14 presents the number of edges detected as influential, as well asthe number of intersecting edges with the original analysis. Again, due to the high dimen-sionality of the problem, the variation in the number of identified edges with different choicesof hyperparameters is expected, though the variation turns out not to be very significant.34inally, to check sensitivity to the choice of R on the performance of BNLC and BNHC, werun the data analysis for BNHC and BNLC with R = 8 and R = 10, and report the posteriormean of the effective dimensionality, along with AUC. Table 15 reports the posterior meanof effective dimensionality, which shows very moderate increase with increasing R . However,increasing R seems to have almost no effect on AUC.BNLC BNHC R = 4 R = 8 R = 10 R = 4 R = 8 R = 10Posterior mean Eff. Dim. 2.17 2.78 2.96 2.00 2.74 3.04AUC 0.61 0.63 0.59 0.59 0.60 0.59Table 15: AUC and posterior mean of effective dimensionality for BNLC and BNHC underdifferent choices of R . We develop a binary Bayesian network regression model that enables classifying multiple net-works with “labeled nodes” into two groups, identifies influential network nodes and predictsthe class in which a newly observed network belongs. Our contribution lies in carefully con-structing a class of network global-local shrinkage priors on the network predictor coefficientwhile recognizing the latent network structure in the predictor variable. In particular, we in-vestigate two specific network shrinkage priors from this general class, leading to two networkclassifiers BNLC and BNHC. Our extensive simulation study shows competitive performancebetween BNLC and BNHC in terms of inference and classification with no clear winner, andboth of them are found to outperform other competitors. Another major contribution ofthe proposed framework remains theoretically understanding the Bayesian network classifiermodel with the Network Lasso shrinkage prior. Specifically, we develop theory guaranteeingaccurate classification as the sample size tends to infinity. The theoretical developmentsallow the number of possible interconnections in the network predictor to grow at a fasterrate than the sample size. We analyze a brain connectome dataset with brain connectivity35etworks between different regions of interest for multiple individuals, and information onwhether an individual is in a low or a high IQ category. BNC shows satisfactory out ofsample classification and identifies important brain regions actively influencing the FSIQ ofan individual. References Armagan, A., Dunson, D. B., and Lee, J. (2013a). Generalized double Pareto shrinkage. Statistica Sinica , (1), 119–143.Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013b). Posteriorconsistency in linear models under shrinkage priors. Biometrika , (4), 1011–1018.Belitser, E. and Nurushev, N. (2015). Needles and straw in a haystack: robust confidencefor possibly sparse sequences. arXiv preprint arXiv:1511.01803 .Brown, T. E., Reichel, P. C., and Quinlan, D. M. (2009). Executive function impairmentsin high iq adults with adhd. Journal of Attention Disorders , (2), 161–167.Bullmore, E. and Sporns, O. (2009). Complex brain networks: graph theoretical analysis ofstructural and functional systems. Nature Reviews. Neuroscience , (3), 186–198.Caplan, B., Kreutzer, J. S., and DeLuca, J. (2011). Encyclopedia of Clinical Neuropsychology;With 199 Figures and 139 Tables. Springer.Carson, S. H., Peterson, J. B., and Higgins, D. M. (2003). Decreased latent inhibition isassociated with increased creative achievement in high-functioning individuals. Journal ofpersonality and social psychology , (3), 499.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparsesignals. Biometrika , (2), 465–480. 36astillo, I., van der Vaart, A., et al. (2012). Needles and straw in a haystack: Posteriorconcentration for possibly sparse sequences. The Annals of Statistics , (4), 2069–2101.Castillo, I., Rousseau, J., et al. (2015). A bernstein–von mises theorem for smooth functionalsin semiparametric models. The Annals of Statistics , (6), 2353–2383.Craddock, R. C., Holtzheimer III, P. E., Hu, X. P., and Mayberg, H. S. (2009). Disease stateprediction from resting state functional connectivity. Magnetic Resonance in Medicine: AnOfficial Journal of the International Society for Magnetic Resonance in Medicine , (6),1619–1628.Daianu, M., Jahanshad, N., Nir, T. M., Toga, A. W., Jack Jr, C. R., Weiner, M. W., andThompson, for the Alzheimer’s Disease Neuroimaging Initiative, P. M. (2013). Breakdownof brain connectivity between normal aging and alzheimer’s disease: a structural k-corenetwork analysis. Brain connectivity , (4), 407–422.Deshpande, M., Kuramochi, M., Wale, N., and Karypis, G. (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledgeand Data Engineering , (8), 1036–1050.Desikan, R. S., S´egonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner,R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., et al. (2006). An automated labelingsystem for subdividing the human cerebral cortex on MRI scans into gyral based regionsof interest. Neuroimage , (3), 968–980.Durante, D. and Dunson, D. B. (2017). Bayesian inference and testing of group differencesin brain networks. Bayesian Analysis , doi:10.1214/16-BA1030 . Advance publication.Erdos, P. and R´enyi, A. (1960). On the evolution of random graphs. Publication of theMathematical Institute of the Hungarian Academy of Sciences , (1), 17–60.37ei, H. and Huan, J. (2010). Boosting with structure information in the functional space: anapplication to graph classification. In Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 643–652. ACM.Frank, O. and Strauss, D. (1986). Markov graphs. Journal of the American StatisticalAssociation , (395), 832–842.Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalizedlinear models via coordinate descent. Journal of Statistical Software , (1), 1–22.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014a). Bayesian data analysis , volume 2. CRC press Boca Raton, FL.Gelman, A., Hwang, J., and Vehtari, A. (2014b). Understanding predictive informationcriteria for bayesian models. Statistics and computing , (6), 997–1016.Ghosal, S., Roy, A., et al. (2006). Posterior consistency of gaussian process prior for non-parametric binary regression. The Annals of Statistics , (5), 2413–2429.Guha, S. and Rodriguez, A. (2020). Bayesian regression with undirected network predic-tors with an application to brain connectome data. Journal of the American StatisticalAssociation , pages 1–34.Helma, C., King, R. D., Kramer, S., and Srinivasan, A. (2001). The predictive toxicologychallenge 2000–2001. Bioinformatics , (1), 107–108.Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. Journal of the AmericanStatistical Association , (469), 286–295.Hoff, P. D. (2009). Multiplicative latent factor models for description and prediction of socialnetworks. Computational and mathematical organization theory , (4), 261.Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to socialnetwork analysis. Journal of the American Statistical Association , (460), 1090–1098.38iar, G., Gray Roncal, W., Mhembere, D., Bridgeford, E., Burns, R., and Vogelstein, J.(2016). ndmg: Neurodata’s MRI graphs pipeline.Kiar, G., Gorgolewski, K., and Kleissas, D. (2017a). Example use case of sic with the ndmgpipeline (sic: ndmg). GigaScience Database .Kiar, G., Gorgolewski, K. J., Kleissas, D., Roncal, W. G., Litt, B., Wandell, B., Poldrack,R. A., Wiener, M., Vogelstein, R. J., Burns, R., et al. (2017b). Science in the cloud (sic):A use case in MRI connectomics. Giga Science , (5), 1–10.Klenke, A. (2013). Probability theory: A Comprehensive Course . Springer Science & BusinessMedia.Kolb, B. and Milner, B. (1981). Performance of complex arm and facial movements afterfocal brain lesions. Neuropsychologia , (4), 491–503.Makalic, E. and Schmidt, D. F. (2015). A simple sampler for the horseshoe estimator. IEEESignal Processing Letters , (1), 179–182.Martin, R., Mess, R., Walker, S. G., et al. (2017). Empirical bayes posterior concentrationin sparse high-dimensional linear models. Bernoulli , (3), 1822–1847.Miller, L. and Milner, B. (1985). Cognitive risk-taking after frontal or temporal lobectomy-II.The synthesis of phonemic and semantic information. Neuropsychologia , (3), 371–379.Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association , (455), 1077–1087.Olde Dubbelink, K. T., Hillebrand, A., Stoffers, D., Deijen, J. B., Twisk, J. W., Stam, C. J.,and Berendse, H. W. (2013). Disrupted brain network topology in parkinson’s disease: alongitudinal magnetoencephalography study. Brain , (1), 197–207.Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American StatisticalAssociation , (482), 681–686. 39olson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models usingp´olya–gamma latent variables. Journal of the American statistical Association , (504),1339–1349.Razumnikova, O. M. (2007). Creativity related cortex activity in the remote associates task. Brain Research Bulletin , (1), 96–102.Reli´on, J. D. A., Kessler, D., Levina, E., and Taylor, S. F. (2017). Network classificationwith applications to brain connectomics. arXiv preprint arXiv:1701.08140 .Richiardi, J., Eryilmaz, H., Schwartz, S., Vuilleumier, P., and Van De Ville, D. (2011).Decoding brain states from fmri connectivity graphs. Neuroimage , (2), 616–626.Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensionalregression. arXiv preprint arXiv:1712.08964 .Srinivasan, A., Muggleton, S. H., Sternberg, M. J., and King, R. D. (1996). Theories formutagenicity: A study in first-order and feature-based induction. Artificial Intelligence , (1-2), 277–299.Stuss, D., Ely, P., Hugenholtz, H., Richard, M., LaRochelle, S., Poirier, C., and Bell, I.(1985). Subtle neuropsychological deficits in patients with good recovery after closed headinjury. Neurosurgery , (1), 41–47.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological) , (1), 267–288.Van Der Pas, S. L., Kleijn, B. J., Van Der Vaart, A. W., et al. (2014). The horseshoeestimator: Posterior concentration around nearly black vectors. Electronic Journal ofStatistics , (2), 2585–2618.Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. (2010).Graph kernels. Journal of Machine Learning Research , (Apr), 1201–1242.40ogelstein, J. T., Roncal, W. G., Vogelstein, R. J., and Priebe, C. E. (2013). Graph classifi-cation using signal-subgraphs: Applications in statistical connectomics. IEEE transactionson pattern analysis and machine intelligence , (7), 1539–1551.Wei, R. and Ghosal, S. (2017). Contraction properties of shrinkage priors in logistic regres-sion. .Yoon, Y. B., Shin, W.-G., Lee, T. Y., Hur, J.-W., Cho, K. I. K., Sohn, W. S., Kim, S.-G.,Lee, K.-H., and Kwon, J. S. (2017). Brain structural networks associated with intelligenceand visuomotor ability. Scientific reports , (1), 2177.Zhang, J., Cheng, W., Wang, Z., Zhang, Z., Lu, W., Lu, G., and Feng, J. (2012). Pat-tern classification of large-scale functional brain networks: identification of informativeneuroimaging markers for epilepsy. PloS one , (5), e36733. This section provides full conditionals for all the parameters in the Bayesian binary networkregression with network lasso shrinkage prior on γ . Assume W = ( u (cid:48) Λ u , ..., u (cid:48) Λ u V , ...., u (cid:48) V − Λ u V ) (cid:48) , D = diag ( s , , ..., s V − ,V ) and γ = ( γ , , ..., γ V − ,V ) (cid:48) . Thus, with n data points, the hierar-chical model with the network lasso prior in the binary setting can be written as t ∼ N( µ + Xγ , Ω − ) γ ∼ N( W , D ) , u k | ξ k = 1 ∼ N ( u k | , Q ) , u k | ξ k = 0 ∼ δ , ξ k ∼ Ber (∆) , µ ∼ f lat () s k,l ∼ Exp ( θ / , θ ∼ Gamma ( ζ, ι ) , Q ∼ IW ( ν, I ) , ∆ ∼ Beta ( a ∆ , b ∆ ) p ( ω i ) ∼ P G (1 , , λ r ∼ Ber ( π r ) , π r ∼ Beta (1 , r η ) , η > . The full conditional distributions of the model parameters are given below.41 µ | − ∼ N (cid:16) (cid:48) Ω ( t − Xγ ) (cid:48) Ω1 , (cid:48) Ω1 (cid:17) • γ | − ∼ N ( µ γ | · , Σ γ | · ), where µ γ | · = ( X (cid:48) Ω X + D − ) − ( X (cid:48) Ω ( t − µ ) + D − W ) and Σ γ | · = ( X (cid:48) Ω X + D − ) − • s k,l | − ∼ GIG (cid:2) , ( γ k,l − u (cid:48) k Λ u l ) , θ (cid:3) , where GIG denotes the generalized inverseGaussian distribution. • θ | − ∼ Gamma (cid:104)(cid:16) ζ + V ( V − (cid:17) , (cid:16) ι + (cid:80) k 0) Polson et al. (2013), weobtain 42 ω i | − ∼ P G (1 , µ + x (cid:48) i γ ), for i = 1 , .., n . This section provides full conditionals for all the parameters in the Bayesian network classi-fier model introduced in this article with Bayesian network horseshoe prior. Assume W =( u (cid:48) Λ u , ..., u (cid:48) Λ u V , ...., u (cid:48) V − Λ u V ) (cid:48) , D = diag ( σ s , , ..., σ s V − ,V ) and γ = ( γ , , ..., γ V − ,V ) (cid:48) .Thus, with n data points, the hierarchical model with the network horseshoe prior in thebinary setting can be written as t ∼ N( µ + Xγ , Ω − ) γ ∼ N( W , D ) , u k | ξ k = 1 ∼ N ( u k | , Q ) , u k | ξ k = 0 ∼ δ , ξ k ∼ Ber (∆) , µ ∼ f lat () s k,l ∼ C + (0 , , σ ∼ C + (0 , , Q ∼ IW ( ν, I ) , ∆ ∼ Beta ( a ∆ , b ∆ ) p ( ω i ) ∼ P G (1 , , λ r ∼ Ber ( π r ) , π r ∼ Beta (1 , r η ) , η > . Note that, following Makalic and Schmidt (2015), s k,l ∼ C + (0 , , σ ∼ C + (0 , s k,l | ν k,l ∼ IG (cid:18) , ν k,l (cid:19) , ν k,l ∼ IG (cid:18) , (cid:19) , σ | σ ∼ IG (cid:18) , σ (cid:19) , σ ∼ IG (cid:18) , (cid:19) . With the model formulation described above, the full conditional distributions of themodel parameters are given by the following distributions: • µ | − ∼ N (cid:16) (cid:48) Ω ( t − Xγ ) (cid:48) Ω1 , (cid:48) Ω1 (cid:17) • γ | − ∼ N ( µ γ | · , Σ γ | · ), where µ γ | · = ( X (cid:48) Ω X + D − ) − ( X (cid:48) Ω ( t − µ ) + D − W ) and43 γ | · = ( X (cid:48) Ω X + D − ) − • s k,l | − ∼ IG (cid:104) , ( ν k,l + ( γ k,l − u (cid:48) k Λ u l ) σ ) (cid:105) • σ | − ∼ IG (cid:104)(cid:16) + V ( V − (cid:17) , (cid:16) σ + (cid:80) k 0) Polson et al. (2013), weobtain • ω i | − ∼ P G (1 , µ + x (cid:48) i γ ), for i = 1 , .., n .44 .3 Appendix C Similar to the assumptions made by Wei and Ghosal (2017) in their proof of posteriorconsistency for binary logistic regression, we prove our results assuming that the centeringparameter µ = 0 in both the true and the data generating models. We note that the mainstructure of the proof will remain unchanged with this assumption and the result proved inthis article can be trivially extended to the setting with nonzero µ .We begin by defining some notations. In the proof, Π( · ) will be used to denote the genericprobability notation. We define the notation of the log-likelihood function by w γ ,n ( y n ) = n (cid:88) i =1 [( x (cid:48) i γ ) y i − z ( x (cid:48) i γ )] , z ( x (cid:48) i γ ) = log(1 + exp( x (cid:48) i γ )) . (9)We also introduce the function C y n ,n ( · ) to quantify the curvature of w γ ,n ( y n ) around γ (0) , C y n ,n ( γ ) = w γ ,n ( y n ) − w γ (0) ,n ( y n ) − ∇ w γ (0) ,n ( y n ) (cid:48) ( γ − γ (0) ) , (10)where ∇ w γ (0) ,n ( y n ) is the derivative of w γ (0) ,n ( y n ) w.r.t. γ , evaluated at γ (0) . Also the likeli-hood p γ ( y n ) can be written using the above notations as p γ ( y n ) = (cid:81) ni =1 exp( w γ ,n ( y i )). Thenotations E γ ( · ) and E γ (0) ( · ) have been reserved to denote expectation w.r.t the distributionof y n | γ and y n | γ (0) respectively.The proof of Theorem 3.1 relies in part on the existence of exponentially consistentsequence of tests. Definition An exponentially consistent sequence of test functions Φ n for testing H : γ = γ vs. H : γ ∈ A cn satisfies E γ (Φ n ) ≤ d exp( − h n ) , sup γ ∈A cn E γ (1 − Φ n ) ≤ d exp( − h n )for some d , d , h , h > 0. 45 emma 8.1 For some h > , there exists a sequence of test functions for testing H : γ = γ vs. H : γ ∈ A cn , which satisfy E γ (Φ n ) ≤ exp( − hn ) , sup γ ∈A cn E γ (1 − Φ n ) ≤ exp( − hn ) . (11) Proof The construction of the test is provided in the proof of Theorem 2 and Lemma 4 inGhosal et al. (2006).We also state another result which will be subsequently used in the proof. Lemma 8.2 Let u (0) k = ( u (0) k, , ..., u (0) k,R ) (cid:48) for k = 1 , .., V n , and υ k,l be the only positive root ofthe equation x + x ( || u (0) k || + || u (0) l || ) − η = 0 , k < l. (12) Assume υ = min k,l υ k,l . Then, for W = ( u (cid:48) u , ..., u (cid:48) V n − u V n ) (cid:48) and W (0) = ( u (0) (cid:48) u (0)2 , ..., u (0) (cid:48) V n − u (0) V n ) (cid:48) Π( || W − W (0) || ∞ < η ) ≥ Π( || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n ) . (13) Proof for k < l , | u (cid:48) k u l − u (0) (cid:48) k u (0) l | = | R (cid:88) r =1 u k,r u l,r − R (cid:88) r =1 u (0) k,r u (0) l,r |≤ | R (cid:88) r =1 ( u k,r − u (0) k,r ) u lr | + | R (cid:88) r =1 ( u l,r − u (0) l,r ) u (0) k,r |≤ || u k − u (0) k || || u l || + || u l − u (0) l || || u (0) k || ≤ || u k − u (0) k || (cid:104) || u l − u (0) l || + || u (0) l || (cid:105) + || u l − u (0) l || || u (0) k || . If || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n , the above inequality implies | u (cid:48) k u l − u (0) (cid:48) k u (0) l | ≤ υ ( υ + || u (0) l || ) + υ || u (0) k || ≤ η , ∀ k < l. || W − W (0) || ∞ < η ) ≥ Π( || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n ). Proof of Theorem 3.1 Suppose E n = (cid:8) y : ||∇ w γ (0) ,n ( y ) || ∞ ≤ √ nq n (cid:9) . Then the probability of the vector y n be-longing to the set E n is given by, P γ (0) ( y n ∈ E n ) ≥ − P γ (0) ( max ≤ j ≤ q n | n (cid:88) i =1 ( y i − ∇ z ( x (cid:48) i ( γ − γ (0) ))) x ij | > √ nq n ) ≥ − q n , where the last step follows from the Hoeffding inequality. Note that as n → ∞ , q n → ∞ ,hence P γ (0) ( y n ∈ E n ) → 1. Hence, in the subsequent proof we can assume without loss ofgenerality that y n ∈ E n . It can be observed thatΠ n ( A cn ) = (cid:82) A cn p γ ( y n ) π n ( γ ) (cid:82) p γ ( y n ) π n ( γ ) = (cid:82) A cn p γ ( y n ) p γ (0) ( y n ) π n ( γ ) (cid:82) p γ ( y n ) p γ (0) ( y n ) π n ( γ ) = N n D n ≤ Φ n + (1 − Φ n ) N n D n , (14)where Φ n is the exponentially consistent sequence of tests given in Lemma 8.1. The aboveequation is true as N n / D n ≤ 1. This is in turn true as both are integrals of the samenonnegative functions, D n is the integral of that function over the entire set of possible γ ’s,while N n is the integral over a subset A cn . In proving Theorem 3.1, we will proceed in threesteps as following.(a) Step 1 shows that Φ n → 0, as n → ∞ , almost surely.(b) Step 2 shows that exp( hn/ − Φ n ) N n → 0, as n → ∞ , almost surely.(c) Finally, step 3 shows that exp( hn/ D n → ∞ , as n → ∞ .Here h is the one as defined in Lemma 8.1. By (14), (a)-(c) implies Π n ( A cn ) → 0. We willnow proceed proving (a)-(c).(a) Step 1 47n application of the Markov inequality and (11) in Lemma 8.1 yield, P γ (0) (Φ n > exp( − nh/ ≤ E γ (0) (Φ n ) exp( nh/ ≤ exp( − nh/ . Therefore (cid:80) ∞ n =1 P γ (0) (Φ n > exp( − nh/ < ∞ .Applying Borel-Cantelli lemma, Thus, P γ (0) (Φ n > exp( − nh/ ∃ n and a set Ω with P γ (0) (Ω) = 0, s.t. for all n > n , Φ n ( ω ) < exp( − nh/ ω ∈ Ω c . Since exp( − nh/ → 0, this means that Φ n → n → a.s. (15)(b) Step 2We have E γ (0) ((1 − Φ n ) N n ) = (cid:90) (1 − Φ n ) (cid:90) A cn p γ ( y n ) p γ (0) ( y n ) π n ( γ ) p γ (0) ( y n )= (cid:90) A cn (cid:90) (1 − Φ n ) p γ ( y n ) π n ( γ )= (cid:90) A cn E γ (1 − Φ n ) π n ( γ ) ≤ sup γ ∈A cn E γ (1 − Φ n Π( A cn ) ≤ sup γ ∈A cn E γ (1 − Φ n ) ≤ exp( − nh ) ≤ exp( − nh/ . Consider the set G n,h, = { (1 − Φ n ) N n exp( nh/ > exp( − nh/ } . The above inequal-ity implies that (cid:80) ∞ n =1 P γ (0) ( G n,h, ) < ∞ . Again since h is fixed, applying Borel-Cantellilemma P γ (0) ( limsup n →∞ G n,h, ) = 0. Using the definition of limsup of the sets G n,h, Klenke(2013), P γ (0) ( G n,h, happens infinitely often) = 0. Thus, P γ (0) ((1 − Φ n ) N n exp( nh/ > exp( − nh/ 4) happens infinitely often) = 0. Let Ω be the set s.t. P γ (0) (Ω) = 0 and (1 − n ( ω )) N n exp( nh/ > exp( − nh/ 4) happens infinitely often for all ω ∈ Ω . This means that ∃ n , s.t. for all n > n , , (1 − Φ n ( ω )) N n exp( nh/ < exp( − nh/ ω ∈ Ω c . Sinceexp( − nh/ → 0, this means that exp( nh/ − Φ n ) N n → nh/ − Φ n ) N n → a.s.. (16)(c) Step 3 (cid:90) p γ ( y n ) p γ (0) ( y n ) π ( γ ) = (cid:90) exp (cid:0) ∇ w γ (0) ,n ( y n ) (cid:48) ( γ − γ (0) ) + C y n ,n ( γ ) (cid:1) π ( γ ) ≥ (cid:90) exp (cid:16) −||∇ w γ (0) ,n ( y n ) || ∞ || γ − γ (0) || − n || γ − γ (0) || (cid:17) π ( γ ) ≥ (cid:90) exp (cid:16) − √ nq n || γ − γ (0) || − n || γ − γ (0) || (cid:17) π ( γ ) ≥ exp (cid:18) − √ nq n η n ρ/ − nη n ρ (cid:19) Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) , where ρ is the one defined in the statement of the theorem and the inequality in the secondline follows from the Taylor series expansion after taking into account that ∇ z ( · ) ≤ / z ( · ) defined in (9)), which is true as d df log (cid:0) e f (cid:1) = e f (1+ e f ) ≤ / . The inequality in thethird line follows from the fact that y n ∈ E n .First, observe that, given all the hierarchical parameters, the Bayesian network lasso priordistribution on γ can be written as γ = W + γ , where γ follows the ordinary Bayesianlasso shrinkage prior. With this observation, one can seeΠ (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) ≥ Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) , where W and W (0) are as defined in Lemma 8.2. We will show sequentially(i) − log Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) = o ( n ) and(ii) − log (cid:110) Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17)(cid:111) = o ( n ).49i) Note that, with R (dimensions of the latent variables) and ∆ (probability of a node beinginfluential) as defined before we obtain,Π( || W − W (0) || < η n ρ/ ) ≥ Π( || u k − u (0) k || ≤ υ n , ∀ k = 1 , .., V n ) ≥ E (cid:104) Π( || u k − u (0) k || ≤ υ n , ∀ k = 1 , .., V n | ∆) (cid:105) ≥ E (cid:34) V n (cid:89) k =1 (cid:26) exp (cid:18) − u (0) (cid:48) k u (0) k (cid:19) Π( || u k || ≤ υ n | ∆) (cid:27)(cid:35) , (17)where the first inequality follows from Lemma 8.2 by replacing η with η n ρ/ with a slightabuse of notation, and υ n is defined accordingly. The last inequality follows from the An-derson’s Lemma. We will now make use of the fact that (cid:82) a − a exp( − x / dx ≥ exp( − a )2 a toconcludeΠ( || u k || ≤ υ n | ∆) ≥ R (cid:89) r =1 Π (cid:16) | u k,r | ≤ υ n R | ∆ (cid:17) = R (cid:89) r =1 (cid:32) (1 − ∆) + ∆ √ π (cid:90) υ n /R − υ n /R exp( − x / (cid:33) ≥ R (cid:89) r =1 (cid:18) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:19) ≥ (cid:20) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:21) R . n (cid:89) k =1 Π( || u k || ≤ υ n ) ≥ E (cid:20) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:21) RV n = E (cid:34) RV n (cid:88) h =1 (cid:18) RV n h (cid:19) (1 − ∆) h ∆ RV n − h (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1)(cid:35) ≥ RV n (cid:88) h =1 (cid:18) RV n h (cid:19) Beta ( RV n − h + 1 , h + 1) (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1) ≥ RV n (cid:88) h =1 ( RV n )! h !( RV n − h )! h !( RV n − h )!( RV n + 1)! (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1) ≥ RV n RV n + 1 (cid:18) υ n R (cid:19) RV n exp( − V n υ n /R ) . Where the last inequality follows from Lemma 8.2 by considering the fact that, υ n = min k,l − [ || u (0) k || + || u (0) l || ]+ (cid:113) [ || u (0) k || + || u (0) l || ] +2 η /n ρ/ ≤ √ η √ n ρ/ . Hence, 0 < υ n R < n . Itnow follows from (17) that − log Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) ≤ V n (cid:88) k =1 u (0) (cid:48) k u (0) k V n η Rn ρ/ − ( RV n ) log (cid:18) √ η √ Rn ρ/ (cid:19) + log( RV n + 1) − log( RV n ) = o ( n ) , by the assumptions (A) and (B). This proves (i).We will now prove (ii). Let S = { j : γ (0)2 ,j (cid:54) = 0 } . Define s as the vector of upper triangularpart of the matrix with ( k, l )th entry s k,l . It follows thatΠ (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≥ Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) Π (cid:88) j (cid:54)∈S | γ ,j | < ( q n − s ,n ) η q n n ρ . (18)We will lower bound two components of the product in (18) individually. By Chebyshev’s51nequality Π (cid:88) j (cid:54)∈S | γ ,j | < ( q n − s ,n ) η q n n ρ ≥ (cid:32) − E [ (cid:80) j (cid:54)∈S | γ ,j | ]4 q n n ρ ( q n − s ,n ) η (cid:33) = (cid:18) − θ n q n n ρ η (cid:19) . (19)Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) = E (cid:20) Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S | s S (cid:19)(cid:21) = E (cid:89) j ∈S Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ | s S (cid:19) . Using the fact that (cid:82) ba e − x / dx ≥ e − ( a + b ) / ( b − a ), one obtains (cid:89) j ∈S Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ | s S (cid:19) ≥ (cid:89) j ∈S η (cid:113) q n n ρ πs j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j (cid:19) . Thus Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) ≥ E (cid:89) j ∈S η (cid:113) q n n ρ πs j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j (cid:19) ≥ (cid:18) η θ n √ q n n ρ π (cid:19) s ,n (cid:89) j ∈S (cid:90) s j (cid:113) s j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j − θ n s j (cid:19) ds j . Use the change of variable s j = z j and the normalizing constant from the inverse Gaussian52ensity to deduce (cid:90) s j (cid:113) s j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j − θ n s j (cid:19) ds j = (cid:90) z j (cid:113) z j exp (cid:18) − ( | γ ,j | + η / (4 q n n ρ ) z j − θ n z j (cid:19) dz j = (cid:115)(cid:18) πθ n (cid:19) exp (cid:18) − θ n (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1)(cid:19) . Therefore,Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) ≥ (cid:18) η √ θ n √ q n n ρ (cid:19) s ,n exp − θ n (cid:88) j ∈S (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1) . (20)Combining results from (19) and (20)Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≥ (cid:18) η √ θ n √ q n n ρ (cid:19) s ,n exp − θ n (cid:88) j ∈S (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1)(cid:18) − θ n q n n ρ/ η (cid:19) . Referring to Assumption (F), − log Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≤ s ,n [ η + log( q n ) + (3 ρ/ 4) log( n ) + log(log( n )) / (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1) q n n ρ/ log( n ) − log (cid:18) − η log( n ) (cid:19) = o ( n ) , (21)under assumptions (B)-(F). 53inally, − log( D n ) ≤ √ nq n η n ρ/ + nη n ρ − log Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) = 2 η √ q n n (1 − ρ ) / + η n − ρ − log Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) . Using (21), the fact that (1 − ρ ) / ∈ ( − / , 0) and assumption (B), we obtain − log( D n ) = o ( nn