[PDF] High Dimensional Bayesian Network Classification with Network Global-Local Shrinkage Priors

Abstract

This article proposes a novel Bayesian classification framework for networks with labeled nodes. While literature on statistical modeling of network data typically involves analysis of a single network, the recent emergence of complex data in several biological applications, including brain imaging studies, presents a need to devise a network classifier for subjects. This article considers an application from a brain connectome study, where the overarching goal is to classify subjects into two separate groups based on their brain network data, along with identifying influential regions of interest (ROIs) (referred to as nodes). Existing approaches either treat all edge weights as a long vector or summarize the network information with a few summary measures. Both these approaches ignore the full network structure, may lead to less desirable inference in small samples and are not designed to identify significant network nodes. We propose a novel binary logistic regression framework with the network as the predictor and a binary response, the network predictor coefficient being modeled using a novel class global-local shrinkage priors. The framework is able to accurately detect nodes and edges in the network influencing the classification. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Theoretically, we show asymptotically optimal classification for the proposed framework when the number of network edges grows faster than the sample size. The framework is empirically validated by extensive simulation studies and analysis of a brain connectome data.

Full PDF

HHigh Dimensional Bayesian Network Classiﬁcation withNetwork Global-Local Shrinkage Priors

Sharmistha Guha & Abel RodriguezSeptember 25, 2020

Abstract

This article proposes a novel Bayesian classiﬁcation framework for networks withlabeled nodes. While literature on statistical modeling of network data typically in-volves analysis of a single network, the recent emergence of complex data in severalbiological applications, including brain imaging studies, presents a need to devise anetwork classiﬁer for subjects. This article considers an application from a brain con-nectome study, where the overarching goal is to classify subjects into two separategroups based on their brain network data, along with identifying inﬂuential regionsof interest (ROIs) (referred to as nodes). Existing approaches either treat all edgeweights as a long vector or summarize the network information with a few summarymeasures. Both these approaches ignore the full network structure, may lead to lessdesirable inference in small samples and are not designed to identify signiﬁcant networknodes. We propose a novel binary logistic regression framework with the network asthe predictor and a binary response, the network predictor coeﬃcient being modeledusing a novel class global-local shrinkage priors. The framework is able to accuratelydetect nodes and edges in the network inﬂuencing the classiﬁcation. Our framework isimplemented using an eﬃcient Markov Chain Monte Carlo algorithm. Theoretically,we show asymptotically optimal classiﬁcation for the proposed framework when the a r X i v : . [ s t a t . M E ] S e p umber of network edges grows faster than the sample size. The framework is em-pirically validated by extensive simulation studies and analysis of a brain connectomedata. Keywords:

Brain Connectome, High dimensional binary regression; Global-Local shrinkageprior; Node selection; Network predictor; Posterior consistency.

Of late, the statistical literature has paid substantial attention to the unsupervised analysis ofa single network, thought to be generated from a variety of classic models, including randomgraph models Erdos and R´enyi (1960), exponential random graph models Frank and Strauss(1986), social space models Hoﬀ et al. (2002); Hoﬀ (2005, 2009) and stochastic block modelsNowicki and Snijders (2001). These models have found prominence in social networkingapplications where the nodes in the network are exchangeable. However, there are pertinentbiological and physiological applications in which network nodes are labeled and a networkis available corresponding to every individual. Section 6 presents one such example from abrain connectome study, where brain networks are available for multiple individuals who areclassiﬁed as subjects with high or low IQ (Intelligence Quotient). In this study, the humanbrain has been divided according to the Desikan Atlas Desikan et al. (2006) that identiﬁes34 cortical regions of interest (ROIs) both in the left and the right hemispheres of the humanbrain, implying 68 cortical ROIs in all. A brain network for each subject is represented by asymmetric adjacency matrix whose rows and columns are labeled corresponding to diﬀerentROIs (shared among networks corresponding to all individuals) and entries correspond toestimates of the number of ﬁbers connecting pairs of brain regions. The scientiﬁc goal inthis setting pertains to developing a predictive rule for classifying a new subject as havinglow or high IQ based on his/her observed brain network with labeled nodes. Additionally,it is of speciﬁc interest for neuroscientists to identify inﬂuential brain regions (nodes in the2rain network) and signiﬁcant connections between diﬀerent brain regions predictive of IQ.Guha and Rodriguez (2020) discuss the network regression problem with a continuousresponse and an undirected network predictor. However, there are pertinent biological andphysiological studies where a network along with a binary response is obtained for eachsubject. The goal of these studies is usually to classify the networks according to the binaryresponse and predict the associated binary response from a network. We refer to this problemas the network or graph classiﬁcation problem. Additionally, Guha and Rodriguez (2020)focus on a speciﬁc network shrinkage prior, whereas this article generalizes the inference to aclass of network global-local shrinkage priors, which includes the prior speciﬁcation in Guhaand Rodriguez (2020) as a special case.Earlier literature on network or graph classiﬁcation has been substantially motivated bythe problem of classiﬁcation of chemical compounds Srinivasan et al. (1996), Helma et al. (2001), where a graph represents a compound’s molecular structure. In such analyses, cer-tain discriminative patterns in a graph are identiﬁed and used as features for training astandard classiﬁcation method Deshpande et al. (2005), Fei and Huan (2010). Another typeof method is based on graph kernels Vishwanathan et al. (2010), which deﬁnes a similaritymeasure between two networks. Both of these approaches are computationally feasible onlyfor small networks, do not account for uncertainty, and do not facilitate inﬂuential networknode identiﬁcation. When the number of network nodes is moderately large, a common ap-proach to network classiﬁcation is to use a few summary measures (average degree, clusteringcoeﬃcient, or average path length) from the network and then apply statistical proceduresin the context of standard classiﬁcation methods (see, for e.g., Bullmore and Sporns (2009)and references therein). These procedures have been recently employed in exploring therelationship between the brain network and neuropsychiatric diseases, such as Parkinson’sOlde Dubbelink et al. (2013) and Alzheimer’s Daianu et al. (2013), but the analyses aresensitive to the chosen network topological measures, with substantially diﬀerent results ob-tained for diﬀerent types of summary statistics. Indeed, global summary statistics collapse3ll local network information, which can aﬀect the accuracy of classiﬁcation. Furthermore,identiﬁcation of the impact of speciﬁc nodes on the response, which is of clear interest inour setting, is not feasible. As with network regression problems, an alternate approachproceeds to vectorize the network predictor and treat edge weights together as a long vectorfollowed by developing a high dimensional regression model with this long vector of edgeweights as predictors Richiardi et al. (2011); Craddock et al. (2009); Zhang et al. (2012).This approach can take advantage of the recent developments in high dimensional binaryregression, consisting of both penalized optimization Tibshirani (1996) and Bayesian shrink-age Park and Casella (2008); Carvalho et al. (2010); Armagan et al. (2013a) perspectives.However, as mentioned in Guha and Rodriguez (2020), this treats the links of the networkas exchangeable, ignoring the fact that coeﬃcients involving common nodes can be expectedto be correlated a priori. In a related work, Vogelstein et al. (2013) propose to look for aminimal set of nodes which best explains the diﬀerence between two groups of networks.This requires solving a combinatorial problem. Again, Durante and Dunson (2017) proposea high dimensional Bayesian tensor factorization model for a population of networks thatallows to test for local edge diﬀerences between two groups of subjects. Both of these ap-proaches tend to focus mainly on classiﬁcation and are not designed to detect importantnodes and edges impacting the response.Our goal in this article is to develop a high-dimensional Bayesian network classiﬁer thatadditionally infers on inﬂuential nodes and edges impacting classiﬁcation. To achieve thisgoal, we formulate a high dimensional logistic network regression model with the binary re-sponse regressed on the network predictor corresponding to each subject. The network pre-dictor coeﬃcient is assigned a prior from the class of

Bayesian network global-local shrinkagepriors discussed in this article. The proposed prior imparts low-rank and near sparse struc-tures a priori on the network predictor coeﬃcient. The low-rank structure of the coeﬃcientis designed to address the transitivity eﬀect on the network predictor coeﬃcient and capturesthe eﬀect of network edge coeﬃcients on classiﬁcation due to the interaction between nodes.4n the other hand, the near sparse structure accounts for the residual eﬀect due to edges.One important contribution of this article is a careful study of the asymptotic propertiesof the proposed binary network classiﬁcation (BNC) framework. In particular, we focus onconsistency properties for the posterior distribution of the BNC framework using a speciﬁcnetwork global-local shrinkage prior, namely the

Bayesian Network Lasso prior . Theoryof posterior contraction for high dimensional regression models has gained traction lately,though the literature is less developed in shrinkage priors compared to point-mass priors.For example, Castillo et al. (2012) and Belitser and Nurushev (2015) have established pos-terior concentration and variable selection properties for certain point-mass priors in thenormal-means models. The latter article also establishes asymptotically nominal coverage ofBayesian credible sets. Results on posterior concentration and variable selection in high di-mensional linear models are also established by Castillo et al. (2015) and Martin et al. (2017)for certain point-mass priors. In contrast, literature on posterior contraction properties forhigh dimensional Bayesian shrinkage priors is relatively limited. To this end, Armagan et al. (2013b) were the ﬁrst to show posterior consistency in the ordinary linear regression modelwith shrinkage priors for low-dimensional settings under the assumption that the numberof covariates does not exceed the number of observations. Using direct calculations, VanDer Pas et al. (2014) show that the posterior based on the ordinary horseshoe prior con-centrates at the optimal rate for normal-mean problems. Recently, Song and Liang (2017)considers a general class of continuous shrinkage priors and obtains posterior contractionrates in ordinary high dimensional linear regression models. In the same vein, Wei andGhosal (2017) oﬀers analysis of posterior concentration for logistic regression models withshrinkage priors on coeﬃcients. While Wei and Ghosal (2017) are the ﬁrst to delineate atheoretical approach for ordinary high dimensional binary classiﬁcation models with shrink-age priors, the study of posterior contraction properties for more structured binary networkclassiﬁcation problems in the Bayesian paradigm has not appeared in the literature. In fact,developing the theory for Bayesian network classiﬁcation with the Bayesian Network Lasso5rior proposed in this article is faced with two major challenges. First, the novel BayesianNetwork Lasso prior imparts a more complex prior structure (incorporating a low-rank struc-ture in the prior mean of edge coeﬃcients, as described in Guha and Rodriguez (2020) thanthat in Wei and Ghosal (2017), introducing additional theoretical challenges. Second, weaim at proving a challenging but practically desirable result of asymptotically optimal clas-siﬁcation when the number of edges in the network predictor grows at a super-linear rateas a function of the sample size. Both of these present obstacles which we overcome in thiswork. The theoretical results provide insights on how the number of nodes in the networkpredictor, or the sparsity in the true network predictor coeﬃcients should vary with samplesize n to achieve asymptotically optimal classiﬁcation. We must mention that developing asimilar theory for the Bayesian Network Horseshoe prior proposed in this article faces morechallenges due to complex prior structure in parameters. We plan to tackle that problem aspart of future work.Section 2 develops the model and the prior distributions. Section 3 discusses theoreticaldevelopments justifying the asymptotically desirable prediction from the proposed model.Section 4 details posterior computation. Results from various simulation experiments and abrain connectome data analysis have been presented in Sections 5 and 6 respectively. Finally,Section 7 concludes the article with a brief discussion of the proposed methodology. In the context of network classiﬁcation, we propose the high dimensional logistic regressionmodel of the binary response y i ∈ { , } on the undirected network predictor A i as y i ∼ Ber (cid:20) exp( ψ i )1 + exp( ψ i ) (cid:21) , ψ i = µ + (cid:104) A i , Γ (cid:105) F , (1)where Γ is a V × V symmetric network coeﬃcient matrix whose ( k, l )th element is given by γ k,l /

2, with γ k,k = 0, for all k = 1 , ..., V . 6odel (1) can be expressed in the form of a generalized linear model. To be morespeciﬁc, (cid:104) A i , Γ (cid:105) F = (cid:80) ≤ k

1) and σ ∼ C + (0 , Network Horseshoe prior . The rest of the hierarchyon λ r ’s, u k ’s follows as in Guha and Rodriguez (2020). This section establishes convergence results for (1) with γ k,l ’s following the Bayesian NetworkLasso shrinkage prior. From the hierarchical speciﬁcation given in (4), the Bayesian NetworkLasso shrinkage prior is given by γ k,l | s k,l ∼ N ( u (cid:48) k Λ u l , s k,l ) , s k,l ∼ Exp ( θ n / θ n as a function of n Armagan et al. (2013a).Our theoretical investigations will also ﬁx θ n (the exact expression is given in Condition (F)in the next subsection) with the ﬁxed values speciﬁed later.Here we consider an asymptotic setting in which the number of nodes in the networkpredictor, V n , grows with the sample size n . This paradigm attempts to capture the factthat the number of elements in A i , given by V n can be substantially larger than samplesize. Since model (1) is equivalent to model (3), the size of the coeﬃcient γ in (3) is alsoa function of n , given by q n = V n ( V n − . This creates theoretical challenges, related to (butdistinct from) those faced in showing posterior consistency for high dimensional continuousArmagan et al. (2013a) and binary regressions Wei and Ghosal (2017).Let y n = ( y , ..., y n ) (cid:48) . Using the superscript (0) to indicate true parameters, the truedata generating model is given by y i ∼ Bernoulli (cid:34) exp( ψ (0) i )1 + exp( ψ (0) i ) (cid:35) , ψ (0) i = (cid:104) A i , Γ (0) (cid:105) F . (5)where Γ (0) is the true network coeﬃcient. Let γ (0) be the vectorized upper triangular part of8 (0) . We assume, γ (0) k,l = u (0) (cid:48) k Λ u (0) l + γ (0)2 ,k,l , where u (0) k is a R dimensional vector, k = 1 , ..., V . γ (0)2 is the vector of all γ (0)2 ,k,l , k < l , and we denote the number of nonzero elements of γ (0)2 by s ,n , i.e. || γ (0)2 || = s ,n .For any (cid:15) >

0, deﬁne A n = (cid:26) γ : n n (cid:80) i =1 | f γ ( x i ) − f γ (0) ( x i ) | ≤ (cid:15) (cid:27) as a neighborhood aroundthe true density. Further suppose π n ( · ) and Π n ( · ) are the prior and posterior densities of γ with n observations, so that Π n ( A cn ) = (cid:82) A cn p γ ( y n ) π n ( γ ) (cid:82) p γ ( y n ) π n ( γ ) , where p γ ( y n ) denotes the likelihood of the n dimensional response vector y n . To show the posterior contraction results, we follow Wei and Ghosal (2017) and Armagan et al. (2013a), with substantial modiﬁcations required due to the nature of our proposednetwork lasso prior distribution. In proving the results, we make a couple of simpliﬁcations.It is assumed that the dimension R of u k is ﬁxed and is the same as R , the dimensionof u (0) k . Consequently, eﬀective dimensionality is not required to be estimated, and hence Λ = I is a non-random matrix. Additionally, we assume M to be non-random and M = I .We emphasize that both these assumptions are not essential for the posterior contractionrate result to be true, and are only introduced for simplifying calculations.For two sequences { C ,n } n ≥ and { C ,n } n ≥ , C ,n = o ( C ,n ) if C ,n /C ,n →

0, as n → ∞ .To begin with, we state the following assumptions under which posterior contraction will beshown.(A) sup r =1 ,..,R ; k =1 ,..,V n | u (0) k,r | < ∞ ;(B) V n = o ( n log( n ) );(C) || A i || ∞ is bounded for all i = 1 , .., , w.l.o.g assume || A i || ∞ ≤ s ,n log( q n ) = o ( n )(E) || γ (0)2 || ∞ < ∞ ;(F) θ n = Cq n n ρ/ log( n ) for some C > ρ ∈ (1 , Remark:

Conditions (A), (C) and (E) are technical conditions ensuring that each of theentries in the true network coeﬃcient and the network predictor are bounded. Condition(B) puts an upper bound on the growth of the number of network nodes with sample size toachieve asymptotically optimal classiﬁcation. Similarly, (D) puts a restriction on the numberof nonzero elements of γ (0)2 with respect to n .The following theorem shows contraction of the posterior asymptotically under mildsuﬃcient conditions on V n , s ,n . The proof of the theorem is provided in Appendix F. Theorem 3.1

Under assumptions (A)-(F) for the Bayesian Network Lasso prior on γ , Π n ( A n ) → in P γ (0) as n → ∞ , for any (cid:15) > . We have implemented both the Bayesian Network Lasso and Network Horseshoe shrinkagepriors on γ . Using the result in Polson et al. (2013), the data augmented representation ofthe distribution of y i given in (2) follows as below p ( y i | ω i ) = 2 − b exp( k i ψ i ) exp( − ω i ψ i / , ω i ∼ P G (1 , , (6)where k i = y i − /

2. Let x i = ( a i, , , a i, , , ..., a i, ,V , a i, , , a i, , , ..., a i, ,V , ...., a i,V − ,V ) (cid:48) be ofdimension q ×

1, where q = V ( V − . Assume X = ( x : · · · : x n ) (cid:48) is an n × q matrix. Then10he conditional likelihood of y = ( y , ..., y n ) (cid:48) given ω = ( ω , ..., ω n ) (cid:48) and γ is given by p ( y | X , γ , ω ) ∝ n (cid:89) i =1 p ( y i | x i , γ , ω i , ... ) ∝ n (cid:89) i =1 exp (cid:8) ( y i − . µ + x (cid:48) i γ ) − ω i ( µ + x (cid:48) i γ ) / (cid:9) ∝ n (cid:89) i =1 exp (cid:40) − ω i (cid:20) ( y i − . ω i − ( µ + x (cid:48) i γ ) (cid:21) (cid:41) In matrix notation, the likelihood may be written as p ( y | X , γ , ω ... ) ∝ N ( t | µ + Xγ , Ω − )where t = (( y − . /ω , ..., ( y n − . /ω n ) (cid:48) = ( k /ω , ..., k n /ω n ) (cid:48) and Ω = diag ( ω , ..., ω n ).While the full posterior distributions for the parameters are not in closed forms, they mostlybelong to the standard families. Hence drawing posterior samples using MCMC can bereadily implemented. Appendix D and Appendix E describe full conditional distributions ofparameters for Bayesian Network Lasso and Network Horseshoe priors on γ , respectively.Let Ω (1) , ..., Ω ( L ) , Γ (1) , ..., Γ ( L ) and µ (1) , ..., µ ( L ) be the L post burn-in MCMC samples for Ω , Γ and µ respectively after suitable thinning. To classify a newly observed network M ∗ as a member of one of the two groups, we compute S ( l ) = exp( µ (1) + (cid:104) M ∗ , Γ ( l ) (cid:105) )1+exp( µ (1) + (cid:104) M ∗ , Γ ( l ) (cid:105) ) for l = 1 , ..., L . M ∗ is classiﬁed as a member of group ‘low’ or ‘high’ if L (cid:80) Ll =1 S ( l ) is less than or greaterthan 0 .

5, respectively. To judge sensitivity to the choice of the cut-oﬀ, the simulation sectionpresents Area under Curve (AUC) of ROC curves with True Positive Rates (TPR) and FalsePositive Rates (FPR) of classiﬁcation corresponding to a range of cut-oﬀ values.Node k is recognized to be inﬂuential in the classiﬁcation process if L (cid:80) Ll =1 ξ ( l ) k > . ξ (1) k , ..., ξ ( L ) k are the L post burn-in MCMC samples of ξ k . Again, one of the goals ofthe proposed framework is to identify inﬂuential network edges impacting the response. Weemploy the algorithm described in Appendix C to identify inﬂuential edges. The algorithmtakes care of multiplicity correction by controlling the false discovery rate (FDR) at 5% level.11inally, we present an estimate of P ( R eff = r | Data ) computed by L (cid:80) Ll =1 I ( (cid:80) Rm =1 λ ( l ) m = r ),where I ( A ) for an event A is 1 if the event A happens and 0 otherwise, and λ (1) m , ..., λ ( L ) m arethe L post burn-in MCMC samples of λ m . This section evaluates the inferential and classiﬁcation ability of our proposed Bayesiannetwork classiﬁcation (BNC) framework, along with a number of competitors, using syntheticnetworks generated under various simulation settings. Our proposed network classiﬁcationapproach with the Bayesian Network Lasso prior and the Bayesian Network Horseshoe priorare referred to as the Bayesian Network Lasso classiﬁer (BNLC) and Bayesian NetworkHorseshoe classiﬁer (BNHC), respectively. In each simulation, we assess the ability of theBNLC and BNHC approaches to correctly identify inﬂuential nodes and edges, to accuratelyestimate predictive edge coeﬃcients and to classify a network with precise characterizationof uncertainties. Classiﬁcation performance of both methods are assessed using the areaunder the Receiving Operating Characteristics (ROC) curve (AUC).To study all competitors under various data generation schemes, we simulate the responsefrom (1) given by y i ∼ Ber (cid:18) exp( µ + (cid:104) A i , Γ (cid:105) F )1 + exp( µ + (cid:104) A i , Γ (cid:105) F ) (cid:19) , (7)where Γ is a symmetric matrix with zero diagonal entries. The intercept µ is ﬁxed at 2 inall simulation scenarios. We consider two diﬀerent schemes of generating the network A i ,referred to as Simulation 1 and

Simulation 2 , respectively.

Simulation 1. In Simulation 1 , the network edges (i.e., the elements of the matrix A i ) aresimulated from N(0 , Simulation 1 assumes that the network predictor follows anErdos-Renyi graph. 12 imulation 2. In Simulation 2 , the network predictor A i corresponding to the i th sampleis generated from a stochastic blockmodel. Here nodes in a simulated network are organizedinto communities so that nodes in the same community tend to have stronger connectionsthan nodes belonging to diﬀerent communities. This simulation scenario simulates net-works which closely mimic brain connectome networks Bullmore and Sporns (2009). Tosimulate networks with such community structures, we assign each node a community la-bel, f k ∈ { , , ..., } , k = 1 , ..., V . The node assignments are the same for all networks inthe population. Given the community labels, the ( k, k (cid:48) )th element of A is simulated from N ( m f k ,f k (cid:48) , σ ), where m k,l = 0 . k = l . When k (cid:54) = l , i.e., the concerned edges connectnodes belonging to diﬀerent clusters, we sample a ﬁxed number of edge locations randomlyand simulate the values from N (0 , σ = 1 and the three clusters with 8, 9 and 8 nodes respectively, in the three com-munities. We note that the network predictors are simulated from a stochastic blockmodelin Simulation 2 which also ensures transitivity in the network predictor.

Simulating the network predictor coeﬃcient Γ . In both Simulations 1 and 2, the networkpredictor coeﬃcient Γ is constructed as the sum of two matrices Γ , and Γ , . We providethe details of constructing the two matrices as below.In both Simulations 1 and 2, we draw V latent variables u k, , each of dimension R g , froma mixture distribution given by u k, ∼ πN R g ( u m,g , u s,g ) + (1 − π ) δ ; k ∈ { , ..., V } , (8)where δ is the Dirac-delta function and π is the probability of any u k, being nonzero.Deﬁne a symmetric matrix Γ , whose ( k, l )th element is given by u (cid:48) k, u l, , k < l and = 0 if k = l . Note that if u k, is zero, then the k th node has no contribution to the mean functionin (7), i.e., the k th node becomes non-inﬂuential in predicting the response. Since (1 − π ) isthe probability of a node being inactive, it is referred to as the node sparsity parameter in13he context of the data generation mechanism under Simulations 1 and . All elements of u m,g are taken to be 0 . u s,g is taken to be 1.We also construct another symmetric sparse matrix Γ , to add additional edge eﬀectscorresponding to edges connecting a few randomly selected nodes. Let π be the proportionof nonzero elements of Γ , , set randomly at either 0 .

05 or 0 .

1. We randomly choose π proportion of locations from the set of all ( k, l ). The nonzero entries are drawn using one ofthe three following strategies: Strategy 1:

Nonzero entries are simulated from N(1 , . Strategy 2:

Nonzero entries are simulated from N(0 . , . Strategy 3:

All nonzero entries are ﬁxed at 0 . − π ) is referred to as the residual edge sparsity .Note that the speciﬁcation of true edge coeﬃcients largely preserves the transitivityproperty in Γ . To see this, note that Γ , is highly sparse, so that γ , ,k,l = γ ,k,l for mostpairs ( k, l ), k < l . For those pairs, γ ,k,l (cid:54) = 0 and γ ,l,l (cid:48) (cid:54) = 0 imply that u k, (cid:54) = , u l, (cid:54) = and u l (cid:48) , (cid:54) = . Thus it follows that γ ,k,l (cid:48) = u (cid:48) k, u l (cid:48) , (cid:54) = 0 . For a comprehensive picture of

Simulation 1 and

Simulation 2 , we consider 4 diﬀerentcases each in both simulations as summarized in Table 1 and 2 respectively. In each ofthese cases, the network predictor coeﬃcient and the response are generated by changingthe node sparsity (1 − π ), the residual edge sparsity (1 − π ) and the true dimension R g ofthe latent variables u k, ’s. The table also presents the maximum ﬁtted dimension R of thelatent variables u k for the logistic regression model (2). Note that the various cases alsoallow model mis-speciﬁcation with unequal choices of R and R g .As competitors, we use generic variable selection and shrinkage methods that treat edgesbetween nodes together as a long predictor vector to run high dimensional regression, therebyignoring the relational nature of the predictor. More speciﬁcally, we use Lasso Tibshirani(1996), which is a popular penalized optimization scheme, and the Bayesian Lasso (BLassofor short)Park and Casella (2008) and Bayesian Horseshoe (BHS for short) priors Carvalho14ases R g R Node Residual Edge StrategySparsity (1 − π ) Sparsity (1 − π )Case - 1 2 2 0.5 0.95 Strategy 1Case - 2 3 5 0.6 0.95 Strategy 1Case - 3 2 5 0.5 0.90 Strategy 2Case - 4 2 5 0.4 0.90 Strategy 3Table 1: Table presents diﬀerent cases for Simulation 1 . The true dimension R g is thedimension of vector object u k, using which data has been generated. The maximum dimen-sion R is the dimension of vector object u k using which the model has been ﬁtted. Nodesparsity and residual edge sparsity are described in the text.Cases R g R Node Residual Edge StrategySparsity (1 − π ) Sparsity (1 − π )Case - 1 2 2 0.5 0.95 Strategy 1Case - 2 2 4 0.5 0.95 Strategy 1Case - 3 2 3 0.7 0.95 Strategy 1Case - 4 2 5 0.4 0.90 Strategy 3Table 2: Table presents diﬀerent cases for Simulation 2 . The true dimension R g is thedimension of vector object u k, using which data has been generated. The maximum dimen-sion R is the dimension of vector object u k using which the model has been ﬁtted. Nodesparsity and residual edge sparsity are described in the text. et al. (2010), which are popular Bayesian shrinkage regression methods, all three underthe logistic regression framework. We use the glmnet package in R Friedman et al. (2010)to implement the frequentist Lasso, while we write our own codes for BLasso and BHS.A comparison with these methods will indicate any relative advantage of exploiting thestructure of the network predictor. Additionally, we compare our methods to a frequentistapproach that develops network classiﬁcation in the presence of a network predictor and abinary response Reli´on et al. (2017). We refer to this approach as

Reli´on.

All Bayesian competitors are allowed to draw 50 ,

000 MCMC samples, out of whichthe ﬁrst 30 ,

000 are discarded as burn-ins. Convergence is assessed by comparing diﬀerentsimulated sequences of representative parameters starting at diﬀerent initial values Gelman et al. (2014b). All posterior inference is carried out based on the rest 20 ,

000 MCMC samplesafter suitably thinning the post burn-in chain. We monitor the auto-correlation plots and15ﬀective sample sizes of the iterates, and they are found to be satisfactorily uncorrelated. Inall of our simulations, we set V = 25 nodes and n = 250 samples.We present analysis for ν = 20, a ∆ = b ∆ = 1. For BNLC, there are two additional hyper-parameters ι and ζ , both of which are set to 1. Note that the choice of a ∆ = b ∆ = 1 ensuresthat the prior on models is such that we have a uniform distribution on the number of activenodes, and conditional on the size of the model, a uniform distribution on all possible modelsof that size. The choice of ν = 20 ensures that the prior distribution of M is concentratedaround a scaled identity matrix. Since model is invariant to rotations of the latent positions,so we want the prior on u k ’s to also be invariant under rotation. That requires that wecenter M around a matrix that is proportional to the identity. Our choice of ι and ζ set theprior mean of s k,l at 0 . Figures 1 and 2 show the posterior probability of the k -th node being detected as inﬂuential,i.e., P ( ξ k = 1 | Data ), by BNLC and BNHC for each node and each case within

Simulations1 and , respectively. Some interesting observations emerge from the results. We ﬁndthat both methods work well with lower node sparsity and higher residual edge sparsity.Decreasing the residual edge sparsity and increasing the node sparsity have adverse eﬀectson the performance. In general, BNLC shows relatively better performance than BNHCin cases with higher node sparsity and/or lower residual edge sparsity. We provide a briefdiscussion below to support these observations.For BNHC, case 2 exhibits a few false positives, and the separation of posterior proba-bilities for truly active and truly inactive nodes is much more stark in case 1 than in case 2.BNLC does a better job of node identiﬁcation than BNHC in case 2. Residual edge eﬀectdoes have an impact on the probabilities, which is evident by comparing cases 1 and 3. For16NHC, case 3 (Simulation 1) displays poor performance with a higher number of both falsepositives and false negatives. Performance of BNLC appears to be better than BNHC incase 3. Fixing the residual edge sparsity and increasing the node sparsity has a negativeimpact on node identiﬁcation, as seen by comparing performances in cases 3 and 4 (Simu-lation 1). For Simulation 2, both competitors perform quite well in cases 1 a nd 2. Again,case 3 (Simulation 2) represents a higher node sparsity, so that both BNHC and BNLC donot perform well in this case. Similar to Simulation 1, BNHC shows inferior performanceto BNLC in case 3. While BNHC oﬀers a few false positives and false negatives in case 4(Simulation 2), the performance appears to be much better than in case 3. Notice that case3 has both higher node sparsity and residual edge sparsity than case 4. While they haveopposing eﬀects, it appears that higher node sparsity demonstrates more of an adverse eﬀecthere compared to a small perturbation in the residual edge sparsity. Recall that Reli´on et al. (2017) is the only other competitor which is designed to detect inﬂuential nodes. It detectsall nodes to be inﬂuential in all simulation cases. We apply the algorithm with a mixture of skewed t-distributions described in Appendix C todetect inﬂuential edges from the post burn-in MCMC samples of the edge coeﬃcients usinga threshold of t = 0 .

05. The proposed approach controls FDR below a threshold of 0 .

05 toaccount for multiplicity correction. Tables 3 and 4 provide the true positive rates (TPR)and false positive rates (FPR) in detecting important edges for Simulations 1 and 2 for thecompetitors, respectively. It is observed that when node sparsity is moderate and residualedge sparsity is high (cases 1 and 2), both BNLC and BNHC oﬀer moderate performance interms of identifying true positives, and include very few false positives. In these cases, BNHCgenerally exhibits a little higher FPR than BNLC. In the case of high node sparsity (e.g.,case 3, Simulation 2) both these methods unfortunately show much lower true positive rates.Again, lower edge sparsity (case 3, Simulation 1) has almost no eﬀect on FPR of BNLC,17 imulation Cases N ode s (a) BNLC Simulation Cases N ode s (b) BNHC Figure 1: Simulation 1: clear background denotes uninﬂuential and dark background denotes inﬂuential nodes in the truth for BNLC and BNHC models. Note that there are 25 rows(corresponding to 25 nodes) and 4 columns corresponding to 4 diﬀerent cases in Simulation1. The model-detected posterior probability of being inﬂuential has been super-imposed ontothe corresponding node.but decreases TPR substantially. For BNHC, both TPR and FPR increase when residualedge sparsity is reduced. Nevertheless, both of them perform signiﬁcantly better than Lassoin almost all cases. The competitor in Reli´on et al. (2017) appears to have suboptimalperformance, as it identiﬁes all edges as important in all the simulation scenarios, resultingin high FPRs. BNLC BNHC Lasso Reli´on (2017)Cases TPR FPR TPR FPR TPR FPR TPR FPRCase - 1 0.65 0.01 0.72 0.12 0.50 0.22 1 1Case - 2 0.64 0.00 0.63 0.02 0.40 0.14 1 1Case - 3 0.45 0.00 0.86 0.40 0.42 0.22 1 1Case - 4 0.72 0.09 0.70 0.12 0.54 0.16 1 1Table 3: True Positive Rates (TPR) and False Positive Rates (FPR) for edges for cases in

Simulation 1 .The results in Tables 3 and 4 indicate higher number of edges identiﬁed as inﬂuential18 a) BNLC (b) BNHC

Figure 2: Simulation 2: clear background denotes uninﬂuential and dark background denotes inﬂuential nodes in the truth for BNLC and BNHC models. Note that there are 25 rows(corresponding to 25 nodes) and 4 columns corresponding to 4 diﬀerent cases in Simulation2. The model-detected posterior probability of being inﬂuential has been super-imposed ontothe corresponding node. BNLC BNHC Lasso Reli´on(2017)Cases TPR FPR TPR FPR TPR FPR TPR FPRCase - 1 0.63 0.00 0.84 0.08 0.44 0.20 1 1Case - 2 0.56 0.00 0.63 0.12 0.53 0.22 1 1Case - 3 0.46 0.02 0.59 0.08 0.31 0.16 1 1Case - 4 0.68 0.03 0.75 0.06 0.34 0.12 1 1Table 4: True Positive Rates (TPR) and False Positive Rates (FPR) for edges for cases in

Simulation 2 .by BNHC than BNLC in all simulations. Digging a bit deeper, we report the ratio of thenumber of edges in the intersection of both methods to the number of total edges identiﬁed byeach method independently in Table 5. In all simulation cases, almost all edges identiﬁed asinﬂuential by BNLC are also identiﬁed as inﬂuential by BNHC. In cases 2 and 4 (Simulation1), the fractions corresponding to BNLC and BNHC are very similar, indicating similar edgeidentiﬁcation by both of them. However, this fraction appears to be lower in BNHC forcases 1 and 3 (Simulation 1). This again shows that the edges identiﬁed by BNLC are also19dentiﬁed by BNHC, with BNHC identifying more edges. The discrepancy turns out to bemore in case 3 (Simulation 1) where BNHC has identiﬁed many more edges. Simulation 2shows a similar trend. We further track the top 10, 20 and 30 edges identiﬁed from BNLCand record how many of these edges belong to the top 10, 20 and 30 edges identiﬁed fromBNHC. Table 5 shows a high level of intersection among the top edges identiﬁed by thesetwo methods.A number of interesting observations emerge from the analysis. First of all, as mentionedearlier, the edges identiﬁed by BNLC are generally also identiﬁed by BNHC. BNHC tendsto identify more edges, leading to higher TPR and FPR. Broadly, in presence of highernode sparsity, the discrepancy is greater, with BNHC having much higher TPR and FPR.Interestingly, the absolute values of the edge coeﬃcients follow very similar rankings forBNHC and BNLC, which leads to high intersections among the top edges selected by thesemethods. Perhaps the diﬀerence in shrinkage mechanism imposed by BNHC and BNLC isresponsible for their diﬀerence in tail behavior, leading to diﬀerences in edge selection.Simulation 1 Simulation 2Cases N BL,BH N BL N BL,BH N BH Top N BL,BH N BL N BL,BH N BH Top10 20 30 10 20 301 0.94 0.61 9 19 27 1.00 0.58 7 17 262 0.85 0.83 8 14 21 1.00 0.46 8 17 263 1.00 0.25 9 13 24 0.97 0.70 9 18 284 0.91 0.87 8 18 27 0.91 0.75 8 17 27Table 5: N BL,BH represents the number of edges identiﬁed by both BNLC and BNHC.Similarly, N BL and N BH represent the number of edges identiﬁed by BNLC and BNHC,respectively. Top 10 represents the number of edges common among the top ten edgesidentiﬁed by BNLC and BNHC. Top 20 and Top30 are deﬁned analogously. The mean squared errors (MSE) associated with the point estimation of edge coeﬃcients fordiﬀerent competitors are presented in Tables 6 and 7, corresponding to Simulations 1 and 2,20espectively. For the Bayesian competitors, point estimates are computed using the posteriormeans of the edge coeﬃcients. In all cases, BNLC and BNHC consistently outperform allother competitors, with the binary Bayesian Lasso exhibiting the next best performance. Inall simulation cases, BNLC comprehensively outperforms BNHC in terms of estimating edgecoeﬃcients. Consistent with earlier observations, both competitors tend to be less accuratewhen node sparsity increases. Figure 3 records AUC for all competitors in Simulations 1and 2. In almost all cases, AUC for BNHC and BNLC turn out to be higher than othercompetitors. On the other hand, Reli´on et al. (2017) appears to have close to randomclassiﬁcation of samples with AUC around 0 . Cases A UC . . . BNLC

BNHC

Binary Lasso

Relion

Binary BLasso

Binary BHS (a) Simulation 1 (b) Simulation 2

Figure 3: Figure shows classiﬁcation performance in the form of Area under Curve (AUC)of ROC for all cases in Simulations 1 and 2.

Figures 4 and 5 present posterior probabilities of eﬀective dimensionality of the latent posi-tions u , . . . , u V for BNLC and BNHC in Simulations 1 and 2, respectively. Note that thetrue dimension of the latent space is known and recorded for all simulations in Tables 1 and2. In all 8 cases, the posterior mode corresponds to the true dimension of the latent space for21 a) Case 1, BNLC (b) Case 2, BNLC (c) Case 3, BNLC(d) Case 4, BNLC (e) Case 1, BNHC (f) Case 2, BNHC(g) Case 3, BNHC (h) Case 4, BNHC Figure 4: Plots showing posterior probability distribution of eﬀective dimensionality forBNLC and BNHC models in all 4 cases in Simulation 1. Filled bullets indicate the truevalue of eﬀective dimensionality. 22 a) Case 1, BNLC (b) Case 2, BNLC (c) Case 3, BNLC(d) Case 4, BNLC (e) Case 1, BNHC (f) Case 2, BNHC(g) Case 3, BNHC (h) Case 4, BNHC

Figure 5: Plots showing posterior probability distribution of eﬀective dimensionality forBNLC and BNHC models in all 4 cases in Simulation 2. Filled bullets indicate the truevalue of eﬀective dimensionality. 23SECases BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL HorseshoeCase - 1 R eff in BNHCconcentrates more sharply around R g in all cases.MSECases BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL HorseshoeCase - 1 To assess how sensitive the inferences from BNLC and BNHC are, we analyze BNLC andBNHC with diﬀerent combinations of hyperparameters. Speciﬁcally for BNLC, we use theﬁve diﬀerent combinations given by, (i) a ∆ = 1 , b ∆ = 9; (ii) ν = 20 , δ = 5 (iii) ν = 50 , δ = 5(iv) ν = 20 , δ = 0 . ν = 50 , δ = 0 .

2. Combination (i) ensures small prior mean for ξ k ’s,while combinations (ii)-(v) allow a range of prior means for θ and M . On the other hand,the three diﬀerent combinations we employ for BNHC are, (i)’ a = 1, b = 9 (ii)’ ν = 10 (iii)’24 = 50. With these hyperparameter combinations for BNLC and BNHC, we analyze the datasimulated in case 4, Simulation 1 (case chosen randomly), report performances on inﬂuentialnode and edge identiﬁcation and the MSE values for estimating the network coeﬃcientmatrix. All these inferences with diﬀerent choices of hyperparameters are compared amongthemselves and compared with the inferences reported earlier on case 4, Simulation 1.Table 8 records the MSE values for estimating the network coeﬃcient under all thesecombinations. The MSE values for BNLC range between 0 .

10 and 0 .

30 (please see table 6).MSE values for BNHC are found to range between 0 .

19 and 0 .

28 with diﬀerent choicesof hyperparameters, as shown in able 6. Figure 6 shows the posterior probabilities of anode being identiﬁed as inﬂuential under all these hyperparameter combinations. It showsprobabilities being only little aﬀected by the change of hyper-parameters. In fact, underhyper-parameter combinations (i),(ii) and (iv), BNLC identiﬁes the same set of nodes asinﬂuential which have been identiﬁed as inﬂuential by the original BNLC prior. Undercombination (iii), BNLC does not identify node 9 as inﬂuential which has been identiﬁedas inﬂuential by the original BNLC prior. Under combination (iv) BNLC identiﬁes oneadditional node (node 21) as inﬂuential over the set of nodes identiﬁed by the originalprior. Under hyperparameter combination (i)’, BNHC identiﬁes the same set of nodes withthe original BNHC prior except nodes 4 , , ,

25 which are identiﬁed as inﬂuential by theoriginal prior, but not by the combination (i)’. Combinations (ii)’ and (iii)’ also identifythe same set of nodes with the original BNHC prior except for nodes 9 , ,

25. Finally,Table 9 oﬀers TPR and FPR values corresponding to the identiﬁcation of inﬂuential edgesfor BNLC and BNHC under various combinations of hyper-parameters. The TPR for BNHCunder combination (iii)’ turns out to be a little higher than the rest, but overall numbers donot show a lot of variation. We emphasize that the results turn out to be better than ourcompetitors under all combinations. 25NLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’MSE 0.14 0.30 0.22 0.10 0.22 0.19 0.28 0.28Table 8: Mean Squared Error (MSE) of estimating the network coeﬃcient in BNLC andBNHC for diﬀerent combinations of hyper-parameters.

Simulation Cases N ode s (a) BNLC Sensitivity Combinations N ode s (b) BNHC Sensitivity Figure 6: Figure shows P ( ξ k = 1 | Data ) for BNLC and BNHC under diﬀerent hyper-parameter combinations in the simulated data for case 4 (Simulation 1).BNLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’TPR 0.80 0.76 0.82 0.83 0.78 0.64 0.88 0.82FPR 0.16 0.21 0.17 0.21 0.18 0.19 0.24 0.18Table 9: True Positive Rates (TPR) and False Positive Rates (FPR) of identifying inﬂuentialedges in BNLC and BNHC for diﬀerent combinations of hyper-parameters.

In this section, we present the inferential and classiﬁcation ability of BNLC and BNHC inthe context of a weighted diﬀusion tension imaging (DTI) dataset. Our dataset containsinformation on the full scale intelligence quotient (FSIQ) for multiple individuals. Full scaleintelligence quotient (FSIQ) is a measure of an individual’s complete cognitive capacity. It26s derived from administration of selected sub-tests from the Wechsler Intelligence Scales(WIS), designed to provide a measure of an individual’s overall level of general cognitive andintellectual functioning, and is a summary score derived from an individual’s performanceon a variety of tasks that measure acquired knowledge, verbal reasoning, attention to verbalmaterials, ﬂuid reasoning, spatial processing, attentiveness to details, and visual-motor in-tegration Caplan et al. (2011). A substantial body of literature has suggested that there isan IQ threshold (usually described as an IQ of approximately 120 points) that may be char-acterized as superior reasoning ability Brown et al. (2009); Carson et al. (2003). Followingthis literature, we have converted the FSIQ scores into a binary response variable y , whichtakes value 0 if FSIQ is less or equal to 120, and takes value 1 if FSIQ is greater than 120.Thus, we classify the subjects in our study as belonging to the low IQ group if y = 0, andthe high IQ group if y = 1.Along with FSIQ measurements, brain connectome information for n = 114 subjects isgathered using weighted diﬀusion tensor imaging (DTI). DTI is a brain imaging techniquethat enables measurement of the restricted diﬀusion of water in tissue in order to produceneural tract images. The brain imaging data we use has been pre-processed using the NDMGpre-processing pipeline Kiar et al. (2016); Kiar et al. (2017a); Kiar et al. (2017b). In thecontext of DTI, the human brain is divided according to the Desikan atlas Desikan et al. (2006), which identiﬁes 34 cortical regions of interest (ROIs) both in the left and righthemispheres of the human brain, implying 68 cortical ROIs in all. Similar to Guha andRodriguez (2020), this results in a brain network of a 68 ×

68 matrix for each individual.Our scientiﬁc goals in this setting include identiﬁcation of brain regions or network nodessigniﬁcantly related to FSIQ and classiﬁcation of a subject into the low IQ or high IQ groupbased on his/her brain connectome information.Identical prior distributions for all the parameters as in the simulation studies have beenused. BNLC and BNHC are both ﬁtted with R = 4, which is found to be suﬃcient for thisstudy. Further, Guha and Rodriguez (2020) show robust inference as long as the chosen R

27s bigger than the eﬀective dimensionality of the latent variables. Similar to article Guhaand Rodriguez (2020), we also do a sensitivity study to check the impact of R on predictiveinference. The choice of hyperparameters for BNLC and BNHC are made similar to thesimulation studies. A brief explanation for such choices of hyper parameters is provided inthe simulation section. The MCMC chain is run for 50 ,

000 iterations, with the ﬁrst 30 , et al. (2014a).All inference is based on the remaining 20 ,

000 post burn-in iterates appropriately thinned.

As in simulation studies, we put our emphasis on identifying inﬂuential brain regions ofinterest (ROIs) associated with FSIQ. The BNLC model estimates posterior probabilitiesover 0 . inﬂuential ) for 38 ROIs, out of which 20 regions are in the lefthemisphere and 18 regions are in the right hemisphere. Among the regions detected in boththe hemispheres, a large number belong to the frontal , temporal and cingulate lobes. Usingthe same principle, the BNHC model identiﬁes 48 nodes to be inﬂuential. Out of the 48inﬂuential nodes, 26 are detected in the left hemisphere and the rest in the right hemisphere.The ROIs are mainly detected in the temporal, frontal, parietal and cingulate lobes in bothhemispheres. Figure 8 plots the estimated posterior probability of an ROI being detected asinﬂuential by the BNLC and BNHC models. Notably, there are 29 ROIs identiﬁed by bothBNLC and BNHC, given in Table 10.A large number of the 29 inﬂuential nodes detected by both BNLC and BNHC are partof the frontal lobes in both the hemispheres. Numerous studies have linked the frontal regionto an individual’s intelligence and cognitive functions Yoon et al. (2017); Stuss et al. (1985);Razumnikova (2007); Miller and Milner (1985); Kolb and Milner (1981). Our method alsoﬁnds a signiﬁcant association between FSIQ and the left inferior parietal lobule , the left precuneus and the supramarginal gyri in both the hemispheres, in the parietal lobe, regions28lso found to be signiﬁcantly related to FSIQ by Yoon et al. (2017).We additionally look into ROIs which are detected by only of the two methods (lets say,BNLC), and report the posterior probabilities of these ROIs being active under the othermethod (i.e., BNHC). Figure 7 shows the posterior probabilities of nodes being active underthe ‘other’ method as discussed above. It is observed that the nodes selected by BNHC butnot by BNLC have probabilities not very far from 0 . (cid:0) (cid:1) and (cid:0) (cid:1) possibilities, respectively. Since a diﬀerent number of nodes are detected as inﬂuentialby BNHC and BNLC, to make a fair comparison, we consider the 29 nodes detected asinﬂuential by both methods, and use our algorithm to ﬁnd the number of inﬂuential edgesamong these (cid:0) (cid:1) possibilities for both BNLC and BNHC. The numbers turn out to be 96and 184, respectively. We note that there are a few nodes which are identiﬁed as inﬂuentialby either BNHC or BNLC, but none of the edges connecting these nodes are found to beinﬂuential. As an example, although the frontal pole and the temporal pole in the lefthemisphere are identiﬁed as inﬂuential nodes by BNLC, none of the edges connecting thesetwo nodes turn out to be inﬂuential. This phenomenon may be due to the use of the FDRin the edge selection procedure, which ﬁnds edges that are most likely to be active whilecontrolling for false discoveries. Hence, not identifying an edge does not necessarily meanthat the edge is not active, it just means that there are others that satisfy the criteria better.29 odes Selected by BNLC, but not by BNHC P r obab ili t y . l h - c auda l m i dd l e f r on t a l l h - pa r ah i ppo c a m pa l l h - pe r i c a l c a r i ne l h - po s t e r i o r c i ngu l a t e l h - p r e c uneu s r h - l a t e r a l o r b i t o f r on t a l r h - m ed i a l o r b i t o f r on t a l r h - p r e c en t r a l r h - s upe r i o r f r on t a l Nodes Selected by BNHC, but not by BNLC P r obab ili t y . l h - ban kss t s l h - c o r pu sc a ll o s u m l h - c uneu s l h - i n f e r i o r pa r i e t a l l h - i s t h m u sc i ngu l a t e l h - l a t e r a l o cc i p i t a l l h - li ngua l l h - pa r s t r i angu l a r i s l h - po s t c en t r a l l h -r o s t r a l an t e r i o r c i ngu l a t e l h -r o s t r a l m i dd l e f r on t a l l h - s upe r i o r f r on t a l r h - c auda l m i dd l e f r on t a l r h - en t o r h i na l r h - f u s i f o r m r h - pa r ah i ppo c a m pa l r h - pe r i c a l c a r i ne r h - po s t c en t r a l r h - p r e c uneu s Figure 7: Figure shows the posterior probabilities of nodes selected as inﬂuential by onemethod, but not by another, of being active.Similar to simulation studies, we dig deeper to analyze the discrepancy in the number ofinﬂuential edges identiﬁed by BNLC and BNHC. Speciﬁcally, we rank the (cid:0) (cid:1) = 406 edgesconnecting the nodes found to be inﬂuential by both BNLC and BNHC, according to theabsolute values of their posterior means. Table 11 shows between 23-74% intersections.To examine the predictive ability of the Bayesian network classiﬁcation model, we reportthe area under curve (AUC) of the ROC curve for BNLC and BNHC, along with all com-peting methods. The AUCs are computed using a 10-fold cross validation approach. TheAUC estimates presented in Table 12 indicate better performance of both BNLC and BNHC,with BNLC slightly outperforming. Frequentist Binary Lasso turns out to be the next bestperformer, while BLasso and BHS perform very similar to a random classiﬁer. Finally, the30ﬀective dimensionality of the model is investigated for both BNLC and BNHC, and theyturn out to be 2 .

17 and 2, respectively. (a) BNLC(b) BNHC

Figure 8: Lateral and medial views of the brain (left and right hemispheres) showing all 68regions of interest (ROIs). The size and color of the ROIs vary according to the value of theposterior probabilities of them being actively related to the binary response for both BNLCand BNHC models. 31 emisphere Lobe NodeLeft Temporal fusiform, middle temporal gyrus, parahippocampal, temporal pole, transverse temporalCingulate isthmus cingulate cortexFrontal pars opercularis, pars orbitalis, pars triangularis, frontal poleOccipital lingualParietal inferior parietal lobule, precuneus, supramarginal gyrusInsula insulaRight Temporal parahippocampal, superior temporal gyrus, temporal poleCingulate caudal anterior cingulate, isthmus cingulate cortexFrontal lateral orbitofrontal, medial orbitofrontal, pars opercularis, pars orbitalis,rostral middle frontal gyrus, superior frontal gyrusOccipital pericalcarineParietal supramarginal gyrusInsula insula

Table 10: Nodes identiﬁed as inﬂuential by both BNLC and BNHC.Top 100 Top 200 Top 30023 99 222Table 11: Top 100 represents the number of edges common among the top 100 edges identiﬁedby BNLC and BNHC. Top 200 and Top 300 are deﬁned analogously.

Method

BNLC BNHC Lasso Reli´on(2017) Binary BinaryBL BHS

AUC

We have already discussed how the hyperparameters are chosen for the simulation studiesand data analysis. To assess how sensitive the inferences from BNLC and BNHC are, weanalyze BNLC and BNHC with diﬀerent combinations of hyperparameters. Speciﬁcally forBNLC, we use the ﬁve diﬀerent combinations (i)-(v) given in Section 5.5, and three diﬀerentcombinations (i)’-(iii)’ for BNHC also mentioned in Section 5.5. We report performanceson the number of inﬂuential nodes identiﬁed. We also ﬁnd the number of inﬂuential edgesconnecting inﬂuential nodes.Table 13 records the number of nodes identiﬁed as inﬂuential and the number of inter-32 h - c auda l m i dd l e f r on t a l l h - en t o r h i na l l h - f u s i f o r m l h - i n f e r i o r t e m po r a l l h - l a t e r a l o r b i t o f r on t a l l h - m ed i a l o r b i t o f r on t a l l h - m i dd l e t e m po r a l l h - pa r ah i ppo c a m pa l l h - pa r a c en t r a l l h - pa r s ope r c u l a r i s l h - pa r s o r b i t a li s l h - pe r i c a l c a r i ne l h - po s t e r i o r c i ngu l a t e l h - p r e c en t r a l l h - p r e c uneu s l h - s upe r i o r t e m po r a l l h - s up r a m a r g i na l l h - f r on t a l po l e l h - t e m po r a l po l e l h - t r an sv e r s e t e m po r a l r h - c auda l an t e r i o r c i ngu l a t e r h - i n f e r i o r t e m po r a l r h - l a t e r a l o cc i p i t a l r h - l a t e r a l o r b i t o f r on t a l r h - li ngua l r h - m ed i a l o r b i t o f r on t a l r h - m i dd l e t e m po r a l r h - pa r a c en t r a l r h - pa r s ope r c u l a r i s r h - pa r s t r i angu l a r i s r h - p r e c en t r a l r h -r o s t r a l an t e r i o r c i ngu l a t e r h -r o s t r a l m i dd l e f r on t a l r h - s upe r i o r f r on t a l r h - s upe r i o r pa r i e t a l r h - s upe r i o r t e m po r a l r h - f r on t a l po l e r h - t e m po r a l po l e rh-temporalpole rh-frontalpole rh-superiortemporal rh-superiorparietal rh-superiorfrontal rh-rostralmiddlefrontal rh-rostralanteriorcingulate rh-precentral rh-parstriangularis rh-parsopercularis rh-paracentral rh-middletemporal rh-medialorbitofrontal rh-lingual rh-lateralorbitofrontal rh-lateraloccipital rh-inferiortemporal rh-caudalanteriorcingulate lh-transversetemporal lh-temporalpole lh-frontalpole lh-supramarginal lh-superiortemporal lh-precuneus lh-precentral lh-posteriorcingulate lh-pericalcarine lh-parsorbitalis lh-parsopercularis lh-paracentral lh-parahippocampal lh-middletemporal lh-medialorbitofrontal lh-lateralorbitofrontal lh-inferiortemporal lh-fusiform lh-entorhinal lh-caudalmiddlefrontal (a) BNLC l h - ban kss t s l h - c o r pu sc a ll o s u m l h - c uneu s l h - en t o r h i na l l h - f u s i f o r m l h - i n f e r i o r pa r i e t a l l h - i n f e r i o r t e m po r a l l h - i s t h m u sc i ngu l a t e l h - l a t e r a l o cc i p i t a l l h - l a t e r a l o r b i t o f r on t a l l h - li ngua l l h - m i dd l e t e m po r a l l h - pa r a c en t r a l l h - pa r s ope r c u l a r i s l h - pa r s o r b i t a li s l h - pa r s t r i angu l a r i s l h - po s t c en t r a l l h - po s t e r i o r c i ngu l a t e l h -r o s t r a l an t e r i o r c i ngu l a t e l h -r o s t r a l m i dd l e f r on t a l l h - s upe r i o r f r on t a l l h - s upe r i o r t e m po r a l l h - s up r a m a r g i na l l h - f r on t a l po l e l h - t r an sv e r s e t e m po r a l r h - c auda l an t e r i o r c i ngu l a t e r h - c auda l m i dd l e f r on t a l r h - en t o r h i na l r h - i n f e r i o r t e m po r a l r h - l a t e r a l o cc i p i t a l r h - li ngua l r h - m i dd l e t e m po r a l r h - pa r ah i ppo c a m pa l r h - pa r a c en t r a l r h - pa r s ope r c u l a r i s r h - pa r s o r b i t a li s r h - pa r s t r i angu l a r i s r h - pe r i c a l c a r i ne r h - po s t c en t r a l r h - p r e c en t r a l r h - p r e c uneu s r h -r o s t r a l an t e r i o r c i ngu l a t e r h -r o s t r a l m i dd l e f r on t a l r h - s upe r i o r pa r i e t a l r h - s upe r i o r t e m po r a l r h - s up r a m a r g i na l r h - f r on t a l po l e r h - t r an sv e r s e t e m po r a l rh-transversetemporal rh-frontalpole rh-supramarginal rh-superiortemporal rh-superiorparietal rh-rostralmiddlefrontal rh-rostralanteriorcingulate rh-precuneus rh-precentral rh-postcentral rh-pericalcarine rh-parstriangularis rh-parsorbitalis rh-parsopercularis rh-paracentral rh-parahippocampal rh-middletemporal rh-lingual rh-lateraloccipital rh-inferiortemporal rh-entorhinal rh-caudalmiddlefrontal rh-caudalanteriorcingulate lh-transversetemporal lh-frontalpole lh-supramarginal lh-superiortemporal lh-superiorfrontal lh-rostralmiddlefrontal lh-rostralanteriorcingulate lh-posteriorcingulate lh-postcentral lh-parstriangularis lh-parsorbitalis lh-parsopercularis lh-paracentral lh-middletemporal lh-lingual lh-lateralorbitofrontal lh-lateraloccipital lh-isthmuscingulate lh-inferiortemporal lh-inferiorparietal lh-fusiform lh-entorhinal lh-cuneus lh-corpuscallosum lh-bankssts (b) BNHC Figure 9: Plot showing whether an edge connecting two inﬂuential nodes is inﬂuential ornot. Note that the map is a M × M symmetric matrix, where M denotes the number ofinﬂuential nodes, and each cell denotes an edge connecting the corresponding pair of nodes.The axis labels are the abbreviated names of the inﬂuential ROIs in the left (starting with‘lh -’) and the right (starting with ‘rh -’) hemispheres of the brain. Full names of the ROIscan be obtained from the widely available Desikan brain atlas. A white cell represents aninﬂuential edge, while red cell represents a non-inﬂuential edge.33NLC BNHCCombinations (i) (ii) (iii) (iv) (v) (i)’ (ii)’ (iii)’ (cid:0) (cid:1) edges and (cid:0) (cid:1) edges in BNLC and BNHC respectively, for all hyperparam-eter combinations. Table 14 presents the number of edges detected as inﬂuential, as well asthe number of intersecting edges with the original analysis. Again, due to the high dimen-sionality of the problem, the variation in the number of identiﬁed edges with diﬀerent choicesof hyperparameters is expected, though the variation turns out not to be very signiﬁcant.34inally, to check sensitivity to the choice of R on the performance of BNLC and BNHC, werun the data analysis for BNHC and BNLC with R = 8 and R = 10, and report the posteriormean of the eﬀective dimensionality, along with AUC. Table 15 reports the posterior meanof eﬀective dimensionality, which shows very moderate increase with increasing R . However,increasing R seems to have almost no eﬀect on AUC.BNLC BNHC R = 4 R = 8 R = 10 R = 4 R = 8 R = 10Posterior mean Eﬀ. Dim. 2.17 2.78 2.96 2.00 2.74 3.04AUC 0.61 0.63 0.59 0.59 0.60 0.59Table 15: AUC and posterior mean of eﬀective dimensionality for BNLC and BNHC underdiﬀerent choices of R . We develop a binary Bayesian network regression model that enables classifying multiple net-works with “labeled nodes” into two groups, identiﬁes inﬂuential network nodes and predictsthe class in which a newly observed network belongs. Our contribution lies in carefully con-structing a class of network global-local shrinkage priors on the network predictor coeﬃcientwhile recognizing the latent network structure in the predictor variable. In particular, we in-vestigate two speciﬁc network shrinkage priors from this general class, leading to two networkclassiﬁers BNLC and BNHC. Our extensive simulation study shows competitive performancebetween BNLC and BNHC in terms of inference and classiﬁcation with no clear winner, andboth of them are found to outperform other competitors. Another major contribution ofthe proposed framework remains theoretically understanding the Bayesian network classiﬁermodel with the Network Lasso shrinkage prior. Speciﬁcally, we develop theory guaranteeingaccurate classiﬁcation as the sample size tends to inﬁnity. The theoretical developmentsallow the number of possible interconnections in the network predictor to grow at a fasterrate than the sample size. We analyze a brain connectome dataset with brain connectivity35etworks between diﬀerent regions of interest for multiple individuals, and information onwhether an individual is in a low or a high

IQ category. BNC shows satisfactory out ofsample classiﬁcation and identiﬁes important brain regions actively inﬂuencing the FSIQ ofan individual.

References

Armagan, A., Dunson, D. B., and Lee, J. (2013a). Generalized double Pareto shrinkage.

Statistica Sinica , (1), 119–143.Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013b). Posteriorconsistency in linear models under shrinkage priors. Biometrika , (4), 1011–1018.Belitser, E. and Nurushev, N. (2015). Needles and straw in a haystack: robust conﬁdencefor possibly sparse sequences. arXiv preprint arXiv:1511.01803 .Brown, T. E., Reichel, P. C., and Quinlan, D. M. (2009). Executive function impairmentsin high iq adults with adhd. Journal of Attention Disorders , (2), 161–167.Bullmore, E. and Sporns, O. (2009). Complex brain networks: graph theoretical analysis ofstructural and functional systems. Nature Reviews. Neuroscience , (3), 186–198.Caplan, B., Kreutzer, J. S., and DeLuca, J. (2011). Encyclopedia of Clinical Neuropsychology;With 199 Figures and 139 Tables.

Springer.Carson, S. H., Peterson, J. B., and Higgins, D. M. (2003). Decreased latent inhibition isassociated with increased creative achievement in high-functioning individuals.

Journal ofpersonality and social psychology , (3), 499.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparsesignals. Biometrika , (2), 465–480. 36astillo, I., van der Vaart, A., et al. (2012). Needles and straw in a haystack: Posteriorconcentration for possibly sparse sequences. The Annals of Statistics , (4), 2069–2101.Castillo, I., Rousseau, J., et al. (2015). A bernstein–von mises theorem for smooth functionalsin semiparametric models. The Annals of Statistics , (6), 2353–2383.Craddock, R. C., Holtzheimer III, P. E., Hu, X. P., and Mayberg, H. S. (2009). Disease stateprediction from resting state functional connectivity. Magnetic Resonance in Medicine: AnOﬃcial Journal of the International Society for Magnetic Resonance in Medicine , (6),1619–1628.Daianu, M., Jahanshad, N., Nir, T. M., Toga, A. W., Jack Jr, C. R., Weiner, M. W., andThompson, for the Alzheimer’s Disease Neuroimaging Initiative, P. M. (2013). Breakdownof brain connectivity between normal aging and alzheimer’s disease: a structural k-corenetwork analysis. Brain connectivity , (4), 407–422.Deshpande, M., Kuramochi, M., Wale, N., and Karypis, G. (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledgeand Data Engineering , (8), 1036–1050.Desikan, R. S., S´egonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner,R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., et al. (2006). An automated labelingsystem for subdividing the human cerebral cortex on MRI scans into gyral based regionsof interest. Neuroimage , (3), 968–980.Durante, D. and Dunson, D. B. (2017). Bayesian inference and testing of group diﬀerencesin brain networks. Bayesian Analysis , doi:10.1214/16-BA1030 . Advance publication.Erdos, P. and R´enyi, A. (1960). On the evolution of random graphs. Publication of theMathematical Institute of the Hungarian Academy of Sciences , (1), 17–60.37ei, H. and Huan, J. (2010). Boosting with structure information in the functional space: anapplication to graph classiﬁcation. In Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 643–652. ACM.Frank, O. and Strauss, D. (1986). Markov graphs.

Journal of the American StatisticalAssociation , (395), 832–842.Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalizedlinear models via coordinate descent. Journal of Statistical Software , (1), 1–22.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014a). Bayesian data analysis , volume 2. CRC press Boca Raton, FL.Gelman, A., Hwang, J., and Vehtari, A. (2014b). Understanding predictive informationcriteria for bayesian models.

Statistics and computing , (6), 997–1016.Ghosal, S., Roy, A., et al. (2006). Posterior consistency of gaussian process prior for non-parametric binary regression. The Annals of Statistics , (5), 2413–2429.Guha, S. and Rodriguez, A. (2020). Bayesian regression with undirected network predic-tors with an application to brain connectome data. Journal of the American StatisticalAssociation , pages 1–34.Helma, C., King, R. D., Kramer, S., and Srinivasan, A. (2001). The predictive toxicologychallenge 2000–2001.

Bioinformatics , (1), 107–108.Hoﬀ, P. D. (2005). Bilinear mixed-eﬀects models for dyadic data. Journal of the AmericanStatistical Association , (469), 286–295.Hoﬀ, P. D. (2009). Multiplicative latent factor models for description and prediction of socialnetworks. Computational and mathematical organization theory , (4), 261.Hoﬀ, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to socialnetwork analysis. Journal of the American Statistical Association , (460), 1090–1098.38iar, G., Gray Roncal, W., Mhembere, D., Bridgeford, E., Burns, R., and Vogelstein, J.(2016). ndmg: Neurodata’s MRI graphs pipeline.Kiar, G., Gorgolewski, K., and Kleissas, D. (2017a). Example use case of sic with the ndmgpipeline (sic: ndmg). GigaScience Database .Kiar, G., Gorgolewski, K. J., Kleissas, D., Roncal, W. G., Litt, B., Wandell, B., Poldrack,R. A., Wiener, M., Vogelstein, R. J., Burns, R., et al. (2017b). Science in the cloud (sic):A use case in MRI connectomics.

Giga Science , (5), 1–10.Klenke, A. (2013). Probability theory: A Comprehensive Course . Springer Science & BusinessMedia.Kolb, B. and Milner, B. (1981). Performance of complex arm and facial movements afterfocal brain lesions.

Neuropsychologia , (4), 491–503.Makalic, E. and Schmidt, D. F. (2015). A simple sampler for the horseshoe estimator. IEEESignal Processing Letters , (1), 179–182.Martin, R., Mess, R., Walker, S. G., et al. (2017). Empirical bayes posterior concentrationin sparse high-dimensional linear models. Bernoulli , (3), 1822–1847.Miller, L. and Milner, B. (1985). Cognitive risk-taking after frontal or temporal lobectomy-II.The synthesis of phonemic and semantic information. Neuropsychologia , (3), 371–379.Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association , (455), 1077–1087.Olde Dubbelink, K. T., Hillebrand, A., Stoﬀers, D., Deijen, J. B., Twisk, J. W., Stam, C. J.,and Berendse, H. W. (2013). Disrupted brain network topology in parkinson’s disease: alongitudinal magnetoencephalography study. Brain , (1), 197–207.Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American StatisticalAssociation , (482), 681–686. 39olson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models usingp´olya–gamma latent variables. Journal of the American statistical Association , (504),1339–1349.Razumnikova, O. M. (2007). Creativity related cortex activity in the remote associates task. Brain Research Bulletin , (1), 96–102.Reli´on, J. D. A., Kessler, D., Levina, E., and Taylor, S. F. (2017). Network classiﬁcationwith applications to brain connectomics. arXiv preprint arXiv:1701.08140 .Richiardi, J., Eryilmaz, H., Schwartz, S., Vuilleumier, P., and Van De Ville, D. (2011).Decoding brain states from fmri connectivity graphs. Neuroimage , (2), 616–626.Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensionalregression. arXiv preprint arXiv:1712.08964 .Srinivasan, A., Muggleton, S. H., Sternberg, M. J., and King, R. D. (1996). Theories formutagenicity: A study in ﬁrst-order and feature-based induction. Artiﬁcial Intelligence , (1-2), 277–299.Stuss, D., Ely, P., Hugenholtz, H., Richard, M., LaRochelle, S., Poirier, C., and Bell, I.(1985). Subtle neuropsychological deﬁcits in patients with good recovery after closed headinjury. Neurosurgery , (1), 41–47.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological) , (1), 267–288.Van Der Pas, S. L., Kleijn, B. J., Van Der Vaart, A. W., et al. (2014). The horseshoeestimator: Posterior concentration around nearly black vectors. Electronic Journal ofStatistics , (2), 2585–2618.Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. (2010).Graph kernels. Journal of Machine Learning Research , (Apr), 1201–1242.40ogelstein, J. T., Roncal, W. G., Vogelstein, R. J., and Priebe, C. E. (2013). Graph classiﬁ-cation using signal-subgraphs: Applications in statistical connectomics. IEEE transactionson pattern analysis and machine intelligence , (7), 1539–1551.Wei, R. and Ghosal, S. (2017). Contraction properties of shrinkage priors in logistic regres-sion. .Yoon, Y. B., Shin, W.-G., Lee, T. Y., Hur, J.-W., Cho, K. I. K., Sohn, W. S., Kim, S.-G.,Lee, K.-H., and Kwon, J. S. (2017). Brain structural networks associated with intelligenceand visuomotor ability. Scientiﬁc reports , (1), 2177.Zhang, J., Cheng, W., Wang, Z., Zhang, Z., Lu, W., Lu, G., and Feng, J. (2012). Pat-tern classiﬁcation of large-scale functional brain networks: identiﬁcation of informativeneuroimaging markers for epilepsy. PloS one , (5), e36733. This section provides full conditionals for all the parameters in the Bayesian binary networkregression with network lasso shrinkage prior on γ . Assume W = ( u (cid:48) Λ u , ..., u (cid:48) Λ u V , ...., u (cid:48) V − Λ u V ) (cid:48) , D = diag ( s , , ..., s V − ,V ) and γ = ( γ , , ..., γ V − ,V ) (cid:48) . Thus, with n data points, the hierar-chical model with the network lasso prior in the binary setting can be written as t ∼ N( µ + Xγ , Ω − ) γ ∼ N( W , D ) , u k | ξ k = 1 ∼ N ( u k | , Q ) , u k | ξ k = 0 ∼ δ , ξ k ∼ Ber (∆) , µ ∼ f lat () s k,l ∼ Exp ( θ / , θ ∼ Gamma ( ζ, ι ) , Q ∼ IW ( ν, I ) , ∆ ∼ Beta ( a ∆ , b ∆ ) p ( ω i ) ∼ P G (1 , , λ r ∼ Ber ( π r ) , π r ∼ Beta (1 , r η ) , η > . The full conditional distributions of the model parameters are given below.41 µ | − ∼ N (cid:16) (cid:48) Ω ( t − Xγ ) (cid:48) Ω1 , (cid:48) Ω1 (cid:17) • γ | − ∼ N ( µ γ | · , Σ γ | · ), where µ γ | · = ( X (cid:48) Ω X + D − ) − ( X (cid:48) Ω ( t − µ ) + D − W ) and Σ γ | · = ( X (cid:48) Ω X + D − ) − • s k,l | − ∼ GIG (cid:2) , ( γ k,l − u (cid:48) k Λ u l ) , θ (cid:3) , where GIG denotes the generalized inverseGaussian distribution. • θ | − ∼ Gamma (cid:104)(cid:16) ζ + V ( V − (cid:17) , (cid:16) ι + (cid:80) k

0) Polson et al. (2013), weobtain 42 ω i | − ∼ P G (1 , µ + x (cid:48) i γ ), for i = 1 , .., n . This section provides full conditionals for all the parameters in the Bayesian network classi-ﬁer model introduced in this article with Bayesian network horseshoe prior. Assume W =( u (cid:48) Λ u , ..., u (cid:48) Λ u V , ...., u (cid:48) V − Λ u V ) (cid:48) , D = diag ( σ s , , ..., σ s V − ,V ) and γ = ( γ , , ..., γ V − ,V ) (cid:48) .Thus, with n data points, the hierarchical model with the network horseshoe prior in thebinary setting can be written as t ∼ N( µ + Xγ , Ω − ) γ ∼ N( W , D ) , u k | ξ k = 1 ∼ N ( u k | , Q ) , u k | ξ k = 0 ∼ δ , ξ k ∼ Ber (∆) , µ ∼ f lat () s k,l ∼ C + (0 , , σ ∼ C + (0 , , Q ∼ IW ( ν, I ) , ∆ ∼ Beta ( a ∆ , b ∆ ) p ( ω i ) ∼ P G (1 , , λ r ∼ Ber ( π r ) , π r ∼ Beta (1 , r η ) , η > . Note that, following Makalic and Schmidt (2015), s k,l ∼ C + (0 , , σ ∼ C + (0 , s k,l | ν k,l ∼ IG (cid:18) , ν k,l (cid:19) , ν k,l ∼ IG (cid:18) , (cid:19) , σ | σ ∼ IG (cid:18) , σ (cid:19) , σ ∼ IG (cid:18) , (cid:19) . With the model formulation described above, the full conditional distributions of themodel parameters are given by the following distributions: • µ | − ∼ N (cid:16) (cid:48) Ω ( t − Xγ ) (cid:48) Ω1 , (cid:48) Ω1 (cid:17) • γ | − ∼ N ( µ γ | · , Σ γ | · ), where µ γ | · = ( X (cid:48) Ω X + D − ) − ( X (cid:48) Ω ( t − µ ) + D − W ) and43 γ | · = ( X (cid:48) Ω X + D − ) − • s k,l | − ∼ IG (cid:104) , ( ν k,l + ( γ k,l − u (cid:48) k Λ u l ) σ ) (cid:105) • σ | − ∼ IG (cid:104)(cid:16) + V ( V − (cid:17) , (cid:16) σ + (cid:80) k

0) Polson et al. (2013), weobtain • ω i | − ∼ P G (1 , µ + x (cid:48) i γ ), for i = 1 , .., n .44 .3 Appendix C Similar to the assumptions made by Wei and Ghosal (2017) in their proof of posteriorconsistency for binary logistic regression, we prove our results assuming that the centeringparameter µ = 0 in both the true and the data generating models. We note that the mainstructure of the proof will remain unchanged with this assumption and the result proved inthis article can be trivially extended to the setting with nonzero µ .We begin by deﬁning some notations. In the proof, Π( · ) will be used to denote the genericprobability notation. We deﬁne the notation of the log-likelihood function by w γ ,n ( y n ) = n (cid:88) i =1 [( x (cid:48) i γ ) y i − z ( x (cid:48) i γ )] , z ( x (cid:48) i γ ) = log(1 + exp( x (cid:48) i γ )) . (9)We also introduce the function C y n ,n ( · ) to quantify the curvature of w γ ,n ( y n ) around γ (0) , C y n ,n ( γ ) = w γ ,n ( y n ) − w γ (0) ,n ( y n ) − ∇ w γ (0) ,n ( y n ) (cid:48) ( γ − γ (0) ) , (10)where ∇ w γ (0) ,n ( y n ) is the derivative of w γ (0) ,n ( y n ) w.r.t. γ , evaluated at γ (0) . Also the likeli-hood p γ ( y n ) can be written using the above notations as p γ ( y n ) = (cid:81) ni =1 exp( w γ ,n ( y i )). Thenotations E γ ( · ) and E γ (0) ( · ) have been reserved to denote expectation w.r.t the distributionof y n | γ and y n | γ (0) respectively.The proof of Theorem 3.1 relies in part on the existence of exponentially consistentsequence of tests. Deﬁnition

An exponentially consistent sequence of test functions Φ n for testing H : γ = γ vs. H : γ ∈ A cn satisﬁes E γ (Φ n ) ≤ d exp( − h n ) , sup γ ∈A cn E γ (1 − Φ n ) ≤ d exp( − h n )for some d , d , h , h >

0. 45 emma 8.1

For some h > , there exists a sequence of test functions for testing H : γ = γ vs. H : γ ∈ A cn , which satisfy E γ (Φ n ) ≤ exp( − hn ) , sup γ ∈A cn E γ (1 − Φ n ) ≤ exp( − hn ) . (11) Proof

The construction of the test is provided in the proof of Theorem 2 and Lemma 4 inGhosal et al. (2006).We also state another result which will be subsequently used in the proof.

Lemma 8.2

Let u (0) k = ( u (0) k, , ..., u (0) k,R ) (cid:48) for k = 1 , .., V n , and υ k,l be the only positive root ofthe equation x + x ( || u (0) k || + || u (0) l || ) − η = 0 , k < l. (12) Assume υ = min k,l υ k,l . Then, for W = ( u (cid:48) u , ..., u (cid:48) V n − u V n ) (cid:48) and W (0) = ( u (0) (cid:48) u (0)2 , ..., u (0) (cid:48) V n − u (0) V n ) (cid:48) Π( || W − W (0) || ∞ < η ) ≥ Π( || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n ) . (13) Proof for k < l , | u (cid:48) k u l − u (0) (cid:48) k u (0) l | = | R (cid:88) r =1 u k,r u l,r − R (cid:88) r =1 u (0) k,r u (0) l,r |≤ | R (cid:88) r =1 ( u k,r − u (0) k,r ) u lr | + | R (cid:88) r =1 ( u l,r − u (0) l,r ) u (0) k,r |≤ || u k − u (0) k || || u l || + || u l − u (0) l || || u (0) k || ≤ || u k − u (0) k || (cid:104) || u l − u (0) l || + || u (0) l || (cid:105) + || u l − u (0) l || || u (0) k || . If || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n , the above inequality implies | u (cid:48) k u l − u (0) (cid:48) k u (0) l | ≤ υ ( υ + || u (0) l || ) + υ || u (0) k || ≤ η , ∀ k < l. || W − W (0) || ∞ < η ) ≥ Π( || u k − u (0) k || ≤ υ, ∀ k = 1 , .., V n ). Proof of Theorem 3.1

Suppose E n = (cid:8) y : ||∇ w γ (0) ,n ( y ) || ∞ ≤ √ nq n (cid:9) . Then the probability of the vector y n be-longing to the set E n is given by, P γ (0) ( y n ∈ E n ) ≥ − P γ (0) ( max ≤ j ≤ q n | n (cid:88) i =1 ( y i − ∇ z ( x (cid:48) i ( γ − γ (0) ))) x ij | > √ nq n ) ≥ − q n , where the last step follows from the Hoeﬀding inequality. Note that as n → ∞ , q n → ∞ ,hence P γ (0) ( y n ∈ E n ) →

1. Hence, in the subsequent proof we can assume without loss ofgenerality that y n ∈ E n . It can be observed thatΠ n ( A cn ) = (cid:82) A cn p γ ( y n ) π n ( γ ) (cid:82) p γ ( y n ) π n ( γ ) = (cid:82) A cn p γ ( y n ) p γ (0) ( y n ) π n ( γ ) (cid:82) p γ ( y n ) p γ (0) ( y n ) π n ( γ ) = N n D n ≤ Φ n + (1 − Φ n ) N n D n , (14)where Φ n is the exponentially consistent sequence of tests given in Lemma 8.1. The aboveequation is true as N n / D n ≤

1. This is in turn true as both are integrals of the samenonnegative functions, D n is the integral of that function over the entire set of possible γ ’s,while N n is the integral over a subset A cn . In proving Theorem 3.1, we will proceed in threesteps as following.(a) Step 1 shows that Φ n →

0, as n → ∞ , almost surely.(b) Step 2 shows that exp( hn/ − Φ n ) N n →

0, as n → ∞ , almost surely.(c) Finally, step 3 shows that exp( hn/ D n → ∞ , as n → ∞ .Here h is the one as deﬁned in Lemma 8.1. By (14), (a)-(c) implies Π n ( A cn ) →

0. We willnow proceed proving (a)-(c).(a) Step 1 47n application of the Markov inequality and (11) in Lemma 8.1 yield, P γ (0) (Φ n > exp( − nh/ ≤ E γ (0) (Φ n ) exp( nh/ ≤ exp( − nh/ . Therefore (cid:80) ∞ n =1 P γ (0) (Φ n > exp( − nh/ < ∞ .Applying Borel-Cantelli lemma, Thus, P γ (0) (Φ n > exp( − nh/ ∃ n and a set Ω with P γ (0) (Ω) = 0, s.t. for all n > n , Φ n ( ω ) < exp( − nh/ ω ∈ Ω c . Since exp( − nh/ →

0, this means that Φ n → n → a.s. (15)(b) Step 2We have E γ (0) ((1 − Φ n ) N n ) = (cid:90) (1 − Φ n ) (cid:90) A cn p γ ( y n ) p γ (0) ( y n ) π n ( γ ) p γ (0) ( y n )= (cid:90) A cn (cid:90) (1 − Φ n ) p γ ( y n ) π n ( γ )= (cid:90) A cn E γ (1 − Φ n ) π n ( γ ) ≤ sup γ ∈A cn E γ (1 − Φ n Π( A cn ) ≤ sup γ ∈A cn E γ (1 − Φ n ) ≤ exp( − nh ) ≤ exp( − nh/ . Consider the set G n,h, = { (1 − Φ n ) N n exp( nh/ > exp( − nh/ } . The above inequal-ity implies that (cid:80) ∞ n =1 P γ (0) ( G n,h, ) < ∞ . Again since h is ﬁxed, applying Borel-Cantellilemma P γ (0) ( limsup n →∞ G n,h, ) = 0. Using the deﬁnition of limsup of the sets G n,h, Klenke(2013), P γ (0) ( G n,h, happens inﬁnitely often) = 0. Thus, P γ (0) ((1 − Φ n ) N n exp( nh/ > exp( − nh/

4) happens inﬁnitely often) = 0. Let Ω be the set s.t. P γ (0) (Ω) = 0 and (1 − n ( ω )) N n exp( nh/ > exp( − nh/

4) happens inﬁnitely often for all ω ∈ Ω . This means that ∃ n , s.t. for all n > n , , (1 − Φ n ( ω )) N n exp( nh/ < exp( − nh/ ω ∈ Ω c . Sinceexp( − nh/ →

0, this means that exp( nh/ − Φ n ) N n → nh/ − Φ n ) N n → a.s.. (16)(c) Step 3 (cid:90) p γ ( y n ) p γ (0) ( y n ) π ( γ ) = (cid:90) exp (cid:0) ∇ w γ (0) ,n ( y n ) (cid:48) ( γ − γ (0) ) + C y n ,n ( γ ) (cid:1) π ( γ ) ≥ (cid:90) exp (cid:16) −||∇ w γ (0) ,n ( y n ) || ∞ || γ − γ (0) || − n || γ − γ (0) || (cid:17) π ( γ ) ≥ (cid:90) exp (cid:16) − √ nq n || γ − γ (0) || − n || γ − γ (0) || (cid:17) π ( γ ) ≥ exp (cid:18) − √ nq n η n ρ/ − nη n ρ (cid:19) Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) , where ρ is the one deﬁned in the statement of the theorem and the inequality in the secondline follows from the Taylor series expansion after taking into account that ∇ z ( · ) ≤ / z ( · ) deﬁned in (9)), which is true as d df log (cid:0) e f (cid:1) = e f (1+ e f ) ≤ / . The inequality in thethird line follows from the fact that y n ∈ E n .First, observe that, given all the hierarchical parameters, the Bayesian network lasso priordistribution on γ can be written as γ = W + γ , where γ follows the ordinary Bayesianlasso shrinkage prior. With this observation, one can seeΠ (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) ≥ Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) , where W and W (0) are as deﬁned in Lemma 8.2. We will show sequentially(i) − log Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) = o ( n ) and(ii) − log (cid:110) Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17)(cid:111) = o ( n ).49i) Note that, with R (dimensions of the latent variables) and ∆ (probability of a node beinginﬂuential) as deﬁned before we obtain,Π( || W − W (0) || < η n ρ/ ) ≥ Π( || u k − u (0) k || ≤ υ n , ∀ k = 1 , .., V n ) ≥ E (cid:104) Π( || u k − u (0) k || ≤ υ n , ∀ k = 1 , .., V n | ∆) (cid:105) ≥ E (cid:34) V n (cid:89) k =1 (cid:26) exp (cid:18) − u (0) (cid:48) k u (0) k (cid:19) Π( || u k || ≤ υ n | ∆) (cid:27)(cid:35) , (17)where the ﬁrst inequality follows from Lemma 8.2 by replacing η with η n ρ/ with a slightabuse of notation, and υ n is deﬁned accordingly. The last inequality follows from the An-derson’s Lemma. We will now make use of the fact that (cid:82) a − a exp( − x / dx ≥ exp( − a )2 a toconcludeΠ( || u k || ≤ υ n | ∆) ≥ R (cid:89) r =1 Π (cid:16) | u k,r | ≤ υ n R | ∆ (cid:17) = R (cid:89) r =1 (cid:32) (1 − ∆) + ∆ √ π (cid:90) υ n /R − υ n /R exp( − x / (cid:33) ≥ R (cid:89) r =1 (cid:18) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:19) ≥ (cid:20) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:21) R . n (cid:89) k =1 Π( || u k || ≤ υ n ) ≥ E (cid:20) (1 − ∆) + ∆ √ π exp( − υ n /R ) 2 υ n R (cid:21) RV n = E (cid:34) RV n (cid:88) h =1 (cid:18) RV n h (cid:19) (1 − ∆) h ∆ RV n − h (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1)(cid:35) ≥ RV n (cid:88) h =1 (cid:18) RV n h (cid:19) Beta ( RV n − h + 1 , h + 1) (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1) ≥ RV n (cid:88) h =1 ( RV n )! h !( RV n − h )! h !( RV n − h )!( RV n + 1)! (cid:18) υ n R (cid:19) RV n − h exp (cid:0) − ( RV n − h ) υ n /R (cid:1) ≥ RV n RV n + 1 (cid:18) υ n R (cid:19) RV n exp( − V n υ n /R ) . Where the last inequality follows from Lemma 8.2 by considering the fact that, υ n = min k,l − [ || u (0) k || + || u (0) l || ]+ (cid:113) [ || u (0) k || + || u (0) l || ] +2 η /n ρ/ ≤ √ η √ n ρ/ . Hence, 0 < υ n R < n . Itnow follows from (17) that − log Π (cid:16) || W − W (0) || < η n ρ/ (cid:17) ≤ V n (cid:88) k =1 u (0) (cid:48) k u (0) k V n η Rn ρ/ − ( RV n ) log (cid:18) √ η √ Rn ρ/ (cid:19) + log( RV n + 1) − log( RV n ) = o ( n ) , by the assumptions (A) and (B). This proves (i).We will now prove (ii). Let S = { j : γ (0)2 ,j (cid:54) = 0 } . Deﬁne s as the vector of upper triangularpart of the matrix with ( k, l )th entry s k,l . It follows thatΠ (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≥ Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) Π  (cid:88) j (cid:54)∈S | γ ,j | < ( q n − s ,n ) η q n n ρ  . (18)We will lower bound two components of the product in (18) individually. By Chebyshev’s51nequality Π  (cid:88) j (cid:54)∈S | γ ,j | < ( q n − s ,n ) η q n n ρ  ≥ (cid:32) − E [ (cid:80) j (cid:54)∈S | γ ,j | ]4 q n n ρ ( q n − s ,n ) η (cid:33) = (cid:18) − θ n q n n ρ η (cid:19) . (19)Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) = E (cid:20) Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S | s S (cid:19)(cid:21) = E  (cid:89) j ∈S Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ | s S (cid:19) . Using the fact that (cid:82) ba e − x / dx ≥ e − ( a + b ) / ( b − a ), one obtains (cid:89) j ∈S Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ | s S (cid:19) ≥ (cid:89) j ∈S  η (cid:113) q n n ρ πs j  exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j (cid:19) . Thus Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) ≥ E  (cid:89) j ∈S  η (cid:113) q n n ρ πs j  exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j (cid:19) ≥ (cid:18) η θ n √ q n n ρ π (cid:19) s ,n (cid:89) j ∈S (cid:90) s j  (cid:113) s j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j − θ n s j (cid:19) ds j  . Use the change of variable s j = z j and the normalizing constant from the inverse Gaussian52ensity to deduce (cid:90) s j  (cid:113) s j exp (cid:18) − | γ ,j | + η / (4 q n n ρ ) s j − θ n s j (cid:19) ds j  = (cid:90) z j  (cid:113) z j exp (cid:18) − ( | γ ,j | + η / (4 q n n ρ ) z j − θ n z j (cid:19) dz j  = (cid:115)(cid:18) πθ n (cid:19) exp (cid:18) − θ n (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1)(cid:19) . Therefore,Π (cid:18) | γ ,j − γ (0)2 ,j | < η √ q n n ρ/ , j ∈ S (cid:19) ≥ (cid:18) η √ θ n √ q n n ρ (cid:19) s ,n exp  − θ n (cid:88) j ∈S (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1) . (20)Combining results from (19) and (20)Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≥ (cid:18) η √ θ n √ q n n ρ (cid:19) s ,n exp  − θ n (cid:88) j ∈S (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1)(cid:18) − θ n q n n ρ/ η (cid:19) . Referring to Assumption (F), − log Π (cid:16) || γ − γ (0)2 || < η n ρ/ (cid:17) ≤ s ,n [ η + log( q n ) + (3 ρ/

4) log( n ) + log(log( n )) / (cid:113) (cid:0) | γ ,j | + η / (4 q n n ρ ) (cid:1) q n n ρ/ log( n ) − log (cid:18) − η log( n ) (cid:19) = o ( n ) , (21)under assumptions (B)-(F). 53inally, − log( D n ) ≤ √ nq n η n ρ/ + nη n ρ − log Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) = 2 η √ q n n (1 − ρ ) / + η n − ρ − log Π (cid:16) || γ − γ (0) || < η n ρ/ (cid:17) . Using (21), the fact that (1 − ρ ) / ∈ ( − / ,

0) and assumption (B), we obtain − log( D n ) = o ( nn