[PDF] Conflict diagnostics for evidence synthesis in a multiple testing framework

Abstract

Evidence synthesis models that combine multiple datasets of varying design, to estimate quantities that cannot be directly observed, require the formulation of complex probabilistic models that can be expressed as graphical models. An assessment of whether the different datasets synthesised contribute information that is consistent with each other, and in a Bayesian context, with the prior distribution, is a crucial component of the model criticism process. However, a systematic assessment of conflict suffers from the multiple testing problem, through testing for conflict at multiple locations in a model. We demonstrate the systematic use of conflict diagnostics, while accounting for the multiple hypothesis tests of no conflict at each location in the graphical model. The method is illustrated by a network meta-analysis to estimate treatment effects in smoking cessation programs and an evidence synthesis to estimate HIV prevalence in Poland.

Full PDF

CConﬂict diagnostics for evidence synthesis in a multipletesting framework

Anne M. Presanis, David Ohlssen, Kai Cui,Magdalena Rosinska, Daniela De AngelisOctober 16, 2018

Medical Research Council Biostatistics Unit, University of Cambridge, U.K.Novartis Pharmaceuticals Corporation, East Hanover, NJ, U.S.A.Department of Epidemiology, National Institute of Public Health,National Institute of Hygiene, Warsaw, Poland e-mail: [email protected]

Abstract

Evidence synthesis models that combine multiple datasets of varying design, to estimatequantities that cannot be directly observed, require the formulation of complex prob-abilistic models that can be expressed as graphical models. An assessment of whetherthe diﬀerent datasets synthesised contribute information that is consistent with eachother, and in a Bayesian context, with the prior distribution, is a crucial componentof the model criticism process. However, a systematic assessment of conﬂict suﬀersfrom the multiple testing problem, through testing for conﬂict at multiple locations ina model. We demonstrate the systematic use of conﬂict diagnostics, while accountingfor the multiple hypothesis tests of no conﬂict at each location in the graphical model.The method is illustrated by a network meta-analysis to estimate treatment eﬀects insmoking cessation programs and an evidence synthesis to estimate HIV prevalence inPoland.

KEYWORDS: Conﬂict; evidence synthesis; graphical models; model criticism; multipletesting; network meta-analysis.

Evidence synthesis refers to the use of complex statistical models that combine multiple, dis-parate and imperfect sources of evidence to estimate quantities on which direct informationis unavailable or inadequate (e.g. Ades and Sutton, 2006; Welton et al., 2012; De Angelis1 a r X i v : . [ s t a t . M E ] S e p t al., 2014). Such evidence synthesis models are typically graphical models represented by adirected acyclic graph (DAG) G ( V , E ), where V and E are sets of nodes and edges respec-tively, encoding conditional independence assumptions (Lauritzen, 1996). With increasedcomputational power, models of the form of G ( V , E ) have proliferated, requiring also thedevelopment of model criticism tools adapted to the challenges of evidence synthesis. Ina Bayesian framework, any of the prior distribution, the assumed form of the likelihoodand structural and functional assumptions may conﬂict with the observed data or with eachother. To assess the consistency of each of these components, various mixed- or posterior-predictive checks have been proposed. In particular, the “conﬂict p-value” (Marshall andSpiegelhalter, 2007; G˚asemyr and Natvig, 2009; Presanis et al., 2013; G˚asemyr, 2016) is adiagnostic calculated by splitting G ( V , E ) into two independent sub-graphs (“partitions”)at a particular “separator” node φ , to measure the consistency of the information providedby each partition about the node (a “node-split”). G˚asemyr and Natvig (2009) and Presaniset al. (2013) demonstrate how the conﬂict p-value may be evaluated in diﬀerent contexts,including both one- and two-sided hypothesis tests, and G˚asemyr (2016) demonstrates theuniformity of the conﬂict p-value in a wide range of models.The conﬂict p-value may be used in a targeted manner, searching for conﬂict at particularnodes in a DAG. However, in complex evidence syntheses, often the location of potentialconﬂict may be unclear. A systematic assessment of conﬂict throughout a DAG is thenrequired to locate problem areas (e.g. Krahn et al., 2013). Such systematic assessment,however, suﬀers from the multiple testing problem, either through testing for conﬂict ateach node in G ( V , E ) or through the separation of G ( V , E ) into more than two partitionsto simultaneously test for conﬂict between each pair-wise partition. Here we account forthese multiple tests by adopting the general hypothesis testing framework of Hothorn et al.(2008); Bretz et al. (2011), allowing for simultaneous multiple hypotheses in a parametricsetting. They propose diﬀerent possible tests to account for multiplicity: we concentratehere on maximum-T type tests.In section 2, we deﬁne evidence synthesis before introducing the particular models thatmotivate our work on systematic conﬂict assessment: a network meta-analysis and a modelfor estimating HIV prevalence. Section 3 describes the methods we use to test for conﬂictand account for the multiple tests we perform. We apply these methods to our examples inSection 4 and end with a discussion in Section 5. Formally, our goal is to estimate K basic parameters θ = ( θ , . . . , θ K ) given a collection of N independent data sources y = ( y , . . . , y N ), where each y i , i ∈ , . . . , N may be a vector orarray of data points. Each y i provides information on a functional parameter ψ i (or poten-tially a vector of functions ψ i ). When ψ i = θ k is the identity function, the data y i are saidto directly inform θ k . Otherwise, ψ i = ψ i ( θ ) is a function of multiple parameters in θ : the y i therefore provide indirect information on these parameters. Given the conditional indepen-dence of the datasets y i , the likelihood is L ( θ ; y ) = (cid:81) Ni =1 L i ( ψ i ( θ ); y i ), where L i ( ψ i ( θ ); y i )is the likelihood contribution of y i given the basic parameters θ . In a Bayesian context,for a prior distribution p ( θ ), the posterior distribution p ( θ | y ) ∝ p ( θ ) L ( θ ; y ) summarisesall information, direct and indirect, on θ . Let ψ = ( ψ , . . . , ψ N ) be the set of functional2arameters informed by data and φ = { θ , ψ } be the set of all unknown quantities, whetherbasic or functional. In this setup, the DAG G ( V , E ) representing the evidence synthesismodel has a set of nodes V = { φ , y } representing either known or unknown quantities; andthe directed edges E represent dependencies between nodes. Each ‘child’ node is indepen-dent of its ‘siblings’ conditional on their direct ‘parents’. The joint distribution of all nodes V is the product of the conditional distributions of each node given its direct parents. Anexample DAG of an evidence synthesis model is given in Figure 1(i). Circles denote unknownquantities: either basic parameters θ that are ‘founder’ nodes at the top of a DAG havinga prior distribution (double circles); or functional parameters ψ . Squares denote observedquantities, solid arrows represent stochastic distributional relationships, and dashed arrowsrepresent deterministic functional relationships. This DAG could be extended to more com-plex hierarchical priors and models, where repetition over variables is represented by ‘plates’,rounded rectangles around the repeated nodes, labelled by the range of repetition. In gen-eral, the set V may be larger than the set of basic and functional parameters, including alsoother intermediate nodes in the DAG, for example unit-level parameters in a hierarchicalmodel. For brevity, from here on we will abbreviate any DAG to the notation G ( φ , y ). Network meta-analysis (NMA) is a speciﬁc type of evidence synthesis (Salanti, 2012), thatgeneralises meta-analysis from the synthesis of studies measuring a treatment eﬀect (e.g. oftreatment B versus treatment A in a randomised clinical trial), to the synthesis of data onmore than two treatment arms. The studies included in the NMA may not all measure thesame treatment eﬀects, but each study provides data on at least two of the treatments. Forexample, considering a set of treatments { A, B, C, D } , the network of trials may consist ofstudies of diﬀerent “designs”, i.e. with diﬀerent subsets of the treatments included in eachtrial (Jackson et al., 2014), such as { ABC, ABD, BD, CD } . As with meta-analysis, NMAmodels can be implemented in either a two-stage or single-stage approach, as described morecomprehensively elsewhere (Salanti, 2012; Jackson et al., 2014). Here we concentrate on asingle-stage approach, where the original data Y Jdi for each treatment J of study i of design d are available. A full likelihood model speciﬁes Y Jdi ∼ f ( p Jdi | w Jdi )for some distribution f ( · ) and treatment outcome p Jdi with associated information w Jdi . Forexample, if the data are numbers of events out of total numbers at risk of the event, then w Jdi might be the denominator for treatment J . We might assume the data are realisations of aBinomial random variable, Y Jdi ∼ Bin ( w Jdi , p

Jdi ), where the proportion p Jdi is a function of astudy-speciﬁc baseline α di representing a design/study-speciﬁc baseline treatment B d and astudy-speciﬁc treatment contrast (log odds ratio) µ B d Jdi , through a logistic model, logit ( p Jdi ) = α di + µ B d Jdi . The intercept is α di = logit ( p B d di ). To complete the model speciﬁcation requiresparameterisation of the treatment eﬀects µ AJdi . A common eﬀect model, for a network-widereference treatment A , is given by µ AJdi = η AJ (1)for each J (cid:54) = A , i.e. assumes that all studies of all designs measure the same treatmenteﬀects. The η AJ are basic parameters, of which there are the number of treatments in thenetwork minus 1, representing the relative eﬀectiveness of treatment J compared to thenetwork baseline treatment A . All other contrasts η JK , J, K (cid:54) = A are functional parameters,3eﬁned by assuming a set of consistency equations η JK = η AK − η AJ for each J, K (cid:54) = A .These equations deﬁne a transitivity property of the treatment eﬀects. The extension to arandom-eﬀects model, still under the consistency assumption, implies µ AJdi = η AJ + β AJdi (2)where usually the random eﬀects β AJdi , reﬂecting between-study heterogeneity, are assumednormally distributed around 0, with a covariance structure deﬁned as a square matrix Σ β suchthat all entries on the leading diagonal are σ β and all remaining entries are σ β / basic parameters is denoted η b = ( η AJ ) J (cid:54) = A and thecorresponding set of functional parameters is denoted η f = ( η JK = η AK − η AJ ) J,K (cid:54) = A . Notethat the common-eﬀect model is a special case of the random-eﬀects model. In the Bayesianparadigm, we specify prior distributions for the basic parameters η b , the (nuisance) study-speciﬁc baselines α di , and in the case of the random treatment eﬀects model, the commonstandard deviation parameter σ β in terms of which the variance-covariance matrix Σ β isdeﬁned. Note that any change in parameterisation of the model, for example changingtreatment labels, will aﬀect the joint prior distribution, making invariance challenging oreven impossible in a Bayesian setting. A smoking cessation example

Dias et al. (2010), amongst many others (Lu and Ades,2006; Higgins et al., 2012; Jackson et al., 2015), considered an NMA of studies of smokingcessation. The network consists of 24 studies of 8 diﬀerent designs, including 2 three-armtrials. Four smoking cessation counselling programs are compared (Figure 2): A no interven-tion; B self-help; C individual counselling; D group counselling. The data (SupplementaryMaterial Table A.1) are the number of individuals out of those participating who have suc-cessfully ceased to smoke at 6-12 months after enrollment. Here we ﬁt the common- andrandom-eﬀect models under a consistency assumption and diﬀuse priors: Normal(0 , ) onthe log-odds scale for η b and α di ; and Uniform(0 ,

5) for σ β . We ﬁnd (Supplementary Ma-terial Table A.2) that the deviance information criterion ( DIC , Spiegelhalter et al. (2002))prefers the random-eﬀect model, suggesting it is necessary to explain the heterogeneity inthe network. The estimates of the treatment eﬀects from the random-eﬀect model are bothsomewhat diﬀerent and more uncertain than those from the common-eﬀect model, agree-ing with estimates found by others, including Dias et al. (2010). Moreover, the posteriorexpected deviance for the random-eﬀect model, E θ | y ( D ) = 54, is slightly larger than thenumber of observations (50), suggesting still some lack of ﬁt to the data. A single node-split model

This residual lack of ﬁt and the general potential in NMA forvariability between groups of direct and indirect information from multiple studies that isexcess to between-study heterogeneity (“inconsistency”, Lu and Ades (2006)) has motivatedvarious approaches to the detection and resolution of inconsistency (Lumley, 2002; Lu andAdes, 2006; Dias et al., 2010; Higgins et al., 2012; White et al., 2012; Jackson et al., 2014).Dias et al. (2010) apply the idea of node-splitting, based on Marshall and Spiegelhalter(2007), to the NMA context, splitting a single mean treatment eﬀect η JK in the random ef-fects consistency model (2). A DAG is partitioned into direct evidence from studies directlycomparing J and K versus indirect evidence from all remaining studies. Speciﬁcally, for anystudy i of design d that directly compares J and K , the study-speciﬁc treatment eﬀect is4xpressed in terms of the direct treatment eﬀect: µ JKdi = η JKDir + β JKdi ; whereas the indirectversion of the treatment eﬀect is estimated from the remaining studies via the consistencyequation: η JKInd = η AK − η AJ . The posterior distribution of the contrast or inconsistency pa-rameter δ JK = η JKDir − η JKInd is then examined to check posterior support for the null hypothesis δ JK = 0. Multiple node-splits

Although the single node-split approach in Dias et al. (2010) hasbeen extended to automate the generation of diﬀerent single node-splitting models for conﬂictassessment (van Valkenhoef et al., 2016), the simultaneous splitting of multiple nodes in aNMA has not yet been considered. In section 4.1, we use multiple splits to investigate conﬂictin the smoking cessation network beyond heterogeneity, accounting for the multiplicity.

As further illustration of systematic conﬂict detection, we consider an evidence synthesisapproach to estimating HIV prevalence in Poland, among the exposure group of men whohave sex with men (MSM) (Rosinska et al., 2016). The data aggregated to the national levelare given in Supplementary Material Table A.3. There are three basic parameters to beestimated: the proportion of the male population who are MSM, ρ ; the prevalence of HIVinfection in the MSM group, π ; and the proportion of those infected who are diagnosed, κ (Figure 3(a)). Likelihood

The total population of Poland, N = 15 , , y , . . . , y directly inform, respectively: ρ ; prevalence of diagnosedinfection πκ ; prevalence of undiagnosed infection π (1 − κ ); and lower ( D L ) and upper ( D U )bounds for the number of diagnosed infections D = N ρπκ (Figure 3(a), SupplementaryMaterial Table A.3). These data are modelled independently as either Binomial ( y , y , y )or Poisson ( y , y ). Priors

The number diagnosed D is constrained a priori to lie between the stochasticbounds D L and D U , which in turn are given vague log-normal priors. Since D is alreadydeﬁned as a function of the basic parameters, the constraint is implemented via introductionof an auxiliary Bernoulli datum of observed value 1, with probability parameter given by afunctional parameter c = P r ( D L ≤ D ≤ D U ) (Figure 3(a)). The basic parameters ρ, π and κ are given independent uniform prior distributions on [0 , Exploratory model criticism

This initial analysis reveals a lack of ﬁt to some of thedata (Supplementary Material Table A.3), with particularly high posterior mean deviancesfor the data informing ρ and πκ . This lack of ﬁt in turn may suggest the existence of conﬂictin the DAG (Spiegelhalter et al., 2002). In Rosinska et al. (2016), conﬂict between evidencesources was not directly considered or formally measured, instead resolving the lack of ﬁtby modelling potential biases in the data in a series of sensitivity analyses. By contrast, inSection 4.2 we systematically assess the consistency of evidence coming from the prior model5nd from each likelihood contribution, by splitting the DAG at each functional parameter(Figure 3(b)). Brieﬂy, as in Presanis et al. (2013), consider partitioning a DAG G ( φ , y ) into two inde-pendent partitions, at a separator node φ . The separator could either be a founder node,i.e. a basic parameter, or a node internal to the DAG, and is split into two copies φ a and φ b , one in each partition (Figure 1(ii,iii)). Suppose that partition G ( φ a , y a ) containsthe data vector y a and provides inference resulting in a posterior distribution p ( φ a | y a ),and that similarly partition G ( φ b , y b ) results in p ( φ b | y b ). The aim is to assess thenull hypothesis that φ a = φ b . For φ taking discrete values, we can directly evaluate p ( φ a = φ b | y a , y b ). If the support of φ is continuous, we consider the posterior prob-ability of δ = h ( φ a ) − h ( φ b ), where h ( · ) is a function that transforms φ to a scale forwhich a uniform prior is appropriate. The two-sided “conﬂict p-value” is deﬁned as c =2 × min { Pr { p δ ( δ | y a , y b ) < p δ (0 | y a , y b ) } , − Pr { p δ ( δ | y a , y b ) < p δ (0 | y a , y b ) }} , where p δ is the posterior density of the diﬀerence δ , so that the smaller c is, the greater the conﬂict. Generalising now to multiple tests of conﬂict, suppose that G ( φ , y ) is partitioned into Q independent sub-graphs, G ( φ , y ) , . . . , G Q ( φ Q , y Q ), where each disjoint subset of the data y q , q ∈ , . . . , Q is chosen to identify part of the basic parameter space θ q = ( θ q , . . . , θ qb q ),where b q is the number of basic parameters in partition q . Note that θ q ⊂ φ q for each q ∈ , . . . , Q , whereas the complementary subset φ q \ θ q consists of functional and othernon-basic parameters. To test the consistency of information provided by each partitionabout a set of J separator nodes ( φ ( s )1 , . . . , φ ( s ) J ) ⊆ φ from the original model, a set ofconstrasts δ j = ( δ j , . . . , δ jC j ) is formed for each j ∈ , . . . , J , one contrast per pair ofpartitions in which φ j appears. A maximum of (cid:0) Q (cid:1) contrasts are possible for each separator,i.e. C j ≤ (cid:0) Q (cid:1) . Each contrast δ jc is deﬁned as δ jc = h j ( φ jq A | y A ) − h j ( φ jq B | y B )for the pair of partitions c = { q A , q B } and node-split copies { φ jq A , φ jq B } . The functions h j ( · )are functions that transform the separator nodes { φ jq A , φ jq B } to an appropriate scale for auniform (Jeﬀreys) prior to be applicable, if either is a founder node in either partition.Denote the separator nodes in each partition by φ ( s ) q = { φ jq , j ∈ , . . . , m q , q ∈ , . . . , Q } ,where m q ≤ J is the number of separator nodes in partition q . Writing these nodes asa stacked vector φ S = ( φ ( s )1 , . . . , φ ( s ) Q ) = ( φ , . . . , φ m , φ , . . . , φ m , . . . , φ Q , . . . , φ m Q Q ) T ,and the transformed version as φ H = h ( φ S ), the total set of contrasts is ∆ = ( δ , . . . , δ J ) T = C ∆ T φ H C ∆ T . Note that not every separator nodenecessarily appears in every partition, so although φ H has maximum length J × Q , inpractice, its length m = (cid:80) Qq =1 m q ≤ J × Q . The contrast matrix C ∆ T therefore has dimension p × m , so that it maps from the space of the m separator nodes (including node-split copies)to that of the p = (cid:80) Jj =1 C j contrasts. A test for consistency of the information in eachpartition may be expressed as a test of the null hypothesis that H : ∆ = C ∆ T φ H = (3) Using standard asymptotic theory (Bernardo and Smith, 1994, see also derivation in Sup-plementary Material Appendix B), it can be shown that if the joint posterior distributionof all parameters φ in all partitions is asymptotically multivariate normal (i.e. if the prioris ﬂat enough relative to the likelihood), and if ∂ ∆( φ ) ∂ φ = C ∆ T is non-singular with contin-uous entries, then the posterior mean of ∆ is ∆ = C ∆ T φ H a ≈ C ∆ T ˆ φ H and the posteriorvariance-covariance matrix of ∆ is S ∆ a ≈ C ∆ T V H C ∆ , where: ˆ φ H is the maximum like-lihood estimate of ˆ φ H ; the matrix V H = J h ( ˆ φ S ) T V S J h ( ˆ φ S ); J h ( ˆ φ S ) is the Jacobian ofthe transformation h ( φ S ); and V S is a blocked diagonal matrix consisting of the inverseobserved information matrices for the separator nodes in each partition along the diagonal.The posterior summaries ∆ and S ∆ , i.e. the Bayes’ estimator under a mean-squared errorBayes’ risk function and corresponding variance-covariance matrix, may therefore be usedunder the general simultaneous inference framework of Hothorn et al. (2008); Bretz et al.(2011) to construct a multiplicity-adjusted test that ∆ = . Given the estimator ∆ and corresponding variance-covariance matrix S ∆ , deﬁne a vec-tor of test statistics T n = D − / n ( ∆ − ∆ ), where n is the dimension of the data y and D n = diag ( S ∆ ). Then it can be shown (Hothorn et al., 2008; Bretz et al., 2011) that T n tends in distribution to a multivariate normal distribution, T n a ∼ N m ( , R ), where R := D − / n S ∆ D − / n ∈ R m,m is the posterior correlation matrix for the vector (length m ) of contrasts ∆ . Under the null hypothesis (3), T n = D − / n ∆ a ∼ N m ( , R ), and hence,assuming S ∆ is ﬁxed and known, the authors show that a global χ -test of conﬂict can beformulated: X = T Tn R + T n d −→ χ ( Rank ( R ))where the superscript + denotes the Moore-Penrose inverse of the corresponding matrix and Rank ( R ) is the degrees of freedom. Importantly, it is also possible to construct multiply-adjusted local (individual) conﬂict tests, based on the m z − scores corresponding to T n andthe null distribution of the maximum of these, Z max , (Hothorn et al., 2008; Bretz et al.,2011). This latter null distribution is obtained by integrating the limiting m − dimensionalmultivariate normal distribution over [ − z, z ] to obtain the cumulative distribution function P ( Z max ≤ z ). The individual conﬂict p-values are then calculated as P ( | z k | < Z max ) , k ∈ , . . . , m , with a corresponding global conﬂict p-value (an alternative to the χ -test) givenby P ( | z max | < Z max ). 7 Examples

We now illustrate the idea of systematic multiple node-splitting to assess conﬂict in our twomotivating examples. All analyses were carried out in

OpenBUGS 3.2.2 (Lunn et al., 2009)and

R 3.2.3 (R Core Team, 2015). We use the

R2OpenBUGS package (Sturtz et al., 2005) torun

OpenBUGS from within R and the multcomp package (Bretz et al., 2011) to carry out thesimultaneous local and global max-T tests. Consider ﬁrst a NMA in general, and for simplicity, assume there are no multi-arm trialsand a common-eﬀect model (equation (1)) for the data. The basic parameters η b form aspanning tree of the network of evidence (Figure 2), i.e. a graph with no cycles, such thateach node in the network can be reached from every other node, either directly or indirectlythrough other nodes (van Valkenhoef et al., 2012). Multiple possible partitionings of theevidence network exist, so a choice must be made (Figure 2). Suppose the spanning tree η b is identiﬁable by a set of evidence Y b containing outcomes from all trials designed todirectly estimate the treatment eﬀects in η b . Then every treatment eﬀect is identiﬁable from Y b , by deﬁnition of a spanning tree and the fact that each treatment eﬀect represented byedges outside the spanning tree is a functional parameter in the set η f , equal to a linearcombination of the basic parameters. The data Y b therefore indirectly inform the functionalparameters η f , whereas the remaining data, Y f = Y \ Y b directly inform η f . A comparisonbetween the direct and indirect evidence on η f is therefore possible, to assess conﬂict betweenthe two types of evidence. The network is split into two partitions, { η Dirf , Y f } (the “directevidence partition”, DE) and { η Indf , Y b } (the “spanning tree partition”, ST) and the directand indirect versions of the functional parameters compared: ∆ = η Dirf − η Indf . Dependingon the studies that are in the DE partition, the basic parameters η b may also be weaklyidentiﬁable in the DE partition, due to prior information. Since a NMA model may beformulated as a DAG, this Direct/Indirect partitioning is equivalent to a multi-node split inthe DAG at the functional parameters (Supplementary Material Figure A.2).Generalising now to more complex situations, if the direct data Y f form a sub-network ofevidence, the question arises of whether these data should be split into further partitions,by identifying a spanning tree for the sub-network. Then the vector ∆ of contrasts to testwould involve comparisons between more than two partitions, e.g. for three partitions: ∆ = (cid:0) η f − η f , η f − η f , η f − η f (cid:1) T If we now consider a random rather than common heterogeneity eﬀects model (equation (2)),a decision must be made on how to handle the variance components in Σ β . One approachwould be to split the variance components simultaneously with the means, so that ∆ alsoincludes contrasts for the variances. Alternatively, if the variance components are not wellidentiﬁed by the evidence in a partition, a common variance component could be assumed.Such commonality could potentially allow for feedback between partitions, since they wouldnot be fully independent (Marshall and Spiegelhalter, 2007; Presanis et al., 2013).Finally, for multi-arm trials, the key consideration is that multi-arm studies should have8nternal consistency, and hence their observations should not be split between partitions. Achoice must therefore be made whether to initially include multi-arm data in the ST data Y b , in the DE data Y f , or in a third partition of their own. In the latter case, any study-speciﬁc treatment eﬀect µ JKdi , where d is a multi-arm design, could be compared at least withthe ST partition, where η JK is deﬁnitely identiﬁed. Potentially, it could also be comparedsimultaneously with the DE partition, if the edge J K is identiﬁable in the DE partition. Thecomparison can be made even if

J K is not identiﬁable, or only weakly identiﬁable from theprior, but if the prior is diﬀuse, then no conﬂict will be detected due to the uncertainty. Sucha comparison is not therefore particularly meaningful, unless we are interested in prior-dataconﬂict.

Smoking cessation example

To illustrate concretely the above issues, we consider ﬁrstthe spanning tree (

AB, AC, AD ) corresponding to the parameters η b = { η AB , η AC , η AD } for the smoking cessation example. Figures 2(b-d) demonstrate diﬀerent ways of splittingthe evidence based on this spanning tree, depending on how we treat the evidence frommulti-arm trials. In Figures 2(b,c), we consider just two partitions, with the multi-arm evi-dence either left in the ST partition { η Indf , Y b } or included in the DE partition { η Dirf , Y f } ,respectively. We compare the direct and indirect evidence on each of the edges or treat-ment comparisons ( BC, BD, CD ). In Figure 2(d), we consider a series of spanning trees((

AB, AC, AD ) , ( BC, BD ) and ( CD )), together with a ﬁnal partition consisting of evidencefrom multi-arm trials, resulting in four partitions.We also consider an alternative choice of spanning tree, ( AB, AC, BD ), as in Figures 2(e,f).In these two models, we again make a choice between including the multi-arm evidencein either the ST or DE partitions and compare the evidence in each partition on edges(

AD, BC, CD ). In all cases, we assume random heterogeneity eﬀects and make the choiceto assume common variance components across the partitions, splitting only the means.Table 1 gives posterior mean (sd) estimates of the treatment eﬀects (log odds ratios) foredges outside the spanning tree, from each partition, where the subscript 1 denotes the STpartition and 2 denotes the DE partition for the two-partition models (b,c,e,f). For thefour-partition model (d), 1-3 denote the sequential spanning tree partitions and 4 the multi-arm trial partition. Also given, for each edge outside the original spanning tree, are theposterior mean (sd) diﬀerences between partitions and both the local and global posteriorprobabilities of no diﬀerence, adjusted for the multiple tests and their correlation. First, notethat the global test of no conﬂict varies by model, and hence by what partitions of evidenceare compared with each other: the posterior probability of no conﬂict in model (b) is 94 . .

4% and 27 .

4% for models (c) and (e). These latter two models appear todetect some mild evidence of conﬂict, despite the large uncertainty in many of the partition-speciﬁc treatment eﬀect estimates, with several of the posterior standard deviations of thesame order of magnitude as the corresponding posterior means, if not larger. The DIC isalso slightly smaller for the two models (c) and (e) which detect potential conﬂict, comparedto those that don’t. This lack of invariance of the global test to the partitions employedsuggests it is not enough to rely on a single node-splitting model to search for conﬂict ina DAG. Moreover, it motivates looking at local tests for conﬂict in diﬀerent node-splittingmodels, to locate the speciﬁc items of evidence that may conﬂict with each other.A closer look at the local posterior probabilities of no conﬂict for each edge outside the initial9panning tree reveals that the potential conﬂict detected by models (c) and (e) involves edgesincluding treatment D (posterior probabilities 17 .

8% and 18 .

6% for edges BD and CD inmodel (c), 12 .

4% and 10 .

5% for edges AD and CD in model (e)). Each of these fourlocal tests involves a partition where the estimated treatment eﬀect for the relevant edge isimplausibly large ( > >

400 on the odds ratio scale) andwhere the sample sizes of the studies involved are small (e.g. studies 7, 20, 23 and 24 inSupplementary Material Table A.1).Unlike models (c), (e) and (f), where in both partitions, each sub-network spans all 4 treat-ments, in models (b) and (d), the spanning tree chosen, (

AB, AC, AD ), is such that for eachsub-network outside the spanning tree, not all the treatments are included (Figure 2). Thisresults in a lack of identiﬁability for the basic parameters η b in partition 2 of model (b) andin partitions 2 and 3 of model (d) (Table 1), where their estimates are dominated by theirdiﬀuse prior distribution (Normal(0 , ) on the log odds ratio scale). There is thereforeno potential for detecting conﬂict about the basic parameters η b , only about the functionalparameters η f .The diﬀerent results obtained from each of the ﬁve models are understandable, since eachmodel partitions the evidence in a diﬀerent way, and the detection of conﬂict relies on theconﬂicting evidence being in diﬀerent rather than the same partitions. However, where thesame evidence is in the same partition for diﬀerent models — for example, the evidencedirectly informing the AC edge in models (c) and (d) — approximately the same estimate isreached in each model, as expected (0 . .

26) in model (c), 0 . .

28) in model (d), Table1).

Figure 3(b) demonstrates the multiple node-splits we make to systematically assess conﬂictin the original DAG of Figure 3(a), separating out the contributions of the prior model andeach likelihood contribution. These node-splits result in 5 partitions, with 6 contrasts to testfor equality to zero. Denoting the nodes in the “prior” partition (above the red arrows inFigure 3(b)) by the subscript p and the nodes in each “likelihood” partition (below the redarrows in Figure 3(b)) by d , the vector of contrasts to test is then ∆ = ( h ( ρ p ) − h ( ρ d ) , h ( π p κ p ) − h ([ πκ ] d ) , h ( π p (1 − κ p )) − h ([ π (1 − κ )] d ) ,g ( D L p ) − g ( D L d ) , g ( D U p ) − g ( D U d ) , g ( D p ) − g ( D d ))) T where h ( · ) and g ( · ) denote the logit and log functions respectively. These contrasts arerepresented by the red dot-dashed arrows in Figure 3(b). In the “prior” partition, thepriors given to the basic parameters are those of the original model (Section 2.2). In each“likelihood” partition, the basic parameters are given Jeﬀreys’ priors so that the posteriorsrepresent only the likelihood. These priors are Beta( / , / ) for the proportions and p ( D B d ) ∝ /D / B d for the lower and upper bounds ( B = L, U ) for D . D d is given a Uniform priorbetween D L d and D U d .Figure 4 shows the posterior distributions of the contrasts ∆ , where 0 lies in these distribu-tions and the corresponding unadjusted ( p U ) and multiply-adjusted ( p A ) individual conﬂictp-values testing for equality to 0. A global χ -squared (Wald) test gives a conﬂict p-value10f 0 . D U (posterior probability of zero diﬀerence is p U = 0 . D ( p U = 0 . ρ ( p U = 0 . D U and D , at p A = 0 .

175 and p A = 0 .

058 respectively. Note thatthe posterior contrasts in Figure 4 are slightly non-normal, hence we interpret the adjustedposterior probabilities of no conﬂict as exploratory, rather than as absolute measures.Examining closer the posterior distributions of the “prior” and “likelihood” versions of thenode D (Supplementary Material Figure A.3, upper panel), we visualise better the prior-data conﬂict: the “likelihood” version lies very much in the lower tail of the “prior” version.This is in spite of – or rather because of – the ﬂat Uniform priors of the prior model, whichtranslate into a non-Uniform implied prior for the function D p = N ρ p π p κ p .The “saturated” model splitting apart each component of evidence in the DAG allows us toassess prior-data conﬂict in this model, but not conﬂict between diﬀerent combinations ofthe likelihood evidence, due to lack of identiﬁability: in each likelihood partition in Figure3(b), clearly only the parameter directly informed by the data, whether basic or functional,can be identiﬁed. To assess consistency of evidence between likelihood terms, we employ across-validatory “leave-n-out” approach, for n = 1 and n = 2, splitting in each case the rel-evant nodes directly informed by the left-out data items. Note that other possibilities exist,such as splitting at the basic parameters, depending on which data are left out. Table 2 givesunadjusted ( p U ) and various multiply-adjusted ( p AW , p AL , p AA ) individual posterior proba-bilities of no diﬀerence between nodes split between partitions 1 (the “left-out” evidence)and 2 (the remaining evidence). These posterior probabilities highlight inconsistency in thenetwork of evidence { y , y , y , y } , i.e. informing the three nodes ρ, πκ and D = N ρπκ .Splits at these three nodes demonstrate low posterior probabilities of no diﬀerence in the“leave-1-out” models (A), (B) and (E), and in the “leave-2-out” models (B), (C), and (J) inparticular. There is no potential for the evidence y on the prevalence of undiagnosed infec-tion π (1 − κ ) to conﬂict with any other evidence, since π and κ are not separately identiﬁablefrom the remaining evidence { y , y , y , y } alone. Hence all of the posterior probabilities ofno diﬀerence concerning the node π (1 − κ ) are high.The conﬂict in the { y , y , y , y } network is well illustrated by the node-split model (J),where the count data on the lower and upper bounds for the D are “left out” in partition1. Supplementary Material Figure A.3 (lower panel) shows the posterior distributions foreach of D L , D U and D in both partitions. Since in partition 2 the data on the limits for D have been excluded, the posterior distributions for the bounds (solid black and red lines) areﬂat and hugely variable. Despite this, the posterior distribution for D is relatively tightlypeaked, due to the indirect evidence on D provided by the data informing ρ and π κ . Itis this indirect evidence that conﬂicts with the direct evidence informing D via the data { y , y } on the bounds for D . 11 Discussion

We have proposed here the systematic assessment of conﬂict in an evidence synthesis, inparticular accounting for the multiple tests for consistency entailed, through the simultaneousinference framework proposed by Hothorn et al. (2008); Bretz et al. (2011). We have chosenthe max-T tests that allow both for multiply-adjusted local and global testing simultaneously.Note that the use of this (typically classical) simultaneous inference framework relies on theasymptotic multivariate normality of the joint posterior distribution. In cases where thelikelihood does not dominate the prior, resulting in a skewed or otherwise non-normal poste-rior, we treat the results of conﬂict analysis as exploratory, rather than absolute measures ofconﬂict. If the posterior is skewed but still uni-modal, a global, implicitly multiply-adjusted,test for conﬂict can be formulated in terms of the Mahalanobis distance of each posteriorsample from their mean, as we proposed in Presanis et al. (2013). This is a multivariateequivalent of calculating the tail area probability for regions further away from the posteriormean than the point . However, the Mahalanobis-based test does not allow us to obtainlocal tests for conﬂict, nor does it apply in the case of a multi-modal posterior. In the lattercase, kernel density estimation could be used to obtain the multivariate tail area probability,although such estimation is computationally challenging for large posterior dimension.Although generalised evidence syntheses have mostly been carried out in a Bayesian frame-work, there are examples (e.g. Commenges and Hejblum, 2013) that are either frequentist ornot fully Bayesian. In the NMA ﬁeld, maximum likelihood and Bayesian methods are bothcommon (e.g. White et al., 2012; Jackson et al., 2014). An advantage of the simultaneousinference framework (Hothorn et al., 2008; Bretz et al., 2011) is that, given any estimator ∆ of a vector of diﬀerences and its corresponding variance-covariance matrix S ∆ , regardless ofthe method used to obtain the estimates, the global and local max-T tests can be formulated.Conﬂict p-values can be seen as cross-validatory posterior predictive checks (Presanis et al.,2013). There is a large literature on various types of Bayesian predictive diagnostics, in-cluding prior-, posterior- and mixed-predictive checks (e.g. Box, 1980; Gelman et al., 1996;Marshall and Spiegelhalter, 2007). A key issue much discussed in this literature is the lack ofuniformity of posterior predictive p-values under the null hypothesis (Gelman, 2013), withsuch p-values conservative due to the double use of data. Much work has therefore beendevoted to either alternative p-values (e.g. Bayarri and Berger, 2000) or post-processing ofp-values to calibrate them (e.g. Steinbakk and Storvik, 2009). Gelman (2013) argues thatthe importance of uniformity depends on the context in which the model checks are con-ducted: in general non-uniformity is not an issue, but if the posterior predictive tests relyon parameters or imputed latent data, then care should be taken. Since conﬂict p-valuesare cross-validatory, the issue of conservatism and the double use of data does not apply. Infact, for a wide class of standard hierarchical models, G˚asemyr (2016) has demonstrated theuniformity of the conﬂict p-value.As illustrated by both applications, the choice of diﬀerent ways of partitioning the evidencein a DAG can lead to diﬀerent conclusions over the existence of conﬂict. This is to be ex-pected when considering the local conﬂict p-values, since conﬂicting evidence may need to bein diﬀerent partitions in order to be detectable. This is analogous to the idea of “masking” incross-validatory outlier detection, where outliers may not be detected if multiple outliers ex-ist (Chaloner and Brant, 1988). In the case of the global tests for conﬂict, the NMA example12howed that these are also not invariant to the choice of partition. In the NMA literature, al-ternative methods accounting for inconsistency include models that introduce “inconsistencyparameters” that absorb any variability due to conﬂict beyond between-study heterogeneity(Lu and Ades, 2006; Higgins et al., 2012; Jackson et al., 2014). Higgins et al. (2012); Jacksonet al. (2014) have pointed out that the apparent algorithm that Lu and Ades (2006) follow foridentifying inconsistency parameters does not guarantee that all such parameters are identi-ﬁed, nor that the Lu-Ades model is invariant to the choice of baseline treatment. The authorsfurther posit, and more recently have proved (Jackson et al., 2015), that their “design-by-treatment interaction model”, which introduces an inconsistency parameter systematicallyfor each non-baseline treatment within each design, contains each possible Lu-Ades modelas a sub-model. In related ongoing work, we note that each Lu-Ades model corresponds to aparticular choice of node-splitting model, one being a reparameterisation of the other. Thelack of invariance of results of testing for inconsistency from one Lu-Ades model to anotheris therefore not surprising, since, as we illustrated here, diﬀerent choices of node-splittingmodel correspond to diﬀerent partitions of evidence being compared. The lack of invarianceof a global test for conﬂict to the choice of node-splitting model, although unsurprising, isperhaps unsatisfactory: however, as we illustrated in this paper, this lack clearly emphasisesthe need for a more comprehensive and systematic assessment of conﬂict throughout a DAG,both at a local level and across diﬀerent types of node-split model, than just a single globaltest can provide. We therefore recommend that although a global test may be an initialstep in any conﬂict analysis, to be sure of detecting any potential conﬂict requires testingfor conﬂict throughout a DAG. One strategy is to start from splitting every possible nodein the DAG, as we did in the HIV example, before looking at more targeted leave-n-out ap-proaches. The design-by-treatment interaction model provides a way of doing so and we arefurther investigating the relationship of the (ﬁxed inconsistency eﬀects) design-by-treatmentinteraction model to such a “saturated” node-splitting model.Note that in the NMA example considered here, we have concentrated on a “contrast-based”as opposed to “arm-based” parameterisation (Hong et al., 2016; Dias and Ades, 2016). Also,we have considered the case where each study has a study-speciﬁc baseline treatment B d andthe network as a whole has a baseline treatment A . However, alternative parameterisationscould be considered, such as using a two-way linear predictor with main eﬀects for bothtreatment and study, treating the counter-factual or missing treatment designs as missingdata (Jones et al., 2011; Piepho et al., 2012). Although we have not yet explored alternativeparameterisations, we posit that systematic node-splitting could be equally well applied.As with any cross-validatory work, the systematic assessment of conﬂict at every node in aDAG can quickly become computationally burdensome as a model grows in dimension. Anarea for future research is the systematic analysis of conﬂict using eﬃcient algorithms (Lunnet al., 2013; Goudie et al., 2015) in a Markov melding framework (Goudie et al., 2016) whichallows for an eﬃcient modular approach to model building. Acknowledgements

This work was supported by the Medical Research Council [Unit Programme number U105260566];and the Polish National Science Centre [grant no. DEC-2012/05/E/ST1/02218]. The au-thors also thank Ian White and Dan Jackson for their very helpful comments.13 eferences

Ades, A. E. and A. J. Sutton (2006). Multiparameter evidence synthesis in epidemiologyand medical decision-making: current approaches.

JRSS(A) 169 (1), 5–35.Bayarri, M. J. and J. O. Berger (2000). P-values for composite null models.

JASA 95 (452),1127–1142.Bernardo, J. M. and A. F. M. Smith (1994).

Bayesian Theory . John Wiley & Sons, Inc.Box, G. E. P. (1980). Sampling and Bayes’ inference in scientiﬁc modelling and robustness.

JRSS(A) 143 (4), 383–430.Bretz, F., T. Hothorn, and P. Westfall (2011).

Multiple Comparisons Using R (First ed.).Chapman and Hall/CRC.Chaloner, K. and R. Brant (1988). A Bayesian approach to outlier detection and residualanalysis.

Biometrika 75 (4), 651–659.Commenges, D. and B. Hejblum (2013). Evidence synthesis through a degradation modelapplied to myocardial infarction.

Lifetime Data Analysis 19 (1), 1–18.De Angelis, D., A. M. Presanis, P. J. Birrell, G. S. Tomba, and T. House (2014). Four keychallenges in infectious disease modelling using data from multiple sources.

Epidemics .Dias, S. and A. E. Ades (2016). Absolute or relative eﬀects? arm-based synthesis of trialdata.

Res. Syn. Meth. 7 (1), 23–28.Dias, S., N. J. Welton, D. M. Caldwell, and A. E. Ades (2010). Checking consistency inmixed treatment comparison meta-analysis.

Stat. Med. 29 (7-8), 932–944.Gelman, A. (2013). Two simple examples for understanding posterior p-values whose distri-butions are far from uniform.

Electron. J. Stat. 7 (0), 2595–2602.Gelman, A., X.-L. Meng, and H. Stern (1996). Posterior predictive assessment of modelﬁtness via realized discrepancies.

Statistica Sinica 6 , 733–807.Goudie, R. J. B., R. Hovorka, H. R. Murphy, and D. Lunn (2015). Rapid model explorationfor complex hierarchical data: application to pharmacokinetics of insulin aspart.

Stat.Med. 34 (23), 3144–3158.Goudie, R. J. B., A. M. Presanis, D. J. Lunn, D. De Angelis, and L. Wernisch(2016). Model surgery: joining and splitting models with Markov melding.https://arxiv.org/abs/1607.06779.G˚asemyr, J. (2016). Uniformity of node level conﬂict measures in Bayesian hierarchicalmodels based on directed acyclic graphs.

Scand. J. Stat. 43 (1), 20–34.G˚asemyr, J. and B. Natvig (2009). Extensions of a conﬂict measure of inconsistencies inBayesian hierarchical models.

Scand. J. Stat. 36 (4), 822–838.Higgins, J. P. T., D. Jackson, J. K. Barrett, G. Lu, A. E. Ades, and I. R. White (2012).Consistency and inconsistency in network meta-analysis: concepts and models for multi-arm studies.

Res. Syn. Meth. 3 (2), 98–110.14ong, H., H. Chu, J. Zhang, and B. P. Carlin (2016). A Bayesian missing data framework forgeneralized multiple outcome mixed treatment comparisons.

Res. Syn. Meth. 7 (1), 6–22.Hothorn, T., F. Bretz, and P. Westfall (2008). Simultaneous inference in general parametricmodels.

Biometrical J. 50 (3), 346–363.Jackson, D., J. K. Barrett, S. Rice, I. R. White, and J. P. T. Higgins (2014). A design-by-treatment interaction model for network meta-analysis with random inconsistency eﬀects.

Stat. Med. 33 (21), 3639–3654.Jackson, D., P. Boddington, and I. R. White (2015). The design-by-treatment interactionmodel: a unifying framework for modelling loop inconsistency in network meta-analysis.

Res. Syn. Meth. 7 (3), 329–32.Jones, B., J. Roger, P. W. Lane, A. Lawton, C. Fletcher, J. C. Cappelleri, H. Tate, andP. Moneuse (2011). Statistical approaches for conducting network meta-analysis in drugdevelopment.

Pharma. Stat. 10 (6), 523–531.Krahn, U., H. Binder, and J. K¨onig (2013). A graphical tool for locating inconsistency innetwork meta-analyses.

BMC Med. Res. Method. 13 (1), 35+.Lauritzen, S. L. (1996).

Graphical Models . Oxford Statistical Science Series. OUP.Lu, G. and A. E. Ades (2006). Assessing evidence inconsistency in mixed treatment com-parisons.

JASA 101 (474), 447–459.Lumley, T. (2002). Network meta-analysis for indirect treatment comparisons.

Stat.Med. 21 (16), 2313–2324.Lunn, D., J. K. Barrett, M. Sweeting, and S. Thompson (2013). Fully Bayesian hierarchicalmodelling in two stages, with application to meta-analysis.

JRSS(C) 62 (4), 551–572.Lunn, D., D. J. Spiegelhalter, A. Thomas, and N. Best (2009). The BUGS project: Evolution,critique and future directions.

Stat. Med. 28 (25), 3049–3067.Marshall, E. C. and D. J. Spiegelhalter (2007). Identifying outliers in Bayesian hierarchicalmodels: a simulation-based approach.

Bayesian Analysis 2 , 409–444.Piepho, H. P., E. R. Williams, and L. V. Madden (2012). The use of two-way linear mixedmodels in multi-treatment meta-analysis.

Biometrics 68 (4), 1269–1277.Presanis, A. M., D. Ohlssen, D. J. Spiegelhalter, and D. De Angelis (2013). Conﬂict diag-nostics in directed acyclic graphs, with applications in Bayesian evidence synthesis.

Stat.Sci. 28 (3), 376–397.R Core Team (2015).

R: a language and environment for statistical computing . Vienna,Austria: R Foundation for Statistical Computing.Rosinska, M., P. Gwiazda, D. De Angelis, and A. M. Presanis (2016). Bayesian evidencesynthesis to estimate HIV prevalence in men who have sex with men in Poland at the endof 2009.

Epidemiol. Infect. 144 , 1175–1191.Salanti, G. (2012). Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many beneﬁts, many concerns for the next gen-eration evidence synthesis tool.

Res. Syn. Meth. 3 (2), 80–97.15piegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde (2002). Bayesian measuresof model complexity and ﬁt.

JRSS(B) 64 (4), 583–639.Steinbakk, G. H. and G. O. Storvik (2009). Posterior predictive p-values in Bayesian hier-archical models.

Scand. J. Stat. 36 (2), 320–336.Sturtz, S., U. Ligges, and A. Gelman (2005). R2WinBUGS: a package for running WinBUGSfrom R.

J. Stat. Softw. 12 (3), 1–16.van Valkenhoef, G., S. Dias, A. E. Ades, and N. J. Welton (2016). Automated generation ofnode-splitting models for assessment of inconsistency in network meta-analysis.

Res. Syn.Meth. 7 (1), 80–93.van Valkenhoef, G., T. Tervonen, B. de Brock, and H. Hillege (2012). Algorithmic parame-terization of mixed treatment comparisons.

Stat. Comp. 22 (5), 1099–1111.Welton, N. J., A. J. Sutton, N. J. Cooper, K. R. Abrams, and A. E. Ades (2012).

EvidenceSynthesis in a Decision Modelling Framework . John Wiley & Sons, Ltd.White, I. R., J. K. Barrett, D. Jackson, and J. P. T. Higgins (2012). Consistency and incon-sistency in network meta-analysis: model estimation using multivariate meta-regression.

Res. Syn. Meth. 3 (2), 111–125. 16 y φ ( ii ) . . .. . .. . .. . . y φ . . .φ a y a,φ . . .. . .. . .. . . y φ . . .φ b y b,φ . . . ( iii ) ψ i ψ . . . ψ i +1 ψ N . . .y y i y i +1 y N . . . . . .θ k θ . . . θ k +1 θ K . . . ( i ) Figure 1: (i) Example DAG G ( V , E ) showing a generic evidence synthesis. (ii) & (iii)Example node-split at separator node φ : (ii) original model G ( φ , y ); (iii) node-split model.In (ii): the data y = { y φ , y φ } comprise data y φ that are direct descendents of φ ; and theremaining data y φ . In (iii): when splitting G ( φ , y ) into partitions a and b , the data vector y φ is split into y a,φ and y b,φ , whereas y φ remains only in partition a . The partition a dataare therefore y a = { y a,φ , y φ } and the partition b data are y b = y b,φ .17able 1: Multiply adjusted posterior mean (sd) estimates of conﬂict between partitions,for each model (b)-(f) respectively. In the two-partition models (b,c,e,f), partition 1 is thespanning tree (indirect) evidence partition and partition 2 is the direct data partition. Inmodel (d), partitions 1-3 are the sequential spanning trees and partition 4 is the multi-armstudy partition. ST:

AB,AC,AD AB,AC,BD

Model: (b) (c) (d) (e) (f)Posterior: Mean SD Mean SD Mean SD Mean SD Mean SD AB AB -0.415 (5.276) 0.319 (0.983) -0.230 ( 5.849) 6.261 (3.251) 1.513 (1.041) AB -0.044 (10.009) AB AC AC -0.165 (5.262) 0.615 (0.866) -0.379 ( 5.848) 6.173 (3.140) 1.496 (0.835) AC AC AD AD AD AD AD − -11.804 (6.268) -1.354 (2.279) p AD − BC BC BC BC BC − BC − BC − p BC − p BC − p BC − BD BD BD BD -0.017 ( 0.948)∆ BD − -0.140 (1.067) 8.639 (5.069) 8.079 ( 5.444)∆ BD − BD − p BD − p BD − p BD − CD CD CD CD -0.345 ( 0.714)∆ CD − -0.294 (0.902) 8.459 (5.033) 7.430 ( 5.526) -6.443 (3.287) -0.682 (1.893)∆ CD − CD − p CD − p CD − p CD − p U denotesthe unadjusted conﬂict p-value; p AW is the p-value adjusted for the multiple tests carriedout within each model (A)-(J) for the leave-2-out approach; p AL is the p-value adjusted forthe 23 tests carried out in all models (A)-(J) for the leave-2-out approach; and p AA is thep-value adjusted for 28 tests carried out in all leave-1-out models (A)-(E) and all leave-2-outmodels (A)-(J).Model Partition 1 Partition 2 Node split p U p AW p AL p AA Leave-1-out(A) y { y , y , y , y } ρ < . y { y , y , y , y } πκ < . y { y , y , y , y } π (1 − κ ) 0 . y { y , y , y , y } D L . y { y , y , y , y } D U < . { y , y } { y , y , y } ρ πκ { y , y } { y , y , y } ρ < . π (1 − κ ) 0.4906 0.7717 1.0000 1.0000(C) { y , y } { y , y , y } πκ < . < . < . < . π (1 − κ ) 0.8322 0.9000 1.0000 1.0000 π κ { y , y } { y , y , y } ρ < . D L { y , y } { y , y , y } ρ D U { y , y } { y , y , y } πκ D L { y , y } { y , y , y } πκ D U { y , y } { y , y , y } π (1 − κ ) 0.1471 0.3330 0.9855 0.9944 D L { y , y } { y , y , y } π (1 − κ ) 0.5237 0.8100 1.0000 1.0000 D U < . { y , y } { y , y , y } D L D U < . D < . B CD AB CD ( a ) ( b ) AB CD ( f ) AB CD ( c ) AB CD ( d ) AB CD ( e ) Figure 2: Smoking cessation evidence network, under (a) a consistency assumption; (b)-(f)inconsistency assumptions, where the evidence is partitioned in diﬀerent ways. In (b), (c),(e) and (f), the direct evidence (dashed lines) is compared with the indirect evidence (solidlines) on each contrast where there is a dashed line. In (d), the evidence is separated intothree spanning trees and a fourth partition for the multi-arm trial evidence.20 y πκy π (1 − κ ) y D L y D U y c Nρπκ π κ D = N ( a ) ρ p y π p κ p y π p (1 − κ p ) y D L p y D U p y c p Nρ p π p κ p π p κ p D p = N ( b ) ρ d [ πκ ] d [ π (1 − κ )] d D L d D U d c d D d Figure 3: (a) DAG of initial model for synthesising Polish HIV prevalence data. (b) DAG ofmultiple node-split model comparing priors to each likelihood contribution. Note that thesquare brackets are used in denoting the nodes in the likelihood partition ([ πδ ] d , [ π (1 − δ )] d )to emphasise the fact that these two nodes are independent parameters not functionallyrelated to each other. 21 ogit ( r p ) - logit ( r d ) Difference D en s i t y −10 −5 0 5 10 15 . . . . . . p U = A = logit ( p p k p ) - logit ( p d k d ) Difference D en s i t y −15 −10 −5 0 5 . . . . . . p U = A = logit ( p p ( - k p )) - logit ( p d ( - k d )) Difference D en s i t y −10 −5 0 5 . . . . . . p U = A = log ( D L p ) - log ( D L d ) Difference D en s i t y −40 −30 −20 −10 0 10 . . . . . . p U = A = log ( D U p ) - log ( D U d ) Difference D en s i t y −10 0 10 20 30 40 . . . . . . p U = A = log ( D p ) - log ( D d ) = log ( N r p p p k p ) - log ( N r d p d k d ) Difference D en s i t y −10 −5 0 5 10 . . . . . . p U = A = Figure 4: Posterior distributions of the contrasts ∆ for the HIV prevalence example. Thered lines denote 0 diﬀerence, p U is the unadjusted and p A the multiply-adjusted individualconﬂict p-value respectively. 22 upplementary Material A Figures and Tables ( a ) ( b ) η f η b Σ β y Jdi i ∈ . . . N d y B d di y Jdi α di p Jdi p Jdi µ B d Jdi µ B d Jdi J = B d J = B d d ∈ . . . Dη f η b y Jdi i ∈ . . . N d y B d di y Jdi α di p Jdi p Jdi µ B d Jdi µ B d Jdi J = B d B d = AJ = B d B d = A d ∈ . . . D B d = AB d = A Figure A.1: (a) DAG of NMA under assumptions of a common treatment eﬀect η JK (noheterogeneity) and consistency η JK = η AK − η AJ . (b) DAG of NMA under assumptions ofrandom treatment eﬀects, to account for heterogeneity, and consistency.23able A.1: Smoking cessation data setStudy Design y A n A y A /n A y B n B y B /n B y C n C y C /n C y D n D y D /n D E θ | y ( D ); de-viance evaluated at posterior means D ( E θ | y θ ); eﬀective number of parameters p D ; and de-viance information criterion DIC .Model: Common-eﬀect Random-eﬀect µ JK : Posterior mean Posterior sd Posterior mean Posterior sd AB AC AD BC BD CD E θ | y ( D ) 267 54 D ( E θ | y θ ) 240 10 p D

27 44

DIC

294 98Table A.3: Results from initial HIV model: observations; posterior mean (sd) estimates;posterior mean deviance E θ | y ( D ); deviance evaluated at posterior means D ( E θ | y θ ); eﬀectivenumber of parameters p D ; and deviance information criterion DIC .Parameter Data Estimates Deviance summaries θ y n y/n ˆ y ˆ θ E θ | y ( D ) D ( E θ | y θ ) p D DICρ

35 1536 0.023 14.6 ( 1.5) 0.010 (0.001) 21.0 20.7 0.4 21.4 πκ

113 2840 0.040 92.5 ( 8.9) 0.033 (0.003) 5.5 4.4 1.1 6.5 π (1 − κ ) 136 2725 0.050 136.7 (11.3) 0.050 (0.004) 1.0 0.0 1.0 2.0 D L

836 836.2 (28.9) 836.2 (28.9) 1.0 0.0 1.0 2.0 D U DEf η STb y Jdi i ∈ . . . N d y B d di y Jdi α di p Jdi p Jdi µ B d Jdi µ B d Jdi J = B d B d = AJ = B d B d = A d ∈ STy B d di α di η STf d / ∈ STi ∈ . . . N d η DEb

Figure A.2: DAG of common-eﬀect network meta-analysis model, split into direct (DE) andindirect (ST) evidence informing the functional parameters η f , i.e. those edges outside ofthe spanning tree formed by the basic parameters η b .26 . . . . . . . log ( D p ) vs log ( D d ) N = 30000 Bandwidth = 0.05 D en s i t y log ( D p ) log ( D d ) log ( D ) vs log ( D ) N = 30000 Bandwidth = 0.05 D en s i t y log ( D L2 ) log ( D L1 ) log ( D U2 ) log ( D U1 ) log ( D ) log ( D ) Figure A.3: Upper panel: Posterior distributions of the nodes D p and D d for the HIVprevalence example, on the log scale. The right-hand blue line denotes where the totalpopulation of Poland ( N = 15 , , a priori forthe number diagnosed. The left-hand blue line denotes the value log( N × . ), i.e. theprior mean of log( D p ) = log( N ρ p π p κ p ). Lower panel: Posterior distributions of the nodes D L , D L , D U , D U , D and D for the HIV prevalence “leave-2-out” node-split model (J), onthe log scale. The dashed lines represent the nodes in partition 1, i.e. the “left-out” partition,where the posteriors are based only on the likelihood given by { y , y } and Jeﬀreys’ priorsfor D L , D U . The solid lines give the corresponding posteriors in partition 2, i.e. based onall the original model priors and on the dataset { y , y , y } .27 Asymptotics

Let p ( θ ) , . . . , p ( θ Q ) denote the set of prior distributions for the basic parameters θ q in eachpartition q . Then by the independence of each partition, the joint posterior distribution ofall parameters φ in all partitions is p ( φ | y ) = Q (cid:89) q =1 p ( θ q ) p ( y q | θ q ) . If the joint prior distribution is dominated by the likelihood, then asymptotically (Bernardoand Smith, 1994), the joint posterior distribution of all nodes is multi-variate normal: φ | y a ∼ N (cid:80) q n q (cid:16) ( ˆ φ , . . . , ˆ φ Q ) , V (cid:17) where n q is the total number of parameters in partition q , whether basic or not, and V isthe inverse observed information matrix for the parameters φ . Since the vector of separatornodes, φ S = ( φ ( s )1 , . . . , φ ( s ) Q ), is a subset of φ , their joint posterior is also multivariatenormal: φ S | y a ∼ N m (cid:16) ( ˆ φ ( s )1 , . . . , ˆ φ ( s ) Q ) , V S (cid:17) (4)where m = (cid:80) q m q is the total number of separator nodes, including node-split copies, and V S is the appropriate sub-matrix of V . Since the partitions are independent, V S is a blockeddiagonal matrix consisting of the inverse observed information matrices for separator nodesin each partition along the diagonal.By theorem 5.17 of Bernardo and Smith (1994), since (4) holds and if J h ( φ S ) = ∂ h ( φ S ) ∂ φ S is non-singular with continuous entries, then the posterior distribution of the transformedseparator nodes, φ H = h ( φ S ), is also asymptotically normal: φ H | y a ∼ N m (cid:16) h ( ˆ φ ( s )1 , . . . , ˆ φ ( s ) Q ) , J h ( ˆ φ S ) T V S J h ( ˆ φ S ) (cid:17) The Jacobian J h ( φ S ) exists and is non-singular for the sorts of transformations we use inpractice, for example log and logit transformations.A further application of theorem 5.17 of Bernardo and Smith (1994) results in a posteriordistribution of the contrasts ∆ that is also aymptotically multivariate normal, if ∂ ∆( φ ) ∂ φ = C ∆ T is non-singular with continuous entries, which as a contrast matrix it is: ∆ | y a ∼ N p (cid:16) C ∆ T h ( ˆ φ ( s )1 , . . . , ˆ φ ( s ) Q ) , C ∆ T J h ( ˆ φ S ) T V S J h ( ˆ φ S ) C ∆ (cid:17) (5)= N p (cid:16) C ∆ T ˆ φ H , C ∆ T V H C ∆ (cid:17) for V H = J h ( ˆ φ S ) T V S J h ( ˆ φ S ). Asymptotically, therefore, the posterior mean ∆ = C ∆ T φ H a ≈ C ∆ T ˆ φ H and the posterior variance-covariance matrix of ∆ is S ∆∆