Statistical inference with anchored Bayesian mixture of regressions models: A case study analysis of allometric data
SStatistical inference with anchored Bayesian mixture ofregressions models:A case study analysis of allometric data
Deborah Kunkel and Mario Peruggia
1. School of Mathematical & Statistical Sciences, Clemson University, Clemson, SC, USA2. Department of Statistics, The Ohio State University, Columbus, OH, USA
Abstract
We present a case study in which we use a mixture of regressions model to improveon an ill-fitting simple linear regression model relating log brain mass to log body massfor 100 placental mammalian species. The slope of this regression model is of particularscientific interest because it corresponds to a constant that governs a hypothesized allo-metric power law that relates brain mass to body mass. A specific line of investigationis to determine whether the regression intercept and slope may vary across subgroupsof related species.We model these data using an anchored Bayesian mixture of regressions model,which modifies the specification of a standard Bayesian Gaussian mixture by pre-assigning small subsets of observations to given mixture components with probabil-ity one. These pre-classified observations (called anchor points) break the relabelinginvariance typical of exchangeable model specifications (the so-called label-switchingproblem) that often prevents direct interpretation of component-specific parametersin mixture models. Anchoring can also be seen as a method for specifying weaklydata-dependent prior distributions on the component-specific parameters. A carefulchoice of which observations to pre-classify to which mixture components is key to thespecification of a well-fitting anchor model.In the article we compare three distinct broad strategies for the selection of anchorpoints. The first strategy assumes that the underlying mixture of regressions modelholds and assigns anchor points to different components so as to maximize the in-formation about their labeling. The second strategy makes no assumption about therelationship between x and y and instead identifies anchor points using a bivariate a r X i v : . [ s t a t . M E ] M a y aussian mixture model that does not condition on body mass. The third strategybegins with the assumption that there is only one mixture regression component andidentifies a clustering structure based on case-deletion importance sampling weights.The anchor points are then selected as typical representatives of the identified clusters.We compare the performance of the three strategies on the allometric data set and useauxiliary taxonomic information about the species to evaluate the model-based classi-fications that we can estimate from these models. Keywords : Allometry, Case-deletion weights, Clustering, EM algorithm, Semi-supervised learning
Acknowledgments
This material is based on work supported by the National ScienceFoundation under grant no. SES-1424481.
In the natural sciences, allometry studies the relationships between physical and physiologicalmeasurements taken on various animal species (Peters, 1983; Gayon, 2000). Of particularinterest is to determine how other measurements may be affected by body mass. Examplesinclude the relationships between body mass and brain mass, body mass and metabolicrate, body mass and gestation duration. It is often postulated that pairs ( x, y ) of suchmeasurements may be related via a power law of the form y = cx b , for some unknownconstants c and b , typically assumed to be positive. The estimation of the exponent b isoften of primary scientific interest. On a logarithmic scale, the power law turns into thelinear relationship log y = (log c ) + b log x and the investigative focus shifts toward theestimation of the slope of the regression line.Given a set of ( x, y ) pairs of traits measured on a variety of species, it is by now generallyaccepted that fitting a single linear regression model to the entire data set provides too crudea summary, especially when many species from different taxa and genetically diverse groups2re included in the data set (Jerison, 1955; Bennett and Harvey, 1985a,b). More refinedapproaches rely on the incorporation of evolutionary information (possibly inferred from ataxonomy) to perform an analysis based on models for derived quantities that can be treatedas independent, rather than for the original measured traits that exhibit species-related de-pendencies. For example, this is the case for a popular type of analysis based on phylogenet-ically independent contrasts (Felsenstein, 1985; Garland Jr et al., 1992). MacEachern andPeruggia (2002) show that traditional Bayesian variance components models applied directlyto allometric data for which taxonomic information is available can produce a good fit andyield easily interpretable inferences.In this article we reanalyze the data that MacEachern and Peruggia (2002) used toillustrate their methods. The data comprise the body and brain mass measurements on 100species of placental mammals originally reported by Sacher and Staffeldt (1974) as well asa taxonomy that assigns each species to an order and sub-order based on its morphologicaland physiological traits. For example, the African lion (Panthera Leo) is listed as belongingto the order Carnivora and sub-order Fissipeda and the chimpanzee (Pan Troglodytes) islisted as belonging to order Primates and sub-order Anthropoidea. In total, the data containspecies that represent 13 orders and 19 sub-orders.The data are shown in Figure 1. The left panel displays a scatterplot of the centered(i.e., mean adjusted) log body mass and log brain mass for the 100 mammalian species.The overlaid line is the least-squares fit from a naive simple linear regression model.Theresiduals for this model are shown in the right panel of Figure 1. This plot raises someconcerns about model fit. First, there is a slight increase in residual variability as log bodymass grows. Second, other features of the residuals can be traced back to the species orders(distinguished by plotting color): all Primates (red points) have positive residuals, whilemost Rodentia (blue points) have negative residuals. This structure points to the fact thatthe least squares line does not properly account for within-order similarities in the allometricrelationship.MacEachern and Peruggia (2002) present a detailed evaluation that uncovers the lack-of-3 lllll lll ll lll lll l lllll ll lllllll lllllll lll ll l llll llll l l llllll llll ll l l lllll l ll llll lllll ll llll ll ll l lll −6 −4 −2 0 2 4 6 body b r a i n HOMO SAPIENSPANTHERA LEOPAN TROGLODYTES llllll lll ll lll lll l lllll ll lllllll lllllll lll ll l llll llll l l llllll llll ll l l lllll l ll llll lllll ll llll ll ll l lll −6 −4 −2 0 2 4 6 − . − . . . . . body r e s i dua l s llllll PRIMATESARTIODACTYLARODENTIACARNIVORAINSECTIVORAOTHER
HOMO SAPIENS
Figure 1: Mammals data (left) with the estimated least-squares regression line and residuals(right) from the least-squares regression fit.fit of this model, introducing Bayesian model diagnostic techniques based on case-deletionimportance sampling weights (Geweke, 1989; Bradlow and Zaslavsky, 1997; Peruggia, 1997;Epifani et al., 2008; Thomas et al., 2018). They also present a Bayesian variance componentsmodel that includes additive random effects for orders and sub-orders. These random effectsinduce positive correlations between the residuals of species belonging to the same taxo-nomic groups and subgroups. The random effect adjustment effectively creates a separateregression line with its own intercept for the species within each subgroup. MacEachern andPeruggia (2002) show that this way of accounting for the taxonomic information significantlyameliorates the quality of the fit.Suppose now that no information about the taxonomy were available. We might wishto remedy the lack-of-fit of the simple linear regression model, but a random effects modelis not an option when the true group memberships are not known. In such situations,a reasonable modeling alternative is to assume that there exist finitely many subgroups ofobservations for which separate regression lines are appropriate, and yet it is unknown whichobservations to associate with which regression. This leads naturally to the formulation4f a mixture of regressions model in which, conditional on unobserved group membershipindicators, observations falling in separate groups follow separate regression models. Usingthese mammalian species data as the foundation for a case study, we consider the specificationof a Bayesian mixture of regressions to account for unobserved heterogeneity in a data set.The fundamental difference between the mixture setting and the random effects settingconsidered by MacEachern and Peruggia (2002) is that, in the former, group membershipsare latent and are the subject of inferential investigation. An additional difference is thatwe will allow for group-specific slopes in addition to group-specific intercepts.A unique challenge arises when modeling with Bayesian mixture models: a typicalBayesian formulation would specify, a priori, a fully exchangeable mixture, yielding a fullyexchangeable posterior. A fully exchangeable model is indifferent to any arbitrary relabelingof the mixture components implying that, a posteriori, each observation is equally likely tofollow any of the postulated regression models. This phenomenon, often referred to as labelswitching, precludes any informative assignment of labels to the mixture components andprevents meaningful interpretation of the corresponding component specific parameter esti-mates. Kunkel and Peruggia (2018) introduce a modeling device called anchoring that breaksthe labeling invariance of the exchangeable version of a Bayesian Gaussian mixture model.The idea is to identify small subsets of representative observations, called the anchor points,and allocate them to separate homogeneous groups before performing the analysis. Kunkeland Peruggia (2018) characterize the equivalence between this process and the specificationof a weakly data-dependent prior for the component specific parameters. A shrewd selectionof anchor points is essential for the successful implementation of the strategy. They proposetwo optimality criteria, one based on minimizing the entropy of the asymptotic distributionof class labeling and one based on a modified EM algorithm.Motivated by the mixture of regressions model and its application to our allometricdata, we extend the work of Kunkel and Peruggia (2018) in two main directions. First, wegeneralize the EM anchoring strategy from the case of multivariate Gaussian mixtures to thecase of mixture of regressions models. Second, we introduce a new strategy for the selection5f anchor points that builds on the work on case-deletion analysis presented in MacEachernand Peruggia (2002) and Thomas et al. (2018). We compare this new strategy for theselection of anchor points (which we call CDW-reg) to the two EM anchoring strategiesbased on the bivariate Gaussian mixture model (which we call EM-MVN) and based onthe mixture of regressions model (which we call EM-reg). The three strategies differ withrespect to the model that they subsume (no regression structure, a single regression line, orseveral regression lines) and how they extract information from the data to determine whichpoints to select as representative of the various mixture components (a principal componentsanalysis of the log case-deletion weights or the EM algorithm).In Section 2 we present the mixture of regressions model and its corresponding anchormodel. In Section 3 we present the EM method for selecting anchor points under the mixtureof regressions and multivariate Gaussian mixture models. In Section 4 we describe the CDW-reg method for choosing anchor points via clustering of case-deletion weights. The results ofour analysis of the mammals data are presented in Section 5.
In this section we give a formal definition of the mixture of regressions model and its anchoredversion that we will employ to analyze the mammals data.
The k -component mixture of linear regressions model specifies that the observations ( y i , x i ) , i =1 , . . . , n , are drawn from k homogeneous subgroups within the population. The response y i is a scalar and x i ∈ R p is a row vector of predictors whose first element is equal to one. Thedata are denoted by y (cid:48) = ( y , . . . , y n ) and by the n × p matrix X with i -th row equal to x i .The mixture likelihood is f ( y | X , β , σ , η ) = n (cid:89) i =1 k (cid:88) j =1 η j φ ( y i ; x i β j , σ ) , (1)6here η (cid:48) = ( η , . . . , η k ) is a vector of mixture probabilities that satisfy (cid:80) kj =1 η j = 1 and φ ( · ; a, b ) denotes the density function of a normal distribution with mean a and variance b , evaluated at its argument. The component-specific parameters in this model are β =( β , . . . , β k ), where β j ∈ R p is the regression coefficients associated with the j th component.This model can be written equivalently by introducing latent allocations s = ( s , . . . , s n ),where s i = j indicates that observation i was generated from component j . Conditional on s i = j , i = 1 , . . . , n , the mean of y i is x i β j and the model is a random effects regression.Introducing a probability distribution on s in which P ( S i = j ) = η j , for j = 1 , . . . , k ,produces the mixture of regressions model in (1).We specify the following exchangeable prior: β j | µ β , V ∼ N p ( µ β , V ) , j = 1 , . . . , k,σ − | a, b ∼ Gamma( a, b ) , (2) η ∼ Dirichlet( α k ) , where V is a diagonal matrix whose p th diagonal element is v p and the Gamma distributionis parameterized to have mean a/b . This type of exchangeable specification is a natural wayof expressing prior ignorance about the relation between y and x within each group: wemake no assumptions that induce different prior distributions on the β j . An unfortunateconsequence of the exchangeable specification is that the posterior density of β is invariantto relabeling: for any permutation of the integers 1 , . . . , k , denoted by ρ q (1 : k ), p ( β | y ) isequal to p ( β ρ q (1: k ) | y ) for any value of β . In other words, the posterior density of a value of β is unchanged if the component labels are permuted. This produces k ! symmetric regions inthe posterior distribution of β , each corresponding to one of the k ! possible labelings of themixture components, and the marginal distributions of the component specific parameterswill all be identical. This is undesirable for several reasons (Jasra et al., 2005). Inferentially, itis impossible to use estimates of the labeled parameters β , . . . , β k to infer distinctions amongfeatures of different groups. Computationally, the label switching phenomenon hampersfitting the model by Markov chain Monte Carlo simulation.7n previous work, (Kunkel and Peruggia (2018)) we have proposed a new class of modelscall anchor models which pre-classify some observations in order to induce a non-exchangeable,data-dependent prior which can alleviate the inferential nuisances caused by this symmetryin the mixture model. To specify an anchor model, we select a small number of points to be assigned a priorito particular components of the mixture model. The model is defined using k index sets A , . . . , A k , where A j contains the indices of the observations to be pre-labeled (or “an-chored”) to component j . The number of points in A j , denoted by m j , is chosen ahead oftime and each observation is anchored to at most one component; i.e., A j ∩ A j (cid:48) = ∅ for j (cid:54) = j (cid:48) .Given the anchor points, A = ∪ kj =1 A j , the anchored version of the model (1) has likelihood f A ( y | X , β , σ , η ) = (cid:81) ni =1 f A ( y i | x i , β , σ , η ), with f A ( y i | x i , β , σ , η ) = (cid:80) kj =1 η j φ ( y i ; x i β j ; σ ) , i / ∈ A,φ ( y i ; x i β j ; σ ) , i ∈ A j , j = 1 , . . . , k. (3)Similarly, if the latent allocation variables are used, we have the equivalent representation f A ( y | X , β , σ , η ) = (cid:81) ni =1 (cid:80) ks i =1 P A ( S i = s i ) f ( y i | s i , x i , β , σ , η ), where f ( y i | s i , x i , β , σ , η ) = φ ( y i ; x i β s i , σ ), P A ( S i = j ) = η j I ( i / ∈ A ) + I ( i ∈ A j ), i = 1 , . . . , n , j = 1 , . . . , k , and I ( · ) isan indicator function that equals 1 when its argument is true and zero otherwise.An anchored mixture model can be regarded as a hybrid between a random effects model,in which the class membership is known for all observations, and a pure mixture model, inwhich the class membership is unknown for all observations. It can be shown that the pre-labeled anchor points in the sets A , . . . , A k eliminate the prior exchangeability of the modelthat leads to label-switching if at least k − Kunkel and Peruggia (2018) describe the anchored expectation-maximization (EM) algo-rithm, a modified version of the EM algorithm for maximum a posteriori estimation ofanchored mixture models. For a generic vector γ = ( γ , . . . , γ k ) of the component-specificmodel parameters, the standard EM algorithm for mixture models uses the latent variablerepresentation of the model and iterates between an E-step, which estimates a probabilitydistribution p ( s ) on the latent allocations conditional on the current estimate of γ , and anM-step, which updates γ to maximize the expected (with respect to p ( s )) joint posteriordensity of γ and s . Instead, the E-step of the modified algorithm estimates a probabilitydistribution q ( s ) that corresponds to an anchor model for some choice of A = ∪ kj =1 A j , thatis, a distribution for which each A j specifies m j observations allocated to component j withprobability 1. The sets A j are chosen to make q ( · ) as close as possible to the unanchored dis-tribution p ( s ). This method seeks to estimate an anchoring structure that produces a closeapproximation to the unanchored posterior distribution of γ near one of its local modes. Thesteps of this algorithm are outlined below for a generic parameter γ and mixture probabilitydistribution p ( y i | γ ) = (cid:80) kj =1 η j f ( y i | γ j ). 9 .1 General anchored EM Initialize γ , η and repeat the following steps until convergence. E step:
Calculate the following distribution on the latent allocations: p ( s | γ t − , η t − ) = (cid:81) ni =1 p ( s i | γ t − , η t − ), where p ( s i = j | γ t − , η t − ) = r tij = η t − j f ( y i | γ t − j ) (cid:80) kl =1 η t − l f ( y i | γ t − l ) . (4) Anchor step:
Find A t = ∪ kj =1 A tj to maximize (cid:80) kj =1 (cid:80) i ∈ A j r tij subject to A j ∩ A j (cid:48) = ∅ forall j (cid:54) = j (cid:48) . Calculate the distribution q t ( s | γ t − , η t − ) = (cid:81) ni =1 q t ( s i | γ t − , η t − ), where q t ( s i = j | γ t − , η t − ) = (cid:101) r tij = r tij , i (cid:54)∈ A t , , i ∈ A tj , , i ∈ A tj (cid:48) , j (cid:54) = j (cid:48) . (5) M step:
Update γ t as arg max γ E q t log( p ( γ , η , s | y )), where the expectation is taken withrespect to the distribution q t defined in the Anchor step.The Anchor step in the algorithm above amounts to assigning to component j the pointswith the highest posterior probability of allocation to component j given the current estimateof the model parameters, as long as this does not anchor any observations to more than onecomponent. If that happens, an approximate solution may be used or linear programmingalgorithms can produce an exact solution if necessary. We have found the algorithm to besensitive to the selection of a starting point and prone to visit local maxima, especially whenmodeling groups that are not well-separated. We thus recommend running the algorithmseveral times and selecting the solution with the largest ending value of E q t log( p ( γ , η , s | y )).This algorithm is applied to a specific model by substituting the appropriate j -th com-ponent density for f ( y i | γ t − j ) in Equation (4) of the E step above and deriving appropriateupdates in the M step. The M step updates the values of the component-specific param-10ters to maximize (cid:80) ni =1 (cid:80) kj =1 (cid:101) r tij log( f ( y | γ , s )) + log( p ( γ )) + log( p ( η )), analogously to thestandard EM algorithm for maximum a posteriori estimation. In the next subsections weoutline the steps of this algorithm for two mixture models that we will use in our analysisof the mammals data. The EM-Reg anchoring method applies the anchored EM algorithm to the mixture of re-gressions model in (3). The E step requires updating the r tij as r tij = η t − j φ ( y i ; x i β t − j , ( σ ) t − ) (cid:80) kl =1 η t − l φ ( y i ; x i β t − l , ( σ ) t − ) , (6)and the M step updates of β t , σ t , η t are β tj = ( X (cid:48) R tj X + V − ) − ( X (cid:48) R tj y + V − µ β ) , j = 1 , . . . , k, (7) (cid:0) σ − (cid:1) t = a + n/ − b + . (cid:80) kj =1 ( y − Xβ tj ) (cid:48) R tj ( y − Xβ tj ) , (8) η tj = (cid:80) ni =1 r tij + α − (cid:80) kl =1 (cid:80) ni =1 r til + α − , j = 1 , . . . , k, (9)where R tj is an n × n diagonal matrix whose i -th diagonal element is (cid:101) r tij . We now introduce a second anchored EM scheme, EM-MVN, which we will apply to dataof the type described in Section 2.1. The EM-MVN scheme assumes, in the anchoringstage, that the vectors z i = ( y i , x i , . . . , x pi ) (cid:48) arise from a mixture of multivariate normaldistributions whose j -th component has mean θ j ∈ R p and covariance matrix Σ j . Theelements of z i are the response y i and the p − x i , . . . , x pi excluding the firstelement of x i equal to one. The distribution of z i under the p -variate Gaussian mixture model11s (cid:80) kj =1 η j φ p ( z i ; θ j , Σ j ), where φ p ( · ; a , B ) denotes the p -variate Gaussian density functionwith mean a and covariance B , evaluated at its argument.The M step is derived assuming a conjugate normal-Wishart prior on ( θ j , Σ j ), in which θ j | µ , κ, Σ j ∼ N p ( µ , κ − Σ j ) and Σ − j | ν, W ∼ Wishart p ( ν, W ), a Wishart distribution withprior scale W and degrees of freedom ν . The updates for θ j and Σ j are given by: θ tj = κ − µ + (cid:80) i r tij z i κ − + (cid:80) ni =1 r tij , j = 1 , . . . , k, (10) Σ tj = W − + (cid:80) i r tij ( z i − θ tj )( z i − θ tj ) (cid:48) + κ − ( µ − θ tj )( µ − θ tj ) (cid:48) ν − p + (cid:80) ni =1 r tij , j = 1 , . . . , k. (11)The update for η j is unchanged from (9).Multivariate Gaussian mixture models are frequently used for classification and are anatural default choice for selecting anchor points. In contrast to the EM-reg method de-scribed in the previous subsection, this selection method does not condition on X , nor doesit assume that an underlying mixture of regressions holds. In practice, the two methodsselect very different anchor points. This owes to the fact that the selection is determined bythe allocation probabilities r ij , which, under the MVN model, will be large for observationsnear the p -variate mean θ j of component j . For the EM-reg method, however, the magnitudeof r ij is driven by the observation’s proximity to the regression line determined by both β j and the x -values. This fundamental difference noticeably affects the resulting inferences, aswe will see in the sequel. For the analysis of the mammals data presented in the remainder of the paper, we usedthe two EM methods outlined in the previous sections to specify anchor models with k = 3mixture components and m = 3 anchor points in each component. The choice of thisnumber of components is motivated by a desire to restrain model complexity while retainingsufficient flexibility to capture salient morphological differences between groups of species.The choice of this number of anchor points per component aims at imposing enough data-12 EM−reg body b r a i n lll PRIMATESRODENTIAINSECTIVORAPRIMATESPRIMATESCARNIVORARODENTIARODENTIALAGOMORPHA −6 −4 −2 0 2 4 6
EM−MVN body b r a i n lll PRIMATESCETACEACETACEACHIROPTERARODENTIARODENTIAARTIODACTYLAARTIODACTYLAARTIODACTYLA −6 −4 −2 0 2 4 6
CDW body b r a i n lll PRIMATESCETACEAPRIMATESARTIODACTYLACARNIVORACHIROPTERAARTIODACTYLAARTIODACTYLARODENTIA
Figure 2: Selected anchor points for the mammals data. The legend indicates the order ofeach anchor point.dependent prior information so as to control the label switching problem. We set the priorhyperparameters in (2) for EM-reg to be a = 5, b = 1, µ β = (3 . , . T , and v = 1, v = 0 . µ = ¯ z , the sample mean vector of the z i , κ = 1 . W = 1 . I , where I denotes the 2 × ν = 2. The left and center panels of Figure 2 showthe anchor points selected by EM-reg and EM-MVN, respectively.The EM-reg method chooses points that provide prior information about the distinctlinear relationships for each component. The anchor points for the green and blue groupsidentify approximately parallel lines that differ only in intercept. The red points suggest aline with a steeper slope than that of the other two groups. None of the EM-reg anchorpoints have extreme x or y values while they fall almost exactly in straight-line patterns.This is in contrast to the choices operated by method CDW-reg (described in Section 4)which selected more influential points to be among the anchors, as shown in the right panelof Figure 2. 13he middle panel of Figure 2 shows the EM-MVN anchor points. For these data, theEM-MVN method selects tight clusters of anchor points that are close to each other in boththe x and y directions. The selected anchor points do not suggest any distinct featuresamong the regression lines in our target model because they show almost no variability inthe x direction; this is a consequence of using a selection method that does not condition on x . It is also interesting to observe that this method, favoring anchor points that fall on theextreme ends of the observed data, ends up selecting some of the same high leverage pointsidentified by CDW-reg. In a Bayesian context, case-deletion analysis quantifies the influence of an individual ob-servation on the overall analysis by comparing the posterior distribution conditional on theentire data set and the posterior distribution conditional on the reduced data set obtainedby omitting the observation under consideration. Case-deletion analysis is an effective toolfor identifying influential observations in Bayesian models (Bradlow and Zaslavsky, 1997;MacEachern and Peruggia, 2002). It also provides an avenue to assess the similarity of theimpact of observations on the inferential conclusions and, as a byproduct, a means to clusterthe observations on the basis of a similarity measure derived from the specified model. Wenow derive a strategy for selecting the anchor points of a Bayesian Gaussian mixture modelas “typical” representatives of the clusters identified via a preliminary case-deletion analysisbased on a model which assume the existence of a single mixture component.Consider a set of observations y = ( y , . . . , y n ) following a model with density f ( y | θ ) con-ditional on a set of parameters θ having prior distribution π ( θ ). The posterior distribution of θ given y is proportional to the joint distribution of y and θ which is q ( y , θ ) = f ( y | θ ) π ( θ ).Similarly, denoting by y \ i the reduced data set obtained by deleting observation i , the pos-terior distribution of θ given y \ i is proportional to q \ i ( y \ i , θ ) = f ( y \ i | θ ) π ( θ ).The ratio between the case-deleted and full posterior reacts to the influence of the deletedcase on the inferential conclusions. For this reason it is useful to understand the behavior of14he random variables w i ( θ ) = q \ i ( y \ i , θ ) q ( y , θ ) , i = 1 , . . . , n, (12)when θ follows the posterior distribution conditional of the entire data set. In practice, wecan compute the normalized empirical versions of the ratios in (12) using a sample θ , . . . , θ M from (approximately) the posterior distribution of θ given y . These quantities, known ascase-deletion importance sampling weights, are given by¯ w i ( θ m ) = w i ( θ m ) (cid:80) M(cid:96) =1 w i ( θ (cid:96) ) , i = 1 , . . . , n ; m = 1 , . . . , M. (13)An important practical use of the normalized case-deletion weights is to reweight empiricalaverages based on samples from the full posterior to approximate expectations with respectto the case-deleted posterior (Geweke, 1989). The behavior of the resulting weighted es-timator is affected by the theoretical properties of the unnormalized weights in (12). Forexample, a fundamental result in Geweke (1989) requires that the unnormalized weights havefinite variance with respect to the full posterior for a central limit theorem for the weightedestimator to hold.In addition, the theoretical variability of the w i ( θ ) and the sample variability of the w i ( θ m ) and ¯ w i ( θ m ), m = 1 , . . . , M , are indicators of the influence of observation i on the pos-terior distribution, with higher variability indicating larger influence Bradlow and Zaslavsky(1997); Peruggia (1997); Epifani et al. (2008). The covariance matrix (with respect to thefull posterior distribution of θ ) of the log weights, C = [ C ij ] = [ Cov (log w i ( θ ) , log w j ( θ ))], isa particularly useful quantity: C ij can be interpreted as summarizing the degree of similaritybetween the influence of deletion of case i and case j . Further, for models of conditional in-dependence, Thomas et al. (2018) detail the existence of a close relationship between C andmeasures of influence based on infinitesimal geometric perturbations of the multiplicativecontributions of each observation to the overall likelihood. In the perturbed likelihood, themultiplicative factor corresponding to each observation is raised to a power ω i , i = 1 . . . , n .The original likelihood is recovered by setting ω i = 1, i = 1 . . . n. Thomas et al. (2018)15how that, in a neighborhood of ω = (1 , . . . , (cid:48) , C characterizes the curvature of the n -dimensional surface representing the Kullback-Leibler divergence of the posterior based onthe original likelihood from the posterior based on the geometrically perturbed likelihood.Thus, C contains information about the directions in R n along which the Kullback-Leiblerdivergence surface changes more rapidly in response to geometric likelihood perturbations.This insight can be used to assess the directional influence of cases.As an exploratory tool for assessing such influence, Thomas et al. (2018) recommend tocompute the sample covariance matrix (cid:98) C based on the sample θ , . . . , θ M and to performa principal component analysis (PCA) based on an eigendecomposition of the estimatedmatrix. Typically, the first several components explain most of the observed variabilityin the log weights and well-summarize the main directions of influence. A PCA displayconsisting of a scatterplot of the first two or three normalized eigenvectors helps to revealstructure in the data: points with high loadings in one or more components are particularlyinfluential and points with similar loadings in all components have similar influence.We propose leveraging these ideas to form a strategy for choosing anchor points basedon the case-deletion weights, which we call CDW-reg. First, we fit a simple model that doesnot accommodate latent heterogeneity in the data. To specify an appropriate mixture modelthat improves on this naive model, we identify clusters based on a PCA summary of the logcase-deletion weights. Thereby we uncover homogeneous subsets of observations that havesimilar directional influence on the posterior distribution and for which separate regressionstructures are possibly needed. Our anchor points are chosen to be representatives of clustersin the PCA display. To apply the proposed technique to the analysis of the mammals data, we began by fitting aBayesian simple linear regression model. This is the same model found to be inadequate byMacEachern and Peruggia (2002). The left panel of Figure 3 shows the PCA display basedon the first three eigenvalues of the eigendecomposition of the sample covariance matrix (cid:98) C of16 − . − . . . . . . . . . −0.3−0.2−0.1 0.0 0.1 0.2 0.3 0.4 PC2 (27%) P C ( % ) P C ( % ) lll llll ll lll l ll lllllll ll ll llll llllll l llllll lllllll ll lll l ll ll ll llll ll llllllllll ll llllllllllll ll lll ll l llllll lll ll lll lll l lllll ll lllllll lllllll lll ll l llll llll l l llllll llll ll l l lllll l ll llll lllll ll llll ll ll l lll −6 −4 −2 0 2 4 6 CDW clusters and regression lines body b r a i n Figure 3: PCA display based on the first three eigenvalues from the eigendecomposition ofthe sample covariance matrix of the log case-deletion weights (left) and resulting k meansclustering with highlighted selected anchor points (right).the empirical case-deletion log weights for the 100 species. The weights were computed froman MCMC run of length 10,000 implemented using the JAGS software (Plummer, 2003).Again postulating three mixture components, we ran the R implementation of the kmeans clustering algorithm (Hartigan and Wong, 1979) identifying the red, green, and blueclusters in the PCA display. The scatterplot in the right panel of Figure 3 shows howthe identified PCA clusters map back into observation space. The clustering appears tobe responding primarily to the size and sign of the residuals from the fitted simple linearregression line. The red cases tend to have larger positive residuals, the blue cases tend tohave larger negative residuals and the green cases are in the middle. As an exploratory step,we superimposed to the three clusters the least squares regression lines estimated from thepoints in the cluster. The blue and green lines are essentially parallel, while the red line hasa slightly larger slope. There is a hint of non-constant variance in the scatterplot and thedifferences in slopes are conceivably driven by the reaction of the PCA induced clustering tothis feature of the data (in addition to the size of the residuals).As a next step, we identified three representative observations within each PCA clusterto be used as anchor points in the anchored Bayesian mixture of regressions model. To do17o, for each of the three original clusters, we again ran the k means clustering algorithms toidentify three PCA sub-clusters. A representative point from each sub-cluster was chosenas the point nearest to the sub-cluster centroid. The selected nine points (three for eachcluster) are drawn as solid dots in the PCA display and scatterplot of Figure 3. The samenine points are also displayed in the right panel of Figure 2, where the orders to which thenine species belong are also identified.When observed in the scatterplot, the selected anchor points can be seen to be adequatelyspaced out in the x dimension, a feature that will provide estimation stability for the Bayesianmixture of regressions model. We also see that, within each cluster, the set of anchor pointscomprises some points with apparently high influence (points with large positive and negativeresiduals in the red and blue groups, respectively, and one point with a very small x value inthe green group) and points that fall very near the least-squares lines. This feature followsfrom the way in which k means decides to allocate member species to the various sub-clusters,with some sub-clusters comprising mostly “usual” observations and other, typically smaller,sub-clusters comprising mostly “unusual” observations. We now present the inferential results obtained by fitting the mixture of regressions model tothe mammals data under the three sets of anchor points. One byproduct of this analysis willbe to identify groups of mammalian species that are similar with respect to the brain/bodymass relationship. Our model fit uses the same hyperparameters specified in the EM-regselection method: a = 5, b = 1, µ β = (3 . , . (cid:48) , and v = 1 , v = .
5. For each anchor modelwe used the JAGS software (Plummer, 2003) to obtain 20,000 posterior samples of the modelparameters via MCMC simulation and assessed convergence using trace plots.18
EM−reg body b r a i n l ll −6 −4 −2 0 2 4 6 EM−MVN body b r a i n ll l −6 −4 −2 0 2 4 6 CDW body b r a i n ll l Figure 4: Posterior mean regression lines for each mixture component. The data are color-coded according to their MAP estimated allocation.
We used the posterior means of the model parameters to estimate component regression linesfor each anchor model. (These are the component-specific, posterior predictive regressionlines.) These lines are shown in Figure 4, with Components 1, 2, and 3 drawn in red, green,and blue, respectively. We also estimated group membership for each species by finding amaximum a posteriori estimate of its latent allocation, s i . Using the posterior samples, s mi , m = 1 , . . . , M , the Monte Carlo estimate of the MAP allocation is (cid:98) s i = max j (cid:80) Mm =1 I ( s mi = j ). Each data point in Figure 4 is color-coded according to its corresponding value of (cid:98) s i .The anchored points, whose group assignments are assumed to be known, are shown as solidsymbols to distinguish them from the remaining observations, whose group assignments areestimated.All three methods have identified Component 1 (shown in red) to be a subgroup whoseslope is considerably steeper than those of the other groups, representing a subset of specieswhose brain masses show an unusually large increase as body masses increase. The estimatedregression lines are similar across the methods, with EM-reg estimating the lowest slope190.88) and EM-MVN estimating the highest (0.93). The three methods assign many of thesame species to Component 1, identifying a string of points in the upper-right end of thescatterplot to be those with the unusually steep regression line. The only difference is thatthe EM-reg model assigns four additional points to Component 1, two of which are its anchorpoints. These points, which appear in the lower left portion of the left panel of Figure 4,are assigned to Component 2 under the other two anchor models. These species have lowbody masses compared to the other species allocated to Component 1, and they fall in anarea where the estimated regression lines for Components 1 and 2 overlap. In fact, one ofthe EM-reg anchor points (triangle) falls almost exactly on the intersection of the two lines;it is only its selection as an anchor point that ensures it is assigned to Component 1.Component 3, plotted in blue in Figure 4, represents, broadly speaking, species withbrains that are small relative to species of the same size. The estimated regression linesare similar for the CDW-reg and EM-reg models. Both models estimate that this grouphas a small intercept and a slope of about 0.7, which is similar to that of Component 2(plotted in green). Under EM-reg, 41 of the 100 species are assigned to this component, andthese species have a wide range of body masses ( x -values). The CDW-reg model assignsfewer species to this group (31), but these species share the characteristics of having smallbrains relative to their body masses and to span a wide range of body masses. The features ofComponent 3 are quite different under the EM-MVN model; the regression line has a shallowslope and large intercept, causing the line to move away from the observed data for smallvalues of x . The estimated regression line under EM-MVN is predominantly representativeof the x - y relationship among the six species that EM-MVN assigns to Component 3. Thesespecies form a small cluster of observations with large body masses on the far right end ofthe scatterplot. They have small brains relative to their body mass and are also assigned toComponent 3 under CDW-reg and EM-reg.Component 2, with regression lines drawn in green, represents the species with moder-ate brain masses relative to their body masses. Under the EM-MVN model, this group isestimated to describe the majority of species; the model has identified two relatively small20ubsets of species that differ meaningfully from this main group. The two other methods haveassigned comparatively fewer observations to this component. The EM-reg model estimatesa Component 2 line with a larger intercept and shallower slope than the other methods.This difference can be attributed to the influence of the EM-reg anchor points, which fallin the center of the scatterplot; the green line fits these points quite well, as well as manyother points in this region. For the other two methods, both of which have anchor pointswith low x -values, the line fits the data well in the low- x range, and less well in the middleof the scatterplot. In particular, there is a string of points corresponding to log body massesbetween − x , which favors points on the periphery of the scatter-plot that are representative of bivariate elliptical clusters. The effect of these anchor pointsis that the inferred regression line moves to accommodate these points and it need not reflectthe x - y relationship in other areas of the sample space. A very different behavior is seenin the EM-reg method: here we have conditioned not only on x , but on a model specifyingthree distinct regression lines. The selected anchor points are highly representative of threedistinct lines and the resulting inference produces estimates that closely fit the lines to theanchor points. The estimated component assignments (cid:98) s give a model-based grouping of the species whichignores the additional data on the species’ taxonomic orders and suborders. A comparisonof these groups with the true taxonomy of the species can shed light on the allometricquestions posed at the beginning of this article: the species assigned to the same componentby (cid:98) s have similar estimated regression slopes, and if there is a correspondence between (cid:98) s ll ll lll lll llllll ll ll −8 −4 0 2 4 6 − PRIMATES body b r a i n E M − r eg ll lllll llllll llll llll −8 −4 0 2 4 6 − ARTIODACTYLA body b r a i n ll lllllll lll ll l llll llll l −8 −4 0 2 4 6 − RODENTIA body b r a i n lll ll lll lll llllll ll ll −8 −4 0 2 4 6 − body b r a i n E M − M V N ll lllll llllll llll llll −8 −4 0 2 4 6 − body b r a i n ll lllllll lll ll l llll llll l −8 −4 0 2 4 6 − body b r a i n lll ll lll lll llllll ll ll −8 −4 0 2 4 6 − body b r a i n CD W ll lllll llllll llll llll −8 −4 0 2 4 6 − body b r a i n ll lllllll lll ll l llll llll l −8 −4 0 2 4 6 − body b r a i n Figure 5: Data for the species in orders Primates, Artiodactyla, and Rodentia color-codedaccording to their model-based (cid:98) s . The rows correspond to the three anchor models.and the species’ true orders, it may indicate that certain taxonomic groups have distinctbody mass/brain mass relationships. Figure 5 displays the data from species in three ofthe 13 orders, color-coded according to the model-based cluster assignment. The displayedorders are Primates, Artiodactyla, and Rodentia, which account for 21, 21, and 24 of the100 species in the data, respectively.The first column of the figure shows the Primate species, which, for all three models,are split between Components 1 and 2. Given body mass, most large-brained species areassigned to Component 1 and most of those with smaller brains are assigned to Component 2.Component 1 also contains the three species of the Cetacea order (not pictured) under all22hree model fits. Interestingly, all three models assign the same seven species of Primates toComponent 2. Three of these species belong to the sub-order Prosimii (all other primates inthe data set belong to the sub-order Anthropoidea). So, the mixture model is sensitive tothis aspect of the taxonomic classification, recognizing that the Prosimii species have smallbrains given their body masses. The two extreme Primate species in the x dimension are theGorilla in the upper end, with a brain smaller than expected and the Hepale Leucocephalain the lower end. The latter species’ brain is actually large given its body mass, but it isclassified as a Component 2 species because all three models do not use Component 1 todescribe Primate species in the lower x range.The clustering among the Rodentia highlights some additional interesting features of themodel fit. Two species are estimated to belong to the large-brained Component 1 underthe EM-reg method, in part because one of the anchor points for Component 1 was fromthe Rodentia order. Although these estimated groups have no well-defined meaning, wemay view these red Rodentia species as something of a misclassification, since Component 1primarily describes the Primates and Cetacea species under the other two models. A partialexplanation for this is that larger body masses are a characterizing feature of these species,but the EM-reg method chooses anchor points conditional on x and thus is less likely toidentify groups that are characterized by differences in x -values. When we specify a finite mixture model, we assume the existence of distinct subgroups in thepopulation. The group membership of any individual observation is unknown and the mix-ture likelihood combines with the prior specifications (including the prior group membershipprobabilities) to determine both the estimated features of the mixture components and theposterior probabilities of group membership. A random effects model similarly allows infer-ence on group-specific features, but requires auxiliary, deterministic information indicatingwhich observations are to be grouped together, and the similarities among these observa-tions drive the estimated features of the various groups. In an anchored mixture model, the23nchored observations affect the model fit in the same way as labeled observations in a ran-dom effects model affect inferences on their groups: they inform the distinct features of theirmixture components. These features then influence the probabilities of group membership ofthe remaining unanchored observations. Thus, the similarities among a component’s anchorpoints are a key driver of what types of groups the mixture model identifies.In our case study, the EM-reg method chose anchor points to fall along straight lines andits accompanying mixture model defined groups based on proximity to these lines. On theother hand, the EM-MVN method selected anchor points to form tight clusters within thebivariate scatterplot, inducing a mixture model whose components are more apt to representgroups that are similar in both the x and y dimensions. The CDW-reg method selectedanchor points that were representative of groups of observations exerting similar influenceon the posterior inferences from a naive model. Which method we prefer is somewhatsituation-dependent. For example, if we truly believe the mixture model we have specified,a method such as EM-reg, which subsumes this model at the anchoring stage, is perfectlyappropriate. However, in our case study this perspective was restrictive in its inability todetect differences in groups that depend on x .The anchoring of EM-MVN assumes a completely different model and, in some sense,produces a mixture of regressions model that is more descriptive of clusters in bivariatespace than of groups of similar x - y relationships. The resulting inferences lack cohesivenessbecause of the mismatch between the anchoring method and the model of analysis. TheCDW-reg method falls between these two extremes; the anchor points are selected usingthe principle that a mixture model is appropriate because the simple model is inadequate;the mixture components are identified as groups exhibiting similar misfit and the estimatedmodel is consistent with the mixture of regressions but flexible in accommodating unexpectedfeatures in the data. This flexibility suggests that the CDW-reg method has potential as aneffective strategy for specifying a wide variety of anchored mixture models. Its reliance onclustering performed in the derived space determined by the PCA eigendecomposition of thelog case-deletion weights enables the method to select anchors by honing in on features that24ffect directly the quality of model fit.Partly, the choice to fix the number of mixture components to three in the interest ofmodeling parsimony limits the ability of the mixture model to adapt to finer structure inthe data. On the other hand, adding more components and companion anchor points wouldmake the reliance of the use of a data-dependent prior too substantial.The known taxonomic structure of species in our case study provides insight about themodel-based classifications produced by the three anchor models which extends beyond theparticular data we analyzed. In our data, the groups of species from the same order exhibitdistinct characteristics that might be primarily revealed by similarities in brain or body mass(the Artiodactyla species, for example, all have above-average body mass), by similarities inthe change in brain mass as body mass changes, or both (as we see for the large primates).The EM-MVN model is useful in identifying groups of the former type because it selectsanchor points that correspond to bivariate clusters in the data. However, this model isless flexible in identifying homogeneous groups that span a wide range of x values. TheEM-reg model is particularly well-suited for identifying groups of the latter type because itsanchor points are selected using the mixture of regressions, but it ignores differences in x which provide valuable information for classification in this setting, where body mass is adistinguishing features of some of the taxonomic groups.Thoughtful application of anchor models requires understanding of these differing char-acteristics. The field of semi-supervised learning has addressed many relevant questionsregarding the combination of labeled and unlabeled data, such as selecting some observa-tions to artificially pre-label (Yarowsky, 1995) and constraining certain observations to thesame group in an effort to learn a distance metric that discerns similarity among futureobservations (Shental et al., 2002). We think that integration of our methods with thesewell-studied problems is a promising direction for future research.In sum, to select a method for finding anchor points, it is important to consider whichtypes of distinguishing features the underlying model uses to identify similar observationsand to determine how such features become relevant to answer the scientific questions of25nterest. References
Bennett, P. and Harvey, P. (1985a). Relative brain size and ecology in birds.
Journal ofZoology , 207(2):151–169.Bennett, P. M. and Harvey, P. H. (1985b). Brain size, development and metabolism in birdsand mammals.
Journal of Zoology , 207(4):491–509.Bradlow, E. T. and Zaslavsky, A. M. (1997). Case influence analysis in Bayesian inference.
Journal of Computational and Graphical Statistics , 6(3):314–331.Epifani, I., MacEachern, S. N., Peruggia, M., et al. (2008). Case-deletion importance sam-pling estimators: Central limit theorems and related results.
Electronic Journal of Statis-tics , 2:774–806.Felsenstein, J. (1985). Phylogenies and the comparative method.
The American Naturalist ,125(1):1–15.Garland Jr, T., Harvey, P. H., and Ives, A. R. (1992). Procedures for the analysis of compar-ative data using phylogenetically independent contrasts.
Systematic biology , 41(1):18–32.Gayon, J. (2000). History of the Concept of Allometry.
American Zoologist , 40(5):748–758.Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration.
Econometrica: Journal of the Econometric Society , pages 1317–1339.Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) , 28(1):100–108.Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). Markov chain Monte Carlo methodsand the label switching problem in Bayesian mixture modeling.
Statistical Science , pages50–67. 26erison, H. J. (1955). Brain to body ratios and the evolution of intelligence.
Science ,121(3144):447–449.Kunkel, D. and Peruggia, M. (2018). Anchored Bayesian Gaussian Mixture Models. arXive-prints .MacEachern, S. N. and Peruggia, M. (2002). Bayesian tools for EDA and model building:A brainy study. In
Case Studies in Bayesian Statistics , pages 345–362. Springer.Peruggia, M. (1997). On the variability of case-deletion importance sampling weights in theBayesian linear model.
Journal of the American Statistical Association , 92(437):199–207.Peters, R. H. (1983).
The ecological implications of body size . Cambridge University Press.Plummer, M. (2003). Jags: A program for analysis of Bayesian graphical models using Gibbssampling.Sacher, G. A. and Staffeldt, E. F. (1974). Relation of gestation time to brain weight forplacental mammals: implications for the theory of vertebrate growth.
The AmericanNaturalist , 108(963):593–615.Shental, N., Hertz, T., Weinshall, D., and Pavel, M. (2002). Adjustment learning andrelevant component analysis. In Heyden, A., Sparr, G., Nielsen, M., and Johansen, P.,editors,
Computer Vision — ECCV 2002 , pages 776–790, Berlin, Heidelberg. SpringerBerlin Heidelberg.Thomas, Z. M., MacEachern, S. N., and Peruggia, M. (2018). Reconciling curvature andimportance sampling based procedures for summarizing case influence in Bayesian models.
Journal of the American Statistical Association , 113(524):1669–1683.Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods.In