Copula-type Estimators for Flexible Multivariate Density Modeling using Mixtures
Minh-Ngoc Tran, Paolo Giordani, Xiuyan Mun, Robert Kohn, Mike Pitt
aa r X i v : . [ s t a t . M E ] J un Copula-type Estimators for Flexible Multivariate DensityModeling using Mixtures
Minh-Ngoc Tran, Paolo Giordani, Xiuyan Mun, Robert Kohn, Mike Pitt ∗ Abstract
Copulas are popular as models for multivariate dependence because they allow themarginal densities and the joint dependence to be modeled separately. However, theyusually require that the transformation from uniform marginals to the marginals of thejoint dependence structure is known. This can only be done for a restricted set ofcopulas, e.g. a normal copula. Our article introduces copula-type estimators for flexiblemultivariate density estimation which also allow the marginal densities to be modeledseparately from the joint dependence, as in copula modeling, but overcomes the lackof flexibility of most popular copula estimators. An iterative scheme is proposed forestimating copula-type estimators and its usefulness is demonstrated through simulationand real examples. The joint dependence is is modeled by mixture of normals andmixture of normals factor analyzers models, and mixture of t and mixture of t factoranalyzers models. We develop efficient Variational Bayes algorithms for fitting these inwhich model selection is performed automatically. Based on these mixture models, weconstruct four classes of copula-type densities which are far more flexible than currentpopular copula densities, and outperform them in simulation and several real data sets. Keywords.
Mixtures of factor analyzers; Mixtures of normals; Mixtures of t ; Mixturesof t -factor analyzers; Variational Bayes. ∗ M.-N. Tran ( [email protected] ), X. Mun ( [email protected] ) R. Kohn( [email protected] ) are at Australian School of Business, University of New South Wales, Aus-tralia. P. Giordani ( [email protected] ) is at the Research Division, Swedish Central Bank,Sweden. M. Pitt ( [email protected] ) is at the Economics Department, University of Warwick, UK. Introduction
Multivariate density estimation is a fundamental problem in statistics and related fields. Oneof the common approaches to multivariate density estimation is mixture modeling, which esti-mates the multivariate density of interest by a multivariate mixture of densities such as a multi-variate mixture of normal densities or a multivariate mixture of t densities (Titterington et al.,1985; McLachlan and Peel, 2000). Mixture models provide an automatic method for estimat-ing the density of non-standard and high-dimensional data. In principle, with sufficient datarelative to the dimension of the multivariate data, a mixture model can fit a data set arbi-trarily well and capture most of its features. In practice, however, transforming the marginalscan greatly facilitate obtaining statistically efficient estimates of a target multivariate density.This can be done informally by taking known transformations of the marginals, for exampleby taking logs, or more formally, as we have done, by estimating the marginals flexibly andthen transforming.A drawback in using mixture models is that we do not have much flexibility in modelingthe marginals, because all of the implied marginals are restricted to some particular form.For example, if the multivariate density of interest is estimated by a multivariate mixture ofnormals then the marginals of the target are estimated by the implied univariate mixture ofnormals. These implied marginals may not even be close to the best models for the targetmarginals, which can be a kernel density, a univariate mixture of t or some parametric form.Furthermore, Giordani et al. (2012) observe that implicit estimation of marginals is in somecases less efficient than direct estimation, even when the true model is used to fit the jointdistribution. They conjecture that the large number of parameters in the joint model thatneed to be estimated makes the estimation practically less efficient, while direct estimation ofthe marginals does not deteriorate with the dimension.Copula modeling is a widely used approach to multivariate density estimation (Joe, 1997;Nelsen, 1999). This approach is flexible in the sense that it allows one to model the marginalsand the joint dependence separately. Because of computational reasons, the joint dependenceis often estimated by a mathematically convenient model such as a multivariate normal or amultivariate t distribution. Such conveniently parametric copula models may not be appro-priate for modeling data sets that have a complex joint dependence structure. For example,different areas in the domain of the data may have different dependence structures (see themotivating example in Section 2 and the Iris data in Section 3). In such cases, a multivariatemixture model will capture the joint dependence of the data better than a simple model suchas a normal or a t model. It is therefore desirable to use flexible models such as multivariate2ixture models to estimate the joint dependence.This article proposes a new class of multivariate density estimators called copula-typeestimators which have the motivation of using flexible models for estimating complex jointdependence structure, while preserving the possibility offered by copulas of modeling themarginal distributions separately. Except in some special cases, copula-type estimators arenot copula estimators, although they still allow the marginals to be separately estimated. Theconstruction of copula-type estimators allows us to estimate them using an iterative scheme.The construction also covers many popular copula estimators found in the literature. Thearticle focuses on a class of copula-type estimators using multivariate mixture models to cap-ture the joint dependence of the target density. In particular, four copula-type estimators areconsidered: a copula-type estimator based on a multivariate mixture of normals, a copula-typeestimator based on a multivariate mixture of t , a copula-type estimator based on a mixture offactor analyzers and a copula-type estimator based on a mixture of t -factor analyzers. Thesefour copula-type estimators allow us to achieve flexibility, efficiency and robustness in multi-variate density estimation. Their estimation is based on efficient Variational Bayes algorithmsfor fitting mixture models, in which model selection (and factor selection) is automaticallyincorporated. See, e.g., Ormerod and Wand (2009) for an introduction to the VariationalBayes method. We believe that our algorithm for fitting mixtures of mixtures of t and t -factoranalyzers is the first method in the literature which is able to do parameter estimation andcomponent and factor selection simultaneously and automatically.The article is organized as follows. Section 2 presents the main results. Section 3 presents asimulation study and several applications to real data. Section 4 concludes the article. Proofsand technical details are presented in the Appendices. Suppose that we are given a data set D Y = { y i = ( y i , ..., y id ) ′ , i = 1 , ..., n } of realizationsof a random vector Y = ( Y , ..., Y d ) ′ , and we wish to estimate the distribution of Y . We willdenote random variables by upper-case letters, their realizations by lower-case letters, andwrite vector variables in bold. We write y for a general multivariate argument and y i for aparticular realization. We restrict the discussion in this paper to continuous marginals.In copula modeling, one often assumes that Y = ( Y , ..., Y d ) ′ inherits the joint dependencestructure from another continuous random vector X = ( X , ..., X d ) ′ . Let G ( x ) be the joint3umulative distribution function (cdf) of X and G j ( x j ), j = 1 , ..., d , be its marginal cdf’s.Write the corresponding probability density functions (pdf’s) as g ( x ) and g j ( x j ) , j = 1 , . . . , d .The joint dependence of Y is assumed to be constructed from X as follows. First, let U j = G j ( X j ), j = 1 , ..., d . Each U j has a uniform distribution on [0 ,
1] while their joint dependenceis induced from that of G , i.e. the cdf of U can be written as C ( u | G ) = G ( G − ( u ) , ..., G − d ( u d )) , u = ( u , ..., u d ) ′ . (1)This function is referred to as a copula function or a copula (induced by G ). This way ofconstructing a copula is known as the inverse method (Nelsen, 1999).Given univariate (continuous) cdf’s F ,...,F d , let Y j = F − j ( U j ), j = 1 ,...,d . Then eachrandom variable Y j admits F j as its cdf while their joint dependence is induced from that ofthe vector X , i.e. the cdf F of Y can be expressed in terms of G as F ( y ) = C ( F ( y ) , ..., F d ( y d ) | G ) = G (cid:16) G − ( F ( y )) , ..., G − d ( F d ( y d )) (cid:17) . (2)We refer to F ( y ) (or its pdf f ( y )) as a copula cdf, which can be though of as an approximationto the true cdf of Y . It is easy to see that the i th marginal cdf of F is F i . Figure 1 demonstratesthis X ↔ U ↔ Y and G ↔ C ↔ F relationship diagrammatically. The three random vectors X , U and Y have different marginals but the same joint dependence structure in the sensethat their cdf’s can be written in terms of the copula C . X = ( X ,...,X d ) ′ ∼ G ( x ) , X j ∼ G j ( x j ) ❄✻ U j = G j ( X j ) U = ( U ,...,U d ) ′ ∼ C ( u | G ) , U j ∼ U [0 , ❄✻ Y j = F − j ( U i ) Y = ( Y ,...,Y d ) ′ ∼ F ( y ) , Y j ∼ F j ( y j )Figure 1: X ↔ U ↔ Y and G ↔ C ↔ F relationship. X , U and Y have the same jointdependence structure but different marginals.Two examples of popular copulas are the normal and t copulas. In the normal copula G is assumed to be the cdf of a multivariate normal distribution N d ( ,V ), with G is assumed tobe the cdf of a multivariate t distribution t d ( ,ν,V ) with ν the degrees of freedom and V is ascale matrix with diagonal entries 1. For both the normal and t copulas the scale matrix V isa correlation matrix. 4nference in copula modeling consists of two problems. The first is how to estimate themarginal cdf’s F j and the second is how to select and estimate an appropriate copula C , orequivalently G . This section focuses on the second problem, i.e. on estimating an appropriatejoint dependence structure. We assume for now that the marginal cdf’s F j are known; marginalestimation is discussed in Section 2.5. By making the transformation u ij = F j ( y ij ), j = 1 ,...,d , i = 1 ,...,n , we obtain a data set D U = { u i ,i = 1 ,...,n } in the U -space and the problem reduces toreconstructing the source of dependence structure in X based on D U . It is worth emphasizingthat the data D U contain all information we have about the joint dependence of F (or G ).The main problem with many current approaches for fitting joint dependence using copulasis that if an inappropriate choice of copula is made, then the transformed data in the X -space may be harder to model than the original data D Y . The following discussion andexample consider this issue. Suppose that we wish to estimate the joint dependence in D U by a multivariate cdf b G , where b G is assumed known up to some parameters that need to beestimated from the data. For example, b G may be a multivariate normal cdf whose mean is and whose covariance matrix is a correlation matrix that needs to be estimated from thedata. We further assume that the marginal cdf’s b G j of b G are fully known. This is the case, forexample, in the normal copula or the t copula with fixed degrees of freedom. Then a simplemethod for estimating b G is as follows: first, transform the data D U to a data set D b G d X in the X -space via x ij = b G − j ( u ij ), j = 1 ,..,d , i = 1 ,...,n ; then, fit b G to D b G d X . For example, in fittinga normal copula we first make the transformation x ij = Φ − ( u ij ) with Φ the standard normalcdf and then fit a multivariate normal distribution N d ( ,V ) (with V a correlation matrix) tothis X -space data set. The idea (hope) is that the transformed data D b G d X are easier to modelthan D Y . However, in some cases D b G d X cannot be fitted well by b G . The main problem withcopulas is that with an inappropriate choice of b G , the transformed data may be harder tomodel than the original data D Y . This is illustrated in the example below. A motivating example.
We construct a two-dimensional vector Y whose joint dependenceis induced from another vector X as in Figure 1. X is distributed as a multivariate mixtureof two normals with density g ( x ) = 0 . N ( µ , V ) + 0 . N ( µ , V ) , (3)where µ = ! , µ = − − ! , V = . . ! , V = − . − . ! , and Y ∼ N (1 ,
3) and Y ∼ t (0 , , Y and X respectively. The left panel of the middle row plots the data D U obtained via u ij = G j ( x ij ),5hich contain all information about the joint dependence of Y (and X ). If we use a normalcopula to model the dependence structure in Y , we need to fit a bivariate normal distributionto the data shown in the right panel of the middle row, which are obtained via x ij = Φ − ( u ij ).Clearly a multivariate normal density does not provide a good fit to this data set and it isnecessary to have a more flexible model than a multivariate normal distribution to capturethe joint dependence. −8 −6 −4 −2 0 2 4 6 8 10 12−10−50510 Original Y−data −5 0 5−505 True X−space data0 0.2 0.4 0.6 0.8 100.20.40.60.81 U−space data −3 −2 −1 0 1 2 3 4−4−2024 Transformed X−space data via the standard normal−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3−2−10123 Transformed X−space data after the iteration stops 5 10 15 20 25 30 35 40 45 50−3.6−3.5−3.4−3.3−3.2 Log−likelihood versus iterations Figure 2: Motivating example: The first row shows the original Y -data and the true depen-dence structure in the X -space, which is equivalently transformed to the U -space (middle-leftpanel) via u ij = G j ( x ij ). The middle-right panel shows the transformed data D b G d X when b G is a normal distribution. The bottom-left panel shows the transformed data D H d X when the H j are mixtures of two normals obtained by the iterative scheme. The last panel plots thelog-likelihood values versus iterations.The example above motivates the use of flexible models to estimate the joint dependence.Suppose that G ( x ) = G ( x | θ ) belongs to some class of multivariate cdf’s, such as the cdf’s ofmultivariate mixtures of normals, with unknown parameter vector θ . From (2), the pdf of Y f ( y | θ ) = g ( x | θ ) Q dj =1 g j ( x j | θ ) d Y j =1 f j ( y j ) , (4)where x j = G − j ( F j ( y j ) | θ ). It is possible, in principle, to estimate θ by maximum likelihoodor by its posterior mode based on the pdf (4). However, when G ( x | θ ) is a complex cdfsuch as a mixture cdf, optimization over θ is computationally very difficult, for two reasons.First, we cannot in general compute the gradient of the likelihood analytically because thepdf (4) has θ deeply embedded in the inverse transformations x j = G − j ( F j ( y j ) | θ ). Second,this is a high-dimensional optimization problem with the complex constraints that the scalecorrelation matrices in the mixture need to be positive definite. For example, suppose that G ( ·| θ ) is the cdf of a mixture of K normals; then the dimension of θ is dim = K − dK + d ( d +1) K = K ( d +1)( d +2) −
1, which can be thousands for even a moderate d ; here, wehave K components, K − Kd mean parameters and Kd ( d + 1) / d = 2) problems, but the optimization algorithm repeatedly failed to converge.In the next section we propose a class of copula-type (CT) estimators which estimate themarginals from D Y as well as flexibly estimating the dependence structure. We now describe a framework for constructing flexible multivariate density estimators, whichallows using complex and flexible models for estimating the joint dependence structure. Notethat we are assuming that the marginal cdf’s F j ( y j ) are given or separately estimated, so thatwe start with the transformed data D U and wish to capture the joint dependence of X .Our estimator for the distribution of interest is constructed as follows. Suppose thatunivariate cdf’s H j are an initial guess of the marginal cdf’s G j , j = 1 ,...,d . Recall that G is the cdf of X and G j are its marginal cdf’s, G is unknown and we wish to estimate G .Let D H d X be the data set in the X -space obtained by transforming x ij = H − j ( u ij ). Now fita multivariate cdf b G to D H d X . For example, H j can be the cdf of an univariate mixture ofnormals and b G the cdf of a multivariate mixture of normals. Let b C ( u | H, b G ) = b G ( H − ( u ) , ..., H − d ( u d )) . (5)We note that b G is selected from the class of cdf’s corresponding to mixture of normals, mixtureof factor analyzers, mixtures of t and mixture of t analyzers. That is, b G is specified up toclass, e.g. mixture of normals, with the parameters, number of components and number offactors unknown and to be estimated form the data.7he following result provides an explicit expression for the estimator. Proposition 1.
The cdf of the estimator for the distribution of Y is b F ( y | H, b G ) = b C ( F ( y ) , ..., F d ( y d ) | H, b G ) . (6) The pdf of the estimator is b f ( y | H, b G ) = b g ( x ) d Y j =1 f j ( y j ) h j ( x j ) , (7) its j th marginal pdf is b f j ( y j | H, b G ) = b g j ( x j ) h j ( x j ) f j ( y j ) with x j = H − j ( F j ( y j )) and b g , b g j , f j , h j density functions with respect to b G , b G j , F j , H j respectively. We note that equation (5) is not necessarily a copula. It is also important to note that b f ( y | H, b G ) in (7) is a valid multivariate density for any b g , h j and f j . To see this, using theequality that h j ( x j ) dx j = f j ( y j ) dy j , we can prove that R b f ( y | H, b G ) d y = 1. This justifies thestopping criterion used in the iterative scheme in Section 2.3.The following result guarantees that under some conditions the marginals of the estimator b f converge to the true marginals f j . We say that a fitting method is reliable if the resultingestimator b g ( x ) converges in total variation norm to the underlying density h ( x ) that generatesthe data D H d X , i.e. d TV ( b g, h ) = 12 Z | b g ( x ) − h ( x ) | d x → , as the sample size increases. Proposition 2 (Marginal consistency) . Suppose that the method for fitting b G to D H d X isreliable. Then b f j ( y j | H, b G ) converges in total variation to the true marginal f j ( y j ) , j = 1 ,..,d ,as the sample size increases. The proofs of the two propositions are in Appendix A.We call the function b C in (5) a copula-type function, and refer to (6) or (7) as a copula-type estimator. This is because b C has a similar form as the copula function C in (1), andunder some conditions (see below) a copula-type function becomes a copula function.This approach to multivariate density estimation is flexible for the following reasons. • It allows us to use complex and principled models such as multivariate mixture modelsto estimate the joint dependence. 8
With appropriate choices of the H j and b G , the framework covers some popular copulasin the literature. For example, with H j = Φ, b G = N d ( ,V ) and V a correlation matrixwe obtain the normal copula model; with H j = t (0 ,ν, b G = t d ( ,ν,V ) and V a scalematrix with diagonal entries 1 we obtain the t copula model. Note that in these twocases, b f j ≡ f j , j = 1 ,...,d . More generally, a copula-type function is a copula function if b G admits H j ’s as its marginal cdf’s. • If H j ≡ F j , then b G is fit directly to the original data, i.e. no marginal adaptation isused. Then copula-type modeling reduces to the usual multivariate modeling, such asmultivariate mixture modeling.We note that unless b G j = H j , copula-type estimators are not true copula estimators because themarginal pdf’s b f j ( y j | H, b G ) of a copula-type estimator are not exactly the separately estimatedmarginal pdf’s f j . In order for a copula-type estimator to be a copula estimator it is necessaryto impose the constraint b G j = H j . However, imposing this constraint usually makes theestimation of b G very difficult, especially when complex models are used to estimate the jointdependence. Furthermore, this constraint need not lead to better performance; see the remarksat the end of Section 2.3. Finally, Proposition 2 guarantees that in large samples a copula-typeestimator converges to an exact copula estimator if the model for D H d X is sufficiently flexible. In general, we should choose the univariate cdf H j such that the transformed data D H d X lookas if they can be effectively fitted by the candidate set of multivariate distributions b G . Inour case, this means that the transformed data can be parsimoniously fitted by a multivariatemixture of normals or a multivariate mixture of t . This is difficult if the H j are only chosenonce. We propose an iterative scheme which is useful for estimating the H j and b G in general.Assume that b G ( x ) belongs to some family of multivariate cdf’s such as multivariate normalmixture cdf’s: b G ( x ) = b G ( x | θ ) with θ the parameters. We start with some initial univariatecdf’s H j ( x j ) = H (0) j ( x j ), fit b G ( x | θ ) to D H d X to get an estimate b θ of θ and then repeat theprocedure with H j ( x j ) set to b G j ( x j | b θ ).1. Start with some initial univariate cdf’s H j ( x j ) = H (0) j ( x j ).2. Transform the data D U to D H d X via x ij = H − j ( u ij ), j = 1 ,...,d , i = 1 ,...,n .3. Fit b G ( x | θ ) to D H d X to get an estimate b θ of θ .4. Set H j ( x j ) = b G j ( x j | b θ ) with b G j ( x j | b θ ) the j th marginal cdf of b G ( x | b θ ). Go back to Step 2.9e suggest stopping the iteration if the log-likelihood X y ∈D Y log b f ( y | H, b G )does not improve any further. The iteration uses Variational Bayes at each iteration to choosethe parameters, as well as choosing automatically the number of components and number offactors. Note that b f ( y | H, b G ) is a valid density. We observe that the log-likelihood oftenincreases in the first few iterations and then decreases; see the last panel in Figure 2. Apossible choice for the initial marginal distributions H (0) j is the standard normal cdf Φ. When b G is the cdf of a multivariate mixture of normals or a multivariate mixture of t , we suggestselecting the H (0) j as the implied marginals of the multivariate mixture distribution estimatedfrom the original data D Y . We found that the resulting estimates are insensitive to the initialdistribution taken and show the usefulness of this scheme through numerical examples. A motivating example (continued).
We now apply the iterative scheme to estimatethe joint dependence in Y with H (0) j ( x j ) = Φ( x j ) and b G ( x | θ ) a multivariate mixture of twonormals. The procedure stops after 13 iterations when the log-likelihood is maximized. Thebottom right panel in Figure 2 plots the log-likelihood values vs iterations number. Thebottom left panel shows the transformed data D H d X after the iterative scheme stops. Clearly,the joint dependence structure of this estimated b G is similar to that of the true distribution G .In fact, the two component correlation matrices of b G are [1 0 .
62; 0 .
62 1] and [1 − . − .
59 1],which are close to the true matrices V and V . Remark 1.
A different, but related, estimator constructed within our framework is b f ( y | b G ) = b g ( x ) d Y i =1 f j ( y j ) b g j ( x j ) , (8)with b G obtained after the iteration above has terminated, i.e. we use the copula induced by b G to construct the estimator. The estimator (8) is a copula estimator as its marginals areequal to f j . However, our experiments show that the copula-type estimator (7) usually has aslightly better performance in terms of the log predictive density score (see Section 3) thanthe copula estimator (8). We conjecture that this is because the expression (7) takes intoaccount the actual marginal transformations of the data H j ( x j ) = F j ( y j ), while (8) uses onlythe estimated joint dependence. Remark 2.
The proposed method can be easily extended to the case where the marginalsdepend on covariates z . Assume that { ( y i , z i ) , i = 1 ,...,n } are observations from a multivariatedistribution F ( y | z ), whose joint dependence is independent of z . Let u ij = F j ( y ij | z i ), j =1 ,...,d ,10 = 1 ,...,n , where F j ( y j | z ) is the j th marginal cdf. We can now use the iterative scheme toestimate the H j and b G . The pdf of the estimator is expressed as b f ( y | z , H, b G ) = b g ( x ) d Y j =1 f j ( y j | z ) h j ( x j ) , with x j = H − j ( F j ( y j | z )). Extension to the case where the distribution functions C and G depend on covariates is more difficult and is left for future research. This paper considers in particular four copula-type distributions using multivariate mixturemodels to estimate the joint dependence.The first copula-type estimator uses a multivariate mixture of normals to model the jointdependence and is denoted by CT-MN. See, e.g., Titterington et al. (1985) and McLachlan and Peel(2000) for an introduction to mixture models.A mixture of normals model may be over-parameterized when modeling high-dimensionaldata as the number of model parameters increases at least quadratically with the dimension.This is because the number of parameters in each component increases quadratically andthe number of components is also likely to increase with dimension. Parameter estimationis typically less efficient statistically if the number of observations is small relative to thenumber of parameters. In such cases, it is desirable to reduce the number of parameters. Themixture of factor analyzers model introduced in Ghahramani and Hinton (1997) provides aneffective way to parsimoniously model high-dimensional data, and inherits the advantages offlexibility from mixture modeling and dimensionality reduction from the factor representation.The second copula-type estimator is based on a mixture of factor analyzers and is denoted byCT-MFA.Krupskii and Joe (2013) propose a general one component factor copula model which theypropose to estimate by maximum likelihood. However, they do not address the two main issuesin our article, i.e. the estimation of a copula of a mixture and how to make the marginals inthat copula consistent with the joint distribution.The third copula-type estimator uses a multivariate mixture of t to model the joint de-pendence and is denoted as CT-M t . The heavy tails of t distributions can make this es-timator successful when modeling data with outliers or atypical observations. See, e.g.,Peel and McLachlan (2000) for a discussion of the multivariate mixture of t model. Thefourth copula-type uses a mixture of t -factor analyzers to estimate the joint dependence, andwe denote it by CT-M t FA. 11e note that our component and factor selection approach will indicate if a simpler normalor t copula or a factor version of these models is sufficient to fit the data. We can also usecross-validation log predictive score (LPDS – see the definition in Section 3) to make a similarassessment.Appendix B presents more details on mixture modeling and how to fit a mixture model tothe data using Variational Bayes methods.
Estimating the marginal densities of Y is typically much easier than estimating the jointdependence structure. There are a number of efficient approaches for estimating a univariatedensity, for example parametric estimation, kernel density estimation and univariate mixtureestimation. Given a class F of univariate density estimators, the best estimator can be selectedusing cross validation LPDS (see Section 3). In the examples below, we consider for the class F a kernel density estimator, a univariate mixture of normals estimator, a univariate mixtureof t estimator, an implied univariate mixture of normals estimator (i.e. the univariate densityestimator for the marginal implied from the multivariate mixture of normals for the joint)and an implied univariate mixture of t estimator. When fitting univariate mixtures to themarginals, the number of components is selected by Variational Bayes for the real examples.For the simulated example, we used the true model that generated the data as the bestmodel for the marginals because we wished to focus on how well the joint density was beingestimated. A common measure for the performance of a density estimator is the log predictive densityscore (LPDS) (see, e.g., Good, 1952; Geisser, 1980). Let D T be a test data set that is inde-pendent of the training set D . Suppose that b p ( y |D ) is a density estimator based on D . TheLPDS of the estimator b p is defined byLPDS( b p ) = 1 |D T | X y i ∈D T − log b p ( y i |D T ) , with |D T | the number of observations in D T . The smaller the LPDS the better the estimator.For the real examples considered in this section we use the cross-validated LPDS. Supposethat the data set D is split into roughly B equal parts D ,..., D B , the B -fold cross-validated12PDS is defined as LPDS( b p ) = 1 |D| B X j =1 X y i ∈D j − log b p ( y i |D \ D j ) . When computing this cross-validated LPDS, the marginal models are fixed at the best modelswhich have been already selected (again, by cross-validated LPDS for each marginal). Thatis, the models for the marginals and the copula model for the joint are specified up to classwith the parameters (including the number of components and number of factors) estimatedfrom each data set
D \D j . We take B = 5 or B = 10 as recommended by Hastie et al. (2009),pp. 241-244.Giordani et al. (2012) propose a class of multivariate density estimators to improve onstandard multivariate estimators. They do so by allowing the user to adjust any initial multi-variate estimator by the best fitting density for each marginal. Giordani et al. (2012) introducetwo marginally adjusted estimators using the mixture of normals and mixture of factor analyz-ers models for the initial estimators. These estimators are denoted by MAMN and MAMFA.A total of 12 estimators are considered below for comparison. The first six are mixture-basedestimators including a multivariate mixture of normals (MN), a multivariate mixture of t (M t ), a mixture of factor analyzers (MFA), a mixture of t -factor analyzers (M t FA), and twomarginally adjusted estimators, MAMN and MAMFA. The others are copula-based estimatorsincluding a normal copula (NC), a t copula ( t C), CT-MN, CT-M t , CT-MFA and CT-M t FA.
We consider the data generating process as in the motivating example in Section 2. Givena dimension d , a training data set D of size n is generated from (3), where µ = ( − ,..., − ′ and µ = (2 ,..., ′ are vectors of size d , V = ( V ,ij ) i,j , V = ( V ,ij ) i,j with V ,ij = 0 . | i − j | and V ,ij = ( − . | i − j | , and Y j ∼ t (0 , ,
5) for all j = 1 ,...,d . A test data set D T of 1000 realizations isthen generated in the same manner to compute the log predictive density scores. For each d and n combination, we compute the 12 density estimators based on D , their LPDS based on D T , CPU times, and replicate this computation for 50 replications. Tables 1 and 2 summarizethe LPDS and CPU times averaged over the replications for various d and n .We draw the following conclusions. 1) The copula-type estimators perform best, exceptfor the CT-M t when d = 40 and n = 200 , t in thelarge- d small- n case is challenging because of a very large number of parameters that need tobe estimated. We observe that the CT-M t works well when n is large enough. 2) Dimensionreduction via the factor analyzers models is useful when d is large. 3) The marginally adjustedestimators, MAMN and MAMFA, always outperform their initial estimators, MN and MFA.13) The normal and t copulas work poorly. This is not surprising as the joint dependence of thedata has a mixture structure. 5) The copula-type estimators are more time consuming thanthe others, principally because in the variational Bayes algorithms, components (and factor)selection takes place every iteration. The code is written in Matlab and run on an Intel Core16 i7 3.2GHz desktop. d n MN M t MFA M t FA MAMN MAMFA NC t C CT-MN CT-M t CT-MFA CT-M t FA5 200 5.01 4.96 5.17 4.97 4.94 5.15 5.62 5.62 4.45 (0.16) (0.12) (0.19) (0.13) (0.15) (0.24) (0.15) (0.15) (0.16) (0.19) (0.46) (0.18)500 8.85 8.80 9.03 8.80 8.73 8.97 10.22 10.22 7.74 (1.52) (0.10) (0.36) (0.32) (1.44) (0.27) (0.01) (0.01) (0.41) (1.29) (0.05) (0.17)500 35.90 40.52 34.44 34.32 35.18 33.56 38.55 38.55 30.02 67.15 28.69 (0.12) (0.03) (0.08) (0.05) (0.13) (0.11) (0.16) (0.17) (0.10) (2.50) (0.13) (0.09)1000 33.94 35.19 33.64 33.58 33.39 32.87 38.18 38.18 29.62 31.75
Table 1: Simulation: The averaged LPDS values of the 12 density estimators for variousdimension d and number of observations n . The numbers in brackets are standard deviationsover the replications. In each case, the minimum LPDS is in bold. d n MN M t MFA M t FA MAMN MAMFA NC t C CT-MN CT-M t CT-MFA CT-M t FA5 200 0.03 0.18 1.39 6 0.16 1.53 0.01 0.14 17 42 102 140500 0.09 0.29 12 20 0.21 12 0.04 0.34 40 214 456 42910 200 0.02 0.29 3.30 8 0.25 3.53 0.02 0.20 43 145 83 251500 0.06 0.35 30 43 0.31 30.8 0.02 0.60 81 381 604 96740 200 0.03 0.47 8.46 13 0.65 9.08 0.02 0.55 43 65 103 278500 0.12 1.60 51 61 1.06 52 0.03 2.05 184 424 421 11481000 0.24 5.14 91 112 2.07 98 0.08 3.81 201 724 672 1634
Table 2: Simulation: The CPU times (in seconds) of the 12 density estimators averaged overreplications.
This data set (Fisher, 1936) consists of observations of the lengths and widths of the sepalsand petals of 150 Iris plants. We are interested in estimating the joint density of these four14ariables. For visualization purposes, we first consider the density estimation problem in 2dimensions, and estimate the joint density of the sepal width and the petal length. Thefirst row in Figure 3 shows the original Y -data and the U -space data, respectively. We usethe univariate mixture of t model to estimate the marginals: a univariate t mixture withtwo components is selected for the sepal width and an univariate t model is selected forthe petal length. The lower-left panel in Figure 3 shows the transformed X -space data via x ij = Φ − ( u ij ) when the normal copula is used. If a normal copula is used then it is necessaryto fit a bivariate normal to this data. Clearly, it is unreasonable to do so. The last panelshows the X -space data (after the iterative scheme stops) when we use the CT-MN model.A multivariate mixture of two normals is selected by the iterative scheme to estimate thedependence structure. This mixture model seems to fit this data set well, visually showingthat the CT-MN model captures the joint dependence structure in the data better than thenormal copula model. Indeed, the 10-fold cross-validation LPDS values of CT-MN and NCare 1 .
35 and 1 .
70, respectively.
Estimators MN M t MFA M t FA MAMN MAMFALPDS 1 .
74 1 .
70 2 .
73 2 .
25 1 .
71 1 . t C CT-MN CT-M t CT-MFA CT-M t FALPDS 2 .
52 2 .
52 1 . . .
96 2 . Table 3: Iris data: 10-fold cross validation LPDS values for various estimators. The minimumLPDS is in bold.We now consider estimating the joint density of all four variables, and demonstrate theperformance of various estimators using the LPDS criterion. The best estimator for thefirst marginal is the implied mixture of normals, and for the last three marginals the directly-estimated mixtures of t . Table 3 summarizes the 10-fold cross-validation LPDS values of theseestimators. We draw the following conclusions. 1) CT-M t performs the best. 2) The copula-type estimators outperform the normal and t copula estimators. 3) Dimension reduction viathe factor analyzers models does not help, probably because of the small dimension. Theimprovement of the mixture-based copula-type estimators over the mixture estimators showsthat it is important to estimate the marginals separately. The improvement of the copula-type estimators over the normal and t copula estimators shows that it is important to haveflexibility in estimating the joint dependence.15 pe t a l l eng t h Figure 3: Iris data: The top row shows the original Y -data of sepal width and petal lengthand the U -space data. The lower-left panel shows the transformed X -space data when thenormal copula is used. The last panel shows the X -space data (after the iterative schemestops) when the CT-MN model is used. Malaria is an infectious disease caused by the parasitic protozoan genus plasmodium. Thisdata set consists of the relative expression level of parasite genes taken at several time points ofthe life cycle of parasites. The original data set consisting of the expression level of 4221 genestaken at 46 time points is further processed by Jasra et al. (2007) using K-means clusteringand principal component analysis to reduce the number of observations from 4221 to 1000and the number of variables from 46 to 6. We use the processed data to demonstrate ourproposed estimators.The best estimators for the first three marginals are kernel densities and for the last threeare a mixture of t , a kernel density and a mixture of normals, respectively. Table 4 summarizesthe 5-fold cross validated LPDS values. Typically we have the same conclusions as in the16revious example: 1) the CT-M t outperforms the others; 2) the copula-type estimators workbetter than the parametric copulas; and 3) dimension reduction does not help in this low-dimensional example. Estimators MN M t MFA M t FA MAMN MAMFALPDS 10 .
76 10 .
71 11 .
65 11 .
11 10 .
73 11 . t C CT-MN CT-M t CT-MFA CT-M t FALPDS 11 .
84 11 .
74 10 . . .
20 10 . Table 4: Gene expression data: 5-fold cross validation LPDS values for various estimators.The minimum LPDS is in bold.
This data set consists of 13 chemical constituents found in 178 samples of wines in a regionof Italy. The data set and detailed information on it is available at the UC Irvine MachineLearning Repository http://archive.ics.uci.edu/ml/datasets/Wine . The small numberof observations relative to the number of variables in this data set shows the usefulness ofdimension reduction via the factor representation. The best models for the marginals varybetween a kernel density and a directly-estimated mixture of t (details not shown). Table5 summarizes the multivariate model fitting results. The best estimator is the CT-MFA. Ingeneral, dimension reduction improve the performance, e.g. the CT-MFA is better than theCT-MN, the CT-M t FA is better than the CT-M t . The NC and t C work almost as well as theCT-MN and CT-M t respectively. This is because the CT-MN and CT-M t estimators reduceto the NC and t C estimators respectively when the joint dependence does not have a mixturestructure.
Estimators MN M t MFA M t FA MAMN MAMFALPDS 19 .
67 20 .
60 20 .
53 19 .
51 19 .
24 19 . t C CT-MN CT-M t CT-MFA CT-M t FALPDS 19 .
22 19 .
03 19 .
22 19 . . . Table 5: Wine data: 5-fold cross validation LPDS values for various estimators. The minimumLPDS is in bold. 17
Conclusion
The article introduces copula-type estimators for flexible multivariate density estimation whichwe believe improve on current popular copula estimators. The new estimators allow themarginal densities to be modeled separately from the joint dependence, as in all copula es-timators, but have the ability to model complex joint dependence structures. In particular,the joint dependence in the copula-type estimators that we propose is modeled by mixturemodels. The mixtures are fitted by Variational Bayes algorithms which automatically incorpo-rate the model selection problem. An iterative scheme is proposed for estimating copula-typeestimators and its usefulness is demonstrated through examples.A practical issue is determining when a mixture-based copula-type estimator is needed fora given data set. As can be seen from the examples, a mixture-based copula-type estimatorworks well when the underlying joint dependence has a mixture structure. Such an estimatorcan be obtained by the Variational Bayes fitting algorithm in our paper, i.e. if the multivariatemixture b G estimated by the iterative scheme has more than one component then it is likely thatthe underlying joint dependence has a mixture structure. In our experience, if the underlyingjoint dependence does not have a mixture structure, then the estimated multivariate mixturewill have only one component and the resulting copula-type estimator will be very similar tothe corresponding normal or t copula estimator.In practice, it is necessary to select an estimator among the four mixture-based copula-typeestimators proposed in the article. In our experience, the CT-M t often works well in smalldimensions and the CT-M t FA is the best in high dimensions. However, we suggest fitting allfour estimators to the data and then selecting the best estimator using some criterion such asthe log predictive density score.An alternative approach is to use marginally adjusted estimators (Giordani et al., 2012)which try to improve on standard multivariate estimators such as a mixture of multivariatenormals, by modifying such estimators to take account of the best fitting marginal densities.We believe that the copula-type and the marginal adaptation approaches complement eachother, in the sense that marginal adaptation attempts to correct deficiencies in standard mul-tivariate estimators and copula-type estimation attempts to make the popular copula modelsmore flexible. The practitioner may use both approaches and choose the best performing one,by some criterion such as the log predictive score.We note, however, that if we wish to incorporate dependence on the covariates in themarginals, then it is easier to do so using the copula type estimators than the marginally ad-justed estimators because of the need to estimate the normalizing constants in the marginally18djusted estimators.
Acknowledgment
The authors would like to thank the referees for insightful comments which helped to improvethe presentation and content of the paper. The research of Minh-Ngoc Tran, Xiuyan Munand Robert Kohn was partially supported by Australian Research Council grant DP0667069.We thank Professor Ajay Jasra for the genome data.
Appendix A
Proof of Proposition 1.
Without loss of generality, assume that d =2. By construction, H i ( X i )= F i ( Y i ), i = 1 , X = ( X ,X ) ′ ∼ b G ( x ,x ). The distribution b F ( y | H, b G ) of Y = ( Y ,Y ) ′ is b F ( y | H, b G ) = P( Y ≤ y , Y ≤ y )= P (cid:16) X ≤ H − ( F ( y )) , X ≤ H − ( F ( y )) (cid:17) = b G (cid:0) H − ( F ( y )) , H − ( F ( y )) (cid:1) = b C ( F ( y ) , F ( y ) | H, b G ) . Taking derivatives with respect to y and y , we obtain the density function of Y b f ( y | H, b G ) = b g ( x , x ) Y i =1 f i ( y i ) h i ( x i )with H i ( x i ) = F i ( y i ).Now noting that h i ( x i ) dx i = f i ( y i ) dy i , the density of the first marginal Y is b f ( y | H, b G ) = Z b f ( y | H, b G ) dy = Z b g ( x , x ) f ( y ) h ( x ) dx = b g ( x ) f ( y ) h ( x ) Proof of Proposition 2.
By construction, the data D H d X are realizations of a random vector X = ( X ,...,X d ) ′ obtained by the transformation X i = H − i ( U i ) with U i uniformly distributedon [0 , i = 1 ,...,d . Therefore h i ( x i ) are the marginal pdf’s of X . Denote by h ( x ) be the jointpdf of X , we have that h i ( x i ) = Z h ( x ) d x − i , with x − i = ( x , ..., x i − , x i +1 , ..., x d ) ′ . h i ( x i ) dx i = f i ( y i ) dy i , by the second result in Proposition 1, d TV ( b f i ,f i ) = 12 Z | b f i ( y i ) − f i ( y i ) | dy i ≤ Z | b g i ( x i ) − h i ( x i ) | dx i = 12 Z x i | (cid:18)Z x − i ( b g ( x ) − h ( x )) d x − i (cid:19) | dx i ≤ Z | b g ( x ) − h ( x ) | d x → , when the sample size inncreases, because the fitting method is reliable. Appendix B: Variational Bayes algorithms for fitting mix-ture models
Using Variational Bayes for fitting mixture models has proven useful and efficient. See, e.g.,Ormerod and Wand (2009) for an introduction to Variational Bayes. Giordani et al. (2012)develop efficient Variational Bayes algorithms for fitting a multivariate mixture of normals anda mixture of factor analyzers in which the number of components and the number of factorsin each component are automatically selected. We present here Variational Bayes algorithmsfor fitting a multivariate mixture of t and a mixture of t -factor analyzers, in which the modelselection problem is also automatically incorporated. Fitting a mixture of t The density of the mixture of t model is of the form p ( x ) = K X k =1 π k t d ( x ; µ k , V k , ν k ) , (9)where t d ( x ; µ ,V,ν ) denotes the density of a d -variate t distribution with location µ , scalematrix V and degrees of freedom ν . The mean of this t distribution is µ if ν >
1, and itsvariance matrix is ( ν/ ( ν − V if ν >
2. The key to our Variational Bayes fitting approach isthe expression of t distributions as scale mixtures of normals (Andrews and Mallows, 1974).The distribution of X ∼ t d ( x ; µ ,V,ν ) can be expressed hierarchically as X | w ∼ N d ( µ , V /w ) with w ∼ G (cid:16) ν , ν (cid:17) . x i | δ i = j, w ij ∼ N d ( µ j , ( w ij T j ) − ) p ( δ i = j ) = π j w ij ∼ G (cid:16) ν j , ν j (cid:17) , i = 1 , ..., n, j = 1 , ..., K with δ i and w ij latent variables. Here T j = V − j . For now we consider the degrees of freedom ν = ( ν ,...,ν K ) as fixed hyperparameters. This will be relaxed below. The model parametersare θ = ( π , w , δ ,T, µ ). We consider the following decomposition p ( θ ) = p ( π ) p ( δ | π ) p ( w ) p ( T ) p ( µ | T ) (10)with the conjugate priors p ( π ) ∼ Dirichlet( α ) p ( w ) = n Y i =1 K X j =1 δ i = j p ( w ij ) p ( δ | π ) ∼ n Y i =1 K X j =1 δ i = j π j T j ∼ Wishart( τ j , Σ j − ) µ j | T j ∼ N d (0 , ( κ j T j ) − ) , where α , τ j , Σ j and κ j are hyperparamters. Note that at the moment the degrees of freedom ν j are also considered as hyperparameters. From the decomposition (10) (cf. Ormerod and Wand,2009), the optimal Variational Bayes posteriors are q ij = q ( δ i = j ) ∝ exp (cid:16) [log π j ] + ( ν j d − w ij ] − ( ν j z ij w ij ] + 12 [log | T j | ] + ν j ν j − log Γ( ν j (cid:17) q ( π ) ∼ Dirichlet( α ) with α j = α j + X i q ij q ( w ij ) ∼ G (cid:18) ν j d , ν j z ij (cid:19) q ( µ j | T j ) ∼ N d ( µ qj , ( κ j T j ) − ) κ j = κ j + X i q ij [ w ij ] , µ qj = 1 κ j X i q ij [ w ij ] x i q ( T j ) ∼ Wishart( τ j , Σ − j ) , τ j = τ j + 1 + X i q ij Σ j = Σ j + κ j µ qj ( µ qj ) ′ + X i q ij [ w ij ]( x i − µ qj )( x i − µ qj ) ′ . ] denotes expectation with respect to the Variational Bayes posterior q , i.e., [ . ]:= E q ( . ).In the above z ij = [( x i − µ j ) ′ T j ( x i − µ j )] = τ j ( x i − µ qj ) ′ Σ − j ( x i − µ qj ) + dκ j and [log π j ] = Ψ( α j ) − Ψ( P j α j ), [log w ij ] = Ψ( ν j + d ) − log( ν j + z ij ), [ w ij ] = ( ν j + d ) / ( ν j + z ij ) and[log | T j | ] = P dh =1 Ψ( ( τ j +1 − h ))+ d log2 − log | Σ j | . Let L ( ν ) be the lower bound on log p ( x | ν ).Estimating the degrees of freedom is challenging in both Bayesian and frequentist ap-proaches. In our setting, the optimal Variational Bayes posterior of ν j does not have anystandard form. We proceed as follow. Let p ( ν ) be a prior on ν . We use a point mass distri-bution for the Variational Bayes posterior of ν , i.e., q ( ν ) = δ ( ν − ν q ) with δ ( . ) the Dirac deltadistribution. The lower bound on log p ( x ) is Z log p ( ν ) p ( x | ν ) q ( ν ) q ( ν ) d ν = log p ( ν q ) + log p ( x | ν q ) . (11)With L ( ν q ) the lower bound on log p ( x | ν q ), the lower bound on log p ( x ) is L = log p ( ν q ) + L ( ν q ) . (12)This needs to be optimized with respect to ν q . We will use the notation ν instead of ν q inwhat follows.It is well known in Bayesian fitting of t distributions that an improper prior on the degreesof freedom leads to an improper posterior, while in frequentist fitting the MLE may notconverge because of the non-regularity of the likelihood. A truncated prior is commonly used.We follow Lin et al. (2004) and use the uniform prior on (0 ,λ ), with some sufficiently large λ , say λ = 100. Then maximizing (12) is equivalent to maximizing the following function in ν j n X i =1 q ij (cid:16) ν j ν j − ( ν j d ν j z ij ν j d − log Γ( ν j (cid:17) (13)subject to ν j ∈ [0 ,λ ], j =1 ,...,K . This is somewhat similar to the M-step update of the degreesof freedom in the EM algorithm of Peel and McLachlan (2000). However, Peel and McLachlan(2000) did not impose any constraint on ν j , which may cause divergence of the solution. Forsimplicity, we consider ν j to be integer.Given an initial number of components K , the Variational Bayes algorithm sequentiallyupdates the parameters q ij , α j , κ j , µ qj , Σ j and ν j until some stopping rule is met. Often,this iterative scheme stops when the lower bound (12) is not improved any further, or whenthe updates are stable in the sense that the difference of main parameters µ qj and Σ j in two22uccessive iterations is smaller than a tolerance value. We refer to this update procedure asthe standard Variational Bayes algorithm.To select K , we start with a reasonably large value of K and remove redundant componentson the basis of maximizing the lower bound as follows. After the standard Variational Bayesprocedure has converged, we try removing the components with smallest P i q ij and actuallyremove these components if the final optimized lower bound is improved. That is, unlike theexisting algorithms in which components with the posterior probabilities n P i q ij smaller than aspecific threshold value are eliminated (Corduneanu and Bishop, 2001; McGrory and Titterington,2007), we first rank components for elimination and eliminate plausible components until thelower bound is not improved any further. We found that our strategy quickly and efficientlyeliminates redundant components, while not requiring any specific threshold value which maybe hard to determine. We will refer to this algorithm for determining K as the EliminationVariational Bayes (EVB). It might be desirable to include split steps which split poorly-fittedcomponents. However, implementation of split steps is difficult in the t mixture contextbecause it is not clear how to initialize new components optimally. Fitting a mixture of t -factor analyzers The density of a mixture of t -factor analyzers is (9) with the scale matrices having factorrepresentation V k = Λ k Λ ′ k +Ψ k . This model is first considered in McLachlan et al. (2007) whodevelop an EM algorithm for fitting. Model selection in fitting this model consists of selectingthe number of components K and the number of factors γ k in each component. Thereforethe number of models in the model space is huge, which makes the model selection problemchallenging when using model selection criteria such as AIC or BIC because one needs tosearch over the whole model space. To reduce the model space, McLachlan et al. (2007)consider the same number of factors γ j ≡ γ for all components. We relax this assumption hereand develop below a Variational Bayes algorithm for fitting the model in which K and γ k areautomatically determined. We believe this is the first algorithm in the literature for fittingthe (full) mixture of t -factor analyzers model which is able to do parameter estimation andmodel selection simultaneously and automatically.The model can be written as x i | δ i = j, z ij , w ij ∼ N d ( µ j + Λ j z ij , ( w ij ψ j ) − I ) z ij ∼ N k j ( , w − ij I ) w ij ∼ G ( ν j , ν j p ( δ i = j ) = π j z ij , w ij , δ i latent variables. Following McLachlan et al. (2007) we assume Ψ j = ψ − j I ,which helps avoid spikes or near singularities in the likelihood. We consider the followingpriors on the model parameters p ( π ) ∼ Dirichlet( α ) , p ( µ j ) ∼ , p (Λ j | τ j ) = k j Y l =1 N d ( , τ − jl I )with τ j = ( τ j ,...,τ jk j ) and put gamma priors G ( a,b ) on τ jl and ψ j . The form of the prior p (Λ j | τ j ) plays a key role in determining the local dimensions k j : a very small value of τ − jl suggests that the factor l of the component j should be removed. This approach is introducedin Ghahramani and Beal (2000) for mixtures of (normal) factor analyzers.The Variational Bayes optimal posteriors for the parameters are as follows q ( z ij ) ∼ N k j ( µ x ij , Σ x ij )Σ x ij = [ w ij ] − ( I + [ ψ j ][Λ ′ j Λ j ]) − , µ x ij = Σ x ij [ w ij ][ ψ j ][Λ ′ j ]( x i − µ qj ) q ( w ij ) ∼ G (cid:16) ν j a w ij , ν j b w ij (cid:17) a w ij = k j d , b w ij = 12 [ ψ j ] c ij + 12 µ ′ x ij µ x ij + 12 tr(Σ x ij ) q ij ∝ exp (cid:16) [log π j ] + ( ν j k j d − w ij ] + d ψ j ] − ( ν j k j d − k j π ) + ν j ν j − log Γ( ν j (cid:17) q ( π ) ∼ Dirichlet( α ) , α j = α j + n X i =1 q ij q ( µ j ) ∼ N d ( µ µ j , σ µ j I ) ,σ µ j = [ ψ j ] n X i =1 q ij [ w ij ] ! − , µ µ j = σ µ j [ ψ j ] n X i =1 q ij [ w ij ]( x i − [Λ j ][ z ij ]) q ( Λ jl ) ∼ N d ( µ Λ jl , σ jl I ) ,σ jl = [ τ jl ] + [ ψ j ] n X i =1 q ij [ w ij ][ x ij,l ] ! − , µ Λ jl = σ jl [ ψ j ] n X i =1 q ij [ w ij ] (cid:16) [ x ij,l ]( x i − µ µ j ) − X s = l µ Λ js [ x ij,l x ij,s ] (cid:17) q ( τ jl ) ∼ G ( a τ jl , b τ jl ) a τ jl = a + d , b τ jl = b + 12 µ ′ Λ jl µ Λ jl + d σ jl q ( ψ j ) ∼ G ( a ψ j , b ψ j ) a ψ j = a + d n X i =1 q ij , b ψ j = b + 12 n X i =1 q ij [ w ij ] c ij c ij = ( x i − µ µ j ) ′ ( x i − µ µ j ) − x i − µ µ j ) ′ [Λ j ] µ x ij + µ ′ x ij [Λ ′ j Λ j ] µ x ij + d σ µ j + tr(Σ x ij [Λ ′ j Λ j ]) . The expectation terms are given by[Λ ′ j Λ j ] = [Λ j ] ′ [Λ j ] + d σ j , ..., σ jkj )and [ x ij,l x ij,s ] = (Σ x ij ) l,s + µ ( l ) x ij µ ( s ) x ij . Similar to the reasoning in the previous section, the degrees of freedom ν j are estimated bymaximizing n X i =1 q ij (cid:16) ν j ν j − ( ν j a w ij ) log( ν j b w ij ) + log Γ( ν j a w ij ) − log Γ( ν j (cid:17) subject to ν j ∈ [0 ,λ ], j = 1 ,...,K .Our standard Variational Bayes algorithm sequentially updates the parameters Σ x ij , µ x ij , a w ij , b w ij , q ij , α j , σ µ j , µ µ j , σ jl , µ Λ jl , a τ jl , b τ jl , a ψ j , b ψ j and ν j until the difference of mainparameters µ µ j and µ Λ jl in two successive iterations is smaller than a tolerance value. Otherstopping rules can be used as well.We now present our strategy for determining the local dimensions k j . We remove thefactor l of the component j if the posterior mean of τ − jl is smaller than a threshold ǫ . Notethat the mean of τ − jl is b τ jl / ( a τ jl − x , we found it necessary to standardize the data such that the columns of x havestandard deviations of 1; this makes the analysis more stable and facilitates the choice of ǫ .After fitting, it is straightforward to write the resulting density back in the original units.From our experience, ǫ = 10 − is a good choice. To select K , we follow the same eliminationVariational Bayes strategy as in the previous section.In summary, our strategy for model selection in fitting the M t FA model is as follows. • Step 1: Start with a reasonably large value of K and with the initial number of factors k j = [ (2 d +1 −√ d +1)] - the largest value allowed for the number of factors in factoranalysis. • Step 2: After the standard Variational Bayes procedure has converged, remove factorswith b τ jl / ( a τ jl − < ǫ . • Step 3: Remove redundant components via the EVB algorithm. • Step 4: Repeat steps 2 and 3 until the lower bound is not improved any further.25 eferences
Andrews, D. and Mallows, C. (1974). Scale mixtures of normal distributions.
Journal of theRoyl Statistical Series, Series B , 36:99–102.Corduneanu, A. and Bishop, C. (2001). Variational Bayesian model selection for mixture dis-tributions. In Jaakkola, T. and Richardson, T., editors,
Artifcial Intelligence and Statistics ,volume 14, pages 27–34. Morgan Kaufmann.Fisher, R. (1936). The use of multiple measurements in taxonomic problems.
Annual Eugenics ,7, Part II:179–188.Geisser, S. (1980). Discussion of “Sampling and Bayes inference in scientific modelling and 10robustness” by G.E.P. Box.
Journal of the Royal Statistical Society, Series A , 143:416–417.Ghahramani, Z. and Beal, M. J. (2000). Variational inference for Bayesian mixtures of factoranalyzers. In S. A. Solla, T. K. L. and Muller, K., editors,
NIPS , volume 12, pages 449–455.MIT Press.Ghahramani, Z. and Hinton, G. (1997). The EM algorithm for mixtures of factor analyzers.Technical report, Dept. of Computer Science, University of Toronto.Giordani, P., Mun, X., Tran, M.-N., and Kohn, R. (2012). Flexible multivariate densityestimation with marginal adaptation.
Journal of Computational and Graphical Statistics .to appear.Good, I. (1952). Rational decisions.
J. R. Stat. Soc. B , 14:107–114.Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2009).
The elements of statisticallearning : data mining, inference, and prediction . Springer series in statistics. New York,N.Y. Springer, second edition.Jasra, A., Stephens, D., and Holmes, C. (2007). Population-based reversible jump Markovchain Monte Carlo.
Biometrika , 97(4):787 – 807.Joe, H. (1997).
Multivariate models and dependence concepts . Chapman & Hall, London.Krupskii, P. and Joe, H. (2013). Factor models for multivariate data.
Journal of MultivariateAnalysis . To appear.Lin, T. I., Lee, J. C., and Ni, H. F. (2004). Bayesian analysis of mixture modelling using themultivariate t distribution. Statistics and Computing , 14:119–130.26cGrory, C. A. and Titterington (2007). Variational approximations in Bayesian model selec-tion for finite mixture distributions.
Computaional Statistics and Data Analysis , 51:5352–5367.McLachlan, G. and Peel, D. (2000).
Finite Mixture Models . John Wiley and Sons, New York.McLachlan, G. J., Bean, R. W., and Jones, L. B. (2007). Extension of the mixture of factoranalyzers model to incorporate the multivariate t -distribution. Computational Statistics &Data Analysis , 51:5327 5338.Nelsen, R. (1999).
An Introduction to Copulas . Springer-Verlag, New York.Ormerod, J. T. and Wand, M. P. (2009). Explaining variational approximation.
The AmericanStatistician , 64(2):140–153.Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing , 10:339–348.Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985).