Factor copula models for mixed data
aa r X i v : . [ s t a t . M E ] J u l Factor copula models for mixed data
Sayed H. Kadhem ∗ Aristidis K. Nikoloulopoulos † Abstract
We develop factor copula models for analysing the dependence among mixed continuous and discrete re-sponses. Factor copula models are canonical vine copulas that involve both observed and latent variables,hence they allow tail, asymmetric and non-linear dependence. They can be explained as conditional inde-pendence models with latent variables that don’t necessarily have an additive latent structure. We focuson important issues that would interest the social data analyst, such as model selection and goodness-of-fit.Our general methodology is demonstrated with an extensive simulation study and illustrated by re-analysingthree mixed response datasets. Our study suggests that there can be a substantial improvement over the stan-dard factor model for mixed data and makes the argument for moving to factor copula models.
Key Words:
Conditional independence; Goodness-of-fit; Latent variable models; Model selection; Taildependence/asymmetry; Canonical vines.
It is very common in social science, e.g., in surveys, to deal with datasets that have mixed continuous anddiscrete responses. In the literature, two broad frameworks have been considered to model the dependenceamong such mixed continuous and discrete responses, namely the latent variable and copula framework.There are two approaches for modelling multivariate mixed data with latent variables: the underlying vari-able approach that treats all variables as continuous by assuming the discrete responses as a manifestation ofunderlying continuous variables that follow the normal distribution (e.g., Muth´en 1984; Lee et al. 1992; Quinn2004), and, the response function approach that postulates distributions on the observed variables conditionalon the latent variables usually from the exponential family (e.g., Moustaki 1996; Moustaki and Knott 2000;Wedel and Kamakura 2001; Huber et al. 2004; Moustaki and Victoria-Feser 2006). The former method as-sumes that the continuous and underlying variables follow a multivariate normal (MVN) distribution, while thelatter assumes that the observed variables are conditionally independent given the latent variables which again ∗ School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ, U.K. † Correspondence to: [email protected], Aristidis K. Nikoloulopoulos, School of Computing Sciences, University ofEast Anglia, Norwich NR4 7TJ, U.K. p latent (unobservable)variables/factors where p << d , where d is the number of observed variables and hence estimation is morefeasible when many categorical or discrete variables are involved. Nevertheless, both approaches are restrictedto MVN assumption that is not valid if tail asymmetry or tail dependence exists in the mixed data which is arealistic scenario. Ma and Genton (2010), Montanari and Viroli (2010), and Irincheeva et al. (2012a) stress thatthe MVN assumption might not be adequate, and acknowledge that the effect of misspecifying the distributionof the latent variables could lead to biased model estimates and poor fit. To this end, Irincheeva et al. (2012b)proposed a more flexible response function approach by strategically multiplying the MVN density of the latentvariables by a polynomial function to achieve departures from normality.As we have discussed, the underlying variable approach exploits the use of the MVN assumption to modelthe joint distribution of mixed data. The univariate margins are transformed to normality and then the MVNdistribution is fitted to the transformed data. This construction is apparently the MVN copula applied to mixeddata (Shen and Weissfeld, 2006; Hoff, 2007; Song et al., 2009; He et al., 2012; Jiryaie et al., 2016), but previouspapers (e.g., Quinn 2004) do not refer to copulas as the approach can be explained without copulas.Smith and Khaled (2012), St¨ober et al. (2015) and Zilko and Kurowicka (2016) called vine copulas to modelmixed data. Vine copulas have two major advantages over the MVN copula as emphasized in Panagiotelis et al.(2017). The first is that the computational complexity of computing the joint probability distribution functiongrows quadratically with d , whereas for the MVN copula the computational complexity grows exponentiallywith d . The second is that vine copulas are highly flexible through their specification from bivariate parametriccopulas with different tail dependence or asymmetry properties. Note in passing that they have as special casethe MVN copula, if all the bivariate copulas are bivariate normal (BVN).In this paper, we extend the factor copula models in Krupskii and Joe (2013) and Nikoloulopoulos and Joe(2015) to the case of mixed continuous and discrete responses. Factor copula models are vine copulas thatinvolve both observed and latent variables, hence they allow flexible (tail) dependence. These types of modelsare more interpretable and fit better than vine copula models, when dependence can be explained through latentvariables. That is theoretical concepts that cannot be measured directly such as intelligence in psychology,or welfare and poverty in economics. Furthermore, the factor copula models are closed under margins (samemodel with one more or less variable) whereas vine copulas without latent variables are not closed undermargins. Finally, they can be explained as conditional independence models, i.e., they are a response function2pproach with dependence coming from latent (unobservable) variables/factors. Our model is the most generalconditional independence model with univariate parameters separated from dependence parameters and latentvariables that don’t necessarily have an additive latent structure. Another mixed-variable model in the literaturethat is called factor copula model (Murray et al., 2013) is restricted to the MVN copula, hence has an additivelatent structure and is not as general.The remainder of the paper proceeds as follows. Section 2 introduces the factor copula models for mixeddata and discusses its relationship with existing models. Estimation techniques and computational details areprovided in Section 3. Sections 4 and 5 propose methods for model selection and goodness-of-fit, respectively.Section 6 presents applications of our methodology to three mixed response data sets. Section 7 contains anextensive simulation study to gauge the small-sample efficiency of the proposed estimation, model selectionand goodness-of-fit techniques. We conclude with some discussion in Section 8. The p -factor model assumes that the mixed continuous and discrete responses Y = ( Y , . . . , Y d ) are condition-ally independent given p latent variables X , . . . , X p . In line with Krupskii and Joe (2013) and Nikoloulopoulos and Joe(2015), we will use a general copula construction, based on a set of bivariate copulas that link observed to la-tent variables, to specify the factor copula models for mixed continuous and discrete variables. The idea in thederivation of this p -factor model will be shown below for the 1-factor and 2-factor case. It can be extended to p ≥ factors or latent variables in a similar manner.For the 1-factor model, let X be a latent variable, which we assume to be standard uniform (withoutloss of generality). From Sklar (1959), there is a bivariate copula C X j such that Pr( X ≤ x, Y j ≤ y ) = C X j (cid:0) x, F j ( y ) (cid:1) for ≤ x ≤ where F j is the cumulative distribution function (cdf) of Y j . Then it followsthat F j | X ( y | x ) := Pr( Y j ≤ y | X = x ) = ∂C X j (cid:0) x, F j ( y ) (cid:1) ∂x . (1)Letting C j | X ( F j ( y ) | x ) = ∂C X j ( x, F j ( y )) /∂x for shorthand notation and y = ( y , . . . , y d ) be realizations of Y , the density ‡ for the 1-factor model is f Y ( y ) = Z d Y j =1 f j | X ( y j | x ) dx, (2) ‡ We mean the density of Y w.r.t. the product measure on the respective supports of the marginal variables. For discrete marginswith integer values this is the counting measure on the set of possible outcomes, for continuous margins we consider the Lebesguemeasure in R . f j | X ( y | x ) = (cid:26) C j | X (cid:0) F j ( y + 1) | x (cid:1) − C j | X (cid:0) F j ( y ) | x (cid:1) if Y j is dicrete ; c X j (cid:0) x, F j ( y ) (cid:1) f j ( y ) if Y j is continuous , is the density of Y j = y conditional on X = x ; c X j is the bivariate copula density of X and Y j and f j is theunivariate density of Y j .For the 2-factor model, consider two latent variables X , X that are, without loss of generality, independentuniform U (0 , random variables. Let C X j be defined as in the 1-factor model, and let C X j be a bivariatecopula such that Pr( X ≤ x , Y j ≤ y | X = x ) = C X j (cid:0) x , F j | X ( y | x ) (cid:1) , where F j | X is given in (1). Then for ≤ x , x ≤ , Pr( Y j ≤ y | X = x , X = x ) = ∂∂x Pr( X ≤ x , Y j ≤ y | X = x )= ∂∂x C X j (cid:0) x , F j | X ( y | x ) (cid:1) = C j | X (cid:0) F j | X ( y | x ) | x (cid:1) . The density for the 2-factor model is f Y ( y ) = Z Z d Y j =1 f X j | X (cid:0) x , y j | x (cid:1) dx dx , (3)where f X j | X ( x , y | x ) = (cid:26) C j | X (cid:0) F j | X ( y | x ) | x (cid:1) − C j | X (cid:0) F j | X ( y − | x ) | x (cid:1) if Y j is discrete ; c jX ; X (cid:0) F j | X ( y | x ) , x (cid:1) c X j (cid:0) x , F j ( y ) (cid:1) f j ( y ) if Y j is continuous . Note that the copula C X j links the j th response to the first latent variable X , and the copula C X j linksthe j th response to the second latent variable X conditional on X . In our general statistical model there areno constraints in the choices of the parametric marginal F j or copula { C X j , C X j } distributions. In this subsection, we provide choices of parametric bivariate copulas that can be used to link the latent withobserved variables. We will consider copula families that have different tail dependence (Joe, 1993) or tailorder (Hua and Joe, 2011).A bivariate copula C is reflection symmetric if its density satisfies c ( u , u ) = c (1 − u , − u ) for all ≤ u , u ≤ . Otherwise, it is reflection asymmetric often with more probability in the joint upper tail or jointlower tail. Upper tail dependence means that c (1 − u, − u ) = O ( u − ) as u → and lower tail dependence means that c ( u, u ) = O ( u − ) as u → . If ( U , U ) ∼ C for a bivariate copula C , then (1 − U , − U ) ∼ b C ,where b C ( u , u ) = u + u − C (1 − u , − u ) is the survival or reflected copula of C ; this “reflection”4f each uniform U (0 , random variable about / changes the direction of tail asymmetry. Under someregularity conditions (e.g., existing finite density in the interior of the unit square, ultimately monotone in thetail), if there exists κ L ( C ) > and some L ( u ) that is slowly varying at + (i.e., L ( ut ) L ( u ) ∼ , as u → + for all t > ), then κ L ( C ) is the lower tail order of C . The upper tail order κ U ( C ) can be defined by thereflection of ( U , U ) , i.e., C (1 − u, − u ) ∼ u κ U ( C ) L ∗ ( u ) as u → + , where C is the survival function ofthe copula and L ∗ ( u ) is a slowly varying function. With κ = κ L or κ U , a bivariate copula has intermediate taildependence if κ ∈ (1 , , tail dependence if κ = 1 , and tail quadrant independence if κ = 2 with L ( u ) beingasymptomatically a constant.After briefly providing definitions of tail dependence and tail order we provide below a list of bivariateparametric copulas with varying tail behaviour: • Reflection symmetric copulas with intermediate tail dependence such as the BVN copula with κ L = κ U = 2 / (1 + θ ) , where θ is the copula (correlation) parameter. • Reflection symmetric copulas with tail quadrant independence ( κ L = κ U = 2 ), such as the Frank copula. • Reflection asymmetric copulas with upper tail dependence only such as – the Gumbel copula with κ L = 2 /θ and κ U = 1 , where θ is the copula paprameter; – the Joe copula with κ L = 2 and κ U = 1 . • Reflection symmetric copulas with tail dependence, such as the t ν copula with κ L = κ U = 2 − T ν +1 (cid:16) − p ( ν + 1)(1 − θ ) / (1 + θ ) (cid:17) , where θ is the correlation parameter of the bivariate t distribu-tion with ν degrees of freedom, and T ν is the univariate t cdf with ν degrees of freedom. • Reflection asymmetric copulas with upper and lower tail dependence that can range independently from0 to 1, such as the BB1 and BB7 copulas with κ L = 1 and κ U = 1 . • Reflection asymmetric copulas with tail quadrant independence, such as the the BB8 and BB10 copulas.The BVN, Frank, and t ν are comprehensive copulas, i.e., they interpolate between countermonotonicity(perfect negative dependence) to comonotonicity (perfect positive dependence). The other aforementionedparametric families of copulas, namely Gumbel, Joe, BB1, BB7, BB8 and BB10 interpolate between indepen-dence and perfect positive dependence. Nevertheless, negative dependence can be obtained from these copulasby considering reflection of one of the uniform random variables on (0, 1). If ( U , U ) ∼ C for a bivariatecopula C with positive dependence, then 5 (1 − U , U ) ∼ b C (1) , where b C (1) ( u , u ) = u − C (1 − u , u ) is the 1-reflected copula of C withnegative lower-upper tail dependence; • ( U , − U ) ∼ b C (2) , where b C (2) ( u , u ) = u − C ( u , − u ) is the 2-reflected copula of C withnegative upper-lower dependence. Negative upper-lower tail dependence means that c (1 − u, u ) = O ( u − ) as u → + and negative lower-uppertail dependence means that c ( u, − u ) = O ( u − ) as u → + (Joe, 2011). Figure 1: Contour plots of bivariate copulas with standard normal margins and dependence parameters corresponding to Kendall’s τ value of 0.5. BVN Frank t −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − Joe reflected Joe 1-reflected Joe −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − BB10 reflected BB10 2-reflected BB10 −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − −3 −2 −1 0 1 2 3 − − − In Figure 1, to depict the concepts of refection symmetric or asymmetric tail dependence or quadrant tail6ndependence, we plot contour plots of the corresponding copula densities with standard normal margins anddependence parameters corresponding to Kendall’s τ value of 0.5. Suppose that data are y ij , j = 1 , . . . , d, i = 1 , . . . , n , where i is an index for individuals or clusters and j isan index for the within-cluster measurements. For continuous random variables, we estimate each marginaldistribution non-parametrically by the empirical distribution function of Y j , viz. F j ( y ij ) = 1 n + 1 n X i =1 ( Y ij ≤ y ij ) = R ij / ( n + 1) , where R ij denotes the rank of Y ij as in the semi-parametric estimation of Genest et al. (1995) and Shih and Louis(1995). Hence we allow the distribution of the continuous margins to be quite free and not restricted by para-metric families.Nevertheless, rank-based methods cannot be used for discrete variables with copulas (Genest and Neˇslehov´a,2007). Hence, • For an ordinal response variable Y j we use a univariate probit model (Agresti, 2010, Section 3.3.2). Let Z j be a standard normal latent variable, such that Y j = y j if α y j − ,j ≤ Z j ≤ α y j j , y j = 1 , . . . , K j , where K j is the number of categories of Y j (without loss of generality, we assume α j = −∞ and α K j j = ∞ ). From this definition, the ordinal response Y j is assumed to have density f j ( y j ; γ j ) = Φ( α y j j ) − Φ( α y j − ,j ) , where γ j = ( a j , . . . , a K j − ,j ) is the vector of the univariate cutpoints. • For a count response variable Y j we use the negative binomial distribution (Lawless, 1987). It allows forover-dispersion and its probability mass function is f j ( y j ; γ j ) = Γ( ξ − + y j )Γ( ξ − ) y j ! µ yj ξ yj (1 + ξ − ) ξ − + y j , y j = 0 , , , . . . , µ > , ξ > , where γ j = { µ j , ξ j } is the vector with the mean and dispersion parameters. In the limit ξ → thenegative binomial reduces to Poisson, which belongs to the exponential family of distributions and it isthe only distribution for count data that existing latent variable models for mixed data can accommodate.7o this end, for a discrete random variable Y j , we approach estimation by maximizing the univariate log-likelihoods ℓ j ( γ j ) = n X i =1 log f j ( y ij ; γ j ) over the vector of the univariate parameters γ j . That is equivalent with the first step of the IFM method in Joe(1997, 2005).After estimating the univariate marginal distributions we proceed to estimation of the dependence param-eters. For the 1-factor and 2-factor models, we let C X j and C X j be parametric bivariate copulas, say withdependence parameters θ j and δ j , respectively. Let also θ = { γ j , θ j : j = 1 , . . . , d } and θ = { γ j , θ j , δ j : j = 1 , . . . , d } to denote the set of all parameters for the 1- and 2-factor model, respectively. Estimation can beachieved by maximizing the joint log-likelihood ℓ Y ( θ ) = n X i =1 log f Y ( y i , . . . , y id ; θ ) . (4)over the copula parameters θ j or δ j , j = 1 , . . . , d with the univariate parameters/distributions fixed as estimatedat the first step of the proposed two-step estimation approach. The estimated parameters can be obtained byusing a quasi-Newton Nash (1990) method applied to the logarithm of the joint likelihood. This numericalmethod requires only the objective function, i.e., the logarithm of the joint likelihood, while the gradients arecomputed numerically and the Hessian matrix of the second order derivatives is updated in each iteration. Thestandard errors (SE) of the estimates can be obtained via the gradients and the Hessian computed numericallyduring the maximization process. These SEs are adequate to assess the flatness of the log-likelihood. ProperSEs that account for the estimation of univariate parameters can be obtained by maximizing the joint likelihoodin (4) at one step over θ .For factor copula models numerical evaluation of the joint density f Y ( y ; θ ) can be easily done using Gauss-Legendre quadrature (Stroud and Secrest, 1966). To compute one-dimensional integrals for the 1-factor model,we use the following approximation: f Y ( y ) = Z d Y j =1 f j | X ( y j | x ) dx ≈ n q X q =1 w q d Y j =1 f j | X ( y j | x q ) , where { x q : q = 1 , . . . , n q } are the quadrature points and { w q : q = 1 , . . . , n q } are the quadrature weights. Tocompute two-dimensional integrals for the 2-factor model, the approximation uses Gauss-Legendre quadrature8oints in a double sum: f Y ( y ) = Z Z d Y j =1 f X j | X (cid:0) x , y j | x (cid:1) dx dx ≈ n q X q =1 n q X q =1 w q w q d Y j =1 f X j | X (cid:0) x q , y j | x q (cid:1) . With Gauss-Legendre quadrature, the same nodes and weights are used for different functions; this helpsin yielding smooth numerical derivatives for numerical optimization via quasi-Newton (Nash, 1990). Ourcomparisons show that n q = 25 is adequate with good precision. In this section we propose an heuristic method that automatically selects the bivariate parametric copula familiesthat link the observed to the latent variables. For multivariate mixed data, it is infeasible to estimate all possiblecombinations of bivariate parametric copula families, and compare them on the basis of information criteria. Wedevelop an algorithm that can quickly select a factor copula model that accurately captures the (tail) dependencefeatures in the data on hand. The linking copulas at each factor are selected with a sequential algorithm underthe initial assumption that linking copulas are Frank, and in then sequentially copulas with non-tail quadrantindependence are assigned to any of pairs where necessary to account for tail asymmetry (discrete data) or taildependence (continuous data). Tail dependence of discrete variables is not clear, because maxima of discretevariables may not converge to a nondegenerate distribution (Feidt et al., 2010).For the 1-factor model, the proposed model selection algorithm is summarized in the following steps:1. For j = 1 , . . . , d estimate the marginal distributions F j ( y ) .2. Fit the 1-factor copula model with Frank copulas to link each of the d observed variables with the latentvariable, i.e., maximise the log-likelihood function of the factor copula model in (4) over the vector ofcopula parameters ( θ , . . . , θ d ) .3. If the j th linking copula has ˆ θ j > , then select a set of copula candidates with ability to interpolatebetween independence and comonotonicity, otherwise select a set of copula candidates with ability tointerpolate between countermonotonicity and independence.4. For j = 1 , . . . , d , 9a) Fit all the possible 1-factor copula models iterating over all the copula candidates for the j th vari-able.(b) Select the copula family that corresponds to the lowest information criterion, say the Akaike, thatis AIC = − × ℓ + 2 × copula parameters.(c) Fix the selected linking copula family for the j th variable.(d) Iterate through step (a) – (c) to select the copulas that link all the observed variables to the 1st factor.For more than one factors we can select the appropriate linking copulas accordingly. We first select copulafamilies in the first factor, and then we proceed to the next factor and apply exactly the same algorithm. Factor copula models with different bivariate linking copulas can be compared via the log-likelihood or AICat the maximum likelihood estimate. In addition, we will use the Vuong’s test (Vuong, 1989) to show if afactor copula model provides better fit than the standard factor model with a latent additive structure, that isa factor copula model with BVN bivariate linking copulas (Krupskii and Joe, 2013; Nikoloulopoulos and Joe,2015). The Vuong’s test is the sample version of the difference in Kullback-Leibler divergence between twomodels and can be used to differentiate two parametric models which could be non-nested. This test has beenused extensively in the copula literature to compare vine copula models (e.g., Brechmann et al. 2012; Joe 2014;Nikoloulopoulos 2017). We provide specific details in Section 5.1.Furthermore, to assess the overall goodness-of-fit of the factor copula models for mixed data, we will useappropriately the limited information M statistic (Maydeu-Olivares and Joe, 2006). The M statistic has beendeveloped for goodness-of-fit testing in multidimensional contingency tables. Nikoloulopoulos and Joe (2015)has used the M statistic to assess the goodness-of-fit of factor copula models for ordinal data. We build onthe aforementioned papers and propose a methodology to assess the overall goodness-of-fit of factor copulamodels for mixed continuous and discrete responses. Since the M statistic has been developed for multivariateordinal data, we propose to first transform the continuous and count variables to ordinal and then calculate the M statistic at the maximum likelihood estimate before discretization. We provide the specifics for the M statistic in Section 5.2. In this subsection, we summarize the Vuong’s test for comparing parametric models (Vuong, 1989). Assumethat we have Models 1 and 2 with parametric densities f (1) Y and f (2) Y , respectively. We can compare10 f Y = n − h n X i =1 n E f Y log f Y ( y i ) − E f Y log f (1) Y ( y i ; θ ) oi , ∆ f Y = n − h n X i =1 n E f Y log f Y ( y i ) − E f Y log f (2) Y ( y i ; θ ) oi . where θ , θ are the parameters in Models 1 and 2, respectively, that lead to the closest Kullback-Leiblerdivergence to the true f Y ; equivalently, they are the limits in probability of the MLEs based on Models 1 and2, respectively.Model 1 is closer to the true f Y , i.e., is the better fitting model if ∆ = ∆ f Y − ∆ f Y < , and Model 2 isthe better fitting model if ∆ > . The sample version of ∆ with MLEs b θ , b θ is ¯ D = n X i =1 D i /n, where D i = log (cid:20) f (2) Y ( y i ; b θ ) f (1) Y ( y i ; b θ ) (cid:21) . Vuong (1989) has shown that asymptotically that √ n ¯ D/s ∼ N (0 , , where s = n − P ni =1 ( D i − ¯ D ) . Hence, its confidence interval (CI) is ¯ D ± . × √ n σ . goodness-of-fit statistic In this subsection to self maintain the paper we provide the form of the limited information M statistic inMaydeu-Olivares and Joe (2006). Assume d ordinal variables Y , . . . , Y d where the j th (1 ≤ j ≤ d ) variableconsists of K j ≥ categories labelled as , , . . . , K j − . Consider the set of univariate and bivariate residualsthat do not include category 0. This is a residual vector of dimension s = d X j =1 ( K j −
1) + X ≤ j
CI (0.025,0.034) (-0.014,0.005) (0.272,0.309) M p -value < . < . < . < . ¶ t Frank Selected ˆ τ ˆ τ SE ˆ τ SE copulas ˆ τ SE1st factorfood 0.61 0.34 0.03 0.48 0.01 BB10 0.38 0.00clothes 0.51 0.32 0.03 0.42 0.01 BB10 0.36 0.01leisure 0.49 0.35 0.02 0.42 0.01 BB10 0.38 0.01dishwasher 0.14 -0.07 0.03 0.08 0.01 reflected Joe 0.19 0.02car 0.12 -0.13 0.03 0.07 0.01 reflected Joe 0.09 0.01motorcycle 0.01 -0.10 0.02 -0.08 0.01 Frank 0.02 0.01bicycles 0.07 -0.10 0.02 -0.05 0.01 Frank 0.04 0.012nd factorfood 0.36 0.66 0.01 0.66 0.01 BB10 0.53 0.01clothes 0.18 0.46 0.02 0.40 0.01 BVN 0.28 0.01leisure 0.07 0.41 0.02 0.36 0.01 BB10 0.30 0.01dishwasher 0.33 0.37 0.01 0.26 0.01 BVN 0.42 0.01car 0.48 0.46 0.02 0.36 0.01 reflected Joe 0.35 0.01motorcycle 0.19 0.15 0.01 0.21 0.02 reflected Joe 0.17 0.01bicycles 0.27 0.27 0.01 0.31 0.01 reflected Gumbel 0.27 0.01AIC 54243.96 53466.18 53514.74 46244.56Vuong
CI (0.033,0.045) (0.027,0.045) (0.387,0.419) M p -value < . < . < . < . ¶ : The resulting model is the same as the standard factor model. rotation, the unique loadings show that one factor is loaded only on the discrete variables (dishwasher, car,motorcycle, and bicycles), while both factors are loaded on the continuous variables (food, clothes, and leisure).This reveals that the one latent variable which is only associated with the continuous variables measures theexpenses, while the the other which is associated with all the mixed variables measures the possession.18 able 4: Maximum deviations D j j of observed and expected counts for each bivariate margin ( j , j ) for the 1- and 2-factor copulamodels for the Swiss consumption survey dataset. D j ,j BVN t Frank Selected BVN t Frank Selected D ,
347 317 303 167 349 311 270 40 D ,
511 468 456 183 509 460 428 133 D ,
158 177 163 70 159 185 161 56 D ,
231 189 223 119 233 181 230 60 D ,
87 117 88 60 87 130 72 12 D ,
78 92 79 88 78 110 89 81 D ,
442 418 431 69 433 403 393 54 D ,
59 80 84 145 38 56 64 86 D ,
96 107 107 201 60 47 93 36 D ,
18 3 18 27 19 15 29 39 D ,
51 76 60 83 49 91 52 61 D ,
182 146 141 196 253 216 168 83 D ,
82 105 106 191 59 13 83 61 D ,
59 58 69 71 13 23 27 45 D ,
62 54 64 103 65 67 69 59 D ,
289 276 286 223 66 74 207 2 D , D ,
82 81 81 88 28 20 46 54 D ,
111 123 111 77 15 22 19 20 D ,
101 96 95 68 33 25 40 64 D ,
70 74 70 61 80 96 87 52
An extensive simulation study is conducted (a) to gauge the small-sample efficiency of the proposed estimationmethod, (b) to examine the reliability of using the heuristic algorithm to select the correct bivariate linkingcopulas, and (c) to study the small-sample performance of the M statistic after transforming the continuousand count variables to categorical.We set the type of the variables, the univariate margins and the bivariate linking copulas, along with theirunivariate and dependence parameters to mimic the data analysed in Section 6. The binary variables don’t havetail asymmetries, hence parametric copulas are less distinguishable. Therefore instead of binary, we simulatedfrom ordinal with 3 categories. We randomly generated samples of size n = { , , } from the selectedfactor copula models.Table 5 contains the resultant biases, root mean square errors (RMSE), and standard deviations (SD), scaledby n , for the estimates obtained using the estimation approach in Section 3. The results show that the proposedestimation approach is highly efficient according to the simulated biases, SDs and RMSEs.Table 6 contains four common nominal levels of the M statistic under the factor copula models for mixed19 able 5: Small sample of sizes n = { , , } simulations ( replications) from the selected factor copula models in Section 6 with resultant biases, root mean square errors (RMSE) and standarddeviations (SD), scaled by n , for the estimated parameters. Political-economic dataset – 1-factor model τ -0.51 0.58 0.80 0.68 0.74 n
100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 n Bias 0.88 2.30 3.17 -1.36 -3.39 -4.87 0.75 -0.55 0.64 0.21 -0.27 0.19 0.29 2.57 0.73 n SD 4.28 7.60 9.63 4.19 7.50 9.08 5.41 10.91 11.98 4.58 8.43 9.84 4.46 14.92 11.78 n RMSE 4.37 7.95 10.13 4.40 8.23 10.31 5.47 10.92 12.00 4.59 8.44 9.84 4.47 15.13 11.80General social survey – 1-factor model τ n
100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 n Bias -0.11 -0.86 -1.66 -0.06 -0.10 -0.19 0.29 0.17 0.53 0.19 0.26 0.27 0.72 0.89 0.94 -0.18 -0.37 -0.30 -0.09 -0.03 -0.03 n SD 7.46 12.41 16.01 6.55 11.12 14.00 8.53 13.76 17.97 8.33 14.07 17.75 9.45 14.63 18.92 6.89 11.89 15.05 7.75 12.90 16.54 n RMSE 7.46 12.44 16.10 6.55 11.12 14.00 8.53 13.76 17.98 8.33 14.07 17.75 9.47 14.65 18.94 6.89 11.90 15.05 7.75 12.90 16.54Swiss consumption survey – 1-factor model τ n
100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 100 300 500 n Bias -15.95 -0.78 -0.04 -7.85 -1.57 -3.22 -8.03 -1.62 -3.22 0.08 0.09 0.26 0.10 0.06 -0.11 0.23 0.12 0.40 0.13 0.06 0.20 n SD 8.81 9.93 13.16 9.58 6.24 7.98 9.54 6.52 8.11 7.69 13.02 16.80 7.72 13.02 17.02 7.46 13.01 16.90 7.51 12.79 16.67 n RMSE 18.23 9.96 13.16 12.38 6.43 8.60 12.47 6.72 8.73 7.69 13.02 16.81 7.72 13.02 17.02 7.46 13.01 16.91 7.51 12.79 16.67General social survey – 2-factor model n = 500 τ n Bias 1.18 -7.19 1.40 0.31 0.19 1.45 -0.44 -0.96 0.19 -0.05 0.22 2.59 -2.47 0.00 n SD 16.17 17.21 19.25 18.83 18.63 19.05 17.66 18.52 18.32 22.72 17.68 26.90 21.77 16.33 n RMSE 16.21 18.65 19.30 18.84 18.63 19.11 17.67 18.54 18.32 22.72 17.68 27.03 21.91 16.33Swiss consumption survey – 2-factor model n = 500 τ n Bias -2.31 -1.60 -0.69 -3.01 -1.04 -0.54 2.89 -4.27 0.64 1.00 3.41 1.27 0.59 -4.15 n SD 7.43 13.67 16.12 27.11 25.31 21.27 21.31 20.37 17.98 19.05 21.20 20.89 19.41 21.55 n RMSE 7.78 13.77 16.14 27.27 25.33 21.28 21.51 20.82 17.99 19.08 21.47 20.93 19.42 21.95 ata with continuous and count data transformed to ordinal with K = { , , } categories. The observed levelsare close to nominal levels. Hence, it is demonstrated that the M statistic remains reliable for mixed data andthat the information loss under discretization is minimal. Table 6: Small sample of sizes n = { , , } distribution for M ( replications). Empirical rejection levels at α = { . , . , . , . } , degrees of freedom (df), and mean under the factor copula models. Continuous and count variables aretransformed to ordinal with K = { , , } categories. n = 100 n = 300 n = 500 K = 3 K = 4 K = 5 K = 3 K = 4 K = 5 K = 3 K = 4 K = 5 Political-economic dataset – 1-factor modeldf 92 121 152 92 121 152 92 121 152mean 89.3 100.2 148.0 90.6 118.1 151.6 100.9 119.1 151.4 α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . α = 0 . n = 500 n = 500 K = 3 K = 4 K = 5 K = 3 K = 4 K = 5 df 154 232 322 65 119 185mean 154.8 234.0 323.3 65.6 119.7 185.5 α = 0 . α = 0 . α = 0 . α = 0 . able 7: Frequencies of the true bivariate copula identified using the model selection algorithm from 100 simulation runs. Political-economic dataset – 1-factor modelContinuous Ordinal n n Joe 2-reflected Joe t t reflected Gumbel 2-reflected Joe 2-reflected Gumbel100 68 63 27 19 27 56 28300 89 79 41 43 49 65 55500 91 85 61 65 74 73 68Swiss consumption survey – 1-factor modelContinuous Ordinal Count n reflected BB10 BB10 BB10 reflected Joe reflected Joe reflected Joe reflected Joe100 27 94 91 61 60 41 56300 50 99 98 64 71 63 68500 70 98 98 68 74 71 72General social survey – 2-factor model1st Factor Continuous Ordinal Count n reflected Gumbel reflected Joe BVN 1-reflected Joe 1-reflected Joe reflected Joe Gumbel100 22 40 10 19 19 50 6300 26 52 11 42 36 79 16500 19 67 13 52 53 83 392nd Factor Continuous Ordinal Count n Gumbel 2-reflected Joe reflected Joe Gumbel t BVN 2-reflected Gumbel100 13 28 28 7 14 21 17300 26 39 56 30 45 28 47500 32 67 65 53 59 33 70Swiss consumption survey – 2-factor model1st Factor Continuous Ordinal Count n BB10 BB10 BB10 reflected Joe reflected Joe Frank Frank100 57 77 55 31 28 23 34300 81 94 82 51 40 19 21500 88 94 87 49 50 21 162nd Factor Continuous Ordinal Count n BB10 BVN BB10 BVN reflected Joe reflected Joe reflected Gumbel100 5 14 28 10 29 31 10300 27 29 43 22 49 40 16500 39 39 60 31 55 63 31 amongst parametric families of copulas (Nikoloulopoulos and Karlis, 2008). For example, • in the results from the 2-factor model for the general social survey, the true copula for the first continuousvariable (1st factor) is the reflected Gumbel with τ = 0 . and is only selected a considerable small22umber of times. The algorithm instead selected with a high probability the reflected Joe (results notshown here due to space constraints), because both reflected Joe and Gumbel copulas provide similardependence properties, i.e., lower tail dependence. • in the results from the 2-factor model for the Swiss consumption survey, the variables with Frank copulashave the lowest selection rates. This is due to the fact that their true Kendall’s τ ’s parameters are close to0 (independence). We have extended the factor copula models proposed in Krupskii and Joe (2013) and Nikoloulopoulos and Joe(2015) to the case of mixed continuous and discrete responses. They include the standard factor model as aspecial case and they can be seen to provide a substantial improvement over the latter on the basis of the log-likelihood principle, Vuong’s and M statistics. Hence, superior statistical inference for the loading parametersof interest can be achieved.This improvement relies on the fact that the latent variable distribution is expressed via factor copu-las instead of the MVN distribution. The latter is restricted to linear and reflection symmetric dependence.Rizopoulos and Moustaki (2008) stressed that the inadequacy of normally distributed latent variables can becaused by the non-linear dependence on the latent variables. The factor copula can provide flexible reflectionasymmetric tail and non-linear dependence as it is a truncated canonical vine copula (Brechmann et al., 2012)rooted at the latent variables. Joe et al. (2010) show that in order for a vine copula to have (tail) dependence forall bivariate margins, it is only necessary for the bivariate copulas in level 1 to have (tail) dependence and it isnot necessary for the conditional bivariate copulas in levels , . . . , d − to have tail dependence. The 1-factorcopula has bivariate copulas with tail dependence in the 1st level and independence copulas in all the remaininglevels of the vine (truncated after the 1st level). The 2-factor copula has bivariate copulas with tail dependencein the 1st and 2nd level and independence copulas in all the remaining levels (truncated after the 2nd level).Even in the cases, where the effect of misspecifying the bivariate linking copula choice to build the factorcopula models can be seen as minimal for the Kendall’s τ (loading) parameters, the tail dependence varies, asexplained in Section 2.1, and is a property to consider when choosing amongst different families of copulas andhence affects prediction. Rabe-Hesketh et al. (2003) highlighted the importance of the correct distributionalassumptions for the prediction of latent scores. The estimated density of latent scores is simply the estimateddensity of latent variables. The latent scores will essentially show the effect of different model assumptions23 igure 3: Comparison of the political-economic risk rankings obtained via our selected model, the standard factor model, and themixed-data factor analysis of Quinn (2004). Factor analysis of Quinn (2004) S e l e c t ed f a c t o r c opu l a m ode l Factor analysis of Quinn (2004) F a c t o r c opu l a m ode l w i t h BV N Factor copula model with BVN S e l e c t ed f a c t o r c opu l a m ode l because it is an inference that depends on the joint distribution. Figure 3 demonstrates these effects by revisitingthe political-economic dataset in Section 6.1 and comparing the political-economic risk ranking obtained viaour selected model, the factor copula model with BVN copulas (standard factor model), and the mixed-datafactor analysis of Quinn (2004). Note that even between the factor copula model with BVN copulas and thefactor analysis model of Quinn (2004), there are small to moderate changes, because while these models sharethe same latent variables distribution, the former model does not assume the observed variables to be normallydistributed, but rather uses the empirical distribution of the continuous observed variables, i.e. allows themargins to be quite free and not restricted by normal distribution. The differences at the lower panel graph aresolely due the miss-specification the latent variable distribution.As stated by many researchers (e.g., Rabe-Hesketh and Skrondal 2001, 2004), the major difficulty of all24he models with latent variables is identifiability. This issue as acknowledged by Hastie et al. (2001, page 494)has left many analysts skeptical of factor models, and may account for its lack of popularity in contemporarystatistics. For example, for the standard factor model or the more flexible model in Irincheeva et al. (2012b)one of loadings in the second factor has to be set to zero, because the model with d loadings is not identifiable.The standard factor model arises as special case of our model if we use as bivariate linking copulas the BVNcopulas. Hence, for the 2-factor copula model with BVN copulas, one of the BVN copulas in the secondfactor has to be set as an independence copula. However, using other than BVN copulas, the 2-factor copulamodel is near-identifiable with d bivariate linking copulas as it has been shown by Krupskii and Joe (2013)and Nikoloulopoulos and Joe (2015).We tackled issues that particularly interest the social data analyst as model selection. Model selectionin previous papers on factor copula models was mainly based on simple diagnostics that are informative ifthe variables are of the same type, e.g. item response (ordinal) data in Nikoloulopoulos and Joe (2015) orfinancial data in Krupskii and Joe (2013). This is not the case for mixed continuous and discrete data as thecontinuous data might have reflection asymmetric tail dependence, while the extreme-value behavior of discretedistributions is often degenerate (Feidt et al., 2010). As regard as to the issue of goodness-of-fit testing, weproposed a technique that is based on the M goodness-of-fit statistic (Maydeu-Olivares and Joe, 2006) inmultidimensional contingency tables to overcome the shortage of goodness-of-fit statistics for mixed continuousand discrete response data (e.g., Moustaki and Knott 2000). Acknowledgements
We would like to thank Professor Harry Joe (University of British Columbia) for comments leading to animproved presentation and Dr Irina Irincheeva (University of Bern) and Professor Marc Genton (King AbdullahUniversity of Science and Technology) for providing the Swiss consumption survey dataset. The simulationspresented in this paper were carried out on the High Performance Computing Cluster supported by the Researchand Specialist Computing Support service at the University of East Anglia.
References
Agresti, A. (2010).
Analysis of Ordinal Categorical Data . John Wiley & Sons.Brechmann, E. C., Czado, C., and Aas, K. (2012). Truncated regular vines in high dimensions with applicationsto financial data.
Canadian Journal of Statistics , 40(1):68–85.Feidt, A., Genest, C., and Neˇslehov´a, J. (2010). Asymptotics of joint maxima for discontinuous random vari-ables.
Extremes , 13:35–53. 25enest, C., Ghoudi, K., and Rivest, L.-P. (1995). A semiparametric estimation procedure of dependence pa-rameters in multivariate families of distributions.
Biometrika , 82:543–552.Genest, C. and Neˇslehov´a, J. (2007). A primer on copulas for count data.
The Astin Bulletin , 37:475–515.Hastie, T., Tibshirani, R., and Friedman, J. (2001).
The elements of statistical learning . Springer.He, J., Li, H., Edmondson, A. C., Rader, D. J., and Li, M. (2012). A gaussian copula approach for the analysisof secondary phenotypes in case-control genetic association studies.
Biostatistics , 13:497–508.Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation.
The Annals of AppliedStatistics , 1(1):265–283.Hua, L. and Joe, H. (2011). Tail order and intermediate tail dependence of multivariate copulas.
Journal ofMultivariate Analysis , 102(10):1454–1471.Huber, P., Ronchetti, E., and Victoria-Feser, M.-P. (2004). Estimation of generalized linear latent variablemodels.
Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 66(4):893–908.Irincheeva, I., Cantoni, E., and Genton, M. (2012a). A non-gaussian spatial generalized linear latent variablemodel.
Journal of Agricultural, Biological, and Environmental Statistics , 17(3):332–353.Irincheeva, I., Cantoni, E., and Genton, M. G. (2012b). Generalized linear latent variable models with flexibledistribution of latent variables.
Scandinavian Journal of Statistics , 39(4):663–680.Jiryaie, F., Withanage, N., Wu, B., and de Leon, A. (2016). Gaussian copula distributions for mixed data, withapplication in discrimination.
Journal of Statistical Computation and Simulation , 86(9):1643–1659.Joe, H. (1993). Parametric families of multivariate distributions with given margins.
Journal of MultivariateAnalysis , 46(2):262 – 282.Joe, H. (1997).
Multivariate Models and Dependence Concepts.
Chapman & Hall, London.Joe, H. (2005). Asymptotic efficiency of the two-stage estimation method for copula-based models.
Journal ofMultivariate Analysis , 94:401–419.Joe, H. (2011). Tail dependence in vine copulae. In Kurowicka, D. and Joe, H., editors,
Dependence Modeling:Vine Copula Handbook , pages 165–187, Singapore. World Scientific.Joe, H. (2014).
Dependence Modelling with Copulas . Chapman and Hall/CRC.Joe, H., Li, H., and Nikoloulopoulos, A. K. (2010). Tail dependence functions and vine copulas.
Journal ofMultivariate Analysis , 101(1):252 – 270.Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis.
Psychometrika , 23(3):187–200.Krupskii, P. and Joe, H. (2013). Factor copula models for multivariate data.
Journal of Multivariate Analysis ,120:85–101.Lawless, J. F. (1987). Negative binomial and mixed Poisson regression.
The Canadian Journal of Statistics ,15(3):209–225.Lee, S.-Y., Poon, W.-Y., and Bentler, P. M. (1992). Structural equation models with continuous and polytomous26ariables.
Psychometrika , 57(1):89–105.Ma, Y. and Genton, M. G. (2010). Explicit estimating equations for semiparametric generalized linear latentvariable models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(4):475–495.Martin, A. D., Quinn, K. M., and Park, J. H. (2011). MCMCpack: Markov chain monte carlo in R.
Journal ofStatistical Software , 42(9):22.Maydeu-Olivares, A. and Joe, H. (2006). Limited information goodness-of-fit testing in multidimensionalcontingency tables.
Psychometrika , 71(4):713–732.Maydeu-Olivares, A. and Joe, H. (2014). Assessing approximate fit in categorical data analysis.
MultivariateBehavioral Research , 49(4):305–328.Montanari, A. and Viroli, C. (2010). A skew-normal factor model for the analysis of student satisfaction towardsuniversity courses.
Journal of Applied Statistics , 37(3):473–487.Moustaki, I. (1996). A latent trait and a latent class model for mixed observed variables.
British Journal ofMathematical and Statistical Psychology , 49:313–334.Moustaki, I. and Knott, M. (2000). Generalized latent trait models.
Psychometrika , 65(3):391–411.Moustaki, I. and Victoria-Feser, M.-P. (2006). Bounded-influence robust estimation in generalized linear latentvariable models.
Journal of the American Statistical Association , 101(474):644–653.Murray, J. S., Dunson, D. B., Carin, L., and Lucas, J. E. (2013). Bayesian gaussian copula factor models formixed data.
Journal of the American Statistical Association , 108(502):656–665.Muth´en, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuouslatent variable indicators.
Psychometrika , 49(1):115–132.Nash, J. (1990).
Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation .Hilger, New York. 2nd edition.Nikoloulopoulos, A. K. (2013). On the estimation of normal copula discrete regression models using thecontinuous extension and simulated likelihood.
Journal of Statistical Planning and Inference , 143:1923–1937.Nikoloulopoulos, A. K. (2016). Efficient estimation of high-dimensional multivariate normal copula modelswith discrete spatial responses.
Stochastic Environmental Research and Risk Assessment , 30(2):493–505.Nikoloulopoulos, A. K. (2017). A vine copula mixed effect model for trivariate meta-analysis of diagnostic testaccuracy studies accounting for disease prevalence.
Statistical Methods in Medical Research , 26(5):2270–2286.Nikoloulopoulos, A. K. and Joe, H. (2015). Factor copula models for item response data.
Psychometrika ,80(1):126–150.Nikoloulopoulos, A. K., Joe, H., and Li, H. (2012). Vine copulas with asymmetric tail dependence and appli-cations to financial return data.
Computational Statistics & Data Analysis , 56:3659–3673.Nikoloulopoulos, A. K. and Karlis, D. (2008). Copula model evaluation based on parametric bootstrap.
Com- utational Statistics & Data Analysis , 52:3342–3353.Panagiotelis, A., Czado, C., Joe, H., and St¨ober, J. (2017). Model selection for discrete regular vine copulas. Computational Statistics & Data Analysis , 106:138–152.Quinn, K. M. (2004). Bayesian factor analysis for mixed ordinal and continuous responses.
Political Analysis ,12(4):338–353.Rabe-Hesketh, S., Pickles, A., and Skrondal, A. (2003). Correcting for covariate measurement error in logisticregression using nonparametric maximum likelihood estimation.
Statistical Modelling , 3(3):215–232.Rabe-Hesketh, S. and Skrondal, A. (2001). Parameterization of multivariate random effects models for cate-gorical data.
Biometrics , 57(4):1256–1263.Rabe-Hesketh, S. and Skrondal, A. (2004).
Generalized latent variable modeling: multilevel, longitudinal, andstructural equation models . CRC Press, Boca Raton.Rizopoulos, D. and Moustaki, I. (2008). Generalized latent variable models with non-linear effects.
BritishJournal of Mathematical and Statistical Psychology , 61(2):415–438.Shen, C. and Weissfeld, L. (2006). A copula model for repeated measurements with non-ignorable non-monotone missing outcome.
Statistics in medicine , 25(14):2427–40.Shih, J. H. and Louis, T. A. (1995). Inferences on the association parameter in copula models for bivariatesurvival data.
Biometrics , 51:1384–1399.Sklar, A. (1959). Fonctions de r´epartition `a n dimensions et leurs marges. Publications de l’Institut de Statis-tique de l’Universit´e de Paris , 8:229–231.Smith, M. S. and Khaled, M. A. (2012). Estimation of copula models with discrete margins via bayesian dataaugmentation.
Journal of the American Statistical Association , 107(497):290–303.Song, P. X.-K., Li, M., and Yuan, Y. (2009). Joint regression analysis of correlated data using gaussian copulas.
Biometrics , 65(1):60–68.St¨ober, J., Hong, H. G., Czado, C., and Ghosh, P. (2015). Comorbidity of chronic diseases in the elderly:Patterns identified by a copula design for mixed responses.
Computational Statistics & Data Analysis , 88:28–39.Stroud, A. and Secrest, D. (1966).
Gaussian Quadrature Formulas . Prentice-Hall, Englewood Cliffs, NJ.Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses.
Econometrica ,57(2):307–333.Wedel, M. and Kamakura, W. A. (2001). Factor analysis with (mixed) observed and latent variables in theexponential family.
Psychometrika , 66(4):515–530.Zilko, A. A. and Kurowicka, D. (2016). Copula in a multivariate mixed discrete–continuous model.