[PDF] Mixed-type multivariate response regression with covariance estimation

Abstract

We propose a new method for multivariate response regressions where the elements of the response vector can be of mixed types, for example some continuous and some discrete. Our method is based on a model which assumes the observable mixed-type response vector is connected to a latent multivariate normal response linear regression through a link function. We explore the properties of this model and show its parameters are identifiable under reasonable conditions. We impose no parametric restrictions on the covariance of the latent normal other than positive definiteness, thereby avoiding assumptions about unobservable variables which can be difficult to verify. To accommodate this generality, we propose a novel algorithm for approximate maximum likelihood estimation that works "off-the-shelf" with many different combinations of response types, and which scales well in the dimension of the response vector. Our method typically gives better predictions and parameter estimates than fitting separate models for the different response types and allows for approximate likelihood ratio testing of relevant hypotheses such as independence of responses. The usefulness of the proposed method is illustrated in simulations; and one biomedical and one genomic data example.

Full PDF

AA uniﬁed method for multivariate mixed-type responseregression

Karl Oskar Ekvall

Division of BiostatisticsInstitute of Environmental MedicineKarolinska Institute [email protected]

Aaron J. Molstad

Department of StatisticsGenetics InstituteUniversity of Florida [email protected]

Abstract

We propose a new method for multivariate response regressions where the elements of the response vectorcan be of mixed types, for example some continuous and some discrete. Our method is based on a modelwhich assumes the observable mixed-type response vector is connected to a latent multivariate normalresponse linear regression through a link function. We explore the properties of this model and showits parameters are identiﬁable under reasonable conditions. We propose an algorithm for approximatemaximum likelihood estimation that works “off-the-shelf” with many different combinations of responsetypes, and which scales well in the dimension of the response vector. Our method typically gives betterpredictions and parameter estimates than ﬁtting separate models for the different response types andallows for approximate likelihood ratio testing of relevant hypotheses such as independence of responses.The usefulness of the proposed method is illustrated using simulations and through three data examples. a r X i v : . [ s t a t . M E ] J a n Introduction

In many regression applications, there are multiple response variables of mixed types: some arecontinuous and some discrete, some are necessarily positive and some need not be, etc. Joint modelingof the responses can lead to more efﬁcient estimation, better prediction, and allows for the testing ofjoint hypotheses without the need for multiple testing corrections. Popular regression models, however,typically assume all responses are of the same type. For example, the classical multivariate normallinear regression model assumes responses are conditionally multivariate normally distributed given thepredictors. Modeling mixed-type responses is more complicated and over the years a substantial efforthas been made to address this. Many methods for speciﬁc combinations of response types, such as someBernoulli and some normal, some Poisson and some normal, and so on, have been proposed (Olkin andTate, 1961; Poon and Lee, 1987; Catalano and Ryan, 1992; Cox and Wermuth, 1992; Fitzmaurice andLaird, 1995; Catalano, 1997; Fitzmaurice and Laird, 1997; Gueorguieva and Agresti, 2001; de Leon,2005; Gueorguieva and Sanacora, 2006; Yang et al., 2007; Faes et al., 2008). Models that allow formany different response distributions have also been considered (Sammel et al., 1997; Dunson, 2000;Gueorguieva, 2001; Rabe-Hesketh et al., 2004; de Leon and Carri`egre, 2007; Goldstein et al., 2009;Bonat and Jørgensen, 2016; Ekvall and Jones, 2020; Bai et al., 2020; Kang et al., 2020).Existing methods for mixed-type responses typically either (i) assume dependence between responsescan be parsimoniously parameterized, (ii) can be prohibitively time consuming to ﬁt unless there arevery few dependent responses, or (iii) are ﬁt using algorithms that require substantial user modiﬁcationdepending on which types of responses are modeled. For example, multivariate generalized linear mixedmodels assume dependence between mixed-type responses is due to random effects. Usually, the randomeffects are fewer than the responses and have a distribution depending on a relatively small number ofparameters. Such parsimonious structures can sometimes be motivated by subject-speciﬁc knowledge orthe sampling design, and they are convenient from a computational perspective since estimates can beobtained by unconstrained maximization of a likelihood (approximation) using off-the-shelf solvers.However, they also lead to restrictions on the joint distribution of the responses that can be difﬁcult tofully appreciate and which complicate inference. For example, the strength of dependence betweenobservations is often determined by the same parameters determining marginal means and variances,and it is often not clear which parameters are identiﬁable (Jiang, 2007; Lele et al., 2010). o address these issues, we propose a method for multivariate mixed-type response regressionswhich1. gives better predictions and estimates than ﬁtting separate models for the different response typesin a wide range of settings,2. allows for the testing of relevant joint hypotheses while avoiding the multiple testing burden,3. works with many different combinations of mixed-type responses off-the-shelf, and4. is fast enough to be practically useful.Our method is based on a model which assumes a latent multivariate linear regression is connectedto observable responses through a link function. Speciﬁcally, a response vector Y ∈ R r , non-stochasticpredictor vector x ∈ R p , and latent vector W ∈ R r satisfy, for some B ∈ R p × r , g { E ( Y | W ) } = W and W = B T x + ε, (1)where ε ∼ N (0 , Σ) and g : R r → R r is a known, injective link function. The elements of Y areconditionally uncorrelated given W , with conditional variances that can depend on W in a way to bespeciﬁed. Our interest is in inference on ( B , Σ) and the prediction of new responses, using n independentobservations ( y i , x i ) ∈ R r × R p , i = 1 , . . . , n , from (1). We do not specify exactly how n , r , and p compare but have in mind settings where n is substantially larger than r and at least as large as p . Someintuition for the model can be gained by noting that the multivariate normal linear regression model is aspecial case of (1) where Y = W and, hence, g is the identity. Here, however, interest is primarily incases where g is not the identity and cov( Y | W ) is positive deﬁnite; in particular, W is truly latent andnot a deterministic function of Y .In general, our methods only require specifying the ﬁrst two moments of Y | W , but we will alsodiscuss fully speciﬁed parametric models consistent with (1). For example, a multivariate generalizedlinear mixed model where each response Y j has its own random intercept ε j and linear predictor W j = B T j x + ε j , where B j is the j th column of B , satisﬁes (1). McCulloch (2008) studied specialcases of that model and noted its potential for modeling mixed-type responses, and similar models havebeen considered for repeated measures (dependent) data (Jaffa et al., 2016). Despite the mathematicalconnection, there are fundamental differences between our setting and those where mixed models aretypically considered. In particular, in our setting the number of latent variables is the same as the number f responses ( r ), their covariance matrix is unstructured, and we have n independent observations of aresponse vector and its corresponding predictors. A particularly useful property of our parameterization,which we discuss in Section 2, is that off-diagonal elements of Σ affect the covariances of responsesbut not their means or variances. Moreover, if an off-diagonal element is zero, then the correspondingresponses are uncorrelated, and, under regularity conditions, the covariance between two responses Y j and Y k is a strictly increasing function of Σ jk . We use these observations to establish identiﬁabilityof the parameters and to design a test for whether responses are uncorrelated, without that hypothesisalso implying restrictions on means and variances. If the responses are assumed to be conditionallyindependent given W , then this is also a test of independence.The following example illustrates the model. Example 1.

Suppose we observe n independently sampled bivariate responses, consisting of onecontinuous and one (non-negative) count variable, and a single predictor, x ∈ R . A possible version of(1) for these data takes B T ∈ R , W = B T x + ε , and E ( Y | W ) =  W exp( W )  and cov( Y | W ) = diag { , exp( W ) } , which corresponds to g ( t ) = [ t , log( t )] T , t = [ t , t ] T . One way to obtain this version of (1) from afully speciﬁed parametric model is to assume the conditional distribution Y | W is normal with mean W and variance 1; and that of Y | W is Poisson with mean (and hence variance) exp( W ) . Then,assuming Y and Y are conditionally independent given W , the marginal distribution of Y = [ Y , Y ] T is characterized by f B , Σ ( y ) ∝ (cid:90) R exp (cid:110) − ( y − w ) / y w − e w − ( w − B T x ) T Σ − ( w − B T x ) (cid:111) d w, where y = [ y , y ] T and w = [ w , w ] T . The density f B , Σ ( y ) is a model for the distribution of amixed-type response vector and how it is affected by the predictor. We emphasize that this model can beuseful even if W has no practical interpretation. Indeed, we do not assign meaning to W in general butfocus on the resulting distribution of Y .We propose an algorithm for ﬁtting (1) that, loosely speaking, iteratively ﬁts a sequence of multivari-ate normal models whose moments approximate those implied by (1). This gives an algorithm that on high level is similar to penalized quasi-likelihood (Breslow and Clayton, 1993; Breslow, 2004) butfor mixed-type responses. In each step of the algorithm, an objective function is minimized by (block)coordinate descent in B and Σ . The update for B is a least squares problem and hence has a closedform solution. The update for Σ is a more complicated optimization problem over a set of symmetricand positive (semi-)deﬁnite matrices and is fundamentally different from optimization problems solvedby common mixed models packages. We solve this problem using an accelerated projected gradientdescent algorithm. Because this algorithm does not require the computation of second derivatives, itscales well in the dimension of the response vector. The algorithm natively supports restrictions on Σ that can be expressed as Σ ∈ M for a set M such that the projection onto it can be computed, and this isessential for our method: ﬁrst, some combinations of responses require identiﬁability restrictions on Σ and this is straightforward to incorporate in the projection step of our algorithm. Secondly, we developa procedure for approximate likelihood ratio testing which uses the projection step to impose nullhypothesis restrictions such as independence between responses. Similarly, if one has subject-speciﬁcknowledge suggesting a particular structure for Σ such as, say, a ﬁrst order autoregressive structure, thenone can take M = { Σ ∈ R r × r : Σ i,j = ρ | i − j | , ρ ∈ [ − , } .In Section 2 we examine properties of (1) in more detail and give conditions that ensure theparameters are identiﬁable. Section 3 contains the proposed algorithm and details on its implementation.Section 4 describes an approximate likelihood ratio testing procedure and in Section 5 we examinethe prediction, estimation, and testing performance of the proposed method using simulations. Dataexamples are in Section 6. Because it makes our development no more difﬁcult, in what follows we consider a slightly more generalversion of (1): g { E ( Y | W ) } = W and W = Xβ + ε, (2)where X ∈ R r × q is a non-stochastic design matrix and β ∈ R q . The classical multivariate responseregression setting in (1) which motivates our study, where each response has the same vector of redictors, is a special case with X = I r ⊗ x T , β = vec( B ) , and q = rp , where ⊗ is the Kroneckerproduct and vec( · ) the vectorization operator. Similarly, a seemingly unrelated regression (Zellner,1962), where each response has its own predictor vector and coefﬁcient vector, is also a special case.Stochastic predictors whose distribution does not depend on model parameters can be accommodatedby conditioning on them in appropriate places. In particular, statements we make about the marginaldistribution of Y then instead apply to the distribution of Y | X , assuming ε is independent of X .We assume the link function g satisﬁes g ( t ) = [ g ( t ) , . . . , g r ( t r )] T , where t = [ t , . . . , t r ] T and g j : R → R is injective for every j . Thus, the j th latent variable has adirect effect on the j th response but no other responses. For simplicity, we also assume there are knownfunctions c j : R → R and dispersion parameters ψ j > , j = 1 , . . . , r , such that E ( Y | W ) = [ c (cid:48) ( W ) , . . . , c (cid:48) r ( W j )] T and cov( Y | W ) = diag[ ψ c (cid:48)(cid:48) ( W ) , . . . , ψ r c (cid:48)(cid:48) r ( W r )] , where the primes denote derivatives. This assumption is not crucial to our development but makesnotation substantially more convenient and is consistent with assuming generalized linear models(McCullagh and Nelder, 1989) for the distributions of Y j | W j , j = 1 , . . . , r . Then, c j is a conditionalcumulant function of the j th response and W j the corresponding natural “parameter”. When ψ j = 1 , thegeneralized linear model distributions are one-parameter exponential family distributions, as in Example1. Our setting does not assume ψ j = 1 for every j , but we do assume the ψ j are known in what follows. It is often difﬁcult to interpret parameters in latent variable models and modeling mixed-type responsesmakes it more complicated yet. Similarly, it is unclear in many latent variable models which parametersare identiﬁable. We address some such concerns in this section. We study two examples in detail andstate a more general result that can also be useful for analyzing other settings.The parameters β and Σ are straightforward to interpret in the latent regression, but interpretingthem in the marginal distribution of Y implied by (2) requires more work. To that end, note that the ean vector and covariance matrix of Y are, respectively, by iterated expectations, E ( Y ) = E { g − ( W ) } and cov( Y ) = cov { g − ( W ) } + E { cov( Y | W ) } . (3)We make a number of observations based on these expressions: ﬁrst, because cov( Y | W ) is diagonalby assumption, the covariance between responses is determined by the covariance matrix of g − ( W ) .Second, since E ( Y j ) and E ( Y j ) are determined by the univariate distribution of Y j , off-diagonalelements of Σ do not affect means and variances of the responses, but typically do affect covariances ofdifferent responses. Third, since g and cov( Y | W ) are non-linear and non-constant in general, E ( Y ) and E { cov( Y | W ) } in general depend on both β and diagonal elements of Σ . Fourth, since var( Y j ) isincreasing in ψ j and cov { g − ( W ) } does not depend on ψ , cor( Y j , Y k ) is decreasing in ψ j and ψ k . Thisis intuitive since responses are conditionally uncorrelated and hence, loosely speaking, a large elementof ψ indicates substantial variation in the corresponding response is independent of the variation in theother responses.In some settings, more precise statements are possible by analyzing closed form expressions for themoments in (3), as the next example illustrates. Example 2 (Normal and quasi-Poisson moments) . Suppose there are r = 4 responses such that E ( Y j | W ) = W j and var( Y j | W ) = ψ j , j = 1 , E ( Y j | W ) = exp( W j ) and var( Y j | W ) = ψ j exp( W j ) , j = 3 , . These moments are consistent with assuming Y j | W ∼ N ( W j , ψ j ) for j = 1 , , and, if ψ = ψ = 1 , Y j | W ∼ Poi { exp( W j ) } , j = 3 , . When not assuming ψ = ψ = 1 , we say these moments areconsistent with normal and (conditional) quasi-Poisson distributions. We have picked r = 4 so that thecovariance matrix includes covariances and variances for all possible combinations of such responses.We examine the effects of these assumptions on the marginal moments of Y , or the moments of Y | X if X is stochastic. Some algebra shows (Supplementary Material): E ( Y j ) =  X T j β j = 1 , X T j β + Σ jj / j = 3 , . nd cov( Y ) =  ψ + Σ · · · Σ · · · Σ e X T β +Σ / · e X T β +Σ ( e Σ − ψ e − X T β − Σ / ) ·· · e X T β + X T β +Σ / / ( e Σ − ·  . Omitted entries in cov( Y ) are the same as those not omitted up to changes in subscripts. Clearly, both E ( Y ) and cov( Y ) depend on β and Σ , but regardless of type, the variance of Y j is increasing in Σ jj ,the mean is increasing in X T j β , and the covariance between Y j and Y k is increasing in Σ jk . We willlater use these observations to prove a result which implies β and Σ are identiﬁable in this example.Consider the linear dependence between mixed-type responses with conditional normal and quasi-Poisson moments, Y and Y , say. The sign of their correlation is the sign of Σ and the squaredcorrelation satisﬁes, by Cauchy–Schwarz’s inequality, cor( Y , Y ) = Σ ( ψ + Σ )( e Σ − ψ / E ( Y )) ≤ Σ Σ ( ψ + Σ )( e Σ − ψ / E ( Y )) , which is upper bounded by Σ / (exp(Σ ) − . Thus, strong linear dependence between Y and Y ispossible if Σ is small, so that exp(Σ ) − is not much larger than Σ .To gain some intuition for how two responses with quasi-Poisson moments behave, suppose forsimplicity that Σ = Σ , ψ = ψ , and X T β = X T β . Then the correlation between Y and Y is e Σ − e Σ − ψ / E ( Y ) . For a small ψ , this correlation is approximately (exp(Σ ) − / (exp(Σ ) − , which for | Σ | ≤ Σ , is upper bounded by and lower bounded by { exp( − Σ ) − } / { exp(Σ ) − } . The latter expressiontends to − if Σ → and if Σ → ∞ . Thus, strong negative correlation between Y and Y requires a small Σ .Example 2 is convenient to analyze because the moments have closed form expressions. In morecomplicated settings, the following result can be useful. It implies that the mean of Y j and covariance of Y j and Y k are strictly increasing in, respectively, the mean of W j and covariance between W j and W k . he result is intuitive but we have not found it stated and proved in the literature. Lemma 2.1.

Let φ µ, Σ be a bivariate normal density with marginal densities φ µ ,σ and φ µ ,σ andcovariance σ = Σ = Σ ; then for any increasing, non-constant g, h : R → R , the functions deﬁnedby µ (cid:55)→ (cid:90) g ( t ) φ µ ,σ ( t ) d t and σ (cid:55)→ (cid:90) g ( t ) h ( t ) φ µ, Σ ( t ) d t are, assuming the (Lebesgue) integrals exist, strictly increasing on R and ( − , , respectively. We illustrate the usefulness of this result in another example.

Example 3 (Normal and Bernoulli responses) . Suppose r = 2 with Y | W ∼ N ( W , ψ ) and Y Bernoulli distributed with E ( Y | W ) = logit − ( W ) = 11 + exp( − W ) . Suppose also for simplicity there are no predictors but that each response has its own intercept: W = β + ε. The marginal distribution of Y is Bernoulli with E ( Y ) = (cid:90)

11 + exp( − β − Σ / t ) φ ( t ) d t, where φ ( · ) is the standard normal density. Using this expression, one can show that, if Σ is ﬁxed, E ( Y ) → if β → −∞ and E ( Y ) → if β → ∞ . That is, any success probability is attainable byvarying β and, hence, some parameter restrictions are needed for identiﬁability. One possibility, whichhas been used in similar settings, is to ﬁx Σ to some known value, say 1 (Dunson, 2000; Bai et al.,2020). While ﬁxing Σ = 1 does not impose any restrictions on the distribution of Y as long as β canvary freely, it may impose restrictions on the joint distribution of Y = [ Y , Y ] T , properties of which weconsider next.Equation (3) implies cov( Y , Y ) = cov( W , logit − ( W )) = (cid:90) R t − t ) φ β, Σ ( t ) d t − β E ( Y ) , here φ β, Σ ( · ) is the bivariate normal density with mean β and covariance matrix Σ . The integral doesnot admit a closed form expression, but Lemma 2.1 says the covariance is strictly increasing in Σ ,which can be used to show the parameters are identiﬁable in this example if Σ is known (Theorem2.2).To understand which values cov( Y , Y ) can take, consider the limiting case as Σ → Σ / Σ / andassume for simplicity β = β = 0 . In the limit, the covariance matrix is singular and the distributionof W the same as that obtained by letting W = ( √ Σ / √ Σ ) W . Then cov( W , logit − ( W )) = (cid:90) Σ / t − Σ / t ) φ ( t ) d t. One can also verify that β = 0 implies E ( Y ) = 1 / regardless of Σ , and hence var( Y ) = 1 / and cor( Y , Y ) = cov( W , logit − ( W )) (cid:112) var( Y ) var( Y ) = 2 √ ψ + Σ (cid:90) Σ / t − Σ / t ) φ ( t ) d t. By using the dominated convergence theorem as ψ → and Σ → ∞ , the last right hand side can beshown to tend to and be upper bounded by (cid:112) /π ≈ . . This correlation corresponds to a limiting caseand is an upper bound on the attainable correlation between Bernoulli and normal responses. For context,we note (cid:112) /π is also the correlation between a standard normal random variable, say Z , and a Bernoullivariable that is if Z is greater than zero, and zero otherwise; and it is an upper bound on the correlationbetween normal and Bernoulli responses sharing a random intercept in a multivariate generalized linearmixed model with a probit link for the Bernoulli responses (McCulloch, 2008, Equation 17).We conclude this section with two results on identiﬁability. The results are stated for some commonchoices of conditional moments of Y | W but the proof idea can apply also in other settings. Essentially,Lemma 2.1 ensures that β and off-diagonal elements of Σ are identiﬁable under quite general conditionsbut some care is needed to show that the diagonal elements of Σ are identiﬁable, as the precedingexample illustrates. The proof of the following theorem is in the Supplementary Material. Theorem 2.2.

Suppose { ( Y i , X i ) ∈ R r × R r × q ; i = 1 , . . . , n } is an independent sample from (2) andthat1. g j , j = 1 , . . . , r , is either the identity, natural logarithm, or logit (log-odds) function,2. Σ jj is ﬁxed and known for every j corresponding to logit g j , . (cid:80) ni =1 X T i X i is invertible,then distinct ( β, Σ) correspond to distinct { E ( Y ) , cov( Y ) } ∈ R rn × R rn × rn , where Y ∈ R rn is avector of all responses. Theorem 2.2 asserts distinct parameters correspond to distinct moments. Identiﬁability in the usualsense that distinct parameters correspond to distinct distributions is obtained as an immediate corollaryby further specifying the family of distributions; we omit the proof.

Corollary 2.3.

If the conditions of Theorem 2.2 hold; the Y j are conditionally independent given W ;and, for every j = 1 , . . . , r , the conditional distribution of Y j | W , is either normal with mean W j andvariance ψ j , Poisson with mean exp( W j ) , or Bernoulli with mean / { − W j ) } ; then β and Σ are indentiﬁable. We propose an algorithm for estimating β and Σ that is based on linearization of the conditional meanfunction w (cid:55)→ ∇ c ( w ) = g − ( w ) . This idea is essentially equivalent to linearization of the link function g , traditionally used to motivate algorithms for generalized linear models (McCullagh and Nelder, 1989,Section 2.5) and generalized linear mixed models (Schall, 1991). This linearization admits an algorithmthat is similar to penalized quasi-likelihood (Breslow and Clayton, 1993; Breslow, 2004). Because themotivating ideas are relatively well known, we give only a brief overview and focus more on solving theresulting optimization problem, which is quite different from those typically considered in the mixedmodels literature.Consider the elementwise ﬁrst order Taylor approximation of g − ( · ) = ∇ c ( · ) around an arbitrary w ∈ R r , E ( Y | W ) = ∇ c ( W ) ≈ ∇ c ( w ) + ∇ c ( w )( W − w ) . Applying expectations and covariances on both sides yields E ( Y ) ≈ ∇ c ( w ) + ∇ c ( w )( Xβ − w ) := m ( w, β ) and cov { E ( Y | W ) } ≈ ∇ c ( w )Σ ∇ c ( w ) . urther applying the approximation E { cov( Y | W ) } = E { diag( ψ ) ∇ c ( W ) } ≈ diag( ψ ) ∇ c ( w ) leadsto an approximate covariance matrix: cov( Y ) ≈ diag( ψ ) ∇ c ( w ) + ∇ c ( w )Σ ∇ c ( w ) := C ( w, Σ) . Intuitively, we expect m ( w, β ) and C ( w, Σ) to be good approximations if W takes values near w withhigh probability. Now, consider a working model which says Y , . . . , Y n are independent with Y i ∼ N { m ( w i , β ) , C ( w i , Σ) } , (4)for observation-speciﬁc approximation points w i ∈ R r , i = 1 , . . . , n . The corresponding negativelog-likelihood is, up to scaling and additive constants, h n ( β, Σ | w , . . . , w n ) = n (cid:88) i =1 (cid:104) log det { C ( w i , Σ) } + { y i − m ( w i , β ) } T C ( w i , Σ) − { y i − m ( w i , β ) } (cid:105) . Minimizers of h n are approximate maximum likelihood estimates, whose quality depend on the accuracyof the working model (4). Based on these motivations, a natural algorithm for estimating β and Σ woulditerate between updating the pair ( β, Σ) by minimizing h n with the w i held ﬁxed; and then updating the w i to get a more accurate working model. In our algorithm, we update the w i by setting them to equalto the “posterior” prediction of W i having observed y i ; that is, the maximizer of the conditional density f β, Σ ( w i | y i ) ∝ exp (cid:26) y T i w i − c ( w i ) −

12 ( w i − X i β ) T Σ − ( w i − X i β ) (cid:27) . This update for w i is closely related to Laplace approximation arguments used to motivate commonalgorithms for mixed models (see e.g. Breslow, 2004). If β and Σ are the true parameters, the workingmodel approximates the moments of the i th response vector around the mode of the distribution of W i | Y i . To summarize, we propose a blockwise iterative algorithm whose ( k + 1) th iterates are obtained sing the updating equations ( β ( k +1) , Σ ( k +1) ) = arg min β, Σ h n ( β, Σ | w ( k )1 , . . . , w ( k ) n ) (5) ( w ( k +1)1 , . . . , w ( k +1) n ) = arg max w ,...,w n n (cid:88) i =1 log f β ( k +1) , Σ ( k +1) ( w i | y i ) . (6)This algorithm can be run for a pre-speciﬁed number of iterations or until convergence of the β and Σ iterates, for example. While the complete algorithm is not designed to minimize a particular objectivefunction, the individual updates, which we discuss in more detail in the following subsections, minimizeobjective functions that can be tracked to determine convergence within each update. In our experience,the values of Σ and β tend to converge after (at most) tens of iterations of (5) and (6).In the following subsections, we ﬁrst discuss the separate optimization problems in (5) and (6) andthen state the full algorithm formally. Computing times are discussed in Section 5. β and Σ To solve the optimization problem in (5), we use a blockwise coordinate descent algorithm. Treating w = { w , . . . , w n } as ﬁxed throughout this subsection (and ignoring the iterate counting superscript),this algorithm iterates between updating β with Σ held ﬁxed and vice versa. Speciﬁcally, the ( l + 1) thiterates of the algorithm for solving (5) can be expressed β ( l +1) = arg min β h n ( β, Σ ( l ) | w , . . . , w n ) (7) Σ ( l +1) = arg min Σ h n ( β ( l +1) , Σ | w , . . . , w n ) (8)The partial minimization with respect to β , (7), is straightforward. Dropping terms which do not dependon β from h n , and letting ˜ y i = y i − ∇ c ( w i ) + ∇ c ( w i ) w i and ˜ X i = ∇ c ( w i ) X i , the objective functionfrom (7) can be expressed h n ( β, Σ ( l ) | w , . . . , w n ) ∝ n (cid:88) i =1 (˜ y i − ˜ X i β ) T C ( w i , Σ ( l ) ) − (˜ y i − ˜ X i β ) , hich is simply a (weighted) residual sum-of-squares. Hence, (7) can be computed in closed form: β ( l +1) = (cid:40) n (cid:88) i =1 ˜ X T i C ( w i , Σ ( l ) ) − ˜ X i (cid:41) − n (cid:88) i =1 ˜ X T i C ( w i , Σ ( l ) ) − ˜ y i . (9)Turning to (8), minimizing h n with respect to Σ is non-trivial owing to the non-convexity of h n and theconstraint that Σ is positive semi-deﬁnite. One possibility is to parameterize Σ in a way that lends itselfto unconstrained optimization (see e.g. Pinheiro and Bates, 1996) and use a generic solver. However,such parameterizations are inconvenient when imposing additional constraints on Σ . For example,as discussed in Section 2, when some responses are binary we restrict diagonal elements of Σ to beequal to a prespeciﬁed constant for identiﬁability. Similarly, when testing the correlation of responsecomponents, we ﬁt our model with certain off-diagonal elements of Σ constrained to be zero. Thus, weneed an algorithm that both allows restrictions on the elements of Σ and ensures estimates are positivesemi-deﬁnite.By picking an appropriate (convex) M ⊆ R r × r , the optimization problem in (8) can be moreprecisely characterized as arg min Σ h n ( β ( l +1) , Σ | w , . . . , w n ) subject to Σ ∈ M . (10)To handle both the non-convexity and general constraints, we propose to solve (10) using a variationof the inertial proximal algorithm proposed by Ochs et al. (2014). This is an accelerated projectedgradient descent algorithm that can be used to minimize an objective function which is the sum of anon-convex smooth function and convex non-smooth function. In our case, h n (as a function of Σ ) isthe non-convex smooth function and the convex non-smooth function is the optimization indicator that Σ ∈ M , that is, the function which equals ∞ if Σ (cid:54)∈ M and zero otherwise. This algorithm, like manypopular accelerated ﬁrst order algorithms, e.g., FISTA (Beck and Teboulle, 2009), uses “inertia” in thesense that the standard projected gradient step is modiﬁed to account for the direction of change from Σ ( t − to Σ ( t ) , where Σ ( t ) denotes the t th iterate of the algorithm used to solve (8). This can lead tofaster convergence than the standard projected gradient descent algorithm.To summarize brieﬂy, with β and w held ﬁxed, our algorithm for solving (10) has ( t + 1) th iterate Σ ( t +1) = P M (cid:104) Σ ( t ) − α ∇ Σ h n ( β, Σ ( t ) | w , . . . , w n ) + γ { Σ ( t ) − Σ ( t − } (cid:105) , here γ = (0 , , α is determined using backtracking line search (see Algorithm 4 of Ochs et al. (2014)for a precise description of the backtracking line search conditions), and P M is the projection onto M .Hence, each update requires computing the projection P M (Σ) = arg min M ∈ M (cid:107) Σ − M (cid:107) F , where subscript F indicates the Frobenius norm. We assume in what follows that this projection isdeﬁned; it sufﬁces, for example, that M is non-empty, closed, and convex (e.g. Megginson, 1998,Corollary 5.1.19). In our software, we allow M to be the intersection of a constraint set and the set ofpositive semi-deﬁnite matrices with eigenvalues bounded below by (cid:15) ≥ , e.g., M = (cid:8) Σ = Σ T : Σ (cid:23) (cid:15)I r , Σ ij = Σ ji = c ij for ( i, j ) ∈ C (cid:9) , where C denotes the set of indices which are constrained and c ij is the value which the ( i, j ) th entry of Σ is constrained to equal. To compute projections onto M of this form, we implement Dykstra’s alternatingprojection algorithm (Boyle and Dykstra, 1986). This algorithm iterates between projections onto eachof the two sets whose intersection deﬁnes M . Because the projection onto { Σ = Σ T : Σ (cid:23) (cid:15)I } and theprojection onto { Σ = Σ T : Σ ij = Σ ji = C ij for ( i, j ) ∈ C} can both be computed in closed form, thisalgorithm tends to be very efﬁcient. As a special case, if C = ∅ , then P M can be computed in closedform.Thus, to apply this algorithm, we need only evaluate the gradient of h n with respect to Σ . Letting r i = ˜ y i − ˜ X i β and D i = ∇ c ( w i ) , we can write h n ( β, Σ | w , . . . , w n ) = n (cid:88) i =1 (cid:104) log det { D i Σ D i + D i diag( ψ ) } + r T i { D i Σ D i + D i diag( ψ ) } − r i (cid:105) . (11)Letting C i (Σ) = { D i Σ D i + D i diag( ψ ) } − , for i = 1 , . . . , n , routine calculations give ∇ Σ h n ( β, Σ; w , . . . , w n ) = n (cid:88) i =1 D i (cid:110) C i (Σ) − C i (Σ) r i r T i C i (Σ) (cid:111) D i . This algorithm is terminated when the objective function values converge. .3 Updating the approximation points We use a trust region algorithm for updating w i , i = 1 , . . . , n , a detailed description of which can befound in Nocedal and Wright (2006). Essentially, the trust region algorithm approximates the objectivefunction locally by a quadratic and requires the computation of gradients and Hessians. These are, for i = 1 , . . . , n , assuming Σ − exists, ∇ w i log f ( w i | y i ; β, Σ) = y i − ∇ c ( w i ) − Σ − ( w i − X T i β ) and ∇ w i log f ( w i | y i ; β, Σ) = −∇ c ( w i ) − Σ − . Since ∇ c ( w i ) and Σ − are positive deﬁnite and the latter does not depend on w i , the objective functionis strongly concave. Thus, it has a unique maximizer and stationary point. In practice, however, Σ can be singular or near-singular and our experience suggests that, especially in early iterations and forstarting values far from the solution, the Hessian − Σ − − ∇ c ( w ) can be nearly singular or have a verylarge condition number even if Σ is non-singular. To improve stability, we suggest regularizing by (i)adding an L -penalty on w i − X i β and (ii) replacing Σ by Σ = Σ + (cid:15)I r for some small (cid:15) > whenupdating w i . Then the optimization problem for updating w i can then be written arg min w (cid:26) − y T i w + c ( w ) + 12 ( w i − X T i ) T Σ − ( w i − X i β ) + τ (cid:107) w i − X T i β (cid:107) (cid:27) , where τ ≥ . In practice, we have found that setting τ to the largest eigenvalue of the current iterateof Σ works well. The intuition for shrinking w i to X i β is that, when β is the true parameter, the latteris the mean of W i . The penalty and regularization of Σ are only included in the update for w i , not inthe objective function for updating β and Σ . That is, regularization is used to ﬁnd useful points aroundwhich to apply the working model (4), not in the ﬁtting of that model. Since the n objective functionsfor the w i are separate, the updates can be done in parallel. We use the implementation in the R package trust (Geyer, 2020). We summarize the steps of our algorithm for estimating β and Σ in Algorithm 1 below. lgorithm 1 : Blockwise iterative algorithm for estimating ( β, Σ)

1. Given (cid:15) β > , (cid:15) Σ > , and M , initialize Σ (1) ∈ M , and β (1) ∈ R q . Set k = 1 .2. w ( k +1) i ← arg max w ∈ R (cid:110) log f β ( k ) , Σ ( k ) ( w | y i ) − τ (cid:107) y i − X i β ( k ) (cid:107) (cid:111) for i = 1 , . . . , n .3. Set ˜Σ (1) = Σ ( k ) . For l = 1 , , . . . until convergence:(a) ˜ β ( l +1) ← arg min β h n ( β, ˜Σ ( l ) | w ( k +1)1 , . . . , w ( k +1) n ) using (9)(b) Set ¯Σ (0) = ¯Σ (1) = ˜Σ ( t ) . For t = 1 , , . . . , until convergence:i. ¯Σ ( t +1) ← P M (cid:104) ¯Σ ( t ) − α ∇ Σ h n ( ˜ β ( l +1) , ¯Σ ( t ) , w ( k +1)1 , . . . , w ( k +1) n ) + γ { ¯Σ ( t ) − ¯Σ ( t − } (cid:105) , (c) ˜Σ ( l +1) ← ¯Σ ( t ∗ ) where ¯Σ ( t ∗ ) is the ﬁnal iterate from 3(b).4. ( β ( k +1) , Σ ( k +1) ) ← ( ˜ β ( t ∗ ) , ˜Σ ( t ∗ ) ) where ( ˜ β ( l ∗ ) , ˜Σ ( l ∗ ) ) are the ﬁnal iterates from 3.5. If (cid:107) β ( k +1) − β ( k ) (cid:107) F ≤ (cid:15) β and (cid:107) Σ ( k +1) − Σ ( k ) (cid:107) F ≤ (cid:15) Σ , terminate. Otherise, set k ← k + 1 and returnto 2. Initializing values can affect the ﬁnal estimates of ( β, Σ) . For this reason, we propose a two-stepinitialization approach which we ﬁnd leads to good initial values. In the ﬁrst step, we run Algorithm 1after initializing w i = 0 , β = 0 , and Σ = I r under the restriction that Σ is diagonal. Once this algorithmhas converged, in the second step, we run Algorithm 1 again by initializing ( β, Σ) and the w i at theirﬁnal iterates from the ﬁrst step. However, we drop the constraint that Σ is diagonal, and allow Σ to beunrestricted (i.e., Σ need not belong to M ). We also replace step 3(b) – (c) by a trust region algorithmwhich often converges quickly but does not guarantee positive semi-deﬁniteness. Once this algorithmhas converged, we use the ﬁnal iterates of ( β, Σ) and the w i as our initial values for Algorithm 1 underthe restriction that Σ ∈ M . In terms of computing time, we found this approach is often faster thanrunning Algorithm 1 directly; and tends to lead to better estimates of ( β, Σ) . If r is relatively large (e.g.,as in Section 6.2), the trust region update of Σ used to get initial values can be slow since it requiresrepeatedly computing a Hessian of dimension { r ( r + 1) / } × { r ( r + 1) / } ; the second initializationstep can then be skipped. In order to perform inference on ( β, Σ) , we propose a novel procedure for hypothesis testing thatis based on the likelihood for the working model (4). We focus on testing hypotheses of the form β, Σ) ∈ H versus ( β, Σ) ∈ H and propose the test statistic T n = h n ( ˜ β, ˜Σ | ˜ w , . . . , ˜ w n ) − h n ( ¯ β, ¯Σ | ˜ w , . . . , ˜ w n ) , where ( ˜ β, ˜Σ) and the approximation points ˜ w = { ˜ w , . . . , ˜ w n } are obtained by running Algorithm 1with the restrictions implied by H and ( ¯ β, ¯Σ) = arg min ( β, Σ) ∈ H h n ( β, Σ | ˜ w , . . . , ˜ w n ) . That is, ( ˜ β, ˜Σ) and ˜ w are estimates and expansion points, respectively, from ﬁtting the null model while ¯ β and ¯Σ are obtained by maximizing the working likelihood from (4) with the expansion points ﬁxed atthose obtained by ﬁtting the null model. Thus, the statistics ¯ β and ¯Σ are not the same as the unconstrainedestimates of β and Σ . We ﬁx the expansion points to ensure ( ˜ β, ˜Σ) and ( ¯ β, ¯Σ) are maximizers of thesame working likelihood, but under different restrictions. We chose the null model’s expansion points tobe conservative; that is, to favor the null hypothesis model. If the working model is accurate, then weexpect T n to be, under the null hypothesis, approximately chi-square distributed with degrees of freedomequal to the additional number of restrictions implied by H relative to H . For example, a test ofindependence between all r responses may take H = { ( β, Σ) ∈ R p × M : Σ ij = Σ ji = 0 , ∀ ( i (cid:54) = j ) } ,which imposes ( r − r/ restrictions. Null models corresponding to hypotheses that constrain elementsof Σ are straightforward to ﬁt by simply including those constraints in the deﬁnition of the set M in (10).We summarize the hypothesis testing procedure in Algorithm 2. Algorithm 2 : Hypothesis testing procedure for ( β, Σ) ∈ H versus ( β, Σ) ∈ H

1. Given H and H , initialize ( β (1) , Σ (1) ) ∈ H .2. For k = 1 , , . . . until convergence:(a) w ( k +1) i ← arg max w ∈ R (cid:8) log f β ( k ) , Σ ( k ) ( w | y i ) − τ (cid:107) y i − X i β ( k ) (cid:107) (cid:9) for i = 1 , . . . , n (b) ( β ( k +1) , Σ ( k +1) ) ← arg min ( β, Σ) ∈ H h n ( β, Σ | w ( k +1)1 , . . . , w ( k +1) n )

3. Set ( ˜ β, ˜Σ) = ( β ( k ∗ ) , Σ ( k ∗ ) ) and ˜ w i = w ( k ∗ ) i for i = 1 , . . . , n where k ∗ denotes the ﬁnal iterate from 2.4. Compute ( ¯ β, ¯Σ) = arg min ( β, Σ) ∈ H h n ( β, Σ | ˜ w , . . . , ˜ w n ) using algorithm to solve (5)5. Return T n = h n ( ˜ β, ˜Σ | ˜ w , . . . , ˜ w n ) − h n ( ¯ β, ¯Σ | ˜ w , . . . , ˜ w n ) . e investigate the size and power of the proposed procedure through extensive numerical experi-ments in Section 5.4. In this section, we study the numerical performance of our method from multiple perspectives. Inthe ﬁrst subsection, we compare our method to two generalized linear mixed models which modelthe responses of different types separately. In the second subsection, we focus on comparing twoversions our method: that with Σ constrained to be diagonal and that with the off-diagonal elementsof Σ unconstrained. Finally, in the last subsection, we examine the power and size of our proposedhypothesis testing procedure under various data generating models. Our software and code that can beused to reproduce the results are available at https://github.com/koekvall/lvmmr . We are not aware of a publicly available (or otherwise) software that ﬁts our model outside of specialcases. Thus, to evaluate our method, we compare to existing methods which assume related butsomewhat different models. We consider r = 9 response variables, three conditionally normallydistributed, three conditionally Poisson, and three Bernoulli. A reasonable alternative to our method is touse separate generalized linear mixed models for the three response types. However, common softwarecannot ﬁt these models due to the unusual random effects structure required; see the SupplementaryMaterial for additional details. To get a comparison, we consider two simpliﬁcations of the type-speciﬁcmodels: (i) Σ jj = σ j and Σ jk = 0 for j (cid:54) = k or (ii) Σ = σ T , for some σ > and = [1 , , T .Option (i) corresponds to assuming all responses are independent and that latent variables correspondingto different responses have their own variance, and option (ii) to using a shared random effect forobservations in the same cluster, where a cluster here consists of responses of the same type in the sameresponse vector. For short, we refer to these as, respectively, independent and clustered generalized linearmixed models. With these simpliﬁcations, there are several software packages that can ﬁt the separatemodels. Somewhat arbitrarily, we pick the glmm (Knudson et al., 2020) package in R to ﬁt (i) and ﬁt ii) using lme4 (Bates et al., 2015) in R. Brieﬂy, the former uses a Monte Carlo approximation of thelikelihood and the latter uses adaptive Gaussian quadrature. We emphasize that both are based on modelsthat are misspeciﬁed in general in our setting and, accordingly, our experience from experimenting withseveral software packages is that differences in performance are due to misspeciﬁcation rather thanany particular implementation. We consider two versions of our method, one which constrains Σ tobe diagonal and one where Σ is (nearly) unconstrained. For both versions, Σ is positive semi-deﬁniteand, following Theorem 2.2, we constrain Σ jj = 1 for all j corresponding to Bernoulli responses. Thetrue value of Σ jj used to generate data is in general not equal to , however. We do not constraindiagonal elements of Σ when ﬁtting independent or clustered generalized linear mixed models becausethe software used does not support it.The comparisons focus on out of sample prediction errors. Our predictions are formed by pluggingestimates into the expressions for E ( Y i ) = E { E ( Y i | W i ) } discussed in Section 2. When a closed formexpression is unavailable, the expectation is easily obtained by r one-dimensional numerical integrations.As a reference, we compare to (oracle) predictions using the true β and Σ . The responses have differentpredictors and we partition accordingly: β = [ β T , . . . , β T r ] T , β j ∈ R p j , q = (cid:80) rj =1 p j , and we write X i,j ∈ R p j for the i th observation of the predictors for the j th response. Then, X i ∈ R r × q has j throw [0 , . . . , X T i,j , , . . . , T ∈ R p , there being (cid:80) j − k =1 p k leading zeros and (cid:80) rk = j +1 p k trailing. Thus, W i,j = X T i,j β j + ε i,j . In all simulations, each X i,j consists of a one in the ﬁrst element (an intercept)and, in the remaining p j − elements, independent realizations of a Uniform[ − , random variable,where for simplicity we take p j = p k for all j and k . For j = 1 , . . . , r , the true regression coefﬁcient β j has ﬁrst element (the intercept) equal to β j and all other elements chosen as independent realizations ofa Uniform[ − . , . . We set β j = 2 if the response is normal or (quasi-)Poisson, and equal to zero if theresponse is Bernoulli. Similarly, if the response is normal, we set ψ j = . ; otherwise, we set ψ j = 1 .We consider three different structures for Σ . Namely, for some ρ ∈ (0 , : we set Σ = 0 . where ˜Σ is (i) autoregressive, meaning ˜Σ jk = ρ | j − k | ; (ii) compound symmetric, meaning ˜Σ jk = ρ I ( j (cid:54) = k ) + I ( j = k ) ; or (iii) block diagonal, meaning ˜Σ jk = ρ I ( j (cid:54) = k ) + I ( j = k ) if ( j, k ) ∈ { , , } × { , , } , ( j, k ) ∈ { , , } × { , , } , or ( j, k ) ∈ { , , } × { , , } and zero otherwise. The ﬁrst throughthird responses are normal, the fourth through sixth Bernoulli, and seventh through ninth quasi-Poisson.Hence, the block-diagonal structure in (iii) consists of block of responses of all unique types. To beclear, these structures are used to generate the data but treated as unknown and, hence, not imposed hen ﬁtting models.For each of the structures in Σ , we investigate the effects of the sample size ( n ), the number ofpredictors ( p j , j = 1 , . . . , r ), and the correlation parameter ( ρ ). In Figure 2, we display the averagerelative squared prediction error, which we deﬁne as the ratio of each method’s sum of squaredprediction error to the sum of squared prediction error of the oracle prediction. Averages are basedon 500 independent replications, and for each replication, the out of sample predictions are on anindependent test set of observations.Results are displayed in Figure 1. In the top row, we observe that as n increases with p j = 5 and ρ = 0 . , each method’s performance improves relative to the oracle performance. However,across all settings, our method’s performance is best. When the covariance structure is non-sparse (e.g.,autoregressive or compound symmetric), we see that the clustered generalized linear mixed modelscan outperform both our method with the diagonality constraint and the independent generalized linearmixed models. The same relative performances are observed as p increases with n = 200 and ρ = 0 . ﬁxed; and with ρ increasing and ( n, p j ) = (200 , . It is notable that when Σ is block diagonal, bothversions of our method outperform the competitors. In the case of clustered generalized linear mixedmodels, this is due to the fact that the speciﬁed covariance structure cannot not well approximate thetrue covariance. For independent generalized linear mixed models, this is likely due to the fact that glmm (nor other software) can impose the identiﬁability condition on Σ for the Bernoulli responses.Table 1 shows the times to ﬁt our model and the models assumed by glmm and glmer . Ourmethod, notably, scales well in n , is frequently substantially faster than glmm , but somewhat slowerthan glmer . These timings are intuitive as glmm uses a Monte Carlo-approximation that is usefulfor high-dimensional integrals but which is not optimized for our setting with many independentobservations. By contrast, glmer is fast because it is here ﬁtting a model which incorrectly assumesthere is only one variance parameter; this gives a much simpler optimization problem than that ourmethod is solving, and faster numerical integration than what is possible when there are several possiblydependent random effects. Next, we compare the two versions of our method (off-diagonals of Σ constrained versus unconstrained)from the previous subsection in terms of their performance on responses of the different types. Because ample sizeCovariance Model 100 150 200 250 300 350 400 450 500AR(1) lvmmr glmm glmer lvmmr glmm glmer lvmmr glmm glmer Table 1: Median computing times (in seconds) for our method with off-diagonals of Σ unconstrained( lvmmr ), independent generalized linear mixed models ﬁt using glmm , and clustered generalized linearmodels ﬁt using glmer under the settings considered in the top row of Figure 1. AR(1), BD, and CScorrespond to autoregressive, block diagonal, and compound symmetric covariance structures, respectively. both versions have correctly speciﬁed univariate response distributions and are ﬁt using the samealgorithm except for the constraints on Σ , these simulations investigate the usefulness of joint modelingof mixed-type responses (i.e., leaving off-diagonals of Σ unconstrained). In the results from Figure 1,averages are taken over all responses types; however, it is also of interest to analyze results stratiﬁedby type. We generate data in the same manner as the previous section except we set ψ j = . forboth normal responses and quasi-Poisson responses; see the Supplementary Material for how wegenerate quasi-Poisson response with ψ j (cid:54) = 1 . Notably, neither glmm nor glmer could accommodateconditionally quasi-Poisson distributed responses.In Figure 2, we display the average squared prediction errors relative to the oracle prediction. Inthe ﬁrst row of Figure 2, we see that as n increases, both methods’ relative mean squared predictionerror approaches the oracle prediction error. However, the predictions from joint modeling outperformsthose based on the model constraining Σ to be diagonal. The smallest differences between the twomethods are observed for the Bernoulli response: we investigate this point further later in the section. Asimilar result is observed in the second row of Figure 2 where the number of predictors per responseincreases. We see that as p j approaches 10, both methods’ relative performance degrades, althoughfor all three response types, predictions from the joint modeling degrades more gradually. Finally, inthe bottom-most row of Figure 2, we display results as ρ increases from . to . . When ρ = 0 . ,especially under the block diagonal and AR(1) models, there is a less substantial difference between the l l l l l l l l l l l l l l l l l l l l l l l l l l AR(1) Block diagonal Compound symmetry

100 200 300 400 500 100 200 300 400 500 100 200 300 400 5001.011.021.03

Sample size A v e r age R M SE l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Number of predictors A v e r age R M SE l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l r A v e r age R M SE Figure 1: Average relative squared prediction errors over 500 independent replications and r = 9 responsevariables as (top row) the sample size varies with ρ = 0 . and p j = 5 for j = 1 , , . . . , ; (middle row) thenumber of predictors varies with n = 200 and ρ = 0 . ; and (bottom row) the correlation parameter ρ varieswith n = 200 and p j = 5 for j = 1 , , . . . , . Methods compared are independent generalized linear mixedmodels using glmm (dark purple, triangle); clustered generalized linear mixed models using glmer (black,dot); our method with Σ ’s off-diagonals constrained to be zero (orange, plus sign); and our method with Σ ’s’ off-diagonals unconstrained (light purple, square). two methods. As ρ approaches 0.95, under each of the covariance structures, the difference between thetwo methods becomes greater. This result is also observed in the Bernoulli responses, but to a lesserdegree than the normal and quasi-Poisson.In the second part of this simulation study, we focus on modeling many Bernoulli responses anda single normal response. We include these results to highlight the fact that even though the relative l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l AR(1) Block diagonal Compound symmetry

100 200 300 400 500 100 200 300 400 500 100 200 300 400 5001.011.021.03

Sample size A v e r age R M SE l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l Number of predictors A v e r age R M SE l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l r A v e r age R M SE Figure 2: Average relative squared prediction errors over 500 independent replications as (top row) thesample size varies with ρ = 0 . and p j = 5 for j = 1 , , . . . , ; (middle row) the number of predictors varieswith n = 200 and ρ = 0 . ; and (bottom row) the correlation ρ with n = 200 and p j = 5 for j = 1 , , . . . , .Purple lines denote predictions based on our method with Σ ’s off-diagonals constrained to be zero; magentalines represent our method with off-diagonals unconstrained. Triangles denote averages over all normalresponses and replications; squares over all quasi-Poisson responses and replications, and circles Bernoulli. square prediction error for the Bernoulli responses is only slightly improved by joint modeling, whichwas also observed in Figure 2, one can realize substantial prediction accuracy gains for the single normalresponse variable by exploiting dependence in components of the W i . Data are generated in the samemanner as in Section 5.2, except with a single normal response and eight Bernoulli response variables.Results are displayed in Figure 3. As before, we observe that as ρ increases from . to . , thedifference between joint and separate modeling becomes more apparent. Notably, the relative mean quared prediction error for the single normal response variable improves more dramatically under boththe autoregressive and compound symmetric covariance structures. Under the block diagonal covariance,the differences are less apparent. This agrees with intuition as under the block diagonal covariancestructure, the normal response is only correlated with two of the Bernoulli responses, whereas with theother structures it is correlated with all eight Bernoulli responses. Together with the results in Figure2, these results suggest that substantial efﬁciency gains can be achieved using our method for jointmodeling of mixed-type responses – even in the case where most response variables are binary. l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l AR(1) Block diagonal Compound symmetry r A v e r age R M SE Figure 3: Average relative squared prediction errors over 500 independent replications as the correlationparameter ρ varies with n = 200 and p j = 5 for j = 1 , , . . . , . Purple lines represent our algorithm withdiagonal Σ and magenta lines represent unstructured Σ . Triangles denote average of the normal responseover the replications, and circles denote the average over all Bernoulli responses and replications. In this section, we examine the approximate likelihood ratio testing procedure described in Algorithm 2.Let D r ++ be the set of r × r diagonal and positive deﬁnite matrices. We study the size and power of theproposed approximate likelihood ratio tests for H : Σ ∈ D r ++ versus H A : Σ ∈ S r ++ , (12) l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l AR(1) Block diagonal Compound symmetry r P r opo r t i on r e j e c t ed l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l g P r opo r t i on r e j e c t ed Figure 4: (Top) Proportion of H : Σ ∈ D r ++ rejected (out of 2500 independent replications) at 0.05 level.(Bottom) Proportion of H : B kj = 0 rejected at the 0.05 level. In both, horizontal dashed and dotted blacklines correspond to 0.10 and 0.05 respectively. Colors denote sample sizes: the black solid line correspondsto n = 200 , the dark blue solid line to n = 400 , the purple solid line to n = 600 , the magenta solid line to n = 800 , and the light orange solid line to n = 1000 . and, assuming all responses have the same predictors as in (1), H : B kj = 0 , for every j = 1 , . . . , r, versus H A : B ∈ R r × p , (13)where B kj denotes the k th predictor’s effect on the j th response variable, i.e., the ( k, j ) th element of B . Without loss of generality, we set k = 2 . Thus, the null hypothesis in (13) implies that the ﬁrstpredictor (ignoring the intercept), has no effect on any response. This highlights how joint hypothesesare straightforward to test in our model; multiple testing corrections, which are often needed when usingseparate models for the r responses, are not needed here. Test statistics for are computed using theprocedure from Algorithm 2.Simulations in this section use data generated as in Section 5.3 but with X i, = X i, = · · · = X i,r for all i = 1 , . . . , n and B = [ β , . . . , β r ] ∈ R p × r . Reported averages are based on 2500 independentreplications. In the ﬁrst setting, we consider n ∈ { , , . . . , } and generate responses with Σ jk = ρ | j − k | , ρ ∈ { . , . , . . . , . } . In the top row of Figure 4, we display the proportion ofrejections at the 0.05 signiﬁcance level based on the approximate likelihood ratio test statistic (using χ − / quantiles). When ρ = 0 (i.e., H is true), approximately 0.10 of tests were incorrectly rejectedwhen n = 200 under all three covariance structures. When n ≥ , the proportion of incorrectrejections fall below 0.075 for all three covariance structures. Proprotions of rejections approach 0.05(the nominal level) as n = 2000 . As ρ increases, we observe that even with n = 200 , the proportion ofcorrectly rejected null hypotheses approaches one as ρ = 0 . . The power depends on both the magnitudeof ρ and the sample size: as one or both increase, the power increases as well.In the second setting, we ﬁx ρ = 0 . and study how the effect size of the B kj affects power whentesting (13). To this end, after generating B as in Section 5.3, for j = 1 , . . . , r independently, we replace B kj with a realization of a Uniform[ − γ − , γ − ] where γ ∈ [0 , . The second row of Figure 4shows that when γ = 0 , so that B kj = 0 for all j = 1 , . . . , r , the proportion of rejections is slightyabove 0.10 for n = 200 , but close to 0.05 (the nominal size) for all n ≥ . There is also a indicationthat correlation between responses beneﬁts power. For example, the power curves under compoundsymmetry tend to be above the corresponding ones under block diagonal structure. We analyze a dataset collected on 333 women who were having difﬁculty becoming pregnant (Cannonet al., 2013). The goal is to model four mixed-type response variables, all related to the ability toconceive. The predictors are age and three variables related to antral follicles: small antral follicle count,average antral follicle count, and maximum follicle stimulating hormone level. Antral follicle countscan be measured via noninvasive ultrasound and, thus, are often used to model fertility. We standardizepredictors before model ﬁtting for ease of interpretation.The four response variables quantify the ability to conceive in different ways: two are approximatelynormally distributed (square-root estradiol level and log-total gondadotropin level); and two are counts(number of egg cells and number of embyros). We modeled the latter using our model with conditionalquasi-Poisson distributions. Following the simulation studies, we set ψ j = 10 − for the two normally istributed responses, and set ψ j = 10 − for the two count responses. First, we test the hypothesis H : Σ ∈ D versus H A : Σ ∈ S and ﬁnd strong evidence against the null hypothesis (p-value < − ) using the test described in Section 5.4. That is, there is strong evidence suggesting the fourresponses are not independent given the predictors. Fitting the unrestricted model using our softwaretook less than three seconds on a laptop computer with 2.3 GHz 8-Core Intel Core i9 processor. Thehypothesis testing procedure took less than six seconds on the same machine.The estimated correlation matrices for the four observed responses and four latent variables are (cid:99) cor( Y i | X i = ¯ X ) =  .

00 0 . − . − . .

01 1 . − . − . − . − .

03 1 .

00 0 . − . − .

09 0 .

69 1 .  , (cid:99) cor( W i | X i ) =  .

00 0 . − . − . .

02 1 . − . − . − . − .

04 1 .

00 0 . − . − .

09 0 .

74 1 .  where the response variable ordering is square-root estradiol level, log-total gondadotropin level, numberof egg cells, and number of embyros. Of course, cor( Y i | X i ) depends on X i , and its estimate is hereevaluated at ¯ X = (cid:80) ni =1 X i /n . The estimates indicate the number of egg cells and number of embryosare highly dependent whereas estradiol and gondadotropin levels are negatively correlated with thesetwo variables. In general, the estimated linear dependence is slightly stronger for the latent variablesthan the observed responses.We also test whether the small antra follicle count is a signiﬁcant predictor of any of the responsevariables after accounting for age, average antral follicle count, and maximum follicle stimulatinghormone level. The number of small antra follicles (2-5 mm) is highly correlated with the number oftotal antra follices (2-10 mm), and it has been argued that only total antra follicle count are needed inpractice (La Marca and Sunkara, 2013). Fitting our model with Σ ∈ S , we reject the null hypothesisthat the four regression coefﬁcients (one for each response) corresponding to antra follicle count is zero(p-value = 0 . ). In our second data analysis, we focus on jointly modelling common somatic mutations and geneexpression measured on patients with breast cancer collect by the The Cancer Genome Atlas Project(TCGA). A somatic mutation is an alteration in the DNA of a somatic cell. Somatic mutations are A T A − G E x G A T A − S M CDH − S M M AP K − G E x P I K C A − S M M AP K − S M M

LL3 − S M M UC − S M M UC − S M TT N − S M M UC − S M T P − S M M UC − G E x M UC − G E x T P − G E x M UC − G E x CDH − G E x TT N − G E x P I K C A − G E x M LL3 − G E x GATA3−GExGATA3−SMCDH1−SMMAP3K1−GExPIK3CA−SMMAP3K1−SMMLL3−SMMUC12−SMMUC4−SMTTN−SMMUC16−SMTP53−SMMUC4−GExMUC16−GExTP53−GExMUC12−GExCDH1−GExTTN−GExPIK3CA−GExMLL3−GEx −0.500.51

Figure 5: A heatmap of the estimated correlation matrix for the W i | X i as described in Section 6.2. Nameswith -SM sufﬁx represent somatic mutations; names with -GEx sufﬁx represent gene expression. Variableswere sorted by heirarhical clustering to improve visualization. believed to play a central role in the development of cancer. Because somatic mutations modify geneexpression, either directly and indirectly, it is natural to want to model somatic mutations and geneexpression jointly.The somatic mutations we model are a binary variable indicating the presence or absence of asomatic mutation in the region of a particular gene. We focus on the ten genes where somatic mutationswere present in more than 5% of subjects in our dataset. Thus, we have r = 20 , coming from ten geneseach with one response corresponding to gene expression and one to the presence of a somatic mutation.For gene expression, we model log-transformed RPKM measurements as normal random variables. Wetreat each patients’ age as the lone predictor.First, we test the covariance matrix for block-diagonality. Under H , we assume entries of Σ corresponding the correlations between somatic mutations and gene expression measurements are zero(i.e., assuming there is no correlation between somatic mutations and gene expression). We observe teststatistic T n = 1016 . with degrees of freedom for a p-value < − . In Figure 5, we display the estimated correlation matrix for the W i | X i . We observe that the oefﬁcient WOMAC score Days missed p-valueIntercept -0.23822 -4.67790BMI 0.02885 0.15758 . × − Age 0.00172 -0.03269 . × − Sex 0.13314 -0.31994 . × − Table 2: Regression coefﬁcient estimates (i.e., ˆ B ) for the three predictors and two response variables in theOAI data analysis. In the rightmost column is the p-value for the test that the corresponding row of B isentirely zero. latent variables corresponding to somatic mutations and gene exprssion in CDH1 are highly negativelycorrelated, whereas for GATA3, somatic mutation and gene expression latent variables have a strongpositive correlation. Latent variables for many of the somatic mutations are highly correlated (e.g., TTN,MLL3, MUC4, MUC12, MUC16). However latent variables corresponding to some somatic mutations,e.g., those in the region of TP53, exhibit small or even negative correlations with many others (e.g.,GATA3, CDH1, PIK3CA). In this section, we analyze data collected through the Osteoarthritis Initiative (OAI), a prospectiveobservational study of knee osteoarthritis progression ( nda.nih.gov/oai/ ). Following McCulloch(2008), we model two outcome variables: Western Ontario and McMaster Univerities disability score(WOMAC), and the number of days of work missed in the three months proceeding data collection.The WOMAC scores are modelled as a normal random variable after adding one and performing alog-transformation; whereas the number of days of work missed are treated as quasi-Poisson randomvariables. To model these data, we consider BMI, age, and sex as predictors. As in the data analysisin Section 6.1, we set ψ j = 10 − for the normally distributed response and ψ j = 10 − for thequasi-Poisson response.The goal of our analysis was to test for the effect of each of the three predictors on both responsessimultaneously. Our analysis included only those subjects who had no missingness in either responsevariables or predictors, so that n = 1602 . Fitting the full model to the data, we obtain the coefﬁcientestimates listed in Table 2. Based on the results, we would conclude that both BMI and Sex aresigniﬁcant predictors for both response variables, while Age did not reach the .05 signiﬁcance cutoff. cknowledgements We are grateful to Galin Jones and Adam Rothman for their comments on an earlier version of the paper.We thank Charles McCulloch for sharing the OAI dataset analyzed in Section 6.3. Substantial parts ofthe work was done while Ekvall was at the Vienna University of Technology (TU Wien), supported byFWF (Austrian Science Fund, ) [P30690-N35].

References

Bai, H., Zhong, Y., Gao, X., and Xu, W. (2020). Multivariate mixed response model with pairwisecomposite-likelihood method.

Stats , 3(3).Bates, D., M¨achler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects models usinglme4.

Journal of Statistical Software , 67(1).Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM Journal on Imaging Sciences , 2(1).Bonat, W. H. and Jørgensen, B. (2016). Multivariate covariance generalized linear models.

Journal ofthe Royal Statistical Society: Series C (Applied Statistics) , 65(5).Boyle, J. P. and Dykstra, R. L. (1986). A method for ﬁnding projections onto the intersection of convexsets in Hilbert spaces. In Dykstra, R., Robertson, T., and Wright, F. T., editors,

Advances in OrderRestricted Statistical Inference , Lecture Notes in Statistics, New York, NY. Springer.Breslow, N. (2004). Whither PQL? In

Proceedings of the second Seattle symposium in biostatistics .Springer New York.Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models.

Journal of the American Statistical Association , 88(421).Cannon, A. R., Cobb, G. W., Hartlaub, B. A., Legler, J. M., Lock, R. H., Moore, T. L., Rossman, A. J.,and Witmer, J. (2013).

Stat2: Building Models for a World of Data . W. H. Freeman and Company,New York. atalano, P. J. (1997). Bivariate modelling of clustered continuous and ordered categorical outcomes. Statistics in Medicine , 16(8).Catalano, P. J. and Ryan, L. M. (1992). Bivariate latent variable models for clustered discrete andcontinuous outcomes.

Journal of the American Statistical Association , 87(419).Cox, D. R. and Wermuth, N. (1992). Response models for mixed binary and quantitative variables.

Biometrika , 79(3).de Leon, A. R. (2005). Pairwise likelihood approach to grouped continuous model and its extension.

Statistics & Probability Letters , 75(1).de Leon, A. R. and Carri`egre, K. C. (2007). General mixed-data model: Extension of general locationand grouped continuous models.

Canadian Journal of Statistics , 35(4).Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes.

Journal of theRoyal Statistical Society. Series B (Statistical Methodology) , 62(2).Ekvall, K. O. and Jones, G. L. (2020). Consistent maximum likelihood estimation using subsets withapplications to multivariate mixed models.

Annals of Statistics , 48(2).Faes, C., Aerts, M., Molenberghs, G., Geys, H., Teuns, G., and Bijnens, L. (2008). A high-dimensionaljoint model for longitudinal outcomes of different nature.

Statistics in Medicine , 27(22).Fitzmaurice, G. M. and Laird, N. M. (1995). Regression models for a bivariate discrete and continuousoutcome with clustering.

Journal of the American Statistical Association , 90(431).Fitzmaurice, G. M. and Laird, N. M. (1997). Regression models for mixed discrete and continuousresponses with potentially missing values.

Biometrics , 53(1).Geyer, C. J. (2020). trust: Trust Region Optimization .Goldstein, H., Carpenter, J., Kenward, M. G., and Levin, K. A. (2009). Multilevel models withmultivariate mixed response types.

Statistical Modelling , 9(3).Gueorguieva, R. V. (2001). A multivariate generalized linear mixed model for joint modelling ofclustered outcomes in the exponential family.

Statistical Modeling , 1(3). ueorguieva, R. V. and Agresti, A. (2001). A correlated probit model for joint modeling of clusteredbinary and continuous responses. Journal of the American Statistical Association , 96(455).Gueorguieva, R. V. and Sanacora, G. (2006). Joint analysis of repeatedly observed continuous andordinal measures of disease severity.

Statistics in Medicine , 25(8).Jaffa, M. A., Gebregziabher, M., Luttrell, D. K., Luttrell, L. M., and Jaffa, A. A. (2016). Multivariategeneralized linear mixed models with random intercepts to analyze cardiovascular risk markers intype-1 diabetic patients.

Journal of Applied Statistics , 43(8).Jiang, J. (2007).

Linear and Generalized Linear Mixed Models and Their Applications . Springer Seriesin Statistics. Springer-Verlag, New York.Kang, X., Chen, X., Jin, R., Wu, H., and Deng, X. (2020). Multivariate regression of mixed responsesfor evaluation of visualization designs.

IISE Transactions .Knudson, C., Benson, S., Geyer, C., and Jones, G. (2020). Likelihood-based inference for generalizedlinear mixed models: inference with the R package glmm.

Stat .La Marca, A. and Sunkara, S. K. (2013). Individualization of controlled ovarian stimulation in IVFusing ovarian reserve markers: from theory to practice.

Human reproduction update , 20(1).Lele, S. R., Nadeem, K., and Schmuland, B. (2010). Estimability and likelihood inference for generalizedlinear mixed models using data cloning.

Journal of the American Statistical Association , 105(492).McCullagh, P. and Nelder, J. A. (1989).

Generalized Linear Models . Chapman & Hall/CRC, BocaRaton, FL.McCulloch, C. E. (2008). Joint modelling of mixed outcome types using latent variables.

StatisticalMethods in Medical Research , 17(1).Megginson, R. E. (1998).

An introduction to Banach space theory . Number 183 in Graduate texts inmathematics. Springer, New York.Nocedal, J. and Wright, S. (2006-09-01, 2006).

Numerical Optimization . Springer-Verlag GmbH. chs, P., Chen, Y., Brox, T., and Pock, T. (2014). iPiano: inertial proximal algorithm for nonconvexoptimization. SIAM Journal on Imaging Sciences , 7(2).Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continuousvariables.

Annals of Mathematical Statistics , 32(2).Pinheiro, J. C. and Bates, D. M. (1996). Unconstrained parametrizations for variance-covariancematrices.

Statistics and computing , 6(3).Poon, W.-Y. and Lee, S.-Y. (1987). Maximum likelihood estimation of multivariate polyserial andpolychoric correlation coefﬁcients.

Psychometrika , 52(3).Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2004). Generalized multilevel structural equationmodeling.

Psychometrika , 69(2).Sammel, M. D., Ryan, L. M., and Legler, J. M. (1997). Latent variable models for mixed discrete andcontinuous outcomes.

Journal of the Royal Statistical Society. Series B (Methodological) , 59(3).Schall, R. (1991). Estimation in generalized linear models with random effects.

Biometrika , 78(4).Yang, Y., Kang, J., Mao, K., and Zhang, J. (2007). Regression models for mixed Poisson and continuouslongitudinal data.

Statistics in Medicine , 26(20).Zellner, A. (1962). An efﬁcient method of estimating seemingly unrelated regressions and tests foraggregation bias.

Journal of the American Statistical Association , 57(298)., 57(298).