The MELODIC family for simultaneous binary logistic regression in a reduced space
TThe MELODIC family for simultaneous binarylogistic regression in a reduced space
Mark de RooijMethodology & Statistics Unit, Institute of Psychology, Leiden UniversityPatrick J. F. GroenenEconometric Institute, Erasmus School of Economics, Erasmus UniversityFebruary 17, 2021
Please address correspondence to Mark de Rooij, Methodology and Statisticsdepartment, Institute of Psychology, Leiden University. PO Box 9555, 2300 RBLeiden, The Netherlands. Email: rooijm at fsw.leidenuniv.nl1 a r X i v : . [ s t a t . M E ] F e b bstract Logistic regression is a commonly used method for binary classifica-tion. Researchers often have more than a single binary response variableand simultaneous analysis is beneficial because it provides insight into thedependencies among response variables as well as between the predictorvariables and the responses. Moreover, in such a simultaneous analysis theequations can lend each other strength, which might increase predictive ac-curacy. In this paper, we propose the MELODIC family for simultaneousbinary logistic regression modeling. In this family, the regression modelsare defined in a Euclidean space of reduced dimension, based on a distancerule. The model may be interpreted in terms of logistic regression coeffi-cients or in terms of a biplot. We discuss a fast iterative majorization (orMM) algorithm for parameter estimation. Two applications are shown indetail: one relating personality characteristics to drug consumption pro-files and one relating personality characteristics to depressive and anxietydisorders. We present a thorough comparison of our MELODIC familywith alternative approaches for multivariate binary data.
Keywords: Multivariate logistic regression; Euclidean distance; Multidimen-sional Unfolding; MM algorithm. 2
Introduction
Logistic regression (Berkson, 1944; Cox, 1958; Agresti, 2003) is one of the mostcommonly used tools for binary classification. Although the logistic functionhas been known since the early 19th century, the logistic regression model wasdeveloped in the second half of the 20th century (Cramer, 2002). Adaptionsof logistic regression models have been developed to make it more flexible,through basis expansion, or less flexible, by means of regularization. For anoverview, see Friedman et al. (2001).Researchers regularly have more than a single response variable. That is,there are applications where several response variables can be predicted froma common set of predictor variables (Breiman and Friedman, 1997). Examplesinclude:• The analysis of depressive and anxiety disorders. Mental disorders arehighly prevalent in modern western societies and a high degree of comor-bidity can often be observed among these disorders. In the NetherlandsStudy for Depression and Anxiety (Penninx et al., 2008) data were col-lected on a large number of subjects about their personality and aboutmental disorders (Spinhoven et al., 2009).• The analysis of drug consumption profiles. Fehrman et al. (2017) are inter-ested in subjects’ drug consumption profiles and how these relate to per-sonality characteristics such as sensation seeking and impulsivity. Theycollected data about the consumption of 18 different drugs.• In clinical trials, the effect of treatments is established where the outcomecan be dichotomous, cured or not cured. Treatments, however, come withside effects and these can be coded as present or absent. It is important1o study treatment and side effects together in order to obtain the wholepicture. For an empirical example, see Molenberghs and Verbeke (2006).• Psychosocial problems frequently occur in young adults. To screen forthese problems in community settings, for example during large-scale gen-eral health check-ups, the Strengths and Difficulties Questionnaire (SDQ)can be used as it is a relatively short instrument. The SDQ has two parts:a self-report and a parent-report. To be useful as a screening tool it shouldhave good validity properties, that is, it should be able to predict certainpsychosocial problems. Vugteveen et al. (2018) investigated the validityof the SDQ with respect to four diagnoses.With multiple binary outcomes it is possible to fit a logistic regression modelseparately for each outcome, but it is often wise to build a single multivariatemodel. In such a multivariate model the dependencies between the various out-comes can be better understood and strength can be borrowed between the dif-ferent outcomes. Second, such a multivariate model is more parsimonious in thesense that less parameters have to be estimated. Furthermore, estimated regres-sion weights may be better in terms of mean squared error. Stein et al. (1956), forexample, showed that simple averages of a multivariate normal distribution areinferior to shrunken averages in terms of mean squared error, where the shrink-age tends toward the average of averages. Breiman and Friedman (1997) showthat shrinkage of coefficients of several multiple regression models toward eachother is beneficial in terms of predictive accuracy. In a similar vein, buildingseveral logistic regressions in a reduced space might provide better estimates ofthe regression coefficients in terms of mean squared error.There are basically two broad ways of analyzing multivariate data and per-forming dimension reduction: The first is based on inner products from whichprincipal component analysis (Pearson, 1901; Hotelling, 1936; Jolliffe, 2002) and2educed rank regression (Izenman, 1975; Ter Braak and Looman, 1994) are de-rived; the second is based on distances which have led to multidimensional scal-ing (Torgerson, 1952, 1958; Gower, 1966; Guttman, 1968) and multidimensionalunfolding (Coombs, 1950; Roskam, 1968; Heiser, 1981; Busing, 2010). This dis-tance framework is conceptually easier than the inner product framework andleads to more straightforward interpretation (De Rooij and Heiser, 2005). Dis-tances, especially Euclidean and Manhattan, are all around us and can alreadybe understood by very young children.In multidimensional unfolding, we generally have a dissimilarity matrix be-tween two sets of objects. The goal is to find a low-dimensional mapping in-cluding points for the row objects and the column objects such that the dis-tances between the points of the two sets are as close as possible (often in the“least squares” sense) to the observed dissimilarities. We will develop a familyof models based on similar ideas.In this paper, we will develop a family of logistic models within a distanceframework. We call it the MELODIC family, written out, the MultivariatE LOgisticDIstance to Categories family. More specifically, we will develop a framework ofmodels in which both participants and the categories of the different responsevariables have a position in low-dimensional Euclidean space. The distancesbetween the position of a participant and the positions of the two categories ofa single response variable determine the probabilities for these two responseoptions. The position of a participant will be parameterized as a linear combi-nation of the predictor variables.The family extends the recently proposed multivariate logistic distance mod-els (Worku and De Rooij, 2018) which built on earlier logistic distance models(Takane et al., 1987; Takane, 1987; De Rooij, 2009) and can be considered as ex-amples in the “Gifi goes logistic” framework as laid out by De Leeuw (2005)3nd Evans (2014).In the next section, we will develop the general model and two constrainedvariants. We will discuss properties of the models and provide interpretationalrules. Two types of these rules can be distinguished: the numerical and thegraphical. These two modes of interpretation for a single model are benefi-cial because there are those people who say that “a graph is worth a thousandwords”, the so-called graph people (Friendly and Kwan, 2011), while others (the table people ) firmly disagree (Gelman, 2011). In Section 3, we develop an It-erative Majorization or MM algorithm (De Leeuw and Heiser, 1977; Groenen,1993; Heiser, 1995; Hunter and Lange, 2004) for estimating the parameters ofour models by minimizing a deviance function. Section 4 describes two illustra-tive applications. In Section 5, we discuss related statistical models and providesome comparisons. We conclude, in Section 6, with a general discussion of ourdevelopments and some possibilities for further investigation.
We consider a system with P explanatory, predictor, or independent variables X and R outcome, response, or dependent binary variables Y . That is, we havea sample of observations { x i , y i } n with x i ∈ R P and y i ∈ { , } R . The responsevariables will be recoded into indicator vectors g ir of length two, where the firstelement equals 1 if y ir = 0 and the second element equals 1 if y ir = 1 .We will use the following notation.• i = 1 , . . . , n for individuals (participants, subjects, objects).• p = 1 , . . . , P for predictor variables (explanatory or independent vari-ables). 4 r = 1 , . . . , R for response variables (outcome or dependent variables).• m = 1 , . . . , M an indicator for the dimensions.• There is a set of predictor variables X = { X p } Pp =1 . Observed values ofthe predictor variables are collected in the n × P matrix X with elements x ip . We assume, without loss of generality, that the predictor variables arecentered, that is (cid:62) X = .• There is a set of response variables Y = { Y r } Rr =1 . Observed values of theresponse variables are collected in the n × R matrix Y . The matrix haselements y ir ∈ { , } . We will code the responses in a super indicatormatrix G having C = 2 R categories, that is, G = [ G | G | . . . | G R ] . • B represents a P × M matrix with regression weights for the predictorvariables.• u i is an M vector with coordinates for person i in M -dimensional Eu-clidean space. These coordinates will be collected in the n × M matrix U with elements u im .• V r is a × M matrix having the coordinates of category (i.e., v r ) inthe first row and in the second row the coordinates of category (i.e., v r ), both for response variable r . These matrices will be collected in the R × M matrix V = (cid:2) V (cid:62) , . . . , V (cid:62) R (cid:3) (cid:62) with elements v rcm .• The observations are { x i , y i } n .• We define a block diagonal matrix J with × diagonal blocks I − (cid:62) .5 We use tildes for current estimates in the iterative process, that is, (cid:101) B rep-resents the matrix with estimates in a given cycle of the algorithm.• Diag() denotes the operator that takes the diagonal values of a matrix andplaces them in a vector. We define the conditional probability that person i is in class c ( c = { , } ) ofresponse variable r , π rc ( x i ) = P ( Y ir = c | x i ) as π rc ( x i ) = exp( − δ ( u i , v rc ))exp( − δ ( u i , v r )) + exp( − δ ( u i , v r )) , (1)where δ ( · , · ) denotes half the squared Euclidean distance δ ( u i , v rc ) = 12 M (cid:88) m =1 ( u im − v rcm ) = 12 M (cid:88) m =1 (cid:0) u im + v rcm − u im v rcm (cid:1) , (2)in M -dimensional Euclidean space. The dimensionality M has to be chosen bythe researcher with possible values being between 1 and min( P, R ) . The coordi-nates of the subjects ( u i ) are assumed to be a linear combination of the predictorvariables, that is, u i = x (cid:62) i B , where B is a P × M matrix with regression weights.The coordinates of category c of response variable r on dimension m are denotedby v rcm and collected in the M -vector v rc .Every subject i is thus represented in an M -dimensional Euclidean space.Moreover, this subject has a distance to a point representing category 1 of re-sponse variable r and to a point representing category of response variable r . These two distances determine the probability for the subject to answer witheither of these categories; the smaller the distance, the larger the probability. Inother words, a subject is most likely to be in the closest class.6he log odds in favor of the 1 category and against category 0 for responsevariable r given the subject’s position is given by log π r ( x i )1 − π r ( x i ) = log π r ( x i ) π r ( x i ) = δ ( u i , v r ) − δ ( u i , v r ) , (3)a simple difference of squared Euclidean distances. This log odds can be furtherworked out as log π r ( x i ) π r ( x i ) = M (cid:88) m =1 (cid:20)
12 ( v r m − v r m ) + x (cid:62) i b m ( v r m − v r m ) (cid:21) , (4)where we see that the effect of predictor variable p on response variable r is de-termined by the regression coefficients b pm and the distance between the two cat-egories. In general, the further apart the two categories are, the better they arediscriminated by the predictor variables. If the two categories fall on the sameposition in the Euclidean space, they are indistinguishable (Anderson, 1984) basedon this set of predictor variables.Let us define a ∗ r = (cid:80) Mm =1 ( v r m − v r m ) and b ∗ r = (cid:80) Mm =1 b m ( v r m − v r m ) .Then the log odds can be written as log π r ( x i ) π r ( x i ) = a ∗ r + x (cid:62) i b ∗ r , (5)showing that the model can be interpreted as standard binary logistic regressionmodels. We call the b ∗ the model implied coefficients . In the general model described above, the categories of the response variableslie freely somewhere in the M -dimensional space. Sometimes, however, re-searchers already have an idea about the underlying structure of the response7ariables. In the literature about depressive and anxiety disorders, for exam-ple, one theory says that fear and distress are its underlying dimensions, whereeach dimension comprises a subset of the disorders. In terms of our models, thismeans that the categories of the response variables pertaining to the distress di-mension lie on a single dimension (i.e., the coordinates of the categories of theseresponse for the other dimensions equal zero). Similarly, for the categories ofthe response variables pertaining to the fear dimension, the coordinates on thedistress dimension all equal zero.If a specific response variable pertains to, say, dimension 1, the class coordi-nates on all other dimensions are set to zero, that is v r m = v r m = 0 , ∀ m (cid:54) = 1 .Such a structure simplifies the model and its interpretation, because in the logodds definition (see Equation 4) the last term becomes zero for several dimen-sions and only the regression weights of the dimension to which the responsevariable pertains are important for the discrimination of the categories of thatresponse variable.One further constraint is to let all response variables have the same discrim-inatory ability. In that case, ( v r m − v r m ) = 1 for the dimensions to whichresponse variable r pertains. For this constrained model, Worku and De Rooij(2018) showed that the parameters can be estimated using standard software forlogistic regression by using a structured design matrix for the predictors. Thispaper describes them as members of a larger family of models, the MELODICfamily. When the dimensionality equals two ( M = 2 ), the model can be easily repre-sented graphically. This representation shows 1) the categories of the responsevariables as points, 2) a decision line for every response variable designating8he predicted class at a specific point, 3) variable axes for the predictor vari-ables, and 4) the subjects’ positions as points. Many aspects of the interpreta-tion of these graphical representations follow the theory of biplots as discussedin Gower and Hand (1995) and Gower et al. (2011).Let us first look at a graphical representation for a single response variableand a set of subjects, of which three of them are highlighted. Figure 1 gives sucha graph where A0 and A1 are the two categories of a response variables namedA, and i , j and k present three subjects. The line halfway between classes A0 andA1 represents the decision line, in other words the line represents the points forwhich the odds are even. The log odds that subject i chooses A0 instead of A1are clearly in favor of class A1, because that is the closest class.The squared distances from Subject i to categories A0 and A1 additively de-compose into one part toward the line through A0 and A1 (i.e., the A01 line)and one part along this line. Equation 3 shows that the log odds are definedin terms of a difference in squared distances, and therefore the part toward theA01 line drops out of the equation. In more detail, for this example we have log π A ( x i ) π A ( x i ) = δ ( u i , v A ) − δ ( u i , v A ) . According to the Pythagorean theorem, the squared distance δ ( u i , v A ) can bedecomposed into δ ( u i , v A ) + δ ( v A , v A ) , where v A is the coordinate of theprojection of u i on the A01 line. Using this decomposition for both terms weobtain log π A ( x i ) π A ( x i ) = ( δ ( u i , v A ) + δ ( v A , v A )) − ( δ ( u i , v A ) + δ ( v A , v A ))= δ ( v A , v A ) − δ ( v A , v A ) . For person j or k , we can use the same decomposition. The projections for the9hree subjects are however equal, and therefore the log odds of category A0against A1 for persons i , j , and k are equal (and for all three subjects in favor ofcategory A1). As noted above, the decision line represents the set of positionswhere the odds are even, that is, the log odds are equal to zero. More generally,iso log odds curves, which are curves where the log odds equal any constant, arestraight lines parallel to these decision lines and orthogonal to the A01 line. Anexample of such an iso log odds line is the one through the points representingthe three subjects (blue dotted).The variable axes can be understood as representations of subjects with vary-ing scores on the corresponding predictor variables and an average score onall other predictor variables. In this way we can interpret the variable axisby moving along the variable axes and computing the log odds for each re-sponse variable. More formally, let us denote by d r the M -vector with differ-ences ( v r m − v r m ) and let predictor variable p be presented by its regressionweights b p (with b (cid:62) p row p of matrix B ), then we can write the effect of predictorvariable p on response variable r as b (cid:62) p d r = (cid:107) b p (cid:107) · (cid:107) d r (cid:107) · cos( b p , d r ) , showing that the log odds are largest when the direction of the variable axisfor predictor variable p is parallel to the line connecting the two categories ofresponse variable r , while it is zero if the variable axis is orthogonal to this line.Figure 2 illustrates this property. In Figure 2, the variable axes are representedfor four predictor variables. We use the convention that the labels attached to thevariable axes are placed on the positive side of the variable. The two categoriesof a response variable (A0 and A1) are depicted by points. The variable axisfor predictor variable X is parallel to the A01 line, indicating that this variablediscriminates this response variable well, while the variable axis for predictor10 l ll l A0 A1 ij k −202 −2 0 2
Dimension 1 D i m en s i on Figure 1: Graphical representation with a single dichotomous response variableA (with categories A0 and A1) and three participants ( i , j , and k ). The solidgreen line represents the decision line where the probabilities for A0 and A1are equal. The blue dotted line projects the points representing the three sub-jects onto the A01 line (blue dashed-dotted line). All points on this dotted linerepresent observations with the same log odds.11 is almost orthogonal to the A01 line, indicating that X does not discriminatebetween these two classes. We could draw the projections of A0 and A1 on eachof the variable axes to see the discriminatory power: the further apart theseprojections are, the higher the power. For example, the projections of the twopoints onto variable X are very close to each other, indicating that X does notdiscriminate these two classes well.The discriminative power depends not only on the distance between the pro-jections but also on the estimated value of the regression coefficients. Larger re-gression weights indicate more general discriminative power for the completeset of response variables. We will indicate the value of the regression weightby using markers along the variable axis in steps of 1 standard deviation. Thefurther apart these markers are, the larger the regression weights and the higherthe discriminative power will be.With R binary response variables, the number of different possible responseprofiles is R . When M = R , each of these profiles can be perfectly represented.In lower dimensional space ( M < R ), however, not all response profiles find aplace in the solution. For example, in a one-dimensional space, only R + 1 dif-ferent response profiles are represented. In two-dimensional space, the numberof represented profiles is (cid:88) m =1 (cid:18) Rm (cid:19) (Coombs and Kao, 1955).In the constrained models, the number of represented response profiles islower than in the general model because all decision lines are either horizontalor vertical. With five response variables, of which three pertain to the firstdimension and two to the second, the model represents × responseprofiles, which is even smaller than the 16 in the general unconstrained model, Assuming we do not have a response variable pertaining to multiple dimensions. l A0 A1
X3X2 X1X4
Dimension 1 D i m en s i on Figure 2: Graphical representation with the two class points of a single responsevariable named A (the classes are represented by the points labeled A0 and A1),variable axes for four predictor variables, and subject points (small dots). Thedashed-dotted line connecting A0 and A1 represents the vector d A . Predictorvariable X discriminates well between the two classes because the direction isparallel to the line joining the two classes, whereas predictor variable X hardlydiscriminates between the two classes because the variable axis is almost or-thogonal to the line joining the two classes.13nd much smaller than the R = 32 possible response profiles. In this section, we develop an MM algorithm, where the first M stands for “ma-jorize” and the second M for “minimize”. Such algorithms are also known asiterative majorization (IM) algorithms. MM algorithms have the property ofguaranteed descent and in MM algorithms it is easy to use low rank restrictions.The global idea of MM algorithms is that, instead of minimizing the original lossfunction, we seek an auxiliary function that 1) touches the original function atthe current estimates, 2) lies above the original function, and 3) is easy to mini-mize. For a detailed treatment of the general principles of IM or MM, we refer toHeiser (1995) and Hunter and Lange (2004). In the following subsections, wewill majorize a deviance function with a least squares function. This results in afast algorithm. Before developing the algorithm, we will discuss identificationof model parameters.
Before we develop an algorithm for the estimation of model parameters, wemust discuss indeterminacies, that is, admissible transformations that changeneither the estimated probabilities nor the loss value.1. Multidimensional scaling and unfolding models in general have transla-tional freedom . We center X so that the origin of the Euclidean space isfixed at the average value of the predictor variables.2. The model has rotational freedom : any map can be rotated without changingthe distances or the probabilities. We will require that n B (cid:62) X (cid:62) XB = I sothat the rotational indeterminacy is removed. Reflection can be removed14y requiring that the regression weights for the first predictor variable arepositive.3. The model π rc ( x i ) = exp( θ irc )exp( θ ir ) + exp( θ ir ) is invariant under an additive constant in the “linear predictor” ( θ irc ) ,that is exp( θ irc )exp( θ ir ) + exp( θ ir ) = exp( θ irc + ζ r )exp( θ ir + ζ r ) + exp( θ ir + ζ r ) . (6)For our distance model, θ irc = − δ ( u i , v rc ) , we can add a constant to thesquared distances per response variable , implying that the term (cid:80) Mm =1 u im can be removed from the distance formulation. We will see a further sim-plification below. The deviance function to be minimized is L ( B , V ) = − n (cid:88) i =1 R (cid:88) r =1 1 (cid:88) c =0 g irc log π rc ( x i ) (7) = − n (cid:88) i =1 R (cid:88) r =1 1 (cid:88) c =0 g irc log (cid:18) exp( θ irc )exp( θ ir ) + exp( θ ir ) (cid:19) , where for our model θ irc = − δ ( u i , v rc ) . Groenen et al. (2003), De Leeuw(2006), and Groenen and Josse (2016) show that the function f ir ( θ i ) = − (cid:88) c =0 g irc log exp( θ irc )exp( θ ir ) + exp( θ ir ) ,
15s majorized by g ir ( θ i , ˜ θ i ) = f ir ( ˜ θ i ) + ( θ i − ˜ θ i ) (cid:62) ∇ f ir ( ˜ θ i ) + 14 (cid:107) θ i − ˜ θ i (cid:107) , with ∇ f ir ( ˜ θ i ) = − ( g ir − ˜ π ir ) , and where the tilde means that it is evaluated using the current estimates. Aproof of the majorizing property is given in the appendix of Groenen and Josse(2016).By analogy to Groenen and Josse, we therefore have that L ( B , V ) ≤ (cid:107) Z − Θ (cid:107) + L ( (cid:101) B , (cid:101) V ) − (cid:107) Z (cid:107) + (cid:13)(cid:13)(cid:13) G − (cid:101) Π (cid:13)(cid:13)(cid:13) = 14 (cid:107) Z − Θ (cid:107) + constant = g ( B , V ) + constant , (8)where Z = { z irc } with z irc = ˜ θ irc + 2( g irc − ˜ π irc ) and Θ = { θ irc } with θ irc = − (cid:80) Mm =1 ( v rcm − u im v rcm ) . In matrix terms we can write Θ as Θ = − (cid:16) (cid:62) v − XBV (cid:62) (cid:17) , and Z as Z = (cid:101) Θ + 2( G − (cid:101) Π ) . Therefore, the objective function to be minimized in every iteration is g ( B , V ) = (cid:13)(cid:13)(cid:13) Z + (cid:62) v / − XBV (cid:62) (cid:13)(cid:13)(cid:13) . The third indeterminacy, outlined above, allows us to rewrite the minimiza-16ion function as g ( B , V ) = (cid:13)(cid:13)(cid:13)(cid:13) ZJ + 12 (cid:62) v J − XBV (cid:62) J (cid:13)(cid:13)(cid:13)(cid:13) , (9)with J a symmetric block diagonal matrix with × diagonal blocks J r , theusual centering matrices, J r = I − (cid:62) .Let us define the matrices A l = I R ⊗ [1 , (cid:62) and A k = I R ⊗ [1 , − (cid:62) with ⊗ the Kronecker product, such that V can be reparametrized as V = A l L + A k K with L the R × M matrix with response variable locations and K the R × M matrixrepresenting the discriminatory power for the response variables. As a numericalexample with two response variables, consider V = = + − − − − − − , where the second matrix on the right-hand side of the equation represents L (the locations or midpoints of the class coordinates) and the last matrix on theright-hand side of the equation represents K . The larger the absolute valuesin row r of matrix K , the better the two categories from this response variable( r ) can be discriminated by the predictor variables; if the values equal zero, thecategories cannot be discriminated and the positions of the two categories of aresponse variable fall in the same place. In the numerical example, the secondresponse variable is better discriminated (that is, the class points are furtherapart). 17sing this reparametrization in our loss function, the last term XBV (cid:62) J sim-plifies to XBV (cid:62) J = XBK (cid:62) A (cid:62) k because the term XBL (cid:62) A (cid:62) l J = as A (cid:62) l J = .Let us now have a closer look at the second term of the loss function (Equation9), that is, (cid:62) v J = Jd v . This term can be rewritten as d v = Diag( VV (cid:62) ) = Diag (cid:16) ( A l L + A k K )( A l L + A k K ) (cid:62) (cid:17) = Diag (cid:16) A l LL (cid:62) A (cid:62) l + A k KK (cid:62) A (cid:62) k + A l LK (cid:62) A (cid:62) k + A k KL (cid:62) A (cid:62) l (cid:17) , where Diag( X ) creates a column vector of the diagonal elements of X . It canbe verified that the terms J Diag (cid:0) A l LL (cid:62) A (cid:62) l (cid:1) and J Diag (cid:0) A k KK (cid:62) A (cid:62) k (cid:1) are bothequal to zero. Therefore Diag( VV (cid:62) ) = J Diag (cid:16) A k KL (cid:62) A (cid:62) l (cid:17) . Focusing on a single response variable r , we can rewrite the corresponding partof the previous equation as k (cid:62) r l r − − , from which it follows that v J can be written as (cid:62) v J = Diag( KL (cid:62) ) (cid:62) A (cid:62) k . Going back to our loss function and using the decomposition into K and L ,it becomes g ( B , K , L ) = (cid:13)(cid:13)(cid:13) ZJ + Diag( KL (cid:62) ) (cid:62) A (cid:62) k − XBK (cid:62) A (cid:62) k (cid:13)(cid:13)(cid:13) , g ( B , K , L ) = 2 (cid:13)(cid:13)(cid:13)(cid:13) ZJA k + Diag( KL (cid:62) ) (cid:62) − XBK (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) and can be rewritten as g ( B , K , L ) = 2 (cid:13)(cid:13)(cid:13)(cid:13) ZJA k + (cid:62) − XBK (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) , where a is a vector with elements a r = (cid:80) Mm =1 k rm l rm . The elements of a can beestimated independently of K , because values of L always exist that togetherwith the k rm can reconstruct a r (see below). Therefore, this latter loss functioncan be solved separately for 1) a and 2) B , K .To update a , define ˜ Z = 12 ZJA k , then the update is a + = − ˜ Z (cid:62) /n .To find the update for B and K , let us define ˜ Z = 12 ZJA k + (cid:62) , so that we have to minimize (cid:13)(cid:13)(cid:13) ˜ Z − XBK (cid:62) (cid:13)(cid:13)(cid:13) , under the restriction that n − B (cid:62) X (cid:62) XB = I . As (cid:13)(cid:13)(cid:13) ˜ Z − XBK (cid:62) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˜ Z − XN (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) N − BK (cid:62) (cid:13)(cid:13)(cid:13) X (cid:62) X with N = ( X (cid:62) X ) − X (cid:62) ˜ Z , the unconstrained update, an update of B and K is found by the generalized singular value decomposition of N (Takane, 2013).19hese two steps can be combined, that is, R − x X (cid:62) ˜ Z = PΦQ (cid:62) , where R x is the matrix square root of the matrix X (cid:62) X , that is X (cid:62) X = R x R (cid:62) x .The updates for B and K can be obtained as B + = √ n R − x P M , where P M denotes the M columns of P corresponding to the M largest singularvalues and K + = 1 √ n Q M Φ M . Finally, an update for L can be obtained from a and K . For every responsevariable we have a r = M (cid:88) m =1 k rm l rm , which is an unidentified system, that is, there are many choices of l rm that pro-vide a solution. These solutions correspond to any position on the decision line(or plane or hyperplane in higher-dimensional spaces), as discussed in relationto Figure 1. We find the position on this hyperplane that is closest to the originof the Euclidean space, that is, l + rm = a r k rm (cid:80) m k rm . A summary of the algorithm can be found in Algorithm 1.The number of parameters of the model is:•
P M − M ( M + 1) / for the regression weights B ;• RM parameters in the matrix K ;20 R parameters in L ;which sum to ( P + R ) M + R − M ( M + 1) / . Data: X , G , M Result: B , K , L Compute: R − x X (cid:62) G = PΦQ (cid:62) ;Initialize: B (0) = √ n R − x P M ;Compute: V = √ n Q M Φ M ;Initialize: K (0) by taking the uneven rows of JV ;Initialize: L (0) by taking the uneven rows of ( I − J ) V ;Compute: ˜ Π and ˜ Θ ; while t = 0 or ( L t − L ( t − ) /L t > − do t = t + 1 ;Compute: Z = ˜ Θ + 2( G − ˜ Π ) ;Compute: Z = ZJA k ;Compute: a = − Z (cid:62) /n ;Compute: Z = ZJA k + (cid:62) ;Compute: R − x X (cid:62) Z = PΦQ (cid:62) ;Update: B ( t ) = √ n R − x P M ;Update: K ( t ) = √ n Q M Φ M ;Update: l ( t ) rm = a r k rm / ( (cid:80) m k rm ) , ∀ r ;Compute ˜ Π , ˜ Θ , and L t ; end Algorithm 1: MELODIC Algorithm
Sometimes researchers have an idea in advance of which responses belong to-gether in which dimensions. Let us denote the set of response variables that21ertains to dimension m by D m . Furthermore, let us denote the set of dimen-sions to which response variable r pertains as S r . Then, for m / ∈ S r , we restrict l rm = k rm = 0 .Much of the unconstrained MM algorithm can be used, except for two as-pects: (i) de orthonormality restriction on XB needs to be relaxed and (2) theupdates will be done dimension wise. We still need a scale restriction per di-mension on XB . The equivalent of (9) can be written g c ( B , K , L ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ZJA k + (cid:62) − M (cid:88) m =1 Xb m k (cid:62) m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , which shows that the last term can be decomposed in dimensional terms. Toupdate for dimension s , we first define ˜ Z = 12 ZJA k + (cid:62) − (cid:88) m (cid:54) = s Xb m k (cid:62) m . Next, we define the matrix ˜ Z s to be the subset of the matrix ˜ Z consisting of thecolumns for which r ∈ D s .Similar to the unconstrained model, a generalized singular value decompo-sition of ˜ Z s gives updates for b s and k s by taking the highest singular value andthe corresponding vector. The update for a is the same as in the unconstrainedmodel. The update for L is similar to the unconstrained model, but we onlygive non-zero values to the dimensions to which response variable r pertains.A summary of the algorithm is given in Algorithm 2.The total number of parameters for the constrained model depends on thespecific constraints. Nevertheless, we have• ( P − M parameters for the regression weights B ;• We define an indicator matrix of size R × M indicating which response22elongs to which dimension. The number of parameters in the matrix K equals the number of ones in that indicator matrix (see the next sectionfor an example);• R parameters in L . 23 ata: X , G , M Result: B , K , L Compute: R − x X (cid:62) G = PΦQ (cid:62) ;Initialize: B (0) = √ n R − x P M ;Compute: V = √ n Q M Φ M ;Initialize: K (0) by taking the uneven rows of JV ;Initialize: L (0) by taking the uneven rows of ( I − J ) V ;Set elements in K (0) and L (0) to zero, following the constraints;Compute: ˜ Π and ˜ Θ ; while t = 0 or ( L t − L ( t − ) /L t > − do t = t + 1 ;Compute: Z = ˜ Θ + 2( G − ˜ Π ) ;Compute: ˜ Z = ZJA k ;Compute: a + = − ˜ Z (cid:62) /n ; for s = 1 , . . . , M do Compute: ˜ Z s ;Compute: R − x X (cid:62) ˜ Z s = PΦQ (cid:62) ;Update: b ( t ) s = √ n R − x P ;Update: k ( t ) s = √ n Q Φ ; end Update: l ( t ) rm = a r / (cid:80) Mm =1 k rm , for m ∈ S r and ∀ r ;Compute ˜ Π , ˜ Θ , and L t ; endAlgorithm 2: MELODIC Algorithm for Constrained Model24
Two empirical applications
In this section we discuss two empirical applications of the model. The firstdata set considers profiles of drug consumption; the second, profiles of mentaldisorders. For the first data set we use the unconstrained model. For the seconddata set we start with a set of constrained models representing different theories.
The drug consumption data (Fehrman et al., 2017) has records for 1885 respon-dents. For each respondent, nine attributes are measured. We have person-ality measurements based on the big five personality traits, neuroticism (N),extraversion (E), openness to experience (O), agreeableness (A), and conscien-tiousness (C), and two other personality characteristics, namely impulsivity (I)and sensation seeking (S). Data were also collected on age and gender. In addition, participants were questioned concerning their use of 18 legaland illegal drugs. For each drug, participants were asked whether they neverused the drug, used it over a decade ago, in the last decade, in the last year,month, week, or day. In our analysis we coded whether participants used theparticular drug in the last year (yes or no). Furthermore, in our analysis wefocused on the drugs that had a minimum percentage of 10% and a maximumof 90%, which are Amphetamine, Benzodiazepine, Cannabis, Cocaine, Ecstasy,Ketamine, legal highs, LSD, Methadone, Mushrooms, and Nicotine ( R = 11 ).The first step in the analysis is to select the dimensionality. We fit models inone to seven dimensions and compute information criteria statistics for compar-ison. The results are given in Table 1, where we can see that either the two- orthree-dimensional solution is optimal according to the AIC and BIC statistics. Also level of education, ethnicity, and country of origin are available in the original data base.We omitted these from the analysis.
Sensation seeking 18409 46 18501 18756Table 2: AIC and BIC statistics for two-dimensional models with 1 predictor leftout of the model.We should also check the influence of the predictor variables. Each of thepredictor variables is left out of the two-dimensional model. The AIC and BICstatistics are shown in Table 2, where it can be seen that only impulsivity canbe considered for being left out of the model; all other predictors would leadto a substantial loss in fit. We decided to further interpret the model using allpredictor variables except impulsivity.The graphical representation of the two-dimensional model is shown in Fig-ure 3. The first thing that catches the eye in Figure 3 is that the categories for"yes" (labels ending with 1) are all to the right-hand side of the categories for We left out the decision lines to avoid clutter. lll llll lllllll lll l lll llll ll lllllll llllllll l l l l ll l l l l l ll l l l l l lllll
Am0Am1Be0Be1 Ca0 Ca1Co0Co1 Ex0 Ex1Ke0Ke1Le0Le1LSD0 LSD1Me0Me1 Mu0Mu1Ni0Ni1 N age gende r E O S A C D i m en s i on Dimension 2 F i g u r e : T w o - d i m e n s i o n a l s o l u t i o n f o r t h e d r u g c o n s u m p t i o n d a t a . P r e d i c t o r v a r i a b l e l a b e l s a r e p r i n t e d o n t h e b o r d e r o f t h e g r a p h w h e r e N = N e u r o t i c i s m ; E = E x t r a v e r s i o n ; O = O p e nn e ss t o e x p e r i e n c e ; A = A g r ee a b l e n e ss ; C = C o n s c i e n t i o u s n e ss ; S = S e n s a t i o n s ee k i n g . W e u s e t h e c o n v e n t i o n t h a t l a b e l s a r e p r i n t e d a tt h e p o s i t i v e s i d e o f t h e v a r i a b l e . M a r k e r s o n t h e v a r i a b l e a x e s i n d i c a t e s t a n d a r dd e v i a t i o n i n c r e a s e s / d e c r e a s e s f r o m t h e m e a n . C a t e g o r y p o i n t s a r e l a b e l e d b y t h e n a m e o f t h e d r u g t o g e t h e r w i t h a1 ( y e s ) o r ( n o ) , t h a t i s , Am1 d e n o t e s t h e c l a ss p o i n t f o r u s e o f A m p h e t a m i n e a n d Am0 f o r n o u s e o f A m p h e t a m i n e . A m = A m p h e t a m i n e ; B e = B e n z o d i az e p i n e ; C a = C a nn a b i s ; C o = C o c a i n e ; E x = E c s t a s y ; K e = K e t a m i n e ; L e = l e g a l h i g h s ; L S D = L S D ; M e = M e t h a d o n e ; M u = M u s h r oo m s ; N i = N i c o t i n e . Q r ).dimensional space. To do this, we define a measure called Quality of Represen-tation, Q r , which is defined by Q r = ( L (0 ,r ) − L r ) / ( L (0 ,r ) − L lr ) , where L (0 ,r ) is the deviance of the intercept-only logistic regression model forresponse variable r , L r is the part of our loss function for response variable r ,and L lr is the deviance from a logistic regression with the same predictor vari-ables. Thus, Q r can be interpreted as the proportion of loss in deviance imposedby the Melodic model compared to an unconstrained logistic regression for re-sponse variable r . The quality of representation for the response variables inthis analysis are given in the last row of Table 3, where it can be seen that mostresponse variables are well represented. The response variable “cocaine” (Co)is worst represented, although still with 89.8% recovered.29 .2 Depression and Anxiety data Depression and anxiety disorders are common at all ages. Approximately oneout of three people in the Netherlands will be faced with them at some timeduring their lives. It is not clear why some people recover quickly and why oth-ers suffer for long periods of time. The Netherlands Study of Depression andAnxiety (NESDA) was therefore designed to investigate the course of depres-sion and anxiety disorders over a period of several years. For more informationabout the study design, see Penninx et al. (2008). In our application, we will an-alyze data from the first wave, focusing on the relationship between personalityand depression and anxiety disorders. The data were previously analyzed bySpinhoven et al. (2009). Data were collected from three different populations:from primary health care; from generalized mental health care; and from thegeneral population. Our analysis will focus on the population of generalizedhealth care.We have data for 786 participants. The diagnoses Dysthymia (D), Major De-pressive Disorder (MDD), Generalized Anxiety Disorder (GAD), Social Pho-bia (SP), and Panic Disorder (PD) were established with the Composite In-terview Diagnostic Instrument (CIDI) psychiatric interview. Personality wasoperationalized using the 60-item NEO Five-Factor Inventory (NEO-FFI). TheNEO-FFI questionnaire measures the following five personality domains: Neu-roticism, Extraversion, Agreeableness, Conscientiousness and Openness to Ex-perience. In addition to these five predictors, three background variables weremeasured: age, gender, and education in years.The prevalences in the data are 21.25% for dysthymia, 76.21% for major de-pressive disorder, 30.41% for generalized anxiety disorder, 41.6% for social pho-bia, and 52.8% for panic disorder. Of the 786 participants, 272 have a singledisorder, the others all have multiple disorders. There are 235 participants with30wo disorders, 147 with three, 96 with four, and 36 participants with five disor-ders.Due to the high comorbidity among disorders, the scientific field of psychi-atry developed three different theories:1. a unidimensional structure where all the disorders are represented by asingle dimension;2. a two-dimensional structure with one dimension representing distress (D,MDD, GAD) and the other fear (SP, PD);3. a two-dimensional structure with one dimension representing depression(D, MDD) and the other anxiety (GAD, SP, PD).We can of course define another two-dimensional structure (Theory 4) in whichdysthymia and major depressive disorder pertain to the first dimension, socialphobia and panic disorder to the second, and generalized anxiety disorder toboth dimensions.Each of the three two-dimensional theories gives rise to a different responsevariable by dimension indicator matrix: D = , D = , D = . We fitted the four models reflecting the four theories to the data. The fitstatistics can be found in Table 4, where it can be seen that all four theories giveabout the same fit, but the distress-fear hypothesis (corresponding to D ) has31 slightly lower AIC and the unidimensional model a lower BIC value than theother theories.Theory/Model Deviance lllllll ll l ll l l ll l l l llllll l l l l l lllllll llllll lllllll D0D1 M0M1G0G1S0S1 P0P1 NC A ge E O A G ende r E du D i m en s i on Dimension 2 F i g u r e : T w o - d i m e n s i o n a l s o l u t i o n f o ll o w i n g T h e o r y f o r t h e d e p r e ss i o n a n d a n x i e t y d a t a . P r e d i c t o r v a r i a b l e l a b e l s a r e p r i n t e d o n t h e b o r d e r o f t h e g r a p h w h e r e N = N e u r o t i c i s m ; E = E x t r a v e r s i o n ; O = O p e nn e ss t o e x p e r i e n c e ; A = A g r ee a b l e n e ss ; C = C o n s c i e n t i o u s n e ss a n dEd u = Ed u c a t i o n . W e u s e t h e c o n v e n t i o n t h a t l a b e l s a r e p r i n t e d a tt h e p o s i t i v e s i d e o f t h e v a r i a b l e . M a r k e r s o n t h e v a r i a b l e a x e s i n d i c a t e s t a n d a r dd e v i a t i o n i n c r e a s e s / d e c r e a s e s f r o m t h e m e a n . C a t e g o r y p o i n t s a r e l a b e l e d b y t h e n a m e o f t h e d r u g t o g e t h e r w i t h a1 ( y e s ) o r ( n o ) w i t h D = D y s t h y m i a , M = M a j o r D e p r e ss i v e D i s o r d e r , G = G e n e r a l i z e d A n x i e t y D i s o r d e r , S = S o c i a l P h o b i a ( S P ) , a n d P = P a n i c d i s o r d e r ( P D ) . The MELODIC family is a statistical toolbox for simultaneous logistic regres-sions in a reduced dimensional space for the analysis of multivariate binarydata. Other statistical models have been proposed for such data; in this sec-tion we show some relationships and comparisons. The related approaches canbe divided in two types of models: marginal models and conditional models.
Marginal models are like standard regression models, dealing in some waywith the dependency among responses. In generalized estimating equations(GEE; Liang and Zeger, 1986; Zeger and Liang, 1986), a working correlationstructure is adopted and estimation and inference are adjusted based on thisstructure using a sandwich estimator (White, 1980). GEE has been mainly de-veloped in the longitudinal context but can also be applied to multivariate re-sponses. Maximum likelihood estimation is also possible for marginal models-see for example Bergsma et al. (2009)- but this is computationally more de-manding.Our MELODIC family can be seen as a member of the GEE family, whereimplicitly we adopt an independent working correlation structure. Moreover,35 lll ll ll ll lll l l ll l l l llllll llllll llllllllllll lllllll
D0D1M0 M1G0G1S0S1 P0P1 E C G ende r N A ge E du O D i m en s i on Dimension 2 F i g u r e : G r a p h i c a l r e p r e s e n t a t i o n o f t h e t w o d i m e n s i o n a l u n c o n s t r a i n e d s o l u t i o n f o r t h e d e p r e ss i o n a n d a n x i e t y d a t a . P r e d i c t o r v a r i a b l e l a b e l s a r e p r i n t e d o n t h e b o r d e r o f t h e g r a p h w h e r e N = N e u r o t i c i s m ; E = E x t r a v e r s i o n ; O = O p e nn e ss t o e x p e r i e n c e ; A = A g r ee a b l e n e ss ; C = C o n s c i e n t i o u s n e ss a n dEd u = Ed u c a t i o n . W e u s e t h e c o n v e n t i o n t h a t l a b e l s a r e p r i n t e d a tt h e p o s i t i v e s i d e o f t h e v a r i a b l e . M a r k e r s o n t h e v a r i a b l e a x e s i n d i c a t e s t a n d a r dd e v i a t i o n i n c r e a s e s / d e c r e a s e s f r o m t h e m e a n . C a t e g o r y p o i n t s a r e l a b e l e d b y t h e n a m e o f t h e d r u g t o g e t h e r w i t h a1 ( y e s ) o r ( n o ) w i t h D = D y s t h y m i a , M = M a j o r D e p r e ss i v e D i s o r d e r , G = G e n e r a l i z e d A n x i e t y D i s o r d e r , S = S o c i a l P h o b i a ( S P ) , a n d P = P a n i c D i s o r d e r ( P D ) .
36e lay a dimensional structure on the response space. A standard GEE modelwould be equal to our one-dimensional model, with the constraint that all re-sponses are equally well discriminated (see Worku and De Rooij, 2018). Ziegleret al. (1998) discuss a set-up where the predictor variables have different effectson each of the response variables. This set-up would correspond to our modelin maximum dimensionality ( R ), where each response pertains to a single di-mension. Asar and İlk (2014) propose a method in which selected predictorvariables have the same effect on selected response variables. This is similarto our constrained model, where predictors have a similar effect on responsevariables pertaining to a given dimension.Reduced rank vector generalized linear models (RR-VGLMs; Yee and Hastie,2003) is a family of models closely related to our own. Whereas we started ourdevelopment from a distance perspective, these RR-VGLMs start from an inner-product perspective. De Rooij and Heiser (2005) give an extensive discussionof the interpretational differences of the two perspectives. Moreover, they showconditions under which the inner product rule and the distance rule are equiv-alent. If we had defined our model as π rc ( x i ) = α rc exp( − δ ( u i , v rc )) α r exp( − δ ( u i , v r )) + α r exp( − δ ( u i , v r )) , then it would be equivalent to a RR-VGLM for binary data. However, the α -parameters would problematize the interpretation of the graphical representa-tion, as the distance rule outlined in Section 2.4 would no longer be valid andthe decision lines would not be equidistant from the two class points.In conditional models , latent variables are included in order to model the de-pendency among the responses. The main examples are generalized linear mixedmodels and latent class models. The family of GLMMs includes item responsemodels and factor analysis models (Skrondal and Rabe-Hesketh, 2004). When37hese are expanded with predictor variables, explanatory item response mod-els (De Boeck and Wilson, 2004) and structural equation models are obtained.Explanatory item response models have mainly been developed for unidimen-sional latent variables. Some progress has been made with multidimensionalitem response models, but the underlying structure should be known a priori(as in our constrained models). Explanatory multidimensional item responsemodels still need further development, partly because estimation is often quitetroublesome in such models due to the intractable integral in the likelihoodfunction (Tuerlinckx et al., 2006).Latent class models (Lazarsfeld and Henry, 1968; McCutcheon, 1987) havebeen developed for multivariate binary data including predictors (Vermunt,2010). In latent class models, as in GLMMs, the dependency among the re-sponse variables is modeled using a latent variable, which in this case is cate-gorical. No dimensional structure is imposed underlying the outcomes, only achoice of the number of categories of the categorical latent variable. These la-tent class models often require a large sample size in order to obtain stable andreliable results (Gudicha et al., 2016).Hubbard et al. (2010), when comparing generalized estimating equationsand generalized linear mixed models, noted that “mixed models involve unver-ifiable assumptions on the data-generating distribution, which lead to poten-tially misleading estimates and biased inference”. More specifically, althoughthe distribution of the random effects cannot be identified from the data, theestimates and inference change according to different choices of the distribu-tion of these random effects. This makes the application of conditional modelsproblematic. Another issue for conditional models is the number of indicatoror response variables. In our example on depressive and anxiety disorders, forexample, there are only five response variables. Distributing these five response38ariables over two underlying dimensions results in a dimension with only twodichotomous indicators. It is generally acknowledged that this number is muchtoo low for valid inference. This small number is less of an issue in our family,because we do not assume a particular distribution for the underlying dimen-sion.A final practical problem in these conditional models is that researchers of-ten first try to find the dimensional structure and in a second step include thepredictor variables. The measurement model (step 1), however, might changesubstantially when the predictor variables are included, leading to a completelydifferent interpretation. To solve this problem, researchers have developed three-step (Bolck et al., 2004, sometimes called the BCH approach) and two-step esti-mators (Bakk and Kuha, 2018) within the context of latent class models whichwere recently adapted for other conditional models. In our family of modelswe have no division in measurement and structural model; the two go hand inhand. In this study, we presented distance models for simultaneous logistic regres-sion analysis of multiple binary response variables based on ideas of multidi-mensional unfolding. Row objects (participants in our examples) are presentedtogether with the two categories of the response variables in a low-dimensionalEuclidean space, where the relative distance between a point representing aparticipant and the points representing the classes of a response variable de-termines the probability for each class. The model is estimated by minimizinga deviance function. These models take into account the dependency amongthe response variables by using a low-dimensional Euclidean space: with theincrease in value of a predictor variable, the probabilities of all response vari-39bles change simultaneously. We christened the models the MELODIC family,that is the MultivariatE LOgistic DIstance to Categories family. We presentedversions of the model both for cases in which we have an a priori theory aboutthe dimensional structure of the response variables and for cases when we donot have such a theory. Two empirical applications are shown, one with and onewithout such an a priori structure.In the case of a two-dimensional model, the result can be interpreted us-ing a biplot. In the case of a higher-dimensional solution similar biplots can beconstructed for pairs of dimensions. A coherent interpretation of the completemodel from such bi-dimensional plots, might, however, more difficult. Alter-natively, the model can be interpreted by the implied logistic regression coef-ficients which have a change in log odds interpretation similar to ordinary lo-gistic regression models. We illustrated both methods of interpretation in theempirical examples. The fact that the model can be interpreted using a graphand using tables is beneficial, because applicants of statistical models can be di-vided into two groups: those who prefer visualizations and those who prefernumbers. With the MELODIC family, an applicant can choose which mode ofinterpretation is most suitable.We developed a fast iterative majorization algorithm to estimate the param-eters of the model. The algorithm converges monotonically to the global opti-mum of the deviance function. The algorithm alternates between 1) updatingan auxiliary vector a , which is simply obtained by taking an average, and 2) up-dating the regression weights ( B ) and item discriminations ( K ), which can beobtained from a generalized singular value decomposition. All model parame-ters can be obtained from these updates.A measure of quality of representation for each response variable was alsoproposed, ranging between 0 and 1. A higher value implies only a small loss40f fit with regard to a univariate logistic regression with the specific responsevariable and the same set of predictor variables. This measure can be used asa diagnostic tool to assess whether response variables are well represented bythe model. In our second application, we saw that for one response variablethe quality of representation was low. Further exploratory analysis suggested adifferent substantial theory.In applications, we need to select the predictor variables as well as the di-mensionality of the model. We used information criteria such as the AIC andBIC in the empirical applications. Alternatively, cross validation or other modelselection criteria can be used. We did not discuss uncertainty estimation for ourmodel. Assuming the model is true, we can compute the Hessian matrix and de-rive standard errors for the parameter estimates from this matrix. Following theGEE set-up, we could develop a sandwich estimator for the covariance matrixof the parameters. Alternatively, the bootstrap can be used (Efron and Tibshi-rani, 1986). The two latter approaches acknowledge the fact that the model isan approximation (Buja et al., 2019a,b) and probably not a completely accuraterepresentation of a population model. In that sense, the sandwich estimatorand the bootstrap can estimate uncertainty with respect to a target model in thepopulation. Focusing on predictive accuracy instead of explanatory value (cf.Shmueli, 2010), the performance of our model in comparison to that of indepen-dently fitted logistic regression models is expected to be higher (Breiman andFriedman, 1997).In this manuscript we only focused on linear effects of the predictor vari-ables. Non-linear effects of predictor variables on the responses can easily beincorporated as long as they can be translated into a design matrix X , such aswith the use of quadratic and cubic effects or with splines defined in terms of atruncated power basis (Friedman et al., 2001). Non-linear variable axes can be41resented in the graphical representation as smooth curves, where effects arestill additive. Interactions can also be included in the model. In the graphicalrepresentation, conditional variable axes need to be represented. In the case ofan interaction between variables X and X , the graphical representation has avariable axis for X for each value of X (or the other way around). For an ex-ample of biplots with such interactions among predictors, see De Rooij (2011).Note that both the nonlinearity and the interactions are effects with respect toall response variables.The past two decades have seen a rise in penalized estimation methods,which are methods that impose penalties on the parameters of the model. Forthe MELODIC family, these could be penalties on the regression weights to gen-eralize the model to the case where P >> n , such as L (Lasso penalty; Tibshi-rani (1996)) or L penalties (Ridge penalty; Hoerl and Kennard (1970)) . Toimplement these in the MELODIC family, we would need to alter the identi-fication constraints. In the current algorithm we used n − B (cid:62) X (cid:62) XB = I toidentify the model, but this scaling does not seem to be in line with a penaltyon B . Therefore, the fixed scaling normalization should be placed on the dis-crimination values ( K ). Another potential type of penalty is a nuclear normpenalty (Fazel, 2002). In the outlined algorithm (see Algorithm 1) we use asingular value decomposition. If we apply an L penalty to the singular val-ues, the discrimination ( K ) between the categories of response variables slowlydiminishes. Because the singular values are ordered, the discrimination in thehigher dimensions becomes zero first and later the discrimination of the firstdimensions also becomes zero. If we choose the optimal value of the penaltyparameter by cross validation, such a penalty could be used for dimension se-lectionWe are currently building an R-package that enables empirical researchers to42pply the models to their own data. For the moment, the R-code of the examplespresented can be obtained from the first author. References
Agresti, A. (2003).
Categorical data analysis . John Wiley & Sons.Anderson, J. A. (1984). Regression and ordered categorical variables.
Journal ofthe Royal Statistical Society: Series B (Methodological) , 46(1):1–22.Asar, Ö. and İlk, Ö. (2014). Flexible multivariate marginal models for analyzingmultivariate longitudinal data, with applications in r.
Computer methods andprograms in biomedicine , 115(3):135–146.Bakk, Z. and Kuha, J. (2018). Two-step estimation of models between latentclasses and external variables.
Psychometrika , 83(4):871–892.Bergsma, W., Croon, M., and Hagenaars, J. (2009).
Marginal Models: For De-pendent, Clustered, and Longitudinal Categorical Data . Statistics for Social andBehavioral Sciences. Springer New York.Berkson, J. (1944). Application of the logistic function to bio-assay.
Journal of theAmerican statistical association , 39(227):357–365.Bolck, A., Croon, M., and Hagenaars, J. (2004). Estimating latent structure mod-els with categorical variables: One-step versus three-step estimators.
PoliticalAnalysis , 12(1):3–27.Breiman, L. and Friedman, J. H. (1997). Predicting multivariate responses inmultiple linear regression.
Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology) , 59(1):3–54. 43uja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., Zhao,L., et al. (2019a). Models as approximations I: Consequences illustrated withlinear regression.
Statistical Science , 34(4):523–544.Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., Zhao, L., et al.(2019b). Models as approximations II: A model-free theory of parametricregression.
Statistical Science , 34(4):545–565.Busing, F. M. T. A. (2010).
Advances in multidimensional unfolding . Doctoral the-sis, Leiden University.Coombs, C. H. (1950). Psychological scaling without a unit of measurement.
Psychological review , 57(3):145.Coombs, C. H. and Kao, R. (1955). Nonmetric factor analysis.
University ofMichigan. Department of Engineering Research. Bulletin .Cox, D. R. (1958). The regression analysis of binary sequences.
Journal of theRoyal Statistical Society: Series B (Methodological) , 20(2):215–232.Cramer, J. S. (2002). The origins of logistic regression.
Tinbergen Institute Dis-cussion Paper , 02-119(4).De Boeck, P. and Wilson, M. (2004).
Explanatory item response models: A general-ized linear and nonlinear approach . Springer Science & Business Media.De Leeuw, J. (2005). Gifi goes logistic: Scasa keynote.De Leeuw, J. (2006). Principal component analysis of binary data by iterated sin-gular value decomposition.
Computational statistics & data analysis , 50(1):21–39. 44e Leeuw, J. and Heiser, W. J. (1977). Convergence of correction matrix algo-rithms for multidimensional scaling.
Geometric representations of relational data ,pages 735–752.De Rooij, M. (2009). Ideal point discriminant analysis revisited with a specialemphasis on visualization.
Psychometrika , 74(2):317.De Rooij, M. (2011). Transitional ideal point models for longitudinal multino-mial outcomes.
Statistical Modelling , 11(2):115–135.De Rooij, M. and Heiser, W. J. (2005). Graphical representations and odds ra-tios in a distance-association model for the analysis of cross-classified data. psychometrika , 70(1):99–122.Efron, B. and Tibshirani, R. (1986). Bootstrap methods for standard errors, con-fidence intervals, and other measures of statistical accuracy.
Statistical science ,pages 54–75.Evans, G. W. (2014).
Logistic Gifi: A Logistic Distance Association Model for Ex-ploratory Analysis of Categorical Data . PhD thesis, UCLA.Fazel, M. (2002).
Matrix rank minimization with applications . Doctoral thesis,Stanford University.Fehrman, E., Muhammad, A. K., Mirkes, E. M., Egan, V., and Gorban, A. N.(2017). The five factor model of personality and evaluation of drug consump-tion risk. In
Data Science , pages 231–242. Springer.Friedman, J., Hastie, T., and Tibshirani, R. (2001).
The elements of statistical learn-ing . Springer series in statistics New York.Friendly, M. and Kwan, E. (2011). Comment-why tables are really much betterthan graphs.
Journal of Computational and Graphical Statistics , 20(1):18.45elman, A. (2011). Why tables are really much better than graphs.
Journal ofComputational and Graphical Statistics , 20(1):3–7.Gower, J. and Hand, D. (1995).
Biplots . Chapman & Hall/CRC Monographs onStatistics & Applied Probability. Taylor & Francis.Gower, J., Lubbe, S., and Roux, N. (2011).
Understanding Biplots . Wiley.Gower, J. C. (1966). Some distance properties of latent root and vector methodsused in multivariate analysis.
Biometrika , 53(3-4):325–338.Groenen, P. J. F. (1993).
The majorization approach to multidimensional scaling .DSWO press Leiden.Groenen, P. J. F., Giaquinto, P., and Kiers, H. A. L. (2003). Weighted majorizationalgorithms for weighted least squares decomposition models. EconometricInstitute Research Papers EI 2003-09, Erasmus University Rotterdam, ErasmusSchool of Economics (ESE), Econometric Institute.Groenen, P. J. F. and Josse, J. (2016). Multinomial multiple correspondence anal-ysis. arXiv preprint arXiv:1603.03174 .Gudicha, D. W., Tekle, F. B., and Vermunt, J. K. (2016). Power and sample sizecomputation for wald tests in latent class models.
Journal of Classification ,33(1):30–51.Guttman, L. (1968). A general nonmetric technique for finding the smallestcoordinate space for a configuration of points.
Psychometrika , 33(4):469–506.Heiser, W. J. (1981).
Unfolding analysis of proximity data . Doctoral dissertation,Leiden University. 46eiser, W. J. (1995). Convergent computation by iterative majorization: Theoryand applications in multidimensional data analysis. In Krzanowski, W. J., ed-itor,
Recent advances in descriptive multivariate analysis , pages 157–189. Claren-don Press.Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation fornonorthogonal problems.
Technometrics , 12(1):55–67.Hotelling, H. (1936). Simplified calculation of principal components.
Psychome-trika , 1(1):27–35.Hubbard, A. E., Ahern, J., Fleischer, N. L., Van der Laan, M., Satariano, S. A.,Jewell, N., Bruckner, T., and Satariano, W. A. (2010). To gee or not to gee: com-paring population average and mixed models for estimating the associationsbetween neighborhood risk factors and health.
Epidemiology , pages 467–474.Hunter, D. R. and Lange, K. (2004). A tutorial on mm algorithms.
The AmericanStatistician , 58(1):30–37.Izenman, A. J. (1975). Reduced-rank regression for the multivariate linearmodel.
Journal of multivariate analysis , 5(2):248–264.Jolliffe, I. T. (2002).
Principal Component Analysis . Springer.Lazarsfeld, P. F. and Henry, N. W. (1968).
Latent structure analysis . HoughtonMifflin Co.Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using general-ized linear models.
Biometrika , 73(1):13–22.McCutcheon, A. L. (1987).
Latent class analysis . Newbury Park, CA: Sage.Molenberghs, G. and Verbeke, G. (2006).
Models for discrete longitudinal data .Springer Science & Business Media. 47earson, K. (1901). Principal components analysis.
The London, Edinburgh, andDublin Philosophical Magazine and Journal of Science , 6(2):559.Penninx, B. W., Beekman, A. T., Smit, J. H., Zitman, F. G., Nolen, W. A., Spin-hoven, P., Cuijpers, P., De Jong, P. J., Van Marwijk, H. W., Assendelft, W. J.,et al. (2008). The Netherlands study of depression and anxiety (NESDA): ra-tionale, objectives and methods.
International journal of methods in psychiatricresearch , 17(3):121–140.Roskam, E. E. (1968).
Metric Analysis Or Ordinal Data in Psychology . Voorschoten,The Netherlands: Vam.Shmueli, G. (2010). To explain or to predict.
Statistical Science , 25:289 – 310.Skrondal, A. and Rabe-Hesketh, S. (2004).
Generalized latent variable modeling:Multilevel, longitudinal, and structural equation models . Crc Press.Spinhoven, P., De Rooij, M., Heiser, W., Smit, J. H., and Penninx, B. W. (2009).The role of personality in comorbidity among anxiety and depressive disor-ders in primary care and specialty care: a cross-sectional analysis.
Generalhospital psychiatry , 31(5):470–477.Stein, C. et al. (1956). Inadmissibility of the usual estimator for the mean of amultivariate normal distribution. In
Proceedings of the Third Berkeley Symposiumon Mathematical Statistics and Probability, Volume 1: Contributions to the Theoryof Statistics . The Regents of the University of California.Takane, Y. (1987). Analysis of contingency tables by ideal point discriminantanalysis.
Psychometrika , 52(4):493–513.Takane, Y. (2013).
Constrained principal component analysis and related techniques .CRC Press. 48akane, Y., Bozdogan, H., and Shibayama, T. (1987). Ideal point discriminantanalysis.
Psychometrika , 52(3):371–392.Ter Braak, C. J. and Looman, C. W. (1994). Biplots in reduced-rank regression.
Biometrical journal , 36(8):983–1003.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journalof the Royal Statistical Society: Series B (Methodological) , 58(1):267–288.Torgerson, W. S. (1952). Multidimensional scaling: I. theory and method.
Psy-chometrika , 17(4):401–419.Torgerson, W. S. (1958).
Theory and methods of scaling.
Wiley.Tuerlinckx, F., Rijmen, F., Verbeke, G., and De Boeck, P. (2006). Statistical infer-ence in generalized linear mixed models: A review.
British Journal of Mathe-matical and Statistical Psychology , 59(2):225–255.Vermunt, J. K. (2010). Latent class modeling with covariates: Two improvedthree-step approaches.
Political analysis , pages 450–469.Vugteveen, J., De Bildt, A., Hartman, C., and Timmerman, M. (2018). Using thedutch multi-informant strengths and difficulties questionnaire (SDQ) to pre-dict adolescent psychiatric diagnoses.
European child & adolescent psychiatry ,27(10):1347–1359.White, H. (1980). A heteroskedasticity-consistent covariance matrix estimatorand a direct test for heteroskedasticity.
Econometrica: journal of the EconometricSociety , pages 817–838.Worku, H. M. and De Rooij, M. (2018). A multivariate logistic distance model forthe analysis of multiple binary responses.
Journal of Classification , 35(1):124–146. 49ee, T. W. and Hastie, T. J. (2003). Reduced-rank vector generalized linear mod-els.
Statistical modelling , 3(1):15–41.Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete andcontinuous outcomes.
Biometrics , pages 121–130.Ziegler, A., Kastner, C., and Blettner, M. (1998). The generalised estimatingequations: an annotated bibliography.