[PDF] MM Algorithms for Variance Components Models

Abstract

Variance components estimation and mixed model analysis are central themes in statistics with applications in numerous scientific disciplines. Despite the best efforts of generations of statisticians and numerical analysts, maximum likelihood estimation and restricted maximum likelihood estimation of variance component models remain numerically challenging. Building on the minorization-maximization (MM) principle, this paper presents a novel iterative algorithm for variance components estimation. MM algorithm is trivial to implement and competitive on large data problems. The algorithm readily extends to more complicated problems such as linear mixed models, multivariate response models possibly with missing data, maximum a posteriori estimation, penalized estimation, and generalized estimating equations (GEE). We establish the global convergence of the MM algorithm to a KKT point and demonstrate, both numerically and theoretically, that it converges faster than the classical EM algorithm when the number of variance components is greater than two and all covariance matrices are positive definite.

Full PDF

MMM Algorithms for Variance Components Models

Hua ZhouDepartment of BiostatisticsUniversity of California, Los [email protected] Liuyi HuDepartment of StatisticsNorth Carolina State [email protected] ZhouDivision of Epidemiologyand BiostatisticsUniversity of [email protected] Kenneth LangeDepartments of Biomathematics,Human Genetics, and StatisticsUniversity of California, Los [email protected]

Abstract

Variance components estimation and mixed model analysis are central themes instatistics with applications in numerous scientiﬁc disciplines. Despite the best eﬀortsof generations of statisticians and numerical analysts, maximum likelihood estimationand restricted maximum likelihood estimation of variance component models remainnumerically challenging. Building on the minorization-maximization (MM) principle,this paper presents a novel iterative algorithm for variance components estimation.MM algorithm is trivial to implement and competitive on large data problems. Thealgorithm readily extends to more complicated problems such as linear mixed models,multivariate response models possibly with missing data, maximum a posteriori estima-tion, penalized estimation, and generalized estimating equations (GEE). We establishthe global convergence of the MM algorithm to a KKT point and demonstrate, bothnumerically and theoretically, that it converges faster than the classical EM algorithmwhen the number of variance components is greater than two and all covariance matricesare positive deﬁnite.

Keywords: generalized estimating equations (GEE), global convergence, matrixconvexity, linear mixed model (LMM), maximum a posteriori (MAP) estimation, max-imum likelihood estimation (MLE), minorization-maximization (MM), missing data,multivariate response, penalized estimation, restricted maximum likelihood (REML),variance components model a r X i v : . [ s t a t . C O ] S e p Introduction

Variance components and linear mixed models are among the most potent tools in a statis-tician’s toolbox. They are essential topics in graduate-level linear model courses and thesubject of many current papers and research monographs (Rao and Kleﬀe, 1988; Searle et al.,1992; Rao, 1997; Khuri et al., 1998; Demidenko, 2013). Their applications in agriculture,biology, economics, genetics, epidemiology, and medicine are too numerous to cover here indetail. The recommended books (Verbeke and Molenberghs, 2000; Weiss, 2005; Fitzmauriceet al., 2011) stress longitudinal data analysis.Given an observed n × response vector y and n × p predictor matrix X , the simplestvariance components model postulates that Y ∼ N ( Xβ , Ω ) , where Ω = m (cid:88) i =1 σ i V i , and the V , . . . , V m are m ﬁxed positive semideﬁnite matrices. The parameters of the modelcan be divided into mean eﬀects ( β , . . . , β p ) and variance components ( σ , . . . , σ m ) , summa-rized by vectors β and σ . Throughout we assume Ω is positive deﬁnite. The extension tosingular Ω will not be pursued here. Estimation revolves around the log-likelihood function L ( β , σ ) = −

12 ln det Ω −

12 ( y − Xβ ) T Ω − ( y − Xβ ) . (1)Among the commonly used methods for estimating variance components, maximum likeli-hood estimation (MLE) (Hartley and Rao, 1967) and restricted (or residual) MLE (REML)(Harville, 1977) are the most popular. REML ﬁrst projects y to the null space of X andthen estimates variance components based on the projected responses. If the columns of thematrix B span the null space of X T , then REML estimates the σ i by maximizing the log-likelihood of the redeﬁned response vector B T Y , which is normally distributed with mean and covariance B T Ω B = (cid:80) mi =1 σ i B T V i B .There exists a large literature on iterative algorithms for ﬁnding MLE and REML (Lairdand Ware, 1982; Lindstrom and Bates, 1988, 1990; Harville and Callanan, 1990; Callananand Harville, 1991; Bates and Pinheiro, 1998; Schafer and Yucel, 2002). Fitting variancecomponent models remains a challenge in models with a large sample size n or a large num-ber of variance components m . Newton’s method (Lindstrom and Bates, 1988) convergesquickly but is numerically unstable owing to the non-concavity of the log-likelihood. Fisher’sscoring algorithm replaces the observed information matrix in Newton’s method by the ex-pected information matrix and yields an ascent algorithm when safeguarded by step halving.However the calculation and inversion of expected information matrices cost O ( mn )+ O ( m ) ﬂops for unstructured V i and quickly become impractical when either n or m is large. The2xpectation-maximization (EM) algorithm initiated by Dempster et al. is a third alterna-tive (Dempster et al., 1977; Laird and Ware, 1982; Laird et al., 1987; Lindstrom and Bates,1988; Bates and Pinheiro, 1998). Compared to Newton’s method, the EM algorithm is easyto implement and numerically stable, but painfully slow to converge. In practice, a strat-egy of priming Newton’s method by a few EM steps leverages the stability of EM and thefaster convergence of second-order methods. Quasi-Newton methods dispense with explicitcalculation of the observed information while achieving a superlinear rate of convergence.In this paper we derive a minorization-maximization (MM) algorithm for ﬁnding theMLE and REML estimates of variance components. We prove global convergence of theMM algorithm to a Karush-Kuhn-Tucker (KKT) point and explain why MM generally con-verges faster than EM for models with more than two variance components. We also sketchextensions of the MM algorithm to the multivariate response model with possibly missingresponses, the linear mixed model (LMM), maximum a posteriori (MAP) estimation, pe-nalized estimation, and generalized estimating equations (GEE). The numerical eﬃciency ofthe MM algorithm is illustrated through simulated data sets and a genomic example withmore than 200 variance components. Background on MM algorithms

Throughout we reserve Greek letters for parameters and indicate the current iteration num-ber by a superscript t . The MM principle for maximizing an objective function f ( θ ) involvesminorizing the objective function f ( θ ) by a surrogate function g ( θ | θ ( t ) ) around the currentiterate θ ( t ) of a search (Lange et al., 2000). Minorization is deﬁned by the two conditions f ( θ ( t ) ) = g ( θ ( t ) | θ ( t ) ) (2) f ( θ ) ≥ g ( θ | θ ( t ) ) , θ (cid:54) = θ ( t ) . In other words, the surface θ (cid:55)→ g ( θ | θ ( t ) ) lies below the surface θ (cid:55)→ f ( θ ) and is tangentto it at the point θ = θ ( t ) . Construction of the minorizing function g ( θ | θ ( t ) ) constitutesthe ﬁrst M of the MM algorithm. The second M of the algorithm maximizes the surrogate g ( θ | θ ( t ) ) rather than f ( θ ) . The point θ ( t +1) maximizing g ( θ | θ ( t ) ) satisﬁes the ascentproperty f ( θ ( t +1) ) ≥ f ( θ ( t ) ) . This fact follows from the inequalities f ( θ ( t +1) ) ≥ g ( θ ( t +1) | θ ( t ) ) ≥ g ( θ ( t ) | θ ( t ) ) = f ( θ ( t ) ) , (3)reﬂecting the deﬁnition of θ ( t +1) and the tangency and domination conditions (2). Theascent property makes the MM algorithm remarkably stable. The validity of the descent3roperty depends only on increasing g ( θ | θ ( t ) ) , not on maximizing g ( θ | θ ( t ) ) . With obviouschanges, the MM algorithm also applies to minimization rather than to maximization. Tominimize a function f ( θ ) , we majorize it by a surrogate function g ( θ | θ ( t ) ) and minimize g ( θ | θ ( t ) ) to produce the next iterate θ ( t +1) . The acronym should not be confused with themaximization-maximization algorithm in the variational Bayes context (Jeon, 2012).The MM principle (De Leeuw, 1994; Heiser, 1995; Kiers, 2002; Lange et al., 2000; Hunterand Lange, 2004; Wu and Lange, 2010) ﬁnds applications in multidimensional scaling (Borgand Groenen, 2005), ranking of sports teams (Hunter, 2004), variable selection (Hunter andLi, 2005), optimal experiment design (Yu, 2010), multivariate statistics (Zhou and Lange,2010), geometric programming (Lange and Zhou, 2014), and many other areas (Lange, 2010,Chapter 12). The celebrated EM principle (Dempster et al., 1977) is a special case ofthe MM principle. The Q function produced in the E step of an EM algorithm minorizesthe log-likelihood up to an irrelevant constant. Thus, both EM and MM share the sameadvantages: simplicity, stability, graceful adaptation to constraints, and the tendency toavoid large matrix inversion. The more general MM perspective frees algorithm derivationfrom the missing data straitjacket and invites wider applications (Wu and Lange, 2010).Figure 1 shows the minorization functions of EM and MM for a variance components modelwith m = 2 variance components.EM and MM algorithms often exhibit slow convergence. Fortunately, this defect canbe remedied by oﬀ-the-shelf acceleration techniques for ﬁxed point iterations. The recentlydeveloped squared iterative method (SQUAREM) (Varadhan and Roland, 2008) and thequasi-Newton acceleration method (Zhou et al., 2011) are particularly attractive, given theirsimplicity and minimal memory and computational costs. Our numerical experiments featurethe unadorned MM algorithm and the quasi-Newton accelerated MM (aMM) algorithm basedon one secant pair. Using more secant pairs is likely to further improve performance. Convex matrix functions

For symmetric matrices we write A (cid:22) B when B − A is positive semideﬁnite and A ≺ B if B − A is positive deﬁnite. A matrix-valued function f is said to be (matrix) convex if f [ λ A + (1 − λ ) B ] (cid:22) λf ( A ) + (1 − λ ) f ( B ) for all A , B , and λ ∈ [0 , . Our derivation of the MM variance components algorithm hingeson the convexity of the two functions mentioned in the next lemma. Lemma 1. (a) The matrix fractional function f ( A , B ) = A T B − A is jointly convex in the m × n matrix A and the m × m positive deﬁnite matrix B . (b) The log determinant function f ( B ) = ln det B is concave on the set of positive deﬁnite matrices. .60.650.7 σ σ -520.5-521-521.5-522-522.5-52017.5 logLEMMM Figure 1: Log-likelihood surface of a 2-variance component model and the surrogate functionsof EM and MM minorizing the objective function at point ( σ t )1 , σ t )2 ) = (18 . , . .5 roof. The matrix fractional function is matrix convex because its epigraph { ( A , B , C ) : B (cid:31) , A T B − A (cid:22) C } = (cid:40) ( A , B , C ) : B (cid:31) , (cid:32) B AA T C (cid:33) (cid:23) (cid:41) is a convex set. Here C varies over the set of n × n positive semideﬁnite matrices. Theequivalence of these two epigraph representations is proved in Boyd and Vandenberghe (2004,A.5.5). For the concavity of the log determinant, see Boyd and Vandenberghe (2004, p74). Our strategy for maximizing the log-likelihood (1) is to alternate updating the mean param-eters β and the variance components σ . Updating β given σ is a standard general leastsquares problem with solution β ( t +1) = ( X T Ω − ( t ) X ) − X T Ω − ( t ) y . (4)Updating σ given β ( t ) depends on two minorizations. If we assume that all of the V i arepositive deﬁnite, then the joint convexity of the map ( X , Y ) (cid:55)→ X T Y − X for positivedeﬁnite Y implies that Ω ( t ) Ω − Ω ( t ) = (cid:32) m (cid:88) i =1 σ t ) i V i (cid:33) (cid:32) m (cid:88) i =1 σ i V i (cid:33) − (cid:32) m (cid:88) i =1 σ t ) i V i (cid:33) = (cid:32) m (cid:88) i =1 σ t ) i (cid:80) j σ t ) j (cid:80) j σ t ) j σ t ) i σ t ) i V i (cid:33) · (cid:32) m (cid:88) i =1 σ t ) i (cid:80) j σ t ) j (cid:80) j σ t ) j σ t ) i σ i V i (cid:33) − · (cid:32) m (cid:88) i =1 σ t ) i (cid:80) j σ t ) j (cid:80) j σ t ) j σ t ) i σ t ) i V i (cid:33) (cid:22) m (cid:88) i =1 σ t ) i (cid:80) j σ t ) j (cid:32) (cid:80) j σ t ) j σ t ) i σ t ) i V i (cid:33) (cid:32) (cid:80) j σ t ) j σ t ) i σ i V i (cid:33) − (cid:32) (cid:80) j σ t ) j σ t ) i σ t ) i V i (cid:33) = m (cid:88) i =1 σ t ) i σ i V i V − i V i = m (cid:88) i =1 σ t ) i σ i V i . When one or more of the V i are rank deﬁcient, we replace each V i by V i,(cid:15) = V i + (cid:15) I for (cid:15) > small. Sending (cid:15) to 0 in the just proved majorization Ω ( t ) (cid:15) Ω − (cid:15) Ω ( t ) (cid:15) (cid:22) m (cid:88) i =1 σ t ) i σ i V i,(cid:15) Ω ( t ) Ω − Ω ( t ) (cid:22) m (cid:88) i =1 σ t ) i σ i V i in the general case. Negating both sides leads to the minorization − ( y − Xβ ) T Ω − ( y − Xβ ) (cid:23) − ( y − Xβ ) T Ω − ( t ) (cid:32) m (cid:88) i =1 σ t ) i σ i V i (cid:33) Ω − ( t ) ( y − Xβ ) (5)that eﬀectively separates the variance components σ , . . . , σ m in the quadratic term of thelog-likelihood (1).The convexity of the function A (cid:55)→ − log det A is equivalent to the supporting hyperplaneminorization − ln det Ω ≥ − ln det Ω ( t ) − tr[ Ω − ( t ) ( Ω − Ω ( t ) )] (6)that separates σ , . . . , σ m in the log determinant term of the log-likelihood (1). Combinationof the minorizations (5) and (6) gives the overall minorization g ( σ | σ t ) )= −

12 tr( Ω − ( t ) Ω ) −

12 ( y − Xβ ( t ) ) T Ω − ( t ) (cid:32) m (cid:88) i =1 σ t ) i σ i V i (cid:33) Ω − ( t ) ( y − Xβ ( t ) ) + c ( t ) (7) = m (cid:88) i =1 (cid:34) − σ i Ω − ( t ) V i ) − σ t ) i σ i ( y − Xβ ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − Xβ ( t ) ) (cid:35) + c ( t ) , where c ( t ) is an irrelevant constant. Maximization of g ( σ | σ t ) ) with respect to σ i yieldsthe lovely multiplicative update σ t +1) i = σ t ) i (cid:115) ( y − Xβ ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − Xβ ( t ) )tr( Ω − ( t ) V i ) , i = 1 , . . . , m. (8)To preserve the uniqueness and continuity of the algorithm map, we must take σ t +1) i = 0 whenever σ t ) i = 0 . As a sanity check on our derivation, consider the partial derivative ∂∂σ i L ( β , σ ) = −

12 tr( Ω − V i ) + 12 ( y − Xβ ) T Ω − V i Ω − ( y − Xβ ) . (9)Given σ t ) i > , it is clear from the update formula (8) that σ t +1) i < σ t ) i when ∂∂σ i L < .Conversely σ t +1) i > σ t ) i when ∂∂σ i L > . Algorithm 1 summarizes the MM algorithm forMLE of the univariate response model (1). 7 nput : y , X , V , . . . , V m Output : MLE ˆ β , ˆ σ , . . . , ˆ σ m Initialize σ (0) i > , i = 1 , . . . , m repeat Ω ( t ) ← (cid:80) mi =1 σ t ) i V i β ( t ) ← arg min β ( y − Xβ ) T Ω − ( t ) ( y − Xβ ) σ t +1) i ← σ t ) i (cid:114) ( y − X β ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − X β ( t ) )tr( Ω − ( t ) V i ) , i = 1 , . . . , m until objective value converges Algorithm 1:

MM algorithm for MLE of the variance components of model (1).The update formula (8) assumes that the numerator under the square root sign is non-negative and the denominator is positive. The numerator requirement is a consequence ofthe positive semideﬁniteness of V i . The denominator requirement can be veriﬁed throughthe Hadamard (elementwise) product representation tr( Ω − ( t ) V i ) = T ( Ω − ( t ) (cid:12) V i ) . Thefollowing lemma of Schur (1911) is crucial. We give a self-contained probabilistic proof. Lemma 2 (Schur) . The Hadamard product of a positive deﬁnite matrix with a positivesemideﬁnite matrix with positive diagonal entries is positive deﬁnite.Proof.

Let X = ( X , . . . , X n ) T be a random normal vector with mean and positive deﬁnitecovariance matrix A . Let Y = ( Y , . . . , Y n ) T be a random normal vector independent of X with mean and positive semideﬁnite covariance matrix B having positive diagonal entries.Then Z = X (cid:12) Y has covariances E( Z i Z j ) = E( X i Y i X j Y j ) = E( X i X j )E( Y i Y j ) = a ij b ij . Itfollows that Cov( Z ) = A (cid:12) B . To show A (cid:12) B is positive deﬁnite, suppose on the contrarythat v T ( A (cid:12) B ) v = Var( v T Z ) = 0 for some v (cid:54) = . Then v T Z ) = E (cid:16) (cid:88) i v i X i Y i (cid:17) = E (cid:104)(cid:16) (cid:88) i v i X i Y i (cid:17) | Y (cid:105) = E[( v (cid:12) Y ) T A ( v (cid:12) Y )] implies v (cid:12) Y = with probability 1. Since v (cid:54) = , Y i = 0 with probability 1 for some i .This contradicts the assumption b ii = Var( Y i ) > for all i .We can now obtain the following characterization of the MM iterates. Proposition 1.

Assume V i has strictly positive diagonal entries. Then tr( Ω − ( t ) V i ) > forall t . Furthermore if σ i > and Ω − ( t ) ( y − Xβ ( t ) ) / ∈ null ( V i ) for all t , then σ t ) i > forall t . When V i is positive deﬁnite, σ t ) i > holds if and only if y (cid:54) = Xβ ( t ) .Proof. The ﬁrst claim follows easily from Schur’s lemma. The second claim follows byinduction. The third claim follows from the observation that null ( V i ) = { } .8n most applications, V m = I . Proposition 1 guarantees that if σ m > and the residualvector y − Xβ ( t ) is nonzero, then σ t ) m remains positive and thus Ω ( t ) remains positivedeﬁnite throughout all iterations. This fact does not prevent any of the sequences σ t ) i from converging to 0. In this sense, the MM algorithm acts like an interior point method,approaching the optimum from inside the feasible region. Univariate response: two variance components

Input : y , X , V , V Output : MLE ˆ β , ˆ σ , ˆ σ Simultaneous congruence decomposition: ( D , U ) ← ( V , V ) Transform data: ˜ y ← U T y , ˜ X ← U T X Initialize σ (0)1 , σ (0)1 > repeat w ( t ) i ← ( σ t )1 d i + σ t )2 ) − , i = 1 , . . . , n β ( t ) ← arg min β (cid:80) ni =1 w ( t ) i (˜ y i − ˜ x Ti β ) σ t +1)1 ← σ t )1 (cid:114) ( ˜ y − ˜ X β ( t ) ) T ( σ t )1 D + σ t )2 I ) − D ( σ t )1 D + σ t )2 I ) − ( ˜ y − ˜ X β ( t ) )tr[( σ t )1 D + σ t )2 I ) − D ] σ t +1)2 ← σ t )2 (cid:114) ( ˜ y − ˜ X β ( t ) ) T ( σ t )1 D + σ t )2 I ) − ( ˜ y − ˜ X β ( t ) )tr[( σ t )1 D + σ t )2 I ) − ] until objective value converges Algorithm 2:

Simpliﬁed MM algorithm for MLE of model (1) with m = 2 variancecomponents and Ω = σ V + σ V .The major computational cost of Algorithm 1 is inversion of the covariance matrix Ω ( t ) at each iteration. Problem speciﬁc structures such as block diagonal matrices or a diagonalmatrix plus a low-rank matrix are often exploited to speed up matrix inversion. The specialcase of m = 2 variance components deserves attention as repeated matrix inversion can beavoided by invoking the simultaneous congruence decomposition for two symmetric matrices,one of which is positive deﬁnite (Rao, 1973; Horn and Johnson, 1985). This decompositionis also called the generalized eigenvalue decomposition (Golub and Van Loan, 1996; Boydand Vandenberghe, 2004). If one assumes Ω = σ V + σ V and lets ( V , V ) (cid:55)→ ( D , U ) bethe decomposition with U nonsingular, U T V U = D diagonal, and U T V U = I , then Ω ( t ) = U − T ( σ t )1 D + σ t )2 I n ) U − Ω − ( t ) = U ( σ t )1 D + σ t )2 I n ) − U T det( Ω ( t ) ) = det( σ t )1 D + σ t )2 I n ) det( U − T U − ) (10) = det( σ t )1 D + σ t )2 I n ) det( V ) . ˜ y = U T y and the revised predictor matrix ˜ X = U T X , theupdate (8) requires only vector operations and costs O ( n ) ﬂops. Updating the ﬁxed eﬀects isa weighted least squares problem with the transformed data ( ˜ y , ˜ X ) and observation weights w ( t ) i = ( σ t )1 d i + σ t )2 ) − . Algorithm 2 summarizes the simpliﬁed MM algorithm for twovariance components. Numerical experiments

This section compares the numerical performance of MM, quasi-Newton accelerated MM,EM, and Fisher scoring on simulated data from a two-way ANOVA random eﬀects modeland a genetic model. For ease of comparison, all algorithm runs start from σ = andterminate when the relative change ( L ( t +1) − L ( t ) ) / ( | L ( t ) | + 1) in the log-likelihood is lessthan − . Two-way ANOVA:

We simulated data from a two-way ANOVA random eﬀects model y ijk = µ + α i + β j + ( αβ ) ij + (cid:15) ijk , where α i ∼ N (0 , σ ) , β j ∼ N (0 , σ ) , ( αβ ) ij ∼ N (0 , σ ) , and (cid:15) ijk ∼ N (0 , σ e ) are jointlyindependent. This corresponds to m = 4 variance components. In the simulation, we set σ = σ = σ e and varied the ratio σ /σ e ; the numbers of levels a and b in factor 1 and factor2, respectively; and the number of observations c in each combination of factor levels. Foreach simulation scenario, we simulated 50 replicates. The sample size was n = abc for eachreplicate.Tables 1 and 2 show the average number of iterations and the average runtimes whenthere are a = b = 5 levels of each factor. Based on these results and further results notshown for other combinations of a and b , we draw the following conclusions. Fisher scoringtakes the fewest iterations. The MM algorithm always takes fewer iterations than the EMalgorithm. Accelerated MM further improves the convergence rate of MM. The faster rateof convergence of Fisher scoring is outweighed by the extra cost of evaluating and invertingthe covariance matrix. When the sample size n = abc is large, Fisher scoring takes muchlonger than either EM or MM. Genetic model:

We simulated a quantitative trait y from a genetic model with twovariance components and covariance matrix Ω = σ a (cid:98) Φ + σ e I , where (cid:98) Φ is a full-rank empiri-cal kinship matrix estimated from the genome-wide measurements of 212 individuals usingOption 29 of the Mendel software (Lange et al., 2013, 2005). In this example, Fisher scoringexcels at smaller σ a /σ e ratios, while accelerated MM is fastest at larger σ a /σ e ratios.In summary, the MM algorithm appears competitive even in small-scale examples. Mod-ern applications often involve a large number of variance components. In this setting, the10M algorithm suﬀers from slow convergence and Fisher scoring from an extremely high costper iteration. Our genomic example in Section 7 reinforces this point. The KKT necessary conditions for a local maximum σ = ( σ , . . . , σ m ) of the log-likelihood(1) require each component of the score vector to satisfy ∂∂σ i L ( σ ) ∈  { } σ i > −∞ , σ i = 0 . In this section we establish the global convergence of Algorithm 1 to a KKT point. To reducethe notational burden, we assume that X is null and omit estimation of ﬁxed eﬀects β . Theanalysis easily extends to the MLE case. Our convergence analysis relies on characterizing theproperties of the objective function L ( σ ) and the MM algorithmic mapping σ (cid:55)→ M ( σ ) deﬁned by equation (8). Special attention must be paid to the boundary values σ i = 0 . Weprove convergences for two cases, which cover most applications. Assumption 1.

All V i are positive deﬁnite. Assumption 2. V is positive deﬁnite, each V i is nontrivial, H = span { V , . . . , V m } hasdimension q < n , and y / ∈ H . The genetic model in Section 3 satisﬁes Assumption 1, while the two-way ANOVA modelsatisﬁes Assumption 2. The key condition y / ∈ span { V , . . . , V m } in the second case is criticalfor the existence of an MLE or an REML (Demidenko and Massam, 1999; Grządziel andMichalski, 2014). We will derive a sequence of lemmas en route to the global convergenceresult declared in Theorem 1. Lemma 3.

Under Assumption 1 or 2, the log-likelihood function (1) is coercive in the sensethat the super-level set S c = { σ ≥ : L ( σ ) ≥ c } is compact for every c .Proof. Let us ﬁrst prove the assertion when all of the covariance matrices V i are positivedeﬁnite. If we set r = (cid:107) σ (cid:107) and α i = r − σ i for each i , then the log-likelihood satisﬁes L ( σ ) = − n r −

12 ln det (cid:16) m (cid:88) i =1 α i V i (cid:17) − r y T (cid:16) m (cid:88) i =1 α i V i (cid:17) − y . The functions ln det (cid:16) (cid:80) mi =1 α i V i (cid:17) and y T (cid:16) (cid:80) mi =1 α i V i (cid:17) − y of α are deﬁned and continuouson the unit simplex and hence bounded there. The dominant term − n ln r of the loglikelihoodtends to −∞ as r tends to ∞ . 11 /σ e Method c = a = b = 5 levels ofboth factors. Standard errors are given in parentheses.12 /σ e Method c = × − seconds) of MM, quasi-Newton accelerated MM (aMM),EM, and Fisher scoring (FS) for ﬁtting a two-way ANOVA model with a = b = 5 levels ofboth factors. Standard errors are given in parentheses.13 a /σ e Method Iteration Runtime ( − sec) Objective0 MM 88.10(29.01) 778.24(305.37) -374.35(9.82)aMM 23.65(5.74) 293.16(146.23) -374.34(9.82)EM 231.93(123.39) 3509.02(1851.11) -374.41(9.83)FS 5.05(1.24) 137.76(65.74) -374.36(9.83)0.05 MM 84.97(31.18) 710.56(260.24) -377.19(10.85)aMM 23.05(5.45) 272.04(67.01) -377.18(10.85)EM 220.57(124.70) 3292.87(1865.91) -377.25(10.85)FS 5.08(1.21) 136.47(33.18) -377.21(10.83)0.1 MM 82.45(34.39) 673.96(268.23) -379.62(10.54)aMM 22.55(6.01) 269.55(86.69) -379.61(10.54)EM 199.70(113.47) 2917.71(1607.33) -379.68(10.54)FS 4.97(1.03) 129.71(40.51) -379.62(10.54)1 MM 31.00(15.59) 160.21(80.45) -409.66(11.26)aMM 12.55(5.38) 90.21(43.54) -409.66(11.26)EM 51.10(28.70) 550.55(321.89) -409.67(11.26)FS 4.60(0.59) 80.28(25.56) -409.66(11.26)10 MM 72.67(39.23) 374.80(209.31) -532.57(9.11)aMM 20.15(10.18) 146.25(81.06) -531.24(9.28)EM 294.20(717.05) 3079.82(7520.30) -532.71(9.11)FS 10.18(4.92) 168.63(80.34) -532.08(9.21)20 MM 78.35(34.32) 425.40(188.08) -591.36(7.05)aMM 14.80(6.53) 117.14(71.14) -589.13(7.15)EM 362.07(764.60) 4144.92(8862.65) -591.62(6.82)FS 10.93(4.75) 181.48(83.96) -590.68(7.08)Table 3: Average performance of MM, quasi-Newton accelerated MM (aMM), EM, andFisher scoring (FS) for ﬁtting a genetic model. Standard errors are given in parentheses.14o prove the assertion under Assumption 2, consider ﬁrst the case V = I n . Setting α i = σ i /σ for i = 2 , . . . , m reduces the loglikelihood to L ( σ , α ) = − n σ −

12 ln det (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − σ y T (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − y . (11)The middle term on the right satisﬁes −

12 ln det (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) ≤ because det ( I n + (cid:80) mi =2 α i V i ) ≥ det I n = 1 . Now let U = ( U q , U n − q ) be an n × n orthogonalmatrix whose left columns U q span H and whose right columns U n − q span H ⊥ . The identity U T (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) U = (cid:32) I q + (cid:80) mi =2 α i U Tq V i U q I n − q (cid:33) follows from the orthogonality relations U Tn − q V i = U Tn − q U q = ( n − q ) × n . This in turn implies (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − = U (cid:32) ( I q + (cid:80) mi =2 α i U Tq V i U q ) − I n − q (cid:33) U T (cid:23) U (cid:32) I n − q (cid:33) U T = U n − q U Tn − q . Therefore the quadratic term in equation (11) is bounded below by the positive constant y T (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − y ≥ y T U n − q U Tn − q y = (cid:107) P H ⊥ y (cid:107) > . Here the assumption y / ∈ H guarantees the projection property P H ⊥ y (cid:54) = .Next we show that the loglikelihood tends to −∞ when σ tends to or ∞ or when (cid:107) α (cid:107) tends to ∞ . The second of the two inequalities L ( σ , α ) ≤ − n σ −

12 ln det (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − σ (cid:107) P H ⊥ y (cid:107) ≤ − n σ − σ (cid:107) P H ⊥ y (cid:107) renders the claim about σ obvious. To prove the claim about α , we make the worst casechoice σ i = (cid:107) P H ⊥ y (cid:107) in the ﬁrst inequality. It follows that L ( σ , α ) ≤ −

12 ln det (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) − n (cid:107) P H ⊥ y (cid:107) − n . α j tends to ∞ , then the inequality −

12 ln det (cid:16) I n + m (cid:88) i =2 α i V i (cid:17) ≤ −

12 ln det (cid:16) I n + α j V j (cid:17) = − n (cid:88) k =1 ln(1 + α j λ jk ) holds, where the λ jk are the eigenvalues of V j . At least one of these eigenvalues is positivebecause V j is nontrivial. It follows that L ( σ , α ) tends to −∞ in this case as well.For the general case where V is non-singular but not necessarily I n , let V / be thesymmetric square root of V and write V + m (cid:88) i =2 σ i V i = V / (cid:32) I + m (cid:88) i =2 σ i V − / V i V − / (cid:33) V / . The above arguments still apply since each V − / V i V − / is nontrivial and y belongs to thespan { V , . . . , V m } = S if and only if V − / y belongs to V − / SV − / . Lemma 4.

The iterates possess the ascent property L ( M ( σ t ) )) ≥ L ( σ t ) ) . Furthermore,when L ( M ( σ ∗ )) = L ( σ ∗ ) , σ ∗ fulﬁlls the ﬁxed point condition M ( σ ∗ ) = σ ∗ , and each com-ponent satisﬁes either (i) σ ∗ i = 0 or (ii) σ ∗ i > and ∂∂σ i L ( σ ∗ ) = 0 .Proof. The ascent property is built into any MM algorithm. Suppose L ( M ( σ ∗ )) = L ( σ ∗ ) ata point σ ∗ ∈ R m + . Then equality must hold in the string of inequalities (3). It follows that g ( M ( σ ∗ ) | σ ∗ ) = g ( σ ∗ | σ ∗ ) and hence that M ( σ ∗ ) = σ ∗ . If σ ∗ i > , the stationarity condition ∂∂σ i L ( σ ∗ ) = ∂∂σ i g ( σ ∗ | σ ∗ ) = 0 applies. The equivalence of the two displayed partial derivatives is a consequence of the factthat the diﬀerence f ( σ ) − g ( σ | σ ∗ ) achieves its minimum of 0 at σ = σ ∗ . Lemma 5.

The distance between successive iterates (cid:107) σ t +1) − σ t ) (cid:107) converges to 0.Proof. Suppose on the contrary that (cid:107) σ t +1) − σ t ) (cid:107) does not converge to 0. Then one canextract a subsequence { t k } k ≥ such that (cid:107) σ t k +1) − σ t k ) (cid:107) ≥ (cid:15) > (12)for all k . Let C be the compact super-level set { σ : L ( σ ) ≥ L ( σ ) } . Since the sequence { σ t k ) } k ≥ is conﬁned to C , one can pass to a subsequence if necessary and assume that σ t k ) converges to a limit σ ∗ and that σ t k +1) converges to a limit σ ∗∗ . Taking limits in therelation σ t k +1) = M ( σ t k ) ) and invoking the continuity M ( σ ) imply that σ ∗∗ = M ( σ ∗ ) .16ecause the sequence L ( σ t k ) ) is monotonically increasing in k and bounded above on C ,it converges to a limit L ∗ . Hence, the continuity of L ( σ ) implies L ( σ ∗ ) = lim k L ( σ t k ) ) = L ∗ = lim k L ( σ t k +1) ) = L ( σ ∗∗ ) = L ( M ( σ ∗ )) . Lemma 4 therefore gives σ ∗∗ = M ( σ ∗ ) = σ ∗ , contradicting the bound (cid:107) σ ∗ − σ ∗∗ (cid:107) ≥ (cid:15) entailed by inequality (12). Theorem 1.

The MM sequence { σ t ) } t ≥ has at least one limit point. Every limit point isa ﬁxed point of M ( σ ) . If the set of ﬁxed points is discrete, then the MM sequence convergesto one of them. Finally, when the iterates converge, their limit is a KKT point.Proof. The sequence { σ t ) } t ≥ is contained in the super-level compact set C deﬁned inLemma 5 and therefore admits a convergent subsequence σ t k ) with limit σ ∞ ) . As arguedin Lemma 5, L ( σ ∞ ) ) = L ( M ( σ ∞ ) )) . Lemma 4 now implies that σ ∞ ) is a ﬁxed point ofthe algorithm map M ( σ ) .According to Ostrowski’s theorem (Lange, 2010, Proposition 8.2.1), the set of limit pointsof a bounded sequence { σ t ) } t ≥ is connected and compact provided (cid:107) σ t +1) − σ t ) (cid:107) → .If the set of ﬁxed points is discrete, then the connected subset of limit points reduces to asingle point. Hence, the bounded sequence σ t ) converges to this point. When the limitexists, one can check that σ ∞ ) satisﬁes the KKT conditions by proving that each zerocomponent of σ ∞ ) has a non-positive partial derivative. Suppose on the contrary σ ∞ ) i = 0 and ∂∂σ i L ( σ ∞ ) ) > . By continuity ∂∂σ i L ( σ t ) ) > for all large t . Therefore, σ t +1) i > σ t ) i for all large t by the observation made after equation (9). This behavior is inconsistent withthe assumption that σ t ) i → . Examination of Tables 2 and 3 suggests that the MM algorithm usually converges faster thanthe EM algorithm. We now provide theoretical justiﬁcation for this observation. Again fornotational convenience, we consider the REML case where X is null. Since the EM principleis just a special instance of the MM principle, we can compare their convergence propertiesin a uniﬁed framework. Consider an MM map M ( θ ) for maximizing the objective function f ( θ ) via the surrogate function g ( θ | θ ( t ) ) . Close to the optimal point θ ∞ , θ ( t +1) − θ ∞ ≈ dM ( θ ∞ )( θ ( t ) − θ ∞ ) , where dM ( θ ∞ ) is the diﬀerential of the mapping M at the optimal point θ ∞ of f ( θ ) . Hence,the local convergence rate of the sequence θ ( t +1) = M ( θ ( t ) ) coincides with the spectral17adius of dM ( θ ∞ ) . Familiar calculations (McLachlan and Krishnan, 2008; Lange, 2010)demonstrate that dM ( θ ∞ ) = I − [ d g ( θ ∞ | θ ∞ )] − d f ( θ ∞ ) . In other words, the local convergence rate is determined by how well the surrogate surface g ( θ | θ ∞ ) approximates the objective surface f ( θ ) near the optimal point θ ∞ . In the EMliterature, dM ( θ ∞ ) is called the rate matrix (Meng and Rubin, 1991). Fast convergenceoccurs when the surrogate g ( θ | θ ∞ ) hugs the objective f ( θ ) tightly around θ ∞ . Figure 1shows a case where the MM surrogate locally dominates the EM surrogate. We demonstratethat this is no accident.McLachlan and Krishnan (2008) derive the EM surrogate g EM ( σ | σ t ) ) = − m (cid:88) i =1 (cid:104) rank( V i ) ln σ i + rank( V i ) σ t ) i σ i − σ t ) i σ i tr( Ω − ( t ) V i ) (cid:105) − m (cid:88) i =1 σ t ) i σ i y T Ω − ( t ) V i Ω − ( t ) y minorizing the log-likelihood up to an irrelevant constant. Section S.3 of the SupplementaryMaterials gives a detailed derivation for the more general multivariate response case. Therank of the covariance matrix V i appears because V i may not be invertible. Both of thesurrogates g EM ( σ | σ ∞ ) ) and g MM ( σ | σ ∞ ) ) are parameter separated. This impliesthat both second diﬀerentials d g EM ( σ ∞ ) | σ ∞ ) ) and d g MM ( σ ∞ ) | σ ∞ ) ) are diagonal. Asmall diagonal entry of either matrix indicates fast convergence of the corresponding variancecomponent. Our next result shows that, under Assumption 1, on average the diagonal entriesof d g EM ( σ ∞ ) | σ ∞ ) ) dominate those of d g MM ( σ ∞ ) | σ ∞ ) ) when m > . Thus, the EMalgorithm tends to converge more slowly than the MM algorithm, and the diﬀerence is morepronounced as the number of variance components m grows. Theorem 2.

Let σ ∞ ) (cid:31) m be a common limit point of the EM and MM algorithms. Thenboth second diﬀerentials d g MM ( σ ∞ ) | σ ∞ ) ) and d g EM ( σ ∞ ) | σ ∞ ) ) are diagonal with d g EM ( σ ∞ ) | σ ∞ ) ) ii = − rank( V i )2 σ ∞ ) i d g MM ( σ ∞ ) | σ ∞ ) ) ii = − y T Ω − ( ∞ ) V i Ω − ( ∞ ) y σ ∞ ) i = − tr( Ω − ( ∞ ) V i ) σ ∞ ) i . Furthermore, the average ratio m m (cid:88) i =1 d g MM ( σ ∞ ) | σ ∞ ) ) ii d g EM ( σ ∞ ) | σ ∞ ) ) ii = 2 mn m (cid:88) i =1 tr( Ω − ( ∞ ) σ ∞ ) i V i ) = 2 m < for m > when all V i have full rank n . roof. See Section ?? of the Supplementary Materials.Both the EM and MM algorithms must evaluate the traces tr( Ω − ( t ) V i ) and quadraticforms ( y − Xβ ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − Xβ ( t ) ) at each iteration. Since these quantities arealso the building blocks of the approximate rate matrices d g ( σ t ) | σ t ) ) , one can rationallychoose either the EM or MM updates based on which has smaller diagonal entries measuredby the (cid:96) , (cid:96) , or (cid:96) ∞ norms. At negligible extra cost, this produces a hybrid algorithm thatretains the ascent property and enjoys the better of the two convergence rates. Besides its competitive numerical performance, Algorithm 1 is attractive for its simplicityand ease of generalization. In this section, we outline MM algorithms for multivariate re-sponse models possibly with missing data, linear mixed models, MAP estimation, penalizedestimation, and generalized estimating equations.

Consider the multivariate response model with n × d response matrix Y , mean E Y = XB ,and covariance Ω = Cov(vec Y ) = m (cid:88) i =1 Γ i ⊗ V i . The p × d coeﬃcient matrix B collects the ﬁxed eﬀects, the Γ i are unknown d × d covariancematrices, and the V i are known n × n covariance matrices. If the vector vec Y is normallydistributed, then Y equals a sum of independent matrix normal distributions (Gupta andNagar, 1999). We now make this assumption and pursue estimation of B and the Γ i , whichwe collectively denote as Γ . Under the normality assumption, the Kronecker product identity vec( CDE ) = ( E T ⊗ C )vec( D ) yields the log-likelihood L ( B , Γ ) = −

12 ln det Ω −

12 vec( Y − XB ) T Ω − vec( Y − XB ) (13) = −

12 ln det Ω −

12 [vec Y − ( I d ⊗ X )vec B ] T Ω − [vec Y − ( I d ⊗ X )vec B ] . Updating B given Γ ( t ) is accomplished by solving the general least squares problem metearlier in the univariate case. Maximization of the log-likelihood (13) is diﬃcult due to therequirement that each Γ i be positive semideﬁnite. Typical solutions involve reparameteri-zation of the covariance matrix (Pinheiro and Bates, 1996). The MM algorithm derived inthis section gracefully accommodates the covariance constraints.19pdating Γ given B ( t ) requires generalizing the minorization (5). In view of Lemma 1and the identities ( A ⊗ B )( C ⊗ D ) = ( AC ) ⊗ ( BD ) and ( A ⊗ B ) − = A − ⊗ B − , we have Ω ( t ) Ω − Ω ( t ) = m (cid:34) m m (cid:88) i =1 Γ ( t ) i ⊗ V i (cid:35) (cid:34) m m (cid:88) i =1 Γ i ⊗ V i (cid:35) − (cid:34) m m (cid:88) i =1 Γ ( t ) i ⊗ V i (cid:35) (cid:22) m m m (cid:88) i =1 ( Γ ( t ) i ⊗ V i )( Γ i ⊗ V i ) − ( Γ ( t ) i ⊗ V i )= m (cid:88) i =1 ( Γ ( t ) i Γ − i Γ ( t ) i ) ⊗ V i , or equivalently Ω − (cid:22) Ω − ( t ) (cid:104) m (cid:88) i =1 ( Γ ( t ) i Γ − i Γ ( t ) i ) ⊗ V i (cid:105) Ω − ( t ) . (14)This derivation relies on the invertibility of the matrices V i . One can relax this assumptionby substituting V (cid:15),i = V i + (cid:15) I n for V i and sending (cid:15) to 0.The majorization (14) and the minorization (6) jointly yield the surrogate g ( Γ | Γ ( t ) ) = − m (cid:88) i =1 (cid:110) tr[ Ω − ( t ) ( Γ i ⊗ V i )]+(vec R ( t ) ) T [( Γ ( t ) i Γ − i Γ ( t ) i ) ⊗ V i ] (vec R ( t ) ) (cid:111) + c ( t ) , where R ( t ) is the n × d matrix satisfying vec R ( t ) = Ω − ( t ) vec( Y − XB ( t ) ) and c ( t ) is anirrelevant constant. Based on the Kronecker identities (vec A ) T vec B = tr( A T B ) and vec( CDE ) = ( E T ⊗ C )vec( D ) , the surrogate can be rewritten as g ( Γ | Γ ( t ) ) = − m (cid:88) i =1 (cid:110) tr[ Ω − ( t ) ( Γ i ⊗ V i )] + tr( R ( t ) T V i R ( t ) Γ ( t ) i Γ − i Γ ( t ) i ) (cid:111) + c ( t ) = − m (cid:88) i =1 (cid:110) tr[ Ω − ( t ) ( Γ i ⊗ V i )] + tr( Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i Γ − i ) (cid:111) + c ( t ) . The ﬁrst trace is linear in Γ i with the coeﬃcient of entry ( Γ i ) jk equal to tr( Ω − ( t ) jk V i ) = Tn ( V i (cid:12) Ω − ( t ) jk ) n , where Ω − ( t ) jk is the ( j, k ) -th n × n block of Ω − ( t ) . The matrix M i of these coeﬃcients can bewritten as M i =  T T . . . T T T . . . T ... ... . . . ... T T . . . T   V i . . . V i ... . . . ... V i . . . V i  (cid:12) Ω ( − t )   . . .

00 1 . . . ... ... . . . ... . . .  = ( I d ⊗ n ) T [( d Td ⊗ V i ) (cid:12) Ω − ( t ) ]( I d ⊗ n ) . nput : Y , X , V , . . . , V m output : MLE (cid:98) B , (cid:98) Γ , . . . , (cid:98) Γ m Initialize Γ (0) i positive deﬁnite, i = 1 , . . . , m repeat Ω ( t ) ← (cid:80) mi =1 Γ ( t ) i ⊗ V i B ( t ) ← arg min B [vec Y − ( I d ⊗ X )vec B ] T Ω − ( t ) [vec Y − ( I d ⊗ X )vec B ] R ( t ) ← reshape ( Ω − ( t ) vec( Y − XB ( t ) ) , n, d ) for i = 1 , . . . , m do Cholesky L ( t ) i L ( t ) Ti ← ( I d ⊗ n ) T [( d Td ⊗ V i ) (cid:12) Ω − ( t ) ]( I d ⊗ n ) Γ ( t +1) i ← L − ( t ) Ti [ L ( t ) Ti ( Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i ) L ( t ) i ] / L − ( t ) i end until objective value converges Algorithm 3:

The MM algorithm for MLE of the multivariate response model (13).The directional derivative of g ( Γ | Γ ( t ) ) with respect to Γ i in the direction ∆ i is −

12 tr( M i ∆ i ) + 12 tr( Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i Γ − i ∆ i Γ − i )= −

12 tr( M i ∆ i ) + 12 tr( Γ − i Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i Γ − i ∆ i ) . Because all directional derivatives of g ( Γ | Γ ( t ) ) vanish at a stationarity point, the matrixequation M i = Γ − i Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i Γ − i (15)holds. Fortunately, this equation admits an explicit solution. For positive scalers a and b , thesolution to the equation b = x − ax − is x = ± (cid:112) a/b . The matrix analogue of this equationis the Riccati equation B = X − AX − , whose solution is summarized in the next lemma. Lemma 6.

Assume A and B are positive deﬁnite and L is the Cholesky factor of B .Then Y = L − T ( L T AL ) / L − is the unique positive deﬁnite solution to the matrix equation B = X − AX − .Proof. Direct substitution shows that Y solves the equivalent equation XBX = A . Toshow uniqueness, suppose Y − AY − = B and Z − AZ − = B . The equations ( B / Y B / ) = B / Y BY B / = B / AB / ( B / ZB / ) = B / ZBZB / = B / AB / imply B / Y B / = B / ZB / by virtue of the uniqueness of symmetric square root. Since B − / is positive deﬁnite, Y = Z . 21he Cholesky factor L in Lemma 6 can be replaced by the symmetric square root of B .The solution, which is unique, remains the same. The Cholesky decomposition is preferredfor its cheaper computational cost and better numerical stability.Algorithm 3 summarizes the MM algorithm for ﬁtting the multi-response model (3). Eachiteration invokes m Cholesky decompositions and symmetric square roots of d × d positivedeﬁnite matrices. Fortunately in most applications, d is a small number. The followingresult guarantees the non-singularity of the Cholesky factor throughout iterations. Proposition 2.

Assume V i has strictly positive diagonal entries. Then the symmetric matrix M i = ( I d ⊗ n ) T [( d Td ⊗ V i ) (cid:12) Ω − ( t ) ]( I d ⊗ n ) is positive deﬁnite for all t . Furthermore if Γ (0) i (cid:31) and no column of R ( t ) lies in the null space of V i for all t , then Γ ( t ) i (cid:31) for all t .Proof. If V i has strictly positive diagonal entries, then so does d Td ⊗ V i , and the Hadamardproduct ( d Td ⊗ V i ) (cid:12) Ω − ( t ) is positive deﬁnite by Schur’s lemma. Since the matrix I d ⊗ n has full column rank d , the matrix M i is also positive deﬁnite. Finally, if no column of R ( t ) lies in the null space of V i , and Γ ( t ) is positive deﬁne, then Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i is positivedeﬁnite. The second claim follows by induction and Lemma 6. Multivariate response, two variance components

When there are m = 2 variance components Ω = Γ ⊗ V + Γ ⊗ V , repeated inversion of the nd × nd covariance matrix Ω reduces to a single nd × nd simultaneous congruence decom-position and, per iteration, two d × d Cholesky decompositions and one d × d simultaneouscongruence decomposition. The simultaneous congruence decomposition of the matrix pair ( V , V ) involves generalized eigenvalues d = ( d , . . . , d n ) and a nonsingular matrix U suchthat U T V U = D = diag( d ) and U T V U = I . If the simultaneous congruence decomposi-tion of ( Γ ( t )1 , Γ ( t )2 ) is ( Λ ( t ) , Φ ( t ) ) with Φ ( t ) T Γ ( t )1 Φ ( t ) = Λ ( t ) = diag( λ ( t ) ) and Φ ( t ) T Γ ( t )2 Φ ( t ) = I d ,then Ω ( t ) = ( Φ − ( t ) ⊗ U − ) T ( Λ ( t ) ⊗ D + I d ⊗ I n )( Φ − ( t ) ⊗ U − ) Ω − ( t ) = ( Φ ( t ) ⊗ U )( Λ ( t ) ⊗ D + I d ⊗ I n ) − ( Φ ( t ) ⊗ U ) T det Ω ( t ) = det( Λ ( t ) ⊗ D + I d ⊗ I n ) det[( Φ − ( t ) ⊗ U − ) T ( Φ − ( t ) ⊗ U − )]= det( Λ ( t ) ⊗ D + I d ⊗ I n ) det( Γ ( t )2 ⊗ V )= det( Λ ( t ) ⊗ D + I d ⊗ I n ) det( Γ ( t )2 ) n det( V ) d . Updating the ﬁxed eﬀects reduces to a weighted least squares problem for the transformedresponses ˜ Y = U T Y , transformed predictor matrix ˜ X = U T X , and observation weights ( λ ( t ) k d i + 1) − . Algorithm 4 summarizes the simpliﬁed MM algorithm. Detailed derivationsare relegated to Section ?? of the Supplementary Materials.22 nput : Y , X , V , V Output : MLE (cid:98) B , (cid:98) Γ , (cid:98) Γ Simultaneous congruence decomposition: ( D , U ) ← ( V , V ) Transform data: ˜ Y ← U T Y , ˜ X ← U T X Initialize Γ (0)1 , Γ (0)2 positive deﬁnite repeat Simultaneous congruence decomposition ( Λ ( t ) , Φ ( t ) ) ← ( Γ ( t )1 , Γ ( t )2 ) B ( t ) ← arg min B [vec( ˜ Y Φ ( t ) ) − ( Φ ( t ) T ⊗ ˜ X )vec B ] T ( Λ ( t ) ⊗ D + I d ⊗ I n ) − [vec( ˜ Y Φ ( t ) ) − ( Φ ( t ) T ⊗ ˜ X )vec B ] Cholesky L ( t )1 L ( t ) T ← Φ ( t ) diag (cid:16) tr (cid:16) D ( λ ( t ) k D + I n ) − (cid:17) , k = 1 , . . . , d (cid:17) Φ ( t ) T Cholesky L ( t )2 L ( t ) T ← Φ ( t ) diag (cid:16) tr (cid:16) ( λ ( t ) k D + I n ) − (cid:17) , k = 1 , . . . , d (cid:17) Φ ( t ) T N ( t )1 ← D / [( ˜ Y − ˜ XB ( t ) ) Φ ( t ) ) (cid:11) ( dλ ( t ) T + n Td )] Λ ( t ) Φ − ( t ) N ( t )2 ← [( ˜ Y − ˜ XB ( t ) ) Φ ( t ) ) (cid:11) ( dλ ( t ) T + n Td )] Φ − ( t ) Γ ( t +1) i ← L − ( t ) Ti ( L ( t ) Ti M ( t ) Ti M ( t ) i L ( t ) i ) / L − ( t ) i , i = 1 , until objective value converges Algorithm 4:

MM algorithm for multivariate response model Ω = Γ ⊗ V + Γ ⊗ V with two variance components matrices. Note that (cid:11) denotes a Hadamard quotient. In many applications the multivariate response model (13) involves missing responses. Forinstance, in testing multiple longitudinal traits in genetics, some trait values y ij may bemissing due to dropped patient visits, while their genetic covariates are complete. Missingdata destroys the symmetry of the log-likelihood (13) and complicates ﬁnding the MLE.Fortunately, MM algorithm 3 easily adapts to this challenge.The familiar EM argument (McLachlan and Krishnan, 2008, Section 2.2) shows that − n Ω ( t ) −

12 tr { Ω − ( t ) [vec( Z ( t ) − XB ( t ) )vec( Z ( t ) − XB ( t ) ) T + C ( t ) ] } (16)minorizes the observed log-likelihood at the current iterate ( B ( t ) , Γ ( t )1 , . . . , Γ ( t ) m ) . Here Z ( t ) isthe completed response matrix given the observed responses Y ( t ) obs and the current parametervalues. The complete data Y is assumed to be normally distributed N (vec( XB ( t ) ) , Ω ( t ) ) .The block matrix C ( t ) is 0 except for a lower-right block consisting of a Schur complement.To maximize the surrogate (16), we invoke the familiar minorization (6) and majorization(14) to separate the variance components Γ i . At each iteration we impute missing entries bytheir conditional means, compute their conditional variances and covariances to supply theSchur complement, and then update the ﬁxed eﬀects and variance components by the explicitupdates of Algorithm 3. The required conditional means and conditional variances can be23onveniently obtained in the process of inverting Ω ( t ) by the sweep operator of computationalstatistics (Lange, 2010, Section 7.3). The linear mixed model plays a central role in longitudinal data analysis. For the sake ofsimplicity, consider the single-level LMM (Laird and Ware, 1982; Bates and Pinheiro, 1998)for n independent data clusters ( y i , X i , Z i ) with Y i = X i β + Z i γ i + (cid:15) i , i = 1 , . . . , n, where β is a vector of ﬁxed eﬀects, the γ i ∼ N ( , R i ( θ )) are independent random eﬀects, and (cid:15) i ∼ N ( , σ I n i ) captures random noise independent of γ i . We assume the matrices Z i havefull column rank. The within-cluster covariance matrices R i ( θ ) depend on a parameter vector θ ; typical choices for R i ( θ ) impose autocorrelation, compound symmetry, or unstructuredcorrelation. It is clear that Y i is normal with mean X i β , covariance Ω i = Z i R i ( θ ) Z Ti + σ I n i ,and log-likelihood L i ( β , θ , σ ) = −

12 ln det Ω i −

12 ( y i − X i β ) T Ω − i ( y i − X i β ) . The next three facts about pseudo-inverses are used in deriving the MM algorithm for LMM.

Lemma 7. If A has full column rank and B has full row rank, then ( AB ) + = B + A + .Proof. Under the hypotheses, the representations A + = ( A T A ) + A T = ( A T A ) − A T and B + = B T ( BB T ) − are well known. The choice B + A + = B T ( BB T ) − ( A T A ) − A T satisﬁesthe four equations characterizing the pseudo-inverse of AB . Lemma 8. If A and B are positive semideﬁnite matrices with the same range, then lim (cid:15) ↓ ( B + (cid:15) I )( A + (cid:15) I ) − ( B + (cid:15) I ) = BA + B . Proof.

Suppose A has spectral decomposition (cid:80) i λ i u i u Ti . The matrix P = (cid:80) λ i > u i u Ti projects onto the range of A and therefore also projects onto the range of B . It follows that P B = B and by symmetry that BP = B . This allows us to write ( B + (cid:15) I )( A + (cid:15) I ) − ( B + (cid:15) I )= BP ( A + (cid:15) I ) − P B + (cid:15) BP ( A + (cid:15) I ) − + (cid:15) ( A + (cid:15) I ) − P B + (cid:15) ( A + (cid:15) I ) − . The last three of these terms vanish as (cid:15) ↓ ; the ﬁrst term tends to the claimed limit. Theseassertions follow from the expressions P ( A + (cid:15) I ) − P = P ( A + (cid:15) I ) − = ( A + (cid:15) I ) − P = (cid:88) λ i > λ i + (cid:15) u i u Ti and (cid:15) ( A + (cid:15) I ) − = (cid:80) i (cid:15) λ i + (cid:15) u i u Ti . 24 emma 9. If R and S are positive deﬁnite matrices, and the conformable matrix Z has fullcolumn rank, then the matrices ZRZ T and ZSZ T share a common range.Proof. In fact, both matrices have range equal to the range of Z . The matrices Z and ZR / clearly have the same range. Furthermore, the matrices ZR / and ZR / R / Z T also havethe same range.The convexity of the map ( X , Y ) (cid:55)→ X T Y − X and Lemmas 7, 8, and 9 now yield viathe obvious limiting argument the majorization Ω ( t ) Ω − Ω ( t ) = ( Z i R i ( θ ( t ) ) Z Ti + σ t ) I n i )( Z i R i ( θ ) Z Ti + σ I n i ) − ( Z i R i ( θ ( t ) ) Z Ti + σ t ) I n i ) (cid:22) ( Z i R i ( θ ( t ) ) Z Ti )( Z i R i ( θ ) Z Ti ) + ( Z i R i ( θ ( t ) ) Z Ti ) + σ t ) σ I n i = [ Z i R i ( θ ( t ) ) Z Ti Z T + i ] R − i ( θ )[ Z + i Z i R i ( θ ( t ) ) Z Ti ] + σ t ) σ I n i In combination with the minorization (6), this gives the surrogate g i ( θ , σ | θ ( t ) , σ t ) ) = −

12 tr( Z Ti Ω − ( t ) i Z i R i ( θ )) − r ( t ) Ti R − i ( θ ) r ( t ) i − σ Ω − ( t ) i ) − σ t ) σ ( y i − X i β ( t ) ) T Ω − t ) i ( y i − X i β ( t ) ) + c ( t ) , for the log-likelihood L i ( θ , σ ) , where r ( t ) i = ( Z + i Z i R i ( θ ( t ) ) Z Ti ) Ω − ( t ) i ( y i − X i β ( t ) ) = R i ( θ ( t ) ) Z Ti Ω − ( t ) i ( y i − X i β ( t ) ) . The parameters θ and σ are nicely separated. To maximize the overall minorization function (cid:80) i g i ( θ , σ | θ ( t ) , σ t ) ) , we update σ via σ t +1) = σ t ) (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) i ( y i − X i β ( t ) ) T Ω − t ) i ( y i − X i β ( t ) ) (cid:80) i tr( Ω − ( t ) i ) . For structured models such as autocorrelation and compound symmetry, updating θ is a low-dimensional optimization problem that can be approached through the stationarity condition (cid:88) i vec (cid:16) Z Ti Ω ( t ) i Z i − R − i ( θ ) r ( t ) i r ( t ) Ti R − i ( θ ) (cid:17) T ∂∂θ j vec R i ( θ ) = 0 for each component θ j . For the unstructured model with R i ( θ ) = R for all i , the stationaritycondition reads (cid:88) i Z Ti Ω ( t ) i Z i = R − (cid:32)(cid:88) i r ( t ) i r ( t ) Ti (cid:33) R − Y i = X i β + Z i γ i + · · · Z im γ im + (cid:15) i . Minorization separates parameters for each level (variance component). Depending on thecomplexity of the covariance matrices, maximization of the surrogate can be accomplishedanalytically. For the sake of brevity, details are omitted.

Suppose β follows an improper ﬂat prior, the variance components σ i follow inverse gammapriors with shapes α i > and scales γ i > , and these priors are independent. The log-posterior density then reduces to ln f ( β , σ | y , X )= −

12 ln det Ω −

12 ( y − Xβ ) T Ω − ( y − Xβ ) − m (cid:88) i =1 ( α i + 1) ln σ i − m (cid:88) i =1 γ i σ i + c, (17)where c is an irrelevant constant. The MAP estimator of ( β , σ ) is the mode of the posteriordistribution. The update (4) of β given σ remains the same. To update σ given β , applythe same minorizations (5) and (6) to the ﬁrst ﬁrst two terms of equation (17). This separatesparameters and yields a convex surrogate for each σ i . The minimum of the σ i surrogate isdeﬁned by the stationarity condition −

12 tr( Ω − ( t ) V i ) + σ t ) i σ i ( y − Xβ ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − Xβ ( t ) ) − α i + 1 σ i + γ i σ i . Multiplying this by σ i gives a quadratic equation in σ i . The positive root should be takento meet the nonnegativity constraint on σ i .For the multivariate response model (13), we assume the variance components Γ i followindependent inverse Wishart distributions with degrees of freedom ν i > d − and scalematrix Ψ i (cid:31) . The log density of the posterior distribution is L ( B , Γ | X , Y ) = −

12 ln det Ω −

12 vec( Y − XB ) T Ω − vec( Y − XB ) − m (cid:88) i =1 ( ν i + d + 1) ln det Γ i − m (cid:88) i =1 tr( Ψ i Γ − i ) + c, (18)where c is an irrelevant constant. Invoking the minorizations (6) and (14) for the ﬁrst twoterms and the supporting hyperplane minorization − ln det Γ i ≥ − ln det Γ ( t ) i − tr { Γ − ( t ) i ( Γ i − Γ ( t ) i ) } − ln det Γ i gives the surrogate function g ( Γ | Γ ( t ) ) = − m (cid:88) i =1 tr( Ω − ( t ) ( Γ i ⊗ V i )) − m (cid:88) i =1 tr (cid:16) Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i Γ − i (cid:17) − m (cid:88) i =1 ( ν i + d + 1)tr( Γ − ( t ) i Γ i ) − m (cid:88) i =1 tr( Ψ i Γ − i ) + c ( t ) . The optimal Γ i satisﬁes the stationarity condition ( I d ⊗ n ) T [( d Td ⊗ V i ) (cid:12) Ω − ( t ) ]( I d ⊗ n ) + ( ν i + d + 1) Γ − ( t ) i = Γ − i ( Γ ( t ) i R ( t ) T V i R ( t ) Γ ( t ) i + Ψ i ) Γ − i and can be found using Lemma 6. In the statistical analysis of high-dimensional data, the imposition of sparsity leads to bet-ter interpretation and more stable parameter estimation. MM algorithms mesh well withpenalized estimation. The simple variance components model (1) illustrates this fact. Forthe selection of ﬁxed eﬀects, minimizing the lasso-penalized log-likelihood − L ( β , σ ) + λ (cid:88) j | β j | is often recommended (Schelldorfer et al., 2011). The only change to the MM Algorithm1 is that in estimating β , one solves a lasso penalized general least squares problem ratherthan an ordinary general least squares problem. The updates of the variance components σ i remain the same. For selection among a large number of variance components, one canminimize the ridge-penalized log-likelihood − L ( β , σ ) + λ m (cid:88) i =1 σ i subject to the nonnegativity constraints σ i ≥ . Here the standard deviations σ i are theunderlying parameters. The variance update (8) becomes σ t +1) i = σ t ) i (cid:115) ( y − Xβ ( t ) ) T Ω − ( t ) V i Ω − ( t ) ( y − Xβ ( t ) )tr( Ω − ( t ) V i ) + 2 λ , i = 1 , . . . , m. The updates for the ﬁxed eﬀects β are unaﬀected. Equation (19) clearly exhibits shrinkagebut no thresholding. The lasso penalized log-likelihood − L ( β , σ ) + λ m (cid:88) i =1 σ i (19)27ubject to nonnegativity constraint σ i ≥ achieves both ends. The update of σ i is chosenamong the positive roots of a quartic equation and the boundary 0, whichever yields a lowerobjective value. One can extend the MM algorithms to binary and discrete response data with the frame-work of generalized estimating equations (GEE) (Liang and Zeger, 1986). Again consider n independent data clusters ( y i , X i ) . In longitudinal studies, ( y i , X i ) would be the responsesand clinical covariates of subject i at diﬀerent time points. In genetic studies, ( y i , X i ) wouldbe the trait values and covariates of individuals within family i . GEE captures the within-cluster correlation by specifying the ﬁrst two moments of the conditional distribution of y i given X i , namely µ ij = E ( y ij | x ij ) and σ ij = Var ( y ij | x ij ) . If one assumes that y ij follows anexponential family with canonical link, then µ ij ( β ) = µ ( θ ij ) and σ ij ( β ) = φµ (cid:48) ( θ ij ) , i = 1 , . . . , n, j = 1 , . . . , n i , where µ ( t ) is a diﬀerentiable canonical link function, µ (cid:48) ( t ) is its ﬁrst derivative, θ ij = x Tij β is the linear systematic part of y ij associated with the covariates, φ is an over-dispersionparameter, and β is the vector of ﬁxed eﬀects.The GEE estimator of β solves the equation n (cid:88) i =1 d µ i ( β ) T V − i ( α , β )[ y i − µ i ( β )] = p , where y i = ( y i , . . . , y in i ) T , µ i ( β ) = [ µ i ( β ) , . . . , µ in i ( β )] T , V i = cov ( y i ) is the working co-variance matrix of the i -th subject, and d µ i ( β ) is the diﬀerential of µ i ( β ) . In longitudinalstudies, V i is often parameterized as V i ( α , β ) = A / i ( β ) R i ( α ) A / i ( β ) , where A / ( β ) isa diagonal matrix with standard deviations σ ij along its diagonal, and R ( α ) is a corre-lation matrix with parameters α . This parameterization is too restrictive in many otherapplications. For instance, in genetic studies, it is critical to dissect the variance into dif-ferent sources such as additive, dominance, and household environment (Lange, 2002). Thissuggests the variance component parameterization R i ( σ ) = σ a Φ i + σ d ∆ ,i + σ h H i + σ e I n i , σ a + σ d + σ h + σ e = 1 , where in the i -th family Φ i is the theoretical kinship matrix, ∆ ,i is the dominance variancematrix, and H i is the household indicator matrix. The matrices Φ i , ∆ ,i , and H i arecorrelation matrices, and the simplex constraint ensures R i is as well. In general, the variance28omponent parameterization R i ( σ ) = (cid:80) mj =1 σ j R ij with the simplex constraint in force isreasonable. In this setting the GEE update of β given σ solves the equation n (cid:88) i =1 X Ti A − / i ( β ) R − i ( σ ) A − / ( β )[ y i − µ i ( β )] = p . This is just the classical GEE update. The diﬃculty lies in updating σ given β . We proposeminimizing the sum (cid:80) ni =1 ψ ( R i ) , where ψ ( t ) is a scalar convex loss function. Example lossfunctions include the Mahalanobis criterion n (cid:88) i =1 ψ ( R i ) = n (cid:88) i =1 ( y i − (cid:98) µ i ) T (cid:98) A − / i R − i ( σ ) (cid:98) A − / i ( y i − (cid:98) µ i ) and the sum of squared Frobenius distances n (cid:88) i =1 ψ ( R i ) = n (cid:88) i =1 tr (cid:104) ( y i − (cid:98) µ i )( y i − (cid:98) µ i ) T − (cid:98) A / i R i ( σ ) (cid:98) A / i (cid:105) . The convexity of ψ ( t ) entails a minorization similar to the minorization (5). Minimizingthe surrogate then yields the MM update σ t +1) j = (cid:80) ni =1 ( y i − (cid:98) µ i ) T (cid:98) A − / i R − i ( σ t ) ) R ij R − i ( σ t ) ) (cid:98) A − / i ( y i − (cid:98) µ i ) (cid:80) mj =1 (cid:80) ni =1 ( y i − (cid:98) µ i ) T (cid:98) A − / i R − i ( σ t ) ) R ij R − i ( σ t ) ) (cid:98) A − / i ( y i − (cid:98) µ i ) . Under ψ ( t ) the MM update boils down to projection onto the simplex. Further explorationof these ideas probably deserves another paper and will be omitted here for the sake ofbrevity. Quantitative trait loci (QTL) mapping aims to identify genes associated with a quantitativetrait. Current sequencing technology measures millions of genetic markers in study subjects.Traditional single-marker tests suﬀer from low power due to the low frequency of manymarkers and the corrections needed for multiple hypothesis testing. Region-based associationtests are a powerful alternative for analyzing next generation sequencing data with abundantrare variants.Suppose y is a n × vector of quantitative trait measurements on n people, X is an n × p predictor matrix (incorporating predictors such as sex, smoking history, and principalcomponents for ethnic admixture), and G is an n × m genotype matrix of m genetic variantsin a pre-deﬁned region. The linear mixed model assumes Y = Xβ + Gγ + (cid:15) , γ ∼ N ( , σ g I ) , (cid:15) ∼ N ( , σ e I n ) , β are ﬁxed eﬀects, γ are random genetic eﬀects, and σ g and σ e are variance componentsfor the genetic and environmental eﬀects, respectively. Thus, the phenotype vector Y hascovariance σ g GG T + σ e I n , where GG T is the kernel matrix capturing the overall eﬀect of the m variants. Current approaches test the null hypothesis σ g = 0 for each region separatelyand then adjust for multiple testing (Lee et al., 2014). Hence, we consider the joint model y = Xβ + s − / G γ + · · · + s − / m G m γ m + (cid:15) , γ i ∼ N ( , σ i I ) , (cid:15) ∼ N ( , σ e I n ) and select the variance components σ i via the penalization (19). Here s i is the number ofvariants in region i , and the weights s − / i put all variance components on the same scale.We illustrate this approach using the COPDGene exome sequencing study ( ) (Regan et al., 2010). After quality control, individuals and 646,125genetic variants remain for analysis. Genetic variants are grouped into 16,619 genes toexpose those genes associated with the complex trait height . We include age , sex , and thetop 3 principal components in the mean eﬀects. Because the number of genes vastly exceedsthe sample size n = 399 , we ﬁrst pare the 16,619 genes down to 200 genes according totheir marginal likelihood ratio test p-values and then carry out penalized estimation of the200 variance components in the joint model (19). This is similar to the sure independencescreening strategy for selecting mean eﬀects (Fan and Lv, 2008). Genes are ranked accordingto the order they appear in the lasso solution path. Table 7 lists the top 10 genes togetherwith their marginal LRT p-values. Figure 2 displays the corresponding segment of thelasso solution path. It is noteworthy that the ranking of genes by penalized estimationdiﬀers from the ranking according to marginal p-values. The same phenomenon occursin selection of highly correlated mean predictors. This penalization approach for selectingvariance components warrants further theoretical study. It is reassuring that the simple MMalgorithm scales to high-dimensional problems. The current paper leverages the MM principle to design powerful and versatile algorithms forvariance components estimation. The MM algorithms derived are notable for their simplicity,generality, numerical eﬃciency, and theoretical guarantees. Both ordinary MLE and REMLare apt to beneﬁt. Other extensions are possible. In nonlinear models (Bates and Watts,1988; Lindstrom and Bates, 1990), the mean response is a nonlinear function in the ﬁxedeﬀects β . One can easily modify the MM algorithms to update β by a few rounds of Gauss-Newton iteration. The variance components updates remain unchanged.30 asso Rank Gene Marginal P-value . × −

22 C9orf21 . × −

43 PLS1 . × −

54 ATP5D . × −

35 ADCY4 . × −

116 SLC22A25 . × −

147 RCSD1 . × −

48 PCDH7 . × −

79 AVIL . × − . × − Table 4: Top 10 genes selected by the lasso penalized variance component model (19) in anassociation study of 200 genes and the complex trait height .Figure 2: Solution path of the lasso penalized variance component model (19) in an associ-ation study of 200 genes and the complex trait height .31ne can also extend our MM algorithms to elliptically symmetric densities f ( y ) = e − κ ( δ ) (2 π ) n (det Ω ) deﬁned for y ∈ R n , where δ = ( y − µ ) T Ω − ( y − µ ) denotes the Mahalanobis distance be-tween y and µ . Here we assume that the function κ ( s ) is strictly increasing and strictly con-cave. Examples of elliptically symmetric densities include the multivariate t , slash, contami-nated normal, power exponential, and stable families. Previous work (Huber and Ronchetti,2009; Lange and Sinsheimer, 1993) has focused on using the MM principle to convert param-eter estimation for these robust families into parameter estimation under the multivariatenormal. One can chain the relevant majorization κ ( s ) ≤ κ ( s ( t ) ) + κ (cid:48) ( s ( t ) )( s − s ( t ) ) with ourprevious minorizations and simultaneously split variance components and pass to the morebenign setting of the multivariate normal. Acknowledgments

The research is partially supported by NSF grant DMS-1055210 and NIH grants R01 HG006139,R01 GM53275, and R01 GM105785. The authors thank Michael Cho, Dandi Qiao, and Ed-win Silverman for their assistance in processing and assessing COPDGene exome sequencingdata. COPDGene is supported by NIH R01 HL089897 and R01 HL089856.

References

Bates, D. and Pinheiro, J. (1998). Computational methods for multilevel models. TechnicalReport Technical Memorandum BL0112140-980226-01TM, Bell Labs, Lucent Technolo-gies, Murray Hill, NJ.Bates, D. M. and Watts, D. G. (1988).

Nonlinear Regression Analysis and Its Applica-tions . Wiley Series in Probability and Mathematical Statistics: Applied Probability andStatistics. John Wiley & Sons, Inc., New York.Borg, I. and Groenen, P. J. (2005).

Modern Multidimensional Scaling: Theory and Applica-tions . Springer Science & Business Media.Boyd, S. and Vandenberghe, L. (2004).

Convex Optimization . Cambridge University Press,Cambridge.Callanan, T. P. and Harville, D. A. (1991). Some new algorithms for computing restrictedmaximum likelihood estimates of variance components.

J. Statist. Comput. Simulation ,38(1-4):239–259. 32e Leeuw, J. (1994). Block-relaxation algorithms in statistics. In

Information Systems andData Analysis , pages 308–324. Springer.Demidenko, E. (2013).

Mixed Models . Wiley Series in Probability and Statistics. John Wiley& Sons, Inc., Hoboken, NJ, second edition. Theory and applications with R.Demidenko, E. and Massam, H. (1999). On the existence of the maximum likelihood estimatein variance components models.

Sankhy¯a Ser. A , 61(3):431–443.Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete datavia the EM algorithm.

Journal of the Royal Statistical Soceity Series B. , 39(1-38).Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional featurespace (with discussion).

J. R. Statist. Soc. B , 70:849–911.Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011).

Applied Longitudinal Analysis .Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, secondedition.Golub, G. H. and Van Loan, C. F. (1996).

Matrix Computations . Johns Hopkins Studies inthe Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition.Grządziel, M. and Michalski, A. (2014). A note on the existence of the maximum likelihoodestimate in variance components models.

Discuss. Math. Probab. Stat. , 34(1-2):159–167.Gupta, A. and Nagar, D. (1999).

Matrix Variate Distributions . Monographs and Surveys inPure and Applied Mathematics. Taylor & Francis.Hartley, H. O. and Rao, J. N. K. (1967). Maximum-likelihood estimation for the mixedanalysis of variance model.

Biometrika , 54:93–108.Harville, D. and Callanan, T. (1990). Computational aspects of likelihood-based inferencefor variance components. In Gianola, D. and Hammond, K., editors,

Advances in Sta-tistical Methods for Genetic Improvement of Livestock , volume 18 of

Advanced Series inAgricultural Sciences , pages 136–176. Springer Berlin Heidelberg.Harville, D. A. (1977). Maximum likelihood approaches to variance component estimationand to related problems.

J. Amer. Statist. Assoc. , 72(358):320–340. With a comment byJ. N. K. Rao and a reply by the author.Heiser, W. J. (1995). Convergent computation by iterative majorization: theory and appli-cations in multidimensional data analysis.

Recent Advances in Descriptive MultivariateAnalysis , pages 157–189. 33orn, R. A. and Johnson, C. R. (1985).

Matrix Analysis . Cambridge University Press,Cambridge.Huber, P. J. and Ronchetti, E. M. (2009).

Robust Statistics . Wiley Series in Probability andStatistics. John Wiley & Sons, Inc., Hoboken, NJ, second edition.Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models.

Ann. Statist. ,32(1):384–406.Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms.

Amer. Statist. , 58(1):30–37.Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms.

Ann. Statist. ,33(4):1617–1642.Jeon, M. (2012).

Estimation of Complex Generalized Linear Mixed Models for Measurementand Growth . PhD thesis, University of California, Berkeley.Khuri, A. I., Mathew, T., and Sinha, B. K. (1998).

Statistical Tests for Mixed Linear Models .Wiley Series in Probability and Statistics: Applied Probability and Statistics. John Wiley& Sons, Inc., New York. A Wiley-Interscience Publication.Kiers, H. A. (2002). Setting up alternating least squares and iterative majorization algo-rithms for solving various matrix optimization problems.

Computational Statistics & DataAnalysis , 41(1):157–170.Laird, N., Lange, N., and Stram, D. (1987). Maximum likelihood computations with repeatedmeasures: application of the EM algorithm.

J. Amer. Statist. Assoc. , 82(397):97–105.Laird, N. M. and Ware, J. H. (1982). Random-eﬀects models for longitudinal data.

Biomet-rics , 38(4):963–974.Lange, K. (2002).

Mathematical and Statistical Methods for Genetic Analysis . Statistics forBiology and Health. Springer-Verlag, New York, second edition.Lange, K. (2010).

Numerical Analysis for Statisticians . Statistics and Computing. Springer,New York, second edition.Lange, K., Hunter, D. R., and Yang, I. (2000). Optimization transfer using surrogate objec-tive functions.

J. Comput. Graph. Statist. , 9(1):1–59. With discussion, and a rejoinder byHunter and Lange.Lange, K., Papp, J., Sinsheimer, J., Sripracha, R., Zhou, H., and Sobel, E. (2013). Mendel:the Swiss army knife of genetic analysis programs.

Bioinformatics , 29:1568–1570.34ange, K., Sinsheimer, J., and Sobel, E. (2005). Association testing with Mendel.

GeneticEpidemiology , 29:36–50.Lange, K. and Sinsheimer, J. S. (1993). Normal/independent distributions and their applica-tions in robust regression.

Journal of Computational and Graphical Statistics , 2:175–198.Lange, K. and Zhou, H. (2014). MM algorithms for geometric and signomial programming.

Mathematical Programming Series A , 143:339–356.Lee, S., Abecasis, G., Boehnke, M., and Lin, X. (2014). Rare-variant association analysis:Study designs and statistical tests.

The American Journal of Human Genetics , 95(1):5 –23.Liang, K. and Zeger, S. (1986). Longitudinal data analysis using general linear models.

Biometrika , 73:13–22.Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM algorithms for linearmixed-eﬀects models for repeated-measures data.

J. Amer. Statist. Assoc. , 83(404):1014–1022.Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed eﬀects models for repeatedmeasures data.

Biometrics , 46(3):673–687.McLachlan, G. J. and Krishnan, T. (2008).

The EM Algorithm and Extensions . WileySeries in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken,NJ, second edition.Meng, X.-L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm.

Journal of the American Statistical Association ,86(416):899–909.Pinheiro, J. and Bates, D. (1996). Unconstrained parametrizations for variance-covariancematrices.

Statistics and Computing , 6(3):289–296.Rao, C. R. (1973).

Linear Statistical Inference and its Applications, 2nd ed . John Wiley &Sons.Rao, C. R. and Kleﬀe, J. (1988).

Estimation of Variance Components and Applications ,volume 3 of

North-Holland Series in Statistics and Probability . North-Holland PublishingCo., Amsterdam. 35ao, P. S. R. S. (1997).

Variance Components Estimation , volume 78 of

Monographs onStatistics and Applied Probability . Chapman & Hall, London. Mixed models, methodolo-gies and applications.Regan, E. A., Hokanson, J. E., Murphy, J. R., Make, B., Lynch, D. A., Beaty, T. H., Curran-Everett, D., Silverman, E. K., and Crapo, J. D. (2010). Genetic epidemiology of COPD(COPDGene) study designs.

COPD , 7:32–43.Schafer, J. L. and Yucel, R. M. (2002). Computational strategies for multivariate linearmixed-eﬀects models with missing values.

J. Comput. Graph. Statist. , 11(2):437–457.Schelldorfer, J., Bühlmann, P., and van de Geer, S. (2011). Estimation for high-dimensionallinear mixed-eﬀects models using (cid:96) -penalization. Scand. J. Stat. , 38(2):197–214.Schur, J. (1911). Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlichvielen Veränderlichen.

J. Reine Angew. Math. , (140):1–28.Searle, S. R., Casella, G., and McCulloch, C. E. (1992).

Variance Components . Wiley Seriesin Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley& Sons, Inc., New York. A Wiley-Interscience Publication.Varadhan, R. and Roland, C. (2008). Simple and globally convergent methods for acceler-ating the convergence of any EM algorithm.

Scand. J. Statist. , 35(2):335–353.Verbeke, G. and Molenberghs, G. (2000).

Linear Mixed Models for Longitudinal Data .Springer Series in Statistics. Springer-Verlag, New York.Weiss, R. E. (2005).

Modeling Longitudinal Data . Springer Texts in Statistics. Springer, NewYork.Wu, T. T. and Lange, K. (2010). The MM alternative to EM.

Statistical Science , 25:492–505.Yu, Y. (2010). Monotonic convergence of a general algorithm for computing optimal designs.

Ann. Statist. , 38(3):1593–1606.Zhou, H., Alexander, D., and Lange, K. (2011). A quasi-Newton acceleration for high-dimensional optimization algorithms.

Statistics and Computing , 21:261–273.Zhou, H. and Lange, K. (2010). MM algorithms for some discrete multivariate distributions.