# Fisher Scoring for crossed factor Linear Mixed Models

aa r X i v : . [ s t a t . M E ] F e b Noname manuscript No. (will be inserted by the editor)

Fisher Scoring for crossed factor Linear Mixed Models

Thomas Maullin-Sapey · Thomas E. Nichols

Received: date / Accepted: date

Abstract

The analysis of longitudinal, heterogeneous orunbalanced clustered data is of primary importance to a widerange of applications. The Linear Mixed Model (LMM) is apopular and ﬂexible extension of the linear model speciﬁ-cally designed for such purposes. Historically, a large pro-portion of material published on the LMM concerns theapplication of popular numerical optimization algorithms,such as Newton-Raphson, Fisher Scoring and ExpectationMaximization to single-factor LMMs (i.e. LMMs that onlycontain one “factor” by which observations are grouped).However, in recent years, the focus of the LMM literaturehas moved towards the development of estimation and in-ference methods for more complex, multi-factored designs.In this paper, we present and derive new expressions forthe extension of an algorithm classically used for single-factor LMM parameter estimation, Fisher Scoring, to mul-tiple, crossed-factor designs. Through simulation and realdata examples, we compare ﬁve variants of the Fisher Scor-ing algorithm with one another, as well as against a baselineestablished by the R package lmer, and ﬁnd evidence of cor-rectness and strong computational efﬁciency for four of theﬁve proposed approaches. Additionally, we provide a newmethod for LMM Satterthwaite degrees of freedom estima-tion based on analytical results, which does not require it-erative gradient estimation. Via simulation, we ﬁnd that this

Thomas Maullin-SapeyBig Data Institute, Li Ka Shing Centre for Health Information and Dis-covery, Old Road Campus, Oxford, OX3 7LF, UKE-mail: [email protected] ID: 0000-0002-1890-330XThomas E. NicholsBig Data Institute, Li Ka Shing Centre for Health Information and Dis-covery, Old Road Campus, Oxford, OX3 7LF, UKE-mail: [email protected] ID: 0000-0002-4516-5103 approach produces estimates with both lower bias and lowervariance than the existing methods.

Keywords

Fisher Scoring · Linear Mixed Model · CrossedFactors

Thomas Maullin-Sapey, Thomas E. Nichols

The complex nature of LMM computation has partlyarisen from the gradual expansion of the deﬁnition of “Lin-ear Mixed Model”. Previously, the term “Linear MixedModel” was primarily used to refer to an extension ofthe linear regression model which contains random effectsgrouped by a single random factor. Examples of this deﬁni-tion can be seen in Laird and Ware (1982), in which “Lin-ear Mixed Model” refers only to “single-factor” longitu-dinal models, and in Lindstrom and Bates (1988), wheremore complex, multi-factor models are described as an “extension” of the “Linear Mixed Model”. Consequently,throughout the late 1970s and 1980s, one of the main fo-cuses of the LMM literature was to provide parameter esti-mation methods, such as Fisher Scoring, Newton-Raphsonand Expectation Maximization, for the single-factor LMM(e.g. Dempster et al. (1977), Jennrich and Schluchter (1986)and Laird et al. (1987)). By exploiting structural features ofthe single-factor model, implementation of these methodsrequired only conceptually simplistic mathematical opera-tions. For instance, the Fisher Scoring algorithm proposedby Jennrich and Schluchter (1986) relies only upon vectoraddition and matrix multiplication, inversion, and reshapingoperations.In recent times, usage of the term “Linear Mixed Mod-els” has grown substantially to include models which con-tain random effects grouped by multiple random factors.Examples of this more general deﬁnition are found inPinheiro and Bates (2009) and Tibaldi et al. (2007). For thismore general “multi-factor” LMM deﬁnition, models canbe described as exhibiting either a hierarchical factor struc-ture (i.e. factor groupings are nested inside one another) orcrossed factor structure (i.e. factor groupings are not nestedinside one another). For instance, a study involving studentsgrouped by the factors “school” and “class” contains hierar-chical factors (as every class belongs to a speciﬁc school).In contrast, a study that involves observations grouped by”subject” and ”location,” where subjects change location be-tween recorded measurements, contains crossed factors (aseach subject is not generally associated with a speciﬁc lo-cation or vice versa). In either case, due to the more com-plex nature of the multi-factor LMM, the computational ap-proaches used in the single-factor setting can not be directlyapplied to the multi-factor model. For this reason, parameterestimation of the multi-factor LMM has often been viewedas a much more “difﬁcult” problem than its single-factorcounterpart (see, for example, the discussion in chapters 2and 8 of West et al. (2014)).Several authors have proposed and implemented meth-ods for multi-factor LMM estimation. However, such meth-ods typically require conceptually complex mathemati-cal operations, which are not naturally amenable to vec-torized computation, or restrictive simplifying model as-sumptions, which prevent some designs from being es- timated. For example, the popular R package lmer per-forms LMM estimation via minimization of a penalizedleast squares cost function based on a variant of Hen-derson’s mixed model equations (Henderson et al. (1959),Bates et al. (2015)). However, sparse matrix methods are re-quired to achieve fast evaluation of the cost function, andadvanced numerical approximation methods are needed foroptimization (e.g. the Bound Optimization BY QuadraticApproximation, BOBYQA, algorithm, Powell (2009)). Thecommonly used SAS and SPSS packages, PROC-MIXEDand MIXED, employ a Newton-Raphson algorithm pro-posed by Wolﬁnger, Tobias, and Sall (1994). In this ap-proach, though, derivatives and Hessians must be computedusing the sweep operator, W-transformation, and Choleskydecomposition updating methodology (SAS Institute Inc.(2015), IBM Corp (2015)). The Hierarchical Linear Mod-els (HLM) package takes an alternative option and restrictsits inputs to only LMMs which contain hierarchical factors(Raudenbush and Bryk (2002)); although, the HierarchicalCross-classiﬁed Models (HCM) sub-module does make al-lowances for speciﬁc use-case crossed LMMs. To performparameter estimation, HLM employs a range of differentmethods, each tailored to a particular model design. As aresult, there are many models for which HLM does not pro-vide support.Methods have also been proposed for evaluating thescore vector (the derivative of the log-likelihood function)and the Fisher Information matrix (the expected Hessianof the negative log-likelihood function) required to per-form Fisher Scoring for the multi-factor LMM. For exam-ple, alongside the Newton-Raphson approach adopted byPROC-MIXED and MIXED, Wolﬁnger et al. (1994) alsodescribe a Fisher Scoring algorithm. However, evaluation ofthe expressions they provide requires use of the sweep op-erator, W-transformation, and Cholesky decomposition up-dating methodology; operations for which widespread vec-torized support do not yet exist. More recently, expressionsfor score vectors and Fisher Information matrices were pro-vided by Zhu and Wathen (2018). However, the approachZhu and Wathen (2018) adopt to derive these expressionsproduces an algorithm that requires independent computa-tion for each variance parameter in the LMM. This algo-rithm’s serialized nature results in substantial overheads interms of computation time, thus limiting the method’s util-ity in practical situations where time-efﬁciency is a crucialconsideration.To our knowledge, no approach has yet been providedfor multi-factor LMM parameter estimation, which utilizesonly simplistic, universally available operations which arenaturally amenable to vectorized computation. In this workwe revisit a Fisher Scoring approach suggested for thesingle-factor LMM, described in Demidenko (2013), ex-tending it to the multi-factor setting, with the intention of isher Scoring for crossed factor Linear Mixed Models 3 revisiting the motivating example of the mass-univariatemodel in later work.The novel contribution of this work is to provide newderivations and closed-form expressions for the score vec-tor and Fisher Information matrix of the multi-factor LMM.We show how these expressions can be employed for LMMparameter estimation via the Fisher Scoring algorithm andcan be further adapted for constrained optimization of therandom effects covariance matrix. Additionally, we demon-strate that such closed-form expressions have use beyond thescope of the Fisher Scoring algorithm. An example of thisis provided in the setting of mixed model inference, wherethe degrees of freedom of the approximate T-statistic are notknown and must be estimated (Verbeke and Molenberghs(2001)). We show how our methods may be combined withthe Satterthwaite-based method for approximating degreesof freedom for the LMM by using an approach based on thework of Kuznetsova et al. (2017).In this paper, we ﬁrst propose ﬁve variants of the FisherScoring algorithm. Following this, we provide a discussionof initial starting values for the algorithm, approaches to en-suring positive deﬁniteness of the random effects covariancematrix, and methods for improving the algorithm’s com-putational efﬁciency during implementation. Detail on con-strained optimization, allowing for structural assumptions tobe placed on the random effects covariance matrix, is thenprovided. Proceeding this, new expressions for Satterthwaiteestimation of the degrees of freedom of the approximate T-statistic for the multi-factor LMM are given. Finally, we ver-ify the correctness of the proposed algorithms and degreesof freedom estimates via simulation and real data examples,benchmarking the performance against that of the R packagelmer.1.2 PreliminariesIn this section, we introduce preliminary statistical conceptsand notation which will be employed throughout the remain-der of this work. In Section 1.2.1, a formal deﬁnition of theLMM is provided and in Section 1.2.2, additional notationis detailed.

Both the single-factor and multi-factor LMM, for n observa-tions, take the following form: Y = X β + Zb + ε (1) ε ∼ N ( , σ I n ) , b ∼ N ( , σ D ) , (2)where the known quantities are: • Y : the ( n × ) vector of responses. • X : the ( n × p ) ﬁxed effects design matrix. • Z : the ( n × q ) random effects design matrix.And the unknown parameters are: • β : the ( p × ) ﬁxed effects parameter vector. • σ : the scalar ﬁxed effects variance. • D : the ( q × q ) random effects covariance matrix.From (1) and (2) the marginal distribution of the responsevector, Y is: Y ∼ N ( X β , σ ( I n + ZDZ ′ )) . The log-likelihood for the LMM speciﬁed by (1) and (2) isderived from the marginal distribution of Y . Dropping con-stant terms, this log-likelihood is given by: l ( θ ) = − (cid:26) n log ( σ ) + σ − e ′ V − e + log | V | (cid:27) , (3)where θ is shorthand for all the parameters ( β , σ , D ) , V = I n + ZDZ ′ and e = Y − X β . Throughout the main body of thiswork, we shall consider parameter estimation performed viaMaximum Likelihood (ML) estimation of (3). However, wenote that the approaches we describe for ML estimation canalso be easily adapted to use a Restricted Maximum Likeli-hood (ReML) criterion. Further detail on ReML estimationis provided in Appendix 6.4.The distinction between the multi-factor and single-factor LMM lies in the speciﬁcation of the random effectsin the model. Random effects are often described in termsof factors, categorical variables that group the random ef-fects, and levels, individual instances of such a categoricalvariable. At this point, we highlight that the term “factor”, inthis work, refers only to categorical variables which grouprandom effects and does not refer to groupings of ﬁxed ef-fects. We denote the total number of factors in the model as r and denote the k th factor in the model as f k for k ∈ { , ..., r } .For a given factor f k , l k will be used to denote the number oflevels possessed by f k , and q k the number of random effectswhich f k groups. The single-factor LMM corresponds to thecase r =

1, while the multi-factor setting corresponds to thecase r > f is ‘sub-ject’ and the factor f is ‘location’. Therefore, r =

2. Thenumber of subjects is l and the number of locations is l .The number of covariates grouped by the ﬁrst factor, q , is Thomas Maullin-Sapey, Thomas E. Nichols q , is 1 (i.e. the ran-dom intercept).The values of r , { q k } k ∈{ ,..., r } and { l k } k ∈{ ,..., r } deter-mine the structure of the random effects design matrix, Z ,and random effects covariance matrix, D . To specify Z for-mally is notationally cumbersome and of little relevanceto the aims of this work. For this reason, the reader is re-ferred to the work of Bates et al. (2015) for further detailon the construction of Z . Here it will be sufﬁcient to notethat, under the assumption that it’s columns are appropri-ately ordered, Z will be comprised of r horizontally con-catenated blocks. The k th block of Z , denoted Z ( k ) , has di-mension ( n × l k q k ) and describes the random effects whichare grouped by the k th factor. Additionally, each block, Z ( k ) ,can be further partitioned column-wise into l k blocks of di-mension ( n × q k ) . The j th block of Z ( k ) , denoted Z ( k , j ) , cor-responds to the random effects which belong to the j th levelof the k th factor. In summary, Z = [ Z ( ) , Z ( ) , . . . Z ( r ) ] , Z ( k ) = [ Z ( k , ) , Z ( k , ) , . . . Z ( k , l k ) ] (for k ∈ { , . . . , r } )An important property of the matrix Z is that for arbi-trary k ≤ r , Z ( k , i ) and Z ( k , j ) satisfy the below condition forall i = j :If row w of Z ( k , i ) contains any non-zero values = ⇒ then row w of Z ( k , j ) contains only zero values. (4)A consequence of the above property is that for any ar-bitrary factor f k , the rows of Z ( k ) can be permuted in or-der to obtain a block diagonal matrix. As, for the single-factor LMM, Z ≡ Z ( ) , it follows that the observations of thesingle-factor LMM can be arranged such that Z is block di-agonal. This feature of the single-factor LMM simpliﬁes thederivation of the Fisher Information matrix and score vec-tor required for Fisher Scoring. However, this simpliﬁcationcannot be generalized to the multi-factor LMM. In general,it is not true that the rows of Z can be permuted in such away that the resultant matrix is block-diagonal. As empha-sized in Section 1.1, due to this, many of the results derivedin the single-factor LMM have not been generalized to themulti-factor setting.To describe the random effects covariance matrix, D , it isassumed that factors are independent from one another andthat for each factor, factor f k , there is a ( q k × q k ) unknowncovariance matrix, D k , representing the “within-factor” co-variance for the random effects grouped by f k . The randomeffects covariance matrix, D , appearing in (2), can now begiven as: D = r M k = ( I l k ⊗ D k ) , where ⊕ represents the direct sum, and ⊗ the Kroneckerproduct. Note that, whilst D is large in dimension (hav-ing dimension ( q × q ) where q = ∑ i q i l i ), D contains q u = ∑ i q i ( q i + ) unique elements. Typically, it is true that q u << q . As a result, the task of describing the random ef-fects covariance matrix D reduces in practice to specifyingonly a small number of parameters. In this section, we introduce notation which will be usedthroughout the remainder of this work. The conventional hatoperator notation is adopted to denote estimators resultingfrom likelihood maximisation procedures. For example, β represents the true (unknown) ﬁxed effects parameter vectorfor the LMM, whilst the maximum likelihood estimate of β is denoted ˆ β . The conventional subscript notation, A [ X , Y ] , isused to denote the sub-matrix of matrix A , composed of allelements of A with row indices x ∈ X and column indices y ∈ Y . The replacement of X or Y with a colon, :, representsall rows or all columns of A , respectively. If a scalar, x or y , replaces X or Y , this represents the elements with rowindices x = X or column indices y = Y , respectively. Thenotation ( k ) may also replace X and Y where ( k ) representsthe indices of the columns of Z which correspond to factor f k . Similarly, X and Y may be substituted for the orderedpair ( k , j ) where ( k , j ) represents the indices of the columnsof Z which correspond to level j of factor f k . We highlightagain our earlier notation, Z ( k , j ) , which, due to its frequentoccurrence acts as a shorthand for Z [ : , ( k , j )] , i.e. the columnsof Z corresponding to level j of factor f k .Finally, we shall also adopt the notations ‘vec’, ‘vech’, N k , K m , n , D k and L k as used in Magnus and Neudecker(1980), deﬁned as follows: • ‘vec’ represents the mathematical vectorizarion oper-ator which transforms an arbitrary ( k × k ) matrix, A ,to a ( k × ) column vector, vec ( A ) , composed of thecolumns of A stacked into a column vector. (Note: thisis not to be confused with the concept of computationalvectorization discussed in Section 1.1). • ‘vech’ represents the half-vectorization operator whichtransforms an arbitrary square matrix, A , of dimension ( k × k ) to a ( k ( k + ) / × ) column vector, vech ( A ) ,composed by stacking the elements of A which fall onand below the diagonal into a column vector. • N k is deﬁned as the unique matrix of dimension ( k × k ) which implements symmetrization for any arbitrarysquare matrix A of dimension ( k × k ) in vectorized form.I.e. N k satisﬁes the following relation: N k vec ( A ) = vec ( A + A ′ ) / . • K m , n is the unique “Commutation” matrix of dimension ( mn × mn ) , which permutes, for any arbitrary matrix A isher Scoring for crossed factor Linear Mixed Models 5 of dimension ( m × n ) , the vectorization of A to obtainthe vectorization of the transpose of A . I.e. K m , n satisﬁesthe following relation:vec ( A ) = K m , n vec ( A ′ ) . • D k is the unique “Duplication” matrix of dimension ( k × k ( k + ) / ) , which maps the half-vectorization ofany arbitrary symmetric matrix A of dimension ( k × k ) toit’s vectorization. I.e. D k satisﬁes the following relation:vec ( A ) = D k vech ( A ) . • L k is the unique 1 − ( k ( k + ) / × k ) , which maps the vectorizationof any arbitrary lower triangular matrix A of dimension ( k × k ) to it’s half-vectorization. I.e. L k satisﬁes the fol-lowing relation: vech ( A ) = L k vec ( A ) . To help track the notational conventions employed in thiswork, an index of notation is provided in the SupplementaryMaterial, Section S ( β , σ , D ) . The FisherScoring algorithm update rule takes the following form: θ s + = θ s + α s I ( θ s ) − dl ( θ s ) d θ , (5)where θ s is the vector of parameter estimates given at itera-tion s , α s is a scalar step size, the score vector of θ s , dl ( θ s ) d θ ,is the derivative of the log-likelihood with respect to θ eval-uated at θ = θ s , and I ( θ s ) is the Fisher Information matrixof θ s ; I ( θ s ) = E (cid:20)(cid:18) dl ( θ ) d θ (cid:19)(cid:18) dl ( θ ) d θ (cid:19) ′ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ s (cid:21) . A more general formulation of Fisher Scoring, which al-lows for low-rank Fisher Information matrices, is given byRao and Mitra (1972): θ s + = θ s + α s I ( θ s ) + dl ( θ s ) d θ , (6)where superscript plus, + , is the Moore-Penrose (or“Pseudo”) inverse. For notational brevity, when discussingalgorithms of the form (5) and (6) in the following sections, the subscript s , representing iteration number, will be sup-pressed unless its inclusion is necessary for clarity.For the LMM, several different representations of theparameters of interest, ( β , σ , D ) , can be used for numer-ical optimization and result in different Fisher Scoring it-eration schemes. In this section, we consider the followingthree representations for θ : θ h = βσ vech ( D ) ...vech ( D r ) , θ f = βσ vec ( D ) ...vec ( D r ) , θ c = βσ vech ( Λ ) ...vech ( Λ r ) , where Λ k represents the lower triangular Cholesky factor of D k , such that D k = Λ k Λ ′ k . We will refer to the representa-tions ( θ h , θ f and θ c ) as the “half”, “full” and “Cholesky”representations of ( β , σ , D ) , respectively. In a slight abuseof notation, the function l will be allowed to take any rep-resentation of θ as input, with the interpretation unchanged(i.e. l ( θ f ) = l ( θ h ) = l ( θ c ) ). For example, if the full repre-sentation is being used, the log-likelihood will be denoted l ( θ f ) , but if the half representation is being used, the loglikelihood will be denoted l ( θ h ) .In the following sections, the score vectors and FisherInformation matrices required to perform ﬁve variants ofFisher Scoring for the multi-factor LMM will be stated withproofs provided in Appendices 6.1 and 6.2. For notationalconvenience, we denote the sub-matrix of the Fisher Infor-mation matrix of θ h with row indices corresponding to pa-rameter vector a and column indices corresponding to pa-rameter vector b as I ha , b . In other words, I ha , b is the sub-matrix of I ( θ h ) , deﬁned by: I ha , b = E (cid:20)(cid:18) dl ( θ h ) da (cid:19)(cid:18) dl ( θ h ) db (cid:19) ′ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ s (cid:21) . For further simpliﬁcation of notation, when a and b areequal, the second subscript will be dropped and the matrix I ha , b = I ha , a will be denoted simply as I ha . Analogous nota-tion is used for the full and Choleksy representations. The ﬁrst variant of Fisher Scoring we provide uses the“half”-representation for ( β , σ , D ) , θ h , and is based on thestandard form of Fisher Scoring given by (5). This may beconsidered the most natural approach for a Fisher Scoringalgorithm as θ h is an unmodiﬁed vector of the unique pa-rameters of the LMM and (5) is the standard update rule.For this approach the elements of the score vector are: dl ( θ h ) d β = σ − X ′ V − e , (7) Thomas Maullin-Sapey, Thomas E. Nichols dl ( θ h ) d σ = − n σ − + σ − e ′ V − e . (8)For k ∈ { , . . . , r } : dl ( θ h ) d vech ( D k ) = D ′ q k vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) (cid:19) . (9)and the entries of the Fisher Information matrix are givenby: I h β = σ − X ′ V − X , I h β , σ = p , , I h σ = n σ − . (10)For k ∈ { , . . . , r } : I h β , vech ( D k ) = p , q k ( q k + ) / , I h σ , vech ( D k ) = σ − vec ′ (cid:18) l k ∑ j = Z ′ ( k , j ) V − Z ( k , j ) (cid:19) D q k . (11)For k , k ∈ { , . . . , r } : I h vech ( D k ) , vech ( D k ) = D ′ q k l k ∑ j = l k ∑ i = ( Z ′ ( k , i ) V − Z ( k , j ) ⊗ Z ′ ( k , i ) V − Z ( k , j ) ) D q k . (12)where n , k denotes the ( n × k ) dimensional matrix of zeros.Due to its to natural approach to Fisher Scoring, this algo-rithm is referred to as FS in the remainder of this text. Pseu-docode for the FS algorithm is given in Algorithm 1. Discus-sion of the initial estimates used in the algorithm is deferredto Section 2.2. Algorithm 1:

Fisher Scoring (FS) Assign θ h to an initial estimate using (25) and (26) while current l ( θ h ) and previous l ( θ h ) differ by more thana predeﬁned tolerance do Calculate dl ( θ h ) d θ h using (7)-(9). Calculate I ( θ h ) using (10)-(12) Assign θ h = θ h + α I ( θ h ) − dl ( θ h ) d θ h Assign α = α if l ( θ h ) has decreased in value. end The second variant of Fisher Scoring considered in this workuses the “full”, θ f , representation of the model parameters,and shall therefore be referred to as “Full Fisher Scoring”(FFS). In this approach, for each factor, f k , the elements ofvec ( D k ) are to be treated as distinct from one another withnumerical optimization for D k performed over the space ofall ( q k × q k ) matrices. This approach differs from the FSmethod proposed in the previous section, in which optimiza-tion was performed on the space of only those ( q k × q k ) ma-trices that are symmetric. This optimization procedure is re-alized by treating symmetric matrix elements of D k as dis-tinct and, for a given element, using the partial derivativewith respect to the element during optimization instead ofthe total derivative with respect to both the element and itssymmetric counterpart. This change is reﬂected by denot-ing the elements of the score vector which correspond tovec ( D k ) using a partial derivative operator, ∂ , rather than thetotal derivative operator, d . The primary motivation for theinclusion of the FFS approach is that it serves as a basis forwhich the constrained covariance approaches of Section 2.5can be built upon. However, it should be noted that, as it doesnot require the construction or use of duplication matrices,FFS also provides simpliﬁed expressions and potential im-provement in terms of computation speed. As a result, FFSis of some theoretical and practical interest and is detailedin full here.An immediate obstacle to this approach is that the FisherInformation matrix of θ f is rank-deﬁcient, and, therefore,cannot be inverted. Intuitively, this is to be expected, as re-peated entries in θ f result in repeated rows in I ( θ f ) . Math-ematically, this can be seen by noting that I f vec ( D k ) can beexpressed as a product containing the matrix N q k (deﬁned inSection 1.2.2), which is low-rank by construction (see Ap-pendix 6.3). Consequently, the Fisher Scoring update rulefor θ f must be based on the Pseudo-inverse formulation ofFisher Scoring given in (6).As the derivatives of the log-likelihood with respect to β and σ do not depend upon the parameterisation of D , bothFFS and FS employ the same expressions for the elements ofthe score vector which correspond to β and σ , given by (7)and (8) respectively. The score vector for { vec( D k ) } k ∈{ ,... r } used by FFS is given by: ∂ l ( θ f ) ∂ vec ( D k ) =

12 vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) (cid:19) . (13)The entries of the Fisher Information matrix of θ f , based onthe derivatives given in (7), (8) and (13), are given by: I f β = I h β , I f β , σ = I h β , σ , I f σ = I h σ . isher Scoring for crossed factor Linear Mixed Models 7 For k ∈ { , . . . , r } : I f β , vec ( D k ) = p , q k , I f σ , vec ( D k ) = σ − vec ′ (cid:18) l k ∑ j = Z ′ ( k , j ) V − Z ( k , j ) (cid:19) . (14)For k , k ∈ { , . . . , r } : I f vec ( D k ) , vec ( D k ) = l k ∑ j = l k ∑ i = ( Z ′ ( k , i ) V − Z ( k , j ) ⊗ Z ′ ( k , i ) V − Z ( k , j ) ) N q k . (15)Derivations for the above can be found in Appendices 6.1and 6.2. In addition, in Appendix 6.3, we show that the FullFisher Scoring algorithm can also be expressed in the form: θ fs + = θ fs + α s F ( θ fs ) − ∂ l ( θ fs ) ∂θ , (16)where, unlike in (5) and (6), F ( θ f ) is not the Fisher Informa-tion matrix. Rather, F ( θ f ) is a matrix of equal dimensionsto I ( θ f ) with all of its elements equal to those of I ( θ f ) ,apart from those which were speciﬁed by (15), which in-stead are, for k , k ∈ { , . . . , r } : F vec ( D k ) , vec ( D k ) = l k ∑ j = l k ∑ i = ( Z ′ ( k , i ) V − Z ( k , j ) ⊗ Z ′ ( k , i ) V − Z ( k , j ) ) , (17)where the same subscript notation has been adopted to index F ( θ f ) as was adopted for I ( θ f ) .The above approach, which utilizes the matrix F ( θ f ) ,rather than I ( θ f ) , is adopted to match the formulationof the Fisher Scoring algorithm given in Section 2.11 ofDemidenko (2013) for the single-factor LMM. Pseudocodefor the FFS algorithm using the representation of the updaterule given by (16), is provided by Algorithm 2. Algorithm 2:

Full Fisher Scoring (FFS) Assign θ f to an initial estimate using (25) and (26) while current l ( θ f ) and previous l ( θ f ) differ by more thana predeﬁned tolerance do Calculate ∂ l ( θ f ) ∂θ f using (7),(8) and (13). Calculate F ( θ f ) using (14) and (17). Assign θ f = θ f + α F ( θ f ) − ∂ l ( θ f ) ∂θ f Assign α = α if l ( θ f ) has decreased in value. end The third Fisher Scoring algorithm proposed in this workrelies on the half-representation of the parameters ( β , σ , D ) and takes an approach, similar to that of coordinate ascent,which is commonly adopted in the single-factor setting (c.f.Demidenko (2013)). In this approach, instead of performinga single update step upon the entire vector θ h in the formof ( ) , updates for β , σ and { D k } k ∈{ ,..., r } are performedindividually in turn. For β and σ each iteration uses thestandard Generalized Least Squares (GLS) estimators givenby: β s + = ( X ′ V − s X ) − X ′ V − s Y , (18) σ s + = e ′ s + V − s e s + n . (19)To update the random effects covariance matrix, D k , foreach factor, f k , individual Fisher Scoring updates are appliedto vech ( D k ) . These updates are performed using the blockof the Fisher Information matrix corresponding to vech ( D k ) ,given by (12), and take the following form for k ∈ { , . . . , r } :vech ( D k , s + ) = vech ( D k , s ) + α s (cid:0) I h vech ( D k , s ) (cid:1) − dl ( θ hs ) d vech ( D k , s ) . (20)In line with the naming convention used in Demidenko(2013), this method shall be referred to as “Simpliﬁed”Fisher Scoring (SFS). This is due to the relative simplicity,both in terms of notational and computational complexity, ofthe updates (18)-(20) used in the SFS algorithm in compari-son to those used in the FS algorithm of Section 2.1.1, givenby (7)-(12). Pseudocode for the SFS algorithm is given byAlgorithm 3. Algorithm 3:

Simpliﬁed Fisher Scoring (SFS) Assign θ f to an initial estimate using (25) and (26) while current l ( θ h ) and previous l ( θ h ) differ by more thana predeﬁned tolerance do Update β using (18) Update σ using (19) for k ∈ { , ... r } do Update vech( D k ) using (20) end Assign α = α if l ( θ h ) has decreased in value. end Thomas Maullin-Sapey, Thomas E. Nichols

The Full Simpliﬁed Fisher Scoring algorithm (FSFS) com-bines the “Full” and “Simpliﬁed” approaches described inSections 2.1.2 and 2.1.3. In the FSFS algorithm individualupdates are applied to β and σ using (18) and (19) respec-tively and to { vec ( D k ) } k ∈{ ,..., r } using a Fisher Scoring up-date step, based on the matrix F vec ( D k ) given by (17). Theupdate rule for { vec ( D k ) } k ∈{ ,..., r } takes the following form:vec ( D k , s + ) = vec ( D k , s ) + α s F − ( D k , s ) ∂ l ( θ fs ) ∂ vec ( D k , s ) . (21)In Appendix 6.3, we justify the above update rule bydemonstrating that it is equivalent to using an update rule ofthe form (6), given by:vec ( D k , s + ) = vec ( D k , s ) + α s (cid:0) I f vec ( D k , s ) (cid:1) + ∂ l ( θ fs ) ∂ vec ( D k , s ) . Pseudocode for the FSFS algorithm is given by Algorithm4.

Algorithm 4:

Full Simpliﬁed Fisher Scoring(FSFS) Assign θ f to an initial estimate using (25) and (26) while current l ( θ f ) and previous l ( θ f ) differ by more thana predeﬁned tolerance do Update β using (18) Update σ using (19) for k ∈ { , ... r } do Update vec( D k ) using (21) end Assign α = α if l ( θ f ) has decreased in value. end The ﬁnal variant of the Fisher Scoring algorithm we con-sider is based on the “simpliﬁed” approach described inSection 2.1.3 and uses the Cholesky parameterisation of ( β , σ , D ) , θ c . This approach follows directly from thebelow application of the chain rule of differentiation forvector-valued functions, dl ( θ c ) d vech ( Λ k ) = ∂ vech ( D k ) ∂ vech ( Λ k ) ∂ l ( θ c ) ∂ vech ( D k ) . An expression for the derivative which appears secondin the above product was given by (9). It therefore followsthat in order to evaluate the score vector of vech ( Λ k ) (i.e. thederivative of l with respect to vech ( Λ k ) ), only an expressionfor the ﬁrst term of the above product is required. This termcan be evaluated to L q k ( Λ ′ k ⊗ I q k )( I q k + K q k ) D q k , proof ofwhich is provided in Appendix 6.6.Through similar arguments to those used to prove Corol-laries 4-6 of Appendix 6.2, it can be shown that the FisherInformation matrix for θ c is given by: I c β = I h β , I c β , σ = I h β , σ , I c σ = I h σ . For k ∈ { , . . . , r } : I c β , vech ( Λ k ) = p , q k ( q k + ) / , I c σ , vech ( Λ k ) = I h σ , vech ( D k ) (cid:18) ∂ vech ( D k ) ∂ vech ( Λ k ) (cid:19) ′ . (22)For k , k ∈ { , . . . , r } : I c vech ( Λ k ) , vech ( Λ k ) = (cid:18) ∂ vech ( D k ) ∂ vech ( Λ k ) (cid:19) I h vech ( D k ) , vech ( D k ) (cid:18) ∂ vech ( D k ) ∂ vech ( Λ k ) (cid:19) ′ . (23)From the above, it can be seen that a non-“simpliﬁed”Cholesky-based variant of the Fisher Scoring algorithm,akin to the FS and FFS algorithms described in Sections2.1.1 and 2.1.2, may be constructed. This would involveconstructing the Fisher Information matrix of θ c , I ( θ c ) ,using (22) and (23), and employing a Fisher Scoring pro-cedure similar to that speciﬁed by Algorithms 1 and 2.Whilst this approach is feasible, in terms of implementa-tion, preliminary tests have indicated that the performanceof this approach, in terms of computation time, is signiﬁ-cantly worse than the previously proposed algorithms. Forthis reason, we only consider the “simpliﬁed” version of theCholesky Fisher Scoring algorithm, analogous to the “sim-pliﬁed” approaches described in Sections 2.1.3 and 2.1.4,is considered here. The Cholesky Simpliﬁed Fisher Scoring(CSFS) algorithm adopts (18) and (19) as update rules for β and σ , respectively, and the following update rule for { vech ( Λ k ) } k ∈{ ,..., r } . vech ( Λ k , s + ) = vech ( Λ k , s ) + α s (cid:0) I c vech ( Λ k , s ) (cid:1) − dl ( θ cs ) d vech ( Λ k , s ) . (24)Pseudocode summarizing the CSFS approach is given byAlgorithm 5. isher Scoring for crossed factor Linear Mixed Models 9 Algorithm 5:

Cholesky Simpliﬁed Fisher Scor-ing (CSFS) Assign θ c to an initial estimate using (25) and (26) while current l ( θ c ) and previous l ( θ c ) differ by more thana predeﬁned tolerance do Update β using (18) Update σ using (19) for k ∈ { , ... r } do Update vec( Λ k ) using (24) end Assign α = α if l ( θ c ) has decreased in value. end β , σ and D will be usedas starting points for optimization is an important consider-ation for the Fisher Scoring algorithm. Denoting these ini-tial values as β , σ and D , respectively, this work followsthe recommendations of Demidenko (2013) and evaluates β and σ using the OLS estimates given by, β = ( X ′ X ) − X ′ Y , σ = e ′ e n . (25)where e is deﬁned as e = Y − X β . For the initial estimateof { D k } k ∈{ ,..., r } , an approach similar to that suggested inDemidenko (2013) is also adopted, which substitutes V for I n in the update rule for vec( D k ), equation (21). The resultinginitial estimate for { D k } k ∈{ ,..., r } is given byvec ( D k , ) = (cid:18) l k ∑ j = Z ′ ( k , j ) Z ( k , j ) ⊗ Z ′ ( k , j ) Z ( k , j ) (cid:19) − × vec (cid:18) l k ∑ j = Z ′ ( k , j ) (cid:18) e e ′ σ − I n (cid:19) Z ( k , j ) (cid:19) . (26)2.3 Ensuring non-negative deﬁnitenessIn order to ensure numerical optimization produces a validcovariance matrix when performing LMM parameter esti-mation, optimization must be constrained to ensure that foreach factor, f k , D k is non-negative deﬁnite. This is a well-documented obstacle for numerical approaches to LMM pa-rameter estimation for which many solutions have been sug-gested. A common approach, utilized in Bates et al. (2015)for example, is to reparameterize optimization in terms ofthe Cholesky factor of the covariance matrix, Λ k , as opposeto the covariance matrix itself, D k . This approach is adoptedby the CSFS method, described in Section 2.1.5. However,the methods described in Sections 2.1.1-2.1.4 require an alternative approach as optimization does not account forthe constraint that D k must be non-negative deﬁnite. Forthe FS, FFS, SFS and FSFS algorithms, we use an ap-proach described by Demidenko (2013), for the single factorLMM. Denoting the eigendecomposition of { D k } k ∈{ ,..., r } as D k = U k Σ k U ′ k , we project D k onto the space of all ( q k × q k ) non-negative deﬁnite matrices by, following each updateof the covariance matrix, D , replacing { D k } k ∈{ ,.. r } with { D + , k } k ∈{ ,.. r } , D + , k = U k Σ + , k U ′ k , where Σ + , k = max ( Σ k , ) with ‘max’ representing theelement-wise maximum. For each factor, f k , this ensures D k ,and as a result D , is non-negative deﬁnite.2.4 Computational efﬁciencyThis section provides discussion on the computational ef-ﬁciency of evaluating the Fisher Information matrices andscore vectors of Sections 2.1.1-2.1.5. This discussion is, inlarge part, motivated by the mass-univariate setting (c.f. Sec-tion 1.1), in which not one, but rather hundreds of thousandsof models must be estimated concurrently. As a result, thisdiscussion prioritizes both time efﬁciency and memory con-sumption concerns and, further, operates under the assump-tion that sparse matrix methodology, such as that employedby the R package ‘lmer’, cannot be employed. This assump-tion has been made primarily as a result of the current lackof support for computationally vectorized sparse matrix op-erations.A primary concern, for both memory consumption andtime efﬁciency, stems from the fact that many of the matri-ces used in the evaluation of the score vectors and Fisher In-formation matrices possess dimensions which scale with n ,the number of observations. For example, V has dimensions ( n × n ) and is frequently inverted in the FSFS algorithm. Inpractice, it is not uncommon for studies to have n ranginginto the thousands. As such, inverting V directly may not becomputationally feasible. To address this issue, we deﬁnethe “product forms” as; P = X ′ X , Q = X ′ Y , R = X ′ Z , S = Y ′ Y , T = Y ′ Z , U = Z ′ Z . Working with the product forms is preferable to using theoriginal matrices X , Y and Z , as the dimensions of the prod-uct forms do not scale with n . Instead, the dimensions of theproduct forms depend only on the number of ﬁxed effects, p , and the second dimension of the random effects designmatrix, q . As an example, consider the expressions below,which appear frequently in (13) and (17) and have been re-formulated in terms of the product forms: Z ′ ( k , i ) V − Z ( k , j ) = ( U − UD ( I q + DU ) − U ) [( k , i ) , ( k , j )] , Z ′ ( k , j ) V − e = (( I q − UD )( I q + DU ) − ( T ′ − R ′ β )) [( k , j ) , : ] . For computational purposes, the right-hand side of theabove expressions are much more convenient than the left-hand side. In order to evaluate the left-hand side in bothcases, an ( n × n ) inversion of the matrix V must be per-formed. In contrast, the right-hand side is expressible purelyin terms of the product forms and ( β , σ , D ) , with the onlyinversion required being that of the ( q × q ) matrix ( I q + DU ) .These examples can be generalized further. In fact, all of theprevious expressions (7)-(26) can be rewritten in terms ofonly the product forms and ( β , σ , D ) . This observation isimportant as it implies that an algorithm for mixed modelparameter estimation may begin by taking the matrices X , Y and Z as inputs, but discard them entirely once the productforms have been constructed. As a result, both computationtime and memory consumption no longer scale with n .Another potential source of concern regarding compu-tation speed arises from noting that the algorithms we havepresented contain many summations of the following twoforms: c ∑ i = A i B ′ i and c ∑ i = c ∑ i = G i , j ⊗ H i , j . (27)where matrices { A i } and { B i } are of dimension ( m × m ) ,and matrices { G i , j } and { H i , j } are of dimension ( n × n ) .We denote the matrices formed from vertical concatenationof the { A i } and { B i } matrices as A and B , respectively, and G and H the matrices formed from block-wise concatena-tion of the { G i , j } and { H i , j } , respectively. Instances of suchsummations can be found, for example, in equations (9) and(12).From a computational standpoint, summations of theforms shown in (27) are typically realized by ‘for’ loops.This cumulative approach to computation can cause a po-tential issue for LMM computation since typically the num-ber of summands corresponds to the number of levels, l k , ofsome factor, f k . In particular applications of the LMM, suchas repeated measures and longitudinal studies, some factorsmay possess large quantities of levels. As this means ‘for’loops of this kind could hinder computation and result inslow performance, we provide alternative methods for cal-culating summations of the forms shown in (27).For the former summation shown in (27), we utilize the“generalized vectorisation”, or “vec m ” operator, deﬁned byTurkington (2013) as the operator which performs the belowmapping for a horizontally partitioned matrix M : M = (cid:2) M M . . . M c (cid:3) → vec m ( M ) = M M ... M c , where the partitions { M i } are evenly sized and contain m columns. Using the deﬁnition of the generalized vectorisa- tion operator, the former summation in (27) can be reformu-lated as: l ∑ i = A i B ′ i = vec m ( A ′ ) ′ vec m ( B ′ ) . The right-hand side of the expression above is of prac-tical utility as the ‘vec m ’ operator can be implemented efﬁ-ciently. The ‘vec m ’ operation can be performed, for instance,using matrix resizing operations such as the ‘reshape’ oper-ators commonly available in many programming languagessuch as MATLAB and Python. The computational costsassociated with this approach are signiﬁcantly lesser thanthose experienced when evaluating the summation directlyusing ‘for’ loops.For the latter summation in (27), we ﬁrst deﬁne the no-tation ˜ M to denote the transformation below, for the block-wise partitioned matrix M : M = M , M , ... M , c M , M , ... M , c ... ... . . . ... M c , M c , ... M c , c → ˜ M = vec ( M , ) ′ vec ( M , ) ′ ...vec ( M , c ) ′ vec ( M , ) ′ ...vec ( M c , c ) ′ . (28)Using a modiﬁcation of Lemma 3.1 (i) ofNeudecker and Wansbeek (1983), we obtain the belowidentity:vec (cid:18) ∑ i , j G i , j ⊗ H i , j (cid:19) = ( I n ⊗ K n , n ⊗ I n ) vec ( ˜ H ′ ˜ G ) , (29)where K n , n is the ( n n × n n ) Commutation matrix (c.f.Section 1.2.2). The matrices ˜ H and ˜ G can be obtained from H and G using resizing operators with little computationtime. The matrix ( I n ⊗ K n , n ⊗ I n ) can be calculated viasimple means and is a permutation matrix depending onlyon the dimensions of H and G . As a result, this matrix can becalculated once and stored as a vector. Therefore, the matrixmultiplication in the above expression does not need to beperformed directly, but instead can be evaluated by permut-ing the elements of vec ( ˜ H ′ ˜ G ) according to the permutationvector representing ( I n ⊗ K n , n ⊗ I n ) . To summarize, eval-uation of expressions of the latter form shown in (27) can beperformed by using only reshape operations, a matrix multi-plication, and a permutation. This method of evaluation pro-vides notably faster computation time than the correspond-ing ‘for’ loop evaluated over all values of i and j .The above method, combined with the product form ap-proach, was used to obtain the results of Sections 2.7-3.2. isher Scoring for crossed factor Linear Mixed Models 11 D k . Examplesinclude compound symmetry, ﬁrst-order auto-regression anda Toeplitz structure. A more comprehensive list of com-monly employed covariance structures in LMM analysescan be found, for example, in Wolﬁnger (1996). In this sec-tion, we describe how the Fisher Scoring algorithms of theprevious sections can be adjusted to model dependence be-tween covariance elements.When a constraint is placed on the covariance matrix D k ,it is assumed that the elements of D k can be deﬁned as con-tinuous, differentiable functions of some smaller parametervector, vecu ( D k ) . Colloquially, vecu ( D k ) may be thought ofas the vector of “unique” parameters required to specify theconstrained parameterization of D k . To perform constrainedoptimization, we adopt a similar approach to the previoussections and deﬁne the constrained representation of θ as θ con = [ β ′ , σ , vecu ( D ) ′ , ... vecu ( D k ) ′ ] ′ . Denoting the Jaco-bian matrix ∂ vec ( D k ) / ∂ vecu ( D k ) as C k , the score vectorand Fisher Information matrix of θ con can be constructedas follows; for k ∈ { , . . . , r } , dl ( θ con ) d vecu ( D k ) = C k ∂ l ( θ con ) ∂ vec ( D k ) , I con β , vecu ( ˜ D k ) = p , ˜ q k , I con σ , vecu ( ˜ D k ) = I f σ , vec ( D k ) C ′ k , (30)and, for k , k ∈ { , . . . , r } , I con vecu ( ˜ D k ) , vecu ( ˜ D k ) = C k I f vec ( D k ) , vec ( D k ) C ′ k , where ˜ q k is the length of vecu ( D k ) and I f is the “full”Fisher Information matrix deﬁned in Section 2.1.2. Theabove expressions can be derived trivially using the deﬁni-tion of the Fisher Information matrix and the chain rule. Inthe remainder of this work, the matrix C k is referred to as a“constraint matrix” due to the fact it “imposes” constraintsduring optimization. A fuller discussion of constraint matri-ces, alongside examples, is provided in Appendix 6.5.We note here that this approach can be extended tosituations in which all covariance parameters can be ex-pressed in terms of one set of parameters, ρ D , com-mon to all { vec ( D k ) } k ∈{ ,..., r } . In such situations, a con-straint matrix, C , may be deﬁned as the Jacobian matrix ∂ [ vec ( D ) ′ , ... vec ( D ) ′ ] ′ / ∂ρ D and Fisher Information ma-trices and score vectors may be derived in a similar man-ner to the above. Models requiring this type of commonalitybetween factors are rare in the LMM literature, since thecovariance matrices { D k } k ∈{ ,... r } are typically deﬁned in-dependently of one another. However, one such example isprovided by the ACE model employed for twin studies, inwhich the elements of v ( D ) can all be expressed in terms ofthree variance components; ρ D = [ σ a , σ c , σ e ] ′ . Further in-formation on the ACE model is presented in Section 2.8.2. 2.6 Degrees of freedom estimationSeveral methods exist for drawing inference on the ﬁxed ef-fects parameter vector, β . Often, research questions can beexpressed as hypotheses of the below form: H : L β = , H : L β = , where L is a ﬁxed and known ( × p ) -sized contrast matrixspecifying a hypothesis, or prior belief, about linear rela-tionships between the elements of β , upon which an infer-ence is to be made. In the setting of the LMM, a commonlyemployed statistic for testing hypotheses of this form is theapproximate T-statistic, given by: T = L ˆ β q ˆ σ L ( X ′ ˆ V − X ) − L ′ , where ˆ V = I n + Z ˆ DZ ′ . As noted in Dempster et al. (1981),using an estimate of D in this manner results in an under-estimation of the true variability in ˆ β . For this reason, therelation T ∼ t v is not exact and is treated only as an approxi-mating distribution (i.e. an “approximate T statistic”) wherethe degrees of freedom, v , must be estimated empirically.A standard method for estimating the degrees of freedom, v , utilizes the Welch-Satterthwaite equation, originally de-scribed in Satterthwaite (1946) and Welch (1947), given by: v ( ˆ η ) = ( S ( ˆ η )) Var ( S ( ˆ η )) , (31)where ˆ η represents an estimate of the variance parameters η = ( σ , D , . . . D r ) and S ( ˆ η ) is given by: S ( ˆ η ) = ˆ σ L ( X ′ ˆ V − X ) − L ′ . Typically, as the variance estimates obtained under ML es-timation are biased downwards, ReML estimation is em-ployed to obtain the estimate of η employed in the aboveexpression. If ML were used to estimate η instead, the de-grees of freedom, v ( ˆ η ) , would be underestimated and, con-sequently, conservative p-values and a reduction in statisti-cal power for resulting hypothesis tests would be observed.To obtain an approximation for v ( ˆ η ) , a second orderTaylor expansion is applied to the unknown variance on thedenominator of (31):Var ( S ( ˆ η )) ≈ (cid:18) dS ( ˆ η ) d ˆ η (cid:19) ′ Var ( ˆ η ) (cid:18) dS ( ˆ η ) d ˆ η (cid:19) . (32)This approach is well-documented, and has beenadopted, most notably, by the R package lmerTest, ﬁrst pre-sented in Kuznetsova et al. (2017). To obtain the derivativeof S and asymptotic variance covariance matrix, lmerTestutilizes numerical estimation methods. As an alternative,we deﬁne the “half” representation of ˆ η , ˆ η h , by ˆ η h = [ ˆ σ , vech ( ˆ D ) ′ , . . . , vech ( ˆ D r ) ′ ] ′ and present the below exactclosed-form expression for the derivative of S in terms ofˆ η h ; dS ( ˆ η h ) d ˆ σ = L ( X ′ ˆ V − X ) − L ′ , dS ( ˆ η h ) d vech ( ˆ D k ) = ˆ σ D ′ q k (cid:18) l k ∑ j = ˆ B ( k , j ) ⊗ ˆ B ( k , j ) (cid:19) , where ˆ B ( k , j ) is given by:ˆ B ( k , j ) = Z ′ ( k , j ) ˆ V − X ( X ′ ˆ V − X ) − L ′ . We obtain an expression for var ( ˆ η h ) by noting that theasymptotic variance of ˆ η h is given by I ( ˆ η h ) − where I ( ˆ η h ) is a sub-matrix of I ( ˆ θ h ) , given by equations (10)-(12).In summary, we have provided all the closed-form ex-pressions necessary to perform the Satterthwaite degrees offreedom estimation method for any LMM described by (1)and (2). For the remainder of this work, estimation of v us-ing the above expressions is referred to as the “direct-SW”method. This name reﬂects the direct approach taken forthe evaluation of the right-hand side of (32). We note thatthis approach may also be extended to models using theconstrained covariance structures of Section 2.5 by employ-ing the Fisher Information matrix given by equation 30 andtransforming the derivative of S ( η ) appropriately (detailsof which are provided by Theorem 9 of Appendix 6.7.3). Weconclude this section by noting that this method can also beused in a similar manner for estimating the degrees of free-dom of an approximate F-statistic based on the multi-factorLMM, as described in Appendix 6.7.2.2.7 Simulation methodsTo assess the accuracy and efﬁciency of each of the proposedLMM parameter estimation methods described in Sections2.1.1-2.1.5 and the direct-SW degrees of freedom estimationmethod described in Section 2.6, extensive simulations wereconducted. The methods used to perform these simulationsare described in this section, with the results of the simula-tions presented in Section 3.1. All reported results were ob-tained using an Intel(R) Xeon(R) Gold 6126 2.60GHz pro-cessor with 16GB RAM. The algorithms of Sections 2.1.1-2.1.5 have been imple-mented in the programming language Python. Results takenacross three simulation settings, each with a different designstructure, are provided for parameter estimation performedusing each of the algorithms. The Fisher Scoring algorithms presented in this work are compared against one another,the R package lmer, and the baseline truth used to gener-ate the simulations. All methods are contrasted in terms ofoutput, computation time and, for the methods presented inthis paper, the number of iterations until convergence. Con-vergence of each method was assessed by verifying whethersuccessive log-likelihood estimates differed by more than apredeﬁned tolerance of 10 − . 1000 individual simulation in-stances were run for each simulation setting.All model parameters were held ﬁxed across all runs. Inevery simulation setting, test data was generated accordingto model (1). Each of the three simulation settings imposeda different structure on the random effects design and co-variance matrices, Z and D . The ﬁrst simulation setting em-ployed a single factor design ( r =

1) with two random effects(i.e. q =

2) and 50 levels (i.e. l = r =

2) where thenumber of random effects and numbers of levels for eachfactor were given by q = q = l =

100 and l = r =

3) with the number of random effects andlevels for each of the factors given by q = q = q = l = l =

50 and l =

10, respectively. In all simula-tions, the number of observations, n , was held ﬁxed at 1000.In each simulated design, the ﬁrst column of the ﬁxed effectsdesign matrix, X , and the ﬁrst random effect in the randomeffects design matrix, Z , were treated as intercepts. Withineach simulation instance, the remaining (non-zero) elementsof the variables X and Z , as well as those of the error vec-tor ε , were generated at random according to the standardunivariate normal distribution. The random effects vector b was simulated according to a normal distribution with co-variance D , where D was predeﬁned, exhibited no particularconstrained structure and contained a mixture of both zeroand non-zero off-diagonal elements. The assignment of ob-servations to levels for each factor was performed at randomwith the probability of an observation belonging to any spe-ciﬁc level held constant and uniform across all levels.Fisher Scoring methods for the single-factor design havebeen well-studied (c.f. Demidenko (2013)) and the inclu-sion of the ﬁrst simulation setting is only for comparisonpurposes. The second and third simulation settings repre-sent more complex crossed factor designs for which the pro-posed Fisher Scoring based parameter estimation methodsdid not previously exist. To assess the performance of theproposed algorithms, the Mean Absolute Error ( MAE ) andMean Relative Difference (

MRD ) were used as performancemetrics. The methods considered for parameter estimationwere those described in Sections 2.1.1-2.1.5 and the base-line truth used for comparison was either the baseline truthused to generate the simulated data or the lmer computedestimates. isher Scoring for crossed factor Linear Mixed Models 13

All methods were contrasted in terms of the

MAE and

MRD of both β and the variance product σ D . The varianceproduct σ D was chosen for comparison instead of the in-dividual components σ and D as the variance product σ D is typically of intrinsic interest during the inference stage ofconventional statistical analyses and is often employed forfurther computation. For each of the methods proposed inSections 2.1.1-2.1.5, both ML and ReML estimation vari-ants of the methods were assessed, where the ReML vari-ants of each method were performed according to the ad-justments detailed in Appendix 6.4. The computation timefor each method was also recorded and contrasted againstlmer. To ensure a fair comparison of computation time, therecorded computation times for lmer were based only on theperformance of the ‘optimizeLmer’ function, which is themodule employed for parameter estimation by lmer.The speciﬁc values of β , σ , and D employed for eachsimulation setting can be found in Section S S The accuracy and validity of the direct-SW degrees of free-dom estimation method proposed in Section 2.6 were as-sessed through further simulation. To achieve this, as LMMdegrees of freedom are assumed to be speciﬁc to the exper-iment design, a single design was chosen at random fromeach of the three simulation settings described in Section2.7.1. With the ﬁxed effects and random effects matrices, X and Z , now held constant across simulations, 1000 sim-ulations were run for each simulation setting. The randomeffects vector, b , and random error vector, ε , were allowedto vary across simulations according to normal distributionswith the appropriate variances. In each simulation, degreesof freedom were estimated via the direct-SW method for apredeﬁned contrast vector, corresponding to a ﬁxed effectthat was truly zero.The direct-SW estimated degrees of freedom were com-pared to a baseline truth and the degrees of freedom esti-mates produced by the R package lmerTest. Baseline truthwas established in each simulation setting using 1 , , ( S ( ˆ η )) , giving a sin-gle value for the denominator of v ( ˆ η ) from Equation (31).Following this, in each of the 1000 simulation instances de-scribed above, the numerator of (31) was recorded, giving1000 estimates of v ( ˆ η ) . The ﬁnal estimate was obtainedas an average of these 1000 values. All lmerTest degreesof freedom estimates were obtained using the same simu-lated data as was used by the direct-SW method and werecomputed using the ‘contest1D’ function from the lmerTestpackage. Whilst all baseline truth measures were computed using parameter estimates obtained using the FSFS algo-rithm of Section 2.1.4, we believe this has not induced biasinto the simulations as, as discussed in Section 3.1.1, all sim-ulated FSFS-derived parameter estimates considered in thiswork agreed with those provided by lmer within a tolerancelevel on the scale of machine error.2.8 Real data methodsTo illustrate the usage of the methodology presented in Sec-tions 2.1-2.6 in practical situations, we provide two real dataexamples. These examples are described fully in this sec-tion and the results are presented in Section 3.2. Again, allreported results were obtained using an Intel(R) Xeon(R)Gold 6126 2.60GHz processor with 16GB RAM. For eachreported computation time, averages were taken across 50repeated runs of the given analysis. The ﬁrst example presented here is based on data from thelongitudinal evaluation of school change and performance(LESCP) dataset (Turnbull et al. (1999)). This dataset hasnotably been previously analyzed by Hong and Raudenbush(2008) and was chosen for inclusion in this work becauseit previously formed the basis for between-software com-parisons of LMM software packages by West et al. (2014).The LESCP study was conducted in 67 American schoolsin which SAT (Student Aptitude Test) math scores wererecorded for randomly selected samples of students. As inWest et al. (2014), one of the 67 schools from this datasetwas chosen as the focus for this analysis. For each individ-ual SAT score, unique identiﬁers (ID’s) for the student whotook the test and for the teacher who prepared the studentfor the test were recorded. As many students were taught bymultiple teachers and all teachers taught multiple students,the grouping of SAT scores by student ID and the groupingof SAT scores by teacher ID constitute two crossed factors.In total, n =

234 SAT scores were considered for analysis.The SAT scores were taken from 122 students taught by atotal of 12 teachers, with each student sitting between 1 and3 tests.The research question in this example concerns howwell a student’s grade (i.e. year of schooling) predicted theirmathematics SAT score (i.e. did students improve in mathe-matics over the course of their education?). For this ques-tion, between-student variance and between-teacher vari-ance must be taken into consideration as different studentspossess different aptitudes for mathematics exams and dif-ferent teachers possess different aptitudes for teaching math-ematics. In practice, this is achieved by employing an LMMwhich includes random intercept terms for both the groupingof SAT scores according to student ID and the grouping of

SAT scores according to teacher ID. For the k th mathemat-ics SAT test taken by the i th student, under the supervisionof the j th teacher, such a model could be stated as follows:MATH i , j , k = β + β × YEAR i , j , k + s i + t j + ε i , j , k , where MATH i , j , k is the SAT score achieved and YEAR i , j , k is the grade of the student at the time the test was taken.In the above model, β and β are unknown parameters and s i , t j and ε i , j , k are independent mean-zero random variableswhich differ only in terms of their covariance. s i is the ran-dom intercept which models between-student variance, t j isthe random intercept which models between-teacher vari-ance, and ε i , j , k is the random error term. The random vari-ables s i , t j and ε i , j , k are assumed to be mutually independentand follow the below distributions: s i ∼ N ( , σ s ) , t j ∼ N ( , σ t ) , ε i , j , k ∼ N ( , σ ) , where the parameters σ s , σ t and σ are the unknown stu-dent, teacher and residual variance parameters, respectively.The random effects in the SAT score model can be de-scribed using the notation presented in previous sections asfollows; r = q = q = s i and t j , respectively), and l = , l =

12 (i.e. there are 122 stu-dents and 12 teachers). When the model is expressed in theform described by equations (1) and (2), the random effectsdesign matrix Z is a 0 − Z indicates the studentand teacher associated with each test score. The random ef-fects covariance matrices for the two factors, student ID andteacher ID, are given by D = [ σ s ] and D = [ σ t ] , respec-tively.For the SAT score model, the estimated parameters ob-tained using each of the methods detailed in Sections 2.1.1-2.1.5 are reported. For comparison, the parameter estimatesobtained by the R package lmer are also given. As the esti-mates for the ﬁxed effects parameters, β and β , are of pri-mary interest, ML was employed to obtain all reported pa-rameter estimates. For this example, methods are contrastedin terms of output, computation time and the number of iter-ations performed. The second example presented in this work aims to demon-strate the ﬂexibility of the constrained covariance ap-proaches described in Section 2.5. This example is based onthe ACE model; an LMM commonly employed to analyzethe results of twin studies by accounting for the complexcovariance structure exhibited between related individuals. The ACE model achieves this by separating between-subjectresponse variation into three categories: variance due to ad-ditive genetic effects ( σ a ), variance due to common environ-mental factors ( σ c ), and residual error ( σ e ). For this exam-ple, we utilize data from the Wu-Minn Human ConnectomeProject (HCP) (Van Essen et al. (2013)). The HCP datasetcontains brain imaging data collected from 1 ,

200 healthyyoung adults, aged between 22 and 35, including data from300 twin pairs and their siblings. We do not make use theimaging data in the HCP dataset but, instead, focus on thebaseline variables for cognition and alertness.The primary research question considered in this exam-ple focuses on how well a subject’s quality of sleep predictstheir English reading ability. The variables of primary in-terest used to address this question are subject scores in thePittsburgh Sleep Quality Index (PSQI) questionnaire and anEnglish language recognition test (ENG). Other variablesincluded the subjects’ age in years (AGE) and sex (SEX),as well as an age-sex interaction effect. A secondary re-search question considered asks “How much of the between-subject variance observed in English reading ability testscores can be explained by additive genetic and commonenvironmental factors?”. To address this question, the co-variance parameters σ a , σ c and σ e must be estimated.To model the covariance components of the ACE model,a simplifying assumption that all family units share com-mon environmental factors was made. Following the workof Winkler et al. (2015), family units were ﬁrst categorizedby their internal structure into what shall be referred to as“family structure types” (i.e. unique combinations of full-siblings, half-siblings and identical twin pairs which form afamily unit present in the HCP dataset). In the HCP dataset,19 such family structure types were identiﬁed. In the fol-lowing text, each family structure type shall be treated as afactor in the model. For the i th observation to belong to the j th level of the k th factor in the model may be interpreted asthe i th subject belonging to the j th family exhibiting familystructure of type k . The model employed for this example isgiven by:ENG k , j , i = β + β × AGE i + β × SEX i + . . . β × AGE i × SEX i + β × PSQI i + ... γ k , j , i + ε k , j , i , where both γ k , j , i and ε k , j , i are mean-zero random variables.The random error, ε k , j , i , is deﬁned by; ε k , j , i ∼ N ( , σ e ) , and the random term γ k , j , i models the within-“family unit”covariance. Further detail on the speciﬁcation of γ k , j , i can befound in Appendix (6.8.1). isher Scoring for crossed factor Linear Mixed Models 15 In the notation of the previous sections, the number offactors in the ACE model, r , is equal to the number of fam-ily structure types present in the model (i.e. 19). For eachfamily structure type present in the model, family struc-ture type k , l k is the number of families who exhibit suchstructure and q k is the number of subjects present in anygiven family unit with such structure. As there is a uniquerandom effect (i.e. a unique random variable, γ k , j , i ) associ-ated with each individual subject, none of which are scaledby any coefﬁcients, the random effects design matrix, Z ,is the ( n × n ) identity matrix. To describe { D k } k ∈{ ,..., r } requires the known matricies K ak and K ck , which specifythe kinship (expected genetic material) and environmentaleffects shared between individuals, respectively (See Ap-pendix 6.8.2 for more details). Given K ak and K ck , the covari-ance components σ and { D k } k ∈{ ,..., r } are given as σ = σ e and D k = σ − e ( σ a K ak + σ c K ck ) respectively.As the covariance components σ a and σ c are of practicalinterest in this example, optimization is performed accord-ing to the ReML criterion using the adjustments describedin Appendix 6.4 and covariance structure is constrained viathe methods outlined in Section 2.5. Further detail on theconstrained approach for the ACE model can be found inAppendix 6.8.3. Discussion of computational efﬁciency forthe ACE model is also provided in Appendix 6.8.4.Section 3.2.2 reports the maximized restricted log-likelihood values and parameter estimates obtained usingthe Fisher Scoring method. Also given are approximate T-statistics for each ﬁxed effects parameter, alongside corre-sponding degrees of freedom estimates and p-values ob-tained via the methods outlined in Section 2.6. To verifycorrectness, the restricted log-likelihood of the ACE modelwas also maximized numerically using the implementationof Powell’s bi-directional search based optimization method(Powell (1964)) provided by the SciPy Python package. Themaximized restricted log-likelihood values and parameterestimates produced were then contrasted against those pro-vided by the Fisher Scoring method. The OLS (OrdinaryLeast Squares) estimates, which would be obtained had theadditive genetic and common environmental variance com-ponents not been accounted for in the LMM analysis, arealso provided for comparison. The results of the parameter estimation simulations of 2.7.1were identical. In all three settings, all parameter estimatesand maximised likelihood criteria produced by the FS, FFS,SFS and FSFS methods were identical to those produced by lmer up to a machine error level of tolerance. In termsof the maximisation of likelihood criteria, there was no evi-dence to suggest that lmer outperformed any of the FS, FFS,SFS or FSFS methods, or vice versa. The CSFS method pro-vided worse performance, however, with maximised likeli-hood criteria which were lower than those reported by lmerin approximately 2 .

5% of the simulations run for simula-tion setting 3. The reported maximised likelihood criteria forthese simulations were all indicative of convergence failure.For each parameter estimation method considered duringsimulation, the observed performance was unaffected by thechoice of likelihood estimation criteria (i.e. ML or ReML)employed.The

MRD and

MAE values provided evidence for strongagreement between the parameter estimates produced bylmer and the FS, FFS, SFS and FSFS methods. The largest

MRD values observed for these methods, taken relative tolmer, across all simulations and likelihood criteria, were1 . × − and 2 . × − for β and σ D , respectively.The largest observed MAE values for these methods, takenrelative to lmer, across all simulations, were given by1 . × − and 4 . × − for β and σ D , respectively. These re-sults provide clear evidence that the discrepancies betweenthe parameter estimates produced by lmer and those pro-duced by the FS, FFS, SFS and FSFS methods are extremelysmall in magnitude. Due to the extremely small magnitudesof these differences, MAE and

MRD values are not reportedin further detail here. For more detail, see SupplementaryMaterial Sections S S q (i.e. the number number ofcolumns in the random effects design matrix) is small, theresults of these simulations demonstrate strong computa-tional efﬁciency for the FS, FFS, SFS and FSFS methods.However, we emphasise that no claim is made to suggestthat the observed results will generalise to larger values of q or all possible designs. For large sparse LMMs that includea large number of random effects, without further adaptationof the methods presented in this work to employ sparse ma- Method FS FFS SFS FSFS CSFS lmer

Simulation t (Time/s) 0 .

041 0 .

037 0 .

026 0 .

055 0 .

057 0 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 5 .

243 5 .

243 6 .

048 6 .

048 10 .

461 - ( . ) ( . ) ( . ) ( . ) ( . ) - Simulation t (Time/s) 0 .

129 0 .

112 0 .

105 0 .

125 0 .

220 0 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 6 .

694 6 .

694 8 .

323 8 .

323 14 .

281 - ( . ) ( . ) ( . ) ( . ) ( . ) - Simulation t (Time/s) 1 .

081 0 .

889 1 .

872 1 .

941 3 .

970 5 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 8 .

133 8 .

133 16 .

054 16 .

054 33 .

437 - ( . ) ( . ) ( . ) ( . ) ( . ) - Table 1

The average time in seconds and the number of iterations reported for maximum likelihood estimation performed using the FS, FFS, SFS,FSFS and CSFS methods. For each simulation setting, results displayed are taken from averaging across 1000 individual simulations. Also givenare the average times taken by the R package lmer to perform maximum likelihood parameter estimation on the same simulated data. Standarddeviations are given in brackets below each entry in the table.

Method FS FFS SFS FSFS CSFS lmer

Simulation t (Time/s) 0 .

057 0 .

052 0 .

043 0 .

064 0 .

088 0 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 5 .

261 5 .

261 6 .

053 6 .

053 10 .

404 - ( . ) ( . ) ( . ) ( . ) ( . ) - Simulation t (Time/s) 0 .

175 0 .

154 0 .

157 0 .

169 0 .

306 0 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 6 .

758 6 .

758 8 .

413 8 .

413 14 .

283 - ( . ) ( . ) ( . ) ( . ) ( . ) - Simulation t (Time/s) 1 .

615 1 .

378 2 .

968 3 .

002 6 .

604 7 . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) n it (No. of iterations) 8 .

084 8 .

084 16 .

018 16 .

018 34 .

611 - ( . ) ( . ) ( . ) ( . ) ( . ) - Table 2

The average time in seconds and the number of iterations reported for restricted maximum likelihood estimation performed using the FS,FFS, SFS, FSFS and CSFS methods. For each simulation setting, results displayed are taken from averaging across 1000 individual simulations.Also given are the average times taken by the R package lmer to perform restricted maximum likelihood parameter estimation on the samesimulated data. Standard deviations are given in brackets below each entry in the table.isher Scoring for crossed factor Linear Mixed Models 17 trix methodology, it is expected that lmer will provide supe-rior performance. We again emphasise here that our purposein undertaking this work is not to compete with the exist-ing LMM parameter estimation tools. Instead, the focus ofthis work is to develop methodology which provides perfor-mance comparable to the existing tools but can be used insituations in which the existing tools may not be applicabledue to practical implementation considerations.Two potential causes for the poor performance of theCSFS method have been identiﬁed below. The ﬁrst causeis presented in Demidenko (2013), in which it is argued thatthe Cholesky parameterised algorithm will provide slowerconvergence than that of the unconstrained alternative meth-ods due to structural differences between the likelihood sur-faces over which parameter estimation is performed. Thisreasoning offers a likely explanation for the higher numberof iterations and computation time observed for the CSFSmethod across all simulations. However, this does not ex-plain the small number of simulations in simulation setting3 in which evidence of convergence failure was observed.The second possible explanation for the poor perfor-mance of the CSFS algorithm presents a potential reason forthe observed convergence failure. Pinheiro and Bates (1996)note that the Cholesky parameterisation of an arbitrary ma-trix is not unique and the number of possible choices for theCholesky factorisation of a square positive deﬁnite matrixof dimension ( q × q ) increases with the dimension q . Thismeans that, for each covariance matrix D k , there are mul-tiple Cholesky factors, Λ k , which satisfy Λ k Λ ′ k = D k . Thelarger D k is in dimension, the greater the number of Λ k thatcorrespond to D k there are. Pinheiro and Bates (1996) arguethat when optimal solutions are numerous and close togetherin the parameter space, numerical problems can arise duringoptimisation. We note that when compared with the othersimulation settings considered here, the design for simula-tion setting 3 contains the largest number of factors and ran-dom effects. Consequently, the covariance matrices, D k , arelarge for this simulation setting and numerous Cholesky fac-tors which correspond to the same optimal solution for thecovariance matrix, D , exist. For this reason, simulation set-ting 3 is the most susceptible to numerical problems of thekind described by Pinheiro and Bates (1996). We suggestthat this is a likely accounting for the 2 .

5% of simulationsin which convergence failure was observed. In summary,the simulation results offer strong evidence for the correct-ness and efﬁciency of the Fisher Scoring methods proposed,with the exception of the CSFS method, which experiencesslower performance and convergence failure in rare cases.

Across all degrees of freedom simulations, results indi-cated that the degrees of freedom estimates produced by the direct-SW method possessed both lower bias and lower vari-ance than those produced by lmerTest. The degrees of free-dom estimates, for each of the three simulation settings, aresummarized in Table 3.It can be seen from Table 3 that throughout all simu-lation settings both direct-SW and lmerTest appear to un-derestimate the true value of the degrees of freedom. How-ever, the bias observed for the lmerTest estimates is notablymore severe that of the direct-SW method, suggesting thatthe estimates produced by direct-SW have a higher accu-racy than those produced by lmerTest. The observed differ-ence in the standard deviation of the degrees of freedom es-timates between the lmerTest and direct-SW methods is lesspronounced. However, in all simulation settings, lower stan-dard deviations are reported for direct-SW, suggesting thatthe estimates produced by direct-SW have a higher precisionthan those produced by lmerTest.For both lmerTest and direct-SW, the observed bias andvariance increase with simulation complexity. The simula-tions provided here indicate that the severity of disagree-ment between direct-SW and lmerTest increases as the com-plexity of the random effects design increases. This observa-tion matches the expectation that the accuracy of the numeri-cal gradient estimation employed by lmerTest will worsen asthe complexity of the parameter space increases. Reportedmean squared errors are also provided, indicating furtherthat the direct-SW method outperformed lmerTest in termsof accuracy and precision in all simulation settings.3.2 Real data results

For the SAT score example described in Section 2.8.1, thelog-likelihood, ﬁxed effect parameters and variance compo-nents estimates produced by the Fisher Scoring algorithmswere identical to those produced by lmer. Further, the re-ported computation times for this example suggested littledifference between all methods considered in terms of com-putational efﬁciency. Of the Fisher Scoring methods con-sidered, the FS and FFS methods took the fewest itera-tions to converge while the SFS and FSFS methods tookthe most iterations to converge. The results presented hereexhibit strong agreement with those reported in West et al.(2014), in which the same model was used as the basis for abetween-software comparison of LMM software packages.For completeness, the full table of results can be found inSupplementary Material Section S The results for the twin study example described in Section2.8.2 are presented in Table 4. It can be seen from Table

Method Truth Direct-SW lmerTest

Simulation .

93 910 .

68 906 . . .

36 2 . . .

64 24 . Simulation .

61 842 .

17 837 . . .

16 7 . . .

20 99 . Simulation .

01 700 .

96 695 . . .

19 17 . . .

01 454 . Table 3

The mean, standard deviation and mean squared error for 1 ,

000 degrees of freedom estimates in each simulation setting. Results aredisplayed for both the lmerTest and Direct-SW methods, alongside “true” mean values, which were established using the moment-matching basedapproach outlined in Section 2.7.2 and computed using 1 , ,

000 simulation instances.

In this work, we have presented derivations for and demon-strated potential applications of, score vector and Fisher In-formation matrix expressions for the LMMs containing mul-tiple random factors. While many of the examples presentedin this paper were benchmarked against existing software,it is not the authors’ intention to suggest that the proposedmethods are superior to existing software packages. Instead,this work aims to complement existing LMM parameter es-timation research. This aim is realized through careful expo-sition of the score vectors and Fisher Information matricesand detailed description of methodology and algorithms.Modern approaches to LMM parameter estimation typ-ically depend upon conceptually complex mathematical op-erations which require support from a range of softwarepackages and infrastructure. This work has been, in largepart, motivated by current deﬁciencies in vectorized supportfor such operations. Vectorized computation is a crucial re-quirement for medical imaging applications, in which hun-dreds of thousands of mixed models must be estimated con-currently. It is with this application in mind that this workhas been undertaken. We intend to pursue the applicationitself in future work.Although the methods presented in this work are wellsuited for the desired application, we stress that there are isher Scoring for crossed factor Linear Mixed Models 19

Estimation Method OLS Powell FS

Performancel (Log-likelihood) − . − . − . t (Time in seconds) < .

001 93 .

32 2 . Fixed effects parameters (Standard Errors) β (Intercept) 121 . ( . ) . ( . ) . ( . ) β (Age) − . ( . ) − . ( . ) − . ( . ) β (Sex) − . ( . ) − . ( . ) − . ( . ) β (Age:sex) 0 . ( . ) . ( . ) . ( . ) β (PSQI score) − . ( . ) − . ( . ) − . ( . ) Covariance parameters σ a (Additive genetic) 0 .

00 48 .

20 50 . σ c (Common environment) 0 .

00 27 .

53 26 . σ e (Residual error) 110 .

04 33 .

57 32 . Tests for Fixed EffectsT-statistic .

67 34 .

83 34 . degrees of freedom .

00 920 .

59 921 . p-value < . ∗ < . ∗ < . ∗ T-statistic − . − . − . degrees of freedom .

00 907 .

44 908 . p-value . . . T-statistic − . − . − . degrees of freedom .

00 833 .

01 832 . p-value . ∗ . . t-statistic .

58 2 .

03 2 . degrees of freedom .

00 830 .

53 830 . p-value . ∗ . ∗ . ∗ T-statistic − . − . − . degrees of freedom .

00 933 .

67 928 . p-value < . ∗ . ∗ . ∗ Table 4

Performance metrics, parameter estimates and approximate T-test results for the twin study example. Standard errors for the ﬁxed effectsparameter estimates are given in brackets alongside the corresponding estimates. For each model parameter, a T-statistic, direct-SW degrees offreedom estimate, and p-value are provided, corresponding to the approximate t-test for a non-zero effect. P-values that are signiﬁcant at the 5%level are indicated using a ∗ symbol.0 Thomas Maullin-Sapey, Thomas E. Nichols many situations where current software packages will likelyprovide superior performance. One such situation can befound by observing that the Fisher Scoring method requiresthe storage and inversion of a matrix of dimensions ( q × q ) .This is problematic, both in terms of memory and computa-tion accuracy, for designs which involve very large numbersof random effects, grouping factors or factor levels. Whilesuch applications have not been considered extensively inthis work, we note that many of the expressions providedin this document may beneﬁt from combination with sparsematrix methodology to overcome this issue. We suggest thatthe potential for improved computation time via the com-bination of the Fisher Scoring approaches described in thispaper with sparse matrix methodology may also be the sub-ject of future research. A and B ofdimensions ( p × q ) and ( m × n ) respectively, the followingdeﬁnition of matrix (partial) derivative has been implicitlyemployed: ∂ vec ( A ) ∂ vec ( B ) = ∂ A [ , ] ∂ B [ , ] . . . ∂ A [ p , ] ∂ B [ , ] . . . ∂ A [ p , q ] ∂ B [ , ] ... . . . ... . . . ... ∂ A [ , ] ∂ B [ m , ] . . . ∂ A [ p , ] ∂ B [ m , ] . . . ∂ A [ p , q ] ∂ B [ m , ] ... . . . ... . . . ... ∂ A [ , ] ∂ B [ m , n ] . . . ∂ A [ p , ] ∂ B [ m , n ] . . . ∂ A [ p , q ] ∂ B [ m , n ] , with the equivalent deﬁnition holding for the total deriva-tive. In addition, for a scalar value, a , and matrix, B , deﬁnedas above, we employ the below matrix (partial) derivativedeﬁnition: ∂ a ∂ B = ∂ a ∂ B [ , ] . . . ∂ a ∂ B [ , n ] ... . . . ... ∂ a ∂ B [ m , ] . . . ∂ a ∂ B [ m , n ] , with, again, the equivalent deﬁnition holding for the totalderivative. The above notation can be useful since by def-inition, the two derivatives deﬁned above satisfy the below isher Scoring for crossed factor Linear Mixed Models 21 relation:vec (cid:18) ∂ a ∂ B (cid:19) = ∂ a ∂ vec ( B ) . The above matrix derivative deﬁnitions are com-monly employed in the mathematical and statisti-cal literature, but are not universal. For example,sources such as Neudecker and Wansbeek (1983) andMagnus and Neudecker (1999) deﬁne ∂ vec ( A ) / ∂ vec ( B ) as the transpose of the deﬁnition supplied above. Thisdiscrepancy between deﬁnitions is noteworthy as some ofthe referenced results used in the following proofs are thetranspose of those found in the original source. For a fullaccount and in-depth discussion of the different deﬁnitionsof matrix derivatives used in the literature, we recommendthe work of Turkington (2013). Lemma 1

Let g be a column vector, A be a square matrixand { B s } be a set of arbitrary matrices of equal size. Let Kbe an unstructured matrix which none of g, A or any of the { B s } depend on. Further, assume A , { B s } , g and K can bemultiplied as required. The below matrix derivative expres-sion now holds; ∂∂ K (cid:20) g ′ ( A + ∑ t B t KB ′ t ) − g (cid:21) = − ∑ s B ′ s (cid:0) A ′ + ∑ t B t KB ′ t (cid:1) − gg ′ (cid:0) A ′ + ∑ t B t KB ′ t (cid:1) − B s . (33) Proof

To begin, denote M = A + ∑ s B s KB ′ s . For clarity andconvenience, in this proof, we shall brieﬂy depart from theusual notation of H [ i , j ] for the ( i , j ) th element of an arbi-trary matrix H and instead employ Ricci calculus notation,in which the ( i , j ) th element of H is given by H ij and all sum-mations are made implicit. The left hand side of (33) is nowgiven, using Ricci calculus notation, as; ∂∂ K mn (cid:20) ( g ′ ) i ( M − ) ij g j (cid:21) = ( g ′ ) i g j ∂∂ K mn (cid:2) M − (cid:3) ij . (34)We employ a result from Giles (2008) which states, in Riccicalculus notation; ∂ [ F − ] ij ∂ x = − ( F − ) iq ∂ F qp ∂ x ( F − ) pj . Applying this result to (34), it can be seen that the right handside of (34) is equal to: − ( g ′ ) i g j ( M − ) iq ∂ M qp ∂ K mn ( M − ) pj . Evaluating the derivative in the above expression and rear-ranging the terms gives the below: − ( B ′ s ) np ( M − ) pj g j ( g ′ ) i ( M − ) iq B qsm = − [ B ′ s M − gg ′ M − B s ] nm . By applying a transposition to the matrices in the above ex-pression, in order to interchange the indices m and n , theresult follows. ⊓⊔ Lemma 2

Let A , { B s } and K be deﬁned as in Lemma 1.Then the following is true: ∂∂ K log | A + ∑ t B t KB ′ t | = ∑ s B ′ s ( A + ∑ t B t K ′ B ′ t ) − B s . (35) Proof

As in the proof of Lemma 1, the notation of M = A + ∑ t B t K ′ B ′ t will be adopted and the proof will be givenin Ricci calculus notation. Expanding the derivative with re-spect to K mn using the partial derivatives of the elements of M gives the following: ∂∂ K mn log | M | = ∂ log | M | ∂ M ij ∂ M ij ∂ K mn . (36)We now make use of the following well-known result, whichcan be found in Giles (2008). ∂ log | X | ∂ X ij = ( X ′− ) ij . (37)Inserting (37) into (36) and noting that A ij does not dependon K mn , yields: ∂∂ K mn log | M | = ( M ′− ) ij ∂ ( B s K ( B s ) ′ ) ij ∂ K mn . Evaluating the derivative on the right hand side of the abovenow returns: ∂∂ K mn log | M | = ( M ′− ) ij ( B s ) im ( B ′ s ) nj = ( B ′ s ) mi ( M ′− ) ij ( B s ) jn . Combining the summation terms in the above expressionyields the following. ∂∂ K mn log | M | = ( B ′ s M ′− B s ) mn . Finally, converting the above into matrix notation now gives(35) as desired. ⊓⊔ We are now in a position to derive the partial derivative ma-trix ∂ l ∂ D k , the matrix derivative with respect to D k which doesnot account for the equal elements of D k induced by symme-try. Theorem 1

The partial derivative matrix ∂ l ∂ D k is given bythe following. ∂ l ( θ ) ∂ D k = l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) . (38) Proof

To derive (38), we use the expression for the log-likelihood of the LMM, given by (3). As the ﬁrst term insidethe brackets of (3) does not depend on D k , we need onlyconsider the second and third term for differentiation. Wenow note that, by (4) and the block diagonal structure of D ,it can be seen that: V = I + ZDZ ′ = I + r ∑ k = l r ∑ j = Z ( k , j ) D k Z ′ ( k , j ) . (39)By substituting (39) into the second term of (3), it can beseen that: σ − e ′ V − e = σ − e ′ (cid:18) I + r ∑ k = l r ∑ j = Z ( k , j ) D k Z ′ ( k , j ) (cid:19) − e . By applying the result of Lemma 1, it can now be seen thatthe partial derivative matrix of the above, taken with respectto D k , can be evaluated to: ∂∂ D k (cid:20) σ − e ′ V − e (cid:21) = σ − l k ∑ j = Z ′ ( k , j ) V − ee ′ V − Z ( k , j ) . Similarly, by substituting (39) into the third term of (3), itcan be seen that:log | V | = log (cid:12)(cid:12) I + r ∑ k = l r ∑ j = Z ( k , j ) D k Z ′ ( k , j ) (cid:12)(cid:12) . By now applying the result of Lemma 2, the partial deriva-tive matrix of the above expression, taken with respect to D k , can be seen to be given by: ∂∂ D k (cid:2) log | V | (cid:3) = l k ∑ j = Z ′ ( k , j ) V − Z ( k , j ) . Combining the previous two derivative expressions, we ob-tain (38) as desired. ⊓⊔ Through applying the vectorisation operator to (38), the fol-lowing corollary is obtained. This is the result stated by (13)in Section 2.1.2.

Corollary 1

The partial derivative vector of the log-likelihood with respect to vec ( D k ) is given as: ∂ l ( θ ) ∂ vec ( D k ) = vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) (cid:19) . Using Theorem 5.12 of Turkington (2013), which states that,in our notation: dl ( θ ) d vec ( D k ) = D q k D ′ q k ∂ l ( θ ) ∂ vec ( D k ) , the following corollary is now obtained. Corollary 2

The total derivative vector of the log-likelihood with respect to vec(D k ) is given as:dl ( θ ) dvec ( D k ) = D q k D ′ q k vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) (cid:19) . Finally, by noting that the vectorisation and half-vectorisation operators satisfy the following relationship:vec ( D k ) = D + q k vech ( D k ) , the following corollary is obtained. This is the result statedby (9) in Section 2.1.1. Corollary 3

The total derivative vector of the log-likelihood with respect to vech ( D k ) is given as:dl ( θ ) dvech ( D k ) = D ′ q k vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − (cid:18) ee ′ σ − V (cid:19) V − Z ( k , j ) (cid:19) . This result diverges slightly from the proof given for thesingle-factor LMM in Demidenko (2013), which uses theinverse duplication matrix D + q k in the place of the trans-posed duplication matrix, D ′ q k , in the above. The authors be-lieve that the derivative given in Demidenko (2013) is, infact, vech (cid:0) ∂ l ( θ ) ∂ vec ( D k ) (cid:1) , and not dl ( θ ) d vech ( D k ) as stated. It should benoted, however, that this is a minor amendment and has littleimpact on the performance of the algorithm in practice.6.2 Fisher Information matrixIn this appendix, we derive the Fisher Information ma-trix for the full representation of the parameter vector, θ f .The derivation of the elements I f β , I f β , σ and I f σ forthe multi-factor LMM is identical to that used for the sin-gle factor LMM given in Demidenko (2013) and, there-fore, will not be repeated here. Provided here are onlyderivations of components of the Fisher Information ma-trix which relate to D k , for some factor f k . To derive theseresults, we will follow a similar argument to that used byDemidenko (2013) for the single-factor setting and will be-gin by noting that the Fisher Information matrix elements, { I fa , b } a , b ∈{ β , σ , vec ( D ) ,..., vec ( D r ) } , can be expressed by thefollowing formula: I fa , b = cov (cid:18) ∂ l ( θ f ) ∂ a , ∂ l ( θ f ) ∂ b (cid:19) . Theorems 2 and 3 provide derivation of equation (14) andTheorem 4 provides derivation of equation (15). Followingthis, Corollaries 4-6 detail the derivation of equations (11)and (12). isher Scoring for crossed factor Linear Mixed Models 23

Theorem 2

For any arbitrary integer k between and r, thecovariance of the partial derivatives of l ( θ f ) with respect to β and vec ( D k ) is given by: I f β , vec ( D k ) = cov (cid:18) ∂ l ( θ f ) ∂β , ∂ l ( θ f ) ∂ vec ( D k ) (cid:19) = p , q k . Proof

First, let u and T ( k , j ) denote the following quantities: u = σ − V − e , T ( k , j ) = Z ′ ( k , j ) V − . (40)As e ∼ N ( , σ V ) , it follows trivially that u ∼ N ( , I n ) . Let c be an arbitrary column vector of length q k . Noting that thederivative of the log-likelihood function has mean zero, thebelow can be seen to be true:cov (cid:18) ∂ l ( θ | y ) ∂β , c ′ ∂ l ( θ | y ) ∂ D k c (cid:19) = E (cid:20) ∂ l ( θ | y ) ∂β c ′ ∂ l ( θ | y ) ∂ D k c (cid:21) . By rewriting in terms of u and T ( k , j ) , and noting that E [ u ] =

0, the right hand side of the above can be seen to simplifyto: E (cid:20) σ − XV − uc ′ (cid:18) l k ∑ j = ( T ( k , j ) u )( T ( k , j ) u ) ′ (cid:19) c (cid:21) = σ − XV − l k ∑ j = E (cid:20) uc ′ T ( k , j ) uu ′ (cid:21) T ′ ( k , j ) c = p , q k . That the above is equal to a matrix of zeros, follows directlyfrom noting that the third moment of the Normal distributionis 0. The result of the theorem now follows. ⊓⊔ Theorem 3

For any arbitrary integer k between and r, thecovariance of the partial derivatives of l ( θ f ) with respect to σ and vec ( D k ) is given by: I f σ , vec ( D k ) = cov (cid:18) ∂ l ( θ f ) ∂σ , ∂ l ( θ f ) ∂ vec ( D k ) (cid:19) = σ vec ′ (cid:18) l k ∑ j = Z ′ ( k , j ) V − Z ( k , j ) (cid:19) . Proof

To begin, u and T ( k , j ) are deﬁned as in (40). Rewritingthe covariance in Theorem 3 in terms of (40) and removingconstant terms gives:14 σ cov (cid:18) u ′ u , vec (cid:18) l k ∑ j = ( T ( k , j ) u )( T ( k , j ) u ) ′ (cid:19)(cid:19) . Noting that the Kronecker product satisﬁes the propertyvec ( aa ′ ) = a ⊗ a , for all column vectors a , we obtain thatthe above is equal to:14 σ cov (cid:18) u ′ u , l k ∑ j = (cid:20) ( T ( k , j ) u ) ⊗ ( T ( k , j ) u ) (cid:21)(cid:19) . We now note that the Kronecker product satisﬁesvec ( ABC ) = ( C ′ ⊗ A ) vec ( B ) for arbitrary matrices A , B and C of appropriate dimensions. Utilizing this and applyingthe mixed product property of the Kronecker product to theabove expression yeilds:14 σ cov (cid:18) vec ′ ( I n )( u ⊗ u ) , l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) )( u ⊗ u ) (cid:21)(cid:19) . By moving constant values outside of the covariance func-tion using standard results the above now becomes:14 σ vec ′ ( I n ) cov ( u ⊗ u ) l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) ′ . Noting that u ∼ N ( , I n ) we now employ a result fromMagnus and Neudecker (1986) which states that cov ( u ⊗ u ) = N n . Substituting this result into the previous expres-sion gives the below:12 σ vec ′ ( I n ) N n l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) ′ . We now note a further result given inMagnus and Neudecker (1986); the matrix N n satisﬁes ( A ⊗ A ) N k = N n ( A ⊗ A ) for all A , n and k such that theresulting matrix multiplications are well deﬁned. Applyingthis result to the above expression, the below can beobtained:12 σ vec ′ ( I n ) l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) ′ N q k . Finally, again using the relationship vec ( ABC ) = ( C ′ ⊗ A ) vec ( B ) which is satisﬁed by the Kronecker product forarbitrary matrices A , B and C of appropriate dimensions, theabove can be seen to reduce to:12 σ vec ′ (cid:18) l k ∑ j = T ( k , j ) T ′ ( k , j ) (cid:19) N q k . Using the deﬁnition of T ( k , j ) , given in (40), and notingthat the matrix N q k satisﬁes the relationship vec ′ ( A ) N q k = vec ′ ( A ) for all appropriately sized symmetric matrices A , theresult now follows. ⊓⊔ Theorem 4

For any arbitrary integers k and k between and r, the covariance of the partial derivatives of l ( θ f ) withrespect to vec ( D k ) and vec ( D k ) is given by: I fvec ( D k ) , vec ( D k ) = cov (cid:18) ∂ l ( θ f ) ∂ vec ( D k ) , ∂ l ( θ f ) ∂ vec ( D k ) (cid:19) = N q k l k ∑ j = l k ∑ i = ( Z ′ ( k , i ) V − Z ( k , j ) ⊗ Z ′ ( k , i ) V − Z ( k , j ) ) . Proof

By Corollary 1, and using the notation introduced in(40), it can be seen that the partial derivative vector withrespect to vec ( D k ) is given by: ∂ l ( θ | y ) ∂ vec ( D k ) = vec (cid:18) l k ∑ j = ( T ( k , j ) u )( T ( k , j ) u ) ′ − l k ∑ j = T ( k , j ) T ′ ( k , j ) (cid:19) . By substituting the above into the covariance expression ap-pearing in Theorem 4 and removing constant terms, the be-low is obtained:14 cov (cid:18) l k ∑ j = vec (cid:20) ( T ( k , j ) u )( T ( k , j ) u ) ′ (cid:21) , l k ∑ j = vec (cid:20) ( T ( k , j ) u )( T ( k , j ) u ) ′ (cid:21)(cid:19) . Noting that for any column vector a , the Kronecker productsatisﬁes the property vec ( aa ′ ) = a ⊗ a , the above can be seento be equal to:14 cov (cid:18) l k ∑ j = (cid:20) ( T ( k , j ) u ) ⊗ ( T ( k , j ) u ) (cid:21) , l k ∑ j = (cid:20) ( T ( k , j ) u ) ⊗ ( T ( k , j ) u ) (cid:21)(cid:19) . Using the mixed product property of the Kronecker prod-uct and moving constant matrices outside of the covariancefunction, using standard results, gives:14 l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) cov ( u ⊗ u ) l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) ′ . Noting that u ∼ N ( , I n ) and using a result fromMagnus and Neudecker (1986) which states that cov ( u ⊗ u ) = N n , yields:14 l k ∑ j = (cid:20) ( T ( k , j ) ⊗ T ( k , j ) ) (cid:21) N n l k ∑ j = (cid:20) ( T ′ ( k , j ) ⊗ T ′ ( k , j ) ) (cid:21) . We now utilize the fact that, as noted byMagnus and Neudecker (1986), the matrix N n satisﬁes ( A ⊗ A ) N k = N n ( A ⊗ A ) for all A , n and k such that theresulting matrix multiplications are well deﬁned. Using thisand the mixed product property of the Kronecker productthe above can now be seen to be equal to:12 N q k l k ∑ j = l k ∑ i = (cid:20) ( T ( k , i ) T ′ ( k , j ) ) ⊗ ( T ( k , i ) T ′ ( k , j ) ) (cid:21) . From the deﬁnition of T ( k , j ) , it can be seen that the above isequal to the result of Theorem 4. ⊓⊔ We now turn attention to the derivation of (11) and (12). Asin the proofs of Corollaries 2 and 3 of Appendix 6.1, webegin by noting that: dl ( θ ) d vech ( D k ) = D + q k dl ( θ ) d vec ( D k )= D + q k D q k D ′ q k ∂ l ( θ ) ∂ vec ( D k )= D ′ q k ∂ l ( θ ) ∂ vec ( D k ) , where the ﬁrst equality follows from the deﬁnition of the du-plication matrix and the second equality follows from Theo-rem 5.12 of Turkington (2013). Applying the above identityto Theorems 2, 3 and 4 and moving the matrix D ′ q k outsidethe covariance function in each, leads to the following threecorollaries which, when taken in combination, provide equa-tions (11) and (12). Corollary 4

For any arbitrary integer k between and r,the covariance of the total derivatives of l ( θ h ) with respectto β and vech ( D k ) is given by: I h β , vech ( D k ) = cov (cid:18) dl ( θ h ) d β , dl ( θ h ) dvech ( D k ) (cid:19) = p , q k ( q k + ) / . Corollary 5

For any arbitrary integer k between and r,the covariance of the total derivatives of l ( θ h ) with respectto σ and vech ( D k ) is given by: I h σ , vech ( D k ) = cov (cid:18) dl ( θ h ) d σ , dl ( θ h ) dvech ( D k ) (cid:19) = σ vec ′ (cid:18) l k ∑ j = Z ′ ( k , j ) V − Z ( k , j ) (cid:19) D q k . Corollary 6

For any arbitrary integers k and k between and r, the covariance of the total derivatives of l ( θ h ) withrespect to vech ( D k ) and vech ( D k ) is given by: I hvech ( D k ) , vech ( D k ) = cov (cid:18) dl ( θ h ) dvech ( D k ) , dl ( θ h ) dvech ( D k ) (cid:19) = D ′ q k l k ∑ j = l k ∑ i = ( Z ′ ( k , i ) V − Z ( k , j ) ⊗ Z ′ ( k , i ) V − Z ( k , j ) ) D q k . We note that the results of Corollaries 4, 5 and 6 do notcontain the matrix N q k , which appears in the correspond-ing theorems (Theorems 2, 3 and 4). This is due to anotherresult of Magnus and Neudecker (1986), which states that D ′ k N k = D ′ k for any integer k . This concludes the derivationsof Fisher Information matrix expressions given in Sections2.1.1-2.1.4. isher Scoring for crossed factor Linear Mixed Models 25 Lemma 3

Let n , p and k be arbitrary positive integers withp < n. Let ˜ N be of dimension ( n × n ) with rank ( ˜ N ) < n, ˜ K beof dimension ( n × p ) with rank ( ˜ K ) = p, C be of dimension ( n × n ) with rank ( ˜ N ) = n and δ be of size ( n × k ) . If thefollowing properties are satisﬁed:1. ˜ K ˜ K + = ˜ N, where ˜ K + = ( ˜ K ′ ˜ K ) − ˜ K ′ is the generalizedinverse of ˜ K.2. ˜ N is symmetric and idempotent.3. ˜ NC = C ˜ N = ˜ NC ˜ N.4. ˜ N δ = δ . then it follows that: ( ˜ NC ) + δ = C − δ . Proof

In proving the above lemma, we make use of the fol-lowing result given by Barnett (1990); • If A is a matrix of dimension ( m × l ) with rank ( A ) = l < m , and B is a matrix of dimension ( l × m ) with rank ( B ) = l , then; ( AB ) + = B ′ ( BB ′ ) − ( A ′ A ) − A ′ . (41)By property 1, we have that: ( ˜ NC ) + = ( ˜ K ˜ K + C ) + . By applying (41) to the above, with A = ˜ K and B = ˜ K + C ,and simplifying the result, the following is obtained: ( ˜ K ˜ K + C ) + = C ′ ˜ K + ′ ( ˜ K + CC ′ ˜ K + ′ ) − ˜ K + . (42)We now consider the inversion in the center of the aboveexpression, i.e. the inversion of ( ˜ K + CC ′ ˜ K + ′ ) . This inversionis given by; ( ˜ K + CC ′ ˜ K + ′ ) − = ˜ K ′ C − ′ C − ˜ K . We will demonstrate this by showing that the product of thematrix inside the inversion on the left hand side and the ma-trix on the right hand side is the identity matrix: ( ˜ K + CC ′ ˜ K + ′ )( ˜ K ′ C − ′ C − ˜ K ) = ˜ K + CC ′ ˜ K + ′ ˜ K ′ C − ′ C − ˜ K . Using properties 1, 2 and 3 we see that the above is equal to:˜ K + ˜ NCC ′ C − ′ C − ˜ K . Simplifying and applying property 1 now yields that theabove is equal the below:˜ K + ˜ N ˜ K = ˜ K + ˜ K = I , where I is the identity matrix as required. Returning to (42)and applying properties 1 and 2, we now have that: ( ˜ NC ) + = C ′ ˜ NC − ′ C − ˜ N . By applying properties 2 and 3, the above can now be re-duced to: ( ˜ NC ) + = ˜ NC − ˜ N . (43)We now note that, by property 3, ˜ NC = C ˜ N . By multiplyingthe left and right side of this identity by C − it can be seenthat: C − ˜ N = ˜ NC − . Multiplying the right hand side of the above by ˜ N and ap-plying property 2 we obtain: C − ˜ N = ˜ NC − ˜ N . Substituting the above into (43) gives: ( ˜ NC ) + = C − ˜ N . The result of Lemma 3 now follows by multiplying the right-hand side by δ and applying property 4. ⊓⊔ Theorem 5

The below two update rules are equivalent,where F ( θ fs ) is the matrix deﬁned in Section 2.1.2 and I ( θ fs ) is the Fisher Information matrix of θ fs . θ fs + = θ fs + α s F ( θ fs ) − ∂ l ( θ fs ) ∂θ , θ fs + = θ fs + α s I ( θ fs ) + ∂ l ( θ fs ) ∂θ . Proof

To prove Theorem 5, it sufﬁces to prove the belowequality: I ( θ fs ) + ∂ l ( θ fs ) ∂θ = F ( θ fs ) − ∂ l ( θ fs ) ∂θ . To prove the above, we employ Lemma 3. To achieve this,we proceed by deﬁning: • ˜ N to be the matrix: I p + . . . N q . . .

00 0 N q . . . . . . N q r , • ˜ K to be the matrix: I p + . . . K q , q . . .

00 0 K q , q . . . . . . K q r , q r , • δ to be the column vector: ∂ l ( θ f ) ∂θ f , • C to be the matrix F ( θ f ) ,and claim that F ( θ f ) ˜ N = I ( θ f ) . To prove this equality,we ﬁrst note the following two properties of the matrix N k ,given by Magnus and Neudecker (1986).1. N k ( A ⊗ A ) = N k ( A ⊗ A ) N k = ( A ⊗ A ) N k for any arbi-trary matrix A and scalars k and k , such that the matrixmultiplications are well deﬁned.2. N k vec ( A ) = vec ( A ) for any arbitrary symmetric matrix A of appropriate dimension.The ﬁrst of the above properties, in combination with equa-tions (15) and (17), can be seen to imply that, for any k and k between 1 and r : I f vec ( D k ) , vec ( D k ) = F vec ( D k ) , vec ( D k ) N q k . In a similar manner, the second of the above properties, incombination with equation (14), can be seen to imply that,for any k between 1 and r : I f σ , vec ( D k ) = F σ , vec ( D k ) N q k . Expanding the multiplication F ( θ fs ) ˜ N block-wise demon-strates that the two equations above are sufﬁcient to provethe equality F ( θ fs ) ˜ N = I ( θ fs ) . The matrices ˜ N , ˜ K and C can be seen to satisfy properties 1-3 of Lemma 3 by not-ing standard results concerning the matrices D k , K m , n and N n , provided in Magnus and Neudecker (1986). Property 4of Lemma 3 can be seen to hold for the matrix ˜ N and columnvector δ by noting the symmetry of ∂ l ( θ fs ) ∂ D k (see Theorem 1).By combining the previous arguments and applying Lemma3, the below is obtained: I ( θ f ) + δ = ( F ( θ f ) ˜ N ) + δ = F ( θ f ) − δ . The result of Theorem 5 now follows. ⊓⊔ Theorem 6

For any ﬁxed integer k between and r, the be-low two update rules are equivalent, where F vec ( D k , s ) is asdeﬁned in Section 2.1.2 and I fvec ( D k , s ) is the Fisher Informa-tion matrix of vec ( D k ) .vec ( D k , s + ) = vec ( D k , s ) + α s F − vec ( D k , s ) ∂ l ( θ fs ) ∂ vec ( D k , s ) , vec ( D k , s + ) = vec ( D k , s ) + α s (cid:0) I fvec ( D k , s ) (cid:1) + ∂ l ( θ fs ) ∂ vec ( D k , s ) . Proof

The proof of the above proceeds by deﬁning ˜ N = N q k ,˜ K = K q k , q k , C = F vec ( D k ) and δ to be the partial derivativeappearing in the statement of the theorem. Following this,the proof follows the same structure as that of Theorem 5.For brevity, the argument will not be repeated in this section. ⊓⊔ e , instead of the response vector, Y . Neglecting constantterms, the Restricted Maximum log-likelihood function, l R ,is given by: l R ( θ h ) = l ( θ h ) − (cid:18) − p log ( σ ) + log | X ′ V − X | (cid:19) , (44)where l ( θ h ) is given in (3). To derive the ReML-based FSalgorithm, akin to that described in Section 2.1.1, the fol-lowing adjustments to the score vectors given by (8) and (9)must be used. dl R ( θ h ) d σ = dl ( θ h ) d σ + p σ − , dl R ( θ h ) d vech ( D k ) = dl ( θ h ) d vech ( D k ) + D ′ q k vec (cid:18) l k ∑ j = Z ′ ( k , j ) V − X ( X ′ V − X ) − X ′ V − Z ( k , j ) (cid:19) . The latter of the above results can be derived through trivialadjustment of the proofs of Theorem 1 and Corollaries 1-3 given in Appendix 6.1. As the derivative of l R ( θ h ) withrespect to β is identical to that of l ( θ h ) , given in Section2.1.1, we do not list it again above.For a derivation of the ReML score vectors of β and σ , we direct the reader to the works of Demidenko (2013)where proofs may be found for the single-factor LMM. Asthese proofs do not depend on the number of factors in themodel, they can be easily seen to also apply in the multi-factor LMM setting without further adjustment.Due to the asymptotic equivalence of the ML and ReMLestimates, the parameter estimates produced by ML andReML have the same asymptotic covariance matrix. Con-sequently, the Fisher Information matrix for the parameterestimates produced by ReML, which can be shown to beequal to the inverse of the asymptotic covariance matrix, isidentical to that of the parameter estimates produced by ML.As a result, the ReML-based FS algorithm utilizes the FisherInformation matrix speciﬁed by (10)-(12) and requires nofurther derivation.To summarize, the ReML-based FS algorithm for themulti-factor LMM is almost identical to that given in Al-gorithm 1. The only adaptations required occur on line 3 isher Scoring for crossed factor Linear Mixed Models 27 of Algorithm 1 where the score vectors must be substitutedfor their ReML counterparts, provided above. Analogous ad-justments can be made for the algorithms presented in Sec-tions 2.1.2-2.1.5.6.5 Constraint MatricesIn this appendix, we provide further detail on the approachto constrained optimization which was outlined in Section2.5. We begin by providing examples of how the notationdescribed in Section 2.5 may be used in practice. For a given k , we introduce the notation { d i } and { ˜ d i } to represent theset of unique parameters required to specify the constrainedand unconstrained representations of D k , respectively. Forexample, if D k is ( × ) in dimension and we wish to induceToeplitz structure onto D k , then D k can be considered to takethe following “constrained” and “unconstrained” represen-tations: D k = d d d d d d d d d = ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d . The notation vecu ( A ) is used to denote the column vec-tor of unique variables in the matrix A , given in order ofﬁrst occurrence as listed in column major order. In theabove example, this implies vecu ( D k ) = [ ˜ d , ˜ d , ˜ d ] ′ whilstvec ( D k ) = [ d , . . . , d ] ′ . Using this notation, the constraintmatrix C k may be deﬁned element-wise. To see this, notethat the chain rule for the total derivative is given by: dl ( θ con ) d ˜ d j = ∑ i ∂ d i ∂ ˜ d j ∂ l ( θ con ) ∂ d i . Comparing this with the derivative expression provided inequation (30), it can be seen that the elements of C k are givenby: C k [ j , i ] = ∂ d i ∂ ˜ d j . In general, derivation of the constraint matrix for the k th factor is relatively straightforward for most practicalapplications. Table 5 supports this statement by providingexamples of the constraint matrix for the k th factor oper-ating under the assumption that D k exhibits several com-monly employed covariance structures. In addition, we notethat two instances in which a constraint matrix approachwas employed implicitly can be found in the main bodyof this work. In Section 2.1.1, we took C k = D q k to en-force symmetry in D k and, in Section 2.1.5, we took C k =( d vech ( D k ) / d vech ( Λ k )) D q k in order to derive the FisherScoring update in terms of the Cholesky factor, Λ k .It is also worth noting that this approach to constrainedoptimization allows for different covariance structures to be Covariance Structure Constraint matrix, C k Diagonal ˜ d d

00 0 ˜ d (cid:2) (cid:3) Variance components ˜ d d

00 0 ˜ d Toeplitz ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d ˜ d Compound Symmetry d ˜ d ˜ d d ˜ d ˜ d (cid:2) (cid:3) AR(1) d ˜ d ˜ d d ˜ d ˜ d (cid:2) d d (cid:3) Table 5

The constraint matrix for the k th factor, C k , in the setting inwhich the k th factor groups 3 random effects, under various covariancestructures. enforced on different factors in the LMM. For instance, aresearcher may enforce the ﬁrst factor in an LMM to havean AR(1) structure while allowing the second factor to ex-hibit a completely different structure such as compoundsymmetry. As noted in Section 2.5, this approach can alsobe extended further to allow all elements of the vectorizedcovariance matrices, vec ( D ) , . . . vec ( D r ) , to be expressedas functions of one set of parameters, ρ D , common to all { vec ( D k ) } k ∈{ ,..., r } . To further detail this, we deﬁne v ( D ) and extend the deﬁnition θ con : v ( D ) = vec ( D ) ...vec ( D r ) , θ con = βσ ρ D , Through the same process used to obtain equation theequations presented in Section 2.5, the below can be ob-tained; dl ( θ con ) d ρ D = C ∂ l ( θ con ) ∂ v ( D ) , I con ρ D = C I f v ( D ) C ′ , (45) where C is the constraint matrix deﬁned to be the Jacobianmatrix of partial derivatives of v ( D ) taken with respect to ρ D . The score vector and Fisher Information matrix asso-ciated to v ( D ) are readily seen to be given by block-wisecombinations of the matrix expressions (13) and (15).6.6 Cholesky Fisher ScoringIn this appendix, we prove the identity stated by the belowtheorem, Theorem 7, which was utilized in Section 2.1.5. Theorem 7

Let D k be a square symmetric positive-deﬁnitematrix with Cholesky decomposition given by D k = Λ k Λ ′ k where Λ k is the lower triangular Cholesky factor. The belowexpression gives the derivative of vech ( D k ) with respect tovech ( Λ k ) . ∂ vech ( D k ) ∂ vech ( Λ k ) = L q k ( Λ ′ k ⊗ I q k )( I q k + K q k ) D q k . Proof

By the chain rule for vector-valued functions, asstated by Turkington (2013), the derivative in Theorem 7can be expanded in the following manner: ∂ vech ( D k ) ∂ vech ( Λ k ) = ∂ vec ( Λ k ) ∂ vech ( Λ k ) ∂ vec ( D k ) ∂ vec ( Λ k ) ∂ vech ( D k ) ∂ vec ( D k ) . We now consider each of the above derivatives in turn. Theﬁrst derivative in the product is given by Theorem 5.9 ofTurkington (2013), which states: ∂ vec ( Λ k ) ∂ vech ( Λ k ) = L q k . The second derivative in the product is given by a result ofMagnus and Neudecker (1999), which states that: ∂ vec ( Λ k Λ ′ k ) ∂ vec ( Λ k ) = ( Λ ′ k ⊗ I q k )( I q k + K q k ) . For the ﬁnal derivative in the product, Theorem 5.10 ofTurkington (2013) gives: ∂ vech ( D k ) ∂ vec ( D k ) = D q k . Combining the above derivatives yields the desired result. ⊓⊔ To derive the estimator given by (31), we ﬁrst begin by not-ing that the approximate T-statistic used for LMM null hy-pothesis testing is given by: T = L ˆ β q L c Var ( ˆ β ) L ′ , where ˆ β is the estimator of β obtained from numerical opti-mization and c Var ( ˆ β ) is an estimate of the variance of the β estimator, given by: c Var ( ˆ β ) = ˆ σ ( X ′ ˆ V − X ) − . We highlight that the above expression differs from the T-statistic typically employed in the setting of linear regres-sion hypothesis testing. Under the standard assumptions forlinear regression, it is possible to derive an exact expressionfor the distribution of the OLS variance estimate of ˆ β givenby ˆ σ ( X ′ X ) − . However, the above estimator, c Var ( ˆ β ) , dif-fers from the OLS estimate as it includes an estimate of V which has an unknown distribution. As a direct result, noequivalent theory is available to obtain the distribution of c Var ( ˆ β ) in the setting of the LMM. A consequence of this isthat, whilst T follows a student’s t n − p distribution in the set-ting of linear regression, the distribution of T in the LMMsetting is unknown. The aim of this section is to approxi-mate the degrees of freedom of T assuming that it follows astudents t -distribution.To meet this aim, we now denote Var ( ˆ β ) as the un-known true variance of the ˆ β estimator. By multiplyingboth the numerator and denominator of the T-statistic by ( L Var ( ˆ β ) L ) / , the below expression for T is obtained: T = Z r L c Var ( ˆ β ) L ′ L Var ( ˆ β ) L ′ , where Z is a random variable with standard normal distribu-tion, given by Z = L ˆ β / q L Var ( ˆ β ) L ′ . We now consider thedeﬁnition of the students t -distribution, which states that a isher Scoring for crossed factor Linear Mixed Models 29 random variable is t -distributed with v degrees of freedom ifand only if it takes the following form: T = Z q Xv , where Z ∼ N ( , ) and X ∼ χ v . When the degrees of free-dom v are unknown, as both X and T are distributed with v degrees of freedom, it sufﬁces to estimate the degreesof freedom of the chi-square variable X instead of the t -distributed variable T . By equating the approximate T-statistic with the deﬁning form of a t -distributed randomvariable, under the assumption that the T-statistic is truly t -distributed, the below expression for X can be obtained: X = vL c Var ( ˆ β ) L ′ L Var ( ˆ β ) L ′ ∼ χ v . We now deﬁne the notation S ( ˆ η ) to represent the estimatedvariance of L ˆ β as a function of the estimated variance pa-rameters given by ˆ η = ( ˆ σ , ˆ D , . . . , ˆ D r ) . Similarly, we de-ﬁne Σ to be the true unknown variance of L ˆ β . Noting that Σ = L Var ( ˆ β ) L ′ and by rearranging the above, the belowcan be obtained: S ( ˆ η ) = L c Var ( ˆ β ) L ′ ∼ Σ v χ v . By considering the variance of the above expression and not-ing that, as X ∼ χ v , Var ( X ) = v , the below expression isobtained.Var ( S ( ˆ η )) = ( Σ ) v Var ( X ) = ( Σ ) v . To obtain an estimator of v in terms of S ( ˆ η ) , the above isrearranged and the approximation Σ ≈ S ( ˆ η ) is used. Thisyields the below approximation.ˆ v ( ˆ η ) = ( S ( ˆ η )) Var ( S ( ˆ η )) . Under the assumption that T ∼ t v , ˆ v ( ˆ η ) is an estimator forthe degrees of freedom of X and, therefore, for the degreesof freedom of T . This concludes the derivation of (31). In this appendix, we detail how the Satterthwaite method de-scribed in Appendix 6.7.1 may be extended in order to esti-mate the degrees of freedom of the approximate F-statistic.The approximate F-statistic for the LMM takes the belowform: F = ˆ β ′ L ′ [ L c Var ( ˆ β ) L ′ ] − L ˆ β r , where, throughout this section only, the notation F will rep-resent the approximate F-statistic and r will be used to de-note r = rank ( L ) . The aim of this section is to approximate F with an F r , v distribution where v , the denominator degreesof freedom, is estimated using a Satterthwaite method of ap-proximation. To simplify the notation we will ﬁrst consider Q , which is the numerator of F ; Q = rF = ˆ β ′ L ′ [ L Var ( ˆ β ) L ′ ] − L ˆ β . We begin by considering the eigendecomposition of L Var ( ˆ β ) L ′ , which we will denote: L Var ( ˆ β ) L ′ = U Λ U ′ , where U is an orthonormal matrix of eigenvectors and Λ isa diagonal matrix of eigenvalues. Using standard propertiesof the eigendecomposition the below is obtained: Q = ( U ′ L ˆ β ) ′ Λ − ( U ′ L ˆ β )= r ∑ i = ( U ′ L ˆ β ) i λ i , where ( U ′ L ˆ β ) i is the i th element of the vector U ′ L ˆ β and λ i is the i th diagonal element (eigenvalue) of Λ , Λ [ i , i ] . Forpurposes which will become clear, we deﬁne ˜ L as U ′ L . Bynoting that λ i = ( U ′ L Var ( ˆ β ) L ′ U ) i , the above can be seen tobe equal to: Q = r ∑ i = ˜ L i ˆ β q ˜ L i Var ( ˆ β ) ˜ L ′ i ! , where ˜ L i represents the i th row of ˜ L . Noting now the deﬁni-tion of ˜ L , it can be seen that the above is a sum of indepen-dent squared T-statistics of the form discussed in Appendix6.7.1. By denoting the unknown degrees of freedom of eachof the T-statistics in the above sum by v i , it follows that: E ( Q ) = r ∑ i = E " ˜ L i ˆ β q ˜ L i Var ( ˆ β ) ˜ L ′ i ! = r ∑ i = v i v i − , where the second equality follows from the fact that asquared T-statistic with degrees of freedom v i is an F , v i -statistic, which in turn has expectation v i v i − . Noting that, inpractice, the true variance of ˆ β is unknown, the methods of6.7.1 must be employed to estimate each of the v i in theabove sum. To use the above expression for E ( Q ) to estimate thedenominator degrees of freedom of F , three cases areconsidered; the cases r > , r = r =

1. Each of thesewill now be considered in turn.

Case 1 (r > By recalling Q = rF , and by the stan-dard formula for the expectation of an F distribution withnumerator degrees of freedom v greater than 2, the below isobtained. E ( Q ) = E ( rF ) = rvv − . Setting the above equal to the previous expression for E ( Q ) and rearranging, the below expression for v is obtained: v = ∑ ri = v i v i − (cid:18) ∑ ri = v i v i − (cid:19) − r . By using the Satterthwaite degrees of freedom approxi-mation method described in Appendix 6.7.1 to estimate { v i } i ∈{ ,..., r } , it follows that the above formula can be usedto estimate the degrees of freedom of the approximate F -statistic when r > Case 2 (r = When r =

2, the expectation of Q isinﬁnite and, resultantly, the equality used in the argumentof case 1 no longer holds. A common convention forestimation of v in the case r = r > Q satisﬁes the followingrelationship with v : v = E ( Q ) E ( Q ) − r . By taking the limit from above of both sides of this expres-sion, and noting that E ( Q ) → ∞ as v ↓

2, the following esti-mate for v is obtained. v = lim r ↓ E ( Q ) E ( Q ) − r = . Case 3 (r = For the case r =

1, it is well known that if F ∼ F , v for some integer v , then F is the square of a T-statistic with the same degrees of freedom, v . Resultantly,in this case, the unknown denominator degrees of freedomof the F-statistic, v , can be estimated using the methods ofAppendix 6.7.1. ( ˆ η h ) with respect to vech ( ˆ D k ) In this appendix, proof of the derivative result which wasgiven in Section 2.6 is provided. Following this, an exten-sion of this result is provided for models containing con-strained covariance matrices of the form described in Sec-tion 2.5. The result given in Section 2.6 is restated by thebelow theorem.

Theorem 8

Deﬁne ˆ η h as in section 2.6 and deﬁne the func-tion S ( ˆ η h ) by:S ( ˆ η h ) = σ L ( X ′ ˆ V − X ) − L ′ . The derivative of S ( ˆ η h ) with respect to the half vectorisa-tion of ˆ D k is given by:dS ( ˆ η h ) dvech ( ˆ D k ) = ˆ σ D ′ q k (cid:18) l k ∑ j = ˆ B ( k , j ) ⊗ ˆ B ( k , j ) (cid:19) , where B ( k , j ) is given by: ˆ B ( k , j ) = Z ′ ( k , j ) ˆ V − X ( X ′ ˆ V − X ) − L ′ . Proof

By the chain rule for vector valued functions, asstated by Turkington (2013), and by noting that the function S ( ˆ η h ) outputs a ( × ) scalar value, it can be seen that: ∂ S ( ˆ η h ) ∂ vec ( ˆ D k ) = ˆ σ ∂ (cid:0) L ( X ′ ˆ V − X ) − L ′ (cid:1) ∂ vec ( ˆ D k ) = ˆ σ ∂ vec ( ˆ V ) ∂ vec ( ˆ D k ) ∂ vec ( ˆ V − ) ∂ vec ( ˆ V ) ∂ vec ( X ′ ˆ V − X ) ∂ vec ( ˆ V − ) ∂ (cid:0) L ( X ′ ˆ V − X ) − L ′ (cid:1) ∂ vec ( X ′ ˆ V − X ) . We now consider each of the terms in this product in turn.Using the expansion given in (39) the ﬁrst derivative in theproduct is evaluated to the below. ∂ vec ( ˆ V ) ∂ vec ( ˆ D k ) = l k ∑ j = Z ′ ( k , j ) ⊗ Z ′ ( k , j ) . For the second derivative in the product, we apply result(4.17) from Turkington (2013), which states: ∂ vec ( ˆ V − ) ∂ vec ( ˆ V ) = − ˆ V − ⊗ ˆ V − . For the third term of the product, a result stated in Chapter5.7 of Turkington (2013) gives the following: ∂ vec ( X ′ ˆ V − X ) ∂ vec ( ˆ V − ) = X ⊗ X . By a variant of (4.17) from Turkington (2013), given on theline following the statement of (4.17), the below is obtained: ∂ (cid:0) L ( X ′ ˆ V − X ) − L ′ (cid:1) ∂ vec ( X ′ ˆ V − X ) = − vec ( L ( X ′ ˆ V − X ) − ( L ( X ′ ˆ V − X ) − ) ′ )= − (( X ′ ˆ V − X ) − L ′ ) ⊗ (( X ′ ˆ V − X ) − L ′ ) . Following this, application of the mixed product property ofthe Kronecker product and multiplication by the transposedduplication matrix yields the result of Theorem 8. ⊓⊔ isher Scoring for crossed factor Linear Mixed Models 31 Theorem 9

Deﬁne ρ ˆ D and C as in Section (2.5) and ˆ B ( k , j ) and ˆ η h as in Theorem 8. The derivative of S ( ˆ η h ) with re-spect to ρ ˆ D is given by:dS ( ˆ η h ) d ρ ˆ D = σ C ˆ B , where ˆ B is deﬁned by: ˆ B = "(cid:18) l ∑ j = ˆ B ′ ( , j ) ⊗ ˆ B ′ ( , j ) (cid:19) , . . . , (cid:18) l r ∑ j = ˆ B ′ ( r , j ) ⊗ ˆ B ′ ( r , j ) (cid:19) ′ . Proof

From the proof of Theorem 8, it can be seen that thepartial derivative of S ( ˆ η h ) with respect to vec ( ˆ D k ) is givenby: ∂ S ( ˆ η h ) ∂ vec ( ˆ D k ) = ˆ σ l k ∑ j = ˆ B ( k , j ) ⊗ ˆ B ( k , j ) . By deﬁning v ( ˆ D ) as in Section 2.5, it can be seen from theabove that the partial derivative of S ( ˆ η h ) with respect to v ( ˆ D ) is given by: ∂ S ( ˆ η h ) ∂ v ( ˆ D ) = ˆ σ ˆ B . By the chain rule for vector valued functions, as stated byTurkington (2013), the below can now be obtained; dS ( ˆ η h ) d ρ ˆ D = ∂ v ( ˆ D ) ∂ρ ˆ D ∂ S ( ˆ η h ) ∂ v ( ˆ D ) = C ∂ S ( ˆ η h ) ∂ v ( ˆ D ) , where the second equality follows from the deﬁnition of C .Substituting the partial derivative of S ( ˆ η h ) with respect to ρ ˆ D into the above completes the proof. ⊓⊔ b , and covariance matrix, D ,may be constructed for the ACE model. Following this, Ap-pendix 6.8.2 details how the matrices which model the ad-ditive genetic and common environmental variance compo-nents of the ACE model are deﬁned. Next, Appendix 6.8.3provides an overview of the constrained optimization pro-cedure adopted for the ACE model, alongside an expressionfor the constraint matrix employed for optimization. Finally,Appendix 6.8.4 provides detail on how the parameter esti-mation method that was employed to obtain the results ofSection 3.2.2 was efﬁciently implemented. In this Appendix, we provide detail on how the random ef-fects vector, b , and covariance matrix, D , are deﬁned for theACE model. To do so, we ﬁrst describe the covariance ofthe random terms γ k , j , i , which appear in twin study modelfrom Section 2.8.2. Following this, we use the γ k , j , i terms toconstruct the random effects vector, b . To simplify notation,we make the assumption that for each family structure type,subjects within family units which exhibit such a structureare ordered consistently. For example, if the family struc-ture describes “families containing one twin-pair and onehalf-sibling”, it may be assumed that the members of everyfamily who exhibit such a structure are given in the sameorder: (twin, twin, half-sibling).As noted in Section 2.8.2, γ k , j , i models the within-“family unit” covariance. As such, cov ( γ k , j , i , γ k , j , i ) =

0, forany two distinct family units, j = j . Within any individualfamily unit (e.g. family unit j of structure type k ), the ran-dom effects { γ k , j , i } i ∈{ ,..., q k } are deﬁned to have the belowcovariance matrix:cov γ k , j , γ k , j , ... γ k , j , q k = σ a K ak + σ c K ck , where K ak and K ck are the known, predeﬁned kinship (ex-pected additive genetic material) and common environmen-tal effects matrices, respectively (see Appendix 6.8.2 for fur-ther detail).The random effects vector b , may be constructed byvertical concatenation of the random γ k , i , j terms, i.e. b =[ γ , , , γ , , , ..., γ r , l r , q r ] ′ . To derive an expression for the co-variance matrix of b , we note from equation (2) that σ e D is equal to cov ( b ) . Equating this with the previous expres-sion, it may now be seen that D is block diagonal, with its k th unique diagonal block given by: D k = σ − e ( σ a K ak + σ c K ck ) . In twin studies, twin pairs are typically described as eitherMonozygotic (MZ) or Dizygotic (DZ). MZ twin pairs areidentical twins, whilst DZ twins are non-identical. The ad-ditive genetic kinship matrix, K ak , describes the additive ge-netic similarity between members of a family unit. For afamily of q k members, K ak , is a ( q k × q k ) dimensional con-stant matrix with its ( i , j ) th element given by: ( K ak ) [ i , j ] = i and j are MZ twins or i = j . if subjects i and j are DZ twins or full siblings. if subjects i and j are half siblings.0 if subjects i and j are unrelated.2 Thomas Maullin-Sapey, Thomas E. Nichols The common environmental matrix, K ck , describes thesimilarity between members of a family unit which is at-tributed to individuals’ being reared in a shared environ-ment. K ck is a ( q k × q k ) dimensional constant matrix withits ( i , j ) th element given by: ( K ak ) [ i , j ] = (cid:26) i and j were reared together.0 if subjects i and j were not reared together. For more information on the construction of kinship andcommon environmental matrices for twin studies, the readeris referred to Lawlor et al. (2009).

In this section, a brief overview of the constrained optimiza-tion procedure which was adopted for parameter estimationof the ACE model is provided. An expression for the con-straint matrix, C is then provided by Theorem 10, alongsidederivation.To perform parameter estimation for the ACE model,an approach based on the ReML criterion described in Ap-pendix 6.4 and the constrained covariance structure meth-ods outlined in Section 2.5 was adopted. The resulting op-timization procedure was performed in terms of the param-eter vector θ ACE = ( β , τ a , τ c , σ e ) , where τ a = σ − e σ a and τ c = σ − e σ c . β and σ e were updated according to the GLSupdate rules provided by equations (18) and (19), respec-tively, whilst updates for the parameter vector [ τ a , τ c ] ′ wereperformed via a Fisher Scoring update rule. The Fisher Scor-ing update rule employed was of the form (5) with θ sub-stituted for [ τ a , τ c ] ′ . To obtain the necessary gradient vectorand information matrix required to perform this update step,equation (45) of Appendix 6.5 was employed. An expressionfor the required constraint matrix, C , alongside derivation,is given by Theorem 10 below. Theorem 10

Let v ( D ) denote the vector of covarianceparameters [ vec ( D ) ′ , vec ( D ) ′ , . . . vec ( D r ) ′ ] ′ . For the ACEmodel, the constraint matrix (Jacobian) which maps a par-tial derivative vector taken with respect to v ( D ) to a totalderivative vector taken with respect to τ = [ τ a , τ c ] ′ is givenby: C = (cid:18) ( , r ) ⊗ (cid:20) τ a

00 2 τ c (cid:21)(cid:19)(cid:18) r M k = (cid:20) vec ( K ak ) ′ vec ( K ck ) ′ (cid:21)(cid:19) , where ⊕ represents the direct sum.Proof To prove Theorem 10, we ﬁrst deﬁne ˜ τ a , , . . . ˜ τ a , r and˜ τ c , , . . . ˜ τ c , r as variables which satisfy the below expressions,for all k ∈ { , . . . , r } . τ a = ˜ τ a , k , τ c = ˜ τ c , k , vec ( D k ) = ˜ τ a , k vec ( K ak ) + ˜ τ c , k vec ( K ck ) . Following this, we deﬁne the vector ˜ τ as ˜ τ =[ ˜ τ a , , ˜ τ c , , . . . , ˜ τ a , r , ˜ τ c , r ] ′ . By the chain rule for vector-valued functions, as stated by Turkington (2013), and by thedeﬁnition of C , provided in Section 2.5, it can be seen that: C = dv ( D ) d τ = ∂ ˜ τ∂τ ∂ v ( D k ) ∂ ˜ τ . (46)By direct evaluation, the ﬁrst partial derivative in the producton the right hand side of the above can be seen to be equalto the following. ∂ ˜ τ∂τ = (cid:20) . . .

00 1 0 1 0 . . . (cid:21) = ( , r ) ⊗ I . To evaluate the second derivative in the product, we ﬁrstconsider the partial derivative of vec ( D k ) with respect tothe parameter vector [ ˜ τ a , k , ˜ τ c , k ] ′ for arbitrary k ∈ { , . . . , r } .From the deﬁnitions of ˜ τ a , k and ˜ τ c , k , it can be seen that: ∂ vec ( D k ) ∂ [ ˜ τ a , k , ˜ τ c , k ] ′ = ∂∂ [ ˜ τ a , k , ˜ τ c , k ] ′ ( ˜ τ a , k vec ( K ak ) + ˜ τ c , k vec ( K ck ))= " ∂∂ ˜ τ a , k ( ˜ τ a , k vec ( K ak ) + ˜ τ c , k vec ( K ck )) ′ ∂∂ ˜ τ c , k ( ˜ τ a , k vec ( K ak ) + ˜ τ c , k vec ( K ck )) ′ = (cid:20) τ a , k vec ( K ak ) ′ τ c , k vec ( K ck ) ′ (cid:21) . By similar reasoning it can be seen, for arbitrary k , k ∈{ , . . ., r } , such that k = k , that the below is true: ∂ vec ( D k ) ∂ [ ˜ τ a , k , ˜ τ c , k ] ′ = ( , q k ) , where ( , q k ) is the ( × q k ) dimensional matrix of zeroelements. By combining the above expressions and notingthe deﬁnitions of v ( D ) and ˜ τ , it can now be seen that thederivative of v ( D ) with respect to ˜ τ is given by the below. ∂ v ( D ) ∂ ˜ τ = (cid:20) τ a , vec ( K a ) ′ τ c , vec ( K c ) ′ (cid:21) ( , q ) . . . ( , q r ) ( , q ) (cid:20) τ a , vec ( K a ) ′ τ c , vec ( K c ) ′ (cid:21) . . . ( , q r ) ... ... .. . ... ( , q ) ( , q ) . . . (cid:20) τ a , r vec ( K ar ) ′ τ c , r vec ( K cr ) ′ (cid:21) = r M k = (cid:20) τ a , k vec ( K ak ) ′ τ c , k vec ( K ck ) ′ (cid:21) . By substituting the above partial derivative results into (46),the following expression can now be obtained for the con-straint matrix, C . C = (cid:18) ( , r ) ⊗ I (cid:19)(cid:18) r M k = (cid:20) τ a , k vec ( K ak ) ′ τ c , k vec ( K ck ) ′ (cid:21) (cid:19) . By substituting τ a = ˜ τ a , k and τ c = ˜ τ c , k and rearranging, theresult stated in Theorem 10 can now be obtained. ⊓⊔ isher Scoring for crossed factor Linear Mixed Models 33 A key aspect in which the ACE model differs from otherLMM’s considered in this work is that, in the ACE model,the second dimension of the random effects design matrix, q , is equal to the number of observations, n . Resultantly,the random effects design matrix, Z , which is given by the ( n × n ) identity matrix, has notably large dimensions. Dueto the large size of Z , the methods of Section 2.4, which as-sume that the second dimension of Z is much smaller than n (i.e. q << n ), cannot be applied to the ACE model in orderto improve computation speed. In this appendix, we brieﬂyoutline how the FSFS algorithm of Section 2.1.4 may be im-plemented efﬁciently, in conjunction with the ReML adjust-ments described in Appendix 6.4 and the constrained opti-mization methods of Section 2.5 to perform parameter esti-mation for the ACE model.To begin, we consider the GLS updates, given by equa-tions (18) and (19), which are employed for updating theparameters β and σ , respectively. By deﬁning the matrix,¯ D k by ¯ D k = I q k + D k , and noting that the matrix Z is the ( n × n ) identity matrix, the below matrix equality can beseen to hold for the ACE model: X ′ V − X = r ∑ k = l k ∑ j = X ′ ( k , j ) ¯ D − k X ( k , j ) , where X ( k , j ) represents the design matrix for the j th familyunit which possesses family structure of type k . Via well-known properties of the Kronecker product, the above equal-ity can be seen to be equivalent to the following:vec ( X ′ V − X ) = r ∑ k = (cid:18) l k ∑ j = X ′ ( k , j ) ⊗ X ′ ( k , j ) (cid:19) vec ( ¯ D − k ) . The above equality is noteworthy as the matrix expres-sion which appears inside the large brackets on the right-hand side is ﬁxed and does not depend on β , σ or D . Fur-ther, the matrix expression inside the brackets is typicallysmall in size and has dimensions ( p × q k ) . As a result, thisexpression can be calculated and stored prior to the iterativeprocedure of the FSFS algorithm. This means that the aboveequality offers a quick and convenient method for repeatedevaluation of the matrix, X ′ V − X , which is employed formany calculations throughout the FSFS algorithm. Similarlogic can be applied to the evaluation of the vector, X ′ V − Y ,and the scalar value, Y ′ V − Y . Using this approach, it cannow be seen that the GLS estimators, (18) and (19), canbe calculated using only { D k } k ∈{ ,..., r } and pre-calculatedmatrices which are small in dimension. Since the matrices { D k } k ∈{ ,..., r } are also typically small in dimension, this cal-culation is extremely quick and computationally efﬁcient.Next, we consider the matrix F vec ( D k ) and the partialderivative vector of l R ( θ f ) , taken with respect to vec ( D k ) . Expressions for these quantities can be obtained using equa-tions (15) and (13), respectively, in combination with theadjustments described in Appendix 6.4. Again noting thatthe random effects design matrix, Z , is equal to the iden-tity matrix, the ReML equivalent of equation (13) simpliﬁessubstantially to the below expression: ∂ l R ( θ f ) ∂ vec ( D k ) =

12 vec (cid:18) ¯ D − k (cid:18) E k E ′ k σ − l k ¯ D k + l k ∑ j = X ( k , j ) ( X ′ V − X ) − X ′ ( k , j ) (cid:19) ¯ D − k (cid:19) , where E k = vec q k ( e ′ ( k ) ) ′ , vec m is deﬁned as the generalizedvectorization operator described in Section 2.4 and e ( k ) isdeﬁned as the residual vector corresponding to subjects whobelong to a family unit with family structure type k . Usingequation (15), the sub-matrix of F corresponding to vec ( D k ) can also be seen to be given by the following simpliﬁed ex-pression. F vec ( D k ) = l k ( ¯ D − k ⊗ ˜ D − k ) . The two expressions above can be evaluated computation-ally in an extremely efﬁcient manner primarily for two rea-sons. The ﬁrst of these is that the matrices involved in com-putation are typically small; D k , E k and X ( k , j ) have dimen-sions ( q k × q k ) , ( q k × l k q k ) and ( q k × p ) , respectively. Thesecond reason is that, excluding the inversions of ¯ D k and X ′ V − X , the only operations required to evaluate the aboveexpressions are computationally quick to perform. These op-erations are: matrix multiplication, the reshape operation,matrix addition and the Kronecker product.Lastly, we consider the evaluation of the score vectorand Fisher Information matrix associated to τ = [ τ a , τ c ] ′ = σ − e [ σ a , σ c ] ′ , which shall be denoted as dl ( θ ACE ) d τ and I ACE τ ,respectively. We deﬁne K k = [ vec ( K ak ) , vec ( K ck )] ′ . By em-ploying equation (45) of Appendix 6.5, and using the con-straint matrix derived in Appendix 6.8.3, the score vectorand Fisher Information matrix associated to τ can be re-duced to the following: I ACE τ = ( ττ ′ ) ⊙ (cid:18) r ∑ k = K k F vec ( D k ) K ′ k (cid:19) , dl R ( θ ACE ) d τ = τ ⊙ (cid:18) r ∑ k = K k ∂ l R ( θ f ) ∂ vec ( D k ) (cid:19) , where ⊙ represents the Hadamard (entry-wise) product. Theresults of Section 3.2.2 were produced by using the GLS up-dates for the parameters β and σ and by employing the twoexpressions above to perform Fisher Scoring update stepsfor the parameter vector τ .It should be noted that, in the ACE model, misspeciﬁca-tion of variance components can result in a low-rank Fisher Information matrix. This can be seen by noting the position-ing of the Hadamard product in the above Fisher Informa-tion matrix expression and by considering the case in whichone of the elements of τ equals 0. If not accounted for ap-propriately, this issue may, in practice, result in failure of thealgorithm to converge. A simple, practical solution to this is-sue is to execute the algorithm multiple times using differentinitial starting points; once with a starting point where τ a isset to zero, once with τ c set to zero and once with neither τ a nor τ c set to zero. The results of Section 3.2.2 were obtainedusing this approach. References

Barnett S (1990) Matrices: Methods and Applications. Ox-ford applied mathematics and computing science series,Clarendon PressBates D, M¨achler M, Bolker B, Walker S (2015) Fittinglinear mixed-effects models using lme4. Journal of Sta-tistical Software, Articles 67(1):1–48, DOI 10.18637/jss.v067.i01Demidenko E (2013) Mixed models. Theory and applica-tions with R. 2nd ed. DOI 10.1002/9781118651537Dempster AP, Laird NM, Rubin DB (1977) Maxi-mum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Soci-ety Series B (Methodological) 39(1):1–38, URL

Dempster AP, Rubin DB, Tsutakawa RK (1981) Estima-tion in covariance components models. Journal of theAmerican Statistical Association 76(374):341–353, DOI10.1080/01621459.1981.10477653Giles MB (2008) Collected matrix derivative results for for-ward and reverse mode algorithmic differentiationHenderson CR, Kempthorne O, Searle SR, vonKrosigk CM (1959) The estimation of environ-mental and genetic trends from records sub-ject to culling. Biometrics 15(2):192–218, URL

Hong G, Raudenbush SW (2008) Causal inference for time-varying instructional treatments. Journal of Educationaland Behavioral Statistics 33(3):333–362, DOI 10.3102/1076998607307355IBM Corp (2015) IBM SPSS Advanced Statistics 23. Ar-monk, NY: IBM Corp.Jennrich RI, Schluchter MD (1986) Unbalancedrepeated-measures models with structured covari-ance matrices. Biometrics 42(4):805–820, URL

Kuznetsova A, Brockhoff P, Christensen R (2017)lmertest package: Tests in linear mixed effectsmodels. Journal of Statistical Software, Articles 82(13):1–26, DOI 10.18637/jss.v082.i13, URL

Laird N, Lange N, Stram D (1987) Maximum likelihoodcomputations with repeated measures: Application of theem algorithm. Journal of the American Statistical As-sociation 82(397):97–105, DOI 10.1080/01621459.1987.10478395Laird NM, Ware JH (1982) Random-effects modelsfor longitudinal data. Biometrics 38(4):963–974, URL

Lawlor D, Lawlor D, Mishra G (2009) Family Matters:Designing, Analysing and Understanding Family BasedStudies in Life Course Epidemiology. Life Course Ap-proach to Adult Health, OUP OxfordLi X, Guo N, Li Q (2019) Functional neuroimaging in thenew era of big data. Genomics, Proteomics & Bioinfor-matics 17(4):393 – 401, DOI https://doi.org/10.1016/j.gpb.2018.11.005, big Data in Brain ScienceLindstrom MJ, Bates DM (1988) Newton-raphson andem algorithms for linear mixed-effects models forrepeated-measures data. Journal of the AmericanStatistical Association 83(404):1014–1022, URL

Magnus JR, Neudecker H (1980) The elimination matrix:Some lemmas and applications. SIAM Journal on Al-gebraic Discrete Methods 1(4):422–449, DOI 10.1137/0601049Magnus JR, Neudecker H (1986) Symmetry, 0-1 matrices,and jacobians : a review. Econometric Theory p 46Magnus JR, Neudecker H (1999) Matrix differential calcu-lus with applications in statistics and econometrics, rev.ed edn. Wiley Series in Probability and Statistics, JohnWileyNeudecker H, Wansbeek T (1983) Some results on com-mutation matrices, with statistical applications. Cana-dian Journal of Statistics 11(3):221–231, DOI 10.2307/3314625Patterson HD, Thompson R (1971) Recoveryof inter-block information when block sizesare unequal. Biometrika 58(3):545–554, URL

Pinheiro J, Bates D (2009) Mixed-Effects Models in S andS-PLUS. Statistics and Computing, SpringerPinheiro JC, Bates DM (1996) Unconstrained parametriza-tions for variance-covariance matrices. Statistics andComputing 6:289–296Powell M (2009) The bobyqa algorithm for bound con-strained optimization without derivatives. Technical Re-port, Department of Applied Mathematics and Theoreti-cal PhysicsPowell MJD (1964) An efﬁcient method for ﬁnd-ing the minimum of a function of several variableswithout calculating derivatives. The Computer Jour- isher Scoring for crossed factor Linear Mixed Models 35 nal 7(2):155–162, DOI 10.1093/comjnl/7.2.155, URL https://doi.org/10.1093/comjnl/7.2.155

Rao C, Mitra SK (1972) Generalized Inverse of Matricesand Its Applications. Probability and Statistics Series, Wi-leyRaudenbush SW, Bryk AS (2002) Hierarchical Linear Mod-els: Applications and Data Analysis Methods, 2nd edn.Advanced Quantitative Techniques in the Social Sciences1, SAGE PublicationsSAS Institute Inc (2015) SAS/STAT® 14.1 User’s GuideThe MIXED Procedure. Springer Berlin Heidelberg,Cary, NC: SAS Institute Inc.Satterthwaite FE (1946) An approximate dis-tribution of estimates of variance compo-nents. Biometrics Bulletin 2(6):110–114, URL

Smith SM, Nichols TE (2018) Statistical challenges in “bigdata” human neuroimaging. Neuron 97(2):263 – 268,DOI https://doi.org/10.1016/j.neuron.2017.12.018Tibaldi FS, Verbeke G, Molenberghs G, Renard D, Van denNoortgate W, de Boeck P (2007) Conditional mixed mod-els with crossed random effects. British Journal of Math-ematical and Statistical Psychology 60(2):351–365, DOI10.1348/000711006X110562Turkington DA (2013) Generalized Vectorization, Cross-Products, and Matrix Calculus. Cambridge UniversityPress, DOI 10.1017/CBO9781139424400Turnbull BJ, Welsh ME, Heid CA, Davis W, Ratnofsky AC(1999) The longitudinal evaluation of school change andperformance (lescp) in title i schools. interim report tocongress.Van Essen D, Smith S, Barch D, Behrens T, Yacoub E, Ugur-bil K (2013) The wu-minn human connectome project:an overview. NeuroImage 80, DOI 10.1016/j.neuroimage.2013.05.041Verbeke G, Molenberghs G (2001) Linear Mixed Models forLongitudinal Data. Springer Series in Statistics, SpringerNew YorkWelch BL (1947) The generalization of ‘student’s’problem when several different population vari-ances are involved. Biometrika 34(1/2):28–35, URL