Adjusted QMLE for the spatial autoregressive parameter
AAdjusted QMLE for the Spatial Autoregressive Parameter
Federico Martellosio ∗ Grant HillierUniversity of Surrey, UK CeMMAP and [email protected]
University of Southampton, UK [email protected]
February 28, 2019
Abstract
One simple, and often very effective, way to attenuate the impact of nuisance parameterson maximum likelihood estimation of a parameter of interest is to recenter the profile scorefor that parameter. We apply this general principle to the quasi-maximum likelihood esti-mator (QMLE) of the autoregressive parameter λ in a spatial autoregression. The resultingestimator for λ has better finite sample properties compared to the QMLE for λ , especiallyin the presence of a large number of covariates. It can also solve the incidental parameterproblem that arises, for example, in social interaction models with network fixed effects, or inspatial panel models with individual or time fixed effects. However, spatial autoregressionspresent specific challenges for this type of adjustment, because recentering the profile scoremay cause the adjusted estimate to be outside the usual parameter space for λ . Conditionsfor this to happen are given, and implications are discussed. For inference, we propose confi-dence intervals based on a Lugannani–Rice approximation to the distribution of the adjustedQMLE of λ . Based on our simulations, the coverage properties of these intervals are excellenteven in models with a large number of covariates. Among the several difficulties posed by nuisance parameters, a fundamental problem in afrequentist framework is that the profile likelihood function for a parameter of interest istypically not a genuine likelihood, in the sense that it does not correspond to the density ofany observable random variable. To tackle this issue, a number of modified profile likelihoods have been proposed (see, e.g., Laskar and King, 2001; Pace and Salvan, 2006). Such modifiedprofile likelihoods are genuine likelihoods only in special cases, but they can often be inter-preted as approximations to a genuine likelihood. In practice, they tend to perform betterthan the profile likelihood, particularly when there is little information in the data aboutthe nuisance parameters, which is likely to happen, for example, if the number of nuisanceparameters is large relative to the sample size. The better performance of modified profile ∗ Corresponding author. School of Economics, University of Surrey, Guildford, Surrey, GU2 7XH, UK. Tel:+44 (0) 1483 683473 a r X i v : . [ ec on . E M ] S e p ikelihoods is not necessarily captured by first-order asymptotic theory. Indeed, if the num-ber of nuisance parameters does not depend on the sample size, modified profile likelihoodsusually produce estimators that are first-order asymptotically equivalent to the maximumlikelihood estimator (MLE). On the other hand, if the number of nuisance parameters in-creases with the sample size, a modified profile likelihood may be preferable even in termsof first-order asymptotic properties (see, e.g., Jiang, 1996; Sartori, 2003; Arellano and Hahn,2007).One consequence of a profile likelihood not being a genuine likelihood is that the expec-tation of the profile score is generally nonzero, which means that setting the profile scoreto zero does not provide an unbiased estimating equation. A simple modified profile likeli-hood is therefore obtained by centering the profile score. We refer to the modified profilelikelihood obtained in this way as the adjusted profile likelihood , and to the associated MLEas the adjusted MLE . The principle underlying this type of adjustment has been motivatedfrom various perspectives (e.g., Neyman and Scott, 1948; Conniffe, 1987; McCullagh andTibshirani, 1990), and has been applied to several statistical models (e.g., Macaskill, 1993;Dhaene and Jochmans, 2015).The present paper is concerned with the adjusted MLE of the autoregressive parameter λ in a spatial autoregression. Reliable estimation of λ is important in many applications ineconomics, as well as in other fields. For example, in social interaction models λ capturesthe endogenous effect, the assessment of which may be crucial for policy purposes (see Man-ski, 1993; Moffitt, 2001; Lee et al., 2010). More precisely, we will consider the quasi-MLE(QMLE), that is, the MLE obtained by maximizing the Gaussian likelihood, but withoutassuming that the error distribution is truly Gaussian. The literature on estimating (cross-sectional or panel) spatial autoregressions is now very large, the QMLE being perhaps themost popular estimator. An early reference for the QMLE in spatial autoregressions is Ord(1975), whereas a rigorous first-order asymptotic analysis of the QMLE was given only muchlater, in an influential paper by Lee (2004a), under conditions that have become standardin the literature. Recently, higher-order approximations to the distribution of the QMLEof λ have become available (Robinson and Rossi, 2015), and, motivated by the fact thatthe QMLE of λ can suffer from substantial bias, a number of studies have suggested biasreduction procedures (e.g., Bao and Ullah, 2007; Bao, 2013; Yang, 2015). In fact, part ofthe bias in the QMLE of a parameter of interest can be attributed to the bias in the profilelikelihood estimating equation, so centering the profile score can itself be interpreted as abias reduction technique. Lee and Yu (2010), while discussing bias reduction in spatial paneldata models, state explicitly that correction methods based on modifying the score “might bepossible and would be of interest in future research” (see their footnote 24). Yu et al. (2015)show that, in the absence of incidental parameters, the adjusted QMLE of λ is first-orderasymptotically equivalent to the QMLE of λ . Yu et al. (2015) also derive the second-orderbias of the adjusted QMLE, and compare by simulation the adjusted QMLE with other2ias reduction techniques. Prior to that article, Durban and Currie (2000) had provided apreliminary investigation of the adjusted QMLE in a class of models that include spatialautoregressions. Notwithstanding the last two papers, the general distributional propertiesof the adjusted QMLE of λ remain unclear. Indeed, Yu et al. (2015) conclude their analysisby suggesting that the adjusted QMLE of λ “deserves a deep study in future”.Our main contributions are as follows. First , on studying the properties of the adjustedprofile likelihood, we find that the distributions of the QMLE of λ and its adjusted versionmay be supported on different intervals. This is due to the fact that, in spatial autoregres-sions, the parameter λ is usually restricted to a certain interval containing the origin. Sucha restriction is incorporated in the QMLE (which, strictly speaking, should therefore be re-ferred to as a restricted QMLE), but may be violated once the profile score is recentered. Wediscuss the implications of the supports being different, and give conditions for this to hap-pen. Second , for inference about λ , we propose confidence intervals based on the Lugannaniand Rice (1980) saddlepoint approximation to the cdf of the adjusted QMLE. Contrary tothe commonly employed Wald confidence intervals, these confidence intervals have excellentcoverage properties even when the dimension of the nuisance parameter is large. Third , weconsider the social interaction model of Lee et al. (2010), and show that the adjusted QMLEsolves the incidental parameter problem due to the network fixed effects without requiringthe condition on W (row normalization) that is needed for the estimator in Lee et al. (2010). Fourth , we compare the QMLE and the adjusted QMLE by Monte Carlo simulation, andfind evidence that the latter is preferable in a variety of circumstances.The rest of the paper is structured as follows. Section 2 introduces the spatial autore-gressive (SAR) model and the QMLE of λ . Section 3 discusses the properties of the adjustedprofile likelihood for λ , and introduces the confidence intervals based on the adjusted QMLE.Section 4 briefly considers the spatial error model, which is less popular than the SAR modelin economic applications, but provides important motivation for the adjusted ML procedure.Section 5 contains simulation evidence on the performance of the adjusted QMLE and as-sociated confidence intervals. Section 6 discusses extensions. Appendix A contains someauxiliary results needed for the proofs, which can be found in Appendix B. Various technicalmaterials related to the paper can be found in the online Supplement.Throughout the paper, all vectors and matrices are real valued unless otherwise indicated.The null space of a matrix A is denoted by null( A ), the column space by col( A ) , and theorthogonal complement of col( A ) by col ⊥ ( A ). Also, M A denotes the orthogonal projectoronto col ⊥ ( A ) ( M A := I n − A ( A (cid:48) A ) − A (cid:48) if A has full column rank). Finally, µ R n denotes theLebesgue measure on R n , and “a.s.” stands for almost surely, with respect to µ R n .3 Preliminaries
We consider the spatial autoregressive (SAR) model y = λW y + Xβ + σε, (2.1)where y is the n × λ is a scalar parameter, W is aspatial weights matrix, X is an n × k matrix of regressors with full column rank and with k ≤ n − β ∈ R k , σ is a positive scale parameter, and ε is a zero mean n × X and W to be non-stochastic, and we assume that W iscompletely known. Alternatively, we could allow X and W to be random, and interpretthe analysis as conditional on them, provided that they are independent of ε . Some of thecolumns of X may be spatial lags of some other columns, to allow for the estimation of, in theterminology of social network analysis, contextual effects. When there is no X , the model iscalled a pure SAR model. We allow for equation (2.1) to also represent a spatial panel datamodel, or a model in which individuals are divided in several networks. In both cases, W isa block diagonal matrix, with the number of blocks being given by, respectively, the numberof time points and the number of networks. Additive fixed effects along one or both of thepanel dimensions, or network fixed effects, can be added. For the purpose of estimation,fixed effects are treated as parameters, and hence can be included in β . Throughout thepaper we assume that W has at least one (real) negative eigenvalue and at least one (real)positive eigenvalue. This assumption is virtually always satisfied in applications, especiallybecause the diagonal entries of W are usually set to zero. The smallest real eigenvalue of W is denoted by ω min , and the largest real eigenvalue is normalized, without loss of generality,to 1.On rewriting equation (2.1) as S ( λ ) y = Xβ + σε , where S ( λ ) := I n − λW , it is clearthat in order for y to be uniquely determined, given X and ε , it is necessary that S ( λ ) isnonsingular. This requires λ (cid:54) = ω − , for any nonzero real eigenvalue ω of W (nonreal complexeigenvalues of W do not need to be considered here, because λ is assumed to be real, and ω − is real if and only ω is). In both applications and theoretical studies, the parameterspace for λ is usually restricted much further, namely to the largest interval containing theorigin in which S ( λ ) is nonsingular, that is,Λ := ( ω − , , or a subset thereof (possibly independent of n ) such as ( − , λ is difficult to interpret. From a large sample perspective, the restriction to ( − ,
1) is often imposed, along with a uniform bound-eness condition on W , to guarantee that the variances of the y i ’s do not explode as n grows. Also, note that W and X . Assumption 1.
There is no real eigenvalue ω of W for which M X ( ωI n − W ) = 0 . To provide some intuition for Assumption 1, we note that the condition M X ( ωI n − W ) = 0is equivalent to col( ωI n − W ) ⊆ col( X ), and we distinguish two cases. First, if Assumption 1is violated for the eigenvalue ω = 0 (i.e., col( W ) ⊆ col( X )), it is evident from equation (2.1)that estimating β and λ separately must be problematic. Second, if Assumption 1 is violatedfor some eigenvalue ω (cid:54) = 0, then for any y ∈ R n it is possible to find a β ∈ R k such that S ( ω − ) y = Xβ , meaning that a SAR model with λ = ω − can provide perfect fit for any y .It is clear that in this case any sensible inferential procedure should suggest that λ = ω − and σ = 0, for any y , and whatever the true values of λ and σ are. The following example provides a simple illustration of Assumption 1.
Example 2.1 (Group interaction) . There are R ≥ m >
1, this type of interaction can be represented by theblock-diagonal weights matrix W = I R ⊗ B m , (2.2)where B m := m − ( ι m ι (cid:48) m − I m ), with ι m an m × ω min = − m − and col( ω min I n − W ) = col( I R ⊗ ι m ) . Noting that I R ⊗ ι m is the designmatrix of the group fixed effects, it follows that, in a SAR model with weights matrix (2.2),Assumption 1 is violated (for ω = ω min ) whenever group fixed effects are included in themodel (recall that the fixed effects are treated as parameters to be estimated and hence areincluded in β ). It should be noted that the presence of group intercepts does not cause aviolation of Assumption 1 when the model is unbalanced, that is, not all groups have thesame size.Indeed, it is well known that the model of Example 2.1 suffers from an identifiabilityproblem. Specifically, Lee (2007b) and Bramoull´e et al. (2009) show that the parameters ofa SAR model with weights matrix (2.2), group fixed effects, and contextual effects are notidentifiable after removal of the group fixed effects by a within transformation. It is easilyverified that, in this model, the identifiability problem occurs even without contextual effects. Λ and the entries of W are allowed to depend on n , although this is not emphasized in our notation. For the specific case of the QMLE, see part (i) of Lemma S.1.2 in Section S.1 of the Supplement. SectionS.1 of the Supplement also contains further technical remarks about Assumption 1. .2 The QMLE We now define the QMLE of the parameters in model (2.1). By quasi-likelihood we mean thelikelihood that would prevail under the condition ε ∼ N(0 , I n ). Omitting additive constants,the quasi-log-likelihood is l ( β, σ , λ ) := − n σ ) + log | det( S ( λ )) | − σ ( S ( λ ) y − Xβ ) (cid:48) ( S ( λ ) y − Xβ ) , (2.3)for any λ such that S ( λ ) is nonsingular. To avoid tedious repetitions, in the remainder ofthe paper we will often omit the “quasi-” in front of “log-likelihood”.The QMLE in most common use is the maximizer of l ( β, σ , λ ) under the condition that λ is in Λ (or in a subset thereof). That is, the QMLE is( ˆ β ML , ˆ σ , ˆ λ ML ) = argmax β ∈ R k , σ > , λ ∈ Λ l ( β, σ , λ ) . Maximization with respect to β and σ gives ˆ β ML ( λ ) := ( X (cid:48) X ) − X (cid:48) S ( λ ) y and ˆ σ ( λ ) := n y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y . The corresponding profile, or concentrated, log-likelihood for λ is, againomitting additive constants, l ( λ ) := l ( ˆ β ML ( λ ) , ˆ σ ( λ ) , λ ) = − n (cid:0) ˆ σ ( λ ) (cid:1) + log | det( S ( λ )) | . (2.4)The QMLE of λ can be equivalently defined asˆ λ ML := argmax λ ∈ Λ l ( λ ) . (2.5)The function l ( λ ) is a.s. well defined (see Section S.2 in the Supplement), and, clearly, iscontinuously differentiable whenever it is well defined. Also, it is easy to see that, underAssumption 1, l ( λ ) a.s. goes to −∞ at each real zero of det( S ( λ )) (cf. Hillier and Martellosio,2018a). Thus, l ( λ ) has a.s. at least one critical point corresponding to a maximum in anyinterval between two consecutive real zeros of det( S ( λ )). We also define the unrestrictedQMLE ˆ λ uML := argmax λ ∈ Λ u l ( λ ), where Λ u := { λ ∈ R : det( S ( λ )) (cid:54) = 0 } . The estimators ˆ λ ML and ˆ λ uML are different, because the global maximum of l ( λ ) over Λ u is not necessarily in Λ,even when the true value of λ is in Λ. A profile likelihood for a parameter of interest does not take into account the samplingvariability associated to the estimation of the nuisance parameters, and hence, as mentionedin the introduction, is generally not a genuine likelihood. As a consequence, a profile score6stimating equation is generally biased, which is likely to induce bias in the QMLE of theparameter of interest. The adjusted QMLE solves the estimating equation obtained byrecentering the profile score. We now apply this general adjustment to the estimation of( σ , λ ) in the SAR model. The score for ( σ , λ ) is centered assuming only that E( ε ) = 0 andvar( ε ) = I n . Note that we treat only β as the nuisance parameter, not ( β, σ ), because anadjusted estimator of σ is required for estimation of λ . On concentrating just β out of the Gaussian log-likelihood (2.3), the profile log-likelihoodfor ( σ , λ ) is l ( σ , λ ) := l ( ˆ β ML ( λ ) , σ , λ ) = − n σ ) + log | det( S ( λ )) | − σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y, (3.1)with profile score s ( σ , λ ) := (cid:34) − n σ + σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y σ y (cid:48) W (cid:48) M X S ( λ ) y − tr( G ( λ )) (cid:35) , (3.2)where G ( λ ) := W S − ( λ ). We now compute the expectation of s ( σ , λ ) under the SAR model y = λW y + Xβ + σε . Assuming that E( ε ) = 0 and var( ε ) = I n , we have E( s ( σ , λ )) = (cid:34) − n σ + n − k σ tr( M X G ( λ )) − tr( G ( λ )) (cid:35) . (3.3)For a pure model, E( s ( σ , λ )) = 0 for any λ such that S ( λ ) is nonsingular, meaning thatthe estimating equation s ( σ , λ ) = 0 is unbiased. When regressors are present, however, theunaccounted variability in the estimation of β causes the estimating equation s ( σ , λ ) = 0to be biased. Note that the expectation (3.3) does not depend on β , so recentering the scoreis straightforward. The adjusted profile score for ( σ , λ ), defined as s a ( σ , λ ) := s ( σ , λ ) − E( s ( σ , λ )), is s a ( σ , λ ) = (cid:34) s a1 ( σ , λ ) s a2 ( σ , λ ) (cid:35) = (cid:34) − n − k σ + σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y σ y (cid:48) W (cid:48) M X S ( λ ) y − tr( M X G ( λ )) (cid:35) . (3.4)Setting s a1 ( σ , λ ) = 0 gives ˆ σ ( λ ) := nn − k ˆ σ ( λ ). That is, recentering the score auto-matically delivers the usual degrees of freedom correction for ˆ σ ( λ ). The adjusted QMLE Treating σ as a nuisance parameter too, and consequently recentering the score for λ only (i.e., the scoreassociated to the log-likelihood (2.4)) would not produce an adjusted estimator for σ . Also, the (exact)recentering of s ( λ ) would require stronger assumptions than what required for the recentering of s ( σ , λ ); seeSection S.7 of the Supplement. If the matrices X or W were stochastic, but independent of ε , we would be conditioning on them here. Depending on the sample size and on the structure of W , ˆ λ ML can be considerably biased even in thepure case, despite the fact that the estimating equation s ( σ , λ ) = 0 is unbiased in that case. The adjustmentstudied in this paper is not designed to reduce this type of bias in ˆ λ ML . Instead, it aims at reducing the biasdue to the nuisance parameter β . λ must then be a zero of s a2 ( λ ) := s a2 (ˆ σ ( λ ) , λ ) = ( n − k ) y (cid:48) W (cid:48) M X S ( λ ) yy (cid:48) S (cid:48) ( λ ) M X S ( λ ) y − tr( M X G ( λ )) , (3.5)or, which is a.s. the same, must solve the estimating equation y (cid:48) S (cid:48) ( λ ) R ( λ ) S ( λ ) y = 0 , (3.6)where R ( λ ) := M X (cid:18) G ( λ ) − tr( M X G ( λ )) n − k I n (cid:19) . We emphasize that, by construction, the estimating equation (3.6) is exactly unbiasedprovided that E( ε ) = 0 and var( ε ) = I n (no further distributional assumptions are required).Given the adjusted profile score s a ( σ , λ ), one can define the function with gradient equalto s a ( σ , λ ), which we refer to as the adjusted likelihood for ( σ , λ ), denoted by l a ( σ , λ ).Letting Re[ · ] denote the real part of a complex number, the following result gives a closedform expression for l a ( σ , λ ). Proposition 3.1.
In a SAR model, the adjusted log-likelihood for ( σ , λ ) , up to an additiveconstant, is l a ( σ , λ ) = − n − k σ ) − σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y + Re[tr( M X log S ( λ ))] , (3.7) for any λ such that S ( λ ) is nonsingular, and for a suitable choice (made in the proof ) of thebranch of the matrix logarithm log S ( λ ) . From expression (3.7), we immediately obtain the adjusted likelihood for λ only, l a ( λ ) := l a (ˆ σ ( λ ) , λ ) = − n − k (cid:0) y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y (cid:1) + Re[tr( M X log S ( λ ))] (3.8)(note that this is the likelihood with score s a2 ( λ )). Equations (3.7) and (3.8) may be usefulfor graphical or optimization purposes, but are not used for the analytical results that followin the paper. Indeed, even the results that (for simplicity and to aid intuition) are stated interms of the adjusted likelihood, are proved using only the expression for the adjusted score,and could equally be formulated in terms of the adjusted score. Remark 3.1.
Like its unadjusted version l ( σ , λ ), the adjusted profile log-likelihood l a ( σ , λ )is not, in general, a genuine likelihood function. Further adjustments could be implementedto make l a ( σ , λ ) closer to a genuine likelihood. In particular, one could normalize s a ( σ , λ )to make it information unbiased (i.e., variance equal to minus the expectation of the deriva-tive of the score), as suggested by McCullagh and Tibshirani (1990). Such adjustmentsmight improve the performance of first-order asymptotic approximations to estimators or8est statistics, but are not considered in this paper because they do not affect the locationof the zeros of s a ( σ , λ ).So far, the adjusted QMLE of λ has been introduced as a zero of s a2 ( λ ), but of course s a2 ( λ ) may have many (real) zeros. The question therefore arises as to how exactly theadjusted QMLE should be defined. Recall from Section 2.2 that ˆ λ ML is defined as themaximizer of l ( λ ) over Λ. One may therefore be tempted to define the adjusted QMLE of λ as the maximizer of l a ( λ ) over Λ. However, we shall see that s a2 ( λ ) may have no zeros inΛ (or, equivalently, the adjusted profile likelihood l a ( λ ) may have no maximum on Λ), so itmakes sense to define the adjusted QMLE on an interval larger than Λ, Λ a say. By analogywith the unadjusted case, we shall define Λ a to be the shortest open interval containing theorigin with the property that l a ( λ ) → −∞ a.s. at both extremes of Λ a . But first, in order tofully understand the problem, we need to study the behavior of the profile score s a2 ( λ ), or,equivalently, of l a ( λ ), near the zeros of det( S ( λ )). l a ( λ ) near a zero of det( S ( λ )) Perhaps unexpectedly, the profile score s ( σ , λ ) and its adjusted version s a ( σ , λ ) may havevery different behavior as λ approaches a real zero of det( S ( λ )). Obviously, different behaviorof s ( σ , λ ) and s a ( σ , λ ) implies different behavior of the functions l ( σ , λ ) and l a ( σ , λ ), andhence of their profile versions l ( λ ) and l a ( λ ). For simplicity, we state the results in this sectionin terms of the univariate functions l ( λ ) and l a ( λ ). We shall see that the different behavior of l ( λ ) and l a ( λ ) accounts for important differences in the properties of the associated estimatorsfor λ .To start with, note that the analysis is straightforward for the pure model. In that case,we trivially have l a ( λ ) = l ( λ ), and plugging ˆ σ ( λ ) = n y (cid:48) S (cid:48) ( λ ) S ( λ ) y in equation (2.4) revealsthat l ( λ ) a.s. approaches −∞ near any real zero of det( S ( λ )). The presence of regressorscomplicates the analysis. We shall confine attention to semisimple eigenvalues of W . Aneigenvalue is said to be semisimple if its algebraic and geometric multiplicities are equal (forthe matrix theoretic definitions and results used in this section see, for instance, Meyer, 2000).While it simplifies the analysis considerably, the restriction to semisimple eigenvalues doesnot imply a great loss of generality. For example, all eigenvalues of a diagonalizable matrixare semisimple, and any simple eigenvalue is semisimple (an eigenvalue being simple if it hasalgebraic multiplicity equal to one). The behavior of l a ( λ ) close to the eigenvalue ω = 1 isoften particularly important, and that eigenvalue is in most cases semisimple in applications.For example, it is semisimple when W is row stochastic (even if W is not diagonalizable, andthe algebraic multiplicity of ω = 1 is larger than one), or when W is irreducible . A row stochastic matrix is a square nonnegative matrix whose row sums are all equal to 1. Irreducibilitycan be defined in terms of the graph of a matrix as follows. Let the graph of an n × n matrix A be the directedgraph on n vertices in which there is an an edge from vertex i to vertex j if and only A ( i, j ) (cid:54) = 0. Also, call agraph strongly connected if there is a sequence of directed edges from any vertex i to any vertex j . Then, A l a ( λ ) to diverge or be bounded near a real zero of det( S ( λ )) in termsof a projector onto an eigenspace of W . Without the restriction to semisimple eigenvalues,the conditions would need to be stated in terms of projections onto generalized eigenspaces.If the eigenvalue ω of W is semisimple, then null( W − ωI n ) (the eigenspace of W associatedto the eigenvalue ω ) and col( W − ωI n ) are complementary subspaces of C n , and therefore wecan define a (unique) projector, denoted by Q ω , onto null( W − ωI n ) along col( W − ωI n ). Recalling that the zeros of det( S ( λ )) are the reciprocals of the nonzero eigenvalues of W , wecan prove the following result. Theorem 1.
Suppose Assumption 1 holds. In a SAR model, for any semisimple nonzeroreal eigenvalue ω of W , lim λ → ω − l a ( λ ) is a.s.(i) −∞ if tr( M X Q ω ) > ;(ii) bounded if tr( M X Q ω ) = 0; (iii) + ∞ if tr( M X Q ω ) < . We can now compare the behavior of the adjusted profile log-likelihood l a ( λ ) near thereal zeros of of det( S ( λ )), as established by Theorem 1, with the behavior of the unadjustedprofile log-likelihood l ( λ ) near those points. Recall from Section 2.2 that lim λ → ω − l ( λ ) = −∞ a.s., for any nonzero real eigenvalue ω of W (under Assumption 1). Thus, l ( λ ) and itsadjusted version l a ( λ ) have the same behavior near a point ω − , for a semisimple nonzeroreal eigenvalue ω , only in case (i) of Theorem 1. In case (ii), l a ( λ ) can be extended to afunction that is a.s. continuous at λ = ω − . This can be achieved by extending the domainof l a ( λ ) to include ω − , and setting l a ( ω − ) := lim λ → ω − l a ( λ ). From now on, when wesay that l a ( λ ) is continuous at λ = ω − we implicitly assume that this extension has beenperformed. In case (iii) of Theorem 1, l a ( λ ) is unbounded from above near ω − . Fromunreported numerical experiments, it appears that tr( M X Q ω ) < W, X ) which are likely to be encountered in applications. We shallalso see shortly that tr( M X Q ω ) < W is symmetric.Ruling out the pathological case (iii), it is useful to try and understand which of cases(i) and (ii) in Theorem 1 is likely to occur in applications. At first sight, the conditiontr( M X Q ω ) = 0 in case (ii) may look very restrictive. Indeed, Lemma A.1 in Appendix A is irreducible if and only if the graph of A is strongly connected (e.g., Meyer, 2000, p. 671). The matrix Q ω is a “spectral projector” of W ; see, for instance, Meyer (2000), where explicit generalrepresentations for such projectors can also be found. Particular cases are given in the proof of Lemma 3.1and in the proof of Lemma S.4.1 in the Supplement. For example, for all simulation designs in Section 5, we find that l a ( λ ) is always bounded from above onΛ a . A complete understanding of when l a ( λ ) may be unbounded from above would be of interest, but is leftfor future research. Note that the fact that l a ( λ ) can be unbounded from above is not surprising, since l ( λ )itself can be unbounded from above (see Lemma S.1.2 in the Supplement). M X Q ω ) (cid:54) = 0 for generic, in the measure theoretic sense, full column rank X (and for any fixed W ). However, X typically contains an intercept (or group intercepts),and this implies that tr( M X Q ω ) = 0 occurs, at ω = 1, for a very large class of matrices W used in practice. To see why this is the case, the next lemma provides a condition fortr( M X Q ω ) = 0 in terms of the eigenspace null( W − ωI n ) (the eigenspace of W associated tothe eigenvalue ω ). Lemma 3.1. (i) For any semisimple eigenvalue ω of W , tr( M X Q ω ) = 0 if null( W − ωI n ) ⊆ col( X ) ;(ii) For any eigenvalue ω of a symmetric W , tr( M X Q ω ) = 0 if null( W − ωI n ) ⊆ col( X ) , tr( M X Q ω ) > otherwise. Part (i) of Lemma 3.1 establishes that null( W − ωI n ) ⊆ col( X ) is sufficient for case (ii) ofTheorem 1 to apply, and hence for l ( λ ) and l a ( λ ) to have different behaviour near λ = ω − (for any semisimple nonzero real eigenvalue ω of W , and provided that Assumption 1 holds).It turns out that this sufficient condition is very often satisfied in applications. Two examplesare given next, the second one being a generalization of the first. Example 3.1 (Row stochastic and irreducible weights matrix) . In applications of spatialautoregressions, W is often row stochastic and irreducible (cf. footnote 6), and an interceptis included in the regression. By the Perron–Frobenius Theorem (e.g., Horn and Johnson,1985, Theorem 8.4.4), ω = 1 is a simple (and hence semisimple) eigenvalue of W , and theassociated eigenspace null( W − I n ) is spanned by a vector of identical entries, and thereforeis in col( X ). It follows, under Assumption 1, that l ( λ ) a.s. approaches −∞ as λ →
1, while l a ( λ ) is a.s. continuous at λ = 1. Example 3.2 (Block diagonal weights matrix) . Example 3.1 generalizes immediately to thecase when W is a block diagonal matrix whose blocks are row stochastic and irreduciblematrices, of, say, size m r × m r . That is, using direct sum notation, W = (cid:76) Rr =1 W r . Thissituation arises, for instance, in a social interaction model on R networks (e.g., Lee et al.,2010), or in a spatial panel model where individuals are followed over time (in which case r is the time dimension; see, e.g., Lee and Yu, 2010). When W has this structure, theeigenspace null( W − I n ) is spanned by the columns of the (network or time) fixed effectsmatrix (cid:76) Rr =1 ι m r , and therefore is in col( X ) as long as the regressions contains those fixedeffects. In that case, and provided that Assumption 1 holds, l ( λ ) a.s. approaches −∞ as λ →
1, while l a ( λ ) is a.s. continuous at λ = 1.According to part (ii) of Lemma 3.1, when W is symmetric, the condition null( W − ωI n ) ⊆ col( X ) is also necessary for l a ( λ ) to be a.s. continuous at λ = ω − . Thus, forsymmetric W , Theorem 1 reduces to the simple statement that lim λ → ω − l a ( λ ) is a.s. boundedif null( W − ωI n ) ⊆ col( X ), −∞ otherwise, for any semisimple nonzero real eigenvalue ω .11e end this section by providing a graphical comparison of l ( λ ) and l a ( λ ). Considera SAR model with weights matrix W equal to the row normalized adjacency matrix ofan Erd˝os-R´enyi G ( n, p ) graph (Erd˝os and R´enyi, 1959). The G ( n, p ) graph is a randomgraph on n vertices, with an edge between any two vertices being present with probability p , independently of every other edge. Suppose that the regression contains an intercept, andthat, for simplicity, the graph is connected. Then, since W is row-stochastic and irreducible, l a ( λ ) is a.s. continuous at λ = 1 and a.s. approaches −∞ as λ approaches any other singularityof S ( λ ). Figure 1 displays l ( λ ) and l a ( λ ), for one random draw of G ( n, p ) and for one randomdraw from the intercept-only model y = . W y + ι n + ε , with ε ∼ N(0 , I n ). The log-likelihoodfunctions are plotted for λ ∈ (0 , ω − ), where ω is the third largest eigenvalue of W , and l a ( λ ) has been lowered so that the two likelihoods have the same maximum value. Notethat, contrary to l ( λ ), l a ( λ ) does not go to −∞ as λ → . . . . . − − − λ Figure 1: The profile likelihood l ( λ ) (dashed) and its adjusted version l a ( λ ) (solid) for a SARmodel on a connected G (100 , .
05) graph.
Theorem 1 establishes that the adjusted profile log-likelihood l a ( λ ) may, in contrast to l ( λ ),be a.s. continuous at the extremes of the parameter space Λ. As a consequence, thereis no guarantee that l a ( λ ) has a maximum over Λ, which suggests that l a ( λ ) should bemaximized over a larger set, Λ a say. As anticipated in Section 3.1, it is natural to defineΛ a as the shortest open interval containing the origin with the property that l a ( λ ) → −∞ a.s. at both extremes of Λ a . For example, in the case of Figure 1, Λ = ( − . ,
1) andΛ a = ( − . , . . Note that the extremes of Λ a must always be zeros of det( S ( λ )), For the particular random draw of ε underlying Figure 1, ˆ λ ML and its adjusted version ˆ λ aML (defined asin Section 3.3 below) are about 0.478 and 0.506, respectively. Over 10 draws from ε ∼ N(0 , I n ) (and for thesame draw of G ( n, p ) used for Figure 1), empirical bias and RMSE are − .
039 and 0 .
128 for ˆ λ ML and − . .
126 for ˆ λ aML . l a ( λ ) is a.s. continuous between consecutive real zeros of det( S ( λ )). Hence, ourdefinition of Λ a requires the following assumption. Assumption 2.
There is at least one negative zero and at least one positive zero of det( S ( λ ))such that l a ( λ ) → −∞ a.s. as λ approaches those zeros.It is clear from Section 3.2 that Assumption 2 can be violated only in very special cases.In those cases, one could take the left (resp., right) endpoint of Λ a to be −∞ (resp., + ∞ ),but we refrain from doing this, for simplicity.It is natural to define the adjusted QMLE of λ as the maximizer of l a ( λ ) over Λ a , that is,ˆ λ aML := argmax λ ∈ Λ a l a ( λ ) . Maximization of l a ( λ ) over a subset of Λ a would yield a censored version of ˆ λ aML . Inparticular, let ¯Λ be Λ augmented with one of its endpoints if l a ( λ ) is bounded near thatendpoint (augmented with both endpoints if l a ( λ ) is bounded near both endpoints), and let¯ λ aML := argmax λ ∈ ¯Λ l a ( λ ) be the estimator that is obtained by maximizing l a ( λ ) over ¯Λ. SinceΛ ⊆ Λ a , ¯ λ aML is a censored version of ˆ λ aML .Note that the set Λ a , contrary to Λ, may depend on X (because, by Theorem 1, whetheror not l a ( λ ) → −∞ a.s. at some zero of det( S ( λ )) depends on X ). For a fixed X , Λ a = Λ if l a ( λ ) → −∞ a.s. as λ approaches the extremes of Λ, and Λ a ⊃ Λ otherwise. But it is alsopossible to compare Λ and Λ a for generic X (in the measure theoretic sense), or better, sincethe model typically contains an intercept, for a generic matrix X containing an intercept.The following example is important. Example 3.3.
Suppose W is row-stochastic and irreducible, and that an intercept is includedin the model. Let X = ( ι n , (cid:101) X ) and (cid:101) X := { (cid:101) X ∈ R n × ( k − : rank( X ) = k } . We know fromExample 3.1 that, in this case, l a ( λ ) is a.s. continuous at λ = 1, for all (cid:101) X ∈ (cid:101) X . Consider nowan arbitrary semisimple nonzero real eigenvalue ω (cid:54) = 1 of W . By Lemma A.1 in AppendixA and Theorem 1, l a ( λ ) is a.s. unbounded from below or from above near ω − for µ R n × ( k − -almost every (cid:101) X ∈ (cid:101) X . Assuming that ω min and the second largest (positive) eigenvalue of W ,denoted by ω , are semisimple, it follows that Λ a = ( ω − , ω − ), for µ R n × ( k − -almost every (cid:101) X ∈ (cid:101) X \{P ω min ∪ P ω } , where P ω is the set of pathological (cid:101) X such that l a ( λ ) is a.s. unboundedfrom above near ω − (i.e., P ω := { (cid:101) X ∈ R n × ( k − : rank( X ) = k and tr( M X Q ω ) < } ). Asdiscussed earlier, P ω is expected to be very small or even µ R n × ( k − -null in cases of interest inapplications. By Lemma 3.1, P ω is empty if W is symmetric, so in that case Λ a = ( ω − , ω − )for µ R n × ( k − -almost every (cid:101) X ∈ (cid:101) X . The fact that l a ( λ ) is a.s. continuous between consecutive real zeros of det( S ( λ )) also means that, ignoringthe pathological cases in which l a ( λ ) is a.s. unbounded from above, Λ a is the smallest open set containing theorigin on which l a ( λ ) is guaranteed to have a maximum a.s. a = ( ω − , ω − ), for generic non-pathological (cid:101) X are in order. First, note that ω − > ω − , X . Thus, how much larger Λ a is compared to Λ depends only on the eigenvalue gap − ω . There is considerable evidence in the graph theory literature that the eigenvalue gap1 − ω tends to be large when the graph underlying W has good connectivity and randomnessproperties (especially when W is the normalized adjacency matrix of a regular, undirected,loopless graph; see, e.g. Brouwer and Haemers, 2011, Chapter 4), and this is indeed somethingwe will come upon in our Monte Carlo experiments later. Second, on replacing the intercept ι n with group intercepts (cid:76) Rr =1 ι m r , as in Example 3.2, the result in Example 3.3 general-izes immediately to the case when W is block diagonal with row stochastic and irreducibleblocks. Third, the argument used in Example 3.3 can also be applied to compare Λ a and Λ forweights matrices that are not row-stochastic and irreducible (or are not block diagonal withrow-stochastic and irreducible blocks). To do this, note that the result Λ a = ( ω − , ω − ) forgeneric non-pathological (cid:101) X in Example 3.3 arises because, in that case, ι n ∈ col( X ) and ι n spans the eigenspace null( W − I n ), so that the condition in Part (i) of Lemma 3.1 is satisfied.When W is not row-stochastic and irreducible, such a special interaction between col( X ) and W will typically not occur, and as a result Λ a = Λ = ( ω − ,
1) for generic non-pathological (cid:101) X . This is most easily seen when W is symmetric. In that case, by part (ii) of Lemma 3.1and Theorem 1, Λ a is different from Λ only if null( W − I n ) or null( W − ω min I n ) are in col( X ).In general there is no reason why col( X ) should contain those eigenspaces. The original motivation for seeking an unbiased estimating equation is, of course, bias reduc-tion, or more generally, improved inference on λ (and possibly σ ) . The simulation evidencereported in Section 5 below is unequivocal that the hoped-for bias reduction is certainlyachieved, and without detriment to the mean squared error. However, there is one caveatthat must be mentioned here, which we discuss next.We have seen that the intervals Λ and Λ a on which the distributions of ˆ λ ML and ˆ λ aML are supported are different in many cases of interest. That is, there are circumstances inwhich ˆ λ aML lies outside Λ . This raises the question of whether, from the point of view ofinterpreting the parameter λ , that outcome is acceptable. Ultimately, this will depend onthe context, but censoring the estimator to ensure that it lies in Λ will clearly entail sacrificein terms of bias. The extent of the censoring would usually be greater the larger is λ inabsolute value, and would also depend on characteristics of W such as its sparseness (see the The situation is somewhat similar to, but more subtle than, that in which the MLE for a variance (orcovariance matrix) satisfies the expected non-negativity (or positive definiteness) requirement, but a bias-corrected version of it does not. For example, subtraction of the estimated first (bias) term in an asymptoticexpansion for the expectation of the MLE can have this undesirable outcome. We also note that several otherestimators of λ , for example IV, GMM, or indirect inference estimators, may have support larger than Λ (seee.g., Kelejian and Prucha, 1998; Lee, 2007a; Kyriacou et al., 2017). λ aML is outside Λ, often the unrestricted maximizer of l ( λ ) is also outside Λ (i.e., ˆ λ ML (cid:54) = ˆ λ uML ). This suggests that, whenever Λ a (cid:54) = Λ, ˆ λ aML shouldbe regarded as a modification to the QMLE that maximizes l ( λ ) over Λ a , not over Λ. Confidence intervals for a parameter of interest based on the QMLE may perform poorly ifthe data contains little information about the nuisance parameters. This may be the case,for instance, when the number of nuisance parameters is large relative to the sample size.The theory presented in the paper so far allows the construction of confidence intervals for λ with accurate coverage even in the case of little information about β .We say that a differentiable function is single-peaked on an open interval if, on thatinterval, it has a maximum and no other stationary points corresponding to minima ormaxima (so the function is non-decreasing to the left of the peak, and non-increasing to theright of the peak). We shall see later in this subsection that l a ( λ ) is single-peaked on Λ a quite generally (Proposition 3.2), but before doing that we show how single-peakedness canbe used to construct confidence intervals for λ. If l a ( λ ) is single-peaked on Λ a , then the cdfof ˆ λ aML admits the representationPr(ˆ λ aML ≤ z ; β, σ , λ ) = Pr( y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y ≤ , (3.9)for any z ∈ Λ a . That is, the cdf of ˆ λ aML at z equals the cdf of the quadratic form y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y at zero. Thus, one can use some approximation to the cdf (at zero) of y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y to obtain an approximation to the cdf (at z ) of ˆ λ aML . Since the cumulantgenerating function of a quadratic form is available in closed form, at least under normality,one natural candidate is the Lugannani–Rice saddlepoint approximation (Lugannani andRice, 1980). In fact, using a saddlepoint approximation to the cdf of a score to recoverthe cdf of the corresponding estimator had already been investigated in Daniels (1983) (seealso Butler, 2007, Chapter 12). The approximate cdf of ˆ λ aML can then be inverted to obtainconfidence intervals for λ . This approach to constructing confidence intervals has been appliedby Hillier and Martellosio (2018a) to the (unadjusted) QMLE of λ . Whilst revising thepresent paper we discovered that essentially the same approach had previously been suggestedby Paige et al. (2009) for general quadratic estimating equations, and then by Jeganathanet al. (2015) specifically for some spatial models. Remark 3.2.
Let ˆ θ an estimator obtained from the estimating equation q ( θ ) = 0. Paige The notation Pr(ˆ λ aML ≤ z ; β, σ , λ ) emphasizes that, of course, we are assuming that y is generatedby the SAR model (2.1). But it is worth remarking that, for a general random vector y , we would stillhave Pr(ˆ λ aML ≤ z ) = Pr( y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y ≤
15t al. (2009) use the device Pr(ˆ θ ≤ z ) = Pr( q ( z ) ≤
0) under the assumption that q ( θ ) ismonotonically decreasing in θ , whereas we have used it under the more general assumptionthat the function with derivative q ( θ ) is single-peaked. It is worth pointing out that, for thecase of ˆ λ ML in a SAR model, the difference between the two assumptions is immaterial whenall the eigenvalues of W are real. Indeed, the score associated to the profile log-likelihood(2.4) is s ( λ ) := n ( y (cid:48) W (cid:48) M X S λ y ) / ( y (cid:48) S (cid:48) λ M X S λ y ) − tr( G λ ), and hence the equation s ( λ ) = 0 isa.s. equivalent to f ( λ ) = 0, where the function f ( λ ) := ny (cid:48) W (cid:48) M X S λ y − tr( G λ ) y (cid:48) S (cid:48) λ M X S λ y is monotonically decreasing in λ if all eigenvalues of W are real (see Li et al., 2013).Let (cid:102) Pr(ˆ λ aML ≤ z ; β, σ , λ ) denote the approximation to the cdf of ˆ λ aML , obtained by theLugannani–Rice formula (see Section S.6 of the Supplement for details). The parameters β and σ in the approximation can be replaced with their QMLEs given λ , ˆ β ML ( λ ) and ˆ σ ( λ ),to give the approximation (cid:102) Pr(ˆ λ aML ≤ z ; λ ) := (cid:102) Pr(ˆ λ aML ≤ z ; ˆ β ML ( λ ) , ˆ σ ( λ ) , λ ). Confidenceintervals for λ based on ˆ λ aML can then be constructed by inverting (cid:102) Pr(ˆ λ aML ≤ z ; λ ). Morespecifically, replace z with the observed value of ˆ λ aML , say ˆ λ obsaML , and let λ := inf { λ : (cid:102) Pr(ˆ λ aML ≤ ˆ λ obsaML ; λ ) = 1 − α } and λ = sup { λ : (cid:102) Pr(ˆ λ aML ≤ ˆ λ obsaML ; λ ) = α } . Then ( λ , λ )is an approximate (1 − α − α )% two-sided confidence interval for λ . The Lugannani–Rice approximation we employ is, as is frequently the case, a normalbased one. That is, it is constructed on the basis of the cumulant generating function of y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y that obtains if ε ∼ N(0 , I n ). It is important to emphasize, however, thatwe intend the approximation to be used generally. Indeed, there is considerable evidence inthe literature that a normal-based Lugannani–Rice approximation is typically very accuratefor the distribution of a statistic that has a limiting normal distribution. There is alsoevidence that a normal-based Lugannani–Rice approximation works generally very well forthe distribution of a quadratic form in nonnormal variables as long as we are far from thelower tail of the distribution (Wood et al., 1993). This seems to be relevant in our case aswe are only interested in the cdf of y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y at 0, and R ( z ) is indefinite.The key condition for representation (3.9) to hold is that l a ( λ ) is single-peaked on Λ a .We now discuss this condition. The following very mild assumption allows us to make use ofthe results derived earlier for semisimple eigenvalues. Assumption 3.
Every eigenvalue of W corresponding to a zero of det( S ( λ )) in Λ a is semisim-ple. Proposition 3.2.
Suppose Assumptions 1, 2, and 3 hold. If ( n − k )tr( M X G ( λ )) > [tr( M X G ( λ ))] for all λ ∈ Λ a such that det( S ( λ )) (cid:54) = 0 , (3.10) Note that the set { λ : α ≤ (cid:102) Pr(ˆ λ aML ≤ z ; λ ) ≤ α } is in general a union of intervals. It is an interval if (cid:102) Pr(ˆ λ aML ≤ z ; λ ) is monotonic in λ . It seems reasonable to expect this monotonicity to hold in many cases, atleast approximately. When Λ a = Λ, there is nothing to assume, because in that case Λ does not contain any zero of det( S ( λ )).When Λ a ⊃ Λ, in most cases of practical interest the only zero of det( S ( λ )) in Λ a is the one corresponding tothe eigenvalue 1, which, as pointed out earlier, is virtually always semisimple in applications. he adjusted profile log-likelihood function l a ( λ ) is a.s. single-peaked on Λ a . It turns out that if W is symmetric, condition (3.10) is satisfied for any X (see LemmaA.2 in Appendix A), and hence Λ a is a.s. single-peaked in that case. If W is not symmetric,the condition depends on both W and X . Extensive numerical experimentation withnonsymmetric W suggests that for the vast majority of pairs ( W, X ) likely to be met inapplications, l a ( λ ) is a.s. single-peaked on Λ a . And, for pairs ( W, X ) such that condition(3.10) is not satisfied, the probability (for some distribution of y that is absolutely continuousw.r.t. µ R n ) that l a ( λ ) is multi-peaked is generally very small. Note that the right hand sideof representation (3.9) is likely to provide a good approximation to the cdf of ˆ λ aML wheneverthe probability that l a ( λ ) is multi-peaked is nonzero but small. The performance of thesaddlepoint confidence intervals is assessed by numerical simulation in Section 5. Naturally,the confidence intervals can be inverted to obtain hypothesis tests on λ , but we do notinvestigate the power properties of such tests here. In this section we apply the score adjustment to a social interaction model with networkfixed effects and contextual effects (e.g., Lee, 2007b; Bramoull´e et al., 2009; Lee et al., 2010;Lin, 2015; Fortin and Yazbeck, 2015). There are R networks, with network r having m r individuals. The model is y r = λW r y r + (cid:101) X r γ + W r (cid:101) X r δ + α r ι m r + σε r , r = 1 , . . . , R, (3.11)where W r is the weights matrix of network r , α r is an unobserved network fixed effect, (cid:101) X r is an m r × ˜ k matrix of regressors, γ and δ are ˜ k × y = ( y (cid:48) , . . . , y (cid:48) R ) (cid:48) , W = (cid:76) Rr =1 W r , β = ( γ (cid:48) , δ (cid:48) , α , . . . , α R ) (cid:48) , ε = ( ε (cid:48) , . . . , ε (cid:48) R ) (cid:48) , and X = ( (cid:101) X, W (cid:101) X, (cid:76) Rr =1 ι m r ), with (cid:101) X := ( (cid:101) X (cid:48) , . . . , (cid:101) X (cid:48) R ) (cid:48) . We assume that X has full columnrank.To avoid the incidental parameter problem that arises in the case of many networks offixed dimensions, inference in this model usually proceeds by first eliminating the networkfixed effects. Define the orthogonal projector M r := I m r − m r ι m r ι (cid:48) m r , and let F r be the m r × ( m r −
1) matrix of orthonormal eigenvectors of M r corresponding to the eigenvalue1, so that F (cid:48) r F r = I m r − , F r F (cid:48) r = M r , and F (cid:48) r ι m r = 0. In a likelihood framework, thestandard procedure is that proposed by Lee et al. (2010), which eliminates the fixed effectsby premultiplying each equation in (3.11) by F (cid:48) r . Under the condition that each weightsmatrix W r has all row sums equal to one (i.e., W r ι m r = ι m r for all r = 1 , . . . , R ), we have It is interesting to note that, for the unadjusted profile likelihood l ( λ ), single-peakedness over Λ holds if n tr( G ( λ )) > [tr( G ( λ ))] for all λ ∈ Λ, a condition that does not depend on X (see Hillier and Martellosio,2018a). (cid:48) r W r = F (cid:48) r W r F r F (cid:48) r , and therefore the transformed model is y ∗ r = λW ∗ r y ∗ r + (cid:101) X ∗ r γ + W ∗ r (cid:101) X ∗ r δ + σε ∗ r , r = 1 , . . . , R, (3.12)where y ∗ r := F (cid:48) r y r , W ∗ r := F (cid:48) r W r F r , (cid:101) X ∗ r := F (cid:48) r (cid:101) X r , and ε ∗ r := F (cid:48) r ε r . Note that var( ε ∗ r ) = I m − ifvar( ε r ) = I m . We denote the profile (quasi) log-likelihood for ( σ , λ ) based on the transformedmodel (3.12) by l LLL ( σ , λ ). If the condition W r ι m r = ι m r , for each r = 1 , . . . , R , is notsatisfied, the transformation by F (cid:48) r does not yield a reduced form and hence a likelihood.In that case, only non-likelihood procedures, such as the GMM of Liu and Lee (2010), areavailable.The score adjustment discussed in the present paper provides an alternative likelihoodsolution to the large- R incidental parameter problem. It provides a large- R consistent esti-mator of all model parameters (see Remark 3.3), like the Lee et al. (2010) procedure, but,contrary to Lee et al. (2010), it does not require the constant row sums condition. The scoreadjustment is performed exactly as for the general SAR model, treating the α r ’s as param-eters, profiling out the parameter β = ( γ (cid:48) , δ (cid:48) , α , . . . , α R ) (cid:48) as in equation (3.1), and recallingthat the expectation (3.3) does not depend on β .The next proposition shows that the score adjustment method is equivalent to the Leeet al. (2010) method, if the latter applies (i.e., W r ι m r = ι m r for all r ) and there are nocovariates (i.e., the model is y r = λW r y r + α r ι m r + σε r , r = 1 , . . . , R ). Proposition 3.3.
Assume that, in model (3.11), there are no regressors and W r ι m r = ι m r ,for each r = 1 , . . . , R . Then, l a ( σ , λ ) = l LLL ( σ , λ ) , for any y ∈ R n . When covariates are present, the estimators of σ and λ (and hence of β ) obtained bythe score adjustment method are different from the ones obtained by the Lee et al. (2010)method. Both methods solve the large- R incidental parameter problem, but, in addition tonot requiring the constant row sum condition, the score adjustment approach also implementsa correction that deals with the nuisance parameters γ and δ . The two methods are comparedby simulation in Section 5.2. Remark 3.3.
A formal consistency proof for the adjusted QMLE in model (3.11) wouldrequire stating several regularity conditions, which is not our aim here. Heuristically, however,consistency follows from a standard argument that we can sketch here (see, e.g., Neymanand Scott, 1948; Dhaene and Jochmans, 2015). To start with, note that in SAR modelswithout incidental parameters, the expectation E( s ( σ , λ )) in equation (3.3) is O (1) as n → ∞ under standard conditions (see, e.g., Lee, 2004b, Lemma A.9), and therefore the necessarycondition plim n →∞ n s ( σ , λ ) = 0 for (ˆ σ , ˆ λ ML ) to be consistent is satisfied. On the other Our focus is on finite sample results. From the point of view of first-order asymptotics, we expect the twoestimators to be equivalent under the conditions in Lee et al. (2010). Such conditions include, in particular,the fact that ˜ k does not vary with n ; if ˜ k increased with n , the adjusted QMLE might have an advantage evenfrom the point of view of first-order asymptotics. R , as in model (3.11), and n increases only because R increases, plim n →∞ n s ( σ , λ ) (cid:54) = 0 because the bias of the profilescore s ( σ , λ ) is typically O ( R ); however, the fact that the expectation (3.3) is independent ofthe nuisance parameter β immediately implies that plim n →∞ n (cid:8) s ( σ , λ ) − E( s ( σ , λ )) (cid:9) = 0,which is essentially what delivers consistency of the adjusted QMLE. The spatial error model y = Xβ + u, u = λW u + σε (4.1)is considerably less popular than the SAR model (2.1) in economic applications, but, as weshall see, provides important motivation for the score adjustment considered in this paper.We now briefly consider this model, leaving all details to Section S.8 of the Supplement. Itis convenient to generalize the error structure in model (4.1) to A ( θ ) u = σε , where A ( θ ) is asquare matrix that is invertible for any value of the parameter θ in a subset Θ of R p . In thecase of the spatial error model, we may take A ( θ ) = S ( λ ) . The profile (quasi) log-likelihood for ( σ , θ ) based on the assumption ε ∼ N(0 , I n ) is (upto an additive constant) l ( σ , θ ) := l ( ˆ β ML ( θ ) , σ , θ ) = − n σ ) + log | det( A ( θ )) | − σ y (cid:48) U ( θ ) y, (4.2)where U ( θ ) := A (cid:48) ( θ ) M A ( θ ) X A ( θ ). As for the case of the SAR model, the score associatedto (4.2) can be (exactly) recentered assuming only that E( ε ) = 0 and var( ε ) = I n . Thecorresponding adjusted likelihood is l a ( σ , θ ) = − n − k σ ) − σ y (cid:48) U ( θ ) y +log | det( A ( θ )) |−
12 log (cid:0) det (cid:0) X (cid:48) A (cid:48) ( θ ) A ( θ ) X (cid:1)(cid:1) . (4.3)Remarkably, and contrary to the case of a SAR model, l a ( σ , θ ) is a genuine likelihood.Hence, the corresponding score provides an unbiased and information unbiased estimatingequation (under the assumptions that E( ε ) = 0 and var( ε ) = I n ). In fact, l a ( σ , θ ) isequivalent to the likelihood used for restricted ML (REML) estimation, which has been shownto be quite generally preferable to ML estimation in a regression model with correlated errors(see, e.g., Thompson, 1962; Patterson and Thompson, 1971; Rahman and King, 1997).Maximization of (4.3) for fixed θ gives ˆ σ ( θ ) = n − k y (cid:48) U ( θ ) y , and thus the profileadjusted likelihood for θ only is l a ( θ ) := l a (ˆ σ ( θ ) , θ ) = − n − k y (cid:48) U ( θ ) y ) + log | det( A ( θ )) | −
12 log (cid:0) det (cid:0) X (cid:48) A (cid:48) ( θ ) A ( θ ) X (cid:1)(cid:1) . (4.4)To fully understand the effect of the score adjustment, it is useful to relate the likelihood19 a ( θ ) to invariance properties of the model. Assuming that the distribution of ε does not de-pend on the parameters β and σ (and that the parameters are identifiable), it is straightfor-ward to check that the model is invariant under the group G X of transformations y → κy + Xδ in the sample space, for any κ >
0, any δ ∈ R k , and for a fixed X (for the relevant definitionof invariance, see Lehmann and Romano, 2005, Chapter 6). The likelihood l a ( θ ) correspondsto the density of a maximal invariant under the group G X (and l a ( σ , θ ) corresponds to thedensity of a maximal invariant under the group of transformations y → y + Xδ , δ ∈ R k ).That is, using the adjusted likelihood l a ( θ ) corresponds to imposing that inference should beinvariant with respect to the group under which the model itself is invariant, as advocatedby the “principle of invariance”.We do not give details here, but it can be shown that, as λ approaches any real zero ofdet( S ( λ )), l ( λ ) → −∞ a.s., whereas l a ( λ ) may either a.s. approach −∞ or be a.s. boundedin a spatial error model. Thus, as might have been expected, in the spatial error model theprofile log-likelihood and its adjusted version behave very much as in the SAR model. So,in particular, it makes sense to define the adjusted QMLE of λ on a support different fromΛ, as in the SAR model. More generally, this implies that when the covariance parameter θ is restricted to a certain set, the REML estimator of θ does not necessarily respect thatrestriction, something that, to the best of our knowledge, has not been noted before in theliterature. We conduct Monte Carlo experiments to investigate the performance of the adjusted MLE of λ in the SAR model. First, we consider the case of a single network, and then the case of themultiple networks model of Section 3.6. The number of replications is 10 in all experiments. A Watts-Strogatz random graph is formed by rewiring with probability p each link in a h -ahead h -behind circular matrix (Watts and Strogatz, 1998). The extreme cases p = 0 and p = 1 correspond, respectively, to a h -ahead h -behind circular matrix and to a Erd˝os-R´enyi Given the correspondence between l a ( λ ) and the density of a maximal invariant under the group G X , thefact that l a ( λ ) may be bounded as λ → ω − , for some nonzero real eigenvalue ω of W , provides an explanationof why the limiting power of an invariant test for λ = 0 may be strictly in (0 ,
1) as λ → ω − (see Martellosio,2010). On the other hand, if l a ( λ ) → −∞ , then the power of an invariant test for λ = 0 may approach 0 or 1as λ → ω − depending on the location of the critical region. There is one important difference though. Contrary to the case of the SAR model, in the spatial errormodel l a ( λ ) can never approach + ∞ a.s. as λ approaches a real zero of det( S ( λ )). This is because l a ( λ ) isa genuine likelihood, so a.s. unboundedness from above of l a ( λ ) as λ approaches a zero of det( S ( λ )) wouldimply the existence of a density that diverges to + ∞ almost everywhere on R n as λ approaches that zero. A h -ahead h -behind circular matrix A is the adjacency matrix of a graph made of n vertices on a circle suchthat each vertex is linked to its h nearest neighbors on each side. That is, A ( i, j ) = 1 if 0 < | i − j | mod( n − − h ) ≤ h ), and A ( i, j ) = 0 otherwise. p , they can reproduce important characteristics of many real worldnetworks, including high clustering and small distances between most nodes. In our firstnumerical experiment, we take W in model (2.1) to be the normalized adjacency matrix of(one realization of) a Watts-Strogatz graph. Normalization of W is either a row normalizationor a spectral normalization (by spectral normalization we mean that the matrix is rescaled byits spectral radius). Note that the two normalizations are equivalent when p = 0, due to thefact that a h -ahead h -behind circular matrix has constant row sums. The matrix X containsan intercept, ˜ k regressors, and, to allow for contextual effects, the spatially lagged version ofthose regressors. That is, X = ( ι n , (cid:101) X, W (cid:101) X ), where (cid:101) X is n × ˜ k . The matrix (cid:101) X is drawn ineach repetition; half of its columns are drawn from independent N(0 , I n ) distributions, halffrom independent uniform distributions on [0 , For p > W is drawn once and thenkept constant across repetitions. We set β = ι k +1 , σ = 1, and n = 200. The errors ε i aregenerated from independent standard normal distributions.In the present context, ˆ λ ML is, subject to regularity conditions, consistent and asymp-totically normal as n → ∞ (Lee, 2004a), and ˆ λ ML and ˆ λ aML are first-order asymptoticallyequivalent (Yu et al., 2015). Table 1 compares the finite sample performance of ˆ λ ML andˆ λ aML for a range of values of p , h , and λ , when ˜ k = 2. The columns headed by ˆ λ ML andˆ λ aML give the bias(s.d.) of the estimators. ∆% stands for percentage change (of the absolutebias or RMSE) from ˆ λ ML to ˆ λ aML . For the case of row normalization, we also report ω − ,the percentage of times when ˆ λ aML >
1, denoted by %(ˆ λ aML > λ aML (which, recall, is the estimator that is set to 1 when ˆ λ aML > a must be the same in each repetition, with its rightendpoint being equal to ω − when W is row normalized, equal to 1 when W is spectrallynormalized (and p > λ aML may be greater than 1 when W is row normalized, whileˆ λ aML < W is spectrally normalized. The results in Table 1 showthat, when p = 0, ˆ λ aML provides a significant improvement compared to ˆ λ ML both in termsof bias and RMSE. As p increases, the reduction in bias afforded by ˆ λ aML increases but atthe cost of greater variability, for both normalizations. The bias of ˆ λ aML is small, unless h islarge and p is small. Note that ω − increases with p and h (cf. the paragraph after Example3.3), and %(ˆ λ aML >
1) is nondecreasing in p , h, and λ , and can be very large when p , h, and λ are large. As %(ˆ λ aML >
1) increases, ¯ λ aML becomes less biased, and more variable,compared to ˆ λ aML .Table 2 analyzes the impact of the number of regressors. We take λ = . h = 5.For these values of λ and h , Pr(ˆ λ aML >
1) is either zero or negligible in the row normalizedcase, so, contrary to the previous table, we do not report ω − , %(ˆ λ aML > λ aML . Asexpected, the bias reduction afforded by ˆ λ aML increases as ˜ k increases. As ˜ k increases, therelative performance of ˆ λ aML , compared to ˆ λ ML , improves also in terms of RMSE if p is notlarge. Given our theoretical framework, it could be argued that a design with (cid:101) X fixed across repetitions wouldbe more appropriate. However, we prefer to vary (cid:101) X across repetitions, in an attempt to investigate theperformance of the adjusted QMLE in an environment where (cid:101) X is random. able 1: Comparison of ˆ λ ML and ˆ λ aML on a Watts-Strogatz network of size n = 200, with ˜ k = 2 regressors. ∆% refers to apercentage change from ˆ λ ML to ˆ λ aML . Row normalization Spectral normalizationˆ λ ML ˆ λ aML ¯ λ aML ˆ λ ML ˆ λ aML p h λ bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE ω − %(ˆ λ aML >
1) bias(s.d.) bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE0 5 0 − . . − . . − . − .
80 1 .
01 0 . − . . − . . − . . − . − . . − . . − . . − . − .
58 1 .
01 0 . − . . − . . − . . − . − . . − . . − . . − . − .
09 1 .
01 0 . − . . − . . − . . − . − . − . . − . . − . − .
50 1 .
02 0 . − . . − . . − . . − . − . . − . . . . − . − .
51 1 .
02 0 . − . . − . . − . . − . − . . − . . − . . − . − .
12 1 .
02 0 . − . . − . . − . . − . − . − . . − . . − . − .
78 1 .
60 3 . − . . − . . − . . − . − . . − . . − . . − . − .
43 1 .
60 10 . − . . − . . − . . − . − . . − . . − . . − . − .
11 1 .
60 31 . − . . − . . − . . − . − . . − . . − . . − . − .
91 1 .
15 0 . − . . − . . − . . − . − . . − . . − . . − . − .
30 1 .
15 0 . − . . − . . − . . − . − . . − . . − . . − . − .
14 1 .
15 0 . − . . − . . − . . − . − . − . . − . . − . − .
29 1 .
22 0 . − . . − . . − . . − . − . . − . . − . . − . − .
47 1 .
22 0 . − . . − . . − . . − . − . . − . . − . . − . − .
30 1 .
22 6 . − . . − . . − . . − . − . − . . − . . − . − .
77 2 .
34 5 . − . . − . . − . . − . − . . − . . − . . − . − .
85 2 .
34 17 . − . . − . . − . . − . − . . − . . − . . − . − .
44 2 .
34 38 . − . . − . . − . . − . − . . − . . − . . − .
77 0 .
59 1 .
49 0 . − . . − . . − . . − .
81 0 . . − . . − . . − . − .
43 1 .
49 0 . − . . − . . − . . − . − . . − . . − . . − . − .
95 1 .
49 12 . − . . − . . − . . − . − . − . . − . . − .
37 0 .
66 1 .
80 0 . − . . − . . − . . − .
90 0 . . − . . − . . − . − .
21 1 .
80 0 . − . . − . . − . . − . − . . − . . − . . − . − .
89 1 .
80 25 . − . . − . . − . . − . − . − . . − . . − .
64 2 .
15 4 .
37 7 . − . . − . . − . . − . − . . − . . − . . − . − .
75 4 .
37 23 . − . . − . . − . . − . − . . − . . − . . − . − .
71 4 .
37 43 . − . . − . . − . . − . − .
261 5 0 − . . − . . − .
25 1 .
32 1 .
76 0 . − . . − . . − . . − .
45 1 . . − . . − . . − .
73 0 .
48 1 .
76 0 . − . . − . . − . . − .
40 0 . . − . . − . . − . − .
02 1 .
76 17 . − . . − . . − . . − . − . − . . . . − .
87 2 .
74 2 .
45 0 .
00 0 . . − . . . . − .
97 2 . . − . . − . . − .
57 2 .
60 2 .
45 0 . − . . − . . − . . − .
75 2 . . − . . − . . − . − .
71 2 .
45 29 . − . . − . . − . . − . − . − . . . . − .
59 5 .
34 7 .
27 8 . − . . − . . − . . − .
19 1 . . − . . − . . − . − .
38 7 .
27 23 . − . . − . . − . . − . − . . − . . − . . − . − .
67 7 .
27 44 . − . . − . . − . . − . − . .2 Multiple networks We now generate data according to model (3.11). For simplicity, we consider a balancedcase: each of the R networks has the same number, m , of individuals. The matrices W r and (cid:101) X are generated as in Section 5.1. That is, each W r is the normalized adjacency matrix of (arealization of) a Watts-Strogatz graph, and the regressors are drawn in each repetition, ˜ k/ k/ ,
1) distribution.Errors and the fixed effects α r are drawn (in each repetition) from N(0 ,
1) distributions, andwe set γ = δ = ι ˜ k and σ = 1 . Table 3 compares ˆ λ aML to the estimator ˆ λ LLL obtained by maximizing the Lee et al.(2010) likelihood l LLL ( σ , λ ), for a range of values of R , m , and ˜ k . We choose λ = 0 . h = 5 and p = 0 .
2. When W is row normalized, ˆ λ aML performs similarly to ˆ λ LLL . In fact, it has lower bias (butnote that the bias of ˆ λ LLL is already very small in most of the cases considered in the table)and slightly slower RMSE. As expected, the relative performance of ˆ λ LLL improves as ˜ k increases. When W is spectrally normalized, ˆ λ LLL cannot be obtained, so only results forˆ λ aML are reported. The simulation results show that ˆ λ aML performs very satisfactorily evenin that case.Table 2: Comparison of ˆ λ ML and ˆ λ aML on a Watts-Strogatz network of size n = 200, with λ = . h = 5. ∆% refers to a percentage change from ˆ λ ML to ˆ λ aML . Row normalization Spectral normalizationˆ λ ML ˆ λ aML ˆ λ ML ˆ λ aML p ˜ k bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE0 2 − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − .
191 2 − . . − . . − .
13 0 . − . . − . . − .
54 0 . − . . − . . − .
03 0 . − . . − . . − .
64 0 . − . . − . . − .
62 0 . − . . − . . − .
69 0 . Table 4 reports empirical coverages of Wald confidence intervals based on first-orderasymptotic normality of ˆ λ LLL (in the columns headed by W
LLL ), empirical coverages of Wald The simulated model is one in which fixed effects and regressors are random and independent of eachother, and W is non-stochastic. Note that we could allow for some correlation between fixed effects andregressors, but this is not needed for our purposes. Some results for larger R are given in Table S.2 in the Supplement. Note that, as for the design in Section 5.1, in the present setting the set Λ a is the same in all repetitions. Similarly to the cross-sectional case, the reduction in RMSE is higher for lower values of p (for example,in the extreme case p = 0, ∆%RMSE is -4.417, -14.707, -24.278 when ˜ k is 2 , ,
10, respectively, and R = 10, m = 20). λ = 0 . h = 5 and p = 0 .
2. ∆% refers to a percentage changefrom ˆ λ LLL to ˆ λ aML . Row normalization Spectral normalizationˆ λ LLL ˆ λ aML ˆ λ aML ˜ k R m bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE bias(s.d.)2 10 20 − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . − . . − . . − . − . − . . confidence intervals based on first-order asymptotic normality of ˆ λ aML (in the columns headedby W aML ), and empirical coverages of saddlepoint confidence intervals based on ˆ λ aML (in thecolumns headed by s aML ). Since ˆ λ LLL is not available when W is spectrally normalized,we only report the case of row normalized W . The nominal size is 95%. We considerequi-tailed two-sided confidence intervals, and right-sided confidence intervals of the form( −∞ , λ U ), where λ U is a suitably selected upper end-point. The errors ε ri are generatedindependently from either (a) a standard normal distribution, (b) a gamma distribution withshape parameter 1 and scale parameter 1, demeaned by the population mean. Mean, variance,skewness, and kurtosis are 0 , , , , , , λ aML offer an improvement over Wald confidence intervals based on ˆ λ LLL when ˜ k is not too small, and particularly in the case of right sided confidence intervals. TheLugannani–Rice approximation delivers a further improvement, and indeed the coverages ofthe saddlepoint confidence intervals based on ˆ λ aML are excellent in all cases considered in thetable, even under the gamma distribution. Conversely, the empirical coverage of the Waldconfidence intervals based on ˆ λ LLL is acceptable in the two-sided case when R = m = 30, butquickly deteriorates as ˜ k increases or n decreases, and is considerably worse in the right-sided When W is spectrally normalized, coverages of the confidence intervals based on ˆ λ aML are similar to thecase of row normalization. The 95% Wald confidence intervals are ˆ λ ± . √ ˆ v (two-sided) and ( −∞ , ˆ λ +1 . √ ˆ v ) (right-sided), whereˆ λ is either ˆ λ LLL or ˆ λ aMLE , and ˆ v denotes the asymptotic variance given in Proposition 6.1 of Lee et al. (2010)and evaluated at the LLL or aMLE estimates of λ, β, σ . λ = 0 obtainedby inverting the confidence interval. Table S.4 in the Supplement reports coverages underother severely non-normal distributions; again, the saddlepoint confidence intervals based onthe adjusted QMLE are very accurate in all cases considered.Table 4: Empirical coverages of 95% confidence intervals in model (3.11) with λ = 0, h = 5,and p = 0 .
2. The error distribution is either a standard normal or a centered gamma(1 , W is row normalized. Normal GammaTwo-sided Right-sided Two-sided Right-sided˜ k R m W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML .
942 0 .
941 0 .
949 0 .
935 0 .
941 0 .
949 0 .
945 0 .
944 0 .
951 0 .
939 0 .
943 0 . .
946 0 .
945 0 .
950 0 .
938 0 .
945 0 .
949 0 .
946 0 .
945 0 .
950 0 .
940 0 .
946 0 . .
946 0 .
946 0 .
950 0 .
940 0 .
944 0 .
950 0 .
948 0 .
948 0 .
952 0 .
942 0 .
946 0 . .
947 0 .
947 0 .
950 0 .
942 0 .
946 0 .
950 0 .
949 0 .
949 0 .
951 0 .
943 0 .
948 0 . .
947 0 .
947 0 .
950 0 .
943 0 .
946 0 .
950 0 .
949 0 .
949 0 .
951 0 .
945 0 .
948 0 . .
948 0 .
947 0 .
949 0 .
944 0 .
948 0 .
949 0 .
949 0 .
949 0 .
951 0 .
945 0 .
949 0 . .
933 0 .
933 0 .
948 0 .
916 0 .
934 0 .
948 0 .
935 0 .
935 0 .
950 0 .
916 0 .
936 0 . .
937 0 .
938 0 .
948 0 .
919 0 .
941 0 .
949 0 .
939 0 .
940 0 .
949 0 .
924 0 .
943 0 . .
943 0 .
942 0 .
949 0 .
932 0 .
942 0 .
949 0 .
943 0 .
943 0 .
950 0 .
930 0 .
942 0 . .
944 0 .
944 0 .
949 0 .
931 0 .
945 0 .
949 0 .
944 0 .
945 0 .
950 0 .
930 0 .
945 0 . .
944 0 .
944 0 .
949 0 .
934 0 .
944 0 .
949 0 .
946 0 .
945 0 .
950 0 .
934 0 .
944 0 . .
945 0 .
945 0 .
948 0 .
934 0 .
945 0 .
949 0 .
946 0 .
947 0 .
950 0 .
934 0 .
947 0 . .
921 0 .
924 0 .
946 0 .
898 0 .
928 0 .
947 0 .
922 0 .
925 0 .
948 0 .
897 0 .
930 0 . .
925 0 .
932 0 .
947 0 .
895 0 .
936 0 .
948 0 .
928 0 .
934 0 .
948 0 .
899 0 .
938 0 . .
935 0 .
937 0 .
948 0 .
916 0 .
939 0 .
948 0 .
936 0 .
938 0 .
949 0 .
918 0 .
940 0 . .
938 0 .
941 0 .
948 0 .
919 0 .
943 0 .
948 0 .
939 0 .
942 0 .
949 0 .
920 0 .
943 0 . .
940 0 .
943 0 .
949 0 .
922 0 .
942 0 .
950 0 .
942 0 .
943 0 .
949 0 .
925 0 .
943 0 . .
943 0 .
945 0 .
949 0 .
925 0 .
945 0 .
949 0 .
942 0 .
944 0 .
949 0 .
926 0 .
945 0 . Recentering the profile score for a parameter of interest is one possible way to deal withnuisance parameters. In this paper, we have applied this general principle to the estimationof the autoregressive parameter λ in a spatial autoregression. The resulting adjusted QMLEfor λ successfully reduces the bias in the QMLE, provides confidence intervals with excellentcoverage properties (and hence tests for λ with excellent size properties) even when thedimension of the nuisance parameter is large, and is as straightforward to compute as theoriginal QMLE. The adjusted QMLE can also solve the incidental parameter problem thatoccur, for example, in social network models with network fixed effects. However, due tothe fact that the parameter space for λ is usually restricted to a certain interval, the spatialautoregressive setting presents challenges for the score adjustment procedure that do not arisein other models. Namely, the distributions of the QMLE and of its adjusted version can besupported on different intervals, which means that a comparison between the two estimatorsis not straightforward. Our simulations suggest that the adjusted QMLE generally performs25etter than the QMLE in terms of RMSE, particularly in models with a large number ofcovariates.This paper has focused on a simple version of a spatial autoregressive model. In empiricalapplications, it is typically desirable to extend the model in various directions. For example,one may want to allow for endogeneity coming from X or W , or for model errors that aresubject to a spatial autoregressive structure themselves. Such extensions would not precludethe use of the score adjustment procedure, but would typically imply that the expectationof the profile score for λ depends on nuisance parameters. In that case, as discussed inMcCullagh and Tibshirani (1990), the expectation would need to be obtained numerically,rather than analytically, and the resulting estimating equation would be unbiased up to someorder, rather than being exactly unbiased.It is also worth mentioning that it should be possible to use the score adjustment proce-dure in conjunction with modifications to the QMLE that allow for unknown heteroskedas-ticity (see, e.g., Liu and Yang, 2015). Finally, the adjusted QMLE should be effective also inmodels where the number of regressors k is increasing with the sample size (see, e.g. Guptaand Robinson, 2018). If k increases sufficiently quickly with n , then the adjusted QMLEshould have advantages, with respect to the QMLE, even in terms of first-order asymptotics. Appendix A Auxiliary results
Lemma A.1.
Let W be a weights matrix, and ω a semisimple real eigenvalue ω of W . Theset of full column rank matrices X such that tr( M X Q ω ) = 0 is a µ R n × k -null set. Proof.
We need to show that A := { X ∈ R n × k : rank( X ) = k and tr( M X Q ω ) = 0 } is a µ R n × k -null set. This is equivalent to showing that B := { X ∈ R n × k : det( X (cid:48) X )tr( M X Q ω ) =0 } is a µ R n × k -null set, because A = { X ∈ R n × k : rank( X ) = k } ∩ B . But det( X (cid:48) X )tr( M X Q ω )is a polynomial in the entries of X , as it is clear from writing det( X (cid:48) X )tr( M X Q ω ) =det( X (cid:48) X )tr( Q ω ) − tr( X adj( X (cid:48) X ) X (cid:48) Q ω ). Hence B is an algebraic variety, and as such itis either a µ R n × k -null set or the whole R n × k . The latter case is easily ruled out. Lemma A.2. If W is symmetric, condition (3.10) is satisfied for any X . Proof. If W is symmetric, v (cid:48) ( G ( λ ) − tI n ) v ≥ t ∈ R , any v ∈ R n , and any λ suchthat det( S ( λ )) (cid:54) = 0. It follows that, for any t ∈ R , any u ∈ R m , any m × n matrix C , and any λ such that det( S ( λ )) (cid:54) = 0, u (cid:48) C ( G ( λ ) − tI n ) C (cid:48) u ≥ (cid:0) C ( G ( λ ) − tI n ) C (cid:48) (cid:1) ≥
0. Letus now choose C to be an ( n − k ) × n matrix such that CC (cid:48) = I n − k and C (cid:48) C = M X , and t tobe tr( M X G ( λ )) / ( n − k ). Then, tr (cid:0) C ( G ( λ ) − tI n ) C (cid:48) (cid:1) = tr( M X G ( λ )) − [tr( M X G ( λ ))] / ( n − k ) = − ( n − k ) δ a ( λ ). The proof is completed on noting that tr (cid:0) C ( G ( λ ) − tI n ) C (cid:48) (cid:1) = 0 if andonly if G ( λ ), and hence W , is a scalar multiple of I n , which is ruled out by the maintainedassumption that W has at least one negative and at least one positive eigenvalue.26 emma A.3. For any semisimple nonzero real eigenvalue ω of W ,(i) if tr( M X Q ω ) > then lim λ ↑ ω − tr( M X G ( λ )) = + ∞ and lim λ ↓ ω − tr( M X G ( λ )) = −∞ ;(ii) if tr( M X Q ω ) = 0 then lim λ → ω − tr( M X G ( λ )) is bounded;(iii) if tr( M X Q ω ) < then lim λ ↑ ω − tr( M X G ( λ )) = −∞ and lim λ ↓ ω − tr( M X G ( λ )) = + ∞ . Proof.
Let Sp( W ) denote the spectrum (defined as the set of distinct eigenvalues) of W .For any λ ∈ R \ Sp( W ), consider the function f ( z ) := z (1 − λz ) − from C \ { λ − } to C , andlet f ( i ) ( z ) denote its i -th order derivative. Since f ( z ) is defined at each eigenvalue of W , thematrix G ( λ ), for any λ ∈ R \ Sp( W ), admits the spectral resolution (e.g., Meyer, 2000, p.603) G ( λ ) = (cid:88) χ ∈ Sp( W ) k χ − (cid:88) i =0 f ( i ) ( χ ) j ! ( W − χI n ) i T χ , (A.1)where k χ denotes the index of an eigenvalue χ , and T χ denotes the projector onto the gen-eralized eigenspace null (cid:0) ( W − χI n ) k χ (cid:1) along col (cid:0) ( W − χI n ) k χ (cid:1) . If an eigenvalue ω of W issemisimple (and only in that case), then k ω = 1, and hencetr( M X G ( λ )) = f ( ω )tr( M X Q ω ) + (cid:88) χ ∈ Sp( W ) \{ ω } k χ − (cid:88) i =0 f ( i ) ( χ ) j ! tr (cid:0) M X ( W − χI n ) i T χ (cid:1) . (A.2)The stated result obtains, because all derivatives f ( i ) ( χ ) are bounded for λ (cid:54) = χ − . Lemma A.4.
For any semisimple nonzero real eigenvalue ω of W , if y / ∈ null( M X S ( ω − )) ,then lim λ → ω − l a ( λ ) is(i) −∞ if tr( M X Q ω ) > ;(ii) bounded if tr( M X Q ω ) = 0; (iii) + ∞ if tr( M X Q ω ) < . Proof.
The result can be proved by looking at the two terms that make up s a2 ( λ ) inexpression (3.5): ( n − k )( y (cid:48) W (cid:48) M X S ( λ ) y ) / ( y (cid:48) S ( λ ) (cid:48) M X S ( λ ) y ) and − tr( M X G ( λ )). Consideran arbitrary nonzero real eigenvalue ω of W . If y / ∈ null( M X S ( ω − )), the function λ (cid:55)→ ( n − k )( y (cid:48) W (cid:48) M X S ( λ ) y ) / ( y (cid:48) S ( λ ) (cid:48) M X S ( λ ) y ) is continuous at λ = ω − because it is well definedat λ = ω − , and is well defined in a neighborhood of λ = ω − (the last claim follows by LemmaS.1.1 in the online supplement to Hillier and Martellosio, 2018a). The proof is completed onusing Lemma A.3 to establish the limiting behavior of the term − tr( M X G ( λ )) as λ → ω − .27 emma A.5. For any semisimple nonzero real eigenvalue ω of W , [tr( M X G ( λ ))] − ( n − k )tr( M X G ( λ )) → + ∞ as λ → ω − if tr( M X Q ω ) < . Proof.
This proof is based on the proof of Lemma A.3. For any λ ∈ R \ Sp( W ), considerthe function g ( z ) = f ( z ) = z (1 − λz ) − from C \ { λ − } to C , and let g ( i ) ( z ) denote its i -th order derivative. Then, for any λ ∈ R \ Sp( W ), a spectral resolution of G ( λ ) is givenby the right hand side of equation (A.1) with all f ( i ) ( · ) replaced by g ( i ) ( · ). Hence, for anarbitrary semisimple nonzero real eigenvalue ω of W , tr( M X G ( λ )) can be expressed as inthe right hand side of (A.2), with f replaced by g . Since all derivatives g ( i ) ( χ ) are boundedfor λ (cid:54) = χ − , lim λ → ω − tr( M X G ( λ )) = −∞ if tr( M X Q ω ) <
0. The proof is completed onobserving that, by Lemma A.3, lim λ → ω − [tr( M X G ( λ ))] = + ∞ if tr( M X Q ω ) < Lemma A.6.
Suppose that, in a SAR model, col( X ) is an invariant subspace of W . Thenthe adjusted log-likelihood function l a ( σ , λ ) is the same as the quasi log-likelihood functionfor ( σ , λ ) based on Dy , for any full rank ( n − k ) × n matrix D such that DX = 0 , and forany y ∈ R n . Proof.
We only provide a sketch of the proof here; full details are given in Section S.9 of theSupplement. If col( X ) is an invariant subspace W , there exists a unique k × k matrix A suchthat W X = XA , and hence S − ( λ ) X = X ( I k − λA ) − , for any λ such that S ( λ ) is invertible.Thus, when col( X ) is an invariant subspace W , the SAR model y = S − ( λ ) Xβ + σS − ( λ ) ε can be written as y = X ( I k − λA ) − β + σS − ( λ ) ε , which corresponds to the spatial errormodel (4.1) with β replaced by ( I k − λA ) − β . Intuitively, the lemma then follows fromthe fact that, for a spatial error model, the adjusted (quasi) log-likelihood l a ( σ , λ ) is thesame as the restricted, or residual, quasi log-likelihood for ( σ , λ ) (see Section S.8.2.2 in theSupplement). Appendix B Proofs
Proof of Proposition 3.1.
The adjusted likelihood l a ( σ , λ ) is defined by the propertythat its gradient is the score s a ( σ , λ ) given in equation (3.4), for any σ > λ ∈ Λ u ,where, recall, Λ u := { λ ∈ R : det( S ( λ )) (cid:54) = 0 } . Solving the two equations ∂l a ( σ , λ ) ∂σ = − n − k σ + 12 σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y,∂l a ( σ , λ ) ∂λ = 1 σ y (cid:48) W (cid:48) M X S ( λ ) y − tr( M X G ( λ )) , gives, up to an additive constant, l a ( σ , λ ) = − n − k σ ) − σ y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y + (cid:90) tr( M X G ( λ )) d λ. (cid:82) tr( M X G ( λ )) d λ can be expressed in termsof the matrix logarithm. For λ ∈ ( − , S ( λ ) admits the convergent power series rep-resentation − (cid:80) ∞ k =1 1 k λ k W k , from which it is immediately clear that dd λ log S ( λ ) = − G ( λ ).This expression for dd λ log S ( λ ) can be extended over the set Λ u , by selecting a suitable branchof the matrix logarithm. Let Ξ := { − λω : λ ∈ R , ω ∈ Sp( W ) } , where Sp( W ) denotes theset of eigenvalues of W , be the subset of C formed by the eigenvalues of S ( λ ) as λ rangesover R . Since Ξ is formed by a finite number of lines in C going through 1, there mustexist a half line l in C starting from the origin that does not intersect Ξ . In the rest ofthe proof, log denotes the matrix logarithm associated to the branch cut l . Then, log S ( λ ),and hence its first derivative, is holomorphic over C \ l . Since C \ l is a connected set andΛ u ⊂ C \ l , the identity theorem for holomorphic functions implies that dd λ log S ( λ ) = − G ( λ )for any λ ∈ Λ u . It follows that tr( M X log S ( λ )) is an antiderivative of − tr( M X G ( λ )), andhence that Re[tr( M X log S ( λ ))] is a real antiderivative of − tr( M X G ( λ )), for any λ ∈ Λ u ,which completes the proof. Proof of Theorem 1.
According to Assumption 1, M X S ( ω − ) (cid:54) = 0, for any nonzero realeigenvalue ω of W . Hence null( M X S ( ω − )) is a µ R n -null set, for any nonzero real eigenvalue ω of W . The result follows by Lemma A.4. Proof of Lemma 3.1. (i) For any semisimple eigenvalue ω of W , Q ω is a projector ontocol( Q ω ) = null( W − ωI n ), so null( W − ωI n ) ⊆ col( X ) if and only if M X Q ω = 0, whichis obviously sufficient for tr( M X Q ω ) = 0. (ii) Let H ω be a matrix whose columns forman orthonormal basis for null( W − ωI n ). If W is symmetric, Q ω = H ω H (cid:48) ω (the orthogonalprojector onto null( W − ωI n )), and hence tr( M X Q ω ) = tr( H (cid:48) ω M X H ω ). Observe now that,since M X is positive semidefinite, all diagonal entries of H (cid:48) ω M X H ω are nonnegative. Thus,tr( M X Q ω ) ≥
0, and the desired conclusion follows from (i).
Proof of Proposition 3.2.
According to Lemma A.5, condition (3.10) implies thattr( M X Q ω ) ≥ ω such that ω − ∈ Λ a . Hence, by Theo-rem 1, as λ approaches the reciprocals of such eigenvalues, l a ( λ ) either a.s. diverges to −∞ or is a.s. bounded, provided that Assumption 1 holds. The former case is impossible by thedefinition of Λ a , so l a ( λ ) must be a.s. continuous on the whole Λ a (after extension of thedomain of l a ( λ ) to include any zeros of det( S ( λ )) in Λ a ). By the definition of Λ a we alsohave that l a ( λ ) → −∞ a.s. at the extremes of Λ a . We now show that the second derivative¨ l a ( λ ) of l a ( λ ) is negative at any critical point of l a ( λ ) in Λ a . This implies that l a ( λ ) a.s. hasa single critical point in Λ a , corresponding to a maximum (that is, it is single-peaked withno stationary inflection points). Write¨ l a ( λ ) = ( n − k ) (cid:18) − ( ac − b )( aλ − bλ + c ) + ( b − aλ ) ( aλ − bλ + c ) (cid:19) − tr( M X G ( λ )) , a := y (cid:48) W (cid:48) M X W y , b := y (cid:48) W (cid:48) M X y , c := y (cid:48) M X y. The first order condition s a2 ( λ ) = 0implies ( b − aλ ) ( aλ − bλ + c ) = 1( n − k ) [tr( M X G ( λ ))] , so that, at any critical point of l a ( λ ),¨ l a ( λ ) = − ( n − k ) ( ac − b )( aλ − bλ + c ) + 1 n − k [tr( M X G ( λ ))] − tr( M X G ( λ )) . (B.1)By the Cauchy-Schwarz inequality the first of the three terms on the right hand side of (B.1)is nonpositive. Hence condition (3.10) is sufficient for ¨ l a ( λ ) to be negative at any zero of s a2 ( λ ) in Λ a , which completes the proof. Proof of Proposition 3.3. If W r ι m r = ι m r , for all r = 1 , . . . , R , then col( (cid:76) Rr =1 ι m r ) is aninvariant subspace of W = (cid:76) Rr =1 W r . Since F r ι m r = 0, for all r = 1 , . . . , R , the desired claimfollows by Lemma A.6. 30 upplement to “Adjusted QMLE for the SpatialAutoregressive Parameter” Federico Martellosio Grant HillierUniversity of Surrey, UK CeMMAP andUniversity of Southampton, UKFebruary 28, 2019This Supplement contains technical material related to the paper “Adjusted QMLE for thespatial autoregressive parameter”. All numbers of equations, lemmas, etc., in the Supplementare preceded by ‘S.’.
Appendix S.1 Details on Assumption 1
The following lemma provides an analysis of Assumption 1.
Lemma S.1.1. (i) If M X ( zI n − W ) = 0 then z is a real eigenvalue of W .(ii) For fixed W and X , the condition M X ( ωI n − W ) = 0 can be satisfied at most by oneeigenvalue ω of W .(iii) For any real eigenvalue ω of W , M X ( ωI n − W ) = 0 if and only if col ⊥ ( X ) ⊆ null( W (cid:48) − ωI n ) .(iv) For any real eigenvalue ω of W , if M X ( ωI n − W ) = 0 then k ≥ n − g ω , where g ω denotes the geometric multiplicity of ω .(v) For any real eigenvalue ω of W , a necessary condition for M X ( ωI n − W ) = 0 is thatall eigenvectors of W associated with eigenvalues other than ω are in col( X ) . Thecondition is also sufficient if W is diagonalizable. Proof. (i) Clearly z must be real, because if M X ( zI n − W ) = 0 then (Re( z ) + Im( z )) M X = M X W , which implies Im( z ) = 0. If z is not an eigenvalue of W then rank( M X ( zI n − W )) = n − k and hence M X ( zI n − W ) cannot be 0, since k < n . (ii) Straightforward, because M X ( ω I n − W ) = M X ( ω I n − W ) implies ω = ω . (iii) Follows from col( ωI n − W ) =null ⊥ ( ωI n − W (cid:48) ), and the fact that A ⊆ B if and only if B ⊥ ⊆ A ⊥ , for any two subspaces A and B . (iv) In order for col( ωI n − W ) to be a subset of col( X ), the dimension of col( X ) mustnot be smaller than that of col( ωI n − W ), that is, it must hold that k ≥ rank( ωI n − W ) = When saying that an eigenvector of W is in col( X ), we take it as given that the vector space field usedto define col( X ) is C n , not R n . If the field was R n , the condition should read “the real parts and imaginaryparts of all eigenvectors of W associated with eigenvalues other than ω are in col( X )”. − nullity( ωI n − W ) = n − g ω (the nullity of a matrix being the dimension of its null space).(v) If M X ( ωI n − W ) = 0 then M X ( ωI n − W ) v = 0 for any v ∈ C n , that is, ωM X v = M X W v. (S.1.1)Suppose now v is an eigenvector of W associated to the eigenvalue κ . Then equation(S.1.1) gives ωM X v = κ M X v , which implies that v must be in col( X ) if κ (cid:54) = ω . If W is diagonalizable, we can write W = HDH − , where D := diag( ωI g ω , ω ∈ Sp( W ))is the diagonal matrix containing the eigenvalues (ordered in some arbitrary manner) of W , and H is the matrix of eigenvectors. Sufficiency of the condition follows, because M X ( ωI n − W ) = M X H diag (cid:0) ( ω − χ ) I n χ , χ ∈ Sp( W ) (cid:1) H − .In view of part (i) of Lemma S.1.1, Assumption 1 could be reformulated as M X ( zI n − W ) (cid:54) =0 for any z in R (or in C ). Part (iii) shows that Assumption 1 is the same as the conditionused in Preinerstorfer and P¨otscher (2017) for identifiability of the spatial autoregressiveparameter in a spatial error model (as it should be, because the SAR and spatial error modelshave the same profile log-likelihood l ( λ ) under Assumption 1; see Martellosio, 2018). Part(iv) indicates how large k must be in order for the identifiability issue related to Assumption1 to arise.Note that the set of (full rank) matrices X such that Assumption 1 is violated for a given W is a null set with respect to any absolutely continuous distribution of X on R n × k . Thecondition that the distribution of X is absolutely continuous would of course not be satisfiedif some of the columns of X were fixed (e.g., there are group intercepts as in Example 2.1).In that case, Assumption 1 may be violated for any draw of the remaining columns.The next result states some consequences of a violation of Assumption 1 on the profilelog-likelihood l ( λ ); see also Martellosio (2018). Lemma S.1.2.
Suppose Assumption 1 is violated for some eigenvalue ω of W . In both theSAR and the spatial error model:(i) if ω (cid:54) = 0 , the profile log-likelihood l ( λ ) is a.s. unbounded from above in a neighborhoodof ω − ;(ii) if ω = 0 , then, up to additive constants and a.s., l ( λ ) = log | det( S ( λ )) | , for any λ suchthat det( S ( λ )) (cid:54) = 0 . Proof. (i) For any λ such that rank( S ( λ )) = n , and for any y / ∈ null( M X S ( λ )), the profilelog-likelihood l ( λ ) for a SAR model is given by equation (2.4). Note that equation (2.4)holds a.s. for any fixed λ such that rank ( S ( λ )) = n , because null( M X S ( λ )) is a µ R n -null setwhen rank( S ( λ )) = n (since k < n ). If Assumption 1 is violated for an eigenvalue ω (whichmust be real and unique by parts (i) and (ii) of Lemma S.1.1) of W , then M X ( ωI n − W ) = 02nd hence M X S ( λ ) = (1 − λω ) M X . Substituting this last equation into (2.4) gives, for any y / ∈ col( X ), l ( λ ) = log | det( S ( λ )) | − n log | − λω | − n y (cid:48) M X y ) (S.1.2)= log (cid:12)(cid:12)(cid:12)(cid:81) κ ∈ Sp( W ) \{ ω } (1 − λ κ ) n κ (cid:12)(cid:12)(cid:12) ( y (cid:48) M X y ) n − ( n − n ω ) log( | − λω | ) , (S.1.3)where n κ denotes the algebraic multiplicity of an eigenvalue κ . The first term in equation(S.1.3) is a.s. bounded as λ → ω − . The second term goes to + ∞ as λ → ω − , because n ω < n by the assumption that W has at least one positive and one negative eigenvalues.Thus, lim λ → ω − l ( λ ) = + ∞ a.s. (ii) If Assumption 1 is violated for ω = 0, equation (S.1.2)gives l ( λ ) = log | det( S ( λ )) | − n log( y (cid:48) M X y ), for any y / ∈ col( X ) . Remark S.1.1.
More can be said about ˆ λ ML in case (ii) of Lemma S.1.2. In that case, ˆ λ ML is a zero of tr( G ( λ )), because dd λ log | det( S ( λ )) | = − tr( G ( λ )) and log | det( S ( λ )) | → −∞ as λ approaches any real zero of det( S ( λ )). For many weights matrices W , tr( G ( λ )) has the samesign as λ on Λ, which implies that log | det( S ( λ )) | is single-peaked on Λ with peak at 0, andhence ˆ λ ML = 0. For example, tr( G ( λ )) has the same sign as λ on Λ whenever all eigenvaluesof W are real and tr( W ) = 0. Appendix S.2 The profile log-likelihood l ( λ ) This section establishes that the (quasi) profile log-likelihood function l ( λ ), given in equation(2.4), is a.s. well defined. Let us start by considering a fixed y ∈ R n . For any λ such that S ( λ ) is nonsingular, β and σ can be concentrated out of the log-likelihood (2.3) if andonly if there is no β ∈ R k such that S ( λ ) y − Xβ (cid:54) = 0, or equivalently, if M X S ( λ ) y (cid:54) = 0.If M X S ( λ ) y = 0 for some λ such that S ( λ ) is nonsingular, then y is fitted perfectly (zeroresiduals) by the SAR model and the profile log-likelihood is not defined for that value of λ .More precisely, if M X S ( λ ) y = 0 is satisfied by a unique value ¯ λ of λ , then the profile log-likelihood function approaches + ∞ as λ approaches ¯ λ . If M X S ( λ ) y = 0 for more than onevalue of λ , then M X S ( λ ) y = 0 for any λ (see Lemma S.1.1 in the online supplement to Hillierand Martellosio, 2018a), and hence the whole profile log-likelihood function is undefined.Let us now turn to the case of random y . Provided that the distribution of y is absolutelycontinuous with respect to µ R n , the set of y ’s such that M X S ( λ ) y = 0, for a fixed λ such that S ( λ ) is nonsingular, is µ R n -null (because null( M X S ( λ )) has dimension k if rank( S ( λ )) = n ,and k < n by assumption). That is, the profile log-likelihood l ( λ ) is a.s. well defined atany fixed value of λ such that S ( λ ) is nonsingular. This does not imply that the function λ (cid:55)→ l ( λ ) is a.s. well defined, because the fact that null( M X S ( λ )) is µ R n -null for any λ suchthat det( S ( λ )) (cid:54) = 0 does not guarantee that the uncountable union of null( M X S ( λ )) over the3et { λ ∈ R n : det( S ( λ )) (cid:54) = 0 } is µ R n -null. That the function l ( λ ) is a.s. well defined on itsdomain is established by the following result. Lemma S.2.1.
In a SAR model, the profile log-likelihood function l ( λ ) is a.s. well definedfor any λ such that S ( λ ) is nonsingular. Proof.
We need to show that the set A := { y ∈ R n : M X S ( λ ) y = 0 for some λ suchthat det( S ( λ )) (cid:54) = 0 } is a µ R n -null set. Suppose first that Assumption 1 is violated foran eigenvalue ω , which must be real and unique by parts (i) and (ii) of Lemma S.1.1, of W . That is, M X ( ωI n − W ) = 0, which implies M X S ( λ ) = (1 − λω ) M X , and hence A =col( X ), which is clearly µ R n -null since k < n by assumption. For the rest of the proof,suppose Assumption 1 holds. We establish that A is a µ R n -null set by showing that thelarger set B := { y ∈ R n : M X S ( λ ) y = 0 for some λ ∈ R } is a µ R n -null set. Note that B = { y ∈ R n : rank( y, W y, X ) < k + 2 } , and that rank( y, W y, X ) < k + 2 if and only ifthe determinants of all ( k + 2) × ( k + 2) submatrices of ( y, W y, X ) are zero (or, equivalently,if and only if det(( y, W y, X ) (cid:48) ( y, W y, X )) = 0). Hence, B is an algebraic variety and assuch it is either a µ R n -null set or the whole R n . We now show that B cannot be R n underAssumption 1. Noting that the problem is invariant to a change of basis for R n , we cantake X = ( e n − k +1 , . . . , e n ) = (0 , I k ) (cid:48) , where e i is the i -th standard unit vector. In order for B = R n , the determinants of all ( k + 2) × ( k + 2) submatrices of ( y, W y, X ) must be zerofor all y ∈ R n . Recalling that n ≥ k + 2 by assumption, this requires, in particular, that thematrix (cid:34) y i ( W y ) i y j ( W y ) j (cid:35) is singular for any i, j ≤ n − k , and for any y ∈ R n (note that if the assumption n ≥ k + 2was violated, then B would be equal to R n ). Now, y i ( W y ) j − y j ( W y ) i = 0 for any y ∈ R n implies ( W y ) i = αy i and ( W y ) j = αy j , for some scalar α . Hence, in order for B = R n theremust be a scalar α such that y − αW y ∈ col( X ) for any y , which contradicts Assumption 1.Thus, B , and hence A , must be a µ R n -null set.It is worth noting that, provided it is well defined, l ( λ ) is C ∞ between any two consecutivereal zeros of det( S ( λ )) (because the term log | det( S ( λ )) | in equation (2.4) is). Also, oncomparing equation (2.4) for l ( λ ) and equation (3.8) for l a ( λ ), it is clear that the statementof Lemma S.2.1 also applies to l a ( λ ). Finally, recall from above that the set null( M X S ( z ))(the set of y ’s that are perfectly fitted by a SAR model with λ = z ) is µ R n -null for any z such S ( z ) is nonsingular. For the values of z such that S ( z ) is singular (i.e., for thevalues z = ω − , for the nonzero real eigenvalues ω of W ), null( M X S ( z )) is a µ R n -null set ifAssumption 1 holds, whereas null( M X S ( z )) is the whole R n if Assumption 1 is violated. Inthe next section we will prove a result, Lemma S.4.2, that establishes the limiting behaviorof l a ( λ ) as λ → ω − when y ∈ null( M X S ( ω − )).4 ppendix S.3 Remarks on Proposition 3.1 Remark S.3.1.
From the proof of Proposition 3.1 it is clear that if S ( λ ) does not haveany real nonpositive eigenvalues, then the principal matrix logarithm (that is, the logarithmassociated to the branch cut ( −∞ , λ ∈ Λ, because S ( λ ) is positive definite on Λ. A different branch cut is requiredto evaluate l a ( σ , λ ) outside Λ. Remark S.3.2.
Proposition 3.1 shows that the effect of the score adjustment on the profilelog-likelihood (3.1) is to replace n with n − k and the term log | det( S ( λ )) | = Re[tr(log S ( λ ))]with Re[tr( M X log S ( λ ))], for any λ such that S ( λ ) is invertible. If we restrict attention to λ ∈ Λ, then S ( λ ) is positive definite, and the adjustment amounts to replacing log(det( S ( λ ))) =tr(log S ( λ )) with tr( M X log S ( λ )) (the matrix logarithm here can be taken to be the principalone, by the previous remark). Appendix S.4 Further results related to Theorem 1
The following lemma complements Lemma 3.1 in the main text by giving a condition fortr( M X Q ω ) = 0 in terms of right and left eigenvectors of W associated to ω . Lemma S.4.1.
For any simple eigenvalue ω of W , tr( M X Q ω ) = 0 if and only if l (cid:48) M X h = 0 ,for some (and hence all) h ∈ null( W − ωI n ) and for some (and hence all) l ∈ null( W (cid:48) − ωI n ) . Proof of Lemma S.4.1. If ω is simple, Q ω = hl (cid:48) /l (cid:48) h for some (and hence all) h ∈ null( W − ωI n ) and for some (and hence all) l ∈ null( W (cid:48) − ωI n ). It follows that tr( M X Q ω ) = 0 if andonly if tr( l (cid:48) M X h ) = 0, or equivalently l (cid:48) M X h = 0.It is a well known fact in linear algebra that left and right eigenvectors correspondingto the same eigenvalue cannot be orthogonal. Lemma S.4.1 establishes that tr( M X Q ω ) = 0if and only if left and right eigenvectors corresponding to ω are orthogonal after orthogonalprojection onto col ⊥ ( X ).Next, recall that the conclusions in Theorem 1 require y / ∈ null( M X S ( ω − )), withnull( M X S ( ω − )) a µ R n -null set under Assumption 1. It is therefore of interest to eluci-date what happens when y ∈ null( M X S ( ω − )). Note that if y ∈ null( M X S ( ω − )) and y ∈ null( M X S ( λ )) for some λ ∈ R / { ω − } , then, by Lemma S.1.1 in the online supplement toHillier and Martellosio (2018a), y ∈ null( M X S ( λ )) for every λ , and hence l a ( λ ) is undefinedfor every λ . Lemma S.4.2.
Consider an arbitrary semisimple nonzero real eigenvalue ω of W . If y ∈ null( M X S ( ω − )) , and y / ∈ null( M X S ( λ )) for any λ ∈ R / { ω − } , then lim λ → ω − l a ( λ ) is(i) −∞ if tr( M X Q ω ) > n − k ; ii) bounded if tr( M X Q ω ) = n − k ; (iii) + ∞ if tr( M X Q ω ) < n − k . Proof.
Assume y ∈ null( M X S ( ω − )), and y / ∈ null( M X S ( λ )) for any λ ∈ R / { ω − } . Then, s a2 ( λ ) in equation (3.5) is well defined for every λ such that S ( λ ) is non singular. Since y ∈ null( M X S ( ω − )), we have M X W y = ωM X y and hence M X S ( λ ) y = (1 − λω ) M X y .Plugging this last expression into equation (3.5), we obtain s a2 ( λ ) = ( n − k ) ω (1 − λω ) y (cid:48) M X y (1 − λω ) y (cid:48) M X y − tr( M X G ( λ ))= ( n − k ) ω − λω − tr( M X G ( λ )) . Equation (A.2) now gives s a2 ( λ ) = (( n − k ) − tr( M X Q ω )) ω − λω − (cid:88) χ ∈ Sp( W ) \{ ω } k χ − (cid:88) i =0 f ( i ) ( χ ) j ! tr (cid:0) M X ( W − χI n ) i T χ (cid:1) . Since all derivatives f ( i ) ( χ ) are bounded for λ (cid:54) = χ − , it follows that: if tr( M X Q ω ) < n − k ,then lim λ ↑ ω − s a2 ( λ ) = + ∞ and lim λ ↓ ω − s a2 ( λ ) = −∞ , that is, lim λ → ω − l a ( λ ) = + ∞ ; iftr( M X Q ω ) > n − k , then lim λ → ω − l a ( λ ) = −∞ ; if tr( M X Q ω ) = n − k , then lim λ → ω − l a ( λ )is bounded.Finally, we remark that, for a fixed eigenvalue ω , the statement of Theorem 1 does notrequire the full Assumption 1, but only that col( ωI n − W ) (cid:42) col( X ) for that specific ω .However, the result is trivial whenever Assumption 1 fails, because in that case l a ( λ ) is flat(and hence tr( M X Q ω ) = 0); see Martellosio (2018). Appendix S.5 Additional simulation evidence
Appendix S.5.1 Relationship between the adjusted MLE and the unre-stricted MLE
We report some simulation evidence on the relationship between the adjusted MLE ˆ λ aML and the unrestricted MLE ˆ λ uML . We employ the simulation design that is used in Section5.1 for Table 1, with W row normalized, and λ = 0 . λ uML >
1” and “ˆ λ aML >
1” report the percentageof Monte Carlo repetitions in which, respectively, ˆ λ uML > λ aML >
1; the columnsheaded by “ua” report the percentage of times that ˆ λ uML > λ aML >
1; the columns headed by “au” report the percentage of times that ˆ λ aML > λ uML >
1. 6he main conclusions are as follows. Firstly, the probabilities that ˆ λ uML and ˆ λ aML aregreater than 1 increase with the density of W and with the rewiring probability p . The tablereports the estimate of these probabilities only for λ = 0 .
5; of course, the probabilities wouldbe larger for larger values of λ . Secondly, when ˆ λ uML is greater than 1, ˆ λ aML is almost alwaysgreater than 1 too, and, on the other hand, the probability that ˆ λ uML is greater than 1 giventhat ˆ λ aML is greater than 1 is decreasing in ˜ k and increasing in p .Table S.1: Study of the probabilities that ˆ λ uML and ˆ λ aML are greater than 1, for a SARmodel on a Watts-Strogatz network of size n = 200, when W is row normalized, λ = 0 . , ˜ k = 1 ˜ k = 3 ˜ k = 5 p h ˆ λ uML > λ aML > λ uML > λ aML > λ uML > λ aML > − − − − − −
10 0 0 − − − − − −
50 2 .
64 3 .
45 76 .
43 100 .
00 0 .
57 11 .
18 5 .
12 100 .
00 0 .
12 21 .
13 0 .
58 100 . .
09 26 .
28 91 .
63 100 .
00 5 .
38 29 .
74 51 .
59 99 .
73 9 .
69 33 .
20 29 .
20 100 . . − − − − − −
10 0 .
00 0 .
00 100 .
00 100 .
00 0 .
00 0 .
00 50 .
00 100 .
00 0 0 − −
50 14 .
76 16 .
16 91 .
34 100 .
00 10 .
84 18 .
09 59 .
91 100 .
00 8 .
15 19 .
68 41 .
43 100 . .
15 33 .
41 99 .
23 100 .
00 32 .
24 33 .
79 95 .
28 99 .
86 30 .
91 34 .
07 90 .
70 99 . . .
00 0 .
00 94 .
12 100 .
00 0 .
00 0 .
00 100 .
00 100 .
00 0 .
00 0 .
00 66 .
67 100 . .
49 0 .
51 96 .
01 100 .
00 0 .
27 0 .
35 76 .
98 100 .
00 0 .
22 0 .
33 65 .
49 100 . .
95 23 .
11 99 .
30 100 .
00 22 .
22 23 .
33 95 .
21 99 .
96 21 .
33 23 .
52 90 .
68 100 . .
13 34 .
19 99 .
84 100 .
00 34 .
02 34 .
36 98 .
84 99 .
83 33 .
97 34 .
55 98 .
17 99 .
861 5 0 .
01 0 .
01 94 .
83 100 .
00 0 .
00 0 .
01 80 .
00 100 .
00 0 .
00 0 .
00 65 .
91 100 . .
09 1 .
10 98 .
77 100 .
00 0 .
95 1 .
01 93 .
55 100 .
00 0 .
79 0 .
90 88 .
23 100 . .
83 23 .
88 99 .
78 100 .
00 23 .
89 24 .
16 98 .
73 99 .
86 23 .
96 24 .
31 98 .
26 99 . .
16 34 .
21 99 .
84 100 .
00 33 .
91 34 .
32 98 .
65 99 .
82 34 .
08 34 .
55 98 .
39 99 . Appendix S.5.2 Large R Table S.2 extends the results reported in Table 3 in the paper to the case when R = 100 and W is row normalized. At the sample sizes considered in the table, the bias of both the LLLestimator and the adjusted QMLE is small, but, as expected, the adjusted QMLE still offersan improvement especially when ˜ k is large.Table S.3 extends Table 4 in the paper to the case when R = 100, for various valuesof ˜ k . Even at these sample sizes, the coverage of the Wald confidence intervals deterioratessignificantly as ˜ k increases (particularly for the case of the confidence intervals based onthe QMLE), but that is not the case for the saddlepoint confidence intervals based on theadjusted QMLE. Appendix S.5.3 Non-normality
Table S.4 extends Table 4 in the paper to other non-normal distributions. More specifically,the errors ε ri are generated independently from either: (a) a Laplace(0 , − / ) distribution,7able S.2: Model (3.11) with R = 100, λ = 0 . h = 5 and p = 0 .
2, when W is rownormalized. ˆ λ LLL ˆ λ aML ˜ k m bias(s.d.) bias(s.d.) ∆% | bias | ∆%RMSE2 20 − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . − . . − . . − . − . Table S.3: Empirical coverages of 95% confidence intervals in model (3.11) with R = 100, λ = 0, h = 5, and p = 0 .
2. The error distribution is a standard normal, and W is rownormalized. Two-sided Right-sided˜ k m W LLL W aML s aML W LLL W aML s aML .
950 0 .
950 0 .
951 0 .
945 0 .
947 0 . .
949 0 .
949 0 .
950 0 .
947 0 .
949 0 . .
947 0 .
948 0 .
950 0 .
939 0 .
948 0 . .
948 0 .
948 0 .
950 0 .
936 0 .
947 0 . .
945 0 .
947 0 .
950 0 .
928 0 .
947 0 . .
944 0 .
946 0 .
948 0 .
928 0 .
947 0 . .
939 0 .
944 0 .
948 0 .
917 0 .
944 0 . .
941 0 .
947 0 .
949 0 .
919 0 .
947 0 . .
928 0 .
942 0 .
947 0 .
896 0 .
943 0 . .
934 0 .
946 0 .
949 0 .
901 0 .
948 0 . (b) a χ distribution, standardized so that its mean is 0 and its variance is 1, (c) a gammadistribution with shape parameter 1/2 and scale parameter 1, standardized so that its meanis 0 and its variance is 1. Skewness and kurtosis are, respectively, 0 and 6 in case (a); (cid:112) / .
63 and 7 in case (b); 2 / = 2 .
83 and 15 in case (c).
Appendix S.6 The Lugannani–Rice approximation
This section provides additional details about the approximation needed for the construc-tion of the confidence intervals in Section 3.5 of the paper. The first-order Lugannani–Riceapproximation to the cdf Pr( V ≤ v ) of a random variable V having cumulant generating8unction (cgf) K V ( · ) is (cid:102) Pr( V ≤ v ) := Φ( ˆ w ) + φ ( ˆ w ) (cid:0) w − u (cid:1) , if v (cid:54) = E( V ) , + K (cid:48)(cid:48)(cid:48) V (0)6 √ π ( K (cid:48)(cid:48) V (0)) , if v = E( V ) , (S.6.1)where Φ( · ) and φ ( · ) denote the cdf and pdf of the standard normal distribution, respectively,primes denote derivatives, andˆ w := sgn(ˆ s ) (cid:112) − K V (ˆ s ) , ˆ u := ˆ s (cid:113) K (cid:48)(cid:48) V (ˆ s ) , (S.6.2)where the saddlepoint value ˆ s is determined by K (cid:48) V (ˆ s ) = v (see, e.g., Butler, 2007, p. 12).This is a normal-based approximation. For extensions to non-normal bases, see Wood et al.(1993).Formula (S.6.1) can be used to derive an approximation to the cdf of ˆ λ aML based on theequality Pr(ˆ λ aML ≤ z ; β, σ , λ ) = Pr( y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y ≤
0) (equation (3.9) in the paper),which, remember, requires l a ( λ ) to be single-peaked. More precisely, formula (S.6.1) canbe applied with V = σ y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y and v = 0. The cgf of σ y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y when y ∼ N( S − ( λ ) Xβ, σ [ S (cid:48) ( λ ) S ( λ )] − ) is K V ( s ) = −
12 log(det( I n − sB ( z, λ ))) − σ β (cid:48) X (cid:48) (cid:16) I n − ( I n − sB ( z, λ )) − (cid:17) Xβ, (S.6.3)for s ∈ (1 / (2 b min ) , / (2 b max )), where B ( z, λ ) := 12 (cid:2) S ( z ) S − ( λ ) (cid:3) (cid:48) (cid:2) R ( z ) + R (cid:48) ( z ) (cid:3)(cid:2) S ( z ) S − ( λ ) (cid:3) , and b min and b max denote, respectively, the smallest and the largest eigenvalues of B ( z, λ ). We define (cid:102)
Pr(ˆ λ aML ≤ z ; β, σ , λ ) := (cid:102) Pr( y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y ≤ z = λ . This is becauseE( y (cid:48) S (cid:48) ( λ ) R ( λ ) S ( λ ) y ) = 0 (see Section 3.1 in the paper), and therefore the expression at thebottom of equation (S.6.1) applies when approximating Pr( y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y ≤ λ aML underestimates λ has the closed form expression (cid:102) Pr(ˆ λ aML ≤ λ ; β, σ , λ ) = 12 + K (cid:48)(cid:48)(cid:48)(cid:48) V (0)6 √ π ( K (cid:48)(cid:48) V (0)) . It is easily verified that the cgf of y (cid:48) Ay , for a symmetric A and when y ∼ N( µ, Σ) is − log(det( I n − sA Σ)) − µ (cid:48) (cid:0) I n − ( I n − sA Σ) − (cid:1) Σ − µ . Equation (S.6.3) obtains immediately from thisexpression on rewriting the quadratic form V = σ y (cid:48) S (cid:48) ( z ) R ( z ) S ( z ) y as V = ˜ y (cid:48) B ( z, λ )˜ y , with ˜ y := σ S ( λ ) y ∼ N( σ Xβ, I n ) . λ = 0, h = 5, and p = 0 .
2. The error distributionis either a Laplace(0 , − / ), a χ , or gamma(1 / ,
1) (with the χ and the gamma distributions standardized to have mean zeroand variance 1), and W is row normalized. Laplace Chi-square GammaTwo-sided Right-sided Two-sided Right-sided Two-sided Right-sided˜ k R m W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML W LLL W aML s aML .
945 0 .
944 0 .
951 0 .
938 0 .
942 0 .
950 0 .
944 0 .
943 0 .
950 0 .
938 0 .
943 0 .
950 0 .
946 0 .
945 0 .
953 0 .
938 0 .
944 0 . .
947 0 .
946 0 .
951 0 .
940 0 .
945 0 .
951 0 .
945 0 .
944 0 .
950 0 .
940 0 .
945 0 .
950 0 .
949 0 .
948 0 .
953 0 .
943 0 .
949 0 . .
947 0 .
946 0 .
950 0 .
942 0 .
945 0 .
950 0 .
947 0 .
946 0 .
949 0 .
942 0 .
945 0 .
950 0 .
948 0 .
948 0 .
951 0 .
943 0 .
947 0 . .
948 0 .
947 0 .
950 0 .
942 0 .
947 0 .
950 0 .
948 0 .
947 0 .
950 0 .
942 0 .
947 0 .
949 0 .
949 0 .
948 0 .
951 0 .
944 0 .
949 0 . .
948 0 .
948 0 .
951 0 .
944 0 .
946 0 .
949 0 .
948 0 .
947 0 .
950 0 .
943 0 .
947 0 .
950 0 .
948 0 .
948 0 .
951 0 .
943 0 .
947 0 . .
949 0 .
949 0 .
950 0 .
944 0 .
948 0 .
950 0 .
949 0 .
949 0 .
950 0 .
945 0 .
949 0 .
949 0 .
949 0 .
949 0 .
951 0 .
946 0 .
950 0 . .
935 0 .
935 0 .
949 0 .
920 0 .
936 0 .
949 0 .
932 0 .
933 0 .
949 0 .
912 0 .
934 0 .
949 0 .
938 0 .
936 0 .
951 0 .
922 0 .
937 0 . .
937 0 .
939 0 .
949 0 .
920 0 .
940 0 .
950 0 .
937 0 .
939 0 .
950 0 .
917 0 .
941 0 .
949 0 .
939 0 .
941 0 .
951 0 .
918 0 .
943 0 . .
943 0 .
943 0 .
950 0 .
931 0 .
942 0 .
950 0 .
942 0 .
942 0 .
949 0 .
929 0 .
943 0 .
949 0 .
942 0 .
942 0 .
950 0 .
929 0 .
942 0 . .
944 0 .
945 0 .
949 0 .
932 0 .
945 0 .
950 0 .
945 0 .
945 0 .
950 0 .
931 0 .
946 0 .
950 0 .
945 0 .
946 0 .
951 0 .
932 0 .
946 0 . .
944 0 .
945 0 .
949 0 .
935 0 .
944 0 .
950 0 .
944 0 .
945 0 .
949 0 .
933 0 .
944 0 .
949 0 .
945 0 .
946 0 .
951 0 .
934 0 .
945 0 . .
946 0 .
946 0 .
949 0 .
935 0 .
946 0 .
949 0 .
946 0 .
947 0 .
950 0 .
935 0 .
946 0 .
950 0 .
947 0 .
948 0 .
951 0 .
937 0 .
948 0 . .
921 0 .
927 0 .
948 0 .
894 0 .
932 0 .
949 0 .
918 0 .
924 0 .
947 0 .
890 0 .
929 0 .
949 0 .
920 0 .
927 0 .
950 0 .
889 0 .
931 0 . .
930 0 .
937 0 .
949 0 .
905 0 .
939 0 .
949 0 .
928 0 .
933 0 .
948 0 .
901 0 .
937 0 .
949 0 .
928 0 .
935 0 .
949 0 .
899 0 .
939 0 . .
935 0 .
939 0 .
949 0 .
914 0 .
940 0 .
949 0 .
937 0 .
939 0 .
949 0 .
918 0 .
940 0 .
949 0 .
937 0 .
939 0 .
950 0 .
917 0 .
941 0 . .
940 0 .
944 0 .
950 0 .
920 0 .
944 0 .
950 0 .
938 0 .
942 0 .
949 0 .
916 0 .
943 0 .
949 0 .
939 0 .
942 0 .
949 0 .
918 0 .
944 0 . .
942 0 .
943 0 .
949 0 .
930 0 .
944 0 .
949 0 .
940 0 .
942 0 .
949 0 .
924 0 .
942 0 .
950 0 .
942 0 .
944 0 .
950 0 .
925 0 .
943 0 . .
943 0 .
945 0 .
949 0 .
928 0 .
945 0 .
950 0 .
943 0 .
945 0 .
949 0 .
927 0 .
945 0 .
949 0 .
943 0 .
945 0 .
949 0 .
925 0 .
946 0 . ppendix S.7 Recentering the score for λ alone The adjusted QMLE λ aML is obtained by recentering the profile score s ( σ , λ ); see Section3.1. It is natural to wonder what would happen if, instead, one were to treat σ as a nuisanceparameter, and recenter the profile score for λ alone, say s ( λ ) (i.e., the score associated tothe log-likelihood (2.4)). To investigate, first we recenter s ( λ ) under a stronger assumptionthan the assumption we used to recenter s ( σ , λ ), the latter assumption being E( ε ) = 0 andvar( ε ) = I n . Under the stronger assumption, recentering s ( λ ) produces the same estimatorfor λ as recentering s ( σ , λ ). But, of course, recentering s ( λ ) does not automatically deliveran adjusted estimator of σ , which is needed for inference on λ . Then, we show how a smallmodification of the procedure produces the same unbiased estimating equation for λ derivedin the paper (equation (3.6)), without the need for the stronger distributional assumption.The log-likelihood obtained after profiling out both β and σ from the (quasi) log-likelihood l ( β, σ , θ ) is given in equation (2.4), and the associated profile score is s ( λ ) := n y (cid:48) W (cid:48) M X S ( λ ) yy (cid:48) S (cid:48) ( λ ) M X S ( λ ) y − tr( G ( λ )) . (S.7.1)We now show that s ( λ ) can be recentered under the assumption that the distribution of y is ascale-mixture of the N( S − ( λ ) Xβ, σ ( S (cid:48) ( λ ) S ( λ )) − ) distribution. By Lemma S.7.1 given atthe end of this subsection, the expectation of s ( λ ) when y ∼ N( S − ( λ ) Xβ, σ ( S (cid:48) ( λ ) S ( λ )) − (or, equivalently, ε ∼ N(0 , I )) is given byE( s ( λ )) = nn − k tr( M X G ( λ )) − tr( G ( λ )) , (S.7.2)which is generally nonzero. The adjusted profile score for λ is therefore s a ( λ ) := s ( λ ) − E( s ( λ )) = n y (cid:48) W (cid:48) M X S ( λ ) yy (cid:48) S (cid:48) ( λ ) M X S ( λ ) y − nn − k tr( M X G ( λ )) , (S.7.3)with associated profile likelihood l ∗ a ( λ ) := (cid:82) s a ( λ ) d λ . In terms of the score s a2 ( λ ) in equation(3.5), s a ( λ ) = nn − k s a2 ( λ ), which implies that the associated estimating equations are identical.Of course, up to additive constants, we also have l ∗ a ( λ ) = nn − k l a ( λ ), l a ( λ ) being the log-likelihood given in equation (3.8). In fact, equation (S.7.2) is quite robust to deviationsfrom normality. For one thing, it is evident that, as a function of y, s ( λ ) is scale-invariant,which implies that its properties under Gaussian assumptions will be retained under all scale-mixtures of the N( S − ( λ ) Xβ, σ ( S (cid:48) ( λ ) S ( λ )) − ) distribution for y. This is a much broaderclass of distributions for y. Also, the result holds approximately in much greater generality, since the first term on That the assumption E( ε ) = 0 and var( ε ) = I n is not sufficient for an exact recentering of s ( λ ) could alsobe seen by noticing that the assumption is not sufficient for the profile adjusted score s a2 ( λ ) given in equation(3.5) to be (exactly) unbiased (even though it is sufficient for the estimating equation (3.6) to be unbiased). λ aML , which has been produced in thepaper by recentering s ( σ , λ ), can be also obtained by recentering s ( λ ) under a strongerdistributional assumption. There is an alternative way to produce ˆ λ aML starting from thescore s ( λ ), without the stronger assumption. The denominator y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y in expres-sion (S.7.1) for s ( λ ) is a.s. positive. Hence, the score equation s ( λ ) = 0 is a.s. equivalent tothe equation y (cid:48) W (cid:48) M X S ( λ ) y − tr( G ( λ )) n y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y = 0 . (S.7.4)This estimating equation can be recentered assuming only that E( S ( λ ) y ) = Xβ andvar( S ( λ ) y ) = σ I n (or, equivalently, E( ε ) = 0 and var( ε ) = I n ). Under such assumptions,E (cid:18) y (cid:48) W (cid:48) M X S ( λ ) y − tr( G ( λ )) n y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y (cid:19) = σ (cid:18) tr( M X G ( λ )) − n − kn tr( G ( λ )) (cid:19) , which is generally different from 0. The recentered version of equation (S.7.4) is therefore y (cid:48) W (cid:48) M X S ( λ ) y − tr( G ( λ )) n y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y − σ (cid:18) tr( M X G ( λ )) − n − kn tr( G ( λ )) (cid:19) = 0 , which now involves the unknown σ . But, replacing σ with its unbiased estimator given λ , n − k y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y gives the equation y (cid:48) W (cid:48) M X S ( λ ) y − tr( M X G ( λ )) n − k y (cid:48) S (cid:48) ( λ ) M X S ( λ ) y = 0 , which is the same as the estimating equation (3.6) in the paper.Finally, we observe that, depending on W and X , E( s ( λ )) in equation (S.7.2) can bepositive, zero, or negative for a given λ . Unreported simulations suggest that, in most casesof interest for applications, E( s ( λ )) < λ ∈ Λ , or at least for λ ∈ ( − , λ ML . Note that if E( s ( λ )) < λ ∈ Λ and s ( λ ) has a uniquezero on Λ (see Proposition 3.2), then ˆ λ aML > ˆ λ ML for any y .The lemma required to obtain equation (S.7.2) under normality is: Lemma S.7.1.
For any n × n nonrandom matrix A , and for z ∼ N( Xβ, σ I n ) , E (cid:18) z (cid:48) AM X zz (cid:48) M X z (cid:19) = tr( AM X ) n − k . Proof.
Let C be a matrix such that CC (cid:48) = I n − k and C (cid:48) C = M X . We can write, uniquely,12 = V z + C (cid:48) z , where V := X ( X (cid:48) X ) − / , z := ( X (cid:48) X ) − / X (cid:48) z , and z = Cz . Thus z (cid:48) AM X zz (cid:48) M X z = z (cid:48) CAC (cid:48) z z (cid:48) z + z (cid:48) V (cid:48) AC (cid:48) z z (cid:48) z . Note that z ∼ N(( X (cid:48) X ) / β, σ I n ), z ∼ N(0 , σ I n − k ) , and z and z are independent.The expectation of z (cid:48) V (cid:48) AC (cid:48) z /z (cid:48) z with respect to z is β (cid:48) ( X (cid:48) X ) / V (cid:48) AC (cid:48) z /z (cid:48) z , and theexpectation with respect to z is then zero, because the distribution of z is invariant under z → − z . The proof is completed on noting that the ratio z (cid:48) CAC (cid:48) z /z (cid:48) z is independentof its denominator (Pitman, 1937), and hence has expected value equal to tr( AM X ) / ( n − k ) . Appendix S.8 Spatial error model
A spatial error model is a particular case of a regression model with correlated errors, whichwe write as y = Xβ + u, with E( u ) = 0 and var( u ) = σ Σ( θ ). We assume that X is fixed,but of course one could condition on a random X if X and u are assumed to be independent.The parameter θ belongs to some subset Θ ⊆ R p , and it is assumed that Σ( θ ) is positivedefinite for any θ ∈ Θ and differentiable in θ . We also define a matrix A ( θ ) such that A (cid:48) ( θ ) A ( θ ) = Σ − ( θ ). The model can the be written as y = Xβ + σA − ε, with E( ε ) = 0 andvar( ε ) = I n . In the spatial error model case, we may take A ( θ ) = S ( λ ).In Section S.8.1 we derive expression (4.3) for the adjusted profile log-likelihood l a ( σ , θ ),in the context of a general Σ( θ ), by recentering the profile score for θ . Then, in Section S.8.2,we show that the corresponding adjusted likelihood is equivalent to the likelihood suggestedby other inferential approaches. Appendix S.8.1 The adjusted profile log-likelihood
The quasi log-likelihood prevailing under the assumption ε ∼ N(0 , I n ) is l ( β, σ , θ ) := − n σ ) −
12 log(det(Σ( θ ))) − σ ( y − Xβ ) (cid:48) Σ − ( θ )( y − Xβ ) , where ( β, σ , θ ) ∈ R k × R + × Θ, and additive constants are omitted.Provided that y / ∈ col( X ), β can be concentrated out of l ( β, σ , θ ). The MLE of β given θ is ˆ β ML ( θ ) := (cid:0) X (cid:48) Σ − ( θ ) X (cid:1) − X (cid:48) Σ − ( θ ) y . Thus, the profile log-likelihood for ( σ , θ ) is l ( σ , θ ) := l ( ˆ β ML ( θ ) , σ , θ ) = − n σ ) −
12 log(det(Σ( θ ))) − σ y (cid:48) U ( θ ) y, (S.8.1) If y ∈ col( X ), the model provides perfect fit. Of course, col( X ) is a µ R n -null set, since k < n byassumption. U ( θ ) := (cid:16) I n − X (cid:0) X (cid:48) Σ − ( θ ) X (cid:1) − X (cid:48) Σ − ( θ ) (cid:17) (cid:48) A (cid:48) ( θ ) A ( θ ) (cid:16) I n − X (cid:0) X (cid:48) Σ − ( θ ) X (cid:1) − X (cid:48) Σ − ( θ ) (cid:17) = A (cid:48) ( θ ) M A ( θ ) X A ( θ ) . The score associated to the log-likelihood (S.8.1) is s ( σ , θ ) := (cid:34) d s ( σ ,θ )d σ d s ( σ ,θ )d θ (cid:35) = (cid:34) − n σ + σ y (cid:48) U ( θ ) y − tr (cid:16) Σ − ( θ ) dΣ( θ )d θ (cid:17) + σ y (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) y (cid:35) . (S.8.2)Due to the presence of the nuisance parameter β , the score s ( σ , θ ) is generally biased.In order to obtain an unbiased estimating equation, we recenter s ( σ , θ ). Using the fact that U ( θ ) X = 0 , we can write s ( σ , θ ) = (cid:34) − n σ + σ u (cid:48) U ( θ ) u − tr (cid:16) Σ − ( θ ) dΣ( θ )d θ (cid:17) + σ u (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) u (cid:35) = (cid:34) − n σ + σ ε (cid:48) M A ( θ ) X ε − tr (cid:16) Σ − ( θ ) dΣ( θ )d θ (cid:17) + ε (cid:48) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) M A ( θ ) X ε (cid:35) . Assuming that E( ε ) = 0 and var( ε ) = I n , we haveE( s ( σ , θ )) = (cid:34) − n σ + n − k σ tr (cid:16) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) dΣ − ( θ )d θ (cid:17) (cid:35) , (S.8.3)where we have usedE (cid:18) ε (cid:48) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) M A ( θ ) X ε (cid:19) = tr (cid:18) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) M A ( θ ) X (cid:19) = tr (cid:18) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) (cid:19) = tr (cid:18)(cid:0) I n − A ( θ ) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) A (cid:48) ( θ ) (cid:1) A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) (cid:19) = tr (cid:18) Σ − ( θ ) dΣ( θ )d θ (cid:19) − tr (cid:18) X ( X (cid:48) Σ − ( θ ) X ) − X dΣ − ( θ )d θ (cid:19) . (S.8.4)14ence, the adjusted score is s a ( σ , θ ) := s ( σ , θ ) − E( s ( σ , θ )) = (cid:34) s a1 ( σ , θ ) s a2 ( σ , θ ) (cid:35) = (cid:34) σ y (cid:48) U ( θ ) y − n − k σ − tr (cid:16) Σ − ( θ ) dΣ( θ )d θ (cid:17) + σ y (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) y − tr (cid:16) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) dΣ − ( θ )d θ (cid:17) (cid:35) . Note that setting s a1 ( σ , θ ) = 0 givesˆ σ ( θ ) = y (cid:48) U ( θ ) yn − k , the adjusted QMLE of σ for given θ . The likelihood corresponding to s a ( σ , θ ) is l a ( σ , θ ) = − n − k σ ) − σ y (cid:48) U ( θ ) y −
12 log(det(Σ( θ ))) −
12 log (cid:0) det (cid:0) X (cid:48) Σ − ( θ ) X (cid:1)(cid:1) , (S.8.5)and hence the profile adjusted likelihood for θ only is l a ( θ ) := l a (ˆ σ ( θ ) , θ )= − n − k y (cid:48) U ( θ ) y ) −
12 log(det(Σ( θ ))) −
12 log (cid:0) det (cid:0) X (cid:48) Σ − ( θ ) X (cid:1)(cid:1) . (S.8.6)In the same way as for the SAR model (see Section 3.1), an alternative to recentering thescore s ( σ , θ ) would be to profile out both β and σ from the (quasi) log-likelihood l ( β, σ , θ )(provided that y / ∈ col( X )), and then recenter the score s ( θ ). We now report the relatedcalculations, for completeness. Note that recentering s ( θ ) produces the same likelihood as(S.8.6), but under a normality assumption, and does not produce a corrected estimator for σ . The MLE’s of β and σ given θ are ˆ β ML ( θ ) := (cid:0) X (cid:48) Σ − ( θ ) X (cid:1) − X (cid:48) Σ − ( θ ) y and ˆ σ ( θ ) := y (cid:48) U ( θ ) yn . Thus the profile log-likelihood for θ only is l ( θ ) := l ( ˆ β ML ( θ ) , ˆ σ ( θ ) , θ ) = − n (cid:0) y (cid:48) U ( θ ) y (cid:1) −
12 log(det(Σ( θ ))) . (S.8.7)The score associated to the log-likelihood (S.8.7) is s ( θ ) := d l ( θ )d θ = n y (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) yy (cid:48) U ( θ ) y −
12 tr (cid:18) Σ − ( θ ) dΣ( θ )d θ (cid:19) = n u (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) uu (cid:48) U ( θ ) u −
12 tr (cid:18) Σ − ( θ ) dΣ( θ )d θ (cid:19) = n ε (cid:48) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) M A ( θ ) X εε (cid:48) M A ( θ ) X ε −
12 tr (cid:18) Σ − ( θ ) dΣ( θ )d θ (cid:19) , (S.8.8)15here in the second line we have used U ( θ ) X = 0, and in the third the definition U ( θ ) := A (cid:48) ( θ ) M A ( θ ) X A ( θ ) and the fact that ε = σ A ( θ ) u .The ratio of quadratic forms in (S.8.8) is independent of its denominator (Pitman, 1937).Hence, its expectation under the assumption underlying (S.8.7) (that is, y ∼ N( Xβ, σ Σ( θ )),or equivalently ε ∼ N(0 , σ I n )) isE (cid:16) ε (cid:48) M A ( θ ) X A ( θ ) dΣ( θ )d θ A (cid:48) ( θ ) M A ( θ ) X ε (cid:17) E (cid:0) ε (cid:48) M A ( θ ) X ε (cid:1) = 1 n − k (cid:26) tr (cid:18) Σ − ( θ ) dΣ( θ )d θ (cid:19) + tr (cid:18) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) dΣ − ( θ )d θ (cid:19)(cid:27) , where we have used expression (S.8.4). The adjusted score s a ( θ ) := s ( θ ) − E( s ( θ )) is therefore n (cid:40) y (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) yy (cid:48) U ( θ ) y − n − k (cid:20) tr (cid:18) dΣ( θ )d θ Σ − ( θ ) (cid:19) + tr (cid:18) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) dΣ − ( θ )d θ (cid:19)(cid:21)(cid:41) = − n n − k ) (cid:40) − ( n − k ) y (cid:48) U ( θ ) dΣ( θ )d θ U ( θ ) yy (cid:48) U ( θ ) y + tr( d Σ( θ )Σ − ( θ ))+ tr (cid:18) X ( X (cid:48) Σ − ( θ ) X ) − X (cid:48) dΣ − ( θ )d θ (cid:19)(cid:27) . Integration of s a ( θ ) gives l ∗ a ( θ ) := nn − k (cid:26) − n (cid:0) y (cid:48) U ( θ ) y (cid:1) −
12 log(det(Σ( θ ))) −
12 log (cid:0) det (cid:0) X (cid:48) Σ − ( θ ) X (cid:1)(cid:1)(cid:27) = nn − k l a ( θ ) . Appendix S.8.2 Other interpretations of l a ( σ , θ ) and l a ( θ ) Appendix S.8.2.1 G X -invariance The adjusted log-likelihoods l a ( σ , θ ) and l a ( θ ), given respectively in equation (S.8.5) and(S.8.6), are genuine likelihoods. The next lemma shows that l a ( θ ) corresponds to the densityof v := Cy/ (cid:107) Cy (cid:107) , where C is an ( n − k ) × n matrix such that CC (cid:48) = I n − k and C (cid:48) C = M X (that is, the columns of C form an orthonormal basis for col ⊥ ( X )), and with the conventionthat v := 0 if Cy = 0. The vector v is a maximal invariant under the group G X defined inSection 4, so in the present context recentering the profile score for θ amounts to imposing G X -invariance. The lemma is established under the assumption that u has an ellipticallysymmetric distribution with no atom at the origin. We note that a similar result to LemmaS.8.1 is stated in Rahman and King (1997). Lemma S.8.1.
Assume that, in a spatial error model, u has an elliptically symmetric distri- ution with Pr( u = 0) = 0 . Then, l a ( θ ) corresponds to the density of the maximal invariant v . Proof of Lemma S.8.1. (i) Under the stated assumption, the density function of themaximal invariant v with respect to the normalized invariant measure on the sphere { v ∈ R n − k : (cid:107) v (cid:107) = 1 } is f ( v ; θ ) = 2 det( C Σ( θ ) C (cid:48) ) − (cid:16) v (cid:48) (cid:0) C Σ( θ ) C (cid:48) (cid:1) − v (cid:17) − n − k ; (S.8.9)see, e.g., Kariya (1980), who considers the case when u has a density, and note that existenceof the density of u is not required for (S.8.9), because the density of v is the same for anyelliptically symmetric distribution of u . Now,det( C Σ( θ ) C (cid:48) ) = det(Σ( θ )) det (cid:0) X (cid:48) Σ − ( θ ) X (cid:1)(cid:0) det( X (cid:48) X ) (cid:1) − (e.g., Verbyla, 1990), and, since the columns of C span col ⊥ ( X ), it is easily seen that C (cid:48) ( C Σ( θ ) C (cid:48) ) − C = U ( θ ). It follows that the likelihood of θ based on v is proportionalto (det(Σ( θ ))) − (cid:0) det (cid:0) X (cid:48) Σ − ( θ ) X (cid:1)(cid:1) − (cid:18) y (cid:48) U ( θ ) yy (cid:48) M X y (cid:19) − n − k , which is equivalent to the log-likelihood (S.8.6).Similarly, it is easily seen that, under the assumption in Lemma S.8.1, l a ( σ , θ ) corre-sponds to the density of a maximal invariant under the group of transformations y → y + Xδ . Appendix S.8.2.2 Restricted likelihood
The adjusted log-likelihood l a ( σ , θ ) given in equation (S.8.5) is also equivalent to the re-stricted, or residual, log-likelihood for ( σ , θ ) (Thompson, 1962; Patterson and Thompson,1971).Let D be a full rank ( n − k ) × n matrix, independent of β , and such that DX = 0. Then,the distribution of Dy does not depend on β . The Gaussian restricted log-likelihood for ( σ , θ )is the log-likelihood based on Dy . It can be immediately verified that this log-likelihood doesnot depend on the specific D that is chosen, and is equivalent to the adjusted log-likelihood l a ( σ , θ ). The procedure can also be taken a step further, to produce a likelihood for θ only. The distribution of Dy/ (cid:107) Dy (cid:107) does not depend on β and σ , and it is easily seenthat the corresponding log-likelihood is the same as l a ( θ ) in (S.8.6). Contrary to the profilelog-likelihood l ( θ ) in (S.8.7), the log-likelihoods (S.8.5) and (S.8.6) are genuine likelihoods(because they correspond to the density of observable random variables, Dy and Dy/ (cid:107) Dy (cid:107) ,respectively). Hence, they provide unbiased and information unbiased estimating equations.There is a large literature on estimating covariance parameters by maximizing the restricted17ikelihood, particularly for variance components models. Often the resulting estimator hasa distribution that is closer to its asymptotic distribution, and has important robustnessproperties (e.g., Verbyla, 1993). Also, the estimator can be consistent even when k increasesat same rate as n (e.g., Smyth and Verbyla, 1999). Note that equations (S.8.5) and (S.8.6) donot depend on D . Taking D = C shows the equivalence to the maximal invariant approachdescribed in Section S.8.2.1.Finally, it is worth pointing out that the log-likelihood l a ( σ , θ ) can also be interpretedas a Cox and Reid (1987) approximate conditional likelihood; see, e.g., Bellhouse (1990). Appendix S.9 Proof of Lemma A.6
If col( X ) is an invariant subspace of W , M X = M S ( λ ) X for any λ such that S ( λ ) is invertible,and hence the profile quasi log-likelihood function for ( σ , λ ) implied by the SAR model, givenin equation (3.1), is the same as that implied by the spatial error model, given by equation(4.2) with θ = λ and A ( θ ) = S ( λ ). It follows that, if col( X ) is an invariant subspace of W ,the score s ( σ , λ ) for a SAR model is given by equation (S.8.2). The adjusted log-likelihoodfunction l a ( σ , λ ) is obtained by centering s ( σ , λ ) under the assumptions that E( ε ) = 0 andvar( ε ) = I n . If col( X ) is an invariant subspace W , there exists a unique k × k matrix A suchthat W X = XA , and hence S − ( λ ) X = X ( I k − λA ) − , for any λ such that S ( λ ) is invertible.Thus, when col( X ) is an invariant subspace W , the SAR model y = S − ( λ ) Xβ + σS − ( λ ) ε can be written as y = X ( I k − λA ) − β + σS − ( λ ) ε. Using this representation, it is clearthat E( s ( σ , λ )) is the same as given in equation (S.8.3) for the spatial error model. Hence, l a ( σ , λ ) is given by equation (S.8.5), i.e., l a ( σ , λ ) = − n − k σ ) − σ y (cid:48) U ( λ ) y −
12 log(det(Σ( λ ))) −
12 log (cid:0) det (cid:0) X (cid:48) Σ − ( λ ) X (cid:1)(cid:1) , (S.9.1)where U ( λ ) := S (cid:48) ( λ ) M S ( λ ) X S ( λ ).Consider now the quasi log-likelihood function l ( σ , λ ; Dy ) for ( σ , λ ) based on Dy , forany full rank ( n − k ) × n matrix D such that DX = 0, when y = X ( I k − λA ) − β + σS − ( λ ) ε ,and ε ∼ N(0 , I n ). Since Dy ∼ N(0 , σ D Σ( λ ) D (cid:48) ), we have l ( σ , λ ; Dy ) = − n − k σ ) −
12 log (cid:0) det (cid:0) D Σ( λ ) D (cid:48) (cid:1)(cid:1) − σ y (cid:48) D (cid:48) (cid:0) D Σ( λ ) D (cid:48) (cid:1) − Dy. (S.9.2)But the log-likelihood functions (S.9.2) and (S.9.1) are equivalent, becausedet (cid:0) D Σ( λ ) D (cid:48) (cid:1) = det( DD (cid:48) ) (cid:0) det( X (cid:48) X ) (cid:1) − det( X (cid:48) Σ − ( λ ) X ) det(Σ( λ ))and D (cid:48) (cid:0) D Σ( λ ) D (cid:48) (cid:1) − D = Σ − ( λ )( I − X ( X (cid:48) Σ − ( λ ) X ) − X (cid:48) Σ − ( λ )) = U ( λ ) . eferences Arellano, M., Hahn, J., 2007. Understanding bias in nonlinear panel models: some recent develop-ments. In: Blundell, R., Newey, W., Persson, T. (Eds.), Advances in Economics and Econometrics,Ninth World Congress. Cambridge University Press, Cambridge, Ch. 12, pp. 381–409.Bao, Y., 2013. Finite-sample bias of the QMLE in spatial autoregressive models. Econometric Theory29 (1), 68–88.Bao, Y., Ullah, A., 2007. Finite sample properties of maximum likelihood estimator in spatial models.Journal of Econometrics 137 (2), 396–413.Bellhouse, D. R., 1990. On the equivalence of marginal and approximate conditional likelihoods forcorrelation parameters under a normal model. Biometrika 77 (4), 743.Boucher, V., Bramoull´e, Y., Djebbari, H., Fortin, B., 2014. Do peers affect student achievement?Evidence from Canada using group size variation. Journal of Applied Econometrics 29 (1), 91–109.Bramoull´e, Y., Djebbari, H., Fortin, B., 2009. Identification of peer effects through social networks.Journal of Econometrics 150 (1), 41–55.Brouwer, A. E., Haemers, W. H., 2011. Spectra of graphs. Springer, New York.Butler, R. W., 2007. Saddlepoint Approximations with Applications. Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press.Carrell, S. E., Sacerdote, B. I., West, J. E., 2013. From natural variation to optimal policy? Theimportance of endogenous peer group formation. Econometrica 81 (3), 855–882.Conniffe, D., 1987. Expected maximum log likelihood estimation. Journal of the Royal StatisticalSociety. Series D (The Statistician) 36 (4), 317–329.Cox, D. R., Reid, N., 1987. Parameter orthogonality and approximate conditional inference. Journalof the Royal Statistical Society. Series B (Methodological) 49 (1), 1–39.Daniels, H. E., 1983. Saddlepoint approximations for estimating equations. Biometrika 70 (1), 89–96.Dhaene, G., Jochmans, K., 2015. Profile-score adjustements for incidental-parameter problems,manuscript.Durban, M., Currie, I. D., 2000. Adjustment of the profile likelihood for a class of normal regressionmodels. Scandinavian Journal of Statistics 27 (3), 535–542.Erd˝os, P., R´enyi, A., 1959. On random graphs i. Publicationes Mathematicae (Debrecen) 6, 290–297.Fortin, B., Yazbeck, M., 2015. Peer effects, fast food consumption and adolescent weight gain. Journalof Health Economics 42, 125–138.Gupta, A., Robinson, P. M., 2018. Pseudo maximum likelihood estimation of spatial autoregressivemodels with increasing dimension. Journal of Econometrics 202 (1), 92–107.Hillier, G., Martellosio, F., 2018a. Exact and higher-order properties of the MLE in spatial autore- ressive models, with applications to inference. Journal of Econometrics 205 (2), 402–422.Hillier, G., Martellosio, F., 2018b. Exact likelihood inference in group interaction network models.Econometric Theory 34 (2), 383–415.Horn, R. A., Johnson, C. R., 1985. Matrix Analysis. Cambridge University Press, Cambridge.Jeganathan, P., Paige, R. L., Trindade, A. A., 2015. Saddlepoint-based bootstrap inference for thespatial dependence parameter in the lattice process. Spatial Statistics 12, 1–14.Jiang, J., 1996. REML estimation: asymptotic behavior and related topics. The Annals of Statistics24 (1), 255–286.Kariya, T., 1980. Locally robust tests for serial correlation in least squares regression. The Annals ofStatistics 8 (5), 1065–1070.Kelejian, H. H., Prucha, I. R., 1998. A generalized spatial two-stage least squares procedure forestimating a spatial autoregressive model with autoregressive disturbances. Journal of Real EstateFinance and Economics 17 (1), 99–121.Kyriacou, M., Phillips, P. C. B., Rossi, F., 2017. Indirect inference in spatial autoregression. TheEconometrics Journal 20 (2), 168–189.Laskar, M., King, M., 2001. Modified likelihood and related methods for handling nuisance param-eters in the linear regression model. In: Saleh, A. K. M. E. (Ed.), Data Analysis from StatisticalFoundations. Nova Science Publisher Inc., New York, pp. 119–142.Lee, L.-F., 2004a. Asymptotic distributions of quasi-maximum likelihood estimators for spatial au-toregressive models. Econometrica 72 (6), 1899–1925.Lee, L.-F., 2004b. Supplement to: Asymptotic distributions of quasi-maximum likelihood estimatorsfor spatial autoregressive models. Department of Economics, The Ohio State University, Columbus,OH.Lee, L.-F., 2007a. GMM and 2sls estimation of mixed regressive, spatial autoregressive models. Journalof Econometrics 137 (2), 489–514.Lee, L.-F., 2007b. Identification and estimation of econometric models with group interactions, con-textual factors and fixed effects. Journal of Econometrics 140 (2), 333–374.Lee, L.-F., Liu, X., Lin, X., 2010. Specification and estimation of social interaction models withnetwork structures. Econometrics Journal 13 (2), 145–176.Lee, L.-F., Yu, J., 2010. Estimation of spatial autoregressive panel data models with fixed effects.Journal of Econometrics 154 (2), 165–185.Lehmann, E. L., Romano, J. P., 2005. Testing statistical hypotheses, 3rd Edition. Springer Texts inStatistics. Springer, New York.Li, M., Yu, D., Bai, P., 2013. A note on the existence and uniqueness of quasi-maximum likelihoodestimators for mixed regressive, spatial autoregression models. Statistics & Probability Letters itman, E. J. G., 1937. The “closest” estimates of statistical parameters. Mathematical Proceedingsof the Cambridge Philosophical Society 33, 212–222.Preinerstorfer, D., P¨otscher, B. M., 2017. On the power of invariant tests for hypotheses on a covari-ance matrix. Econometric Theory 33 (1), 1–68.Rahman, S., King, M. L., 1997. Marginal-likelihood score-based tests of regression disturbances inthe presence of nuisance parameters. Journal of Econometrics 82 (1), 81–106.Robinson, P. M., Rossi, F., 2015. Refinements in maximum likelihood inference on spatial autocorre-lation in panel data. Journal of Econometrics 189 (2), 447–456.Sartori, N., 2003. Modified profile likelihoods in models with stratum nuisance parameters. Biometrika90 (3), 533–549.Smyth, G. K., Verbyla, A. P., 1999. Double generalized linear models: approximate REML and diag-nostics. In: Friedl, H., Berghold, A., Kauermann, G. (Eds.), Proceedings of the 14th InternationalWorkshop on Statistical Modelling. Graz, pp. 66–80.Thompson, W. A., J., 1962. The problem of negative estimates of variance components. The Annalsof Mathematical Statistics 33 (1), 273–289.Verbyla, A. P., 1990. A conditional derivation of residual maximum likelihood. Australian Journal ofStatistics 32 (2), 227–230.Verbyla, A. P., 1993. Modelling variance heterogeneity: Residual Maximum Likelihood and diagnos-tics. Journal of the Royal Statistical Society. Series B (Methodological) 55 (2), 493–508.Watts, D. J., Strogatz, S., 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442.Wood, A., Booth, J., Butler, R., 1993. Saddlepoint approximations to the cdf of some statistics withnonnormal limit distributions. Journal of the American Statistical Association 88, 680–686.Yang, Z., 2015. A general method for third-order bias and variance corrections on a nonlinear esti-mator. Journal of Econometrics 186 (1), 178–200.Yu, D., Bai, P., Ding, C., 2015. Adjusted quasi-maximum likelihood estimator for mixed regressive,spatial autoregressive model and its small sample bias. Computational Statistics and Data Analysis87, 116–135.itman, E. J. G., 1937. The “closest” estimates of statistical parameters. Mathematical Proceedingsof the Cambridge Philosophical Society 33, 212–222.Preinerstorfer, D., P¨otscher, B. M., 2017. On the power of invariant tests for hypotheses on a covari-ance matrix. Econometric Theory 33 (1), 1–68.Rahman, S., King, M. L., 1997. Marginal-likelihood score-based tests of regression disturbances inthe presence of nuisance parameters. Journal of Econometrics 82 (1), 81–106.Robinson, P. M., Rossi, F., 2015. Refinements in maximum likelihood inference on spatial autocorre-lation in panel data. Journal of Econometrics 189 (2), 447–456.Sartori, N., 2003. Modified profile likelihoods in models with stratum nuisance parameters. Biometrika90 (3), 533–549.Smyth, G. K., Verbyla, A. P., 1999. Double generalized linear models: approximate REML and diag-nostics. In: Friedl, H., Berghold, A., Kauermann, G. (Eds.), Proceedings of the 14th InternationalWorkshop on Statistical Modelling. Graz, pp. 66–80.Thompson, W. A., J., 1962. The problem of negative estimates of variance components. The Annalsof Mathematical Statistics 33 (1), 273–289.Verbyla, A. P., 1990. A conditional derivation of residual maximum likelihood. Australian Journal ofStatistics 32 (2), 227–230.Verbyla, A. P., 1993. Modelling variance heterogeneity: Residual Maximum Likelihood and diagnos-tics. Journal of the Royal Statistical Society. Series B (Methodological) 55 (2), 493–508.Watts, D. J., Strogatz, S., 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442.Wood, A., Booth, J., Butler, R., 1993. Saddlepoint approximations to the cdf of some statistics withnonnormal limit distributions. Journal of the American Statistical Association 88, 680–686.Yang, Z., 2015. A general method for third-order bias and variance corrections on a nonlinear esti-mator. Journal of Econometrics 186 (1), 178–200.Yu, D., Bai, P., Ding, C., 2015. Adjusted quasi-maximum likelihood estimator for mixed regressive,spatial autoregressive model and its small sample bias. Computational Statistics and Data Analysis87, 116–135.