Verifying the existence of maximum likelihood estimates for generalized linear models
aa r X i v : . [ ec on . E M ] A ug Verifying the existence of maximum likelihood estimatesfor generalized linear models ∗ Sergio Correia
Federal Reserve Board
Paulo Guimarães
Banco de Portugal, CEFUP, and IZA
Thomas Zylkin † University of Richmond
August 5, 2019
Abstract
A fundamental problem with nonlinear estimation models is that estimates are not guar-anteed to exist. However, while non-existence is a well-studied issue for binary choice mod-els, it presents significant challenges for other models as well and is not as well understoodin more general settings. These challenges are only magnified for models that feature manyfixed effects and other high-dimensional parameters. We address the current ambiguity sur-rounding this topic by studying the conditions that govern the existence of estimates for awide class of generalized linear models (GLMs). We show that some, but not all, GLMs canstill deliver consistent estimates of at least some of the linear parameters when these con-ditions fail to hold. We also demonstrate how to verify these conditions in the presence ofhigh-dimensional fixed effects, as are often recommended in the international trade literatureand in other common panel settings.
JEL Classification Codes:
C13, C18, C23, C25
Keywords:
Nonlinear models, Separation, Pseudo-maximum likelihood, Panel data ∗ We would like to thank Paul Kvam, João Santos Silva, Molin Zhong, and conference participants at the 2019 NorthAmerican Summer Meeting of the Econometric Society and the 2019 Stata Conference. Thomas Zylkin is grateful forresearch support from the NUS Strategic Research Grant (WBS: R-109-000-183-646) awarded to the Global ProductionNetworks Centre (GPN@NUS) for the project titled “Global Production Networks, Global Value Chains, and EastAsian Development”. † Corresponding author : Thomas Zylkin. Robins School of Business, University of Richmond, Richmond, VA, USA23217. E-mail: [email protected] . Introduction
Count data models are widely used in applied economic research (Cameron and Trivedi, 2013;Winkelmann, 2008), and Poisson models, in particular, have recently exploded in popularity inmore general applications since the publication of Santos Silva and Tenreyro (2006). Given thiswidespread and long-standing popularity, it is genuinely surprising that economists have onlyrecently become aware that the existence of maximum likelihood (ML) estimates in count datamodels is not guaranteed. More precisely, as a follow-up paper by Santos Silva and Tenreyro (2010)documents, the first-order conditions that maximize the likelihood of Poisson models might nothave a solution if regressors are perfectly collinear over the subsample of non-zero observations,but not over the subsample where the dependent variable is zero. Beyond this observation, how-ever, Santos Silva and Tenreyro (2010) caution that “it is not possible to provide a sharp criteriondetermining the existence” of Poisson ML estimates. Moreover, although non-existence is a well-known issue in binary choice models, it seemingly remains unknown if similar issues could arisein other non-binary choice models besides Poisson.In this paper, we resolve several key aspects of the current ambiguity regarding the non-existence of estimates in Poisson and other generalized linear models (GLMs). We do so in partby drawing on a largely uncredited contribution by Verbeek (1989), who derived necessary andsufficient conditions governing the existence of ML estimates for a broad class of GLMs, encom-passing both Poisson and binary choice, as well many other models for use with count data andother applications. Using Verbeek (1989)’s earlier results as our starting point, we show that, formany GLMs, at least some of the linear parameters can usually be consistently estimated, evenwhen the ML estimates can nominally be said to “not exist”—a result that turns out to be especiallyhelpful for Poisson and similar models. We also add new results for some models that only havebecome popular in the years following Verbeek (1989) and that turn out not to share these usefulproperties. For example, the log-link Gamma Pseudo-maximum Likelihood (PML) model some-times recommended in fields such as international trade and health care economics has completelydifferent conditions governing existence than Poisson models and cannot be sensibly estimated Poisson estimation has emerged as a workhorse model for studying health outcomes (Manning and Mullahy,2001), patent citations (Figueiredo et al., 2015), trade data (Santos Silva and Tenreyro, 2006), economic history(de Bromhead et al., 2019), auctions (Bajari and Hortaçsu, 2003) as well as in many other economic applications wherethe dependent variable is discrete or when the data are assumed to be generated by a constant-elasticity model. As of this writing, both printings of Verbeek (1989, 1992) together have only seven citations listed on GoogleScholar. As the editor’s note appended to Verbeek (1992) explains, Albert Verbeek’s motivation for writing thispaper was identical to that of Santos Silva and Tenreyro (2010): frustration over standard algorithms either failing toconverge or converging to nonsensical estimates. This is also the only approach that has been discussedin the context of non-binary choice models (Santos Silva and Tenreyro, 2010; Larch et al., 2019).On the other hand, since dropping a regressor generally has implications for the identification ofthe other parameters (and since it is often not obvious which regressor is the “right” one to drop),the leading alternative recommended for binary choice settings is to assume the data have beendrawn randomly from a known prior distribution (Heinze and Schemper, 2002; Gelman et al.,2008). These methods could be adapted to non-binary choice models as well. However, theystill necessarily involve altering the model in a way that affects identification. Furthermore, theyare not currently compatible with models that include high-dimensional fixed effects, which are Our recommendation to drop separated observations is ostensibly similar to Allison (2008)’s suggestion to “donothing”, since doing nothing can result in approximately valid estimates and inferences for at least some of themodel parameters. However, in general, doing nothing is likely to result in lack of convergence, may overstatestatistical significance, and always introduces at least some numerical error even in the “valid” point estimates. Somesoftware packages drop separated observations by default (e.g., Stata’s probit command), but they generally are notadept at detecting these observations, nor do they usually provide theoretical justification for this practice in theirdocumentation. Our companion website offers examples and discussion. Our own suggested remedy, which only involves dropping the separated observations, is gen-erally very simple to implement through our algorithm and does not have any of these limitations.This is mainly because of an insight advanced independently by Verbeek (1989), Geyer (1990), andClarkson and Jennrich (1991): any GLM suffering from separation can be nested within a “com-pactified” GLM where the conditional mean of each observation is allowed to go to its boundaryvalues. The likelihood function always has a maximum somewhere in the compactified parameterspace; thus, we can transform the problem of (non-)existence to one of possible corner solutions.More importantly, observations with a mean value at the boundary in the more general modelare effectively perfectly predicted observations, which (as we will show) can be quickly identifiedeven for very complex models and which offer no information about the parameters with interiorsolutions. Dropping these observations then results in a standard (non-compactified) version ofthe model that is assured to produce the same model fit, estimates, and inferences as the compact-ified version. We also show that the estimates are consistent and that correct inference requiresonly careful attention to which of the regressors are involved in separation. The resulting outputon the whole is no different than what one would observe with a perfectly collinear regressor andthe problems of interpretation and inference turn out to be very similar as well.Exactly who originally first discovered that separation could occur for Poisson regression andother non-binary choice GLMs is uncertain. The literature starts with Haberman (1973, 1974)’sderivation of a necessary and sufficient condition for the existence of estimates for log-linear fre-quency table models (including Poisson frequency tables). However, it was known at the timethat this condition was difficult to verify for higher-dimensional tables (see Albert and Anderson,1984), a still-unsettled problem we indirectly solve in this paper. Soon thereafter, Wedderburn(1976) independently derived a sufficient (but not necessary) condition for the existence of esti-mates across a wide class of GLMs. His result can be shown to be equivalent to the later resultfrom Santos Silva and Tenreyro (2010) for the Poisson model. Silvapulle (1981) and Albert and An-derson (1984) are then credited with demonstrating the concept of “separation” for binary choicemodels. The latter paper also conjectures, but does not prove, that their analysis may general-ize to the class of log-linear frequency table models considered by Haberman (1973, 1974). A few This popularity is only likely to increase in the near future thanks to a series of computational innovations thathave made Poisson models with multiple levels of fixed effects more feasible to compute (see Figueiredo et al., 2015;Larch et al., 2019; Stammann, 2018) as well as a growing literature on bias corrections for incidental parameter biasin nonlinear panel models (see Arellano and Hahn, 2007; Fernández-Val and Weidner, 2016). Part of our motivation for our paper is to bring more attention to these existing results, whichare continually being rediscovered and are often left out of textbook discussions. We also add tothis earlier literature in three main ways. First, by considering an expanded set of models, weoffer a more detailed treatment of how the separation problem varies across different GLMs. Asignificantly stricter set of conditions governs the existence of estimates for Gamma and InverseGaussian (PML) regressions than for Poisson, Logit, and Probit, for example—a result that hasnot been shown previously and which raises a new set of concerns about the application of suchmodels to international trade data, health care cost data, and other settings where zero outcomesare common. Second, we clarify that at least some of the linear parameters can still often beconsistently estimated in the presence of separation as well as how to obtain valid asymptoticinferences—though, again, it is important to note these results do not extend to all GLMs. Finally, we introduce a simple-but-powerful method for detecting separation in models with alarge number of fixed effects, a conceptually non-trivial task that would ordinarily require solvinga high-dimensional linear programming problem. Because our algorithm relies on repeated iter-ation of a least-squares-with-equality-constraints regression, it can take advantage of the recentinnovations of Correia (2017), who shows how to solve high-dimensional least-squares problemsin nearly linear time. To our knowledge, the only other method that has been suggested for de-tecting separation in large ML models is that of Eck and Geyer (2018). Their algorithm worksby iteratively solving for the null eigenvectors of the information matrix, whereas ours shouldbe substantially more scalable because it avoids large matrix operations altogether. Our method The results by Geyer (1990) are applicable to the class of linear exponential families, while the work of Clarksonand Jennrich (1991) applies to linear parameter models (also known as “models with a linear part”; see Stirling, 1984). Manning and Mullahy (2001) leave aside the issue of zero outcomes in their paper, but they indicate that GammaPML is generally a good model for health care cost data and also remark that “there is ostensibly nothing in the aboveanalysis that would preclude applications to data where realizations of y are either positive or zero, as is commonin many health economics applications.” Our own findings indicate that zeroes actually pose a distinct problem forGamma PML models that must be carefully taken into account. Gourieroux et al. (1984, Appx 1.1) and Fahrmeir and Kaufmann (1985) (Sec. 2.2) both assume in their proofsof consistency that the solutions for the linear parameters are interior. We present a proof that relies on a suitablere-parameterization of the separated model such that the results of Gourieroux et al. (1984) apply directly.
The class of GLMs we consider is defined by the maximization of the following log-likelihoodfunction: l ( β ) = Õ i l i ( β ) = Õ i { α i y i θ i − α i b ( θ i ) + c i ( α i , y i )} , (1)where y i ≥ x i is a set of M regressors ( x , x , . . . , x M ), and β ∈ R M is an M × α i = w i / φ is a positive weighting parameter,with known component w i and a potentially unknown scaling factor φ . b i and c i are knownreal-valued functions that depend on the specific model. θ i = θ ( x i β ; ν ) is the canonical parameterrelating the linear predictor x i β to the log-likelihood of a given observation l i and its conditionalmean µ i ≡ E [ y i | x i ] = b ′ ( θ i ) . Note that θ ( x i β ; ν ) is continuous, strictly increasing, and twicedifferentiable in x i β and that b ( θ i ) is continuous, increasing, and convex in θ i . Notably, these lastfew restrictions together ensure that the quantities θ i , x i β , and µ i are each increasing with respectto one another and that l ( β ) is continuous in β . We further assume that lim x i β →−∞ µ i = ν is then an additional dispersionparameter that allows us to also consider Negative Binomial models (see Table 1.)The first-order conditions for each individual parameter β m follow from the GLM score func-tion: s ( β m ) = Õ i s i ( β m ) = Õ i α i [ y i − b ′ ( θ i )] θ ′ ( x i β ; ·) x mi = ∀ m . (2) φ is not associated with the problem of separation and will henceforth be treated as known. The results wedocument apply to models with unknown scaling factors without loss of generality. Table 1 gives examples. Formore information on these class of models, see P. McCullagh (1989) Section 2.2.2. Also note that the “linear predictor” term x i β is often denoted as η i . We keep it as x i β to economize on notation. c ( α i , y i ) . As such, we can considerthe separation problem in models where y i = On top of these generalized restrictions, we use two further assumptions to derive a necessaryand sufficient condition for existence that holds across most of these models. First, we assume thatthe matrix of regressors X = x , x , . . . , x M is of full column rank. This rank assumption allows usto set aside the more widely understood case of perfectly collinear regressors, although in Section3, we will find it useful to draw a comparison between these two issues. Second, we assume forthe moment that the individual log-likelihood contributions l i ( β ) have a finite upper bound. Lateron, we will consider two GLMs that are not guaranteed to have a finite upper bound, the Gammaand Inverse Gaussian PML models. We show that the relevant criteria governing existence arenot the same as for models that obey this assumption.To extend and generalize the earlier result from Santos Silva and Tenreyro (2010) for Poissonmodels, we are now ready to prove the following proposition: Proposition 1 (Non-existence) Suppose that l ( β ) conforms to (1) , the matrix of regressors X = x , x , . . . , x M is of full column rank, and the individual log-likelihood l i ( β ) always has a finite upperbound. An ML solution for β will not exist if and only if there exists a linear combination of regressors z i = x i γ ∗ such that z i = ∀ i s.t. < y i < y , (3) z i ≥ ∀ i s.t. y i = y , (4) z i ≤ ∀ i s.t. y i = , (5) where γ ∗ = ( γ ∗ , γ ∗ , . . . , γ ∗ M ) ∈ R M is a non-zero vector of the same dimension as β and where y is an For more on the wide applicability of PML, see Gourieroux et al. (1984), Manning and Mullahy (2001), and San-tos Silva and Tenreyro (2006). pper bound on µ i that equals for binary choice models ( ∞ otherwise). The proof of this proposition follows Verbeek (1989), but also draws on an earlier proof bySilvapulle (1981) specifically for binary choice models. In addition, the necessity of the conditionon the boundedness of the l i (·) function is due to Clarkson and Jennrich (1991); note that we willexplore the implications of relaxing this assumption in Proposition 2 later in the paper.The general idea is that we want to show that if a vector γ ∗ satisfying (3)-(5) exists, the log-likelihood function l ( β ) is always increasing if we search for a maximum in the direction associ-ated with γ ∗ . Otherwise, if no such γ ∗ exists, we will show that searching in any direction fromany starting point in R M under the noted conditions always eventually causes the log-likelihoodto decrease, such that the function must reach a maximum for some finite β MLE ∈ R M . To pro-ceed, let γ = ( γ , γ , . . . , γ M ) ∈ R M be an arbitrary non-zero vector of the same dimension as β and k > l ( β + kγ ) , which allows us to consider how thelog-likelihood function changes as we search in the same direction as γ from some initial point β .Differentiating l ( β + kγ ) with respect to k , we obtain dl ( β + kγ ) dk = Õ i α i [ y i − b ′ ( θ i )] θ ′ x i γ . Suppose that there exists a γ ∗ such that z i = x i γ ∗ satisfies (3)-(5). In this case, the above expressionbecomes dl ( β + kγ ∗ ) dk = Õ y i = α i [− b ′ ( θ i )] θ ′ z i + Õ y i = y α i [ y − b ′ ( θ i )] θ ′ z i > , with the inequality following because b ′ and θ ′ are both positive and because b ′ = µ < y . Noticealso that the inequality is strict because we must have at least one observation for which z i , l ( β + kγ ∗ ) > l ( β ) for any k > β ∈ R M , meaning that we cannot find a finite ML solution β MLE ∈ R M maximizing l (·) . This is the case where estimates are said not to exist; intuitively, the log-likelihood is stillincreasing as it approaches the limit where x i β → −∞ for at least one observation where y i = In Verbeek (1989), the relevant theorems are Theorem 6, which establishes conditions under which the likelihoodfunction has a local maximum that lies on the boundary of the parameter space, and Theorem 4, which establishesthat any local maximum on the boundary is a global maximum if the likelihood function is concave. Note that wehave relaxed the concavity assumption, since it is straightforward to show the weaker result that if there is a localmaximum at the boundary induced by separation, the global maximum can only occur at the boundary. x i β → ∞ for at least one observation where y i = y .Alternatively, suppose that, for any γ , we always have that x i γ , < y i < y ). In this case, we have that lim k →∞ l i ( β + kγ ) = −∞ for at least oneobservation. Since l i (·) is continuous in β and (by assumption) has a finite upper bound, we cantherefore always identify a finite scalar k such that k > k implies that l ( β + kγ ) = Í i l i ( β + kγ ) < Í i l i ( β ) = l ( β ) , for any β , γ ∈ R M . In other words, searching for an ML solution β MLE in anydirection from any starting point in R M space will always eventually yield a decrease in the overalllog-likelihood l (·) . Because l (·) is continuous, this guarantees the existence of a finite β MLE ∈ R M maximizing l (·) .Next, note that for any y i = x i γ > l i ( β + kγ ) is monotonic in k with lim k →∞ l i ( β + kγ ) = −∞ . Similarly, note that µ < y ensures the same is true for any y i = y observation such that x i γ < Thus, we can again always find a sufficient k such that k > k implies l ( β + kγ ) < l ( β ) so long as we always have that either x i γ > y i = x i γ < y i = y . Finally, note that we do notconsider the case where there exists a vector γ such that x i γ = all i , as this is the case where X is not of full rank. Therefore, the only possible scenario in which estimates do not exist is theone where we can find a linear combination of regressors z i = x i γ ∗ satisfying (3)-(5). (cid:4) To tie in some standard terminology from the binary choice literature (cf., Albert and Ander-son, 1984), we will say that the linear combination of regressors defined by z i = x i γ ∗ in the casewhere estimates do not exist “separates” the observations for which z i ≷ z i < y i = z i > y i = y ,since in these cases what value z i takes perfectly predicts whether y i is 0 or 1. Otherwise, wehave only “quasi-complete separation”, where only some y i outcomes are perfectly predicted. Fornon-binary choice models, however, note that, so long as y i takes on at least two positive values,it will never be the case that z i = y i , regardless of whether z i < y i = To be clear, if y = ∞ , we never have that y i = y ; only conditions (3) and (4) are salient. On the other end ofthe spectrum, models for “fractional” data such as Bernoulli PML (cf., Papke and Wooldridge 1996; Santos Silva et al.2014) allow the dependent variable to vary continuously over [ , ] . For these models, all three conditions stated inProposition 1 are relevant. Readers should be wary of the weight carried by the word “always” here. It could be the case, for example, that x i γ = y i > x i γ ≥ y i =
0. This is still a case where estimates do not exist, since γ ∗ = − γ wouldsatisfy the needed conditions. y i >
0. Inour way of phrasing the issue, this criterion equates to saying there exists no linear combinationof regressors satisfying (3). However, as Santos Silva and Tenreyro (2010) are careful to note, thiscriterion is only sufficient rather than necessary and sufficient. As the remaining elements of thepreceding proof show, even if such a linear combination exists, separation is still avoided if z i takes on both positive and negative values when y i =
0, such that its maximum and minimumvalues over y i = z i = y i > Interestingly, we do not know of a widely accepted label for what we have called the “linearcombination of regressors that separates the data” (i.e., z i = x i γ ∗ ). Obviously, z i plays a centralrole in the analysis of separation and the literature could use a concise name for it. We propose“certificate of separation”. The idea is that we can “certify” whether any such z i separates the databy regressing it on x i over 0 < y < y to see if the R equals one, and by checking the values of z i at the boundary. There could be multiple such certificates; thus, we will sometimes refer toan “overall certificate of separation” z that can be used to identify all separated observations. Forany γ ∗ associated with a certificate of separation, we will tend to use the term “separating vector”(although another name for it is the “direction of recession”; see Geyer, 2009). We will use γ todenote the separating vector associated with z . Results for Gamma PML and Inverse Gaussian PML . Of course, one stipulation that sticksout in Proposition 1 is our requirement that the individual log-likelihood l i (·) have a finite upperbound. To our knowledge, the implications of relaxing this assumption have not been touchedupon in the prior literature. Rewinding some of the last few details behind the above proof, thespecific role played by this restriction is that it ensures that, if lim k →∞ l i ( β + kγ ) = −∞ for any i ,the overall log-likelihood l ( β + kγ ) = Í i l i ( β + kγ ) also heads toward −∞ for large k . However,if l i (·) is not bounded from above, it turns out that matters can differ. In this case, even if thedata exhibits “overlap” (as defined above), this alone will not be sufficient to ensure that l (·) has a The question of when overlap occurs is precisely the point left ambiguous in Santos Silva and Tenreyro (2010).See page 311 of their paper. Our proposed name borrows from optimization, where phrases such as “certificate of feasibility”, “certificate ofconvexity”, “certificate of non-negativity”, and so on are used with a similar purpose. l i (·) does not necessarily have a finiteupper bound are Gamma PML and Inverse Gaussian PML. As shown in Table 1, the form of thepseudo-log-likelihood function for Gamma PML regression is l ( β ) = Õ i l i ( β ) = Õ i − αy i exp (− x i β ) − αx i β (6)and the form for Inverse Gaussian PML is l ( β ) = Õ i l i ( β ) = Õ i − α y i (− x i β ) + α exp (− x i β ) . (7)In either case, notice that the associated b i function from (1)— x i β for Gamma, − exp (− x i β ) forInverse Gaussian—has a lower bound of −∞ (i.e., lim x i β →−∞ b i = −∞ .) Thus, in either case, while l i (·) continues to have a finite upper bound for observations where y i >
0, if y i =
0, we have thatlim x i β →−∞ l i (·) = ∞ . With this in mind, the following Proposition collects results that apply toeither of these estimators: Proposition 2 (Gamma PML and Inverse Gaussian PML) Suppose the matrix of regressors X = x , x , . . . , x M is of full column rank. Also let γ ∗ = ( γ ∗ , γ ∗ , . . . , γ ∗ M ) ∈ R M be a non-zero vector of thesame dimension as β .(a) If l ( β ) conforms to Gamma PML (i.e., (6) ), PML estimation of β will have no solution if andonly if there exists a linear combination of regressors z i = x i γ ∗ such that z i ≥ ∀ i s.t. y i > and either of the following two conditions holds: Õ i z i < or Õ i z i = with z i > for at least one observation with y i > . (9) (b) If l ( β ) conforms to Inverse Gaussian PML (i.e., (7) ), PML estimation of β will have no solutionif and only if there exists a linear combination of regressors z i = x i γ ∗ such that z i satisfies (8) Note that ML estimation of either a Gamma distribution or an Inverse Gaussian distribution will not admit y i = nd at least 1 z i is < when y i = . Part (a) of Proposition 2 follows from again considering the function l ( β + kγ ) , this time specif-ically for Gamma PML. Using (6), it is straightforward to show that lim k →∞ l ( β + kγ ) = −∞ if x i γ < y i >
0. By a continuity argument similar to the one usedabove, this implies that l ( β + kγ ) must eventually become decreasing in k for sufficiently large k .Next, consider what happens if there exists a linear combination of regressors z i = x i γ whichis always ≥ y i >
0. In this case, because lim k →∞ Í z i , − αy i exp (− x i β − kz i ) =
0, we havethat lim k →∞ l ( β + kγ ) = lim k →∞ Õ i − α ( x i β − kz i ) + Õ z i = − αy i exp (− x i β ) . There are four possibilities for the above limit. If Í i z i <
0, the Gamma pseudo-likelihood functionis always increasing in the direction associated with γ , such that finite estimates do not exist.Alternatively, if Í i z i >
0, the limit equals −∞ and we are again assured that this function musteventually decrease with k , such that estimates will exist.The remaining two possibilities occur when Í i z i =
0. In this case, the effect of an increase in k on the psuedo-loglikelihood function is always given by dl ( β + kγ ) dk = Õ z i > αy i exp (− x i β − kz i ) z i ≥ . Inspecting the above expression, dl ( β + kγ )/ dk > z i > y i >
0, ensuring again that finite estimates do not exist. The final possibilityis if z i = y i > l ( β + kγ ) = l ( β ) for any k >
0. In otherwords, regardless of which initial β we consider, the pseudo-likelihood will always be weaklyhigher when we increment β by some positive multiple of γ , implying either that a finite solutionfor β maximizing l (·) does not exist (if z i > y i >
0) or that any finite solutionwill be non-unique (if z i = y i > β will not have a finite solution if there exists a linear combination of regressorssatisfying (8) and (9) and may not necessarily have a unique solution even if these conditions arenot met.For proving part (b), which pertains instead to Inverse Gaussian PML, it is again convenientto work with the derivative of the l ( β + kγ ) function with respect to k . Continuing to let z i = x i γ ,11nd after dividing up terms appropriately, this derivative can be expressed as dl ( β + kγ ) dk = − α Õ y i = [ exp (− x i β − kz i )] z i + α Õ y i > y i exp (− x i β − kz i ) z i − α Õ y i > exp (− x i β − kz i ) z i . (10)Let us start with the conditions highlighted in part (b), where z i ≥ y i > z i < y i =
0. We can see that the second and third terms in(10) will go to 0 in the limit where k becomes infinitely large. The first term, meanwhile, headsto infinity. Thus, the pseudo-log-likelihood function increases asymptotically for large k and it isclear there is no finite solution for β .However, we still need to verify what happens if we cannot find a linear combination z i sat-isfying both of the conditions stated in part (b). This part requires slightly more work. If z i ≥ i , for example, all three terms in (10) go to zero for k → ∞ —a result that is not in itselfall that informative. Likewise, if we consider what happens when z i may be less than zero for y i >
0, the first and third terms could potentially head toward + ∞ while the second term headstoward −∞ . In all of these seemingly ambiguous scenarios, we can use L’Hôpital’s rule to clarifythat dl ( β + kγ )/ dk < k , indicating that the pseudo-log-likelihood functionalways eventually decreases in the direction associated with γ . (cid:4) To our knowledge, we are the first to study the general circumstances under which estimatesfrom Gamma PML and Inverse Gaussian PML exist. That these estimators have not been specif-ically looked at in this context is perhaps not all that surprising, since these models have nottraditionally been used with zeroes and since the increase in popularity of PML estimation inapplied work has only occurred relatively recently. Indeed, thanks to contributions such as Man-ning and Mullahy (2001), Santos Silva and Tenreyro (2006), and Head and Mayer (2014), the maincontext in which researchers will likely be familiar with Gamma PML is in settings where zeroesare common, such as data for international trade flows and healthcare costs. Inverse GaussianPML is also sometimes considered for these types of applications (see Egger and Staub, 2015) butis significantly less popular, likely because it is more difficult to work with numerically.In this light, the results contained in Proposition 2 can be read in one of two ways. On the onehand, we confirm Gamma PML and Inverse Gaussian PML can, in principle, be used with datasets that include observed zeroes, even though their ML equivalents cannot. Since the ability to Even Wedderburn (1976), in his original derivation of a sufficient condition for the existence of GLM estimates,specifically avoids commenting on what conditions would be needed for Gamma estimates to exist if the dependentvariable is allowed to be zero. On the other hand, we can see from acomparison of Propositions 1 and 2 that the criteria needed for Gamma PML and Inverse GaussianPML to have finite solutions are considerably more strict than the equivalent criteria needed formost other standard GLMs. Furthermore, as we will see in the next section, these fundamentaldifferences also imply that Gamma PML and Inverse Gaussian PML lack some appealing propertiesthat enable us to more easily remedy situations where estimates do not exist for other models.For these reasons, we would recommend researchers exercise extra caution when using either ofthese estimators with data sets that include zeroes.
This section describes recommendations for dealing with separation in practice, including in high-dimensional environments with many fixed effects and other nuisance parameters. Before digginginto these details, it is important to make two general points. First, as we have shown, the im-plications of separation differ across different models; thus, the most appropriate remedy shoulddepend on the model being used. Second, the appeal of our own preferred alternative—droppingseparated observations beforehand—is likely to depend on one’s comfort level with allowing thelinear predictor x i β to attain what would ordinarily be an inadmissible value (and the conse-quences for the likelihood function then follow). One method we do generally caution against isto simply remove one of the x mi implicated in z i from the model, since this obviously affects theidentification of all remaining parameters to be estimated, with the effect differing depending onwhich regressor is dropped. In the next subsection, we will first show that when x i β is allowed to attain ±∞ , separated ob-servations often do not affect the score function for β under fairly general circumstances (though,as noted, it does depend on the model being used.) The main utility of this insight, which is es-pecially useful for models with many fixed effects, is that it provides a theoretical justificationfor the practice of first dropping separated observations from the estimation. The remainder ofthis section then focuses on issues related to detection of separation problems, including in high-dimensional environments. The other main reason is that the traditional practice of logging the dependent variable and estimating a linearmodel is now widely known to introduce a bias whenever the error term is heteroskedastic. Furthermore, in fixed effects models computed using dimensionality-reducing techniques (e.g., Figueiredo et al.,2015; Larch et al., 2019; Stammann, 2018) even identifying which combinations of the x mi ’s induce separation maybe infeasible. .1 Effects of dropping separated observations We now turn to discussing how identification of at least some of the model parameters may bepossible when separation occurs. We start with the concept established in Verbeek (1989) andClarkson and Jennrich (1991) of a “compactified” (or “extended”) GLM where the parameter spaceis extended to admit boundary values. We can phrase this compactification in one of severalequivalent ways. For example, we could express the domain for β as [−∞ , + ∞] M , the compactclosure of R M . However, it is also often convenient to work with the linear predictor x i β , whichin turn also may vary over [−∞ , + ∞] for each i . In particular, note how the conditional mean µ i behaves as x i β attains either of its two limits: when x i β → −∞ , we have that µ i →
0, whereaswhen x i β → ∞ (a situation that is only relevant for binary choice models and fractional datamodels), we have that µ i → y . It is straightforward to show that estimates always exist when we“compactify” the model in this way.With this adjustment to the parameter space in mind, consider what happens to the scorefunction s ( β ) and information matrix F ( β ) : = E [ ∂ s ( β )/ ∂ β ] in the limit as k → ∞ in the case ofseparation outlined above. In other words, considerlim k →∞ s ( β + kγ ∗ ) : = lim k →∞ Õ i s i ( β + kγ ∗ ) : = lim k →∞ Õ i α i [ y i − µ i ( x i β + kγ ∗ )] θ ′ ( x i β + kγ ∗ ; ·) x i (11)andlim k →∞ F ( β + kγ ∗ ) : = lim k →∞ Õ i F i ( β + kγ ∗ ) : = − lim k →∞ Õ i α i µ ′ i ( x i β + kγ ∗ ) θ ′ ( x i β + kγ ∗ ; ·) x i x T i (12)where we take γ ∗ to be a vector satisfying the applicable conditions for non-existence. At thispoint, it will also be useful to state the following lemma: Lemma 1
Suppose that l ( β ) conforms to (1) . If the individual log-likelihood function l i ( β ) has afinite upper bound, then:(a) The respective limits of the i -specific score term s i and i -specific information term F i each go to as the linear predictor x i β goes to −∞ (i.e., lim x i β →−∞ s i = and lim x i β →−∞ F i = ) As discussed in Verbeek (1989), one way to justify the inclusion of infinitely large values in the admissible param-eter space is to observe that we could just as easily perform the maximization over a homeomorphic space where theparameters of interest are instead bounded by a finite interval (e.g., [− , ] M instead of [−∞ , ∞] M ). A version of thisconcept is also described in Haberman (1974). It is also sometimes referred to as the “Barndorf-Nielsen completion”(Barndorff-Nielsen, 1978). b) For models where lim x i β →∞ µ i = y < ∞ (e.g., binary choice models), the limits of s i and F i goto as x i β goes to ∞ as well (i.e., lim x i β →∞ s i = and lim x i β →∞ F i = .) The utility of this lemma (which we prove in our Appendix) is that, together with (11) and (12), itdelivers the following proposition:
Proposition 3 (Effects of dropping separated observations) Suppose the assumptions stated in Propo-sition 1 continue to hold, except we now consider a “compactified” GLM where the domain for β is [−∞ , + ∞] M . Also suppose the joint likelihood of any non-separated observations always satisfies theclassical assumptions described in Gourieroux et al. (1984). If there exists a separating vector γ ∗ ∈ R M meeting the conditions described in Proposition 1, then:(a) A solution for β ∈ [−∞ , + ∞] M maximizing l ( β ) always exists.(b) The ML estimates for the linear predictors ( x i β ) , canonical parameters ( θ i ), and conditionalmeans ( µ i ’s) of any observations not separated by γ ∗ (i.e., those with x i γ ∗ = ) are unaffectedby dropping any observations that are separated by γ ∗ (i.e., those with x i γ ∗ , ).(c) For any m with γ ∗ m = , the associated individual parameter estimate β m is unaffected bydropping any observations with x i γ ∗ , .(d) If l ( β ) is in the linear exponential family and if the relationship between the linear predictor x i β and the conditional mean µ ( x i β ) is correctly specified, then all finite elements of β are consis-tently estimated and their asymptotic confidence intervals can be inferred using the subsampleof non-separated observations. Part (a) follows from our proof of Proposition 1 (and is also a central result from Verbeek,1989). After allowing β to take on either −∞ or + ∞ , we rule out the cases where estimates wouldotherwise be said not to exist. Parts (b) and (c) are analogous to the insights contained in Clarksonand Jennrich (1991)’s Theorem 2. After invoking Lemma 1, the score function in (11) can be re-written as lim k →∞ s ( β + kγ ∗ ) = Õ x i γ ∗ = s i ( β ) + lim k →∞ Õ x i γ ∗ , s i ( β + kγ ∗ ) = Õ x i γ ∗ = s i ( β ) . (13)The key insight presented in (13) is that the contribution of any observation with x i γ ∗ , β that maximizes l ( β ) in the compactified model must also maximize Í x i γ ∗ = l i ( β ) (i.e., thelog-likelihood associated with only the observations that are not separated by x i γ ∗ .) Otherwise,we would have that Í x i γ ∗ = s i ( β ) , Í x i γ ∗ , s i ( β ) =
0, implying that the joint likelihood ofthe non-separated observations can be increased without affecting that of the separated observa-tions.Parts (b) and (c) then follow because, if β = β ∗ maximizes the log-likelihood of the non-separated observations Í x i γ ∗ = l i ( β ) , any coefficient vector of the form β ∗ + kγ ∗ maximizes itas well. That is to say, the estimates for β will not in general be identical with or without theseparated observations. But the quantities x i β , θ i , and µ i will not be affected, as stated in part (b),since x i ( β ∗ + kγ ∗ ) = x i β ∗ over the subsample where x i γ ∗ = θ i and µ i are functions of x i β ) . Consequently, for any m such that γ ∗ m =
0, the individual parameter estimate β ∗ m + kγ ∗ m = β ∗ m is clearly the same in either case, as stated in part (c).To prove part (d), we consider a suitable re-parameterization of the linear predictor x i β thatpreserves the same information about any β m ’s associated with regressors that are not involved inseparation. Let S < M be the number of regressors for which no separating vector γ ∗ exists with γ ∗ m ,
0. We need to allow for the possibility that there could be many such separating vectorsaffecting the data. Without loss of generality, we can make the following assumptions:• z ( ) i = Í m x mi γ ( ) m , z ( ) i = Í m x mi γ ( ) m , . . . , z ( J ) i = Í m x mi γ ( J ) m are J ≤ S distinct, linearly inde-pendent certificates of separation.• z i : = Í j z ( j ) i is the “overall” certificate that identifies all separated observations in the data,and γ : = Í j γ ( j ) m is its associated separating vector.• The x i ’s and z i ’s are such that γ ( j ) j = γ ( j )− j = j , − j ∈ . . . J and j , − j .Furthermore, γ ( j ) m = j ∈ . . . J and for all m > S .The re-parameterized linear predictor e x i e β can be obtained by adding and subtracting Í Jm = β m z ( m ) i : e x i e β : = x i β + J Õ m = β m z ( m ) i − J Õ m = β m z ( m ) i = J Õ m = β m z ( m ) i + S Õ m = J + x mi e β m + M Õ m = S + x mi β m , (14)where each e β m : = β m − Í Jj = β j γ ( j ) m must now be interpreted as a combination of multiple dif-ferent parameters. The new set of regressors is e x i : = ( z ( ) i , . . . , z ( J ) i , x J + i , . . . , x Mi ) . Under thisre-parameterization, we know that β m = ∞ for m ∈ . . . J and β m = −∞ for m ∈ J + . . . S .16he combined e β m parameters will have finite estimates, however. What’s more, the first-orderconditions used to identify estimates of β S + , . . . , β M are unaffected by this transformation. To complete the proof, let e β F : = ( e β J + , . . . , e β S , β S + , . . . , β M ) denote the vector of finite pa-rameters and let b β F denote the corresponding vector of ML estimates. By parts (b) and (c), itis possible to construct a suitably modified ( M − J ) × e s ( b β F ) : = Í z i = e s i ( b β F ) thatuniquely identifies the ML for e β F over the subset of observations where z i =
0. Letting N z i = be the number of z i = e β F will beconsistently estimated if ( / N z i = ) Í z i = e s i ( e β F ) → p N z i = becomes large. Thus, consistencydepends only on the joint likelihood of the observations for which z i =
0, which is always welldefined. As such, Gourieroux et al. (1984)’s proof of consistency for linear exponential familiesapplies directly after restricting the sample to only the non-separated observations. Similarly, astandard asymptotic variance expansion gives us (cid:16) N z i = (cid:17) (cid:16) b β F − e β F (cid:17) → d N (cid:16) , e F − B e F − (cid:17) , where e F : = E [ Í z i = ∂ e s i ( b β F )/ ∂ b β F ] is a reduced information matrix pertaining only to the finiteparameters and B : = E [ Í z i = e s i ( b β F ) e s i ( b β F ) T ] captures the variance of the modified score. (cid:4) For researchers encountering separation problems, the key takeaways from Proposition 3 arelikely to be parts (c) and (d): even if one or more of the elements of the ML for β “does not exist”(i.e., is ±∞ ), it is still often the case that β has some finite elements that are identified by the modelfirst-order conditions and that can be consistently estimated. Specifically, so long as separationis “quasi-complete” (meaning there are at least some observations with x i γ = x i γ = For example, in a Poisson model where µ = exp [ β + β x + β x + β x ] and z i = x i + x i is a linear combinationof x i and x i that equals 0 for all y i = < y i =
0, then exp [ β + β z i + ( β − β ) x + β x ] is are-parameterization of µ that presents the same information about β . Here, we know that β and β will both be ∞ .The combined parameter β − β is finite, however, and the re-parameterized model allows us to take into accountits covariance with β in drawing inferences. Most GLMs typically used in applied economic research are from the linear exponential family (e.g., Poisson,Probit, Logit, and Negative Binomial). However, a similar result can be obtained for an even more general class ofGLMs by extending the proofs of Fahrmeir and Kaufmann (1985). Therefore, its estimated effect could theoret-ically take any value without affecting the score function or the estimates of other variables it isnot collinear with. Separation is similar in that, because the regressors implicated in x i γ are onlyidentifiable from the observations where x i γ ,
0, they therefore provide no information about theremaining parameters that are not involved in separation. The two issues are still fundamentallydistinct, of course, since separation involves estimates of the problematic regressors becominginfinite rather indeterminate. In either case, however, it is important that a researcher note thatthe choice of which regressor to drop from the estimation is often arbitrary and that the computedcoefficients of some of the remaining regressors (i.e., those that are involved in either separationor perfect collinearity) may need to be interpreted as being relative to an omitted regressor oromitted regressors, as shown in (14). The discussion thus far has been strictly theoretical, but the practical aspects of the separationproblem are also interesting. To date, most discussion of how to detect separation has focused onbinary choice models, where the only relevant conditions governing separation are (4) and (5). More precisely, this is where x i γ = γ . Another similarity between separation and perfect collinearity is that separation is neither strictly a “small sam-ple” issue nor a “large sample” issue. It may be resolved by obtaining a larger sample if the underlying reason is thatthere not enough variation in one or more of the regressors in the current sample. However, it may also occur inlarge samples either because of fundamental co-dependence between y i and some of the regressors (an embargo mayalways predict zero exports, for example) or because the number of regressors increases with the sample size (as istypically the case with panel data models). < y i < y ;such that the third condition stated in (3) becomes very important. In some cases, this conditioncan greatly simplify the task of detection. For example, if we can determine that X is of full columnrank over 0 < y i < y , then (3) cannot be satisfied and we know there is no separation. Likewise,if the rank of X over 0 < y i < y is M −
1, such that there is only one γ ∗ that satisfies (3), it isgenerally easy to compute values for z i = x i γ ∗ over the rest of the sample and check whether ornot they satisfy the other conditions for separation.However, even in non-binary choice models, things become much more complicated if thereare multiple linear combinations of regressors satisfying (3) (i.e., if the column rank of X over0 < y i < y is ≤ M − y i > z i = x i − x i and z i = x i − x i are both always 0 over y i >
0. The second- and third-to-lastcolumns of Table 2 then show that both z i and z i exhibit overlap over y i =
0, suggesting thatestimates should exist. However, just by virtue of there being two such linear combinations ofregressors satisfying (3), note that any other linear combination z i of the form z i = αz i + ( − α ) z i also satisfies (3), where α could be any real number. Thus, there are actually an infinite number oflinear combinations of regressors one would need to check for overlap in this manner in order todetermine existence. In this particular example, it is still possible to determine without too mucheffort that z i = . z i + . z i separates the first observation. But for more general cases, a morerigorous approach would clearly be needed to take into account the many different ways the datacould be separated.In light of these complexities, the standard method for detecting separation in the literatureis to set up some variation of the following linear programming problem:max γ S Õ y i = x i γ S < + Õ y i = y x i γ S > s.t. − x i γ S ≥ y i = x i γ S ≥ y i = yx i γ S = < y i < y , (15)where x i γ S < and x i γ S > respectively denote 0/1 indicators for observations with x i γ S < x i γ S >
0. If a non-zero vector γ S can be found that solves the problem defined by (15), then thelinear combination x i γ S clearly satisfies the conditions for separation described in Proposition Santos Silva and Tenreyro (2010) were the first to make this observation.
19. Furthermore, since γ S must maximize the number of separated observations, it follows that γ S = γ . A simplex solver or a variety of other similar linear programming methods may be usedto solve for γ S ; see Konis (2007) for a thorough discussion.A common weakness of linear programming methods in this context is that they suffer froma curse of dimensionality. Notice that the number of constraints associated with (15) is equalto the number of observations (i.e, N ) and the number of γ -parameters that need to be solvedfor is equal to the number of regressors (i.e., M ). While there are standard operations that maybe used to reduce the size of the problem to one with only N − M constraints (cf., Konis, 2007,p. 64), an obvious problem nonetheless arises if either M or N − M is a large number, as isincreasingly the case in applied economics research. In these cases, the standard approach justdescribed necessitates solving a high-dimensional linear programming problem, which may bedifficult to solve even using the most computationally efficient LP-solvers currently available. The following discussion therefore turns to the question of how to equip researchers to deal withthe separation problem in models with many fixed effects and other nuisance parameters.
To introduce a notion of high-dimensionality, we will now suppose the set of regressors can bepartitioned into two distinct components: a set of P non-fixed effects regressors w i = w i , . . . , w Pi ,which we will treat as countable in number, and a set of Q / d i = d i , . . . , d Qi , where Q is allowed to be a large number. The total number of regressors M = P + Q is therefore also alarge number and the combined matrix of non-fixed effect and fixed regressors can be expressedas X = { w i , d i } . Note that this partition does not depend on the indexing of these fixed effects, butthey could easily be sub-divided into multiple levels (e.g., “two way” or “three way” fixed effectsspecifications) depending on the application. The number of observations N is assumed to be > M , with N − M also generally treated as a large number.Before describing our preferred method for solving this problem, we first briefly discuss the As noted in the introduction, this popularity is largely driven by the wide adoption of fixed effects Poisson PML(FE-PPML) models for estimating gravity models. For example, Figueiredo et al. (2015) estimate a gravity model forpatent citations with N ≈
26 million and M ≈ ,
000 and Larch et al. (2019) estimate a similar model for internationaltrade flows with N ≈ ,
000 and M ≈ , Computationally efficient LP-solvers would typically involve inverting an M × M basis matrix (cf., Hall andHuangfu, 2011), a step we would prefer to avoid. In addition, note that the high-dimensional portion of the regressor set need not consist of only 0 / d i contains linear time trends, fixed effects interactedwith non-fixed effect variables, and so on without loss of generality. γ ∗ that involve only the fixed effects. Alternatively, we could simply attempt to computeestimates without any precautions and consider any observation for which the conditional meanappears to be converging numerically to either 0 or y to be separated. As discussed in Clarksonand Jennrich (1991) (p. 424), this latter method is generally not guaranteed to detect separationcorrectly. Our own preferred algorithm, which is based on weighted least squares, does not suffer fromthese types of issues. It can be applied to a very general set of models, is guaranteed to detectseparation, and is both simple to understand and fast. Moreover, it can be implemented in anystandard statistical package (without need for an LP solver), and our proof of its effectivenessrelies only on textbook least-squares algebra.We now turn to describing how the algorithm works for Poisson models and similar modelswith only a lower bound. We will then explain how it may be readily applied to binomial andmultinomial models without loss of generality. To proceed, let u i be an artificial regressand suchthat u i ≤ y i = u i = y i >
0. Also let ω i be a set of regression weights, givenby ω i = y i = K if y i > , with K an arbitrary positive integer. The purpose behind these definitions is that we can choosesufficiently large K such that a weighted regression of u i on x i can be used to detect if the equalityconstraint in (3) can be satisfied by the data. This result is an application of what is sometimescalled the “weighting method” (Stewart, 1997). We clarify how this technique works using the We describe in our Appendix an iteratively reweighted least squares (IRLS) algorithm that can accommodatehigh-dimensional models in a computationally efficient way. The iterative output from this algorithm can in principlebe used to detect observations whose µ ’s are converging to inadmissible values. However, in practice, leaving theseobservations in the estimation tends to slow down convergence. Furthermore, implementing this method is harderwhen the true distribution of µ is very skewed, since it becomes very difficult to determine numerically what is a“zero” µ versus only a “small” µ . It is also similar to penalty methods and barrier methods, which have been used for decades as alternatives tosimplex-based algorithms for solving linear programming problems (Forsgren et al., 2002). The value-added of ourapproach is that it also takes advantage of recent computational innovations that are specific to the estimation ofleast-squares regressions and that readily accommodate models with arbitrarily high dimensionality.
Lemma 2
For every ϵ > , there is an integer K > such that the residual e i from the weighted leastsquares regression of u i on x i , using ω i as weights, is within ϵ of zero ( | e i | < ϵ ) for the observationswhere y i > . To prove this statement, note first that the residual sum of squares (RSS) minimized by thisregression will be at most u ′ u . It is then useful to let K equal the smallest integer that is > u ′ u / ϵ .If the weighted least squares residual e i is greater than ϵ in absolute magnitude, then that obser-vation will contribute more than Kϵ to the RSS and the RSS will be at least Kϵ . If K > u ′ u / ϵ then RSS > u ′ u , which is a contradiction. (cid:4) Because we can force the predicted values of u i from this regression to zero for observationswith y i >
0, the coefficients computed from this regression therefore satisfy (3). The only remain-ing step is to choose u i so that all separated observations have predicted values less than zero andall non-separated observations have predicted values equal to zero. We achieve this goal via thefollowing algorithm:1. Given a certain ϵ >
0, define the working regressor u i and regression weight ω i as: u i = − y i =
00 if y i > ω i = y i = K if y i > . Observe that: (i) the regressand is either zero or negative; (ii) u ′ u is equal to the number of y i = N ( ) ).2. Iterate on these two steps until all residuals are smaller in absolute magnitude than ϵ (i.e.,until all | e i | < ϵ ):(a) Regress u i against x i using weights ω i . Compute the predicted values b u i = x i b γ andresiduals e i = u i − b u i .(b) For observations with y i =
0, update u i = min ( b u i , ) , ensuring that the regressandremains ≤ One could also update K and ϵ with each iteration as well. In theory, this would lead to exact convergence. Inpractice, we would typically need to insist ϵ be no smaller than 1 e −
16, which is the machine precision of mostmodern 64 bit CPUs. unweighted R of the last regression iteration is always equal to 1 . u i = b u i for all i ). The following proposition establishes the convergence properties of this algorithmand its effectiveness at detecting separation: Proposition 4 (Convergence to the correct solution) The above algorithm always converges. Fur-thermore, if all b u i = upon convergence, there exists no non-zero vector γ ∗ ∈ R M that solves thesystem defined by (3) and (5) and there is no separation. Otherwise, the observations that are foundto have b u i < are separated and all the observations with b u i = are not separated. We provide a proof of this proposition in our Appendix. The main observation for our currentpurposes is that none of the above steps are significantly encumbered by the size of the data and/orthe complexity of the model. Thanks to the recent innovations of Correia (2017), weighted linearregressions with many fixed effects can be computed in almost-linear time (as can more generalhigh-dimensional models using time trends or individual-specific continuous regressors). Theabove method can therefore be applied to virtually any GLM for which (3) and (5) are necessaryand sufficient conditions for existence, even when the model features many levels of fixed effectsand other high-dimensional parameters. Notably, this includes frequency table models—the orig-inal object of interest in Haberman (1974)—which themselves may be thought of as multi-wayfixed effects models without non-fixed effect regressors.The above algorithm still needs a name. Its defining features are that it iteratively uses weightedleast squares in combination with a “linear rectifier” function to ensure u i eventually convergesto the overall certificate of separation that identifies all separated observations. However, “Iter-atively Rectified Weighted Least Squares" would be confusing for obvious reasons. Instead, wehave settled on the name “iterative rectifier” (or IR for short).Finally, it is important to clarify that our iterative rectifier algorithm can be easily adaptedto the binary choice case (or, more generally, to the case of a multinomial dependent variable)using a simple transformation of the model. A logit model can always be re-written as a Poissonmodel, for example (Albert and Anderson, 1984, p. 9). We would argue that the Poisson model As discussed in Guimarães and Portugal (2010), this is because we can use the Frisch-Waugh-Lovell theorem tofirst “partial out” the fixed effects d i from either side of the problem via a within-transformation operation and thenregress the within-transformed residuals of u i on those of the non-fixed effect regressors w i to obtain e i . Correia(2017) then shows how to solve the within-transformation sub-problem in nearly-linear time. We borrow this term from the machine learning literature, where min ( y , ) and max ( y , ) are known as linearrectifiers or ReLUs (Rectified Linear Units). Despite their simplicity, ReLUs have played a significant role in increasingthe accuracy and popularity of deep neural networks (Glorot et al., 2011). Albert and Anderson (1984) have previously conjectured that this type of equivalence between Logit and Poissonmodels could be used to simplify the problem of detecting separation in frequency table models. The notes we providein our Appendix include a proof of Albert and Anderson (1984)’s conjecture.
23s actually the easier of the two to work with for larger problems, since the algorithm we havejust described cannot be applied directly to a binary choice model. For binary choice modelswe can use this Logit-Poisson transformation to write down an equivalent Poisson model that isseparated if and only if the original binary choice model is separated. The same is also true forfractional response models where y i can vary continuously over [ , y ] . Our appendix providesfurther discussion as well as a proof. In this paper, we have provided an updated treatment of the concept of separation in generalizedlinear models. While the result that all GLMs with bounded individual likelihoods suffer fromseparation under similar circumstances has been shown before by several authors, these resultsarguably have not been given sufficient attention. Now that nonlinear estimation techniques haveprogressed to the point where nonlinear models are regularly estimated via Pseudo-maximumLikelihood with tens of thousands of fixed effects (and many zero observations), there is con-siderable ambiguity over whether the estimates produced by these models are likely to “exist”,what it means when they do not “exist”, and what can be done to ensure that the model can besuccessfully estimated.We have brought more clarity to each of these topics. We have done so by building on theearlier work of Verbeek (1989) and Clarkson and Jennrich (1991), which we have extended to in-corporate some models that have not previously been examined in this literature and which havetheir own, more idiosyncratic criteria governing existence. An important takeaway from thisanalysis is that some, but not all, GLMs can still deliver uniquely identified, consistent estimatesof at least some of the model parameters even if other parameter estimates are technically infinite.We have also introduced a new method that can be used to detect separation in models with multi-ple levels of high-dimensional fixed effects, an otherwise tricky proposition that would ordinarilyrequire solving a high-dimensional linear programming problem. As GLM estimation with high-dimensional fixed effects increasingly becomes faster and more appealing to researchers, the needfor methods that can detect and deal with separation in these models represents an important gapthat we aim to fill. 24 eferences
Abowd, J. M., R. H. Creecy, F. Kramarz, et al. (2002): “Computing Person and Firm EffectsUsing Linked Longitudinal Employer-Employee Data,” Tech. rep., Center for Economic Studies,US Census Bureau.
Albert, A. and J. A. Anderson (1984): “On the Existence of Maximum Likelihood Estimates inLogistic Regression Models,”
Biometrika , 71, 1–10.
Allison, P. D. (2008): “Convergence Failures in Logistic Regression,” in
SAS Global Forum , vol.360, 1–11.
Arellano, M. and J. Hahn (2007): “Understanding Bias in Nonlinear Panel Models: Some RecentDevelopments,”
Econometric Society Monographs , 43, 381.
Bajari, P. and A. Hortaçsu (2003): “The Winner’s Curse, Reserve Prices, and Endogenous Entry:Empirical Insights from eBay Auctions,”
The RAND Journal of Economics , 34, 329–355.
Barndorff-Nielsen, O. (1978):
Information and Exponential Families: In Statistical Theory , JohnWiley & Sons.
Bergé, L. (2018): “Efficient Estimation of Maximum Likelihood Models with Multiple Fixed-Effects: The R package FENmlm,” unpublished manuscript, Center for Research in EconomicAnalysis, University of Luxembourg.
Cameron, A. C. and P. K. Trivedi (2013):
Regression Analysis of Count Data , Cambridge Univer-sity Press.
Carneiro, A., P. Guimarães, and P. Portugal (2012): “Real Wages and the Business Cycle: Ac-counting for Worker, Firm, and Job Title Heterogeneity,”
American Economic Journal: Macroe-conomics , 4, 133–152.
Clarkson, D. B. and R. I. Jennrich (1991): “Computing Extended Maximum Likelihood Esti-mates for Linear Parameter Models,”
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 53, 417–426.
Correia, S. (2015): “Singletons, Cluster-Robust Standard Errors and Fixed Effects: A Bad Mix,”unpublished manuscript. 25—— (2017): “Linear Models with High-Dimensional Fixed Effects: An Efficient and Feasible Es-timator,” unpublished manuscript.
Correia, S., P. Guimarães, and T. Zylkin (2019): “ppmlhdfe: Fast Poisson Estimation with High-Dimensional Data,” arXiv preprint arXiv:1903.01690 . de Bromhead, A., A. Fernihough, M. Lampe, and K. H. O’Rourke (2019): “When Britain TurnedInward: The Impact of Interwar British Protection,” American Economic Review , 109, 325–52.
Eck, D. J. and C. J. Geyer (2018): “Computationally Efficient Likelihood Inference in Expo-nential Families When the Maximum Likelihood Estimator Does Not Exist,” arXiv preprintarXiv:1803.11240 . Egger, P. H. and K. E. Staub (2015): “GLM Estimation of Trade Gravity Models with Fixed Effects,”
Empirical Economics , 50, 137–175.
Fahrmeir, L. and H. Kaufmann (1985): “Consistency and Asymptotic Normality of the MaximumLikelihood Estimator in Generalized Linear Models,”
Annals of Statistics , 342–368.
Fernández-Val, I. and M. Weidner (2016): “Individual and Time Effects in Nonlinear PanelModels with Large N, T,”
Journal of Econometrics , 192, 291–312.
Figueiredo, O., P. Guimarães, and D. Woodward (2015): “Industry Localization, Distance De-cay, and Knowledge Spillovers: Following the Patent Paper Trail,”
Journal of Urban Economics ,89, 21–31.
Forsgren, A., P. E. Gill, and M. H. Wright (2002): “Interior Methods for Nonlinear Optimiza-tion,”
SIAM Review , 44, 525–597.
Gaure, S. (2013): “OLS with Multiple High Dimensional Category Variables,”
ComputationalStatistics & Data Analysis , 66, 8–18.
Gelman, A., A. Jakulin, M. G. Pittau, and Y.-S. Su (2008): “A Weakly Informative Default PriorDistribution for Logistic and Other Regression Models,”
Annals of Applied Statistics , 2, 1360–1383.
Geyer, C. J. (1990): “Likelihood and Exponential Families,” Ph.D. thesis, University of Washington.——— (2009): “Likelihood Inference in Exponential Families and Directions of Recession,”
Elec-tronic Journal of Statistics , 3, 259–289. 26 lorot, X., A. Bordes, and Y. Bengio (2011): “Deep Sparse Rectifier Neural Networks,” in
Pro-ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , 315–323.
Gourieroux, A. C., A. Monfort, and A. Trognon (1984): “Pseudo Maximum Likelihood Meth-ods: Theory,”
Econometrica , 52, 681–700.
Guimarães, P. (2014): “POI2HDFE: Stata Module to Estimate a Poisson Regression with TwoHigh-Dimensional Fixed Effects,” Statistical Software Components, Boston College Departmentof Economics.
Guimarães, P. and P. Portugal (2010): “A Simple Feasible Procedure to Fit Models with High-Dimensional Fixed Effects,”
Stata Journal , 10, 628–649.
Haberman, S. J. (1973): “Log-Linear Models for Frequency Data: Sufficient Statistics and Likeli-hood Equations,”
Annals of Statistics , 1, 617–632.——— (1974):
The Analysis of Frequency Data , vol. 4, University of Chicago Press.
Hall, J. and Q. Huangfu (2011): “A High Performance Dual Revised Simplex Solver,” in
Interna-tional Conference on Parallel Processing and Applied Mathematics , Springer, 143–151.
Head, K. and T. Mayer (2014): “Gravity Equations: Workhorse, Toolkit, and Cookbook,” in
Hand-book of International Economics , ed. by G. Gopinath, E. Helpman, and K. Rogoff, North Holland,vol. 4, 131–195, 4 ed.
Heinze, G. and M. Schemper (2002): “A Solution to the Problem of Separation in Logistic Regres-sion,”
Statistics in Medicine , 21, 2409–2419.
Konis, K. (2007): “Linear Programming Algorithms for Detecting Separated Data in Binary Lo-gistic Regression Models,” Ph.D. thesis, University of Oxford.
Larch, M., J. Wanner, Y. V. Yotov, and T. Zylkin (2019): “Currency Unions and Trade: A PPMLRe-assessment with High-dimensional Fixed Effects,”
Oxford Bulletin of Economics and Statistics ,81, 487–510.
Manning, W. G. and J. Mullahy (2001): “Estimating Log Models: To Transform or Not to Trans-form?”
Journal of Health Economics , 20, 461–494.27 . McCullagh, J. A. N. (1989):
Generalized Linear Models , Chapman & Hall/CRC Monographs onStatistics & Applied Probability, Chapman and Hall/CRC, 2 ed.
Papke, L. E. and J. M. Wooldridge (1996): “Econometric Methods for Fractional Response Vari-ables with an Application to 401(k) Plan Participation Rates,”
Journal of Applied Econometrics ,11, 619–632.
Rainey, C. (2016): “Dealing with Separation in Logistic Regression Models,”
Political Analysis , 24,339–355.
Santos Silva, J. M. C. and S. Tenreyro (2006): “The Log of Gravity,”
Review of Economics andStatistics , 88, 641–658.——— (2010): “On the Existence of the Maximum Likelihood Estimates in Poisson Regression,”
Economics Letters , 107, 310–312.
Santos Silva, J. M. C., S. Tenreyro, and K. Wei (2014): “Estimating the Extensive Margin ofTrade,”
Journal of International Economics , 93, 67–75.
Silvapulle, M. J. (1981): “On the Existence of Maximum Likelihood Estimators for the BinomialResponse Models,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 43,310–313.
Silvapulle, M. J. and J. Burridge (1986): “Existence of Maximum Likelihood Estimates in Regres-sion Models for Grouped and Ungrouped Data,”
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 48, 100–106.
Stammann, A. (2018): “Fast and Feasible Estimation of Generalized Linear Models with ManyTwo-Way Fixed Effects,” arXiv preprint arXiv:1707.01815 . Stammann, A., F. Heiss, and D. McFadden (2016): “Estimating Fixed Effects Logit Models withLarge Panel Data,” unpublished manuscript . Stewart, G. W. (1997): “On the Weighting Method for Least Squares Problems with Linear Equal-ity Constraints,”
BIT Numerical Mathematics , 37, 961–967.
Stirling, W. D. (1984): “Iteratively Reweighted Least Squares for Models with a Linear Part,”
Applied Statistics , 7–17. 28 erbeek, A. (1989): “The Compactification of Generalized Linear Models,” in
Statistical Modelling ,Springer, 314–327.——— (1992): “The Compactification of Generalized Linear Models,”
Statistica Neerlandica , 46, 107–142.
Wedderburn, R. W. M. (1976): “On the Existence and Uniqueness of the Maximum LikelihoodEstimates for Certain Generalized Linear Models,”
Biometrika , 63, 27–32.
Weidner, M. and T. Zylkin (2017): “Bias and Consistency in Three-Way Gravity Models,” un-published manuscript.
Winkelmann, R. (2008):
Econometric Analysis of Count Data , Springer Science & Business Media.
Yotov, Y. V., R. Piermartini, J.-A. Monteiro, and M. Larch (2016):
An Advanced Guide to TradePolicy Analysis: The Structural Gravity Model , Geneva: World Trade Organization.
Zorn, C. (2005): “A Solution to Separation in Binary Response Models,”
Political Analysis , 13,157–170. 29able 1: Mapping different regression models onto GLM
Model Log-likelihood ( l ) θ ( x i β ; ν ) b ( θ i ) µ i ( = b ′ ) First-order condition for β m Probit Í i ( y i log µ i + ( − y i ) log ( − µ i )) = Í i (cid:16) y i log Φ i ( x i β ) − Φ i ( x i β ) + log ( − Φ i ( x i β )) (cid:17) log Φ i ( x i β ) − Φ i ( x i β ) log ( + exp ( θ i )) exp ( θ i ) + exp ( θ i ) ( = Φ ( x i β )) Í i ϕ ( x i β ) Φ i ( x i β )[ − Φ i ( x i β )] [ y i − µ i ] x mi = Í i ( y i log µ i + ( − y i ) log ( − µ i )) = Í i ( y i x i β − log ( + exp ( x i β ))) x i β log ( + exp ( θ i )) exp ( θ i ) + exp ( θ i ) Í i [ y i − µ i ] x mi = Í i [ y i x i β − exp ( x i β ) − ln y i ! ] x i β exp ( θ i ) exp ( θ i ) Í i [ y i − µ i ] x mi = Í i y i log (cid:16) exp ( x i β ) ν + exp ( x i β ) (cid:17) − ν log ( ν + exp ( x i β )) + c ( ν , y i ) log (cid:16) exp ( x i β ) ν + exp ( x i β ) (cid:17) ν log (cid:16) ν − exp ( θ ) (cid:17) ν exp ( θ i ) − exp ( θ i ) ( = e x i β ) Í i [ y i − µ i ] (cid:0) + ν − µ i (cid:1) − x mi = Í i − αy i exp (− x i β ) − αx i β − exp (− x i β ) log (− / θ i ) -1 / θ i ( = e x i β ) α Í i [ y i − µ i ] exp (− x i β ) x mi = Í i − σ [ y i − exp ( x i β )] − log (cid:0) πσ (cid:1) = Í i σ (cid:2) y i exp ( x i β ) − exp ( x i β ) (cid:3) + c (cid:0) σ , y i (cid:1) exp ( x i β ) θ i / θ i σ Í i [ y i − µ i ] exp ( x i β ) x mi = Í i α (cid:2) − y i exp (− x i β ) + exp (− x i β ) (cid:3) − exp (− x i β ) −(− θ ) / (− θ i ) − / ( = e x i β ) α Í i [ y i − µ i ] exp (− x i β ) x mi = Φ (·) is the cdf of a standard normal distribution. ϕ (·) is its pdf. α and σ are dispersion/scaling factors to be estimated, which do notaffect identification of β . ν , which does affect identification of β , is the dispersion parameter for Negative Binomial regression. Notethat for Gamma and Inverse Gaussian, we consider only the Pseudo-maximum Likelihood (PML) versions of these estimators (thestandard likelihood functions for these models do not admit y i = able 2: Separation may not be detected because of multiple redundant regressors over y i > y i x i x i x i x i z i z i z i In this example, a typical (iterative) check for perfect collinearity over y i > z i = x i − x i is always 0 over y i > z i = x i − x i . Since both z i and z i take on positive as well as negative values over y i =
0, itwould appear the model does not suffer from separation. However, the linear combination z i = . z + . z only takes on values ≤ y i =
0, implying separation. x is an explicitconstant. nline-only Appendix for“Verifying the existence of maximum likelihood estimatesfor generalized linear models” by Sergio Correia, Paulo Guimarães, & Thomas Zylkin( not for publication ) Additional proofs
Proof of Lemma 1 . Recall from our discussion of the GLM log-likelihood function in (1) that b ( θ i ) is stipulated to be increasing and convex with respect to θ i . Thus, the function l i = y i θ i − b ( θ i ) + c i has a unique, finite maximum so long as y i is positive. In turn, we need only concern ourselveswith how l i behaves either when y i = y i = y . In the case where y i = i ’s contribution to the score function— s i ( β ) = ∂ l i / ∂ β —is given by s i ( β ) = − α i b ′ ( θ i ) θ ′ ( x i β ; ·) x i = − α i µ i θ ′ ( x i β ; ·) x i . (16)To complete the proof of part (a), first note that µ i and θ ′ i are both bounded from below by 0; µ i × θ ′ i < x i β →−∞ µ i × θ ′ i > . In this case, thesign of s i ( β ) is equal to the sign of − x i at the limit where x i β → −∞ . As a result, the individuallog-likelihood l i (·) can be perpetually increased by increasing β in the direction opposite to x i (i.e.,by decreasing x i β ). It therefore does not have a finite upper bound. We also need to show thatlim k →∞ F i ( β ) =
0. However, this follows directly from the fact that lim x i β →−∞ s i ( β ) →
0. To seethis, let F ( m , n ) i denote the m , n th element of F i ( β ) and let s ( m ) i denote observation i ’s contributionto the m th element of the score vector. If observation i is separated at the lower bound by somevector γ ∗ , each element of F i ( β ) can be written aslim k →∞ F ( m , n ) i ≡ lim k →∞ E h ∂ s ( m ) i ( β + kγ ∗ )/ ∂ β n i ≡ lim k →∞ E " lim τ → s ( m ) i ( β + kγ ∗ + n τ ) − s ( m ) i ( β + kγ ∗ ) τ = , where n is a unit vector with length M and with its n th element equal to 1. Finally, for part(b), we also need to consider the case where lim x i β →∞ µ i → y < ∞ (where we generally take y i to be 1, as is the case in binary choice models). A similar reasoning applies here as well: iflim x i β →∞ s i ( β ) = lim x i β →∞ ( y − µ i ) θ ′ i x i >
0, it will always be possible to increase l i (·) by increasing32 in the same direction as x i . And we can rule out lim x i β →∞ s i ( β ) < θ ′ i cannotbe negative and since µ i cannot exceed y . The reasoning as to why lim k →∞ F i ( β ) = (cid:4) Proof of Proposition
4. This proof is split into two parts. First, we show the algorithm describedin Section 3.3 always converges. Then we show that if the algorithm converges, it always con-verges to the correct results.
Proof of convergence . Let u ( k ) i denote the i th observation of the working dependent variable atiteration k and let b u ( k ) i : = x i b γ ( k ) be its predicted value, with b γ ( k ) denoting the vector of weightedleast squares coefficients estimated at iteration k . Also let u ( k ) i and b u ( k ) i respectively denote thesquares of u ( k ) i and b u ( k ) i . e i : = u i − b u i will continue to denote a residual, with e ( k ) i denoting theresidual from the k th iteration and e ( k ) i denoting its square. In addition, it will occasionally beconvenient to let u ( k ) , b u ( k ) , and e ( k ) respectively denote the vector analogues of u ( k ) i , b u ( k ) i , and e ( k ) i .When the algorithm converges, all of the residuals from the weighted least squares step con-verge to zero: u i = b u i ≤ ∀ i ⇐⇒ e i = ∀ i . It would be cumbersome to show that all residualsindeed converge, so we instead take a simpler route and work with the sum of squared residuals(SSR). We can do this because | e i | ≤ qÍ i e i , which implies that if the SSR converges to zero, allresiduals must converge to zero as well.Let SSR ( k ) denote the SSR from the k th iteration. We will prove that that lim k →∞ SSR ( k ) = SSR ( ) + SSR ( ) + . . . + SSR ( k ) converges to a finite number.To see this, note that we have by the normal equations that Í ˆ u i e i =
0. Thus, the SSR can becomputed as Í ni = e i = Í ni = u i − Í ni = ˆ u i for all iterations, including k + SSR ( k + ) = n Õ i = u i ! ( k + ) − n Õ i = ˆ u i ! ( k + ) . By construction, u ( k + ) i = min ( ˆ u ( k ) i , ) , and thus: n Õ i = u i ! ( k + ) = Õ ˆ u ( k ) < ˆ u i ! ( k ) . We can also split Í ˆ u , ( k + ) based on the values of ˆ u ( k + ) :33 n Õ i = ˆ u i ! ( k + ) = Õ ˆ u ( k + ) < ˆ u i ! ( k + ) + © « Õ ˆ u ( k + ) ≥ ˆ u i ª®¬ ( k + ) . Putting the last three equations together,
SSR ( k + ) = Õ ˆ u ( k ) < ˆ u i ! ( k ) − Õ ˆ u ( k + ) < ˆ u i ! ( k + ) − © « Õ ˆ u ( k + ) ≥ ˆ u i ª®¬ ( k + ) . If we move the last equation forward to ( k + ) and then add SSR ( k + ) , we notice this is a telescopingseries where one term cancels: SSR ( k + ) + SSR ( k + ) = Õ ˆ u ( k ) < ˆ u i ! ( k ) − © « Õ ˆ u ( k + ) ≥ ˆ u i ª®¬ ( k + ) − Õ ˆ u ( k + ) < ˆ u i ! ( k + ) − © « Õ ˆ u ( k + ) ≥ ˆ u i ª®¬ ( k + ) . More generally, the infinite sum of the sequence starting at k = ∞ Õ k = SSR ( k ) = Õ ˆ u ( ) < ˆ u i ! ( ) − ∞ Õ k = © « Õ ˆ u ( k ) ≥ ˆ u i ª®¬ ( k ) − lim k →∞ Õ ˆ u ( k ) < ˆ u i ! ( k ) ≤ Õ ˆ u ( ) < ˆ u i ! ( ) . After adding
SSR ( ) on both sides, and applying Í ni = u i = Í ni = ˆ u i + Í ni = e i , we have that ∞ Õ k = SSR ( k ) ≤ Õ ˆ u ( ) < ˆ u i ! ( ) + SSR ( ) ≤ n Õ i = ˆ u i ! ( ) + SSR ( ) = n Õ i = u i ! ( ) . Therefore, ∞ Õ k = SSR ( k ) ≤ n Õ i = u i ! ( ) = Õ y i = , where the last equality follows from how we initialize u i , with u ( ) i = − y i = u ( ) i = y i = k →∞ SSR ( k ) = k →∞ SSR ( k ) = c > , but ∞ Õ k = ( SSR ) ( k ) ≤ C for some finite C , then, by iteration k ∗ : = (cid:6) Cc (cid:7) , the sum of the sequence will have exceeded C , a34ontradiction. Therefore, the SSR converges to zero, with the same necessarily being true for allof the individual residuals. (cid:4) Proof of convergence to the correct solution . The above proof tells us that our iterative “rectifier”algorithm will eventually converge, but of course it doesn’t tell us that it will converge to the correct solution. What we will prove now is exactly that:lim k →∞ ˆ u ( k ) i < i is separated . As in the main text, z is the name we will give to the “certificates of separation” used to detectseparated observations. By Proposition 1, any such z must be a linear combination of regressors: z = Xγ , with z i = y i > y i = z i < z vectors. Wedo not just want to find some of the z ’s that induce separation; rather, we want to identify a z thatis as large as possible, in the sense of having the most non-zero rows.Our proof that our algorithm accomplishes this task can be outlined in two steps. We firstneed to show that if lim k →∞ ˆ u ( k ) i <
0, then observation i is separated. This is very simple toshow, since the above proof of convergence implies that lim k →∞ ˆ u ( k ) i = lim k →∞ u ( k ) and since,by construction, u ( k ) i = y i > u ( k ) i ≤ y i =
0. Recalling that γ ( k ) is the vector ofcoefficients computed from the weighted least squares regression in each iteration k , it is nowobvious that lim k →∞ ˆ u ( k ) = lim k →∞ Xγ ( k ) is a linear combination of regressors that meets thecriteria for separation described in Proposition 1 if there are any i such that lim k →∞ ˆ u ( k ) i < k →∞ ˆ u ( k ) i is necessarily < all separated observations,is more complicated. To prove this part, we will rely on the following lemma: Lemma 3
For every possible z satisfying the criteria for separation described in Proposition 1, wemust have that lim k →∞ u ( k ) i < on at least one row where z i < . The underlined portion of Lemma 3 that clarifies that it applies to “every possible” z is impor-tant. As we will soon see, the fact that the algorithm discovers at least one separated observationassociated with every possible linear combination of regressors that induces separation will besufficient to prove that lim k →∞ ˆ u ( k ) i < Proof of Lemma 3 . To prove Lemma 3, it will first be useful to document the following prelimi-naries . First, recall that each iteration k involves a regression of our working dependent variable35 ( k ) on our original regressors X that produces a set of residuals e ( k ) . Thus, the normal equationsfor each of these regressions imply that X ′ e ( k ) = = ⇒ z ′ e ( k ) = ∀ k and ∀ z (since z ′ e ( k ) isjust a linear combination of X ′ e ( k ) that consists of pre-multiplying X ′ e ( k ) by γ ). Second, we canalways decompose each vector of predicted values for our working dependent variable into itspositive and negative components using the appropriate rectifier functions: b u = b u ( + ) + b u (−) , where u ( + ) i = max ( b u i , ) and u (−) i = min ( b u i , ) . Third, using this notation, we also have that u ( k + ) i = ˆ u ( k )(−) i (i.e., the working dependent variable inherits the rectified predicted values from the prior itera-tion).We start with the observation noted above that the normal equations imply z ′ e ( k ) = Í z i e ( k ) i = ∀ k . ) Now let’s focus on iterations k and k + Õ z i e ( k ) i + Õ z i e ( k + ) i = . After grouping terms and using the definition of e i , we have Õ z i h u ( k ) i − ˆ u ( k ) i + u ( k + ) i − ˆ u ( k + ) i i = . Using our decomposition, ˆ u ( k ) i = u ( k ) , ( + ) i + u ( k ) , (−) i (and likewise for k + Õ z i h u ( k ) i − ˆ u ( k ) , ( + ) i − ˆ u ( k ) , (−) i + u ( k + ) i − ˆ u ( k + ) , ( + ) i − ˆ u ( k + ) , (−) i i = . Also recall that u ( k + ) i = ˆ u ( k ) , (−) i (and likewise for k + Õ z i h u ( k ) i − ˆ u ( k ) , ( + ) i − u ( k + ) i + u ( k + ) i − ˆ u ( k + ) , ( + ) i − u ( k + ) i i = . After canceling out terms and re-arranging, we have Õ z i h u ( k + ) i − u ( k ) i i = − Õ z i h ˆ u ( k ) , ( + ) i + ˆ u ( k + ) , ( + ) i i . Notice that we can focus on the observations where z i < z i = Thus, Õ z i < z i h u ( k + ) i − u ( k ) i i = − Õ z i < z i h ˆ u ( k ) , ( + ) i + ˆ u ( k + ) , ( + ) i i . Also, there must be negative elements of z , as otherwise z wouldn’t be a valid certificate of separation. k +
2, then the righthand termis strictly positive. This is because z i <
0, and ˆ u ( k ) , ( + ) i is non-negative, with at least one strictlypositive observation (otherwise, ˆ u ( k ) i would meet the stopping criteria because it would be strictlynon-positive and the next iteration would return ˆ u ( k + ) i = u ( k + ) i = ˆ u ( k ) i . )Thus, Õ z i < z i u ( k + ) i > Õ z i < z i u ( k ) i By itself, this statement is interesting, because it tells us that the weighted sum of u is increasingas we iterate. But we can get a useful bound if we recall that on the first iteration, u i = − y i = , which implies Í z i u ( ) i = Í z i . Then, for every odd iteration with k ≥
3, we know that Õ z i u ( k ) i > − Õ z i . Denote the minimum (i.e., most negative) z i as z min . Then, z min Õ u ( k ) i ≥ Õ z i u ( k ) i > − Õ z i for k = , , , . . . = ⇒ Õ z i < u ( k ) i < − Í z i z min for k = , , , . . . Given that both z min and Í z i must both be < Í z i < u ( k ) i must be negative on every odd iteration starting with k =
3. Obviously, this is not possible unlessat least one u ( k ) i is negative for an observation where z i < k → ∞ , since u ( k ) i must eventually converge to the same result forboth odd and even k . (cid:4) For the remainder of the proof of the overall theorem, let ˆ u (∞) i : = lim k →∞ ˆ u ( k ) i denote the solutionobtained by our algorithm. Thanks to the insights established by Lemma 3, we can now prove thatˆ u (∞) i < z that separates observation i . We can do so by consideringtwo cases. First, note that if ˆ u (∞) i = i , then Lemma 3 implies there cannot be any such z and the data are not separated. The more interesting case is if ˆ u (∞) i < i . In thatcase, recall that ˆ u (∞) i is an admissible z (because it will have converged to a vector that is < y i = z ∗ that is < u (∞) i =
0. To see why this cannot37appen, let α ∗ : = sup ˆ u (∞) i < z ∗ i ˆ u (∞) i > . Then, given z ∗ and our solution ˆ u (∞) i , we can construct a third certificate z ∗∗ that also separatesthe data: z ∗∗ i : = z ∗ i − α ∗ ˆ u (∞) i ≤ z ∗ i − z ∗ i = , where the inequality follows from the definition of α ∗ . By construction, z ∗∗ i < u (∞) i =
0, and z ∗∗ i = u (∞) i <
0. If z ∗∗ i = all observations where ˆ u (∞) i =
0, we have a contradiction, since Lemma 3(b) tells us at least oneobservation separated by z ∗∗ i must also be separated by our solution ˆ u (∞) i . If not, we repeat: let α ∗∗ : = sup ˆ u (∞) i < z ∗∗ i ˆ u (∞) i and z ∗∗∗ i = z ∗∗ i − α ∗∗ ˆ u (∞) i , which gives us yet another certificate of separation z ∗∗∗ i that will equal 0 for at least one observa-tion where ˆ u (∞) i < z ∗∗ i <
0. We can repeat this process as many times as needed until weeventually obtain a z that does not separate any observations for which ˆ u (∞) i <
0. Lemma 3 againprovides the needed contradiction indicating that this cannot happen. (cid:4)
Example code . The number of steps needed in the above proof may suggest the iterative recti-fier algorithm is rather complicated. However, in practice, it requires only a few lines of code toimplement. Below, we provide some generic “pseudo code” that should be simple to program invirtually any statistical computing language (e.g., R, Stata, Matlab).38 seudo code:
Set u i = − y i =
0; 0 otherwiseSet ω i = K if y i >
0; 1 otherwise
Begin loop :Regress u on X , weighting by ω ( produces coefficients b γ )Set b u = X b γ Set b u = | b u | < ϵ Stop if b u i ≤ i ( all separated observations have been identified )Replace u i = min ( b u i , ) End loop .For readers interested in more details, we have created a website that provides sample Statacode and data sets illustrating how all of the methods for detecting separation described in thispaper can be implemented in practice. Also see our companion paper for the ppmlhdfe
Statacommand (Correia et al., 2019), which provides further useful information related to technicalimplementation and testing.
An alternative method using within-transformation and lin-ear programming
Larch et al. (2019) have also recently proposed a method for detecting separation in Poisson mod-els in the presence of high-dimensional fixed effects. In their paper, this is accomplished by first“within-transforming” all non-fixed effect regressors with respect to the fixed effects, then check-ing whether the within-transformed versions of these regressors satisfy conditions for separation.As they discuss (and as we will document here as well), any method based on this strategy is onlyable to detect instances of separation that involve at least one non-fixed effect regressor; it can-not be used to detect separation involving only the fixed effects. Another difference is that Larchet al. (2019) describe how to detect linear combinations of regressors that satisfy (3) only. De-tecting linear combinations of regressors that satisfy both of the relevant conditions describedin Proposition 1 (i.e., both (3) and (5)) requires an appropriate extension of their methods thatincorporates the linear programming problem in (15).The first step is to regress each non-fixed effect regressor w pi on every other regressor (includ-39ng the fixed effects) over 0 < y i < y . If we find that w pi is perfectly predicted over 0 < y i < y ,then we know there is a linear combination of regressors involving w pi that satisfies (3), as shownby Larch et al. (2019). Larch et al. (2019) do not discuss how this step is applicable to the linear pro-gramming problem in (15), but focusing on these “candidate” linear combinations that we alreadyknow to satisfy (3) turns out to be an effective way of reducing the dimensionality of the problem(for non-binary choice models at least). More formally, we can determine candidate solutions for γ ∗ by first computing the following linear regression for each w pi : w pi = w − pi δ p + d i ξ p + r pi for 0 < y i < y , (17)where w − pi is the set of other non-fixed effect regressors (i.e., excluding w pi ). δ p and ξ p are thecoefficient vectors to be estimated. Our focus is on the residual error r pi obtained from each ofthese regressions. If r pi is uniformly zero, then some combination of the fixed effects and the othernon-fixed effect regressors perfectly predicts w pi over 0 < y i < y . Or, to cement the connectionwith Proposition 1, we would have that r pi = w pi − w − pi δ p − d i ξ p is a linear combination of regressorsthat satisfies condition (8).Because the estimation expressed in (17) is a linear regression, it can generally be computedvery quickly using the algorithm of Correia (2017), even for models with very large M . The mainadvantage of this first step is that it greatly reduces the dimension of the linear programmingproblem we need to solve. This is for two reasons. First, it allows us to effectively perform a changeof variables from x i (which is of dimension M ) to the set of r pi associated with any regressors thatare perfectly predicted over 0 < y i < y (which will have a much smaller dimension ≤ P ≪ M ).Second, since any linear combination of these z pi ’s is assured to satisfy (3), we no longer need thethird set of constraints stipulated in (15). Since we very often have that 0 < y i < y for a majorityof the observations in non-binary choice models, changing variables in this way is likely to alsogreatly reduce the number of constraints. A suitable re-parameterization of our original linear programming problem in (15) helps toillustrate the idea behind this change of variables. Let r ∗ i : = { r pi | r pi = < y i < y }, i.e., a vectorconsisting of the predicted residuals from (17) associated with any w pi that are perfectly predicted For this reason, this first step of regressing each regressor on every other regressor can be beneficial even in non-high-dimensional environments when the number of observations with 0 < y i < y is large. A similar first step alsoappears in Santos Silva and Tenreyro (2010) and Larch et al. (2019), but both of these papers stop short of verifying the“overlap” conditions described in (4)-(5). Addressing the latter complication requires one of the methods describedin this paper. < y i < y . The modified linear programming problem based on r ∗ i instead of x i ismax ϕ Õ y i = (cid:0) r ∗ i ϕ < (cid:1) + Õ y i = y (cid:0) r ∗ i ϕ > (cid:1) s.t. − r ∗ i ϕ ≥ y i = r ∗ i ϕ ≥ y i = y , (18)where, as noted, the number of parameters we need to solve for (i.e., the length of the vector ϕ in this case) is only equal to the number of w pi that we found to be perfectly predicted byother regressors in the first step. Furthermore, the number of constraints we need to take intoaccount is only N y i = + N y = y instead of N . To appreciate why this approach works, consider whathappens when a non-zero vector ϕ ∗ can be found solving (18). In that case, r ∗ i ϕ ∗ = Í p | r pi ∈ r ∗ i ϕ ∗ p r pi = Í p | r pi ∈ r ∗ i ϕ ∗ p ( w pi − w − pi δ p − d i ξ p ) is a linear combination of regressors that satisfies (3)-(5), indicatingseparation.However, while this approach is able to quickly identify separation involving complex combi-nations of fixed effect and non-fixed effect regressors, it cannot be easily used to identify separa-tion involving the fixed effects only (at least not without estimating (17) M times in the first step,which is likely to be time consuming). For some standard fixed effect configurations, this latterproblem is not so severe. For example, the trivial case where a fixed effect dummy is always equalto zero when 0 < y i < y is very easy to find. Models with only one level of fixed effects are thuseasy to deal with in this regard. Similarly, for models with two levels of fixed effects (e.g., exporterand importer, firm and employee), the graph-theoretical approach of Abowd et al. (2002) can beapplied to identify any combinations of fixed effects that are perfectly collinear over 0 < y i < y ,which then can be added as needed to the linear programming step in (18).However, for more general cases, such as non-binary choice models with more than two levelsof high-dimensional fixed effects, it has been known since Haberman (1974, Appendix B) thatseparation involving only categorical dummies (i.e., fixed effects) can be difficult to verify (alsosee Albert and Anderson, 1984, p. 9.) To our knowledge, this problem has remained unresolved inthe literature and Abowd et al. (2002)’s method cannot be used to solve the problem for generalcases either. Thus, unless we have a non-binary choice model with either one or two levels of Note that we can still detect cases in which a single fixed effect induces separation or if there is separationinvolving only two levels of fixed effects. The case we can not as easily detect is if there is a linear combinationinvolving three or more levels of fixed effects and which also do not involve any of the non-fixed effect regressors.In addition, it is worth clarifying that neither perfect collinearity between fixed effects nor separation involving onlythe fixed effects (by Proposition 3) poses an issue for identification of the non-fixed effect parameters. However,separation can affect an estimation algorithm’s ability to reach convergence, the speed at which it converges, and Noting that a Logit modelcan be transformed into a Poisson model by adding a fixed effect (as we discuss next), the same isalso true for binary choice models with more than one fixed effect.
Verifying separation in binary choice models using the Logit-Poisson transformation
While the discussion in Section 3.3 focuses on the case of a model with only a lower bound atzero, our methods can be applied to binary choice models without loss of generality. The onlyfurther complication that is needed is that we must first transform the model by taking advantageof the following property:
Definition 1 (The Logit-equivalent Poisson model) Any Logit model with p ( y i = ) = exp ( x i β )/[ + exp ( x i β )] can be re-written as a Logit-equivalent Poisson model via the following steps:1. Let each observation now be given by y i , a and be indexed by i and a = , . A “ y i , ” willhenceforth indicate an “original” observation from the original Logit model and a “ y i , ” willindicate an “artificial” observation. The construction of artificial observations is described inthe next step.2. For every original observation with y i , = , create an artificial observation with y i , = . Forevery original observation with y i , = , similarly create an artificial observation with y i , = .For all artificial observations, set all corresponding elements of x i , equal to . The number ofobservations should now be N , where N is the original sample size.3. Add a set of i -specific fixed effects to the model, to be given by δ i . These may be thought ofas the coefficients of a set of dummy variables d i , which equal 1 only for the two observationsindexed by a particular i (one original observation and one artificial observation). even whether it converges to the correct estimate. One possible method is the one discussed in Clarkson and Jennrich (1991) on p. 424, which allows estimationto proceed without precautions and iteratively drops any observations that appear to be converging to a boundaryvalue. The algorithm we describe later on this Appendix could be used in conjunction with this approach. However, asClarkson and Jennrich (1991) note, this method is not guaranteed to detect separation accurately. Furthermore, in ourown implementations, we have noted that removing separated observations mid-estimation generally leads to slowerconvergence. Yet another problem arises if “cluster-robust” standard errors are used. In that case, the algorithm ofCorreia (2015) must also be repeatedly applied in order to determine that correct number of non-singletons clustersthat are left as additional separated observations are removed. Otherwise, statistical significance will tend to beoverstated. he resulting Logit-equivalent Poisson model is given by E [ y i , a ] = exp [ δ i + x i , a β ] and is estimatedusing Poisson regression. After obtaining a Poisson model in this way, we have the following equivalence:
Proposition 5 (Logit-Poisson Equivalence) The Logit-equivalent Poisson model is equivalent to theoriginal Logit model. In particular:• The FOC’s for β are the same.• The parameter estimates for β and their associated asymptotic variances are the same.• The conditional mean from the Poisson model equals the expected probability that y i = fromthe Logit model. The properties described in Proposition 5 can be established using the Poisson FOCs for δ i and β : N Õ i = x i , (cid:16) y i − e δ i + x i , β (cid:17) = , ∀ i : (cid:16) − e δ i (cid:16) + e x i , β (cid:17) (cid:17) = , (19)where we have used the fact that y i , + y i , = x i , =
0. It shouldbe apparent that e δ i = /( + e x ( ) i β ) . After plugging in the solution for δ i into the FOC for β , weobtain N Õ i = x i , (cid:18) y i , − e x i , β + e x i , β (cid:19) = , which is the same as the FOC for β from the original Logit model. The estimates for β thereforeare the same across both models, as are the associated asymptotic variances. Furthermore, thePoisson conditional mean e δ i + x i , β = exp ( x i β )/[ + exp ( x i β )] is the same as p ( y i = ) from theLogit model. (cid:4) The most important implication of these results for our current purposes is the following:
Proposition 6 (Equivalence under separation) Suppose that l ( β ) conforms to (1) , the matrix of re-gressors X = x , x , . . . , x M is of full column rank, and the individual log-likelihood l i ( β ) always hasa finite upper bound. Any binary choice model that satisfies these conditions is separated if and onlyif the Logit-equivalent Poisson model is separated. γ ∗ ∈ R M that satisfies (4) and (5). Then for any separated observation with y i =
1, the FOC for δ i in theLogit-equivalent Poisson model must satisfy e δ i = lim x i , β →∞ /( + e x i , β ) = i ( y i , ) has a conditional mean of µ i , = y i =
0, the conditional mean for y i , , µ i , , must be 0. This can only be true if y i , is separated.If we instead consider separation in the Poisson model, we can simply focus on cases whereeither the original observation has a conditional mean of 0 or the artificially created observationhas a conditional mean of 0. In the former case, it is obvious there is separation in either model.In the latter case, we must have that δ i = −∞ , which can only be true if l ( β ) is increasing as x i β → ∞ , implying y i , is separated in the original Logit model.Finally, the conditions for a binary choice model to be separated depend only on the config-uration of the data and do not depend on the specific choice of model (e.g., Probit vs. Logit).Therefore, the Poisson model described above can be used to check for separation in any con-ceivable GLM binary choice model for which the individual likelihood function is bounded fromabove, not just the Logit model. (cid:4) An IRLS Algorithm for GLM estimation with high-dimensionalfixed effects