[PDF] Bias and Consistency in Three-way Gravity Models

Abstract

We study the incidental parameter problem for the ``three-way'' Poisson {Pseudo-Maximum Likelihood} (``PPML'') estimator recently recommended for identifying the effects of trade policies and in other panel data gravity settings. Despite the number and variety of fixed effects involved, we confirm PPML is consistent for fixed T and we show it is in fact the only estimator among a wide range of PML gravity estimators that is generally consistent in this context when T is fixed. At the same time, asymptotic confidence intervals in fixed- T panels are not correctly centered at the true point estimates, and cluster-robust variance estimates used to construct standard errors are generally biased as well. We characterize each of these biases analytically and show both numerically and empirically that they are salient even for real-data settings with a large number of countries. We also offer practical remedies that can be used to obtain more reliable inferences of the effects of trade policies and other time-varying gravity variables, which we make available via an accompanying Stata package called ppml_fe_bias.

Full PDF

BBias and Consistency in Three-way Gravity Models ∗ Martin WeidnerUCL Thomas Zylkin † RichmondJanuary 28, 2020

We study the incidental parameter problem in “three-way” Poisson Pseudo-MaximumLikelihood (“PPML”) gravity models recently recommended for identifying the eﬀects oftrade policies and in other network panel data settings. Despite the number and variety ofﬁxed eﬀects this model entails, we conﬁrm it is consistent for small T and we show it is infact the only estimator among a wide range of PML gravity estimators that is generallyconsistent in this context when T is small. At the same time, asymptotic conﬁdenceintervals in ﬁxed- T panels are not correctly centered at the true point estimates, andcluster-robust variance estimates used to construct standard errors are generally biasedas well. We characterize each of these biases analytically and show both numericallyand empirically that they are salient even for real-data settings with a large numberof countries. We also oﬀer practical remedies that can be used to obtain more reliableinferences of the eﬀects of trade policies and other time-varying gravity variables. JEL Classiﬁcation Codes:

C13; C50; F10

Keywords:

Structural Gravity; Trade Agreements; Asymptotic Bias Correction ∗ Thomas Zylkin is grateful for support from NUS Strategic Research Grant WBS: R-109-000-183-646awarded to the Global Production Networks Centre (GPN@NUS) for the project titled “Global Pro-duction Networks, Global Value Chains, and East Asian Development”. Martin Weidner acknowledgessupport from the Economic and Social Research Council through the ESRC Centre for Microdata Meth-ods and Practice grant RES-589-28-0001 and from the European Research Council grants ERC-2014-CoG-646917-ROMIA and ERC-2018-CoG-819086-PANEDA. We also thank Valentina Corradi, RiccardoD’adamo, Ivan Fernández-Val, Koen Jochmans, Michael Pfaﬀermayr, Amrei Stammann, and Yoto Yotov.An accompanying Stata package called ppml_fe_bias is now available online. † Contact information : Weidner: Department of Economics, University College London, LondonWC1H 0AX. Email: [email protected]. Zylkin: Robins School of Business, University of Richmond,Richmond, VA, USA 23217. E-mail: [email protected]. a r X i v : . [ ec on . E M ] J a n Introduction

Despite intense and longstanding empirical interest, the eﬀects of bilateral trade agree-ments on trade are still considered highly diﬃcult to assess. As emphasized in a recentpractitioner’s guide put out by the WTO (Yotov, Piermartini, Monteiro, and Larch, 2016),many current estimates in the literature suﬀer from easily identiﬁable sources of bias (or“estimation challenges”). This is not for a lack of awareness. Papers showing leadingcauses of bias in the gravity equation are often among the most widely celebrated andcited in the trade ﬁeld, if not in all of Economics. In particular, it is now generallyaccepted that trade ﬂows across diﬀerent partners are interdependent via the networkstructure of trade (the main contribution of Anderson and van Wincoop, 2003), thatlog-transforming the dependent variable is not innocuous (as argued by Santos Silva andTenreyro, 2006), and—most relevant to the context of trade agreements—that earlier,puzzlingly small estimates of the eﬀects of free trade agreements were almost certainlybiased downwards by treating them as exogenous (Baier and Bergstrand, 2007).As a consequence—and aided by some recent computational developments—researchersseeking to identify the eﬀects of trade agreements have naturally moved towards more ad-vanced estimation strategies that take on board all of the above concerns. In particular,a “three-way” ﬁxed eﬀects Poisson Pseudo-Maximum Likelihood (“FE-PPML”) modelwith time-varying exporter and importer ﬁxed eﬀects to account for network dependenceand time-invariant exporter-importer (“pair”) ﬁxed eﬀects to address endogeneity has re-cently emerged as a logical workhorse model for empirical trade policy analysis. It also For some context, if we start citation counts in 2003, Anderson and van Wincoop (2003) and San-tos Silva and Tenreyro (2006) are, respectively, the most cited articles in the

American Economic Review and the

Review of Economics and Statistics . Paling only slightly in this exclusive company, Baier andBergstrand (2007) is the 4th most-cited article in the

Journal of International Economics , having gath-ered “only” 2,200 citations. Readers familiar with these other papers will also likely be familiar withHelpman, Melitz, and Rubinstein (2008)’s work on the selection process underlying zero trade ﬂows, anissue we do not take up here. Larch, Wanner, Yotov, and Zylkin (2019), Correia, Guimarães, and Zylkin (2019), and Stammann(2018) describe algorithms that enable estimation of the three-way PPML models considered here. Pair ﬁxed eﬀects are of course no substitute for good instruments. However, instruments for tradepolicy changes which are also exogenous to trade are understandably hard to come by. As discussed inHead and Mayer (2014)’s essential handbook chapter on gravity estimation, pair ﬁxed eﬀects have theadvantage that the eﬀects of trade agreements and other trade policies are identiﬁed from time-variationin trade within pairs. Causal interpretations follow if standard “parallel trend” assumptions are satisﬁed. T ” case where the number oftime periods is small. Even though FE-PPML models can be shown to be asymptoticallyunbiased with a single ﬁxed eﬀect (a well-known result) as well as in a two-way settingwhere both dimensions of the panel become large (Fernández-Val and Weidner, 2016),the latter result does not come strictly as a generalization of the former one, leaving itpotentially unclear whether a three-way model with a ﬁxed time dimension should beexpected to inherit the nice asymptotic properties of these other models.Accordingly, the question we investigate in this paper might simply be phrased as: “ Dothree-way FE-PPML gravity models suﬀer from an incidental parameter problem (IPP)? ”As it turns out, there are two answers to this question: “ no... but also yes .” From atraditional (i.e., small- T inconsistency) perspective, there is no IPP: because the ﬁrst-order conditions of FE-PPML allow us to “proﬁle out” the pair ﬁxed eﬀect terms fromthe ﬁrst-order conditions of the other parameters, we can re-express the model as a two-way proﬁle likelihood that we can then deconstruct using the basic approach establishedby Fernández-Val and Weidner (2016) for two-way asymptotic analysis. The three-waymodel is therefore consistent in ﬁxed- T settings for largely the same reasons the two-waymodels considered in Fernández-Val and Weidner (2016) are consistent, and we providesuitably modiﬁed versions of the regularity conditions and consistency results establishedby Fernández-Val and Weidner (2016) for the simpler two-way case. Importantly, thisconsistency property turns out to be very speciﬁc to the FE-PPML estimator. As weare able to show, FE-PPML is in fact the only estimator among a wide range of relatedFE-PML gravity estimators that is generally consistent in this context when T is small.At the same time, it does not also follow that Fernández-Val and Weidner (2016)’searlier results for the asymptotic unbiased -ness of the two-way FE-PPML model similarlycarry over to the three-way case. This is where the “ ...but also yes ” part of our answercomes in. There is, in fact, a unique type of IPP in the three-way FE-PPML model that,to our knowledge, can only arise in models where there are diﬀerent levels of ﬁxed eﬀectsthat grow large at diﬀerent rates. Speciﬁcally, if N is the number of countries, proﬁling outthe large (on the order of N ) number of pair ﬁxed eﬀects eliminates any “1 /T ”-speciﬁc2ias term that would normally be associated with a short time series. Using the heuristicsuggested by Fernández-Val and Weidner (2018), we would then expect an asymptotic biaswith an order given by the ratio between the order of the number of remaining parameters( N T ) and that of number of observations ( N T ), or 1 /N . However, due to the specialproperties of FE-PPML, the asymptotic bias in our setting behaves more like a 1 / ( N T )bias as N and T grow large at the same rate. The bias thus vanishes at a rate of 1 /N as N → ∞ , ensuring consistency even for ﬁxed T , and the estimator is actually unbiased asboth N and T → ∞ , exactly like in the two-way FE-PPML model. What makes this bias a concern in ﬁxed- T settings then is that the asymptotic standarddeviation is of order 1 / ( N √ T ); thus, the asymptotic bias in point estimates will alwaysbe of comparable magnitude to their standard errors when T is ﬁxed. Put another way,without a bias correction, asymptotic conﬁdence intervals will be incorrectly centered andwill therefore produce misleading inferences, even as N → ∞ . This is eﬀectively a versionof the so-called “large T ” IPP, so-named because this type of result typically only ariseswhen taking asymptotics on the time dimension (e.g., Arellano and Hahn, 2007), usuallyfor the purposes of deriving bias corrections for an estimator known to be inconsistent inshort panels (e.g., Hahn and Newey, 2004). Unlike in most other settings explored in thisliterature, and even though the size of the time dimension does play a role in conditioningthe bias, the panel estimator we consider is consistent regardless of T . Nonetheless, theleading remedies recommended by the “large T ” literature can still be adapted to reducethe bias and correct inferences.Aside from the bias in point estimates, another (not unrelated) issue that aﬀects thethree-way model is a general downward bias in the cluster-robust sandwich estimatortypically used to compute standard errors. This latter bias is similar to one that has beenfound in the simpler two-way gravity model by several recent studies (Egger and Staub,2015; Jochmans, 2016; Pfaﬀermayr, 2019) and arises for the same reason: because theorigin-time and destination-time ﬁxed eﬀects in the model each converge to their true A similar IPP can arise for certain other three-way PML estimators aside from three-way FE-PPML.However, because these other estimators are generally inconsistent for ﬁxed T , they will typically havean additional bias term of order 1 /T that only disappears if the model is correctly speciﬁed. The new literature on “large T ” asymptotic bias in nonlinear FE models has emerged as a re-cent response to the well-known “small T ” consistency problem ﬁrst described in Neyman and Scott(1948). Other examples include Phillips and Moon (1999), Hahn and Kuersteiner (2002), Lancaster(2002), Woutersen (2002), Alvarez and Arellano (2003), Carro (2007), Arellano and Bonhomme (2009),Fernández-Val and Vella (2011), and Kato, F. Galvao Jr., and Montes-Rojas (2012). / √ N (not 1 /N ), the cluster-robust sandwich estimator for thevariance has a leading bias of order 1 /N (not 1 /N ), and standard errors in turn have abias of order 1 / √ N . This latter type of bias is related to the general result that standard“heteroskedasticity-robust” variance estimators are downward-biased in small samples(see, e.g., MacKinnon and White, 1985; Imbens and Kolesar, 2016), including for PMLmodels (Kauermann and Carroll, 2001). The fact that the bias in the sandwich estimatorconverges at a slower rate due to the incidental parameters merits special considerationon top of these already-known issues. We should therefore be concerned that estimatedconﬁdence intervals may be too narrow in addition to being oﬀ-center.Our analysis provides theoretical characterizations of both of these issues as well asa series of possible bias corrections, which we evaluate using simulations and a real-dataapplication. For the bias in point estimates, we construct two-way analytical and jackknifebias corrections inspired by the corrections proposed in Fernández-Val and Weidner (2016;2018). For the bias in standard errors, we show how Kauermann and Carroll (2001)’smethod for correcting the PML sandwich estimator may be adapted to the case of aconditional estimator with multi-way ﬁxed eﬀects and cluster-robust standard errors.Our simulations conﬁrm that these methods are usually eﬀective at improving inferences.The jackknife correction reduces more of the bias in point estimates than the analyticalcorrection in smaller samples, but the analytical correction does a better job at improvingcoverage, especially when also paired with corrected standard errors.For our empirical application, we estimate the average eﬀects of a free trade agreement(FTA) on trade for a range of diﬀerent industries using what would typically be considereda large trade data set, with 169 countries and 5 time periods. The biases we uncover varyin size across the diﬀerent industries, but are generally large enough to indicate that ourbias corrections should be worthwhile in most three-way gravity settings. For aggregatetrade data (which yields results that are fairly representative), the estimated coeﬃcientfor FTA has an implied downward bias about 15%-18% of the estimated standard error,and the implied downward bias in the standard error itself is about 10% of the originalstandard error.The literature on large- T IPPs with more than one ﬁxed eﬀect is small but growing.Aside from Fernández-Val and Weidner (2016)’s work on bias corrections for two-waynonlinear models, Pesaran (2006), Bai (2009), Hahn and Moon (2006), and Moon andWeidner (2017) have each conducted similar analyses for two-way linear models with4nteracted individual and time ﬁxed eﬀects. Turning to three-way models, Hinz, Stam-mann, and Wanner (2019) have recently developed bias corrections for dynamic three-wayprobit and logit models based on asymptotics suggested by Fernández-Val and Weidner(2018) where all three panel dimensions grow at the same rate. Though widely applica-ble, this approach is not appropriate for our setting because of the diﬀerent role playedby the time dimension when the estimator is FE-PPML. In the network context, Gra-ham (2017), Dzemski (2018), and Chen, Fernández-Val, and Weidner (2019) have studiedlarge- T IPPs in dyadic models where the diﬀerent nodes in the network are character-ized by node-speciﬁc (possibly sender- and receiver-speciﬁc) ﬁxed eﬀects. The analysis ofChen, Fernández-Val, and Weidner (2019) bears some especial similarity to our own inthat they allow these node-speciﬁc eﬀects to be vectors rather than scalars, similar to theexporter-time and importer-time ﬁxed eﬀects that feature in gravity models. Our biasexpansions mainly diﬀer from those of Chen, Fernández-Val, and Weidner (2019) becausethe equivalent outcome variable in our setting (trade ﬂows observed over time for a givenpair) is also a vector rather than a scalar and because we work with a conditional momentmodel where the distribution of the outcome may be misspeciﬁed. These distinctions areimportant because they together imply that the asymptotic bias is necessarily a functionof the joint distribution of the outcome vector, a complication that does not arise in theseother settings.In what follows, Section 2 ﬁrst provides a general overview of the no-IPP propertiesof the FE-PPML model (including the limits thereof). Section 3 then establishes biasand consistency results for the three-way gravity model speciﬁcally and discusses how toimplement bias corrections. Sections 4 and 5 respectively present simulation evidence andan empirical application. Section 6 concludes, and an Appendix adds proofs and furthersimulation results.

In this section, we consider scenarios under which PPML models with various combi-nations of ﬁxed eﬀects may or may not suﬀer from an IPP. Our focus for now will be Also related are the GMM-based diﬀerencing strategies for two-way FE models proposed by Char-bonneau (2017) and Jochmans (2016). These strategies rely on diﬀerencing the data in such as way thatthe resulting GMM moments do not depend on any of the incidental parameters. In principle, thesemethods could be extended to allow for diﬀerencing across a time dimension as well in a three-way panel.

The classic “one-way” FE setting is a natural way of demonstrating why FE-PPML modelssometimes do not suﬀer from incidental parameter bias when other nonlinear FE modelsnormally would. Consider a static panel data model with individuals i = 1 , . . . , N , timeperiods t = 1 , . . . , T , outcomes y it , and strictly exogenous regressors x it satisfying E ( y it | x it , α i ) = λ it := exp( x it β + α i ) . (1)The FE-PPML estimator maximizes P i,t ( y it log λ it + λ it ) over β and α . The correspond-ing FOC’s may be written as N X i =1 T X t =1 x it (cid:16) y it − b λ it (cid:17) = 0 , ∀ i : T X t =1 (cid:16) y it − b λ it (cid:17) = 0 , (2)where b λ it := exp( x it b β + b α i ). Solving for b α i and plugging the expression back into the FOCfor b β we ﬁnd N X i =1 T X t =1 x it " y it − exp( x it b β ) P Tτ =1 exp( x iτ b β ) T X τ =1 y iτ = 0 , (3)which, as long as (1) holds, are valid (sample) moments to estimate β . Thus, understandard regularity conditions, we have that √ N ( b β − β ) → d N (0 , V ) as N → ∞ , where V is the asymptotic variance. The FE-PPML estimator therefore does not suﬀer from anIPP: even though b α i is an inconsistent estimate of α i , the FE-PPML score for β has zeromean when evaluated at the true parameter β , and b β therefore converges in probabilityto β without any asymptotic bias. This is a well known result that can also be obtainedin the Poisson-MLE case by conditioning on P t y it ; see Cameron and Trivedi (2015). The earliest references to present versions of this result include Andersen (1970), Palmgren (1981),and Hausman, Hall, and Griliches (1984). Another important contribution is Wooldridge (1999), whoshows that FE-PPML is consistent even when the assumed distribution of the data is misspeciﬁed. OurLemma 2 in the Appendix clariﬁes that FE-PPML is relatively unique in this regard versus similar models.

6f course, with a doubly-indexed panel indexed by individuals and time, a standardapproach here would be to also include a time ﬁxed eﬀect for each period t . For small T , the addition of time ﬁxed eﬀects has little eﬀect on the above example: the smallnumber of time dummies needed for the ﬁxed eﬀects can be thought of as components of x it without loss of generality and are therefore consistently estimated for the same reasonsthe other components of x it are consistently estimated. A more interesting case is where T is large, such that we have a more complex, “two-way” estimator where both dimensionsof the panel—individual and time—grow with the sample. As shown by Fernández-Valand Weidner (2016)—and as we ourselves will show shortly—a two-way FE-PPML modelof this type is again consistent and exhibits no asymptotic bias. Thus, this series of resultsmay create the impression that Poisson models are immune to IPPs, regardless of howmany ﬁxed eﬀects are included or which dimensions of the panel grow with the sample.The following discussion makes it clear this is not generally the case. In the above “classic” setting, every observation is aﬀected by exactly one ﬁxed eﬀect. Incurrent applied work, it is common to specify models with what we will call “overlapping”ﬁxed eﬀects, where each observation may be aﬀected by more than one ﬁxed eﬀect. Somestandard examples include the gravity model from international trade (which we discussnext) as well as other settings where researchers may wish to control for multiple sourcesof heterogeneity (e.g., ﬁrm and employee, teacher and student). Thus, it is important toclarify that the presence of overlapping ﬁxed eﬀects can easily lead to an IPP, even whenthe underlying estimator is Poisson or PPML. We give the following simple example:

Example 1.

Consider a model with three time periods T = 3 and two ﬁxed eﬀects α i and γ i for each individual: t = 1 : E ( y i | x i , α i , γ i ) = λ i := exp( x i β + α i ) ,t = 2 : E ( y i | x i , α i , γ i ) = λ i := exp( x i β + α i + γ i ) ,t = 3 : E ( y i | x i , α i , γ i ) = λ i := exp( x i β + γ i ) . The FE-PPML estimator maximizes P Ni =1 P t =1 ( y it log λ it + λ it ) over β , α and γ . T = 3 is ﬁxed as N → ∞ . In this example, because the ﬁxed eﬀects are overlapping, we have that b α enters intothe FOC for b γ , and vice versa. Therefore, when for a given value b β we want to solve the7OC for b α and b γ we have to solve a system of equations, and the solutions become muchmore complicated functions of the outcome variable than in the one-way model. Whilehaving this type of co-dependence between the FOCs for the various ﬁxed eﬀects neednot necessarily lead to an IPP (as our gravity examples will show), it does create one inmodels where more than one ﬁxed eﬀect dimension grows at the same rate as the panelsize, as is the case with α and γ in Example 1.The easiest way to demonstrate that this type of model suﬀers from an IPP is by wayof simulations. The top-left panel of Fig. 1 presents simulated FE-PPML estimates of β based on Example 1 using panel sizes of N = 100, N = 1 , , y ijt is assumed to be log-normal with varianceequal to λ ijt (as in a Poisson distribution), but we have found similar results for otherdata-generating assumptions such as those described in Section 4. The true value for β is1, and values for x , α , and γ are constructed using the same methods as Fernández-Valand Weidner (2016). The results show that FE-PPML clearly suﬀers from an IPP inthis example. Even for the largest panel size where N = 10 , b β is about 1 . .

15, and the estimates do not show signs ofconverging to the true estimate of β = 1 as the panel size increases.Gravity models, by contrast, also feature multiple levels of overlapping ﬁxed eﬀects,but generally either the number of ﬁxed eﬀects grows at a slower rate than the size ofthe panel—as in the two-way gravity model—or there is only one ﬁxed eﬀect dimensionthat grows at the same rate as the panel size—as in the three-way gravity model usuallyrecommended for trade policy analysis. Determining whether an IPP is present (and whattype) for FE-PPML applied to gravity models therefore requires a closer examination ofthese models, which we now turn to. We introduce the concept of a “gravity model” as follows. Countries are indexed by i, j ∈ N := { , . . . , N } , with i = j , and y ij is the volume of trade between i and j . Ingeneral, we allow there to be

T > T = 1. The exporter and importer ﬁxed This panel structure can be easily relaxed to allow the number of exporters and the number ofimporters to be diﬀerent; the real key here is that we assume both dimensions of the panel grow at thesame rate asymptotically. α i and γ j . The model reads E ( y ij | x ij , α i , γ j ) = λ ij := exp( x ij β + α i + γ j ) . The FE-PPML estimator maximizes P ni =1 P j = i ( y ij log λ ij + λ ij ) over β , α , and γ , where x ij would normally contain a set of exogenous bilateral regressors (e.g., the log of geo-graphic distance, the sharing of a common border, and so on).From Fernández-Val and Weidner (2016), we know that q N ( N − b β − β ) → d N (0 , V ) as N → ∞ . That is, in contrast to what we found above for the model in Example1, we have no IPP here (neither an inconsistency nor an asymptotic bias problem). Thereasons behind this result are twofold. First, although we consider an asymptotic settingwhere both ﬁxed eﬀect dimensions (the number of exporters and the number of importers)grow with N , the sample size grows with N ; all α i ’s and γ j ’s are therefore consistentlyestimated as N → ∞ , and β is in turn consistently estimated as well. Second, for the FE-PPML model speciﬁcally, we can either solve for b α i or solve for b γ j to obtain a proﬁle scorefor the remaining parameters (including b β ) that is asymptotically unbiased as N → ∞ . The simulations presented in the top-right panel of Fig. 1 provide a visual illustration ofthis property, conﬁrming that estimates are correctly centered regardless of N .These results might perhaps create the impression that FE-PPML gravity modelsgenerally inherit all the same no-IPP properties as the classic one-way panel data model.As we will now discuss in detail, the three-way FE-PPML gravity model only inheritssome, not all, of these nice properties. As we will also see, this impression is misleadingfor other reasons as well: even for the two-way model, while the α i and γ j parameters donot aﬀect the score for b β , they nonetheless have implications for the estimated varianceof b β that are not innocuous; we thus will also devote some attention to whether thethree-way model suﬀers from a similar issue. Note that Theorem 4.1 in Fernández-Val and Weidner (2016) is written for the correctly speciﬁedcase, where y ij is actually Poisson distributed. However, Remark 3 in the paper gives the extension toconditional moment models, where for the FE-PPML case only E ( y ij | x ij , α i , γ j ) = exp( x ij β + α i + γ j )needs to hold. That remark also states that the asymptotic bias of the FE-PPML estimator b β is zero;that is, no bias correction is necessary for valid asymptotic inference here. Their paper considers standardpanel models, as opposed to trade models, but the only technical diﬀerence is that y ij is often not observedfor the trade model when i = j . This missing diagonal has no eﬀect on any of the results we discuss. Egger, Larch, Staub, and Winkelmann (2011) have previously observed that the two-way FE-PPMLestimator is consistent in this setting, as is any two-way FE-PML estimator where both dimensions of thepanel increase with the square root of the sample size. However, as shown by Fernández-Val and Weidner(2016), the no-bias result for FE-PPML does not extend to other similar estimators in this context. Results for the Three-way Gravity Model

To recap the sequence of results just described, we know that FE-PPML estimates withone ﬁxed eﬀect do not suﬀer from an IPP. We also know that FE-PPML may have an IPPin models with more than one ﬁxed eﬀect, but it is both consistent and asymptoticallyunbiased in two-way gravity settings when neither ﬁxed eﬀect dimension grows at thesame rate as the size of the panel. As we will now show, each of these earlier results willbe useful for understanding the more complex case of a three-way gravity model wherewe add a time dimension and a third set of ﬁxed eﬀects to the above two-way model.We also describe a series of bias corrections for the three-way model, including for thepossible downward bias of the estimated standard errors.

To formally introduce the three-way model, we now add an explicit time subscript t ∈{ , . . . , T } to y ij , x ij , α i , and γ j to the prior model and also add a bilateral (or “country-pair”)-speciﬁc ﬁxed eﬀect η ij . The model now reads as E ( y ijt | x ijt , α it , γ jt , η ij ) = λ ijt := exp( x ijt β + α it + γ jt + η ij ) , (4)where the three ﬁxed eﬀects now respectively index exporter-time, importer-time, andcountry-pair. To append an error term, we further assume y ijt = λ ijt ω ijt ≥

0, with ω ijt ≥ T ﬁxed, while N → ∞ . The FE-PPML estimator maximizes L ( β, α, γ, η ) := N X i =1 N X j =1 j = i T X t =1 ( y ijt log λ ijt + λ ijt )over β , α , γ and η .With the added country-pair ﬁxed eﬀect η , notice that not all of the ﬁxed eﬀectdimensions grow at the same rate as N increases. The numbers of exporter-time andimporter-time ﬁxed eﬀects each increase with N (as before), but the dimension of η increases with N , since adding another country to the data adds another N − η (as we did with α For discussion of this model, see Yotov, Piermartini, Monteiro, and Larch (2016) or Larch, Wanner,Yotov, and Zylkin (2019).

10n (3)), so that we may deal with the remaining two ﬁxed eﬀects in turn. For given valuesof β , α , γ the maximizer over η satisﬁesexp [ b η ij ( β, α, γ )] = P Tt =1 y ijt P Tt =1 µ ijt , µ ijt := exp( x ijt β + α it + γ jt ) . (5)We therefore have L ( β, α, γ ) := max η L ( β, α, γ, η ) = N X i =1 N X j =1 j = i ‘ ij ( β, α it , γ jt ) , (6)with ‘ ij ( β, α it , γ jt ) := T X t =1 " y ijt log µ ijt P Ts =1 µ ijs ! + µ ijt P Ts =1 µ ijs T X s =1 y ijs + T X t =1 y ijt log T X s =1 y ijs ! . = T X t =1 y ijt log µ ijt P Ts =1 µ ijs ! + terms not depending on any parameters . (7)Thus, after proﬁling out the η ij parameters, we are left with the likelihood of a multinomialmodel where the only incidental parameters are α it and γ jt . The FE-PPML estimatorsfor β , α it and γ jt are given by( b β, b α, b γ ) = argmax β,α,γ L ( β, α, γ ) . (8)Using (4) one can easily verify that E " ∂‘ ij ( β, α it , γ jt ) ∂β = 0 , E " ∂‘ ij ( β, α it , γ jt ) ∂α it = 0 , E " ∂‘ ij ( β, α it , γ jt ) ∂γ jt = 0 . (9)Thus, after proﬁling out η ij , there is no bias in the score of the proﬁle log-likelihood ‘ ij ( β, α it , γ jt ). The reason for this is exactly the same as for the no-IPP result in theclassic panel setting above. Furthermore, note that the only ﬁxed eﬀects that need to beestimated in ‘ ij ( β, α it , γ jt ) are α it and γ jt , which only grow with the square root of thesample size as N → ∞ , implying that they are consistently estimated. Thus, we can statethe following result: Proposition 1.

So long as the set of non-ﬁxed eﬀect regressors x ijt is exogenous to theresidual disturbance ω ijt after conditioning on the ﬁxed eﬀects α it , γ jt , and η ij , FE-PPMLestimates of β from the three-way gravity model are consistent for N → ∞ . This consistency result can be seen as a corollary of the asymptotic normality result in Proposition 3below, for which formal regularity conditions are stated in Assumption A of the Appendix. /T -bias, such that the earlier consistency resultfrom Fernández-Val and Weidner (2016) for two-way estimators can again be applied. Inother words, the three-way FE-PPML model is consistent as N → ∞ largely for the samereason two-way FE-PPML and other two-way nonlinear gravity estimators are generallyconsistent. However, in the context of three-way estimators, we can also state a strongerresult that applies more narrowly to FE-PPML in particular: Proposition 2.

Consider the class of “three-way” FE-PML gravity estimators with con-ditional means given by λ ijt := exp( x ijt β + α it + γ jt + η ij ) and FOC’s given by b β : N X i =1 N X j =1 j = i T X t =1 x ijt (cid:16) y ijt − b λ ijt (cid:17) g ( b λ ijt ) = 0 , b α it : N X j =1 (cid:16) y ijt − b λ ijt (cid:17) g ( b λ ijt ) = 0 , b γ jt : N X i =1 (cid:16) y ijt − b λ ijt (cid:17) g ( b λ ijt ) = 0 , b η ij : T X t =1 (cid:16) y ijt − b λ ijt (cid:17) g ( b λ ijt ) = 0 , where i, j = 1 , . . . , N , t = 1 , ..., T, and g ( b λ ijt ) is an arbitrary function of b λ ijt . If T issmall, then for b β to be consistent under general assumptions about Var( y | x, α, γ, η ) , wemust have that g ( λ ijt ) is constant over the range of λ ’s that are realized in the data-generating process. That is, the estimator must be equivalent to FE-PPML. The details behind this latter result are somewhat subtle. Clearly, for arbitrary g ( b λ ijt ),it is generally not possible to write down a closed form solution b η ij = ln P Tt =1 y ijt g ( b λ ijt ) − ln P Tt =1 µ ijt g ( b λ ijt ) that would allow us to derive a two-way proﬁle likelihood that does notdepend on b η ij . However, as we discuss in the Appendix, it is still possible to obtain a two-way proﬁle likelihood if g ( b λ ijt ) is of the form g ( b λ ijt ) = b λ qijt , where q can be any real number.Notably, this latter class of models not only includes FE-PPML (for which q = 0), but alsoincludes other popular gravity estimators such as Gamma PML ( q = −

1) and GaussianPML ( q = 1). And yet, the existence of seemingly equivalent proﬁle likelihood expressionsfor these other estimators does not guarantee that they are consistent. Actually, thethree-way gravity estimators associated with g ( b λ ijt ) = b λ qijt can be shown to suﬀer from a1 /T -bias that only disappears if either q = 0 (in which case the estimator is FE-PPML)or if the conditional variance is proportional to λ − qijt (in which case the estimator inheritsthe properties of the MLE estimator). 12 .2 Asymptotic Bias Because the three-way FE-PPML model inherits the consistency properties of the two-wayestimator, one might expect that it also inherits its unbiased -ness properties as well. How-ever, this is where the limitations of PPML’s no-IPP properties become apparent. Whilethe proﬁle log-likelihood in (6) is now of a similar form to the two-way models consid-ered in Fernández-Val and Weidner (2016), notice that it no longer resembles the originalFE-PPML log-likelihood. The no-bias result for two-way FE-PPML from Fernández-Valand Weidner (2016) therefore does not carry over to the proﬁle log-likelihood and it ispossible to show that FE-PPML has an asymptotic bias in this setting.

Preliminaries

As with the models considered in Fernández-Val and Weidner (2016), the origins of thisbias have to do with the rate at which the estimated incidental parameters b α i and b γ j converge to their true values α i and γ j . As such, it will be useful to pause here toestablish to some additional notation, mostly to provide some shorthand for the higher-order partial derivatives of ‘ ij with respect to b α i and b γ j . To this end, let ‘ ij ( β, α i , γ j ) =: ‘ ij ( β, π ij ) , with π ij =  π ij ... π ijT  :=  α i + γ j ... α iT + γ jT  It will also be convenient to let ϑ ijt := λ ijt / P τ λ ijτ . With ‘ ij now expressed in similarform to the objective function considered in Fernández-Val and Weidner (2016), we cannow deﬁne the following objects: • S ij := ∂‘ ij /∂π ij is a T × y ijt − ϑ ijt P τ y ijτ . • H ij := − ∂ ‘ ij /∂π ij ∂π ij gives us a T × T matrix with diagonal elements ϑ ijt (1 − ϑ ijt ) P τ y ijτ and oﬀ-diagonal ( s = t ) elements given by − ϑ ijs ϑ ijt P τ y ijτ . • G ij := ∂ ‘ ij /∂π ij ∂π ij ∂π ijt is a T × T × T cubic tensor. The elements on the maindiagonal of G ij are given by − ϑ ijt (1 − ϑ ijt ) (1 − ϑ ijt ) P τ y ijτ . The elements of the3 planar diagonals with r = s = t are given by ϑ ijs (1 − ϑ ijs ) ϑ ijt P τ y ijτ . All otherelements with r = s = t are given by − ϑ ijr ϑ ijs ϑ ijt P τ y ijτ . b β depends on b α i and b γ j . For example, S ij not only dou-bles for both ∂‘ ij /∂α i as well as for ∂‘ ij /∂γ j , but also allows us to obtain ∂‘ ij /∂β k = x ij,k S ij . Likewise, we also have that ∂ ‘ ij /∂α i ∂α i = ∂ ‘ ij /∂α i ∂γ j = ∂ ‘ ij /∂γ j ∂γ j = − H ij , ∂ ‘ ij /∂α i ∂β k = ∂ ‘ ij /∂γ j ∂β k = − H ij x ij,k and that ∂ ‘ ij ∂α i ∂α i ∂β k = ∂ ‘ ij ∂α i ∂γ j ∂β k = ∂ ‘ ij ∂γ j ∂α i ∂β k = ∂ ‘ ij ∂γ j ∂γ j ∂β k = G ij x ij,k , where it is important to note that the product G ij x ij,k is a T × T matrix with individualelements [ G ij x ij,k ] st = P r G ijrst x ijr,k . In addition, we will for the most part assume thatscore vectors are conditionally independent of one another—i.e., Cov (cid:16) S ij , S i j (cid:12)(cid:12)(cid:12) x ij (cid:17) = 0if ( i, j ) = ( i , j )—though this assumption can be relaxed, as we explain later on.The remaining preliminaries then require that we also deﬁne the expected Hessian¯ H ij = E (cid:16) H ij (cid:12)(cid:12)(cid:12) x ij (cid:17) . Because we have not chosen a normalization for α i and γ j , ¯ H ij isonly positive semi-deﬁnite (not positive deﬁnite). Therefore, we will use a Moore-Penrosepseudoinverse, to be denoted with a † , whenever the analysis requires we work with aninverse of ¯ H ij or similar matrices. We likewise ﬁnd it useful to deﬁne ¯ G ij = E ( G ij ).Finally, with ¯ H ij in hand, we can deﬁne the within-transformed regressor matrix e x ij := x ij − α xi − γ xj , where α xi and γ xj are T × K matrices that minimize N X i =1 X j ∈ N \{ i } Tr (cid:20)(cid:16) x ij − α xi − γ xj (cid:17) ¯ H ij (cid:16) x ij − α xi − γ xj (cid:17)(cid:21) , (10)subject to appropriate normalizations on α xi and γ xj (e.g. ι T α xi = ι T γ xj = 0, where ι T =(1 , . . . , is a T-vector of ones). Each within-transformed regressor vector e x ij,k can beinterpreted as containing the residuals left after partialing out x ij,k with respect to any i -and j -speciﬁc components and weighting by ¯ H ij . Speciﬁcally, we have that ¯ H ij ι T = 0, where ι T = (1 , . . . , is a T-vector of ones. Thus, ¯ H ij is only ofrank T − T . The Moore-Penrose inverse allows us to avoid the problem of choosingwhat normalizations to use for α i and γ j while still leading to the same end results. While we present the computation of e x ij as a two-way within-transformation to preserve the analogywith Fernández-Val and Weidner (2016), each individual element e x ijt,k can also be shown to be equivalent(subject to a normalization) to a three-way within-transformation of x ijt,k with respect to it , jt , and ij and weighting by λ ijt . Readers familiar with Larch, Wanner, Yotov, and Zylkin (2019) may ﬁnd thelatter presentation easier to digest. ias Expansion As in Fernández-Val and Weidner (2016), we can characterize the asymptotic bias in b β byexamining how the estimated ﬁxed eﬀects b α i and b γ j enter the score for b β . The full detailsbehind this derivation are left for the Appendix, but the following second-order expansionprovides a general basis. Let φ := vec( α, γ ) be a vector that collects all of the two-wayincidental parameters, such that we can again re-express ‘ ij slightly as ‘ ij = ‘ ij ( β, φ ). Wecan then deﬁne the function b φ ( β ) as b φ ( β ) := arg max φ N ( N − X i,j ‘ ij ( β, φ ) , which allows us to succinctly characterize the estimated values for b α and b γ as a functionof β . Next, we construct a second-order expansion of the expected score for b β around thetrue incidental parameter vector φ and evaluated at the true parameter β : E " ∂‘ ij ( β , b φ ( β )) ∂β ≈ E " ∂‘ ij ( β , φ ) ∂β + E " ∂ ‘ ij ( β , φ ) ∂β∂φ (cid:16) b φ ( β ) − φ (cid:17) + 12 dim φ X f,g E " ∂ ‘ ij ( β , φ ) ∂β∂φ f ∂φ g (cid:16) b φ f ( β ) − φ f (cid:17) (cid:16) b φ g ( β ) − φ g (cid:17) . (11)This expression is near-identical to a similar expansion that appears in Fernández-Val andWeidner (2016)—diﬀering mainly in that ‘ ij is a vector rather than a scalar—and com-municates the same essential insights: because the latter two terms in (11) are generally = 0, the score for b β is biased, with the bias depending on the interaction between thehigher-order partial derivatives of ‘ ij and the estimation error in the incidental parametersas well as their variances and covariances.After dropping terms that are asymptotically small and plugging in the just-deﬁnedexpressions S ij , H ij , G ij , and e x ij where appropriate, we can use (11) to obtain a tractableexpression for the bias that serves as the centerpiece of the following proposition. Proposition 3.

Under appropriate regularity conditions (Assumption A in the Appendix),for T ﬁxed and N → ∞ we have q N ( N − b β − β − W − N ( B N + D N ) N − ! → d N (cid:16) , W − N Ω N W − N (cid:17) , In particular, all elements of the cross-partial objects E [ ∂ ‘ ij /∂α i ∂γ j ], E [ ∂ ‘ ij /∂α i α i ∂γ j ], etc. canbe shown to be asymptotically small. Thus, in what follows, B N reﬂects the contribution of the α i parameters to the bias and D N reﬂects the contribution of the γ j parameters. here W N and Ω N are K × K matrices given by W N = 1 N ( N − N X i =1 X j ∈ N \{ i } e x ij ¯ H ij e x ij , Ω N = 1 N ( N − N X i =1 X j ∈ N \{ i } e x ij h Var (cid:16) S ij (cid:12)(cid:12)(cid:12) x ij (cid:17)i e x ij , and B N and D N are K -vectors with elements given by B kN = − N N X i =1 Tr  X j ∈ N \{ i } ¯ H ij  † X j ∈ N \{ i } E (cid:16) H ij e x ij,k S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17) + 12 N N X i =1 Tr  X j ∈ N \{ i } ¯ G ij e x ij,k   X j ∈ N \{ i } ¯ H ij  †  X j ∈ N \{ i } E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17)  X j ∈ N \{ i } ¯ H ij  †  ,D kN = − N N X j =1 Tr  X i ∈ N \{ j } ¯ H ij  † X i ∈ N \{ j } E (cid:16) H ij e x ij,k S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17) + 12 N N X j =1 Tr  X i ∈ N \{ j } ¯ G ij e x ij,k   X i ∈ N \{ j } ¯ H ij  †  X i ∈ N \{ j } E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17)  X i ∈ N \{ j } ¯ H ij  †  . The above proposition establishes the asymptotic distribution of the three-way gravityestimator as N → ∞ , including the asymptotic bias ( N − − W − N ( B N + D N ). Intuitively,this bias can be decomposed as the product of the inverse expected Hessian with respectto β (i.e. W − N ) and the bias of the score in (11), which in turn is captured by the two-way bias terms B N and D N and the rate of asymptotic convergence (essentially 1 /N ).In the two-way FE-PPML setting considered in Fernández-Val and Weidner (2016), wewould have that B N = D N = 0, such that b β is unbiased. Importantly, and unlike in thetwo-way FE-PPML setting, the three-way model does not give us the no-bias result that B N = D N = 0, as the following discussion helps to illustrate. Illustrating the Bias using the T = 2 Case

Admittedly, the complexity of the objects that appear in Proposition 3 may make it dif-ﬁcult to appreciate the general point that the three-way estimator is not unbiased. Oneway to make these details more transparent is to focus our attention on the simplest pos-sible panel model where T = 2. The convenient thing about this simpliﬁed setting is the16ikelihood function ‘ ij can be reduced to just a scalar: ‘ ij = y ij log ϑ ij + y ij log (1 − ϑ ij ),where ϑ ij = exp (∆ x ij β + π ij )exp (∆ x ij β + π ij ) + 1and where ∆ x ij = x ij − x ij and π ij = π ij − π ij . Importantly, these normalizationsallow us to express ∂‘ ij /∂π ij , ∂ ‘ ij /∂π ij , etc. as also just scalars, and we can thereforeeasily derive the following result: Remark 1.

For T = 2 , we calculate S ij = ϑ ij y ij − ϑ ij y ij , H ij = ϑ ij ϑ ij ( y ij + y ij ) , ¯ H ij = ϑ ij λ ij , G ij = ϑ ij ϑ ij ( ϑ ij − ϑ ij )( y ij + y ij ) , ¯ G ij = ϑ ij ( ϑ ij − ϑ ij ) λ ij , and ∆ e x ij = e x ij − e x ij . The bias term B kN in Proposition 3 can then be written as B kN = plim N →∞  − N N X i =1 P j = i ∆ e x ij ϑ ij ϑ ij h ϑ ij E ( y ij ) − ϑ ij E ( y ij ) + ( ϑ ij − ϑ ij ) E ( y ij y ij ) iP j = i ϑ ij λ ij + 12 N N X i =1 nP j = i ∆ e x ij ϑ ij ( ϑ ij − ϑ ij ) λ ij onP Nj =1 ϑ ij E ( y ij )+ ϑ ij E ( y ij ) − ϑ ij ϑ ij E ( y ij y ij ) ohP j = i ϑ ij λ ij i  , with an analogous expression also following for D kN . Two points then stand out based on the above expression. First, unlike in the two-wayFE-PPML case, neither of the two terms in B kN generally equals 0. Even in the correctlyspeciﬁed case (where E ( y ij ) = λ ij + λ ij and E ( y ij y ij ) = λ ij λ ij ), the ﬁrst term canbe shown to cancel, but the second term does not, because P j ¯ G ij ∆ e x ij = 0. This is verydiﬀerent from the two-way case where ¯ H ij = ¯ G ij = − λ ij . In that case, both terms in B kN and D kN always cancel, regardless of whether the PPML model is correctly speciﬁed. Thediﬀerence can be appreciated by comparing simulation results from the top-right panelof Fig. 1, which are based on the two-way model and are therefore unbiased, with thosefrom the bottom-left panel, which are based on the three-way model with T = 2 and showa clear asymptotic bias.Second, it is plain from Remark 1 that both terms in the bias generally depend onthe expected second moments of y ij (e.g., E ( y ij ), E ( y ij y ij ), etc.). This is again dif-ferent from the models that were previously considered in Fernández-Val and Weidner(2016). Among other things, the diﬃculty associated with estimating these second The speciﬁc examples used in Fernández-Val and Weidner (2016) are the Poisson model, which isunbiased, and the probit model, which requires the distribution of y ij to be correctly speciﬁed. They What if T is Large? While Proposition 3 only focuses on asymptotics where N → ∞ , the three-way gravitypanel also features a time dimension ( T ), and it is interesting to wonder how the aboveresults may depend on changes in T . The following remark clariﬁes how the bias terms B N and D N can be re-written to illuminate the role of the time dimension. Remark 2.

Using generic deﬁnitions for S ij , H ij , G ij , and e x ij (e.g., S ij := ∂‘ ij /∂π ij , H ij := ∂ ‘ ij /∂π ij ∂π ij , etc.), the formulas for the asymptotic distribution in Proposition 3apply generally to M-estimators of the form (8) based on concave objective functions ‘ ij ( β, α it , γ jt ) . Unlike with two-way FE-PPML models, these formulas do not reduce tozero when we further specialize them to the proﬁled Poisson pseudo-likelihood in (7) , butwe still ﬁnd it instructive to do so (e.g. to discuss the large T limit below). For thatpurpose, we deﬁne the T × T matrix M ij = I T − ϑ ij ι T . Furthermore, let Λ ij be the T × T diagonal matrix with diagonal elements λ ijt , and for i, j ∈ { , . . . , N } deﬁne the T × T matrices Q i = 1 N −  X j ∈ N \{ i } M ij Λ ij M ij  †  X j ∈ N \{ i } M ij E ( y ij y ij ) M ij   X j ∈ N \{ i } M ij Λ ij M ij  † ,R ij = E ( y ij y ij ) M ij  N − X j ∈ N \{ i } M ij Λ ij M ij  † Λ ij M ij . The bias term B N = ( B kN ) in Proposition 3 can then be expressed as B kN = 1 N ( N − N X i =1 X j ∈ N \{ i } " − ι T R ij e x ij,k ι T λ ij + λ ij Q i Λ ij M ij e x ij,k ι T λ ij , (12) and an analogous formula for D N follows by interchanging i and j appropriately. also provide a bias expansion for “conditional moment” models that allow the distribution of y ij to bemisspeciﬁed. Beyond this theoretical discussion, bias corrections for misspeciﬁed models have yet toreceive much attention, however. As can be seen above, an important complication that arises for thesemodels is that the bias depends on the distribution of the data, which is typically treated as unknown.

18o long as there is only weak time dependence between observations belonging to thesame pair (in the sense described by Hansen, 2007), the matrix objects R ij and Q i Λ ij M ij in Remark 2 are both of order 1 as T → ∞ , such that both terms in brackets in (12) arelikewise of order 1. We will henceforth assume any time dependence is weak. Remark 3then describes some additional asymptotic results for when T is large. Remark 3.

Under asymptotics where T → ∞ , we have the following:(i) If N is ﬁxed and T → ∞ , then b β is generally inconsistent.(ii) As T → ∞ , the combined bias term W − N ( B N + D N ) goes to zero at a rate of /T .Therefore, because the standard error is of order / ( N √ T ) , there is no bias in the asymp-totic distribution of b β as N and T both → ∞ . To elaborate further, letting T → ∞ is obviously not suﬃcient for either α or γ to beconsistently estimated and does not solve the IPP, as stated in part (i). However, as part(ii) tells us, T still plays an interesting role in conditioning the bias when both N and T jointly become large. Intuitively, because W − N is of order 1 /T as T → ∞ , whereas B N and D N are both of order 1, the bias in b β eﬀectively vanishes at a rate of 1 / ( N T ) as both

N, T → ∞ , such that it increasingly shrinks in relation to the order-1 / ( N √ T ) standarderror. This is what we mean when we say the IPP the three-way PPML model suﬀersfrom is rather unique: it can be resolved by large enough T (like most IPPs), yet large T is actually neither necessary nor suﬃcient to ensure consistency. Allowing for Conditional Dependence across Pairs

The bias expansion in Proposition 3 allows for errors to be clustered within each pair( i, j ), but assumes conditional independence of y ij and y i j for all ( i, j ) = ( i , j ). Thisassumption is consistent with the standard practice in the literature of assuming thaterrors are clustered within pairs when computing standard errors (see Yotov, Piermartini,Monteiro, and Larch, 2016.) However, it is important to clarify that the results in Propo-sition 3 may change when other assumptions are used. For example, if we want to allow y ij and y ji (i.e., both directions of trade) to be correlated, then the bias results would By “weak” time dependence, we mean that any such dependence dissipates as the temporal distancebetween observations increases. Alternatively, if observations are correlated regardless of how far apartthey are in time, the standard error is always of order 1 /N (see Hansen, 2007), and the same will also betrue for the asymptotic bias. The latter is arguably a less natural assumption in this context, however. N to allow for theadditional clustering; namely, we would needΩ N = 1 N ( N − N − X i =1 N X j = i +1 Var (cid:16)e x ij S ij + e x ji S ji (cid:12)(cid:12)(cid:12) x (cid:17) = 1 N ( N − N − X i =1 N X j = i +1 (e x ij h Var (cid:16) S ij (cid:12)(cid:12)(cid:12) x ij (cid:17)i e x ij + e x ji h Var (cid:16) S ji (cid:12)(cid:12)(cid:12) x ji (cid:17)i e x ji + e x ij h Cov (cid:16) S ij , S ji (cid:12)(cid:12)(cid:12) x ij (cid:17)i e x ji + e x ji h Cov (cid:16) S ji , S ij (cid:12)(cid:12)(cid:12) x ji (cid:17)i e x ij ) . (13)However, this is just one possibility. Similar adjustments could be made to allow forclustering by exporter or importer, for example, or even for multi-way clustering á laCameron, Gelbach, and Miller (2011). In these cases, the bias would also need to bemodiﬁed; speciﬁcally, one would have to modify the portions of D kN that B kN that dependon the variance of S ij to allow for correlations across i and/or j . Of course, even if the point estimates are correctly centered, inferences will still be un-reliable if the estimates of the variance used to construct conﬁdence intervals are notthemselves unbiased. For PPML models, conﬁdence intervals are typically obtained usinga “sandwich” estimator for the variance that accounts for the possible misspeciﬁcation ofthe model. However, as shown by Kauermann and Carroll (2001), the PPML sandwichestimator is generally downward-biased in ﬁnite samples. Furthermore, for gravity models(both two-way and three-way), the bias in the sandwich estimator can itself be formalizedas a kind of IPP. To illustrate the bias of the sandwich estimator in our three-way setting, recall thatwe can express the variance of b β as Var( b β − β ) = N − ( N − − W − N Ω N W − N . As isalso true for the linear model (cf., MacKinnon and White, 1985; Imbens and Kolesar,2016), the bias arises because plugin estimates for Ω N depend on the estimated variance E ( b S ij b S ij ) = E [( y ij − b λ ij )( y ij − b λ ij ) ] rather than on the true variance E ( S ij S ij ) = E [( y ijt − λ ijt )( y ijt − λ ijt ) ]. Even though E ( b S ij b S ij ) is a consistent estimate for E ( S ij S ij ), it will This type of IPP has similar origins to the one described in Verdier (2018), who considers a dyadiclinear model with two-way FEs and sparse matching between the two panel dimensions. φ := vec ( α, γ ) and now let d ij be a T × dim ( φ ) matrix ofdummies such that each row of d ij satisﬁes d ijt φ = α it + γ jt . Using the same approach asKauermann and Carroll (2001), we can then use the special case where E ( S ij S ij ) = κ ¯ H ij (such that Ω N = κW N , meaning the model is correctly speciﬁed) to demonstrate that E ( b S ij b S ij ) generally has a downward bias. Speciﬁcally, let the ﬁtted score vector b S ij beapproximated by the ﬁrst-order expansion b S ij = S ij − ¯ H ij x ij ( b β − β ) − ¯ H ij d ij ( b φ − φ ). Alsoassume that E ( S ij S ij ) = κ ¯ H ij , such that the FE-PPML model is correctly speciﬁed. Thenthe expected outer product of the ﬁtted score E ( b S ij b S ij ) has a ﬁrst-order bias of E ( b S ij b S ij − S ij S ij ) ≈ − κN ( N − ¯ H ij e x ij W − N e x ij ¯ H ij − κN ( N − ¯ H ij d ij W ( φ ) − N d ij ¯ H ij (14)where W ( φ ) N := E N [ − ∂ ‘ ij /∂φ∂φ ] = − [ N ( N − − P i,j d ij ¯ H ij d ij captures the expectedHessian of the concentrated likelihood with respect to φ . The two terms on the right-hand side of (14) are both negative deﬁnite, implying thatthe sandwich estimator is generally downward-biased—and deﬁnitively so if the model iscorrectly speciﬁed. The most meaningful diﬀerence with the earlier results of Kauermannand Carroll (2001) is how we can use these two terms to decompose the bias in E ( b S ij b S ij )into two distinct sources. The ﬁrst term in (14), which depends on [ N ( N − − W − N ,captures how the bias depends on the variance of b β. The second term, which depends on[ N ( N − − W ( φ ) − N , captures how much of the bias is due to the variance in the estimatedincidental parameter vector b φ . The former term decreases with N , but the latter termonly decreases with N , since increasing N by 1 only adds 1 additional observation of eachelement of b φ . All together, this analysis implies that the estimated standard error for b β will exhibita bias that only disappears at the relatively slow rate of 1 / √ N . We should thereforebe concerned that asymptotic conﬁdence intervals for b β may exhibit inadequate coverageeven in moderately large samples, similar to what has been found for two-way FE-PPMLmodels in recent simulation studies by Egger and Staub (2015), Jochmans (2016), andPfaﬀermayr (2019). Indeed, the bias approximation we have derived in (14) can be readilyadapted to the two-way setting or even to more general settings with k -way ﬁxed eﬀects. A detailed derivation of (14) is provided in the Appendix. Pfaﬀermayr (2019) makes a similar point about the order of the bias of the standard errors for thetwo-way FE-PPML model, albeit using a slightly diﬀerent analysis. .4 Bias Corrections for the Three-way Gravity Model We now present two methods for correcting the bias in estimates: a jackknife methodbased on the split-panel jackknife of Dhaene and Jochmans (2015) and an analyticalcorrection based on the expansion shown in Proposition 3. We also provide an analyticalcorrection for the downward bias in standard errors.

Jackknife Bias Correction

The advantage of the jackknife correction is that it does not require explicit estimation ofthe bias yet still has a simple and powerful applicability. To see this, note ﬁrst that theasymptotic bias we characterize can be written as1

N B β + o p ( N − ) , where B β is a combined term that captures any suspected asymptotic bias contributionsof order 1 /N . The speciﬁc jacknife we will apply for our current purposes is a split-paneljackknife based on Dhaene and Jochmans (2015). As in Dhaene and Jochmans (2015), wewant to divide the overall data set into subpanels of roughly even size and then estimate b β ( p ) for each subpanel p . Given the gravity structure of the model, we ﬁrst divide the setof countries into evenly-sized groups a and b . We then consider 4 subpanels of the form“( a, b )”, where “( a, b )” denotes a subpanel where exporters from group a are matchedwith importers from group b . The other three subpanels are ( a, a ), ( b, a ), and ( b, b ). Forrandomly-generated data, we can deﬁne a and b based on their ordering in the data (i.e., a := i : i ≤ N/ b := i : i > N/ The split-panel jackknife estimator for β , e β JN , is then deﬁned as e β JN := 2 b β − X p b β ( p ) . (15)This correction works to reduce the bias because, so long as the distribution of y ij and x ij is homogeneous across the diﬀerent partitions of the data, each b β ( p ) has a leadingbias term equal to 2 B β /N . The average b β ( p ) across these four subpanels thus also has This is just one possible way to construct a jackknife correction for two-way panels. We have alsoexperimented with splitting the panel one dimension at a time as in Fernández-Val and Weidner (2016),but we ﬁnd the present method performs noticeably better at reducing the bias.

22 leading bias of 2 B β /N and any terms depending on B β /N cancel out of (15). Thus,the bias-corrected estimate e β JN only has a bias of order o p ( N − ), which is obtained bycombining the second-order bias from b β with that of the average subpanel estimate. Thislatter bias can be shown to be larger than the original second-order bias in (3.4), but theoverall bias should still be smaller because of the elimination of the leading bias term. Analytical Bias Correction

Our anaytical correction for the bias is based on the bias expression in Proposition 3 anduses the plugin objects be x ij , b S ij , c H ij , c H ij , and b G ij . For the most part, these objects areformed in the obvious way by replacing λ ijt with b λ ijt and ϑ ijt with b ϑ ijt := b λ ijt / P τ b λ ijτ where needed. The resulting bias correction is given by ( N − − c W − N ( b B N + c D N ), where b B N and c D N are K -vectors with elements given by b B kN = − N − N X i =1 Tr  X j ∈ N \{ i } c H ij  † X j ∈ N \{ i } c H ij be x ij,k b S ij  + 12 ( N − N X i =1 Tr  X j ∈ N \{ i } b G ij be x ij,k   X j ∈ N \{ i } c H ij  †  X j ∈ N \{ i } b S ij b S ij   X j ∈ N \{ i } c H ij  †  , c D kN = − N − N X j =1 Tr  X i ∈ N \{ j } c H ij  † X i ∈ N \{ j } c H ij be x ij,k b S ij  + 12 ( N − N X j =1 Tr  X i ∈ N \{ j } b G ij be x ij,k   X i ∈ N \{ j } c H ij  †  X i ∈ N \{ j } b S ij b S ij   X i ∈ N \{ j } c H ij  †  , and where c W = 1 N ( N − N X i =1 X j ∈ N \{ i } be x ij c H ij be x ij , As in Fernández-Val and Weidner (2016), it is possible to show that these plug-in cor-rections lead to estimates that are asymptotically unbiased as N → ∞ . Still, for ﬁnitesamples, it is evident that the bias in some of these plug-in objects—the b S ij b S ij outerproduct terms, for example—could cause the analytical bias correction to itself exhibit The replacement of N with N − b B kN and b D kN stems from a degrees-of-freedom correction. Thiscorrection is needed because creating plug-in values for the E (cid:0) S ij H ij (cid:12)(cid:12) x ij,k (cid:1) and E (cid:0) S ij S ij (cid:12)(cid:12) x ij,k (cid:1) objectsthat appear in Proposition 3 requires computing terms of the form E [ y ijt ] and E [ y ijs y ijt ], as illustratedin Remark 1. a priori whether the analytical correction willoutperform the jackknife at reducing the bias in b β . One clear advantage the analyticalcorrection has over the jackknife is that it does not require the distribution of y ij and x ij to be homogeneous over the diﬀerent partitions of the data in order to be valid. Bias-corrected Standard Errors

Under the assumption of clustered errors within pairs, a natural correction for the varianceestimate is available based on (14). Speciﬁcally, let b Ω U := 1 N ( N − X i,j be x ij " I T − N ( N −

1) ¯ H ij be x ij c W − N be x − N ( N −

1) ¯ H ij d ij c W ( φ ) − N d ij − b S ij b S ij be x ij , where I T is a T × T identity matrix and c W ( φ ) N is a plugin estimate for W ( φ ) N . The correctedvariance estimate is then given by b V U = 1 N ( N − − c W − b Ω U c W − . The logic of this adjusted variance estimate follows directly from Kauermann and Carroll(2001): if the PPML estimator is correctly speciﬁed (such that E ( S ij S ij ) = κ ¯ H ij ), then b V U can be shown to eliminate the ﬁrst-order bias in b V ( b β − β ) shown in (14). It isnot generally unbiased otherwise, but it is plausible that it should eliminate a signiﬁcantportion of any downward bias under other variance assumptions as well. For our simulation analysis, we assume the following: (i) the data generating process(DGP) for the dependent variable is of the form y ijt = λ ijt ω ijt , where ω ijt is a log-normaldisturbance with mean 1 and variance σ ijt . (ii) β = 1. (iii) The model-relevant ﬁxedeﬀects α , γ , and η are each ∼ N (0 , / x ijt = x ijt − / α + γ + ν ijt , where ν ijt ∼ N (0 , / (v) Taking our cue from Santos Silva and Tenreyro (2006), we These assumptions on α , γ , η , x ijt , and ν ijt are taken from Fernández-Val and Weidner (2016).Notice that x ijt is strictly exogenous with respect to ω ijt conditional on α , γ , and η . ω ijt : DGP I : σ ijt = λ − ijt ; Var( y ijt | x it , α, γ, η ) = 1 . DGP II : σ ijt = λ − ijt ; Var( y ijt | x it , α, γ, η ) = λ ijt . DGP III : σ ijt = 1; Var( y ijt | x it , α, γ, η ) = λ ijt . DGP IV : σ ijt = 0 . λ − ijt + 0 . e x ijt ; Var( y ijt | x it , α, γ, η ) = 0 . λ ijt + 0 . e x λ ijt , where we also allow for serial correlation within pairs by imposingCov[ ω ijs , ω ijt ] = exp h . | s − t | × q ln(1 + σ ijs ) q ln(1 + σ ijt ) i − , such that the degree of correlation weakens for observations further apart in time. The relevance of these various assumptions to commonly used error distributions isbest described by considering the conditional variance Var( y ijt | x it , α, γ, η ). For example,DGP I assumes that the conditional variance is constant, as in a Gaussian process withi.i.d disturbances. In DGP II, the conditional variance equals the conditional mean, asin a Poisson distribution. DGP III—which we will also refer to as the case of “log-homoscedastic” data—is the unique case highlighted in Santos Silva and Tenreyro (2006)where the assumption that the conditional variance is proportional to the square of theconditional mean leads to a homoscedastic error when the model is estimated in logsusing a linear model. Finally, DGP IV provides a “quadratic” error distribution thatmixes DGP II and DGP III and also models a more complex dependence between x ijt and the variance of the error term.Tables 1 and 2 present simulation evidence comparing the uncorrected three-way FE-PPML estimator with results computed using the analytical and jackknife correctionsdescribed in Section 3.4. As in the prior simulations, we again compute results for avariety of diﬀerent panel sizes—in this case for N = 20 , ,

100 and T = 2 , , Inorder to validate our analytical predictions regarding these estimates, we compute theaverage bias of each estimator, the ratios of the average bias to the average standard error The 0 . . Note that the trade literature currently recommends using wide intervals of 4-5 years between timeperiods so as to allow trade ﬂows time to adjust to changes in trade costs (see Cheng and Wall, 2005.)Thus, for practical purposes, T = 10 may be thought of as a relatively “long” panel in this context thatmight span 40+ years. β = 1. In particular, we expect that the bias in b β should be decreasing in either N or T but should remain large relative to the estimated standard error and induce inadequatecoverage for small T . We are also interested in whether the usual cluster-robust standarderrors accurately reﬂect the true dispersion of estimates. Results for DGPs I and II areshown in Table 1, whereas Table 2 shows results for DGPs III and IV.The results in both tables collectively conﬁrm the presence of bias and the viabilityof the analytical and jackknife bias corrections. The average bias is generally larger forDGPs I and IV than II and III. As expected, it generally falls with both N and T acrossall the diﬀerent DGPs, though only weakly so for DGP III (the log-homoscedastic case),which generally only has a small bias. To use DGP II—the Poisson case, where PPMLshould otherwise be an optimal estimator—as a representative example, we see that theaverage bias falls from 3 . N = 20, T = 2 to a low of0 . N = 100 , T = 10. For DGP IV, the least favorable ofthese cases, the average bias ranges from − . − . β should be consistently estimated evenfor small T but has an asymptotic bias that depends on the number of countries and onthe number of time periods.Interestingly, while the average bias almost always decreases with T , the ratio ofthe bias to standard error usually does not, seemingly contrary to the expectations laidout in Remark 3. Evidently, when T is suﬃciently small, the rate at which the biasdecreases with T may be slower than 1 /T . Researchers should thus be careful to notethat the implications of Remark 3 do not necessarily apply to settings with small T or evenmoderately large T . Instead, it seems reasonable to expect that the bias will generallybe non-negligible relative to the standard error except for very large T . Furthermore,the estimated cluster-robust standard errors themselves clearly exhibit a bias in all casesas well. Even when N = 100, SE/SD ratios are uniformly below 1; generally they are Numerically, what we have found is that the two terms that appear in both B and D in Proposition3 tend to have opposite signs when the DGP is log-homoscedastic. Thus, they tend to mitigate oneanother, leading to a somewhat muted bias in this case. We have also simulated the bias for larger values of T beyond T = 10. What we ﬁnd is that the biasdecreases somewhat slowly with T for small values of T (consistent with the results in these tables), butdoes indeed start to decrease with 1 /T as T becomes increasingly large. . .

95, and for DGP IV, they are closer to 0 .

85 or even 0 .

8. Because ofthese biases, the simulated FE-PPML coverage ratios are unsurprisingly below the 0 . N and T —notice how,for the Poisson case, for example, the average bias left by the jackknife correction is nevergreater than 0 . .

08% and 1 . N = 100, the analytical correctionoften dominates, especially when T is at least 5. All the same, both corrections gener-ally have a positive eﬀect, and the better across-the-board bias-reduction performance ofthe jackknife comes at the important cost of a relatively large increase in the variance.Thus, the analytical correction generally performs as well as or better than the jackknifein terms of improving coverage even in the smaller samples. Neither correction is suﬃ-cient to bring coverage ratios to the immediate vicinity of 0 .

95, however, though correctedGaussian-DGP estimates and Poisson-DGP estimates both reach 0 . .

94 using the an-alytical correction when N = 100, and coverage for the analytical-corrected Poisson-DGPestimates reaches 0 . .

96 when N = 50.Table 3 then evaluates the eﬃcacy of our bias correction for the estimated variance.Keeping in mind that this correction is calibrated for the case of a correctly speciﬁedvariance (which corresponds to DGP II), it is unsurprising that the eﬀect of this correctionvaries depending on the conditional distribution of the data. The best results by far are forthe Gaussian, Poisson, and Log-homoscedastic DGPs (DGPs I, II, and III, respectively),where combining the analytical bias correction for the point estimates with the correctionfor the variance yields coverage ratios that fall within an acceptable range between 0 . .

962 when N is either 50 or 100 and are often close to the target value of 0 .

95 inthese cases. These corrections lead to dramatic improvements in coverage for DGP IV aswell, but there the remaining biases in both the point estimate and the standard errorremain large even for N = 100 and T = 10.Overall, these simulations suggest that combining an analytical bias correction for b β with a further correction for the variance based on (14) should be a reliable way of reducingbias and improving coverage. At the same time, it should be noted that neither oﬀers acomplete bias removal. For smaller samples, if reducing bias on average is heavily favored,and if the distribution of y ij and x ij can be reasonably assumed to be homogeneous, then27he split-panel jackknife method might be preferable to the analytical correction method.We should also be careful to point out that the results produced here are based on theparticular assumptions we have chosen to generate the data. To determine the practicalimplications of these corrections, a more meaningful test will be to apply them to estimatesproduced using real data. For our empirical application, we estimate the average eﬀects of an FTA using a panelwith what would typically be considered a relatively large number of countries. Our tradedata is from the BACI database of Gaulier and Zignago (2010), from which we extractdata on trade ﬂows between 169 countries for the years 1995, 2000, 2005, 2010, and 2015.Countries are chosen so that the same 169 countries always appear as both exporters andimporters in every period; hence, the data readily maps to the setting just described with N = 169 and T = 5. We combine this trade data with data on FTAs from the NSF-Kellogg database maintained by Scott Baier and Jeﬀ Bergstrand, which we crosscheckagainst data from the WTO in order to incorporate agreements from more recent years. The speciﬁcation we estimate is y ijt = exp[ α it + γ jt + η ij + βF T A ijt ] ω ijt , (16)where y ijt is trade ﬂows (measured in current USD), F T A ijt is a 0 / i and j have an FTA at time t , and ω ijt is an error term. As we have noted,estimation of speciﬁcations such as (16) via PPML has become an increasingly standardmethod for estimating the eﬀects of trade agreements and other trade policies and iscurrently recommended as such by the WTO (see Yotov, Piermartini, Monteiro, andLarch, 2016.)Table 4 presents results from FE-PPML estimation of (16), including results ob-tained using our bias corrections. Because biases may vary depending on the speciﬁcheteroscedasticity patterns native to each industry, we show results for industry-speciﬁcregressions at the 2 digit ISIC (rev. 3) industry level as well as for aggregate trade. Theresults for aggregate trade ﬂows, shown in the bottom row of Table 4, are nonetheless This database is available for download on Jeﬀ Bergstrand’s website: . The most recent version runs from 1950-2012. The additional data from the WTO isneeded to capture agreements that entered into force between 2012 and 2015.

F T A ijt foraggregate trade is initially estimated to be 0 . e . − . The estimated standard error is 0 . p < .

01 signiﬁcancelevel. Our bias-corrected estimates do not paint an altogether diﬀerent picture, but dohighlight the potential for meaningful reﬁnement. Both the analytical and jackknife biascorrections for β suggest a downward bias of 0 . .

05, or about 15%-18% of the estimatedstandard error. As our bias-corrected standard errors show (in the last column of Table(4)), the initially estimated standard error itself has an implied downward bias of 10%(i.e., 0 .

027 versus 0 . The term “partial eﬀect” is conventionally used to distinguish this type of estimate from the “generalequilibrium” eﬀects of an FTA, which would typically be calculated by solving a general equilibrium trademodel where prices, incomes, and output levels (which are otherwise absorbed by the α it and γ jt ﬁxedeﬀects) are allowed to evolve endogenously in response to the FTA. In the context of such models, β canusually be interpreted as capturing the average eﬀect of an FTA on bilateral trade frictions speciﬁcally,holding ﬁxed all other determinants of trade. Conclusion

Thanks to recent methodological and computational advances, nonlinear estimation withthree-way ﬁxed eﬀects has become increasingly popular for investigating the eﬀects oftrade policies on trade ﬂows. However, the asymptotic and ﬁnite-sample properties ofsuch an estimator have not been rigorously studied, especially with regards to potentialIPPs. The performance of the FE-PPML estimator in particular is of natural interestin this context, both because FE-PPML is known to be relatively robust to IPPs aswell as because it is likely to be a researcher’s ﬁrst choice for estimating three-way gravitymodels. Our results regarding the consistency of PPML in this setting reﬂect these uniqueproperties of PPML and support its current status as a workhorse estimator for estimatingthe eﬀects of trade polices.Given the consistency of PPML in this setting, and given the nice IPP-robustness prop-erties of PPML in general, it may come as a surprise that three-way PPML nonethelesssuﬀers from an IPP bias. We show that the leading component of this bias is decreas-ing in the number of countries in the panel as well as in the number of time periods.Thus, the bias is likely to be of comparable magnitude to the standard error when thetime dimension of the panel is small, even for large panels with many countries. Typ-ical cluster-robust estimates of the standard error are also biased, implying asymptoticconﬁdence intervals not only oﬀ-center but also too narrow.These issues are not so severe that they leave researchers in the wilderness, but we dorecommend taking advantage of the corrective measures described in the paper when es-timating three-way gravity models. In particular, we ﬁnd that analytical bias correctionsbased on Taylor expansions to both the point estimates and standard errors generallylead to improved inferences when applied simultaneously. These corrections are not apanacea, however, and several avenues remain open for future work. For example, con-ﬁdence interval estimates could be adjusted further to account for the uncertainty inthe estimated variance—Kauermann and Carroll (2001) describe such a correction forthe standard PPML model. A quasi-diﬀerencing approach similar to Jochmans (2016)could also provide another angle of attack. Turning to broader applications, the essentialdyadic structure of our bias corrections could be easily adapted to network models thatstudy changes in network behavior over time, especially settings that involve studying thenumber of interactions between network members.30 eferences

Allen, T., C. d. C. Dobbin, and

M. Morten (2018): “Border Walls,” Discussionpaper, National Bureau of Economic Research.

Alvarez, J., and

M. Arellano (2003): “The Time Series and Cross-Section Asymp-totics of Dynamic Panel Data Estimators,”

Econometrica , 71(4), 1121–1159.

Andersen, E. B. (1970): “Asymptotic properties of conditional maximum-likelihoodestimators,”

Journal of the Royal Statistical Society: Series B (Methodological) , 32(2),283–301.

Anderson, J. E., and

E. van Wincoop (2003): “Gravity with Gravitas: A Solutionto the Border Puzzle,”

American Economic Review , 93(1), 170–192.

Arellano, M., and

S. Bonhomme (2009): “Robust Priors in Nonlinear Panel DataModels,”

Econometrica , 77(2), 489–536.

Arellano, M., and

J. Hahn (2007): “Understanding Bias in Nonlinear Panel Models:Some Recent Developments,”

Econometric Society Monographs , 43, 381.

Bai, J. (2009): “Panel Data Models With Interactive Fixed Eﬀects,”

Econometrica ,77(4), 1229–1279.

Baier, S. L., and

J. H. Bergstrand (2007): “Do Free Trade Agreements actuallyIncrease Members’ International Trade?,”

Journal of International Economics , 71(1),72–95.

Beverelli, C., and

G. Orefice (2019): “Migration deﬂection: The role of PreferentialTrade Agreements,”

Regional Science and Urban Economics , 79, 103469.

Bosquet, C., and

H. Boulhol (2015): “What is really puzzling about the “distancepuzzle”,”

Review of World Economics , 151(1), 1–21.

Brinkman, J., and

J. Lin (2019): “Freeway Revolts!,”

Unpublished manuscript . Cameron, A. C., J. B. Gelbach, and

D. L. Miller (2011): “Robust Inference WithMultiway Clustering,”

Journal of Business & Economic Statistics , 29(2), 238–249.31 ameron, A. C., and

P. K. Trivedi (2015): “Count Panel Data,”

Oxford Handbookof Panel Data Econometrics (Oxford: Oxford University Press, 2013) . Carro, J. M. (2007): “Estimating Dynamic Panel Data Discrete Choice Models withFixed Eﬀects,”

Journal of Econometrics , 140(2), 503–528.

Charbonneau, K. B. (2017): “Multiple ﬁxed eﬀects in binary response panel datamodels,”

The Econometrics Journal , 20(3), S1–S13.

Chen, M., I. Fernández-Val, and

M. Weidner (2019): “Nonlinear factor modelsfor network and panel data,” arXiv preprint arXiv:1412.5647 . Cheng, I. H., and

H. J. Wall (2005): “Controlling for Heterogeneity in gravity modelsof trade,”

Federal Reserve Bank of St. Louis Review , 87(1), 49–63.

Correia, S., P. Guimarães, and

T. Zylkin (2019): “PPMLHDFE: Fast PoissonEstimation with High-dimensional Fixed Eﬀects,”

Unpublished manuscript . Dhaene, G., and

K. Jochmans (2015): “Split-panel Jackknife Estimation of Fixed-eﬀect Models,”

The Review of Economic Studies , 82(3), 991–1030.

Dzemski, A. (2018): “An empirical model of dyadic link formation in a network withunobserved heterogeneity,”

Review of Economics and Statistics . Egger, P., M. Larch, K. E. Staub, and

R. Winkelmann (2011): “The TradeEﬀects of Endogenous Preferential Trade Agreements,”

American Economic Journal:Economic Policy , 3(3), 113–143.

Egger, P. H., and

K. E. Staub (2015): “GLM estimation of trade gravity modelswith ﬁxed eﬀects,”

Empirical Economics , 50(1), 137–175.

Fernández-Val, I., and

F. Vella (2011): “Bias Corrections for Two-step Fixed Ef-fects Panel Data Estimators,”

Journal of Econometrics , 163(2), 144–162.

Fernández-Val, I., and

M. Weidner (2016): “Individual and Time Eﬀects in Non-linear Panel Models with Large N, T,”

Journal of Econometrics , 192(1), 291–312.

Fernández-Val, I., and

M. Weidner (2018): “Fixed Eﬀect Estimation of Large TPanel Data Models,”

Annual Review of Economics .32 aduh, A., T. Gra˘cner, and

A. D. Rothenberg (2020): “Life in the Slow Lane:Unintended Consequences of Public Transit in Jakarta,”

Unpublished manuscript . Gaulier, G., and

S. Zignago (2010): “BACI: International Trade Database at theProduct-Level. The 1994-2007 Version,” Working Paper 2010-23, CEPII research center.

Graham, B. S. (2017): “An econometric model of network formation with degree het-erogeneity,”

Econometrica , 85(4), 1033–1063.

Greene, W. (2004): “Fixed Eﬀects and Bias Due to the Incidental Parameters Problemin the Tobit Model,”

Econometric Reviews , 23(2), 125–147.

Hahn, J., and

G. Kuersteiner (2002): “Asymptotically Unbiased Inference for a Dy-namic Panel Model with Fixed Eﬀects when Both N and T Are Large,”

Econometrica ,70(4), 1639–1657.

Hahn, J., and

H. R. Moon (2006): “Reducing Bias of MLE in a Dynamic PanelModel,”

Econometric Theory , 22(3), 499–512.

Hahn, J., and

W. Newey (2004): “Jackknife and Analytical Bias Reduction for Non-linear Panel Models,”

Econometrica , 72(4), 1295–1319.

Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix estimatorfor panel data when T is large,”

Journal of Econometrics , 141(2), 597–620.

Hausman, J., B. H. Hall, and

Z. Griliches (1984): “Econometric Models for CountData with an Application to the Patents-R&D Relationship,”

Econometrica , 52(4),909–938.

Head, K., and

T. Mayer (2014): “Gravity equations: Workhorse, Toolkit, and Cook-book,”

Handbook of International Economics , 4, 131–196.

Helpman, E., M. Melitz, and

Y. Rubinstein (2008): “Estimating Trade Flows:Trading Partners and Trading Volumes,”

The Quarterly Journal of Economics , 123(2),441–487.

Hinz, J., A. Stammann, and

J. Wanner (2019): “Persistent Zeros: The ExtensiveMargin of Trade,”

Unpublished manuscript .33 mbens, G. W., and

M. Kolesar (2016): “Robust standard errors in small samples:Some practical advice,”

Review of Economics and Statistics , 98(4), 701–712.

Jochmans, K. (2016): “Two-Way Models for Gravity,”

Review of Economics and Statis-tics . Jochmans, K., and

M. Weidner (2019): “Fixed-Eﬀect Regressions on Network Data,”

Econometrica , 87(5), 1543–1560.

Kato, K., A. F. Galvao Jr., and

G. V. Montes-Rojas (2012): “Asymptotics forPanel Quantile Regression Models with Individual Eﬀects,”

Journal of Econometrics ,170(1), 76–91.

Kauermann, G., and

R. J. Carroll (2001): “A note on the eﬃciency of sandwichcovariance matrix estimation,”

Journal of the American Statistical Association , 96(456),1387–1396.

Lancaster, T. (2002): “Orthogonal Parameters and Panel Data,”

The Review of Eco-nomic Studies , 69(3), 647–666.

Larch, M., J. Wanner, Y. V. Yotov, and

T. Zylkin (2019): “Currency Unions andTrade: A PPML Re-assessment with High-dimensional Fixed Eﬀects,”

Oxford Bulletinof Economics and Statistics , 81(3), 487–510.

MacKinnon, J. G., and

H. White (1985): “Some heteroskedasticity-consistent co-variance matrix estimators with improved ﬁnite sample properties,”

Journal of econo-metrics , 29(3), 305–325.

Moon, H. R., and

M. Weidner (2017): “Dynamic Linear Panel Regression Modelswith Interactive Fixed Eﬀects,”

Econometric Theory , 33(1), 158–195.

Neyman, J., and

E. L. Scott (1948): “Consistent Estimates Based on Partially Con-sistent Observations,”

Econometrica , 16(1), 1–32.

Nickell, S. (1981): “Biases in dynamic models with ﬁxed eﬀects,”

Econometrica , pp.1417–1426.

Palmgren, J. (1981): “The Fisher Information Matrix for Log Linear Models ArguingConditionally on Observed Explanatory Variables,”

Biometrika , 68(2), 563–566.34 esaran, M. H. (2006): “Estimation and Inference in Large Heterogeneous Panels witha Multifactor Error Structure,”

Econometrica , 74(4), 967–1012.

Pfaffermayr, M. (2019): “Gravity models, PPML estimation and the bias of the robuststandard errors,”

Applied Economics Letters , pp. 1–5.

Phillips, P. C. B., and

H. R. Moon (1999): “Linear Regression Limit Theory forNonstationary Panel Data,”

Econometrica , 67(5), 1057–1111.

Santos Silva, J. M. C., and

S. Tenreyro (2006): “The Log of Gravity,”

Review ofEconomics and Statistics , 88(4), 641–658.

Stammann, A. (2018): “Fast and feasible estimation of generalized linear models withhigh-dimensional k-way ﬁxed eﬀects,” arXiv preprint arXiv:1707.01815 . Verdier, V. (2018): “Estimation and inference for linear models with two-way ﬁxedeﬀects and sparsely matched data,”

Review of Economics and Statistics , (forthcoming).

Wooldridge, J. M. (1999): “Distribution-free Estimation of Some Nonlinear PanelData Models,”

Journal of Econometrics , 90(1), 77–97.

Woutersen, T. (2002): “Robustness against Incidental Parameters,” Discussion paper,University of Western Ontario, Department of Economics.

Yotov, Y. V., R. Piermartini, J.-A. Monteiro, and

M. Larch (2016): “AnAdvanced Guide to Trade Policy Analysis: The Structural Gravity Model,”

WorldTrade Organization, Geneva . 35 .8 .9 1 1.1 1.2N=100 N=1000N=10000

Example 1

FE-PPML w/ Overlapping Fixed Effects .8 .9 1 1.1 1.2N=20 N=50N=100

T=1 (two-way model)

FE-PPML Gravity Estimates .8 .9 1 1.1 1.2N=20 N=50N=100

T=2 (three-way model)

FE-PPML Gravity Estimates

Simulation Results for Different FE-PPML Models

Figure 1: Kernel density plots of FE PPML estimates for 3 diﬀerent models, using 500replications. Clockwise from top left, the 3 models are: y it = exp[ α i × t ≤

2) + γ i × t ≥

2) + x it β ] ω it , with the t dimension of the panel ﬁxed at T = 3; a two-way gravity model with y ij = exp[ α i + γ j + x ij β ] ω ij ; a three-way gravity model with y ijt = exp[ α it + γ jt + η ij + x ijt β ] ω ijt and T = 2. The i and j dimensions of the panelboth have size N in the latter two models. The true value of β is 1 (indicated by thevertical dotted lines) and the data is generated using Var( y |· ) = E ( y |· ). See text forfurther details. 36able 1: Finite-sample Properties of the Three-way FE-PPML Gravity Model N=20 N=50 N=100T=2 T=5 T=10 T=2 T=5 T=10 T=2 T=5 T=10

A. Gaussian DGP (“DGP I”)

Average bias ( × ) FE-PPML 6.588 3.645 2.297 2.659 1.446 0.904 1.376 0.702 0.410Analytical 2.193 0.896 0.464 0.326 0.120 0.101 0.071 -0.009 -0.006Jackknife 0.233 -0.043 -0.334 0.031 -0.038 -0.048 -0.012 -0.058 -0.043

Bias / SE ratio

FE-PPML 0.683 0.778 0.736 0.620 0.709 0.684 0.606 0.659 0.601Analytical 0.227 0.191 0.149 0.076 0.059 0.076 0.031 -0.008 -0.009Jackknife 0.024 -0.009 -0.107 0.007 -0.019 -0.036 -0.005 -0.054 -0.063

SE / SD ratio

FE-PPML 0.837 0.826 0.883 0.926 0.936 0.965 0.932 0.921 0.943Analytical 0.795 0.805 0.863 0.882 0.905 0.954 0.899 0.901 0.936Jackknife 0.716 0.749 0.802 0.844 0.870 0.923 0.892 0.907 0.937

Coverage probability (should be . for an unbiased estimator) FE-PPML 0.828 0.798 0.818 0.878 0.856 0.884 0.888 0.870 0.884Analytical 0.864 0.876 0.904 0.916 0.928 0.928 0.926 0.944 0.938Jackknife 0.836 0.858 0.890 0.910 0.912 0.928 0.924 0.936 0.944

B. Poisson DGP (“DGP II”)

Average bias ( × ) FE-PPML 3.774 2.017 1.314 1.517 0.839 0.563 0.721 0.395 0.234Analytical 1.121 0.459 0.355 0.143 0.079 0.117 -0.038 -0.015 -0.004Jackknife -0.090 -0.096 -0.165 -0.012 -0.010 0.024 -0.093 -0.050 -0.031

Bias / SE ratio

FE-PPML 0.683 0.778 0.736 0.620 0.709 0.684 0.606 0.659 0.601Analytical 0.227 0.191 0.149 0.076 0.059 0.076 0.031 -0.008 -0.009Jackknife 0.024 -0.009 -0.107 0.007 -0.019 -0.036 -0.005 -0.054 -0.063

SE / SD ratio

FE-PPML 0.875 0.828 0.918 0.959 0.977 0.988 0.962 0.950 0.930Analytical 0.835 0.806 0.899 0.931 0.959 0.981 0.946 0.944 0.925Jackknife 0.749 0.750 0.830 0.894 0.920 0.954 0.942 0.938 0.921

Coverage probability (should be . for an unbiased estimator) FE-PPML 0.884 0.870 0.902 0.928 0.908 0.918 0.920 0.928 0.904Analytical 0.884 0.880 0.922 0.940 0.940 0.956 0.936 0.944 0.930Jackknife 0.856 0.868 0.892 0.922 0.926 0.944 0.934 0.936 0.932

Notes : Results computed using 500 replications. The model being estimated is y ijt = λ ijt ω ijt , where λ ijt = exp( α it + γ jt + η ij + βx ijt ). The data is generated using α it ∼ N (0 , / γ jt ∼ N (0 , / η ij ∼ N (0 , /

16) and β = 1. x ijt = x ijt − / α it + γ jt + η ij + ν ijt , with x ij = η ij + ν ij and ν ijt ∼ N (0 , / y ijt ). The “Gaussian” DGP (panel A) assumes Var( ω ijt ) = λ − ijt . The “Poisson” DGP(panel B) assumes Var( ω ijt ) = λ − ijt . SE/SD refers to the ratio of the average standard error of of b β relative to the standarddeviation of b β across simulations. Coverage probability refers to the probability β is covered in the 95% conﬁdenceinterval for b β NT . “Analytical” and “Jackknife” respectively indicate Analytical and Jackknife bias-corrected FE-PPMLestimates. “FE-PPML” indicates uncorrected estimates. SEs allow for within- ij clustering. N=20 N=50 N=100T=2 T=5 T=10 T=2 T=5 T=10 T=2 T=5 T=10

A. Log-homoscedastic DGP (“DGP III”)

Average bias ( × ) FE-PPML 0.223 -0.291 -0.292 0.161 -0.070 -0.048 -0.033 -0.083 -0.103Analytical -0.264 -0.356 -0.126 -0.022 -0.049 0.068 -0.126 -0.057 -0.033Jackknife -0.580 -0.440 -0.320 0.016 -0.046 0.046 -0.146 -0.076 -0.044

Bias / SE ratio

FE-PPML 0.022 -0.058 -0.087 0.036 -0.032 -0.032 -0.014 -0.072 -0.133Analytical -0.026 -0.071 -0.038 -0.005 -0.022 0.046 -0.054 -0.049 -0.043Jackknife -0.057 -0.088 -0.095 0.004 -0.021 0.031 -0.063 -0.066 -0.057

SE / SD ratio

FE-PPML 0.869 0.797 0.887 0.940 0.953 0.954 0.962 0.934 0.897Analytical 0.816 0.756 0.837 0.902 0.915 0.920 0.936 0.913 0.872Jackknife 0.731 0.705 0.778 0.870 0.881 0.903 0.935 0.895 0.862

Coverage probability (should be . for an unbiased estimator) FE-PPML 0.902 0.886 0.920 0.948 0.934 0.938 0.942 0.934 0.926Analytical 0.880 0.864 0.912 0.932 0.924 0.932 0.938 0.932 0.912Jackknife 0.838 0.838 0.878 0.930 0.908 0.914 0.940 0.916 0.924

B. Quadratic DGP (“DGP IV”)

Average bias ( × ) FE-PPML -6.544 -6.024 -5.210 -3.341 -3.275 -2.798 -2.305 -2.144 -1.899Analytical -5.051 -4.412 -3.586 -1.810 -1.831 -1.448 -1.110 -1.025 -0.883Jackknife -4.441 -3.836 -3.228 -1.429 -1.574 -1.249 -1.042 -0.961 -0.812

Bias / SE ratio

FE-PPML -0.544 -0.960 -1.203 -0.597 -1.089 -1.319 -0.750 -1.260 -1.562Analytical -0.420 -0.703 -0.828 -0.324 -0.609 -0.682 -0.361 -0.603 -0.726Jackknife -0.369 -0.612 -0.746 -0.256 -0.523 -0.589 -0.339 -0.565 -0.668

SE / SD ratio

FE-PPML 0.817 0.734 0.787 0.855 0.845 0.842 0.898 0.845 0.805Analytical 0.753 0.676 0.715 0.788 0.771 0.768 0.830 0.780 0.739Jackknife 0.670 0.621 0.665 0.751 0.735 0.743 0.823 0.758 0.720

Coverage probability (should be . for an unbiased estimator) FE-PPML 0.860 0.750 0.732 0.852 0.756 0.694 0.868 0.698 0.588Analytical 0.834 0.770 0.786 0.880 0.808 0.810 0.888 0.818 0.782Jackknife 0.784 0.742 0.744 0.850 0.816 0.820 0.894 0.820 0.770

16) and β = 1. x ijt = x ijt − / α it + γ jt + η ij + ν ijt , with x ij = η ij + ν ij and ν ijt ∼ N (0 , / y ijt ). The “Log-homoscedastic” DGP (panel A) assumes Var( ω ijt ) = 1. The “Quadratic”DGP (Panel B) assumes ω ijt is log-normal with variance equal to 0 . λ − ijt + 0 . x ijt ). SE/SD refers to the ratio of theaverage standard error of of b β relative to the standard deviation of b β across simulations. Coverage probability refers to theprobability β is covered in the 95% conﬁdence interval for b β NT . “Analytical” and “Jackknife” respectively indicateAnalytical and Jackknife bias-corrected FE-PPML estimates. “FE-PPML” indicates uncorrected estimates. SEs allow forwithin- ij clustering. N=20 N=50 N=100T=2 T=5 T=10 T=2 T=5 T=10 T=2 T=5 T=10

A. Gaussian DGP (“DGP I”)

SE / SD ratio with corrected SEs

FE-PPML 0.938 0.907 0.962 0.980 0.978 1.001 0.963 0.944 0.961Analytical 0.891 0.884 0.940 0.933 0.946 0.990 0.928 0.923 0.954Jackknife 0.803 0.823 0.873 0.894 0.909 0.957 0.921 0.929 0.955

Coverage probability with corrected SEs (should be . for an unbiased estimator) FE-PPML 0.864 0.830 0.848 0.900 0.876 0.888 0.896 0.876 0.890Analytical 0.912 0.906 0.918 0.936 0.934 0.950 0.934 0.948 0.948Jackknife 0.880 0.904 0.908 0.922 0.924 0.936 0.936 0.942 0.952

B. Poisson DGP (“DGP II”)

SE / SD ratio with corrected SEs

FE-PPML 0.977 0.910 1.003 1.009 1.018 1.025 0.988 0.971 0.949Analytical 0.933 0.886 0.982 0.979 1.000 1.019 0.971 0.965 0.943Jackknife 0.836 0.825 0.907 0.940 0.959 0.991 0.968 0.958 0.939

Coverage probability with corrected SEs (should be . for an unbiased estimator) FE-PPML 0.922 0.892 0.930 0.946 0.922 0.924 0.926 0.940 0.908Analytical 0.916 0.918 0.938 0.956 0.952 0.962 0.940 0.946 0.934Jackknife 0.896 0.904 0.916 0.936 0.940 0.946 0.948 0.936 0.934

C. Log-homoscedastic DGP (“DGP III”)

SE / SD ratio with corrected SEs

FE-PPML 0.984 0.896 0.998 1.001 1.013 1.012 0.998 0.967 0.928Analytical 0.924 0.850 0.940 0.960 0.972 0.977 0.970 0.945 0.902Jackknife 0.827 0.793 0.875 0.926 0.936 0.959 0.969 0.927 0.891

Coverage probability with corrected SEs (should be . for an unbiased estimator) FE-PPML 0.946 0.926 0.946 0.954 0.952 0.956 0.948 0.942 0.938Analytical 0.920 0.908 0.938 0.948 0.942 0.946 0.944 0.934 0.932Jackknife 0.896 0.878 0.914 0.950 0.930 0.942 0.948 0.932 0.926

D. Quadratic DGP (“DGP IV”)

SE / SD ratio with corrected SEs

FE-PPML 0.952 0.861 0.929 0.947 0.948 0.949 0.970 0.924 0.882Analytical 0.877 0.793 0.845 0.873 0.865 0.865 0.897 0.853 0.810Jackknife 0.781 0.729 0.786 0.833 0.824 0.837 0.889 0.828 0.789

Coverage probability with corrected SEs (should be . for an unbiased estimator) FE-PPML 0.900 0.820 0.808 0.894 0.806 0.762 0.892 0.752 0.652Analytical 0.894 0.830 0.838 0.908 0.864 0.862 0.918 0.856 0.818Jackknife 0.844 0.808 0.814 0.894 0.860 0.866 0.908 0.848 0.822

16) and β = 1. x ijt = x ijt − / α it + γ jt + η ij + ν ijt , with x ij = η ij + ν ij and ν ijt ∼ N (0 , / ω ijt . The “Gaussian” DGP (panel A) assumes Var( ω ijt ) = λ − ijt . The “Poisson” DGP (panel B)assumes Var( ω ijt ) = λ − ijt . The “Log-homoscedastic” DGP (panel C) assumes Var( ω ijt ) = 1. The “Quadratic” DGP (PanelD) assumes ω ijt is log-normal with variance equal to 0 . λ − ijt + 0 . x ijt ). SE/SD refers to the ratio of the averagestandard error of of b β relative to the standard deviation of b β across simulations. Coverage probability refers to theprobability β is covered in the 95% conﬁdence interval for b β . “Analytical” and “Jackknife” respectively indicate Analyticaland Jackknife bias-corrected FE-PPML estimates. “FE-PPML” indicates uncorrected estimates. SEs allow for within- ij clustering. The corrected SEs correct for ﬁrst-order ﬁnite sample bias in the estimated variance. N = 169) Originalestimates Bias-corrected estimatesIndustry Code b β SE Analytical Jackknife SEAgriculture 1 0.100 (0.046) 0.110 0.115 (0.051)Forestry 2 -0.205 (0.125) -0.199 -0.189 (0.155)Fishing 5 0.128 (0.141) 0.140 0.182 (0.164)Coal 10 0.025 (0.131) -0.039 -0.063 (0.164)Metal Ores 13 0.040 (0.100) 0.033 -0.025 (0.123)Other Mining & Quarrying n.e.c. 14 0.048 (0.096) 0.079 0.097 (0.107)Food & Beverages 15 0.019 (0.043) 0.026 0.031 (0.048)Tobacco 16 0.535 (0.139) 0.525 0.571 (0.162)Textiles 17 0.228 (0.045) 0.226 0.234 (0.055)Apparel 18 0.092 (0.092) 0.094 0.127 (0.122)Leather Products 19 0.224 (0.067) 0.220 0.240 (0.079)Wood & Cork Products 20 0.078 (0.109) 0.098 0.101 (0.127)Paper & Paper Products 21 -0.002 (0.062) -0.004 -0.018 (0.071)Printed & Recorded Media 22 -0.115 (0.065) -0.144 -0.180 (0.076)Coke & Reﬁned Petroleum 23 0.256 (0.076) 0.291 0.319 (0.090)Chemicals & Chemical Products 24 0.073 (0.035) 0.073 0.078 (0.040)Rubber & Plastic Products 25 0.141 (0.030) 0.146 0.157 (0.035)Non-metallic Mineral Products 26 0.217 (0.049) 0.223 0.225 (0.058)Basic Metal Products 27 0.268 (0.100) 0.273 0.301 (0.115)Fabricated Metal Products (excl. Machinery) 28 0.196 (0.036) 0.206 0.225 (0.041)Machinery & Equipment n.e.c. 29 0.049 (0.035) 0.052 0.056 (0.041)Oﬃce, Accounting, and Computer Equipment 30 -0.036 (0.062) -0.044 -0.045 (0.074)Electrical Equipment 31 0.213 (0.045) 0.225 0.240 (0.052)Communications Equipment 32 -0.127 (0.067) -0.143 -0.173 (0.081)Medical & Scientiﬁc Equipment 33 0.063 (0.039) 0.069 0.082 (0.044)Motor Vehicles, Trailers & Semi-trailers 34 0.157 (0.064) 0.169 0.194 (0.077)Other Transport Equipment 35 0.208 (0.124) 0.231 0.267 (0.137)Furniture & Other Manufacturing n.e.c. 36 0.224 (0.073) 0.224 0.227 (0.082)Total All 0.082 (0.027) 0.086 0.087 (0.030)

Notes : These results are computed using ISIC Rev. 3 industry-level trade data for trade between 169 countries during years 1995,2000, 2005, 2010, & 2015. The original data is from BACI. The model being estimated is y ijt = exp( α it + γ jt + η ij + βF T A ijt ) ω ijt ,where y ijt is the trade volume and F T A ijt is a dummy for the presence of an FTA. α it , γ jt , & η ij respectively denoteexporter-time, importer-time, & exporter-importer ﬁxed eﬀects. We estimate each industry separately. The jackknife correctionsuse the average of 200 randomly-assigned split-panel partitions. SEs are clustered by exporter-importer. Appendix with proofs

In what follows, we ﬁnd it convenient to ﬁrst provide a proof of Proposition 3, which char-acterizes the asymptotic distribution of b β and its asymptotic bias. This proof naturallylends itself to further discussion of the “large T ” results from Remarks 2 and 3 as well asthe consistency result from Proposition 1, which itself follows as a by-product of Proposi-tion 3. We then demonstrate the uniqueness of this latter result as stated in Proposition2 and highlight the general inconsistency of other three-way gravity estimators. We alsoinclude more details behind the downward bias in the estimated variance. A.1 Proof of Proposition 3

Known result for two-way ﬁxed eﬀect panel models

Our proof of Proposition 3 relies on results from Fernández-Val and Weidner (2016)– denoted FW in the following. That paper considers a standard panel setting whereindividuals i are observed over time periods t , and mixing conditions (as opposed toconditional independence assumptions) are imposed across time periods. By contrast, weconsider a pseudo-panel setting, where the two panel dimensions are labelled by exporters i and importers j , and we impose conditional independence assumptions across both i and j here (see also Dzemski, 2018, who employs those results in a directed networksetting where outcomes are binary, and Graham, 2017, for the undirected network case.)Given those diﬀerences—and before introducing any further complications—we brieﬂywant to restate the main result in FW for the two-way pseudo-panel case. Outcomes Y ij , i, j = 1 , . . . , N , conditional on all the strictly exogenous regressors X = ( X ij ), ﬁxed eﬀect N -vectors α and γ , and common parameters β are assumed to be generated as Y ij | X, α, γ, β ∼ f Y ( · | X ij , α i , γ j , β ) , where the conditional distribution f Y is known, up to the unknown parameters α i , γ j ∈ R and β ∈ R K . It is furthermore assumed that α i and γ j enter the distribution functiononly through the single index π ij = α i + γ j ; that is, the log-likelihood can be deﬁned by ‘ ij ( β, π ij ) = log f Y ( Y ij | X ij , α i , γ j , β ) . The maximum likelihood estimator for β is given by b β = argmax β ∈ R K max α,γ ∈ R N L ( β, α, γ ) , L ( β, α, γ ) = X i,j ‘ ij ( β, α i + γ j ) . K -vector Ξ ij with components, k = 1 , . . . , K ,Ξ ij,k = α ∗ i,k + γ ∗ j,k , ( α ∗ k , γ ∗ k ) = argmin α i,k ,γ j,k X i,j E ( − ∂ π ‘ ij ) E ( ∂ β k α i ‘ ij ) E ( ∂ α i ‘ ij ) − α i,k − γ j,k ! , where here and in the following all expectations are conditional on regressors X = ( X ij ),and on the parameters α , γ , β . For q ∈ { , , } , the (within-transformation) diﬀerentia-tion operator D βα qi = D βγ qj is deﬁned by D βα qi ‘ ij = ∂ βα qi ‘ ij − ∂ α q +1 i ‘ ij Ξ ij , D βγ qj ‘ ij = ∂ βγ qj ‘ ij − ∂ γ q +1 j ‘ ij Ξ ij . (17) Theorem 1.

Assume that(i) Conditional on X , α , γ , β the outcomes Y ij are distributed independently across i and j with Y ij | X, α , γ , β ∼ exp[ ‘ ij ( β , π ij )] , where π ij = α i + γ j .(ii) The map ( β, π ) ‘ ij ( β, π ) is four times continuously diﬀerentiable, almost surely.All partial derivatives of ‘ ij ( β, π ) up to fourth order are bounded in absolute valueby a function m ( Y it , X it ) > , almost surely, uniformly over a convex compact set B ⊂ R dim β +1 , which contains an ε -neighbourhood of ( β , π ij ) for all i, j, N , andsome ε > . Furthermore, max i,j E [ m ( Y ij , X ij )] ν is uniformly bounded over N ,almost surely, for some ν > .(iii) For all N , the function ( β, α, γ )

7→ L ( β, α, γ ) is almost surely strictly concave over R K +2 N , apart from one “ﬂat direction” described by the transformation α i α i + c , γ j γ j − c , which leaves L ( β, α, γ ) unchanged for all c ∈ R . Furthermore, there existconstants b min and b max such that for all ( β, π ) ∈ B , < b min ≤ − E h ∂ α i ‘ ij ( β, π ) i ≤ b max , almost surely, uniformly over i, j, N .In addition, assume that the following limits exist B = lim N →∞  − N X i,j E (cid:16) ∂ α i ‘ ij D βα i ‘ ij + D βα i ‘ ij (cid:17)P j E (cid:16) ∂ α i ‘ ij (cid:17)  ,D = lim N →∞  − N X i,j E (cid:16) ∂ γ j ‘ ij D βγ j ‘ ij + D βγ j ‘ ij (cid:17)P i E (cid:16) ∂ γ j ‘ i j (cid:17)  ,W = lim N →∞  − N X i,j E (cid:16) ∂ ββ ‘ ij − ∂ α i ‘ ij Ξ ij Ξ ij (cid:17) , here expectations are conditional on X , α , γ , β . Finally, assume that W > . Then, as N → ∞ , we have N (cid:16) b β − β (cid:17) → d W − N ( B + D, W ) , Remarks: (a) This is just a reformulation of Theorem 4.1 in FW to the case of pseudo-panels,and the proof is provided in that paper. Since we consider only strictly exogenousregressors, all the analysis is conditional on X ; and the bias term B simpliﬁes here,since conditional on X (and the other parameters), we assume independence acrossboth i and j . Thus, no Nickell-type bias (Nickell, 1981; Hahn and Kuersteiner,2002) appears here, but we still have incidental parameter biases because the modelis nonlinear (Neyman and Scott, 1948; Hahn and Newey, 2004).(b) In the original version of this theorem, the sums in the deﬁnitions of L ( β, α, γ ), B , D , and W run over all possible pairs ( i, j ) ∈ { , . . . , N } . However, for the tradeapplication in the current paper we assume we only have observations for i = j ;that is, those sums over i and j only run over the set { ( i, j ) ∈ { , . . . , N } : i = j } of N ( N −

1) observed country pairs. The sum over j (in B ) then also only runsover j = i , and the sum over i (in D ) only runs over i = j . It turns out that thosechanges make no diﬀerence to the proof of the theorem, because the proportion ofmissing observations for each i and j is asymptotically vanishing. For that reasonit also does not matter whether we change the 1 /N in W to 1 / [ N ( N − N (cid:16) b β − β (cid:17) to q N ( N − (cid:16) b β − β (cid:17) . The same equivalenceholds throughout our own results for applications in which researchers wish to useobservations for which i = j (simply replace N − N where appropriate.)(c) The above theorem assumes that the log-likelihood ‘ ij ( β, α i + γ j ) for Y ij | X, α, γ, β is correctly speciﬁed. This is an unrealistic assumption for the PPML estima-tors in this paper, where we only want to assume that the score of the pseudo-log-likelihood has zero mean at the true parameters, that is, E h ∂ β ‘ ij ( β , α i + γ j ) | X ij , α i , γ j , β i = 0 and E h ∂ α i ‘ ij ( β , α i + γ j ) | X ij , α i , γ j , β i = 0 and E h ∂ γ j ‘ ij ( β , α i + γ j ) | X ij , α i , γ j , β i = 0. This extension to “conditional mo-ment models” is discussed in Remark 3 of FW. The statement of the theorem then43eeds to be changed as follows: N (cid:16) b β − β (cid:17) → d W − N ( B + D, Ω) , (18)where the deﬁnition of W is unchanged, but the expression of B = B + B , D = D + D and Ω now read B = lim N →∞  − N X i,j E ( ∂ α i ‘ ij D βα i ‘ ij ) P j E ( ∂ α i ‘ ij )  ,B = lim N →∞ 

12 1 N X i hP j E ( ∂ α i ‘ ij ) i P j E ( D βα i ‘ ij ) hP j E ( ∂ α i ‘ ij ) i  ,D = lim N →∞  − N X j P i E h ∂ γ j ‘ ij D βγ j ‘ ij iP i E (cid:16) ∂ γ j ‘ ij (cid:17)  ,D = lim N →∞ 

12 1 N X j P i h E ( ∂ γ j ‘ ij ) i P i E ( D βγ j ‘ ij ) hP i E (cid:16) ∂ γ j ‘ ij (cid:17)i  , Ω = lim N →∞  N X i,j E [ D β ‘ ij ( D β ‘ ij ) ]  . (19)These are the formulas that we have to use as a starting point for the bias resultsderived in this paper.Our task in the following is to translate and generalize the conditions, statement, andproof of Theorem 1, as extended in (18) and (19), to the case of the three-way PPMLestimator discussed in the main text. Regularity conditions for Proposition 3

The following regularity conditions are required for the statement of Proposition 3 tohold.

Assumption A. (i) Conditional on x = ( x ijt ) , α = ( α it ) , γ = ( γ jt ) , η = ( η ij ) and β , the outcomes y ij = ( y ij, , . . . , y ij,T ) are distributed independently across i and j ,and the conditional mean of y ijt is given by equation (4) for all i , j , t .(ii) The range of x ijt , α it and γ jt is uniformly bounded, and there exists ν > such that E ( y νijt | x ijt , α it , γ jt , η ij ) is uniformly bounded over i , j , t , N .(iii) lim N →∞ W N > , with W N deﬁned in Proposition 3. y ijt to be correctly speciﬁed, as already discussed in remark (c) above. Noticealso that this assumption requires conditional independence across i and j , but we do nothave to restrict the dependence of y ijt over t for our results.We consider the Poisson log-likelihood in this paper, which after proﬁling out η ij gives the pseudo-log-likelihood function ‘ ij ( β, α it , γ jt ) deﬁned in equation (7). This log-likelihood is strictly concave and arbitrarily often diﬀerentiable in the parameters, socorresponding assumptions in Theorem 1 are automatically satisﬁed. Assumption A(ii) istherefore already suﬃcient for the corresponding assumptions (ii) and (iii) in Theorem 1.Finally, Assumption A(iii) simply corresponds to the condition W >

0, which is just anappropriate non-collinearity condition on the regressors x ijt . Translation to our main text notation

The main diﬀerence between Theorem 1 in the Appendix and Proposition 3 in the maintext is that Theorem 1 only covers the case where π ij = α i + γ j is a scalar, while inour model in the main text α i , γ j and π ij = α i + γ j are all T -vectors. We can imposeadditional normalizations on those T -vectors, because the proﬁle likelihood L ( β, α, γ ) in(6) is invariant under parameter transformations α i α i + c i ι T and γ j γ j + d j ι T forarbitrary c i , d j ∈ R , where ι T = (1 , . . . , is the T -vector of ones. In the followingwe choose the normalizations ι T α i = 0 and ι T γ j = 0, implying ι T π ij = 0 for all i, j .Accounting for this normalization we actually only have ( T −

1) ﬁxed eﬀects α i and γ j foreach i, j here. Theorem 1 is therfore directly applicable to the case T = 2, but for T > D βα qi = D βγ qj in (17) to the case ofvector-value α i and γ j was described in Section 4.2 of Fernández-Val and Weidner (2018).Remember the deﬁnition of ‘ ij ( β, π ij ) = ‘ ij ( β, α i , γ j ) and e x ij := x ij − α xi − γ xj . Then, byreparameterizing the pseudo-log-likelihood ‘ ij ( β, α i , γ j ) as follows ‘ ∗ ij ( β, α i , γ j ) := ‘ ij ( β, π ij − β ( α xi + γ xj )) = ‘ ij ( β, α i − β α xi , γ j − β γ xj ) (20)one achieves that the expected Hessian of L ∗ ( β, α, γ ) = P i,j ‘ ∗ ij ( β, α i , γ j ) is block-diagonal,in the sense that E ∂ βα i L ∗ ( β , α , γ ) = 0 and E ∂ βγ j L ∗ ( β , α , γ ) = 0 — the deﬁnition Those invariances α i α i + c i ι T and γ j γ j + d j ι T correspond to parameter transformations thatin the original model could be absorbed by the parameters η ij . α xi and γ xj by (10) in the main text exactly corresponds to those block-diagonalityconditions. With those deﬁnitions, we then have that D βα qi ‘ ij = ∂ βα qi ‘ ∗ ij = e x ij ∂ α q +1 i ‘ ij . In particular, we ﬁnd that our deﬁnitions of W N = 1 N ( N − N X i =1 X j ∈ N \{ i } e x ij ¯ H ij e x ij , Ω N = 1 N ( N − N X i =1 X j ∈ N \{ i } e x ij h Var (cid:16) S ij (cid:12)(cid:12)(cid:12) x ij (cid:17)i e x ij , in Proposition 3 correspond to − N ( N − P i,j E (cid:16) ∂ ββ ‘ ij − ∂ α i ‘ ij Ξ ij Ξ ij (cid:17) and N ( N − P i,j E h D β ‘ ij ( D β ‘ ij ) i in the notation of Theorem 1 and equation (19). Thus, the asymptoticvariance in (18) indeed corresponds to the asymptotic variance formula in Proposition 3. Inverse expected incidental parameter Hessian

The asymptotic bias results that follow require that we ﬁrst derive some key properties ofthe expected Hessian with respect to the incidental parameters. Remember the deﬁnitionsof the 2

N T -vector φ = vec( α, γ ) from the main text. The expected incidental parameterHessian is the 2 N T × N T matrix given by¯ H := E [ − ∂ φφ L ( β , φ )] =  ¯ H ( αα ) ¯ H ( αγ ) [ ¯ H ( αγ ) ] ¯ H ( γγ )  , where L ( β, φ ) = L ( β, α, γ ) is deﬁned in (6), and ¯ H ( αα ) , ¯ H ( αγ ) and ¯ H ( γγ ) are N T × N T submatrices. Here and in the following all expectations are conditional on all the regressorrealizations. The matrix ¯ H ( αα ) = E [ − ∂ αα L ( β , φ )] is block-diagonal with N non-zerodiagonal T × T blocks given by E h − ∂ α i α i L ( β , φ ) i = P j ∈ N \{ i } ¯ H ij , because for i = j wehave E h − ∂ α i α j L ( β , φ ) i = 0, since the parameters α i and α j never enter into the sameobservation. Analogously, the matrix ¯ H ( γγ ) = E [ − ∂ γγ L ( β , φ )] is block-diagonal with N non-zero diagonal T × T blocks given by P i ∈ N \{ j } ¯ H ij . By contrast, the matrix ¯ H ( αγ ) consistents of blocks E h − ∂ α i γ j L ( β , φ ) i = ¯ H ij that are non-zero for i = j , because anytwo parameters α i and γ j jointly enter into T observations. The incidental parameterHessian matrix ¯ H therefore has strong diagonal T × T blocks of order N , but also manyoﬀ-diagonal elements of order one. 46he pseudoinverse of ¯ H crucially enters in the stochastic expansion for b β below. It istherefore necessary to understand the asymptotic properties of this pseudoinverse ¯ H † . Thefollowing lemma shows that ¯ H † has a structure analogous to ¯ H , namely, strong diagonal T × T blocks of order 1 /N , and much smaller oﬀ-diagonal elements of order 1 /N . Wecan write ¯ H = D + G , where D :=  ¯ H ( αα ) NT × NT NT × NT ¯ H ( γγ )  , G :=  NT × NT ¯ H ( αγ ) [ ¯ H ( αγ ) ] NT × NT  . The matrix D is block-diagonal, and its pseudoinverse D † is therefore also block-diagonalwith T × T blocks on its diagonal given by (cid:16)P j ∈ N \{ i } ¯ H ij (cid:17) † , i = 1 , . . . , N and (cid:16)P i ∈ N \{ j } ¯ H ij (cid:17) † , j = 1 , . . . , N . Thus, D † has diagonal elements of order N − . For any matrix A we denoteby k A k max the maximum over the absolute values of all elements of A . Lemma 1.

Under Assumption A we have, as N → ∞ , (cid:13)(cid:13)(cid:13) ¯ H † − D † (cid:13)(cid:13)(cid:13) max = O P (cid:16) N − (cid:17) . This result is crucial in order to derive the stochastic expansion of b β . Indeed, as we willsee below, once Lemma 1 is available, then the proof of Proposition 3 is a straightforwardextension of the proof of Theorem 4.1 in FW. Lemma 1 is analogous to Lemma D.1 inFW, but our proof strategy for Lemma 1 is diﬀerent here, because we need to account forthe vector-valued nature of α i and γ j , which requires new arguments. Proof of Lemma 1. H † in powers of G : The matrix ¯ H is (minus) theexpected Hessian of the proﬁle log-likelihood L = P i,j ‘ ij . Because in that objectivefunction we have already proﬁled out the ﬁxed eﬀect parameters η ij we have ¯ H ij ι T = 0for all i, j , where ι T = (1 , . . . , is the T -vector of ones. This implies that¯ H ( I N ⊗ ι T ) = 0 . (21)The last equation describes 2 N zero-eigenvectors of ¯ H (i.e. the eigenvalue zero of ¯ H hasmultiplicity at least 2 N ). Because the original log-likelihood function of the Poisson modelwas strictly concave in the single index x ijt β + α it + γ jt + η ij it must be the case that anyadditional zero-eigenvalue of ¯ H is due to linear transformations of the parameters α and γ that leave α it + γ jt unchanged for all i, j, t . There is exactly one such transformation Notice that any collinearity problem in the likelihood involving the regression parameters β is ruledout for suﬃciently large sample sizes by our assumption that lim N →∞ W N >

0, which guarantees thatthe expected Hessian wrt β is positive deﬁnite asymptotically. t ∈ { , . . . , T } , namely the likelihood is invariant under α it α it + c t and γ jt γ jt − c t for any c t ∈ R . The expected Hessian ¯ H therefore has additional zero-eigenvectors, which are given by the columns of the 2 N T × T matrix v := ( ι N , − ι N ) ⊗ M ι T , (22)where M ι T := I T − P ι T and P ι T := T − ι T ι T . In the last display we could have used theidentity matrix I T instead of M ι T , but we want the columns of v to be orthogonal to thezero-eigenvectors already given by (21), which is achieved by using M ι T . As a consequenceof this, we have rank( v ) = T −

1; that is, since we already have (21) we only ﬁnd T − H (i.e. themultiplicity of the eigenvalue zero) is equal to 2 N + T −

1. It is easy to verify that indeed¯ H v = 0 . (23)Equations (21) and (23) describe all the zero-eigenvectors of ¯ H . The projector onto thenull-space of ¯ H is therefore given by P null := I N ⊗ P ι T + P v , (24)where P v = v ( v v ) † v . The Moore-Penrose pseudoinverse of ¯ H therefore satisﬁes¯ H ¯ H † = ¯ H † ¯ H = I NT − P null = M ( ι N , − ι N ) ⊗ M ι T , (25)where the rhs is the projector orthogonal to the null-space of ¯ H (i.e. the projector ontothe span of ¯ H ). The deﬁnition of the Moore-Penrose pseudoinverse guarantees that ¯ H † has the same zero-eigenvectors as ¯ H ; that is, we also have ¯ H † v = 0 and ¯ H † ( I N ⊗ ι T ) = 0.The last equation together with the symmetry of ¯ H † implies that( I N ⊗ P ι T ) ¯ H † = 0 . (26)Next, similar to the above argument for ¯ H we have that the only zero-eigenvector of the T × T matrices P j ∈ N \{ i } ¯ H ij and P i ∈ N \{ j } ¯ H ij is given by ι T , and therefore we have  X j ∈ N \{ i } ¯ H ij   X j ∈ N \{ i } ¯ H ij  † = M ι T ,  X i ∈ N \{ j } ¯ H ij   X i ∈ N \{ j } ¯ H ij  † = M ι T , which can equivalently be written as D † D = D D † = I N ⊗ M ι T = I NT − I N ⊗ P ι T , (27)48here P ι T := T − ι T ι T . Now, using (25) and ¯ H = D + G we have¯ H † ( D + G ) = I NT − P null . Multiplying this with D † from the right, using (27) and (26), and bringing ¯ H † G D † to therhs gives ¯ H † = D † − P null D † − ¯ H † G D † . (28)By transposing this last equation we obtain¯ H † = D † − D † P null − D † G ¯ H † , (29)and now plugging (28) into the rhs of (29) gives¯ H † = D † − D † P null − D † G D † + D † G P null D † − D † G ¯ H † G D † = D † − D † G D † − D † P null − P null D † + D † G ¯ H † G D † , where in the second step we used that D † G P null = − P null , which follows from 0 = ¯ H P null = D P null + G P null by left-multiplication with D † and using that D † D P null = 0. This expansionargument for ¯ H † so far has followed the proof of Theorem 2 in Jochmans and Weidner(2019). We furthermore have here that D † ( I N ⊗ P ι T ) = 0, because ¯ H ij ι T = 0, implyingthat D † P null = D † P v . The expansion in the last display therefore becomes¯ H † − D † = − D † G D † − D † P v − P v D † + D † G ¯ H † G D † , (30)with 2 N T × T matrix v deﬁned in (22). This expansion is the ﬁrst key step in the proofof the lemma. H † : The term D † G ¯ H † G D † in the expansion (30) stillcontains ¯ H † itself. In order to bound contributions from this term we therefore need apreliminary bound on the spectral norm of ¯ H † .The objective function ‘ ij ( β, π ij ) := ‘ ij ( β, α it , γ jt ) in (7) is strictly convex in π ij ,apart from the ﬂat direction given by the invariance π ij π ij + c ij ι T , c it ∈ R . Thisstrict convexity together with our Assumption A(ii) that all regressors and parame-ters are uniformly bounded over i, j, N, T implies that for the T × T expected Hessian¯ H ij := E h − ∂ ‘ ij /∂π ij ∂π ij ( β , α , γ ) i there exists a constant b > i, j, N, T such that min { v ∈ R : ι T v =0 } v ¯ H ij v ≥ b > . H ij is positive deﬁnite in all directions orthogonal to ι T .Again, the lower bound b > H ij ≥ b M ι T , (31)where ≥ means that the diﬀerence between the matrices is positive deﬁnite.Next, let e i = (0 , . . . , , , , . . . , be the i ’th standard unit vector of dimension N .For all i, j ∈ N := { , . . . , N } we then have ∂ φ π ij = e i e j ! ⊗ I T , which are 2 N T × T matrices. Because L ( β, φ ) = P Ni =1 P j ∈ N \{ i } ‘ ij ( β, π ij ) we thus ﬁndthat ¯ H = E [ − ∂ φφ L ] = N X i =1 X j ∈ N \{ i } (cid:16) ∂ φ π ij (cid:17) E h − ∂ π ij π ij ‘ ij i (cid:16) ∂ φ π ij (cid:17) = N X i =1 X j ∈ N \{ i } " e i e j ! ⊗ I T ¯ H ij " e i e j ! ⊗ I T ≥ b N X i =1 X j ∈ N \{ i } " e i e j ! ⊗ I T M ι T " e i e j ! ⊗ I T = b  N X i =1 X j ∈ N \{ i } e i e j ! e i e j !  ⊗ M ι T = b  ( N − I N ι N ι N − I N ι N ι N − I N ( N − I N | {z } =: Q N ⊗ M ι T where we also used (31). It is easy to show that for N > N × N matrix Q N hasan eigenvalue zero with multiplicity one, an eigenvalue N − N − N with multiplicity N −

1, and an eigenvalue 2( N −

1) with multiplicityone. Thus, the smallest non-zero eigenvalue of Q N is ( N − Q N is given by v := ( ι N , − ι N ) , and therefore we have Q N ≥ ( N − M v , where M v = I N − (2 N ) − v v is the projector orthogonal to v . We therefore have¯ H ≥ b ( N − M ( ι N , − ι N ) ⊗ M ι T = b ( N −

2) ( I NT − P null ) , P null is the projector onto the null-space of ¯ H , as already deﬁned above. From thisit follows that ¯ H † ≤ b ( N −

2) ( I NT − P null ) , and therefore for the spectral norm (cid:13)(cid:13)(cid:13) ¯ H † (cid:13)(cid:13)(cid:13) ≤ b ( N −

2) = O (1 /N ) . (32) (cid:13)(cid:13)(cid:13) ¯ H † − D † (cid:13)(cid:13)(cid:13) max : Using (31) we ﬁndmax i ∈ N  N − X j ∈ N \{ i } ¯ H ij  † = O P (1) , max j ∈ N  N − X i ∈ N \{ j } ¯ H ij  † = O P (1) . This together with our boundedness Assumption A(ii) implies that (cid:13)(cid:13)(cid:13) D † (cid:13)(cid:13)(cid:13) max = O P (1 /N ) , kGk max = O P (1) . (33)The deﬁnition of the 2 N T × T matrix v in (22) implies that k P v k max = (cid:13)(cid:13)(cid:13) P ( ι N , − ι N ) ⊗ M ι T (cid:13)(cid:13)(cid:13) max ≤ (cid:13)(cid:13)(cid:13) P ( ι N , − ι N ) (cid:13)(cid:13)(cid:13) max = (2 N ) − k ( ι N , − ι N ) ( ι N , − ι N ) k max = (2 N ) − = O (1 /N ) , (34)where we also used that k M ι T k max ≤

1. In the following display, let e k = (0 , . . . , , , , . . . , be the k ’th standard unit vector of dimension 2 N T . We ﬁnd that (cid:13)(cid:13)(cid:13) G ¯ H † G (cid:13)(cid:13)(cid:13) max = max k,‘ ∈{ ,..., NT } (cid:12)(cid:12)(cid:12) e k G ¯ H † G e ‘ (cid:12)(cid:12)(cid:12) ≤ max k ∈{ ,..., NT } kG e k k ! (cid:13)(cid:13)(cid:13) ¯ H † (cid:13)(cid:13)(cid:13) max ‘ ∈{ ,..., NT } kG e ‘ k ! = max k ∈{ ,..., NT } kG e k k ! (cid:13)(cid:13)(cid:13) ¯ H † (cid:13)(cid:13)(cid:13) ≤ (cid:16) √ N T kGk max (cid:17) (cid:13)(cid:13)(cid:13) ¯ H † (cid:13)(cid:13)(cid:13) = O P (1) , (35)where the ﬁrst line is just the deﬁnition of k·k max , the second step uses deﬁnition of thespectral norm (cid:13)(cid:13)(cid:13) ¯ H † (cid:13)(cid:13)(cid:13) , the third step is an obvious rewriting, the fourth step uses that thenorm of 2 N T -vector G e k can at most be √ N T times the maximal absolute element of51he vector, and the ﬁnal step uses that T is ﬁxed in our asymptotic and kGk max = O P (1)and also (32).Next, for general 2 N T × N T matrices A and B we have the bound (notice that k·k max is not a matrix norm) k AB k max ≤ N T k A k max k B k max , but because D is block-diagonal (with non-zero T × T blocks on the diagonal) we havefor any 2 N T × N T matrix A the much improved bound k D A k max ≤ T k D k max k A k max . Applying those inequalities to the expansion of ¯ H † − D † obtained from (30), and alsousing (33) and (34) and (35), we ﬁnd that (cid:13)(cid:13)(cid:13) ¯ H † − D † (cid:13)(cid:13)(cid:13) max ≤ T (cid:13)(cid:13)(cid:13) D † (cid:13)(cid:13)(cid:13) kGk max + 2 T (cid:13)(cid:13)(cid:13) D † (cid:13)(cid:13)(cid:13) max k P v k max + T (cid:13)(cid:13)(cid:13) D † (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) G ¯ H † G (cid:13)(cid:13)(cid:13) max = O P (1 /N ) , as N → ∞ (remember that T is ﬁxed in our asymptotic.) This is what we wanted toshow. (cid:4) Proof of Proposition 3

The pseudo-likelihood function of the Poisson model is strictly concave in the singleindex. Therefore, Assumption A together with Lemma 1 guarantee that the conditionsof Theorem B.1 in Fernández-Val and Weidner (2016) are satisﬁed for the rescaled andpenalized objective function q N ( N − L ( β, φ ) − φ P null φ, with P null deﬁned in (24). Here, the penalty term φ P null φ guarantees strict concavityin ( β, φ ). However, in the following all derivatives of L ( β, φ ) are evaluated at the trueparameters, and since we impose the normalization P null φ = 0 the only derivative of Since we have a concave objective function, we can apply Theorem B.3 in FW to obtain preliminaryconvergence results for both b β and b φ . That theorem guarantees that that the consistency conditionon b φ ( β ) in Assumption (iii) of Theorem B.1 in FW is satisﬁed under our Assumption A, and it alsoguarantees (cid:13)(cid:13)(cid:13) b β − β (cid:13)(cid:13)(cid:13) = O P ( N − / ), which is important to apply Corollary B.2 in FW to obtain theexpansion result in our equation (36). ( β, φ ) where the penalty term gives a non-zero contribution is the incidental parameterHessian matrix ¯ H = E [ − ∂ φφ L ( β , φ )] for which the penalty term provides exactly thecorrect regularization. However, instead of that regularization, we can equivalently usethe pseudoinverse; namely we have (cid:16) ¯ H + c P null (cid:17) − = ¯ H † + 1 c P null , for any c >

0. In all expressions below where ¯ H † appears we could equivalently write¯ H † + N P null , but the additional contributions from N P null will always vanish because thegradient of L ( β, φ ) with respect to φ is orthogonal to P null .By applying Theorem B.1 and its Corollary B.2 in FW we thus obtain q N ( N − b β − β ) = W − N U N + o P (1) , (36)where W N = − N ( N − (cid:16) ∂ ββ ¯ L + [ ∂ βφ ¯ L ] ¯ H † [ ∂ φβ ¯ L ] (cid:17) = − N ( N − N X i =1 X j ∈ N \{ i } ∂ ββ ¯ ‘ ∗ ij was already deﬁned in Proposition 3, and we have U N := U (0) N + U (1) N , with U (0) N = 1 q N ( N − h ∂ β L + [ ∂ βφ ¯ L ] ¯ H † ∂ φ L i = 1 q N ( N − ∂ β L ∗ = 1 q N ( N − N X i =1 X j ∈ N \{ i } ∂ β ‘ ∗ ij , q N ( N − U (1) N = [ ∂ βφ L − ∂ βφ ¯ L ] ¯ H † ∂ φ L − [ ∂ βφ ¯ L ] ¯ H † h H − ¯ H i ¯ H † ∂ φ L + 12 dim φ X g =1 (cid:16) ∂ βφ φ g ¯ L + [ ∂ βφ ¯ L ] ¯ H † [ ∂ φφ φ g ¯ L ] (cid:17) [ ¯ H † ∂ φ L ] g ¯ H † ∂ φ L = [ ∂ βφ L ∗ − ∂ βφ ¯ L ∗ ] ¯ H † ∂ φ L + 12 dim φ X g =1 ∂ βφ φ g ¯ L ∗ [ ¯ H † ∂ φ L ] g ¯ H † ∂ φ L . Here, ‘ ∗ ij was deﬁned in (20), all “bars” denote expectations conditional on X and φ , and allthe partial derivatives are evaluated at the true parameters. We also deﬁned L ∗ ( β, φ ) := P Ni =1 P j ∈ N \{ i } ‘ ∗ ij ( β, α it , γ jt ). Remember that we use a diﬀerent scaling of the (proﬁle) like-lihood function than FW; namely in (6) we deﬁne L ( β, φ ) = P Ni =1 P j ∈ N \{ i } ‘ ij ( β, α it , γ jt ),while in FW this function would have an additional factor 1 / q N ( N − q N ( N −

1) factors in W N , U (0) N and U (1) N as compared to Theorem B.1 inFW.The score term ∂ β ‘ ∗ ij = e x ij S ij has zero mean and ﬁnite variance and is independentacross i and j , conditional on X and φ . By the central limit theorem we thus ﬁnd U (0) N ⇒ N (0 , Ω N ) , where Ω N = 1 N ( N − N X i =1 X j ∈ N \{ i } Var (cid:16) ∂ β ‘ ∗ ij (cid:12)(cid:12)(cid:12) x ij (cid:17) = 1 N ( N − N X i =1 X j ∈ N \{ i } e x ij h Var (cid:16) S ij (cid:12)(cid:12)(cid:12) x ij (cid:17)i e x ij . Thus, the term U (0) N only contributes variance to the asymptotic distribution of b β , butno asymptotic bias. By contrast, the term U (1) N only contributes bias to the asymptoticdistribution of b β , but no variance. Namely, one ﬁnds that U (1) N → p B N + D N , (37)with B N and D N as given in the proposition. The proof of (37) is exactly analogous tothe corresponding discussion of those terms in the proof of Theorem 4.1 in FW, which werestated above as Theorem 1 (remember that for T = 2 our result here is indeed just aspecial case of Theorem 4.1 in FW.) Therefore, instead of repeating those derivations here,we provide in the following a slightly less rigorous, but much easier to follow, derivationof those bias terms. Derivation of the asymptotic bias in Proposition 3

Remember that the main diﬀerence between Theorem 1 and our case here is that for us theincidental parameters α i and γ j are T -vectors, while in Theorem 1 the index π ij = α i + γ j is just a scalar. An easy way to generalize the asymptotic bias formulas in Theorem 1 anddisplay (19) to vector-valued incidental parameters is to use a suitable parameterizationfor the incidental parameters α i and γ j . The formulas for B and D can most easily begeneralized by parameterizing the incidental parameters as follows α i = A i e α i , γ j = C j e γ j , (38)54here e α i and e γ j are T − A i and C j are T × ( T −

1) matrices that satisfy A i A i = X j ¯ H ij  † , C j C j = X i ¯ H ij ! † . (39)Let e L ( β, e α, e γ ) = L ( β, ( A i e α i ) , ( C j e γ j )). This reparameterization guarantees that ∂ e L ( β , e α , e γ )( ∂ e α i )( ∂ e α i ) = A i X j ¯ H ij  A i = I T − ,∂ e L ( β , e α , e γ )( ∂ e γ j )( ∂ e γ j ) = C j X i ¯ H ij ! C j = I T − . (40)That is, the Hessian matrix with respect to the incidental parameters e α i and e γ j is nor-malized to be an identity matrix under that normalization. It can be shown that thisimplies that the incidental parameter biases B and D “decouple” across the T − e α i and e γ j ; that is, the total contribution to the incidental parameter bias of b β just becomes a sum over T − B and D in (19). Thus, for k ∈ { , . . . , K } we have B ,k = T − X q =1  − N X i,j E (cid:16) ∂ ˜ α i,q ‘ ij D β k ˜ α i,q ‘ ij (cid:17)P j E (cid:16) ∂ ˜ α i,q ‘ ij (cid:17)  = T − X q =1  − N X i,j E (cid:16) ∂ ˜ α i,q ‘ ij D β k ˜ α i,q ‘ ij (cid:17) = − N X i,j E h ( ∂ ˜ α i ‘ ij ) ( D β k ˜ α i ‘ ij ) i = − N X i,j E h ( ∂ α i ‘ ij ) A i A i ( D β k α i ‘ ij ) i = − N X i,j E  S ij X j ¯ H ij  † H ij e x ij,k  , where in the second step we used the fact that P j E (cid:16) ∂ ˜ α i,q ‘ ij (cid:17) = 1 according to (40), inthe third step we rewrote the sum over q ∈ { , . . . , T − } in terms of the vector productof the T − ∂ ˜ α i ‘ ij and D β k ˜ α i ‘ ij , in the fourth step we used that α i = A i e α i , andin the ﬁnal step we used (39) and the deﬁnitions of S ij , H ij and e x ij,k . All expectationshere are conditional on X (in the main text we always make that conditioning explicit),and ¯ H ij and e x ij,k are non-random conditional on X ; that is, we can also write this lastexpression as B ,k = − N X i Tr X j ¯ H ij  † X j E (cid:16) H ij e x ij,k S ij (cid:17) . D ,k = − N X i,j E  S ij X i ¯ H i j ! † H ij e x ij,k  . Next, to generalize the incidental parameter biases B and D in (19) to vector-values α i and γ j we again make a transformation (38), but this time we choose A i A i = X j ¯ H ij  † X j E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij (cid:17) X j ¯ H ij  † .C j C j = X i ¯ H ij ! † "X i E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij (cid:17) i ¯ H ij ! † . (41)Notice that for a correctly speciﬁed likelihood we have the Bartlett identities ¯ H ij = E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij (cid:17) , implying that (39) and (41) are identical for correctly speciﬁed likelihoods.In general, however, the transformation now is diﬀerent. Instead of normalizing theHessian matrices to be identities, as in (40), the new transformation deﬁned by (41)guarantees thatAsyVar (cid:16)be α i (cid:17) = " ∂ e L ( β , e α , e γ )( ∂ e α i )( ∂ e α i ) † Var  ∂ e L ( β , e α , e γ ) ∂ e α i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X  " ∂ e L ( β , e α , e γ )( ∂ e α i )( ∂ e α i ) † = I T − , AsyVar (cid:16)be γ j (cid:17) = " ∂ e L ( β , e α , e γ )( ∂ e γ j )( ∂ e γ j ) † Var  ∂ e L ( β , e α , e γ ) ∂ e γ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X  " ∂ e L ( β , e α , e γ )( ∂ e γ j )( ∂ e γ j ) † = I T − . (42)Again, it can be shown that with this normalization the incidental parameter bias con-tributions B and D “decouple”; that is, each component of be α i contributes an incidentalparameter bias of the form B in (19) to b β , and each component of be γ i contributes anincidental parameter bias of the form D in (19) to b β . The total contribution thus reads,for k ∈ { , . . . , K } , B ,k = T − X q =1 

12 1 N X i hP j E ( ∂ ˜ α i,q ‘ ij ) i P j E ( D β k ˜ α i,q ‘ ij ) hP j E (cid:16) ∂ ˜ α i,q ‘ ij (cid:17)i  = T − X q =1

12 1 N X i,j E ( D β k ˜ α i,q ‘ ij ) = 12 1 N X i,j Tr h E ( D β k ˜ α i ˜ α i ‘ ij ) i = 12 1 N X i,j Tr h A i E ( D β k α i α i ‘ ij ) A i i = 12 N X i Tr X j ¯ G ij e x ij,k  X j ¯ H ij  † X j E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17) X j ¯ H ij  †  , hP j E ( ∂ ˜ α i,q ‘ ij ) i / hP j E (cid:16) ∂ ˜ α i,q ‘ ij (cid:17)i = 1 accordingto (42), in the third step we rewrote the sum over q ∈ { , . . . , T − } as a trace over the( T − × ( T −

1) matrix of third-order partial derivatives E ( D β k ˜ α i ˜ α i ‘ ij ), in the fourth stepwe used that α i = A i e α i , and in the ﬁnal step we used the cyclicity of the trace and (41)and the deﬁnitions of ¯ G ij , e x ij,k , and the tensor-vector product ¯ G ij e x ij,k (which, recall, is a T × T matrix).Analogously we ﬁnd D ,k = T − X q =1 

12 1 N X j hP i E ( ∂ ˜ γ j,q ‘ ij ) i P i E ( D β k ˜ γ j,q ‘ ij ) hP i E (cid:16) ∂ ˜ γ j,q ‘ ij (cid:17)i  = 12 N X j Tr  X i ¯ G ij e x ij,k ! X i ¯ H ij ! † "X i E (cid:16) S ij S ij (cid:12)(cid:12)(cid:12) x ij,k (cid:17) i ¯ H ij ! †  . We have thus translated all the formulas in Theorem 1 and in display (19) to the caseof vector-valued α i and γ j to ﬁnd exactly the expression for the asymptotic biases B kN = B ,k + B ,k and D kN = D ,k + D ,k in Proposition 3. Rewriting the bias expressions as in Remarks 2 and 3

Remember that E ( y ijt | x ijt , α it , γ ij ) = λ ijt := exp( x ijt β + α it + γ ij ) and ϑ ijt := λ ijt P τ λ ijτ ,and denote the corresponding T -vectors by y ij , λ ij and ϑ ij . It is convenient to deﬁne the T × T matrices Λ ij := diag ( λ ij ) , and M ij := I T − λ ij ι T ι T λ ij = I T − ϑ ij ι T , which is the unique idempotent T × T matrix (i.e. M ij M ij = M ij ) that satisﬁes rank( M ij ) = T − M ij λ ij = 0, and ι T M ij = 0. Notice also that λ ij = Λ ij ι T , and therefore M ij Λ ij = Λ ij M ij . We then have S ij = M ij y ij , ¯ H ij = M ij Λ ij M ij = M ij Λ ij = Λ ij M ij = Λ ij − λ ij λ ij ι T λ ij ,H ij = ¯ H ij ι T y ij ι T λ ij ! , G ij,tsr = − T X u =1 λ ij,u M ij,tu M ij,su M ij,ru , where t, s, r ∈ { , . . . , T } .Next, deﬁne e x ∗ ij,k := M ij e x ij,k . Noting that λ ij e x ∗ ij,k = 0, we ﬁnd W N,k‘ = 1 N ( N − X i,j e x ∗0 ij,k Λ ij e x ∗ ij,‘ = 1 N ( N − X i,j,t λ ijt e x ∗ ijt,k e x ∗ ijt,‘ . This shows that W N has an additional sum over t , so W N increases linearly in T , and W − N = O ( T − ), for T → ∞ .Now, also deﬁne D ij,k := diag (cid:20)(cid:16) λ ijt e x ∗ ijt,k (cid:17) t =1 ,...,T (cid:21) , which is the diagonal T × T matrixwith diagonal entries λ ijt e x ∗ ijt,k . The ﬁrst-order conditions of the optimization problemthat deﬁnes e x ij,k read X i ¯ H ij e x ij,k = 0 , X j ¯ H ij e x ij,k = 0 , or equivalently X i Λ ij e x ∗ ij,k = 0 , X j Λ ij e x ∗ ij,k = 0 , which can also be written as X i D ij,k = 0 , X j D ij,k = 0 . (43)These FOC’s are only important to simplify the term B ,k in what follows. We have B ,k = − N X i,j E h ( ι T y ij ) S ij i ι T λ ij X j ¯ H ij  † Λ ij e x ∗ ij,k = − N ( N − X i,j ι T ι T λ ij Var( y ij ) M ij  N X j ¯ H ij  † Λ ij M ij e x ij,k ,B ,k = − N X i Tr X j M ij D ij,k M ij  X j ¯ H ij  † X j M ij Var( y ij ) M ij  X j ¯ H ij  †  = 1 N ( N − X i,j  λ ij Q i Λ ij e x ∗ ij,k ι T λ ij − (cid:16) λ ij e x ∗ ij,k (cid:17) (cid:16) λ ij Q i λ ij (cid:17) ( ι T λ ij )  = 1 N ( N − X i,j λ ij Q i Λ ij M ij e x ij,k ι T λ ij , M ij , (43), that ι T D ij,k ι T = λ ij e x ∗ ij,k , and that D ij,k ι T = Λ ij e x ∗ ij,k ; and in the last step, we used that Λ ij e x ∗ ij,k =Λ ij M ij e x ij,k and λ ij e x ∗ ij,k = 0. We also used the deﬁnition of Q i given in Remark 2.We then have for B kN = B ,k + B ,k that B kN = − N ( N − X i,j T ι T R ij e x ij,k T ι T λ ij + 1 N ( N − X i,j T λ ij Q i Λ ij M ij e x ij,k T ι T λ ij , where we have now also used the deﬁnition of R ij from Remark 2 in order to simplify B ,k .Under appropriate regularity conditions, the T × T matrices Q i and R ij each maintaindiagonal elements of order one and oﬀ-diagonal elements of order 1 /T through theirdependence on Var( y ij ). Therefore, all the numerators and denominators in the lastexpression for B kN remain of order one as T → ∞ , such that B kN = O (1) as T → ∞ , withan analogous result also following for D kN . Recalling that W N increases linearly with T ,we thus conclude that the bias term W − N ( B N + D N ) N − , is of order 1 / ( N T ) as both N and T grow large. Comment on Proposition 1

We note that the consistency result from Proposition 1 also follows from the above proofof Proposition 3.

Remark 4.

If the asymptotic bias in b β is characterized by Proposition 3, then b β isconsistently estimated as N → ∞ . As we have noted in the text, for this consistency result to hold, we need for the two-way proﬁle score in (9) to be unbiased at the true parameters ( β, α, γ ). In particular, weneed for there to be no incidental parameter bias term of order 1 /T associated with thepair ﬁxed eﬀect η ij . As the following discussion clariﬁes, the FE-PPML model is quitespecial in this regard. A.2 Proof of Proposition 2

To prove Proposition 2, it will ﬁrst be useful to prove the following lemma:59 emma 2.

Consider the class of “one-way” FE-PML panel estimators with conditionalmeans given by λ it := exp( x it β + α i ) and FOC’s given by b β : N X i =1 T X t =1 x it (cid:16) y it − b λ it (cid:17) g ( b λ it ) = 0 , b α i : T X t =1 (cid:16) y it − b λ it (cid:17) g ( b λ it ) = 0 , where i = 1 , . . . , N , t = 1 , ..., T, and g ( b λ it ) is an arbitrary positive function of b λ it . If T issmall, b β is only consistent under general assumptions about Var( y | x, α ) if g ( λ ) is constantover the range of λ ’s that are realized in the data-generating process. Put simply, if Lemma 2 holds, then no other FE-PML estimator of the form describedin Proposition 2 aside from FE-PPML can be consistent under general assumptions aboutthe conditional variance Var( y | x, α, γ, η ). We have already shown that the three-way FE-PPML estimator is generally consistent regardless of the conditional variance. Thus, ifwe can prove Lemma 2, Proposition 2 follows directly. Proof of Lemma 2.

Our strategy here will be to adopt a speciﬁc parameterization forthe conditional variance Var( y | x, α ) and then examine the conditions under which b β issensitive to small changes in the conditional variance. If b β depends on Var( y | x, α ) evenfor large N , then it is not possible for b β to be consistent under general assumptions aboutVar( y | x, α ).To proceed, let the true data generating process be given by y it = λ it ω it , where λ it is the true conditional mean and ω it := exp (cid:20) −

12 ln (1 + λ ρit ) + q ln (1 + λ ρit ) z it (cid:21) (44)with z it a randomly-generated variable distributed N (0 , ω it is therefore a heteroscedas-tic multiplicative disturbance that follows a log-normal distribution with E [ ω it ] = 1 andVar( ω it ) = λ ρit . The conditional mean of y it is in turn given by E [ y it | x, α ] = λ it and theconditional variance is given by Var( y it | x, α ) = Var( y it | λ it ) = λ it Var( ω it ) = λ ρ +2 it . Ourfocus is the exponent ρ , which governs the nature of the heteroscedasticity and can be60ny real number. With this in mind, it is useful to document the following results, E " ∂ω it ∂ρ = ∂ E [ ω it ] ∂ρ = 0 (45) E " ∂ ( ω it ) ∂ρ = E " ω it ∂ω it ∂ρ = ∂ E ( ω it ) ∂ρ = ∂V [ ω it ] ∂ρ = λ ρit ln λ it = 0 . (46)Put another way, the expected value of the change in ω it with respect to ρ must alwaysbe zero because E [ ω it ] = 1 regardless of ρ . Similarly, the expected change in the secondmoment of ω it must be λ ρit ln λ it because this gives the change in the variance of ω it . To facilitate the rest of the proof, we invoke the following conceit: the random distur-bance term z it , once drawn from N (0 , known and ﬁxed , such that each ω it may betreated as a known transformation of the underlying value for z it given by (44). Amongother things, this means we can always treat the partial derivatives ∂ω it ∂ρ and ∂y it ∂ρ = λ it ∂ω it ∂ρ as well-deﬁned; similarly, we can treat the estimated parameters b β and b α i as deterministicfunctions of the variance parameter ρ with well-deﬁned total derivatives d b βdρ and d b α i dρ . Thatis, for a given draw of z it ’s, we can perturb how the corresponding ω it ’s are generated andconsider comparative statics for how estimates are aﬀected. If b β is consistent regardlessof the variance assumption used to generate ω it , then small changes in ρ should have noeﬀect on b β asymptotically. Thus, our goal in the following is to determine if there areany estimators in this class other than FE-PPML under which lim N →∞ d b βdρ = 0 in thisexperiment.The next step is to totally diﬀerentiate the FOC’s for b β and b α i with respect to a changein ρ . Let L denote the pseudo-likelihood function to be maximized. For notationalconvenience, we can express the scores for b β and b α i as L β and L α i , such that their FOCscan respectively be written as L β = 0 and L α i = 0. Diﬀerentiating the FOC for b β , weobtain d b βdρ = −L − ββ L βρ − L − ββ X i L βα i d b α i dρ , (47)where L ββ is the matrix obtained from partially diﬀerentiating the score for b β with respectto b β , L βρ (a vector) is the partial derivative of L β with respect to ρ , and L βα i (also a Note here that ∂ ( ω it ) ∂ρ = 2 ω it ∂ω it ∂ρ . The implied pseudo-likelihood function is given here by L := P Ni =1 P Tt =1 y it R g ( λ it ) λ it dλ it − P Ni =1 P Tt =1 R g ( λ it ) dλ it . b α i . Applying a similar set of operations tothe FOC for b α i then gives d b α i dρ = −L − α i α i L α i ρ − L − α i α i L βα i d b βdρ , (48)where L α i α i and L α i ρ are scalars that respectively contain the partial derivatives of L α i with respect to b α i and ρ . Plugging (48) into (47), we have d b βdρ = −L − ββ L βρ + L − ββ N X i =1 L − α i α i L βα i L α i ρ + L − ββ N X i =1 L − α i α i L βα i L βα i d b βdρ = − I − L − ββ N X i =1 L − α i α i L βα i L βα i ! − L − ββ L βρ (49)+ I − L − ββ N X i =1 L − α i α i L βα i L βα i ! − L − ββ N X i =1 L − α i α i L βα i L α i ρ , (50)where I is an identity matrix whose dimensions equal the size of β .Let P henceforth denote the combined matrix object I − L − ββ P i L − α i α i L βα i L βα i . It isstraightforward to show that that ﬁrst term in (50), − P − L − ββ L βρ , converges in proba-bility to a zero vector when N → ∞ . To see this, note ﬁrst that P and L ββ must benon-singular and ﬁnite for b β to be at a maximum point of L and for d b βdρ to exist. Fur-thermore, lim N →∞ N T L − ββ = − E [ x it b λ it g ( b λ it ) x it ] − must also be non-singular and ﬁnite.Slutsky’s theorem then implies lim N →∞ − P − L − ββ L βρ → p N →∞ N − T − L βρ → p L βρ more closely, we have L βρ = N X i =1 T X t =1 x it ∂y it ∂ρ g ( b λ it ) = N X i =1 T X t =1 x it λ it ∂ω it ∂ρ g ( b λ it ) . lim N →∞ N − T − L βρ → p E h ∂ω it ∂ρ i = 0 (by(45)). We may therefore focus our attention on the second term on the RHS in (50), P − L − ββ P i L − α i α i L βα i L α i ρ . Noting that L − α i α i must be <

0, in this case we consider theconditions under which lim N →∞ N − T − P i L − α i α i L βα i L α i ρ similarly converges in proba-bility to zero. The summation in this latter term may be expressed as N X i =1 L − α i α i L βα i L α i ρ = N X i =1 L − α i α i " T X t =1 x it (cid:16) y it − b λ it (cid:17) g ( b λ it ) b λ it − T X t =1 x it b λ it g ( b λ it ) T X t =1 ∂y it ∂ρ g ( b λ it ) . Re-arranging this expression, we have that N X i =1 L − α i α i L βα i L α i ρ = N X i =1 T X t =1 T X s =1 L − α i α i x it y it g ( b λ it ) b λ it g ( b λ is ) ∂y is ∂ρ − N X i =1 T X t =1 T X s =1 L − α i α i x it (cid:16)b λ it g ( b λ it ) + g ( b λ it ) (cid:17) b λ it g ( b λ is ) ∂y is ∂ρ . (51)62ocusing ﬁrst on the second of the two summation terms in (51), we again apply y it = λ it ω it , ∂y is ∂ρ = λ it ∂ω is ∂ρ , and E h ∂ω it ∂ρ i = 0. We have thatlim N →∞ N T N X i =1 T X t =1 T X s =1 L − α i α i x it (cid:16)b λ it g ( b λ it ) + g ( b λ it ) (cid:17) b λ it g ( b λ is ) λ is ∂ω is ∂ρ → p . This follows for the same reason lim N →∞ N − T − L βρ → p → p g ( b λ it ) = 0.To complete the proof, we just need to show that this term does not reduce to 0 if g ( b λ it ) = 0. A ﬁnal step gives uslim N →∞ N T N X i =1 T X t =1 T X s =1 L − α i α i x it y it g ( b λ it ) b λ it g ( b λ is ) ∂y is ∂ρ = lim N →∞ N T N X i =1 T X t =1 L − α i α i x it g ( b λ it ) b λ it g ( b λ it ) y it ∂y it ∂ρ = lim N →∞ N T N X i =1 T X t =1 L − α i α i x it g ( b λ it ) b λ it g ( b λ it ) λ it ω it ∂ω it ∂ρ = 0 . To elaborate, the terms where s = t vanish as N → ∞ because disturbances are assumedto be independently distributed ( E [ ω it ∂ω is ∂ρ ] = 0 if s = t .) The remaining details followfrom (46). We have now shown lim N →∞ d b βdρ = 0 if and only if g ( b λ it ) = 0. In otherwords, the estimator must be FE-PPML, which assumes g ( b λ it ) is a constant. For otherFE-PML estimators, even if b β is consistent for a particular ρ , it cannot be consistent forall ρ because b β does not converge to the same value for N → ∞ when we vary ρ . As wediscuss below, this is what happens for FE-Gamma PML (where g ( b λ it ) = b λ − it ) and someother similar models. (cid:4) To be clear, the robustness of the FE-PPML estimator to misspeciﬁcation is a knownresult established by Wooldridge (1999). However, to our knowledge, it has not previ-ously been shown that FE-PPML is the only estimator in the class we consider that has Note that under FE-PPML, where g ( b λ it ) = 0, the estimator is consistent even if disturbances arecorrelated. This is yet another reason why FE-PPML is an especially robust estimator. Notice that if T → ∞ also, we have that lim T →∞ T L − α i α i = − E hb λ it g ( b λ it ) i − must be ﬁnite. Wewould therefore havelim N,T →∞ N T N X i =1 T X t =1 (cid:2) T L − α i α i (cid:3) x it g ( b λ it ) b λ it g ( b λ it ) λ it (cid:20) T − ω it ∂ω it ∂ρ (cid:21) = 0 , ensuring that b β does not depend on ρ for the large N, large T case. This follows becauselim T →∞ T − V [ ω it ] = 0 = ⇒ lim T →∞ T − E h ω it ∂ω it ∂ρ i = 0. At the same time, it is worth clarifying that FE-PPML is not the onlyestimator that is capable of producing consistent estimates of three-way gravity models.Rather, it is the only estimator in the class we consider that only requires correct spec-iﬁcation of the conditional mean and for the covariates to be conditionally exogenous inorder to be consistent. The following discussion describes some known cases in whichother estimators will be consistent.

A.3 Results for Other Three-way Estimators

Depending on the distribution of the data, there may be some other consistent estimatoravailable aside from FE-PPML. In particular, if g ( b λ ijt ) is of the form g ( b λ ijt ) = b λ qijt ,with q an arbitrary real number, the FOC for b η ij has a solution of the form b η ij =[ P Tt =1 b µ q +1 ijt ] − P Tt =1 y ijt b µ qijt . It is therefore possible to “proﬁle out” b η ij from the FOC for b β , just as in the FE-PPML case. As such, it is possible for the estimator to be consis-tently estimated, but only if the conditional variance is correctly speciﬁed (more precisely,we must have Var( y | x, α, γ, η ) ∝ b λ − qit , the equivalent of ρ = − − q .) In this case, theestimator is not only consistent, but should be more eﬃcient as well.An interesting example to consider in the gravity context is the Gamma PML (GPML)model, which imposes g ( b λ ijt ) = b λ − ijt . Generally speaking, GPML is considered the primaryalternative to PPML and OLS as an estimator for use with gravity equations (see Headand Mayer, 2014; Bosquet and Boulhol, 2015.) However, to our knowledge, no referencesto date on gravity estimation make it clear that, unlike in a two-way setting, the three-way FE-GPML estimator is only consistent when the conditional variance is correctlyspeciﬁed. Thus, it is possible that researchers could mistakenly infer that the appealof FE-GPML as an alternative to FE-PPML in the two-way gravity setting carries overto the three-way setting. This is especially a concern now that recent computational Alternatively, it is possible to extend the above result to an even more general class of models byconsidering estimators that depend on g ( b α i ) rather than g ( b λ it ). The same type of proof may be used toshow that b β depends on the variance assumption if g ( b α i ) = 0. Furthermore, the estimator can be shownto be consistent if g ( b α i ) = 0. As discussed in Greene (2004), the ﬁxed eﬀects Gamma model is generally known not to suﬀer froman incidental parameter problem, similar to FE-Poisson. However, the result stated in Greene (2004) isfor the Gamma MLE estimator, which restricts the conditional variance to be equal to the square of theconditional mean. The FE-Gamma PML model is consistent under the slightly more general assumptionthat the conditional variance is proportional to the square of the conditional mean. For example, Head and Mayer (2014), arguably the leading reference to date on gravity estimation, The displayed kernel densities are computed using 500 replicationsof a three-way panel structure with N = 50 and T = 5. The i and j dimensions ofthe panel both have size N = 50 and the size of the time dimension is T = 5. The ﬁxedeﬀects are generated according to the same procedures described in the text and we againmodel four diﬀerent scenarios for the distribution of the error term (Gaussian, Poisson,Log-heteroscedastic, and Quadratic).As we would expect based on Proposition 2, FE-PPML is relatively unbiased across allfour diﬀerent assumptions considered for the distribution of the error term. The generalinconsistency of the three-way linear model—which is only unbiased for DGP III wherethe error term is log-homoscedastic—is also as expected. However, the reasons behindthe bias in the OLS estimate are well-documented (see Santos Silva and Tenreyro, 2006)and do not have to do with the incidental parameters included in the model. The three-way FE-GPML is also consistent under DGP III because it assumes the error term hasa variance equal to the square of the conditional mean. Both OLS and GPML are alsomore eﬃcient than PPML in this case. However, as the other three panels show, whenthis variance assumption is relaxed, the three-way FE-GPML model clearly suﬀers froman IPP, exhibiting an average bias equal to roughly half that of OLS in all three cases.We have also performed some simulations with three-way FE-Gaussian PML, whichimposes g ( b λ ijt ) = b λ ijt . We do not show results for this other estimator because theHDFE-IRLS algorithm we used to produce the FE-PPML and FE-Gamma PML estimatesfrequently did not converge for the FE-Gaussian PML model. However, the results we didobtain were in line with our results for FE-GPML and with our discussion of Proposition2 above: the FE-Gaussian PML estimates were unbiased when the DGP for ω ijt was itself suggest comparing PPML estimates with GPML estimates to determine if the RHS of the model ispotentially misspeciﬁed. Such a comparison is not straightforward in a three-way setting because theGPML estimator is likely to be inconsistent. Their other suggestion to compare GPML and OLS estimatesstill seems sensible, however. As we show below, both estimators give similar results when the Gammavariance assumption is satisﬁed and give diﬀerent results otherwise. We were able to compute three-way FE-Gamma PML estimates using a modiﬁed version of theHDFE-IRLS algorithm used in Correia, Guimarães, and Zylkin (2019). To our knowledge, these are theﬁrst results presented anywhere documenting the inconsistency of the three-way Gamma PML estimator. Simulations with larger N are more narrowly distributed, but otherwise are very similar. A.4 Showing Bias in the Cluster-robust Sandwich Estimator

For convenience, let x ij := ( x ij , d ij ) be the matrix of covariates associated with pair ij , inclusive of the it - and jt -speciﬁc dummy variables needed to estimate α i and γ j .Similarly, let b := ( β , φ ) be the vector of coeﬃcients to be estimated and let b b be thevector of coeﬃcient estimates. Note that we can write a ﬁrst-order approximation for b S ij as b S ij ≈ S ij − ¯ H ij x ij ( b b − b ) , which is consistent with the approximation provided in (14). We can then replace b b − b with the standard ﬁrst-order expansion b b − b ≈ − ¯ L − bb L b , where L = P i,j ‘ ij is the proﬁlelikelihood. This expansion in turn can be written out as b b − b ≈ − ¯ L − bb "X m,n x mn S mn , Now we turn our attention to the outer product b S ij b S ij : b S ij b S ij ≈ S ij S ij + ¯ H ij x ij ( b b − b ) x ij ¯ H ij − H ij h x ij ( b b − b ) i S ij = S ij S ij + ¯ H ij x ij ( b b − b ) x ij ¯ H ij + 2 ¯ H ij x ij ¯ L − bb "X m,n x mn S mn S ij Because we assume we are in the special case where FE-PPML is correctly speciﬁed, wehave that E [( b b − b ) ] = − κ ¯ L − bb , where ¯ L bb := E [ L bb ]. We also have that E [ S ij S ij ] = κ ¯ H ij .Therefore, after applying expectations where appropriate, we have that E [ b S ij b S ij ] ≈ S ij S ij + κ ¯ H ij x ij ¯ L − bb x ij ¯ H ij , which can be seen as extending Kauermann and Carroll (2001)’s results to the case ofa panel data pseudo-likelihood model with within-panel clustering. We are not done,however, as we have not yet isolated the inﬂuence of the incidental parameters. Tocomplete the derivation of the bias, we must more carefully consider the full inverseHessian term ¯ L − bb . Using standard matrix algebra, this inverse can be written as:¯ L − bb =  (cid:16) ¯ L ββ − ¯ L φβ ¯ L − φφ ¯ L φβ (cid:17) − − (cid:16) ¯ L ββ − ¯ L φβ ¯ L − φφ ¯ L φβ (cid:17) − ¯ L φβ ¯ L − φφ − ¯ L − φφ ¯ L φβ (cid:16) ¯ L ββ − ¯ L φβ ¯ L − φφ ¯ L φβ (cid:17) − ¯ L − φφ + ¯ L − φφ ¯ L φβ (cid:16) ¯ L ββ − ¯ L φβ ¯ L − φφ ¯ L φβ (cid:17) − ¯ L φβ ¯ L − φφ  , V[y|x, a , g ]=1 DGP I (Gaussian) .9 1 1.1 1.2 1.3FE-PPML FE-GPMLFE-OLS

V[y|x, a , g ]=E[y|x, a , g ] DGP II (Poisson) .96 .98 1 1.02 1.04FE-PPML FE-GPMLFE-OLS

V[y|x, a , g ]=E[y|x, a , g ] DGP III (Log-homoscedastic) .8 .85 .9 .95 1 1.05FE-PPML FE-GPMLFE-OLS

V[y|x, a , g ]=0.5 E[y|x, a , g ]+0.5 e E[y|x, a , g ] DGP IV (Quadratic)

Comparing Three-way Gravity Estimators (N=50; T=5)

Figure 2: Kernel density plots of three-way gravity model estimates using diﬀerent FEestimators, based on 500 replications. The model being estimated is y ijt = exp[ α it + γ jt + η ij + x ijt β ] ω ijt , where the distribution of ω ijt depends on the DGP and the true value of β is 1 (indicated by the vertical dotted lines). The size of the i and j dimensions is givenby N = 50 and the t dimension has size T = 5. See text for further details.67here we have used ¯ L φφ in place of ¯ H in order to add clarity. Making use of somealready-established deﬁnitions, we have that the top-left term ( ¯ L ββ − ¯ L ∗0 φβ ¯ L − φφ ¯ L φβ ) − = − [ N ( N − − W − N and, similarly, that ¯ L − φφ = − [ N ( N − − W ( φ ) − N . If we again consider E [ b S ij b S ij ], we can now write E [ b S ij b S ij − S ij S ij ] ≈ − κN ( N −

1) ¯ H ij ( x ij d ij ) ×  W − N − W − N ¯ L φβ ¯ L − φφ − ¯ L − φφ ¯ L φβ W − N W ( φ ) − N + ¯ L − φφ ¯ L ∗ φβ W − N ¯ L φβ ¯ L ∗− φφ  ( x ij d ij ) ¯ H ij = − κN ( N −