[PDF] Inference on two component mixtures under tail restrictions

Abstract

Many econometric models can be analyzed as finite mixtures. We focus on two-component mixtures and we show that they are nonparametrically point identified by a combination of an exclusion restriction and tail restrictions. Our identification analysis suggests simple closed-form estimators of the component distributions and mixing proportions, as well as a specification test. We derive their asymptotic properties using results on tail empirical processes and we present a simulation study that documents their finite-sample performance.

Full PDF

IINFERENCE ON TWO-COMPONENT MIXTURES UNDER TAIL RESTRICTIONS * KOEN JOCHMANS and MARC HENRY and BERNARD SALANIÉ Sciences Po, 28 rue des Saints Pères, 75007 Paris, France.E-mail: [email protected] The Pennsylvania State University, University Park, PA 16801, U.S.A.E-mail: [email protected] Columbia University, 420 West 118th Street, New York, NY 10027, U.S.A.E-mail: [email protected]

Final version: February 15, 2021Many econometric models can be analyzed as finite mixtures. We focus on two-component mixtures and we showthat they are nonparametrically point identified by a combination of an exclusion restriction and tail restrictions.Our identification analysis suggests simple closed-form estimators of the component distributions and mixingproportions, as well as a speciﬁcation test. We derive their asymptotic properties using results on tail empiricalprocesses and we present a simulation study that documents their finite-sample performance.

Keywords: mixture model; nonparametric identiﬁcation and estimation; tail empirical process

INTRODUCTION

The use of finite mixtures has a long history in applied econometrics. A non–exhaustive list of applica-tions includes models with discrete unobserved heterogeneity, hidden Markov chains, and models withmismeasured discrete variables; see Henry et al. (2014) for a more extensive discussion of applications.Until recently, the literature on nonparametric identification of mixture models was sparse. Followingthe lead of Hall and Zhou (2003), several authors have analyzed multivariate mixtures; recent contri-butions are Kasahara and Shimotsu (2009), Allman et al. (2009), and Bonhomme et al. (2014, 2016).There are fewer identifying restrictions available when the model of interest is univariate. Bordes et al.(2006), for instance, provide such restrictions for location models with symmetric error distributions.In this paper we give sufficient conditions that point-identify univariate component distributionsand associated mixing proportions. The restrictions we rely on are most effective in two-componentmodels; and to simplify the analysis, we focus on this case, like Hall and Zhou (2003) and Bordes et al.(2006). We comment brieﬂy on mixtures with more components at the end of the paper. Our arguments * We are grateful to Peter Phillips, Arthur Lewbel, and three referees for comments and suggestions, and to Victor Cher-nozhukov and Yuichi Kitamura for fruitful discussions. Parts of this paper were written while Henry was visiting the Universityof Tokyo Graduate School of Economics and while Salanié was visiting the Toulouse School of Economics. The hospitality ofboth institutions is gratefully acknowledged. Jochmans’ research has received funding from the SAB grant “Nonparametric es-timation of ﬁnite mixtures”. Henry’s research has received funding from the SSHRC Grants 410-2010-242 and 435-2013-0292,and NSERC Grant 356491-2013. Salanié thanks the Georges Meyer endowment. Some of the results presented here previouslycirculated as part of Henry et al. (2010), whose published version (Henry et al. 2014) only contains results on partial identiﬁca-tion. a r X i v : . [ ec on . E M ] F e b K. Jochmans, M. Henry, and B. Salanié are constructive, and we propose closed-form estimators for both the component distributions andthe mixing proportions. We derive their large-sample properties and we propose a speciﬁcation test.Finally, we investigate the behavior of our inference tools in a simulation experiment.The model we consider in this paper is characterized by an exclusion restriction and a tail-dominanceassumption. Like Henry et al. (2014), we assume the existence of a source of variation that shifts themixing proportions but leaves the component distributions unchanged. Such an assumption is naturalin several important applications, such as measurement-error models (Mahajan 2006), for example. Inhidden Markov models, it follows directly from the model specification. The exclusion restriction isalso implied by the conditional-independence restriction that underlies the results of Hall and Zhou(2003) and others on multivariate mixtures.Henry et al. (2014) have shown that our exclusion restriction implies that both the mixing proportionsand the component distributions lie in a non-trivial set. However, they only proved partial identiﬁca-tion, and they did not discuss inference. Here we achieve point-identification by complementing theexclusion restriction with a restriction on the relative tail behavior of the component distributions. Thisrestriction is quite natural in location models, for instance, but it can be motivated more generally.Regime-switching models typically feature regimes with different tail behavior, for example. Alterna-tively, theoretical models can imply the required tail behavior; an example is the search and matchingmodel of Shimer and Smith (2000), as explained in D’Haultfœuille and Février (2015).Our identification argument suggests plug-in estimators of the mixing proportions and the compo-nent distributions that are available in closed form. The estimators are based on ratios of intermediatequantiles, and their convergence rate is determined by the theory of tail empirical processes. As werely on the tail behavior of the component distributions to infer the mixing proportions, our estimatorsconverge more slowly than the parametric rate. If the mixing proportions were known—or could be es-timated at the parametric rate—the tail restrictions could be dispensed with and the implied estimatorof the component distributions would also converge at the parametric rate.Our estimators are consistent under very weak tail dominance assumptions. To control for asymptoticbias in their limit distribution we need to impose stronger requirements that prevent the tails of thecomponents from vanishing too quickly. These assumptions rule out the Gaussian location model.Such thin-tailed distributions are known to be problematic for inference techniques that rely on tailbehavior (Khan and Tamer 2010). However, we show that our assumptions apply to distributions withfatter tails, such as Pareto distributions.Identification only requires that the variable subject to the exclusion restriction can take on twovalues. If it can take on more values, the model is overidentified and the specification can be tested. nference on mixtures

1. MIXTURES WITH EXCLUSION AND TAIL RESTRICTIONS

Let ( Y, X ) ∈ R ×X be random variables. We assume throughout that our mixtures satisfy the followingsimple exclusion restriction. A SSUMPTION F ( y | x ) ≡ P ( Y ≤ y | X = x ) decomposes as the two-component mixture F ( y | x ) = G ( y ) λ ( x ) + H ( y ) (1 − λ ( x )) (1.1) for distribution functions G : R (cid:55)→ [0 , and H : R (cid:55)→ [0 , and a function λ : X (cid:55)→ [0 , that mapsvalues x into mixing proportions. The assumption that the component distributions do not depend on X embodies our exclusion restric-tion; see also Henry et al. (2014).We complete the mixture model with the following assumption.A SSUMPTION The mixing proportion λ is non-constant on X and is bounded away from zero andone on X . Non-constancy of λ gives the variable X relevance. Bounding λ away from zero and one implies thatthe mixture is irreducible. Our first example has a long history in empirical work (Frisch 1934).

K. Jochmans, M. Henry, and B. Salanié E XAMPLE

Let T denote a binary treatment indicator. Suppose that T is subject to classification error: rather than observing T , we observe misclassified treatment X . Thedistribution of the outcome variable Y given X = x is F ( y | x ) = P ( Y ≤ y | T = 1 , X = x ) λ ( x ) + P ( Y ≤ y | T = 0 , X = x ) (1 − λ ( x )) , with λ ( x ) = P ( T = 1 | X = x ) . The usual ignorability assumption states that X and Y are independentgiven T . That is, P ( Y ≤ y | T = t, X = x ) = P ( Y ≤ y | T = t ) , for t ∈ { , } , in which case the decomposition of F ( y | x ) reduces to the model in (1.1) with G ( y ) = P ( Y ≤ y | T = 1) and H ( y ) = P ( Y ≤ y | T = 0) . Also note that λ is non-constant unlessmisclassification in T is completely random. The identification of treatment effects when the treatment indicator is mismeasured has received con-siderable attention, especially in the context of regression models (Bollinger 1996; Mahajan 2006;Lewbel 2007). Here, the conditional ignorability assumption that validates our exclusion restrictionrelies on non-differential misclassification error. It has been routinely used elsewhere (Carroll et al.2006).Our second example deals with regime-switching models, also referred to as hidden Markov models.These models cover switching regressions, which have been used in a variety of settings (see, e.g.,Heckman 1974, Hamilton 1989), as well as several versions of stochastic-volatility models (Ghyselset al. 1996).E

XAMPLE

Let Y = ( Y , . . . , Y T ) (cid:48) be a time series of outcome variables.A hidden Markov model for the dependency structure in these data assumes that there is a discretelatent series of state variables S = ( S , . . . , S T ) (cid:48) having Markovian dependence, that the variables in Y are jointly independent given S , and that P ( Y t ≤ y t | S = s ) = P ( Y t ≤ y t | S t = s t ) . To see that such a model fits (1.1) , assume that there are two latent states 0 and 1 and (for notationalsimplicity) that S has first-order Markov dependence. Denote X = ( Y , . . . , Y t − ) (cid:48) . Then F ( y t | x ) = P ( Y t ≤ y t | S t = 1) P ( S t = 1 | X = x ) + P ( Y t ≤ y t | S t = 0) P ( S t = 0 | X = x ) , nference on mixtures which fits our setup. Moreover, λ ( x ) = P ( S t = 1 | X = x ) does vary with x , unless the outcomes areindependent of the latent states. In this example, the exclusion restriction follows directly from the Markovian structure of the regime-switching model. Gassiat and Rousseau (2016) obtained nonparametric identification in location mod-els when the matrix of transition probabilities of the Markov chain has full rank. The approach pre-sented here delivers nonparametric identification in a much broader range of models.Our third example links (1.1) to the recent literature on multivariate mixtures that builds on Hall andZhou (2003).E

XAMPLE

Suppose Y and X are two measurements that are independentconditional on a latent binary factor T : P ( Y ≤ y, X ≤ x ) = P ( Y ≤ y | T = 1) P ( X ≤ x | T = 1) P ( T = 1)+ P ( Y ≤ y | T = 0) P ( X ≤ x | T = 0) P ( T = 0) . Then the conditional distribution of the Y given X is F ( y | x ) = P ( Y ≤ y | T = 1) P ( T = 1 | X = x ) + P ( Y ≤ y | T = 0) P ( T = 0 | X = x ) . This is of the form in (1.1) with G ( y ) = P ( Y ≤ y | T = 1) , H ( y ) = P ( Y ≤ y | T = 0) , and λ ( x ) = P ( T =1 | X = x ) . Note that the bivariate mixture model implies that the distribution of X given Y decomposesin the same way. Hall and Zhou (2003) showed that multivariate two-component mixtures with conditional indepen-dence restrictions are nonparametrically identified from data on three or more measurements and areset identified from data on only two measurements. The results we derive below imply that two mea-surements can also yield point identification under tail restrictions.

We show below that both the mixture components

G, H and the mixing proportions λ are identifiedunder the following dominance condition on the tails of the component distributions.A SSUMPTION (i) The left tail of G is thinner than the left tail of H , i.e., lim y ↓−∞ G ( y ) H ( y ) = 0 . K. Jochmans, M. Henry, and B. Salanié(ii) The right tail of G is thicker than the right tail of H , i.e., lim y ↑ + ∞ − H ( y )1 − G ( y ) = 0 . Tail dominance is natural in location models.E

XAMPLE

Suppose that Y = µ ( T ) + U , where T is a binary indicator and U ∼ F , independent of T . Then (1.1) yields F ( y | x ) = F ( y − µ (1)) P ( T = 1 | X = x ) + F ( y − µ (0)) P ( T = 0 | X = x ) . Suppose that µ (0) < µ (1) , that F is absolutely continuous, and that its hazard rate f ( u ) / (cid:0) − F ( u ) (cid:1) (resp. f ( u ) /F ( u ) ) goes to + ∞ as u ↑ + ∞ (resp. u ↓ −∞ ). Then Assumption 3 holds with G ( y ) = F ( y − µ (1)) and H ( y ) = F ( y − µ (0)) . Proof.

Let us show that Assumption 3(ii) holds. Let ϕ ( u ) ≡ − ln(1 − F ( u )) and note that ϕ (cid:48) ( u ) = f ( u ) / (1 − F ( u )) . Then − F ( y − µ (0))1 − F ( y − µ (1)) = exp ( ϕ ( y − µ (1)) − ϕ ( y − µ (0))) = exp (cid:0) − ϕ (cid:48) ( y ∗ ) ( µ (1) − µ (0)) (cid:1) for some y ∗ between y − µ (1) and y − µ (0) . Since µ (1) > µ (0) and the hazard rate increases withoutbound as y ↑ + ∞ , the expression on the right-hand side tends to zero as y increases. Assumption 3(i)can be verified in the same way.It is important to note that, aside from regularity conditions, we do not impose any shape restrictionson the mixture components outside of the tails.We now show that, combined, our exclusion restriction and tail-dominance assumption identify allelements of the mixture model.T HEOREM

Under Assumptions 1—3, G , H , and λ are identified. Proof.

The proof is constructive. Fix x (cid:48) ∈ X and choose x (cid:48)(cid:48) ∈ X so that λ ( x (cid:48) ) (cid:54) = λ ( x (cid:48)(cid:48) ) . Then re-arranging (1.1) gives F ( y | x (cid:48) ) F ( y | x (cid:48)(cid:48) ) = 1 + λ ( x (cid:48) ) ( G ( y ) /H ( y ) − λ ( x (cid:48)(cid:48) ) ( G ( y ) /H ( y ) − , − F ( y | x (cid:48) )1 − F ( y | x (cid:48)(cid:48) ) = λ ( x (cid:48) ) + ((1 − H ( y )) / (1 − G ( y ))) (cid:0) − λ ( x (cid:48) ) (cid:1) λ ( x (cid:48)(cid:48) ) + ((1 − H ( y )) / (1 − G ( y ))) (1 − λ ( x (cid:48)(cid:48) )) . nference on mixtures ζ − ( x (cid:48) , x (cid:48)(cid:48) ) ≡ lim y ↓−∞ F ( y | x (cid:48) ) F ( y | x (cid:48)(cid:48) ) = 1 − λ ( x (cid:48) )1 − λ ( x (cid:48)(cid:48) ) ,ζ + ( x (cid:48) , x (cid:48)(cid:48) ) ≡ lim y ↑ + ∞ − F ( y | x (cid:48) )1 − F ( y | x (cid:48)(cid:48) ) = λ ( x (cid:48) ) λ ( x (cid:48)(cid:48) ) . (1.2)These two equations can be solved for the mixing proportion at x (cid:48) , yielding λ ( x (cid:48) ) = 1 − ζ − ( x (cid:48)(cid:48) , x (cid:48) ) ζ + ( x (cid:48)(cid:48) , x (cid:48) ) − ζ − ( x (cid:48)(cid:48) , x (cid:48) ) . (1.3)Since λ is non-constant, for any x (cid:48) ∈ X there exists a x (cid:48)(cid:48) ∈ X for which such a system of equations canbe constructed. The function λ is therefore identified on its entire support. To establish identificationof G and H , first note that G ( y ) − H ( y ) = F ( y | x (cid:48)(cid:48) ) − F ( y | x (cid:48) ) λ ( x (cid:48)(cid:48) ) − λ ( x (cid:48) ) (1.4)follows from (1.1). Then, evaluating (1.1) in x (cid:48)(cid:48) and re-arranging the resulting expression for F ( y | x (cid:48)(cid:48) ) gives H ( y ) = F ( y | x (cid:48)(cid:48) ) − (cid:0) G ( y ) − H ( y ) (cid:1) λ ( x (cid:48)(cid:48) ) = F ( y | x (cid:48)(cid:48) ) − λ ( x (cid:48)(cid:48) ) λ ( x (cid:48)(cid:48) ) − λ ( x (cid:48) ) (cid:0) F ( y | x (cid:48)(cid:48) ) − F ( y | x (cid:48) ) (cid:1) , which is identified. Furthermore, using (1.2) we can write H ( y ) = F ( y | x (cid:48)(cid:48) ) − − ζ + ( x (cid:48) , x (cid:48)(cid:48) ) (cid:0) F ( y | x (cid:48)(cid:48) ) − F ( y | x (cid:48) ) (cid:1) . (1.5)Plugging this expression for H ( y ) back into the mixture representation of F ( y | x (cid:48)(cid:48) ) as in (1.1) furtheryields G ( y ) = F ( y | x (cid:48)(cid:48) ) − − ζ − ( x (cid:48) , x (cid:48)(cid:48) ) (cid:0) F ( y | x (cid:48)(cid:48) ) − F ( y | x (cid:48) ) (cid:1) , (1.6)again using (1.2). This shows that both component distributions are identified, concluding the proof.If we only assume one-sided tail dominance, then either G or H remains identified.C OROLLARY

Under Assumptions 1 and 2, G is identified if Assump-tion 3(i) holds and H is identified if Assumption 3(ii) holds. Proof.

We consider identification of H . Let x (cid:48) , x (cid:48)(cid:48) be as in the proof of Theorem 1. Under As-sumption 3(ii) we can still determine ζ + ( x (cid:48) , x (cid:48)(cid:48) ) = λ ( x (cid:48) ) /λ ( x (cid:48)(cid:48) ) , from which we can learn the ratio / (1 − ζ + ( x (cid:48)(cid:48) , x (cid:48) )) . Together with (1.5) this yields H . This concludes the proof of the corollary. K. Jochmans, M. Henry, and B. Salanié

The following example illustrates the usefulness of Corollary 1.E

XAMPLE

Consider a two-regime stochastic volatility model, which is aspecial case of Example 2. Assume that the outcome variable Y has mean zero and conditional vari-ance T σ G + (1 − T ) σ H for positive constants σ G and σ H . Suppose that σ G > σ H . Then G is the distribution associated with aregime that is characterized by relatively higher volatility. In this case, both tails of G dominate thoseof H . Hence, in Assumption 3, Condition (ii) holds but Condition (i) fails. Nevertheless, the distribution H of the lower-volatility regime remains identified. Our identification result suggests plug-in estimators of the mixing proportions and the componentdistributions.The proof of Theorem 1, Equations (1.5)–(1.6) in particular, further show that our mixture modelyields overidentifying restrictions as soon as the instrument can take on more than two values. We turnto estimation in the next section, where we also construct a statistic for a speciﬁcation test that exploitsthe invariance of the formulae for G and H in Equations (1.5)–(1.6) to the values x (cid:48) , x (cid:48)(cid:48) .

2. ESTIMATION

To motivate the construction of our estimators, we first note that the structure of the model in (1.1)continues to hold when we aggregate across x . Extending our notation to F ( y | A ) ≡ P ( Y ≤ y | X ∈ A ) , λ ( A ) ≡ (cid:88) x ∈ A λ ( x ) P ( X = x | X ∈ A ) , for any A ⊂ X , we have F ( y | A ) = G ( y ) λ ( A ) + H ( y ) (1 − λ ( A )) , (2.1)which is of the same form as (1.1). Furthermore, the proof of Theorem 1 continues to go through for(2.1); replacing x (cid:48) with A and x (cid:48)(cid:48) with X − A does not alter the argument.We will assume from now on that X is discrete. As will become apparent, this only entails a loss ofgenerality for the estimation of the function λ , as our estimator will only yield a discretized approx-imation to it. Extending our results to continuous X would complicate the exposition greatly and wefeel that it would only distract from our main argument.We will work under the following sampling condition. nference on mixtures SSUMPTION ( Y , X ) , . . . , ( Y n , X n ) is a random sample on ( Y, X ) . For each A ⊂ X , let F n ( y | A ) ≡ n − A n (cid:88) i =1 { Y i ≤ y, X i ∈ A } , where n A ≡ (cid:80) ni =1 { X i ∈ A } .For each pair of disjoint subsets A, B of X we can generalize (1.2) to ζ − ( A, B ) ≡ lim y ↓−∞ F ( y | A ) F ( y | B ) = 1 − λ ( A )1 − λ ( B ) ,ζ + ( A, B ) ≡ lim y ↑ + ∞ − F ( y | A )1 − F ( y | B ) = λ ( A ) λ ( B ) . (2.2)For any subsample of size m and integers ι m and κ m , let (cid:96) m and r m denote the ( ι m + 1 )th and( m − κ m )th order statistics of Y in this subsample. We estimate the quantities in (2.2) by ζ − n ( A, B ) ≡ F n ( (cid:96) n B | A ) F n ( (cid:96) n B | B ) , ζ + n ( A, B ) ≡ − F n ( r n B | A )1 − F n ( r n B | B ) , (2.3)respectively. In our asymptotic theory, we will choose ι n B and κ n B so that (cid:96) n B ↓ −∞ and r n B ↑ + ∞ as n ↑ + ∞ at an appropriate rate.Estimators of both the mixing proportions and the component distributions follow readily along thelines of the proof of Theorem 1; see below. Since their asymptotic distribution will be driven by thelarge-sample behavior of the estimators of the quantities in (2.3), we start by deriving the statisticalproperties of these estimators. Throughout this section we fix disjoint sets

A, B and consider the asymptotic behavior of the estimatorsin (2.3).Consistency only requires the following rate conditions.A

SSUMPTION ι n B / √ n B ln ln n B ↑ + ∞ and κ n B / √ n B ln ln n B ↑ + ∞ as n ↑ + ∞ . T HEOREM

If Assumptions 1–5 hold, ζ − n ( A, B ) p → ζ − ( A, B ) , ζ + n ( A, B ) p → ζ + ( A, B ) , as n ↑ + ∞ . K. Jochmans, M. Henry, and B. Salanié

Proof.

We prove the theorem for ζ + n ; the proof for ζ − n follows in a similar fashion. Write ζ + n − ζ + = ( ζ + n − ζ κ nB ) + ( ζ κ nB − ζ + ) , (2.4)for ζ κ nB ≡ (1 − F ( r n B | A )) / (1 − F ( r n B | B )) . For the second right-hand side term in (2.4) we have ζ κ nB − ζ + =  λ ( A ) + − H ( r nB )1 − G ( r nB ) (1 − λ ( A )) λ ( B ) + − H ( r nB )1 − G ( r nB ) (1 − λ ( B )) − λ ( A ) λ ( B )  = O p (cid:18) − H ( r n B )1 − G ( r n B ) (cid:19) = o p (1) , by Assumptions 3(ii) and 5. To deal with the first right-hand side term in (2.4), recall that ζ + n − ζ κ nB = 1 − F n ( r n B | A )1 − F n ( r n B | B ) − − F ( r n B | A )1 − F ( r n B | B ) . Letting G n ( y | S ) ≡ √ n S (cid:0) F n ( y | S ) − F ( y | S ) (cid:1) for any S ⊂ X we thus have that ζ + n − ζ κ nB = (1 − F ( r n B | A )) G n ( r n B | B ) / √ n B − (1 − F ( r n B | B )) G n ( r n B | A ) / √ n A (1 − F n ( r n B | B ))(1 − F ( r n B | B ))= √ n B κ n B (cid:18) ζ κ nB G n ( r n B | B ) − (cid:114) n B n A G n ( r n B | A ) (cid:19) = O a.s. (cid:18) √ n B ln ln n B κ n B (cid:19) , where the second equality uses − F n ( r n B | B ) = κ n B /n B and the last one follows by the law of theiterated logarithm for empirical processes. Thus, from Assumption 5 it follows that | ζ + n − ζ κ nB | = o p (1) . This completes the proof.Deriving the limit distribution requires some more care, and three more assumptions. We ﬁrst imposethe following regularity condition on the component distributions.A SSUMPTION G and H are absolutely continuous on R . This assumption is very weak. Note that, as we do not require the existence of moments of the com-ponent distributions, our results also apply to heavy-tailed distributions such as Cauchy and Paretodistributions.We will complement Assumption 5 with an additional rate condition.A

SSUMPTION ι n B /n B ↓ and κ n B /n B ↓ as n ↑ + ∞ . Where Assumption 5 required the order statistics to grow to ensure consistency, this assumption boundsthis growth rate so that appropriately scaled versions of ζ + n and ζ − n have a limit distribution. nference on mixtures SSUMPTION (i) G ( (cid:96) n B ) /H ( (cid:96) n B ) = o p (1 / √ ι n B ) ; and(ii) (1 − H ( r n B )) / (1 − G ( r n B )) = o p (1 / √ κ n B ) . Assumption 8 rules out distributions whose tails vanish too quickly and ensures that the limit distribu-tions are free of asymptotic bias. We comment on Assumption 8 after we derive the limit distributionsof our estimators.Let ρ A,B ≡ P ( X ∈ B ) / P ( X ∈ A ) . Note that < ρ A,B < + ∞ because of random sampling. Intro-duce σ − ( A, B ) ≡ ζ − ( A, B ) + ρ A,B ζ − ( A, B ) ,σ ( A, B ) ≡ ζ + ( A, B ) + ρ A,B ζ + ( A, B ) , Theorem 2 provides the asymptotic properties of the estimators in (2.2) and is the main building blockfor our subsequent results.T

HEOREM

If Assumptions 1–8 hold, then as n ↑ + ∞ , √ ι n B (cid:0) ζ − n ( A, B ) − ζ − ( A, B ) (cid:1) d → N (0 , σ − ( A, B )) , √ κ n B (cid:0) ζ + n ( A, B ) − ζ + ( A, B ) (cid:1) d → N (0 , σ ( A, B )); and these two estimators are asymptotically independent.

Proof.

We focus on the limit behavior of √ κ n B ( ζ + n − ζ + ) here; the proof of the result for √ ι n ( ζ − n − ζ − ) follows along similar lines.As in the proof of Theorem 2, write √ κ n B ( ζ + n − ζ + ) = √ κ n B ( ζ + n − ζ κ nB ) + √ κ n B ( ζ κ nB − ζ + ) , (2.5)for ζ κ nB ≡ (1 − F ( r n B | A )) / (1 − F ( r n B | B )) . Assumption 8 implies that √ κ n B ( ζ κ nB − ζ + ) = √ κ n B O p (cid:18) − H ( r n B )1 − G ( r n B ) (cid:19) = o p (1) . Hence, the second right-hand side term in (2.5) is asymptotically negligible.2

K. Jochmans, M. Henry, and B. Salanié

We now turn to the ﬁrst term in (2.5). From the proof of Theorem 2 we have that √ κ n B ( ζ + n − ζ κ nB ) = (cid:114) n B κ n B (cid:18) ζ κ nB G n ( r n B | B ) − (cid:114) n B n A G n ( r n B | A ) (cid:19) , where G n ( y | S ) ≡ √ n S (cid:0) F n ( y | S ) − F ( y | S ) (cid:1) for any S ⊂ X . Let α n ( u ) ≡ √ n (cid:0) U n ( u ) − u (cid:1) for U n theempirical cumulative distribution of an i.i.d. sample of size n from a uniform distribution on [0 , . ByAssumption 6, F ( y | S ) is continuous in y for all S ⊂ X . Therefore G n ( y | A ) = α n A (cid:0) − F ( y | A ) (cid:1) and G n ( y | B ) = α n B (cid:0) − F ( y | B ) (cid:1) by an application of the probability integral transform. Hence, we may write √ κ n B ( ζ + n − ζ κ nB ) = ζ κ nB (cid:114) n B κ n B α n B (cid:0) − F ( r n B | B ) (cid:1) − (cid:114) n B κ n B (cid:114) n B n A α n A (cid:0) − F ( r n B | A ) (cid:1) . (2.6)We study the asymptotic behavior of each of the right-hand side terms in turn.Start with the first right-hand side term in (2.6). From the deﬁnition of the order statistic r n B , weﬁnd by adding and subtracting F n ( r n B | B ) that − F ( r n B | B ) = κ n B n B (cid:18) √ n B κ n B G n ( r n B | B ) (cid:19) ; or, deﬁning ε n ≡ −√ n B /κ n B G n ( r n B | B ) , − F ( r n B | B ) = κ n B n B (1 − ε n ) . Therefore we can write ζ κ nB (cid:114) n B κ n B α n B (cid:0) − F ( r n B | B ) (cid:1) = √ ζ κ nB (cid:114) n B κ n B α n B (cid:18) κ n B n B − ε n (cid:19) . (2.7)By the law of the iterated logarithm together with Assumption 5, ε n = − √ n B κ n B O a.s. (cid:16)(cid:112) ln ln n B (cid:17) = O a.s. (cid:18) √ n B ln ln n B κ n B (cid:19) = o a.s. (1) . Hence (1 − ε n ) / converges almost surely to / ; and (1 − ε n ) / ∈ (0 , for n large enough.We may then apply Theorem 2.1 in Einmahl (1992) to establish the convergence in distribution of (cid:113) n B κ nB α n B (cid:16) κ nB n B − ε n (cid:17) to a normal random variable with mean zero and variance / . This, to-gether with Equation (2.7) and an application of Slutsky’s theorem, implies that (cid:114) n B κ n B ζ κ nB α n B (cid:0) − F ( r n B | B ) (cid:1) d → ζ + Z + B , (2.8) nference on mixtures Z + B is a standard normal random variable.Now turn to the second right-hand side term in (2.6). First observe that − F ( r n B | A ) = ζ κ nB (cid:0) − F ( r n B | B ) (cid:1) = ζ κ nB κ n B n B (1 − ε n ) . Using ρ A,B = lim n ↑ + ∞ n B /n A , this gives (cid:114) n B κ n B (cid:114) n B n A α n A (cid:0) − F ( r n B | A ) (cid:1) = (cid:113) ρ A,B ζ + (cid:114) n A κ n A α n A (cid:18) κ n A n A − ε n (cid:19) + o p (1) , where ˜ κ n A ≡ ( κ n B ζ κ nB ) / ( n B /n A ) . As ˜ κ n A satisfies Assumption (1.5) of Theorem 2.1 in Einmahl(1992) we may apply his theorem again to obtain (cid:114) n B κ n B (cid:114) n B n A α n A (cid:0) − F ( r n B | A ) (cid:1) d → (cid:113) ρ A,B ζ + Z + A , (2.9)where Z + A is a standard-normal random variable which, because of random sampling, is independentof Z + B .Combining (2.6) with (2.8) and (2.9) then gives √ κ n B ( ζ + n − ζ κ nB ) d → ζ + Z + B − (cid:113) ρ A,B ζ + Z + A , as claimed. This concludes the proof.We ﬁnish this section with two examples that specialize Assumption 8 to densities with log-concavetails and Pareto tails, respectively. In both cases, Assumption 8 is implied by the rate conditions inAssumption 7.E XAMPLE

Suppose that G and H have log-concave tails; and for notationalsimplicity, assume that − ln (1 − G ( y )) ∼ (cid:32) yσ + G (cid:33) α + G , − ln (1 − H ( y )) ∼ (cid:32) yσ + H (cid:33) α + H , as y ↑ + ∞ , for real numbers α + G , α + H > and σ + G , σ + H > , and − ln G ( y ) ∼ (cid:32) − yσ − G (cid:33) α − G , − ln H ( y ) ∼ (cid:32) − yσ − H (cid:33) α − H , as y ↓ −∞ , for real numbers α − G , α − H > and σ − G , σ − H > . Then Assumption 7 implies Assumption 8 if both(i) α + G < α + H , or α + G = α + H and σ + G > σ + H ; and K. Jochmans, M. Henry, and B. Salanié(ii) α − G > α − H , or α − G = α − H and σ − G < σ − H hold. Proof.

We verify the second rate; the first follows similarly. Throughout, fix the set B . Assumptions3(ii) and 7 imply that − F ( r n B | B ) = (1 − G ( r n B )) λ ( B ) + (1 − H ( r n B )) (1 − λ ( B ))= (1 − G ( r n B )) (cid:0) λ ( B ) + o p (1) (cid:1) . Further, because κ n B /n B = 1 − F n ( r n B | B ) , adding and subtracting F ( r n B | B ) gives κ n B n B = (1 − F ( r n B | B )) + ( F n ( r n B | B ) − F ( r n B | B ))= (1 − F ( r n B | B )) + O a.s. ( (cid:112) (ln ln n B ) /n B ) . Because (ln ln n B ) /n B → , put together, we find κ n B n B = C (1 − G ( r n B )) (1 + o p (1)) for some constant C . Since G and H have log-concave tails, it follows from this expression that r n B behaves asymptotically like α + G √ ln n B . And since − H ( r n B )1 − G ( r n B ) ∼ exp (cid:32) r n B σ + G (cid:33) α + G − (cid:32) r n B σ + H (cid:33) α + H  , we have that − H ( r n B )1 − G ( r n B ) = (cid:40) O p (cid:16) exp( − (ln n B ) α + H /α + G ) (cid:17) if α + H > α + G O p (1 /n B ) if α + H = α + G and σ + H < σ + G , from which the conclusion follows.Example 6 does not cover location models with log-concave distributions in the case when the α and σ parameters of H equal those of G . This includes the location model with Gaussian errors, for which α = 2 and σ is the common standard error. While our estimator remains consistent in such cases, wedo not know of general results on tail empirical processes that would yield the asymptotic distributionof the estimator in this knife-edge case. To assess the extent to which the failure of Assumption 8 mayplay a role for inference, our simulation experiments in Section 3 include a Gaussian location model. nference on mixtures XAMPLE

Let C denote a generic constant. Suppose that G and H have Pareto tails,i.e., (1 − G ( y )) ∼ C y − α + G , (1 − H ( y )) ∼ C y − α + H , as y ↑ + ∞ , for positive real numbers α + H > α + G and G ( y ) ∼ C ( − y ) − α − G , H ( y ) ∼ C ( − y ) − α − H , as y ↓ −∞ , for positive real numbers α − G < α − H . Then Assumption 7 implies Assumption 8. Proof.

The argument is very similar to the one that was used to verify Example 6. We focus on theright tail; the argument for the left tail is similar. We have κ n B n B = (1 − G ( r n B )) (cid:0) o p (1) (cid:1) = C r − α + G (cid:0) o p (1) (cid:1) . Assumption 8 requires that (1 − H ( r n B )) / (1 − G ( r n B )) = o (1 / √ κ n ) , that is, that r α + G − α + H n B = o p (1 / √ κ n B ) . This rate condition is satisfied when (cid:18) n B κ n B (cid:19) α + G − α + Hα + G = o p (cid:18) √ κ n B (cid:19) , which can be achieved by setting κ n B = o ( n γ + B ) for γ + ≡ α + H − α + G α + H − α + G / . (2.10)This condition is weaker than Assumption 7 and is therefore implied by it.Example 7 shows that our methods are well suited to deal with Pareto tails. Pareto tails show up inmany economic applications. A time-honored example is income and wealth distributions (Atkinsonet al. 2011), which are often modeled as a log-normal for most quantiles, combined with a Pareto righttail. More generally, “power laws” have become a popular tool in ﬁnance, in studies of ﬁrm growth, andin urban economics (see Gabaix 2009 for a recent survey, and Acemoglu et al. 2012 for an applicationto business cycles.) Many recent models of monopolistic competition, as used in international trade forinstance, also assume that productivities are Pareto-distributed (Arkolakis et al. 2012).Let us focus on the right tail condition. Identification only requires that the tail index of H be largerthan that of G , that is, α + H > α + G . Let c + ≡ α + H /α + G > . Equation (2.10) then gives a convergence ratearbitrarily close to n − β + / for β + = 2( c + − / (2 c + − . For example, if c + = 2 then β + = 2 / K. Jochmans, M. Henry, and B. Salanié and our estimators will converge slightly slower than n − / . However, as c + increases, β + becomescloser to one and our estimators will converge at close to the n − / parametric rate. Fix x ∈ X and consider estimating λ ( x ) . Set A = X − x and B = x in (2.2) and solve for λ ( x ) to get λ ( x ) = 1 − ζ − ( A, x ) ζ + ( A, x ) − ζ − ( A, x ) . The mixing proportion λ need not be a strictly monotonic function. Estimating λ ( x ) by an average ofplug-in estimates of (1.3) could therefore be problematic, as the denominator in (1.3) can be zero or bearbitrarily close to it for some pairs of values ( x (cid:48) , x (cid:48)(cid:48) ) .We instead estimate the mixing proportion at X = x by a plug-in estimator based on (2.3), that is, λ n ( x ) ≡ − ζ − n ( A, x ) ζ + n ( A, x ) − ζ − n ( A, x ) . This estimator uses observations with X i (cid:54) = x in a way that immunizes it against small or zero denom-inators.To present the asymptotic variance of this estimator we need to define d − ( x ) ≡ − ζ + ( A, x )( ζ + ( A, x ) − ζ − ( A, x )) ,d + ( x ) ≡ ζ − ( A, x ) − ζ + ( A, x ) − ζ − ( A, x )) . (2.11)The speed of convergence and the asymptotic distribution of the λ n ( x ) depend on the ratio c x ≡ lim n ↑ + ∞ ι n x /κ n x .T HEOREM

Under the conditions of Theorem 2, | λ n ( x ) − λ ( x ) | = o p (1) as n ↑ + ∞ .Under the conditions of Theorem 3, √ ι n x (cid:0) λ n ( x ) − λ ( x ) (cid:1) d → N (cid:0) , d − ( x ) σ − ( A, x ) + c x d + ( x ) σ ( A, x ) (cid:1) if c x < + ∞ , √ κ n x (cid:0) λ n ( x ) − λ ( x ) (cid:1) d → N (cid:0) , c − x d − ( x ) σ − ( A, x ) + d + ( x ) σ ( A, x ) (cid:1) if c x > , as n ↑ + ∞ .nference on mixtures Proof.

The consistency claim follows directly from Theorem 2 by an application of the continuousmapping theorem.To establish the asymptotic distribution, note that Theorem 3 states that √ ι n x ( ζ − n ( A, x ) − ζ − ( A, x )) d → N (0 , σ − ( A, x )) , √ κ n x ( ζ + n ( A, x ) − ζ + ( A, x )) d → N (0 , σ ( A, x )) , and that ζ − n ( x ) and ζ + n ( x ) are asymptotically independent. An expansion around ζ − ( A, x ) and ζ + ( A, x ) then yields √ ι n x ( λ n ( x ) − λ ( x )) = d − ( x ) √ ι n x ( ζ − n ( A, x ) − ζ − ( A, x ))+ d + ( x ) √ κ n x ( ζ + n ( A, x ) − ζ + ( A, x )) (cid:114) ι n x κ n x + o p (1) , which has the limit distribution stated in the theorem if c x is finite. Also, by the same argument, √ κ n x ( λ n ( x ) − λ ( x )) = d + ( x ) √ κ n x ( ζ + n ( x ) − ζ + ( x ))+ d − ( x ) √ ι n x ( ζ − n ( x ) − ζ − ( x )) (cid:114) κ n x ι n x + o p (1) converges in distribution as stated in the theorem if c x is non-zero. This verifies the claims and provesthe theorem. To estimate the component distributions, choose B = X − A so that A and B partition X . Equations(1.5) and (1.6) then suggest the estimators H n ( y ; A, B ) ≡ F n ( y | A ) − − ζ + n ( B, A ) (cid:0) F n ( y | A ) − F n ( y | B ) (cid:1) ,G n ( y ; A, B ) ≡ F n ( y | A ) − − ζ − n ( B, A ) (cid:0) F n ( y | A ) − F n ( y | B ) (cid:1) . (2.12)For notational simplicity we now drop A and B from the arguments: G n ( y ) ≡ G n ( y ; A, B ) and H n ( y ) ≡ H n ( y ; A, B ) .To state their asymptotic behavior, let d G ( A, B ; y ) ≡ F ( y | A ) − F ( y | B )(1 − ζ − ( B, A )) ,d H ( A, B ; y ) ≡ F ( y | A ) − F ( y | B )(1 − ζ + ( B, A )) , and let (cid:107)·(cid:107) ∞ denote the supremum norm.8 K. Jochmans, M. Henry, and B. Salanié T HEOREM Under the conditions of Theorem 2, (cid:107) G n − G (cid:107) ∞ = o p (1) , (cid:107) H n − H (cid:107) ∞ = o p (1) , as n ↑ + ∞ .Under the conditions of Theorem 3, √ ι n A ( G n ( y ) − G ( y )) d → N (cid:16) , d G ( A, B ; y ) σ − ( B, A ) (cid:17) , √ κ n A ( H n ( y ) − H ( y )) d → N (cid:16) , d H ( A, B ; y ) σ ( B, A ) (cid:17) , as n ↑ + ∞ for each y ∈ R ,. Proof.

Consistency follows by Theorem 2 and the Glivenko-Cantelli theorem.We establish the asymptotic distribution of G n ; the result for H n follows by the same argument.First note that √ ι n A ( G n ( y ) − G ( y )) = T + T + T for T ≡ √ ι n A ( F n ( y | A ) − F ( y | A )) ,T ≡ − − ζ − ( B, A ) √ ι n A (cid:0)(cid:8) F n ( y | A ) − F ( y | A ) (cid:9) − (cid:8) F n ( y | B ) − F ( y | B ) (cid:9)(cid:1) ,T ≡ − ( F n ( y | A ) − F n ( y | B )) √ ι n A (cid:18) − ζ − n ( B, A ) − − ζ − ( B, A ) (cid:19) . By the Glivenko-Cantelli theorem, T = o p (1) and T = o p (1) while T = − ( F ( y | A ) − F ( y | B )) √ ι n A (cid:18) − ζ − n ( B, A ) − − ζ − ( B, A ) (cid:19) + o p (1) . A linearization of this expression in ζ − n ( B, A ) − ζ − ( B, A ) together with an application of Theorem 3to the partition A, B then yields the result.When X can take on more than two values there are multiple ways of choosing the sets A and B . Inspection of the asymptotic variance does not give clear guidance on how to choose A and B in an optimal manner. An ad-hoc way to proceed when the number of possible choices for A, B issmall, is to simply compute estimators for all possible choices. Alternatively, it would be possibleto combine estimates based on multiple choices through a minimum-distance procedure. We leave adetailed analysis for future research. nference on mixtures An implication of our model restrictions is that the estimators of G and H in (2.12), when basedon different subsets of X , should co-incide with one another, up to sampling error. This observationsuggests the possibility to test the specification when X can take on more than two values.Theorem 6 provides the relevant asymptotic distributional result to perform this test. In it we use Σ G = d G ( A, C ) (cid:110) d G ( A, C ) σ − ( C, A ) − d G ( A, B ) ζ − ( C, A ) ζ − ( B, A ) (cid:111) + d G ( A, B ) (cid:110) d G ( A, B ) σ − ( B, A ) − d G ( A, C ) ζ − ( C, A ) ζ − ( B, A ) (cid:111) and Σ H = d H ( A, C ) (cid:110) d H ( A, C ) σ ( C, A ) − d H ( A, B ) ζ + ( C, A ) ζ + ( B, A ) (cid:111) + d H ( A, B ) (cid:110) d H ( A, B ) σ ( B, A ) − d H ( A, C ) ζ + ( C, A ) ζ + ( B, A ) (cid:111) , where the triple A, B, C constitutes any partition of X and, for any A and B , we write d G ( A, B ) ≡ E [ W ( Y ) d G ( A, B ; Y )] , d H ( A, B ) ≡ E [ W ( Y ) d H ( A, B ; Y )] , for a chosen weight function W that is bounded on R . The choice of these weights should reﬂect theanalyst’s concerns about potential violations of our assumptions in the application under study.T HEOREM

Under the conditions of Theorem 3 lim n ↑ + ∞ P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:80) ni =1 W ( Y i ) G n ( Y i ; A, B ) − n − (cid:80) ni =1 W ( Y i ) G n ( Y i ; A, C ) √ Σ G / √ ι n A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > z ( τ / (cid:41) = τ, and lim n ↑ + ∞ P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:80) ni =1 W ( Y i ) H n ( Y i ; A, B ) − n − (cid:80) ni =1 W ( Y i ) H n ( Y i ; A, C ) √ Σ H / √ κ n A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > z ( τ / (cid:41) = τ, where z ( τ ) is the − τ quantile of the standard-normal distribution. Proof.

We consider only the case of G . The difference G n ( y ; A, B ) − G n ( y ; A, C ) equals − ζ − n ( C, A ) ( F n ( y | A ) − F n ( y | C )) − − ζ − n ( B, A ) ( F n ( y | A ) − F n ( y | B )) for any y . An expansion around ζ − ( C, A ) and ζ − ( B, A ) , then shows that the scaled difference √ ι n A G n ( y ; A, B ) − G n ( y ; A, C ) is asymptotically equivalent to d G ( A, C ; y ) √ ι n A (cid:0) ζ − n ( C, A ) − ζ − ( C, A ) (cid:1) − d G ( A, B ; y ) √ ι n A (cid:0) ζ − n ( B, A ) − ζ − ( B, A ) (cid:1) . K. Jochmans, M. Henry, and B. Salanié

This holds for any y and, therefore, also for the weighted average over y . Together with Theorem 3, thisresult then readily yields the asymptotic distribution of the difference n − (cid:80) ni =1 W ( Y i ) G n ( Y i ; A, B ) − n − (cid:80) ni =1 W ( Y i ) G n ( Y i ; A, C ) and implies the claim of the theorem.We leave a detailed analysis of the power properties of this specification test for future research.Here, we provide a consistency result against failure of Assumption 3.E XAMPLE

Suppose that H dominates G in both tails. Then H is nolonger identified and lim n ↑ + ∞ P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − (cid:80) ni =1 W ( Y i ) H n ( Y i ; A, B ) − n − (cid:80) ni =1 W ( Y i ) H n ( Y i ; A, C ) √ Σ H / √ κ n A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > z (cid:41) = 1 for any z . Proof.

When H dominates G in both tails, a small calculation reveals that ζ + n ( A, B ) = ζ − ( A, B ) + o p (1) , and so √ κ n A | ( ζ + n ( A, B ) − ζ + ( A, B )) | grows without bound as n ↑ + ∞ . The conclusion then readilyfollows from the linearization in the proof of Theorem 6.

3. SIMULATION EXPERIMENTS

In our numerical illustrations we will work with the family of skew-normal distributions (Azzalini1985). The skew-normal distribution with location µ , positive scale σ , and skewness parameter β mul-tiplies the density of N ( µ, σ ) by a term that skews it to the right if β > and to the left if β < : f ( x ; µ, σ, β ) ≡ σ φ (cid:18) x − µσ (cid:19) × Φ (cid:16) β x − µσ (cid:17) Φ(0) . Its mean and variance are µ + σδ (cid:113) π and σ (cid:16) − δ π (cid:17) , respectively, where δ ≡ β/ (cid:112) β . Clearly, f ( x ; µ, σ, β ) → σ φ (cid:18) x − µσ (cid:19) as β → .In our simulations we will consider data generating processes where the outcome is generated as Y = T V G + (1 − T ) V H , (3.1) nference on mixtures T is a latent binary variable, and V G ∼ G and V H ∼ H . Both error distributions G and H areskewed-normal distributions with parameters µ G , σ G , β G and µ H , σ H , β H , respectively.From Capitanio (2010) it follows that Assumption 8 holds if G is right-skewed and H is left-skewed.We will consider designs where β G > and β H < to verify our asymptotics.When β G = β H = 0 , (3.1) collapses to a standard location model with normal errors Y = ( µ G − µ H ) T + V, V ∼ N (0 , σ G + σ H ) . (3.2)The identifying tail condition in Assumption 3 still holds if µ G > µ H , and our estimators remainconsistent. However, Assumption 8 now fails and so we may expect poor inference in this design.In our experiments we generate a binary X with P ( X = 1) = and fix conditional probabilities as P ( T = 0 | X = 0) = 34 , P ( T = 1 | X = 0) = 14 , P ( T = 1 | X = 1) = 14 , P ( T = 1 | X = 1) = 34 . We present results for data generating processes where µ G = µ = − µ H and β G = β = − β H . We usethe designs µ = 0 and β ∈ { . , } to evaluate the adequacy of our asymptotic arguments for small-sample inference. We also look at the performance of our estimators when µ ∈ { . , } and β = 0 ,which yields the Gaussian location model in (3.2). We fix σ G = σ H = 1 throughout. For each of thesedesigns we consider choices of the empirical quantiles as ι n x = C ( n x ln ln n x ) / , κ n x = C ( n x ln ln n x ) / , for several choices of the constant C . All of these choices are in line with our asymptotic arguments.The larger the constant C the more conservative the choice of intermediate quantile, q (cid:96) ≡ ι n x n x , q r ≡ n x − κ n x n x , for a given sample size.We run experiments for sample sizes n ∈ { , , , , , , } . We report(the average over the replications of) q (cid:96) and q r along with the estimation results to get an idea ofhow far in the tails of the component distributions we are going to obtain the results. A data-drivendetermination of the constant C is challenging and is left for future research. For space considerationswe report only a subset of the results here. The full set of simulation results is available in the workingpaper version of this paper (Jochmans et al., 2014).Tables 1 and 2 report the results for the mixing proportions λ (0) and λ (1) . Each table contains thebias, standard deviation (SD), ratio of the (average over the replications of the) estimated standard2 K. Jochmans, M. Henry, and B. Salanié error to the standard deviation (SE/SD), and the coverage of confidence intervals (CI95) for n ∈ { , , , } . All these statistics were computed from , Monte Carlo replications. Table1 reports results for the simulation design with µ = 0 , β = 5 for C ∈ { . , , . } , so as to evaluate theimpact of the choice of this tuning parameter on the results. This impact was similar in all other designsand so, for these designs, we present only results for one choice of C . The constant C was fixed to . for all designs except for the pure location model with µ = . and β = 0 , where, for practical reasons,we use C = . . These results are bundled in Table 2.

Table 1.

Mixing proportions

BIAS SD SE/SD CI n q (cid:96) q r λ n (0) λ n (1) λ n (0) λ n (1) λ n (0) λ n (1) λ n (0) λ n (1) C = . , . . . − . . . . . . . , . . . − . . . . . . . C = 11 , . . . − . . . . . . . , . . . − . . . . . . . C = 1 . , . . . − . . . . . . . , . . . − . . . . . . . The results in Table 1 support our asymptotic theory. For all choices of the tuning parameter C , thebias and standard deviation shrink to zero as n ↑ + ∞ ; and the bias is small relative to the standarderror. Furthermore, SE / SD → and the coverage rates of the confidence intervals are close to . inlarge samples. The variability of the point estimates is somewhat overestimated when n is very smalland C is chosen conservatively. Together with the relatively small bias, this implies that confidenceintervals are slightly conservative. For C = . , coverage rates are close to . , even for the smallestsamples considered, and for all C , the coverage rates move fairly quickly toward . as n increases.The same conclusions hold for the design with µ = 0 and β = 2 . (first block of Table 2).Now turn to the results for the pure location model with Gaussian errors ( β = 0 ) in Table 2, wherethe tail conditions of Assumption 8 fail. The difference between the two designs is the distance betweenthe component distributions (governed by µ ). When µ = 1 , G is centered at while H is centered at − , so that µ G − µ H = 2 . When µ = 1 / , G and H are closer to each other: µ G − µ H = 1 . In thefirst of these designs the bias in the point estimates is somewhat larger than in the skewed designs.Nonetheless, the bias is still small relative to the standard deviation. Furthermore, the coverage of the nference on mixtures Table 2.

Mixing proportions (cont’d)

BIAS SD SE/SD CI n q (cid:96) q r λ n (0) λ n (1) λ n (0) λ n (1) λ n (0) λ n (1) λ n (0) λ n (1) µ = 0 and β = 2 . , . . . − . . . . . . . , . . . − . . . . . . . µ = 1 and β = 01 , . . . − . . . . . . . , . . . − . . . . . . . µ = . and β = 01 , . . . − . . . . . . . , . . . − . . . . . . . confidence intervals displays a similar pattern as before, and is excellent when n is not too small.When we move to the second design the bias increases further. The bias still shrinks to zero as n grows, confirming that our estimator remains consistent. However, the bias is not negligible relative tothe standard deviation; the coverage of the confidence intervals deteriorates as n grows, and inferencebecomes unreliable.We next turn to the results for the component distributions. For clarity we present the results bymeans of a series of plots. We provide results for n = 1 , for the skewed designs µ = 0 , β = 5 and µ = 0 , β = 2 . in Figure 1 and for the symmetric designs µ = 1 , β = 0 and µ = 0 . , β = 0 in Figure 2.Results for G n are in the left-side plots. Results for H n are in the right-side plots. Each plot containsthe mean of the point estimates (solid red lines) and the mean of confidence bounds constructedaround it using a plug-in estimator of the asymptotic variance in Theorem 5 (dashed blue lines). Eachplot also contains the true component distribution (solid black lines, marked x) and the mean of confidence bounds constructed around the point estimator using the empirical standard deviation overthe Monte Carlo replications (dashed green lines, upper band marked (cid:52) , lower band marked (cid:53) ). Wevary the range of the vertical axis across the plots in a given figure to enhance visibility.The plots in Figure 1 again confirm our asymptotics. The bias in the point estimators is small acrossall plots. The asymptotic theory mostly does a good job in capturing the small-sample variability ofthe point estimators although, when n is small, the standard errors are somewhat too small. In ourdesigns, this underestimation is more severe for H n than for G n , as is apparent from inspection of thelower-right plot in the ﬁgure. Inspection of the full set of results (not reported here) shows that thisunderestimation vanishes as n grows, again confirming our asymptotic theory.4 K. Jochmans, M. Henry, and B. Salanié

Figure 1 . Simulation results for G n (left) and H n (right) for design µ = 0 , β = 5 (top) and design µ = 0 , β = 2 . (bottom). Each plot contains the mean of the point estimator (solid red line) and the mean of the estimated confidence bands (dashed blue lines), along with the true curve (solid black line, marked x) and confidencebands constructed using the Monte Carlo standard deviation (dashed green lines, upper band marked (cid:52) and lowerband marked (cid:53) ). -2 -1 0 1 2 3 µ = , β = -0.4-0.200.20.40.60.81 G n -3 -2 -1 0 1 200.20.40.60.81 H n -2 -1 0 1 2 3 µ = , β = . -0.200.20.40.60.81 G n -3 -2 -1 0 1 200.20.40.60.811.2 H n nference on mixtures Figure 2 . Simulation results for G n (left) and H n (right) for design µ = 1 , β = 0 (top) and design µ = 0 . , β = 0 (bottom). Each plot contains the mean of the point estimator (solid red line) and the mean of the estimated confidence bands (dashed blue lines), along with the true curve (solid black line, marked x) and confidencebands constructed using the Monte Carlo standard deviation (dashed green lines, upper band marked (cid:52) and lowerband marked (cid:53) ). -2 -1 0 1 2 3 4 µ = , β = -0.25-0.050.150.350.550.750.95 G n -4 -3 -2 -1 0 1 2-0.050.150.350.550.750.95 H n -2 -1 0 1 2 3 µ = . , β = -0.10.10.30.50.70.9 G n -3 -2 -1 0 1 200.20.40.60.81 H n K. Jochmans, M. Henry, and B. Salanié

The results in Figure 2 for the Gaussian location model are in line with our findings concerning themixing proportions. In the design where µ G − µ H = 2 (upper two plots) our estimators do well in spiteof Assumption 8 not holding. When the µ G − µ H = 1 (lower two plots), however, the asymptotic biasin G n and H n becomes visible. While the variability of the point estimates is correctly captured by ourasymptotic-variance estimator, the confidence bounds settle around an incorrect curve. nference on mixtures CONCLUDING REMARKS

We conducted most of our analysis with a mixture of two components. However, some of our resultswould extend to a version of (1.1) with a larger number of components. Suppose that the mixture has J irreducible components, as in F ( y | x ) = J (cid:88) j =1 λ j ( x ) G j ( y ) , in obvious notation. Henry et al. (2014) showed that the mixture components and mixing proportionsare only identified up to J ( J − inequality-constrained real parameters in general.Tail dominance restrictions can still be quite powerful. Take J = 3 for instance, and assume that G dominates in the left tail and G dominates in the right tail. Then it is easy to adapt the proofof Theorem 1 to prove that the behavior of F ( y | x ) in the left tail identifies the function λ up to amultiplicative constant, and that the behavior of F ( y | x ) in the right tail identifies the function λ up toanother multiplicative constant. Imposing the values of the mixing proportions at one particular valueof x would be enough to point identify all elements of the model, for instance; and it would be easy toadapt our estimators and tests to such a setting. Whether such additonal restrictions are plausible is, ofcourse, highly model-dependent. Notes We omit conditioning variables throughout. The identification analysis extends straightforwardly. In principle, the distri-bution theory could be extended by using local empirical process results along the lines of Einmahl and Mason (1997). Wepostpone a detailed investigation into such an extension to future work. Note that irreducibility rules out the possibility of achieving identification of G and H via an identification-at-infinityargument, as in Heckman (1990) and Andrews and Schafgans (1998) for instance. The expression for λ ( x (cid:48) ) in (1.3) also holds for any x (cid:48)(cid:48) . This invariance cannot fruitfully be exploited to test the tailrestrictions of Assumption 3, however, as the right-hand side expression in (1.3) is independent of the value x (cid:48)(cid:48) even whenAssumption 3 fails. In this design, there is a small probability that either q (cid:96) = 0 or q r = 1 when C = . and n is small. This shows up insimulations with a large number of replications, as is the case here. The slightly more conservative choice of C = . avoidsthis issue. K. Jochmans, M. Henry, and B. Salanié

REFERENCES

Acemoglu, D., V. Carvalho, A. Ozdaglar, and A. Tabaz-Salehi (2012). The network origins of aggregateﬂuctuations.

Econometrica 80 (5), 1977–2016.Allman, E. S., C. Matias, and J. A. Rhodes (2009). Identiﬁability of parameters in latent structuremodels with many observed variables.

Annals of Statistics 37 , 3099–3132.Andrews, D. W. K. and M. M. A. Schafgans (1998). Semiparametric estimation of the intercept of asample selection model.

Review of Economic Studies 65 , 497–517.Arkolakis, C., A. Costinot, and A. Rodriguez-Clare (2012). New trade models, same old gains?

Amer-ican Economic Review 102 , 94–130.Atkinson, A. B., T. Piketty, and E. Saez (2011). Top incomes in the long run of history.

Journal ofEconomic Literature 49 , 3–71.Azzalini, A. (1985). A class of distributions which includes the normal ones.

Scandinavian Journal ofStatistics 12 , 171–178.Bollinger, C. R. (1996). Bounding mean regressions when a binary regressor is mismeasured.

Journalof Econometrics 73 , 387–399.Bonhomme, S., K. Jochmans, and J.-M. Robin (2014). Estimating multivariate latent-structure models.

Annals of Statistics , forthcoming.Bonhomme, S., K. Jochmans, and J.-M. Robin (2016). Nonparametric estimation of finite mixturesfrom repeated measurements.

Journal of the Royal Statistical Society, Series B 78 , 211–229.Bordes, L., S. Mottelet, and P. Vandekerkhove (2006). Semiparametric estimation of a two-componentmixture model.

Annals of Statistics 34 , 1204–1232.Capitanio, A. (2010). On the approximation of the tail probability of the scalar skew-normal distribu-tion.

METRON 68 , 299–308.Carroll, R. J., D. Ruppert, L. A. Stefanski, and C. Crainiceanu (2006).

Measurement Error in NonlinearModels: A Modern Perspective . Chapman and Hall, CRC Press.D’Haultfœuille, X. and P. Février (2015). Identification of mixture models using support variations.

Journal of Econometrics 189 , 70–82.D’Haultfoeuille, X. and A. Maurel (2013). Another look at identiﬁcation at infinity of sample selectionmodels.

Econometric Theory 29 , 213–224.Einmahl, J. (1992). Limit theorems for tail processes with application to intermediate quantile estima-tion.

Journal of Statistical Planning and Inference 32 , 137–145.Einmahl, U. and D. Mason (1997). Gaussian approximation of local empirical processes indexed byfunctions.

Probability Theory and Related Fields 107 , 283–311.Frisch, R. (1934). Statistical conﬂuence analysis by means of complete regression systems. TechnicalReport 5, University of Oslo, Economics Institute, Oslo, Norway.Gabaix, X. (2009). Power laws in economics and ﬁnance.

Annual Review of Economics 1 , 255–294.Gassiat, E. and J. Rousseau (2016). Nonparametric finite translation hidden Markov models and ex-tensions.

Bernoulli 22 , 193–212.Ghysels, E., A. Harvey, and E. Renault (1996). Stochastic volatility. In G. S. Maddala and C. R. Rao(Eds.),

Handbook of Statistics Volume 14: Statistical Methods in Finance . Elsevier.Hall, P. and X.-H. Zhou (2003). Nonparametric identification of component distributions in a multi-variate mixture.

Annals of Statistics 31 , 201–224.Hamilton, J. D. (1989). A new approach to the analysis of nonstationary times series and the businesscycle.

Econometrica 57 , 357–384.Heckman, J. J. (1974). Shadow prices, market wages, and labor supply.

Econometrica 42 , 679–694.Heckman, J. J. (1990). Varieties of selection bias.

American Economic Review 80 , 313–318.Henry, M., Y. Kitamura, and B. Salanié (2010). Identifying finite mixtures in econometric models.Cowles Foundation Discussion Paper 1767. nference on mixtures

Quantitative Economics 5 , 123–144.Hu, Y. and S. M. Schennach (2008). Instrumental variable treatment of nonclassical measurement errormodels.

Econometrica 76 , 195–216.Jochmans, K., M. Henry, and B. Salanié (2014). Inference on mixtures under tail restrictions. Discus-sion Paper No 2014-01, Department of Economics, Sciences Po.Kasahara, H. and K. Shimotsu (2009). Nonparametric identification of finite mixture models of dy-namic discrete choices.

Econometrica 77 , 135–175.Khan, S. and E. Tamer (2010). Irregular identification, support conditions and inverse weight estima-tion.

Econometrica 78 , 2021–2042.Lewbel, A. (2007). Estimation of average treatment effects with misclassiﬁcation.

Econometrica 75 ,537–551.Mahajan, A. (2006). Identification and estimation of regression models with misclassiﬁcation.

Econo-metrica 74 , 631–665.Schwarz, M. and S. Van Bellegem (2010). Consistent density deconvolution under partially knownerror distribution.

Statistics and Probability Letters 80 , 236–241.Shimer, R. and L. Smith (2000). Assortative matching and search.