Almost Similar Tests for Mediation Effects and other Hypotheses with Singularities
AAlmost Similar Tests for Mediation Effects and otherHypotheses with Singularities ∗ Kees Jan van GarderenAmsterdam School of EconomicsUniversity of Amsterdam
Noud van GiersbergenAmsterdam School of EconomicsUniversity of Amsterdam
First Version: February, 2020
Abstract
Testing for mediation effects is empirically important and theoretically interesting.It is important in psychology, medicine, economics, accountancy, and marketing forinstance, generating over 90,000 citations to a single key paper in the field. It alsoleads to a statistically interesting and long-standing problem that this paper solves.The no-mediation hypothesis, expressed as H : θ θ = 0, defines a manifold that isnon-regular in the origin where rejection probabilities of standard tests are extremelylow. We propose a general method for obtaining near similar tests using a flexible g -function to bound the critical region. We prove that no similar test exists for mediation,but using our new varying g -method obtain a test that is all but similar and easy to usein practice. We derive tight upper bounds to similar and nonsimilar power envelopesand derive an optimal test. We extend the test to higher dimensions and illustrate theresults in a trade union sentiment application.Keywords: Varying g -method, Mediation, Indirect Effect, Power Envelope, SimilarTests, Invariant Tests, Optimal Tests ∗ The authors thank Isaiah Andrews, Tim Armstrong, Peter Boswijk, Geert Dhaene, James Duffy, Jean-Marie Dufour, Patrik Guggenberger, Grant Hillier, Max King, Gael Martin, Sophocles Mavroeidis, GeertMesters, Frank Kleibergen, Anna Mikusheva, Ulrich M¨uller, Bent Nielsen, Adam McCloskey, Peter Phillips,Mikkel Plagborg-Møller, Richard Smith, Frank Windmeijer, Tiemen Woutersen and other participants atseminars at the University of Oxford, Monash University, KU Leuven, University of Amsterdam, and theconferences Advances in Econometrics 2018 in London, ANZESG2019 in Wellington, and (EC) a r X i v : . [ ec on . E M ] D ec Introduction
Testing for mediation effects is empirically extremely important in various scientific disci-plines. A key paper in psychology, Baron and Kenny (1986) has more than 90,000 citations and is used in many other fields. Mediation testing is important in accounting, e.g. Colettiet al. (2005), marketing, e.g. MacKenzie et al. (1986), sociology, e.g. Alwin and Hauser(1975) who used the term indirect effect, in epidemiology, e.g. Freedman and Schatzkin(1992) who coined the term intermediate endpoint effect, and in econometrics e.g. Heckmanand Pinto (2015a,b) on treatment effects and production technology. This minimal selectionis hardly representative for the vast body of literature on mediation analysis. It only illus-trates the breadth of its empirical relevance. Tests for mediation effects can have extremelylow power, especially when the effect is small or estimated with large variance. The primarypurpose of this paper is to provide a new and powerful test.The aim of mediation testing is to discover if an independent variable ( X ) causes adependent variable ( Y ) via an intervening or mediating variable ( M ). The mediating variableis exogenous in the common experimental settings in psychology and other fields, but is alsoconsidered exogenous in other settings where assignments are random or constitute a naturalexperiment. The basic model is: Y = τ X + βM + u, (1) M = αX + v, (2)where all variables are taken in deviation from their means or more generally after partialingout other exogenous effects. The disturbances u and v are assumed to be independentbecause of an experimental set up and more generally because no influence of Y on M isassumed in this type of model. This independence is a crucial identification condition. Theparameter β cannot be estimated consistently if M is endogenous. We will further makethe distributional assumption: ( u i , v i ) (cid:48) ∼ II N (0 , diag ( σ , σ )), i = 1 , · · · , n, with n thenumber of observations. This facilitates a likelihood analysis, but has no consequences forthe asymptotic normality of the t -statistics that will be used.MacKinnon et al. (2002) give a literature review and compare 14 different methods fortesting the effects of a mediation variable. These methods are based on standardized mea-sures of the product of two coefficients ( αβ ) or based on the difference of two related coeffi-cients ( τ ∗ − τ ) in equations (1) and (3) : Y = τ ∗ X + w. (3)If there is a mediation effect then X influences M, such that α (cid:54) = 0 , and M influences Y, such that β (cid:54) = 0 . If there is no mediation by M then the effect of X on Y is not altered bythe inclusion of M such that τ ∗ − τ = 0 . Since model (3) is a restricted version of (1) with β = 0 , it is straightforward to showthat the OLS estimates for the three models satisfy ˆ τ ∗ = ˆ τ + ˆ α ˆ β and the relation τ ∗ − τ = αβ also holds in model interpretation terms; see Appendix A. Cited by 90,147 on 15 January 2020 and 79,205 on 22 October 2018. A simple extension is to add other explanatory variables to the model (cid:80) Kk =1 γ k Z k . If the covariates Z k are added to all three models, then the degrees of freedom of the relevant t -tests are reduced by K. (cid:113) ˆ α ˆ β/ ˆ σ αβ , with ˆ σ αβ an estimate of the standard error of the product ˆ α ˆ β. It is availablein standard statistical packages such as SPSS, SAS, and R. It has good properties wheneither α or β is large and the standard errors of ˆ α and ˆ β are small, but if the two t -testsfor testing α = 0 and β = 0 tend to be small, properties deteriorate. For parameter valuesunder the null, the Null Rejection Probability (NRP) can be very close to zero and, under thealternative, power can fall far below the size (highest NRP) of 5% that we use throughout.Other tests considered in MacKinnon et al. (2002) suffer from the same problems.The origin is exceptional even under the null. The null hypotheses αβ = 0 defines amanifold that is almost everywhere continuously differentiable, with the exception of theorigin which is a singular point. The problematic behavior of the Wald test under thenull with singularities is well known and as yet unresolved. Dufour et al. (2017) providean extensive characterization of the asymptotic null distribution of Wald-type statistics fortesting restrictions given by polynomial functions with local singularities. We refer to theircomprehensive review of the literature on the problems of Wald tests with singularities. Inthe case of a single restriction, such as mediation testing, they provide limit distributions andbounds. This only shows the extend of the problems, but does not solve them or show howthe Wald test can be salvaged. We will construct a new test that has good power propertiesuniformly, even in a neighborhood of the singularity.The distributions of all test statistics considered in the literature depend on the value ofthe parameters under the null. As a consequence none of these tests is similar, meaning thatit has rejection probability that is not constant on the boundary of the null hypothesis. Infact rejection probabilities under alternatives close to the origin can be much lower than thesize. These tests are therefore seriously biased since power can be much less than the NRPsfor certain parameter values. The Wald-type (Sobel) test’s dependence on the parametervalue is extreme in the sense that the asymptotic critical value when both α = 0 and β = 0 is χ (0 .
95) and for any other value equals χ (0 . g method. It varies a function g that defines the boundaryof the critical region to obtain a test that has NRPs as close as possible to 5% . This newmethod is not limited to the mediation hypothesis, but can be applied to many other testingproblems with nuisance parameters to obtain near similar tests more generally.For the mediation hypothesis we construct this boundary in the space of the two common3 -statistics for testing α = 0 or β = 0 . We develop a numerical method that does not usesimulations to determine this critical region which has NRPs that are extremely close to 5%for all values of α and β under the null. This requires some computing effort on our partinitially, but once completed our results can be easily implemented in practice using thetable or the computer code provided. This test has much better power properties than Waldand LR tests, but a natural question is if one can do even better. To determine the qualityof the test in absolute terms requires an appropriate power envelope. The power envelopecannot be constructed by point optimal invariant tests based on a simple application ofthe Neyman-Pearson lemma because the null and alternative are composite. Andrews andPloberger (1994) address this issue by optimizing weighted power and recent econometriccontributions, including Andrews et al. (2006, 2008), Elliott et al. (2015), Guggenbergeret al. (2019), to name but a few, have considered null and/or alternative weighted mixturedistributions such that the Neyman-Pearson lemma can be applied to the resulting pointnull and point alternative distributions. Within the specified class of mixture distributions,the least favorable distribution is then constructed and a critical value calculated. Any othertest in this class has power no higher than the test constructed, resulting in an optimalityproperty. There is no guarantee however, that the resulting test is similar and it can stillbe seriously biased as we have confirmed (but not reported) in the mediation context for avariety of mixtures.We take a more direct approach to constructing the power envelope and maximize powerfor a grid of points in the alternative. We introduce a class of near similar invariant testsΓ (cid:15) with 0 . − (cid:15) ≤ N RP ≤ .
05 for a grid of points under H and (cid:15) small. The algorithmgenerates a different test for each grid point in the alternative that is an approximatelysimilar invariant test that maximizes power for that point alternative. This provides apower envelope (upper bound) for similar tests. Using the same algorithm we can furtherdetermine a power envelope for nonsimilar tests by discarding the near similarity restriction0 . − (cid:15) ≤ N RP .We use the near similar power envelope to construct an optimal test within Γ (cid:15) thatminimizes the total power difference from the envelope on a grid of points. It has powerthat deviates less than 0 . , showing that the potential power loss due to the similarity requirement issmall.Andrews (2012) shows that an exact similar test exists in the related one-sided testingproblem H : α ≥ β ≥ Emperor’s New Tests ” for similar tests with very poor properties. Insistence onsimilarity can render Likelihood Ratio (LR) tests α -inadmissible, cf. Lehmann and Romano(2005, Section 6.7), but Perlman and Wu (1999) give examples where similar tests haveextremely undesirable properties yet inadmissible LR tests still provide reasonable answers.In the mediation setting the LR test is much better than the Wald test as we will show,but still suffers from poor power properties close to the origin and is inadmissible. Our testis non-randomized and has good power properties uniformly superior to the Wald, LR, andLM tests considered here and would please even statistically erudite emperors.4oreira and Mour˜ao (2016) consider random critical values. Such an interpretationcan be given quite generally to any critical region in a higher dimensional space since theboundary of the critical region for one statistic can be expressed as a function of the remainingstatistics. Our solution can be framed in terms of a random critical value for the minimumof the absolute t -values. This critical value is a function of the maximum of the absolute t -values. The critical region that we construct is fixed however and not at all random. Ourapproach appears to lend itself better to multivariate extensions.Our empirical illustration requires such an extension to three dimensions and we considergeneral hypotheses of the form H : θ · · · θ K = 0. In order to derive the critical region fordimensions three and higher we exploit the symmetries of the testing problem further. Thetesting problem is invariant to ordering (permutations) of parameters and statistics andsign changes (reflections), giving rise to a finite group with eight transformations on theparameter and sample space in two dimensions and K !2 elements in K dimensions. Wegive the relevant distribution of the maximal invariant and use it to derive a critical regionexplicitly in two and three dimensions. For three dimensions we use a method to obtain thesolution in dimension K from the preceding solution in dimension K −
1. These solutionsare dimensionally coherent in the sense that for extremely large values of a number, k say,of t -statistics, the solution reduces to the K − k dimensional reduction since in such casesit is essentially known that k parameters are non-zero and rejection of the null depends onthe remaining ( K − k ) t -statistics.An empirical illustration on union sentiment among southern nonunion textile workersin Section 6 shows the practical implementation in two and three dimensions and leads todifferent conclusions than standard tests. For practitioners the major advantage of our testis that there is a better chance of formally showing that there is a mediation effect. Ourtest has better power, especially when the two channeling effects are small or less accuratelyestimated. Given the enormous interest in testing for mediation and the fact that our testcan have close to 5% more power than existing tests, many unpublished examples will existwhere it can now be concluded that there is a statistically significant mediation effect. The joint density of (
Y, M ) given X can be written as f ( Y, M | X ; λ ) = f ( Y | M, X ; λ ) f ( M | X ; λ )with λ = ( τ , β, σ ) (cid:48) and λ = ( α, σ ) (cid:48) . The parameters λ and λ vary freely as a resultof the triangular structure of the model. The mediation variable is the endogenous variablein (2) but is strongly exogenous for β in (1) since Y is not causal for M . For a sample of n independent observations the loglikelihood equals the sum of two normal loglikelihoodscorresponding to (1) and (2) : (cid:96) ( λ ) ∝ − σ n (cid:88) i =1 ( y i − τ x i − βm i ) − n σ ) − σ n (cid:88) i =1 ( m i − αx i ) − n σ ) . (4) This can easily be extended to include more regressors/covariates. Instrumental variables can also beused, but note that X and M appear in both equations and in the standard setup u and v are independentbecause of the experimental interpretation of M .
5s a consequence the Maximum Likelihood Estimators (MLEs) for α and β are the usualOLS estimators for the two equations separately. Furthermore, both observed and expectedFisher information matrices will be block diagonal in terms of λ and λ as well as in ( τ , β ) (cid:48) , σ , α, and σ . As a result the standard t -statistics T and T for α and β respectively areasymptotically independent and normally distributed with means µ ≡ α /σ α , µ ≡ β /σ β where α , β denote the true parameter values and σ α , σ β the standard deviations of theOLS estimators: ( T − µ ) d → N (0 , I ). Throughout the rest of the paper we will use thenormal distribution for the t -statistics,( T − µ ) ≡ (cid:18) T − µ T − µ (cid:19) ∼ N (0 , I ) , with the understanding that this is an asymptotic approximation, but exploited as if it is theexact distribution. This is analogues to the assumption in the weak instruments literaturethat the covariance matrix is known (e.g. Andrews et al. (2006)).The finite-sample distribution involves t -distributions with different degrees of freedomand is complicated by the fact that σ β depends on M . A strong justification for restrictingattention to T , even in finite samples is the exact result that T is maximal invariant withrespect to an appropriate group of (location-scale) transformations that leave the testingproblem invariant. Appendix C proves this result and provides further distributional detailsrelevant for the model. Standard test statistics used in practice have distributions that depend on the parametervalues under the null. The rejection probabilities are therefore not constant and the testsbiased with power dropping below the size of the test, especially in a neighborhood of theorigin. We illustrate the issue for the classic trinity of Wald, LR, and LM methods forconstructing test statistics.The Wald test for testing H : αβ = 0 together with its asymptotic distribution is givenby Glonek (1993) and further analyzed in Drton and Xiao (2016): W = T T T + T d → (cid:26) χ : if α = 0 or β = 0 , but not both, χ : if α = β = 0 . (5)The widely used Sobel (1982) test equals √ W . The discrete jump in the asymptoticdistribution from the origin to any other parameter value that is fixed is remarkable andshows explicitly that the distribution depends heavily on the parameter values under thenull. The critical value is the usual 3 .
84 in all cases other than the origin. For an NRP of5% at the origin the critical value should be 0 .
96 but this would lead to over-rejection forother values under the null and the test would be oversized (size > LR = min {| T | , | T |} , (6)and rejects when both H α : α = 0 and H β : β = 0 are rejected. In MacKinnon et al. (2002)this is referred to as the test for joint significance, but not identified as the LR test. Therejection probability is: P [ LR ≥ cv ] = P [ | T | ≥ cv ∩ | T | ≥ cv ] = P [ | T | ≥ cv ] · P [ | T | ≥ cv ]by independence of T and T . These rejection probabilities are monotonically increasing inthe absolute values of α and β . Correct size is therefore obtained by choosing the criticalvalue of the test by letting α → ∞ when β = 0 , or β → ∞ if α = 0 , to guarantee that therejection probability under the null is always smaller than or equal to the nominal size. Theasymptotic 5% critical value is therefore the usual 1 .
96. The NRP will depend on the valuesof α and β and vary between the following two extremes: P [ LR ≥ z . ] = (cid:26) .
05 : if α → ∞ ∧ β = 0 , or β → ∞ ∧ α = 0 , . = 0 . α = 0 ∧ β = 0 , where z . is the upper 2 .
5% percentile of the standard normal distribution. For an NRPof 5% at the origin ( α, β ) = (0 , , the critical value should equal cv LR = 1 . > H α ¯ β : α = 0 ∧ β (cid:54) = 0 , H αβ : β = 0 ∧ α (cid:54) = 0 , or H αβ : α = 0 ∧ β = 0. Explicitexpressions for the three LM tests are given in Appendix A. These LM tests are essentiallysquared t -tests with restricted variance estimates. The three versions can be combined intoa single statistic, but its distribution will depend on the true parameter values under thenull.All these classic tests are functions of two t -statistics and their distributions, as wellas the NRPs, clearly depend on the parameter values under the null and the tests are notsimilar. A test is called similar on the boundary of H if the probability of rejection of thenull is constant for all parameter values on the boundary of H and H : Definition 1
Similar test . Let ω ⊂ Θ be the boundary between H : θ ∈ Θ and H : θ ∈ Θ \ Θ . A test is similar on the boundary ω if the null rejection probability does not dependon θ ∈ ω . For the null hypothesis of no mediation, the boundary consists of the horizontal andvertical axes of the ( α, β ) space and is equal to H itself. None of the classic tests is similarand in a neighborhood of the origin the NRPs are close to zero. As a result the power ina neighborhood of the origin is also close to zero and far below the size of the test andthe tests are biased since there are parameter values with probability of rejection under thealternative lower than under the null. 7 .2 Critical Regions The behavior and construction of the classic test statistics is problematic. Given that nosatisfactory adjustments of classic test statistics have been found, despite considerable effortsover recent decades, a different approach is required.In order to derive an alternative test procedure we shift the focus from the test statisticto the critical region. A critical region defines a test statistic of course, but choosing a classof tests, such as Wald, LR, or LM tests, restricts the shape of the critical region. For thesame reason the tests focusing on improving the standard error of ˆ α ˆ β or (ˆ τ ∗ − ˆ τ ) analyzedin MacKinnon et al. (2002) limit the shape.We construct a new test procedure by constructing the critical region directly in thetwo-dimensional sample space of the t -statistics used in the construction of the tests. Weconsider critical regions that are bounded by a measurable function g ( · ) and give the fol-lowing definition. Definition 2
Boundary function (of the critical region) . g : R → R defines the:Critical Region : CR g = (cid:8) ( T , T ) ∈ R | | T | ≥ g ( | T | ) ∩ | T | ≥ g ( | T | ) (cid:9) , Acceptance Region : AR g = (cid:8) ( T , T ) ∈ R | | T | < g ( | T | ) ∪ | T | < g ( | T | ) (cid:9) . The justification for considering the t -statistics is threefold. First, the MLE ˆ λ = (cid:16) ˆ τ , ˆ β, ˆ σ , ˆ α, ˆ σ (cid:17) (cid:48) is a complete minimal sufficient statistic because the model constitutes a full exponentialmodel given the dimensional equality of the minimal sufficient statistic and the parameterspace; see van Garderen (1997). Second, T and T have distributions under the null that areindependent of the nuisance parameters τ , σ , and σ . Finally, T = ( T , T ) (cid:48) is a maximalinvariant under an appropriate group of transformations generalizing the scale invariance ofthe t -statistics. In the mediation problem there are further symmetries and invariances.The null hypothesis is not changed if α and β are permuted or their signs changed. Conse-quently, we can permute the t -statistics and change their sign without affecting the problem.As a consequence only 1 / th of the two-dimensional sample space of T needs considerationand we define the critical region in the first octant (east to northeast). The other sevenparts follow by symmetry. The test defined by CR g is indeed invariant to permutations,reflections, and scale transformations. The domain of g ( · ) can therefore be restricted to thenon-negative real line and bounded by the 45 ◦ line: g ( x ) ≤ x. In Section 5 we show that theordered absolute t -statistic is a maximal invariant not only for this testing problem but alsoits generalization to higher dimensions.We can put the general definition of a similar test in terms of the boundary function g ( · ),noting that H is itself the boundary ω of H and H : Definition 3 g ( · ) is said to be a similar boundary function if the probability of the criticalregion CR g defined by g is constant under H : P [ T ∈ CR g | H ] = constant ∀ ( α, β ) ∈ R with αβ = 0 . Appendix C shows that this also holds exactly in finite samples. T , T ) forthe Wald and LR test. The LM test is not illustrated because it is not properly and uniquelydefined. We show the boundaries for two critical values: one such that for large | α | or | β | the NRP is 5% asymptotically. This value is the usual 3 .
84 for the Wald test and 1 .
96 forthe LR test. The second, smaller critical value is such that the NRP is 5% when α = β = 0 . This value is 0 .
96 for the Wald test and 1 .
22 for the LR test. The rejection probabilitiesare shown as a function of the noncentrality parameter µ = α/σ α for given µ = 0 , suchthat H holds. For the LR test with critical value 1 .
96 the NRP goes to 0 . . when α = β = 0 . In the second case, with the smaller critical value 1 .
22, the NRP is 5%by construction when α = β = 0 , but for other values the NRPs are much higher than thenominal size of the test and hence they are not valid 5% tests. The same situation will occurwhen constructing point optimal invariant tests. The Wald test is considerably worse withlower NRP over a wider range of µ , see figures 1c and 1d.The trinity of classic tests is clearly nonsimilar. The question is if we can do much better.Does there exist a similar test or is this a problem that is intrinsically unsolvable? Our maintheoretical contribution, Theorem 4, states that no similar test exists. The practical answerhowever is that we can get very close to similarity and can do much better in terms of powerthan existing tests. Theorem 4
No similar boundary function g ( · ) exists for testing H : αβ = 0 . The proof of this main theorem is given in Appendix B and exploits the symmetriesof the problem and the completeness of the normal distribution. We use 5% significancethroughout but it is immediate from the proof that there is no significance level for whichthere exists a similar boundary function, apart from two trivial exceptions. A size of 0%would yield g ( t ) = t and AR g = R such that the test would never reject. The other trivialsolution is g ( t ) = 0 and defining g − (0) = ∞ accordingly. This test always rejects, leadingto an NRP of 100% for all parameter values.Andrews (2012) proves constructively that a similar test exists for the “one-sided” testingproblem H : µ ≥ H : µ (cid:3)
0, but his test is randomized and he makes the pointthat it has very low power. Andrews shows that on the negative parts of the axes the NRPis 5%, which is (trivial) power in his setting and correct size in ours. One-sided alternativesdestroy the symmetry of the problem that we exploit and use to prove the non-existence.We do not consider randomized tests and our new test below has very good power propertiesclose to the power envelope.Despite the negative non-existence result of Theorem 4, we construct a critical regionthat is all but similar with NRPs that do not differ from 5% in practical terms for anyparameter value under the null and has good power. This new test is easy to implementusing Table 1. We also provide R-code in Appendix E.The new test is obtained in two steps. We propose a new general method for constructingnear similar tests. In the first step we use this method to derive a near similar test for themediation hypothesis. We derive the power envelope for near similar tests, which can beused to show that the test has good power properties. In the second step, however, we usethis power envelope to optimize the test and maximize power within the class of near similar9 - - - - - (a) CR Wald (Sobel) test. - - - - - - (b) CR LR test. WaldWald % - - - μ → P [ R e j ] → (c) NRP Wald (Sobel) test. LRLR % - - - μ → P [ R e j ] → (d) NRP LR test. Figure 1: Critical regions for Wald (Sobel) and LR tests and their rejection probabilities.White areas are the critical regions for the valid 5% tests. For 5% rejection in the origin( α, β ) = (0 , g -Method The new general method for the construction of near similar tests is easily described by threegeneric steps:1. Define a flexible boundary g for the critical region in the relevant sample space.2. Define a criterion function Q ( g ) that penalizes the deviation of the NRP from 5% fora grid of parameter values under the null (and possibly restrictions on g and otheraspects deemed relevant).3. Systematically vary and determine g such that it minimizes the criterion function andis therefore as close to similarity as possible in the metric defined by Q .The relevant sample space is determined by the particular testing problem at hand andmay have been reduced by sufficiency, invariance, or other principles, to dimension k , say.The boundary g of the critical and acceptance region is then of dimension ( k − g flexibly, but we will use splines.The criterion function may include aspects other than similarity, for instance smoothnessand monotonicity of g , convexity of the critical or acceptance regions, or even rejection prob-abilities under alternatives. Consequently, Step 3 will generally be a constraint optimizationproblem. The systematic variation of g is intended to be in line with the optimization routineused to minimize Q as in, e.g. a Newton-Raphson-type procedure.An explicit implementation of the varying g -method to the mediation problem is givennext. g -Test The initial step in the varying g -method is to determine the relevant sample space for thetesting problem. The mediation model allows a reduction of the sample space by sufficiencyto the MLE of the five parameters. A further reduction to ( T , T ) in two dimensions followsfrom location-scale invariance as proved in Appendix C. Permutation and reflection sym-metries reduce the sample space to one octant, because the absolute order statistic definedas: (cid:16) | T | (1) , | T | (2) (cid:17) = (min ( | T | , | T | ) , max ( | T | , | T | )) , is a maximal invariant. It has a distribution that depends only on the ordered absolutenoncentrality parameter which is the corresponding maximal invariant in the parameterspace: (cid:16) | µ | (1) , | µ | (2) (cid:17) = (min ( | µ | , | µ | ) , max ( | µ | , | µ | )) and ( µ , µ ) = ( α/σ α , β/σ β ) . - - T - - - T t ( ) t ( ) t ( ) t ( ) t ( ) t ( ) g ( ) g ( ) g ( ) g ( ) g ( ) g ( ) g - ( t ) g ( t ) Figure 2: Construction of the basic g -function with J = 6, so 8 knots in all and the resultingCR boundary in ( T , T ) space.The g -boundary is generally determined by an algorithm. Appendix D shows the basicimplementation of the varying g -method using linear splines with J + 2 knots, with the firstand last knots fixed. In spite of its simplicity, it leads to big improvements even for smallvalues of J . Figure 2 illustrates the construction of the g -function for a fixed number of gridpoints J = 6 and the resulting CR g in the sample space of ( T , T ).Figure 3 shows the NRPs of the test in comparison to the LR and Wald (Sobel) tests.There is a remarkable gain in the lowest NRP, and therefore local power, from 0 .
25% to4 . J = 2. We started with J = 2, after J = 16 the improvements were verysmall, and with J = 32 there was essentially no improvement. The figure shows a slightover-rejection for some parameter values under H and hence the test is not a valid 5% test.We will correct for this subsequently by imposing strict side conditions on the NRP, becausesimply increasing penalties on over-rejection will not solve the issue. Insistence on similarity can have negative consequences for the power in general, but nothere. Even the basic test with J = 32 has good power, especially in comparison with theSobel and LR tests. It is uniformly better for all values of the noncentrality parameter µ and in a neighborhood of the origin with µ = 0 it is essentially 5% points higher.12 J = g J = g J = LRW ( Sobel - - μ → P [ R e j ] → Figure 3: NRPs basic g -tests with J = 2 , , g J = g J = g J = LRW ( Sobel ) P [ R e j ec t ] → μ → Figure 4: Power comparison basic g -tests, LR- and W (Sobel) tests along µ = µ (= µ ).Denote the rejection probability of the g -test as a function of the noncentrality parameters( µ , µ ) = ( α/σ α , β/σ β ) by: π g ( µ , µ ) = P [ T ∈ CR g | ( µ , µ )] . (7)If µ and/or µ equal 0 then the null hypothesis is true and π g is the NRP. When both arenon-zero, H is false and π g is the power of the test defined by CR g . Figure 4 illustratesthe power in the 45 ◦ direction µ = µ , but in other directions power is also superior to theWald (Sobel) and LR tests.There is a straightforward explanation for the additional power. The Wald and LR testboth reject much less than 5% near the origin. The critical region can be extended and thepower increased without failing the size condition. In the origin the NRPs are close to 0%for the LR and Wald (Sobel) tests. By extending the critical region we can therefore gainalmost 5% power without violating the size condition. Nevertheless, the LR test has someattractive features including that rejection for a particular ( t , t ) implies rejection for largervalues of t and/or t which is intuitive since the evidence against the null is increasing. A13isadvantage is however that one never rejects when either t or t is smaller than 1 .
96 andthis causes conservativeness that can be resolved by adding area to the critical region. Weelaborate on this after deriving the optimal g -test. Comparison to the Sobel and LR tests is limited because they are very poor for small valuesof µ . The absolute quality, or even near optimality, of the new g -test can only be assessedby comparing the power surface of the test to the power envelope, or a tight upper boundthereof, for a class of tests that satisfy appropriate invariance-, size and almost similarityrestrictions. Since no exact similar invariant test exists, we introduce a class Γ (cid:15) of near similartests with NRPs that deviate less than (cid:15) from the 5% level and an operational (super)classΓ M (cid:15) ⊇ Γ (cid:15) as follows. Definition 5
The class Γ (cid:15) of near similar boundary functions with (cid:15) > is defined by: Γ (cid:15) = (cid:26) g : R → R (cid:12)(cid:12)(cid:12)(cid:12) sup µ ≥ P [ CR g | (0 , µ )] ≤ . and inf µ ≥ P [ CR g | (0 , µ )] ≥ . − (cid:15) (cid:27) The class Γ M (cid:15) with M = (cid:110)(cid:16) , µ ( ι )0 (cid:17)(cid:111) Υ ι =1 a set containing Υ points under H , is defined by: Γ M (cid:15) = (cid:40) g : R → R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup (0 ,µ ) ∈ M P [ CR g | (0 , µ )] ≤ . and inf (0 ,µ ) ∈ M P [ CR g | (0 , µ )] ≥ . − (cid:15) (cid:41) . For (cid:15) = 0 the boundary functions in Γ would be similar and, since no such boundaryexists by Theorem 1, Γ would be empty. For (cid:15) = 0 .
05, on the other hand, Γ . contains all tests that satisfy the size condition. For (cid:15) close to 0, Γ (cid:15) contains boundaries that arealmost similar. The minimum value of (cid:15) for which Γ (cid:15) are not empty depends in general onthe testing problem.The class Γ M (cid:15) can be thought of as a discretization of Γ (cid:15) in the sense that a grid ofpoints under the null is considered. It imposes less restrictions and enforces near similarityconditions on a finite number of points only. As a consequence it may contain boundariesthat do not satisfy the size condition for points that are not in M . Obviously Γ (cid:15) ⊆ Γ M (cid:15) sincethe size and NRP conditions also hold for the points in M .Within the class Γ (cid:15) there is no unique solution. As a consequence one is forced to choosea boundary function from Γ (cid:15) , or in practice from Γ M (cid:15) , to obtain an operational test. For theconstruction of the power envelope we can select the test that maximizes the power againsta particular point ( µ , µ ) in the alternative. This test is a Point Optimal Invariant NearSimilar (POINS) test. The critical region of this test varies with ( µ , µ ) and no uniformlymost powerful test exists within the class Γ (cid:15) . It can be used however, to construct an upperbound for the power envelope. Definition 6
The power envelope of a near similar invariant test with (cid:15) > is defined as: π ( µ , µ ) = max g ∈ Γ (cid:15) P [ CR g | ( µ , µ )] . or a given set of points M = (cid:110) (0 , µ ( ι )0 ) (cid:111) Υ ι =1 define an upper bound to the power envelopeby: ¯ π ( µ , µ ) = max g ∈ Γ M (cid:15) P [ CR g | ( µ , µ )] . For notational simplicity we have suppressed the dependence on (cid:15) and M . Since Γ (cid:15) ⊆ Γ M (cid:15) and elements of Γ M (cid:15) do not necessarily satisfy the size condition for all parameter values itfollows that ¯ π ( µ , µ ) ≥ π ( µ , µ ) , because fewer conditions are imposed. Choosing a finergrid for M will force ¯ π ( µ , µ ) closer to π ( µ , µ ), at least in the additional points in M where the size condition is now required to hold. Also note that the “point” optimal g thatmaximizes power for the point ( µ , µ ) , may have undesirable features such as includingparts of the axes in the critical region, even though such observations are perfectly in linewith the null hypothesis.We determine ¯ π ( µ , µ ) numerically by maximizing the power directly by selecting criticalregion points in the sample space that maximize the probability of rejection when the truedensity has parameter ( µ , µ ), under the side conditions that the NRP ∈ [0 . − (cid:15), .
05] forall parameters (0 , µ ) ∈ M . The sample space is decomposed into 285 ,
150 squares andfor each square it is determined whether it should be included in the critical or acceptanceregion in order to maximize the power while at the same time satisfying the approximatesimilarity condition. This is repeated for a grid of ( µ , µ ) points. So for each point on thegrid the POINS critical region is determined and the power recorded. Appendix D givesdetails of the algorithm and the optimization routine that can deal with a large number ofvariables and side conditions.By dropping the near similarity restriction 0 . − (cid:15) ≤ N RP in the same algorithm, we canconstruct a power envelope for nonsimilar tests. The maximal difference from the (higher)nonsimilar power surface is 2% points when power is around 40% , showing that the powerloss due to the similarity requirement is small.The power surface of the basic test based on 32 knots, g J =32 , can now be compared tothe power envelope upper bound. They are very close over the whole of the parameter space.The test g J =32 is oversized, however, and even though the over-rejection seems practicallyirrelevant, theoretically the test is not a valid 5% test.The power envelope enables us to construct a correctly sized optimal test derived nextand will show that the upper bound is tight. g -test Having determined an upper bound to the power envelope, we can determine a g -boundaryfunction with a power surface as close as possible to this upper bound. This optimal test isfound using the algorithm given in Appendix D. We parsimoniously simplify the g -functionto just three clamped splines joined by three linear parts. This function is given in AppendixE and R-code is also provided there. For ease of implementation we give values of g ( t ) inTable 1. Figure 5 shows the optimal g -boundary test for the mediation problem.The optimal CR g includes a narrow region close to the 45 ◦ line where both t -statisticsare of the same magnitude. This is expedient because mediation requires both α and β to15 .0 0.5 1.0 1.5 2.0 2.5 3.00.00.51.01.52.0 t → g ( t ) → g ( t ) Figure 5: Optimal g -boundary function. The dashed line is the LR boundary. μ → Figure 6: NRP as a function of the non-centrality parameter µ . The solid line is the optimaltest and is strictly between 0.04999 and 0.05. The dotted horizontal line is uniformly 5%of the unattainable similar test. The dashed wave is the NRP of the basic g J =32 which ismarginally over-sized. 16 Table 1: g -function: table entries are g ( t ) values for corresponding t -value in first column+ value first row. e.g. g (1 .
09) = 1 . . Compare smallest absolute t-statistic with g (largest absolute t -statistic) . Linear interpolation results in | N RP − | < . ◦ line as illustrated by thepower surface in Figure 7 showing highest power on the diagonal. Second, the near similaritycondition requires additional critical region area in the left corner of the octant because NRPsare particularly low for small parameter values. The increased power is naturally linked tothe increase in Type I error, but correct size of a test by definition merely requires thatthis is not larger than 5%. Nevertheless, size (NRP)/power trade-off exists as well as othercompromises that can be assessed using critical region analysis. For instance, it may seemless intuitive that rejection is not monotonic in t and t since an increase in both t and t represents increased evidence against the null. The LR and Wald tests are monotonic inthis sense, but lead to a reduction in power to nearly zero for small parameter values. Noobserved value t of T will ever lie on the horizontal or vertical axis and any observed t istherefore more likely given an alternative parameter value than a value under the null. It istherefore desirable to add area to the LR critical region even if this results in a non-convexcritical region or acceptance region. One could cogitate about the very narrow region closeto the diagonal and whether the acceptance should not continue along the 45 ◦ line until e.g.1 .
2, but the new g -boundary is the optimal solution to a well-defined problem.The narrow region of the optimal CR g is a strict extension of the CR LR , which itself isstrictly larger than the Sobel (Wald) CR W . Since the new test is constructed to satisfy thesize condition we have the following: Theorem 7
The Sobel/Wald test and the LR test are inadmissible.
Proof. CR W ⊂ CR LR ⊂ CR g hence P [ CR W ] < P [ CR LR ] < P [ CR g ] ≤ .
05. The optimal g -test has uniformly higher power and is correctly sized by construction. (cid:3) The NRP as a function of the noncentrality parameter µ is shown in Figure 6. Thedifference from 5% is less than 10 − and so small that the scale had to be magnified greatlyand to an extend that prevents comparison with the LR and Sobel tests in the same graph.We include the NRPs for the basic g J =32 which shows the over-rejection less than 0 . g -test is very close to the power envelope (upper bound). Themaximal difference is 0 . g -test look almost identical when graphed. Figure 7 therefore shows only the powersurface of the optimal g -test. Finally, the new g -test is optimal for all intents and purposesin a larger class of tests. It is optimal by construction within the class of near similar testsΓ M (cid:15) , but given the closeness of its power surface to the (non)similar power envelope, therecannot exist any near similar test that has additional power more than 0 . g -test hasgood properties for all parameter values.The power surface in Figure 7 shows only the first quadrant of the parameter spaceof ( µ , µ ) . The other four quadrants follow by simple permutations and reflections of theparameters. 18igure 7: Power surface optimal g -test If mediation is through a chain of effects where X → M (0) → · · · → M ( K − → Y then K parameters are required to be non-zero for this channel to operate. The empirical examplein the next section requires an extension to three dimensions, but there are many otherproblems in econometrics that involve restrictions that at least one parameter is zero, seee.g. Dufour et al. (2017). In K dimensions the null hypothesis that at least one parameteris zero and the alternative is all K parameters non-zero: H : θ θ ...θ K = 0 H : θ θ ...θ K (cid:54) = 0As before we assume that the estimator ˆ θ is normally distributed with covariance matrixΩ = diag ( σ , ..., σ K ) and known such that Ω − / (cid:16) ˆ θ − θ (cid:17) = ( T − µ ) ∼ N (0 , I K ) withelements T k = ˆ θ k /σ k and noncentrality parameters µ k = θ k /σ k . In higher dimensions it is even more important to exploit invariance and symmetry prop-erties of the problem because it can reduce the domain of integration by a factor K !2 K . Thetesting problem is invariant to reordering the parameters (permutations) and sign changes(reflections) of the K parameters { θ i } Ki =1 . There is an associated group of transformations on T that leaves the distribution invariant. It can be decomposed into two proper subgroups:the group of permutations, G say, with K ! elements, and the group of sign changes, G say, with 2 K elements (two possible signs for each element). The groups G and G haveonly the identity element in common, but are otherwise non-overlapping. The full group G generated by G and G therefore has K !2 K elements. In two dimensions this equals eight,in three dimensions 48 , in four dimensions 384 etc. with a multiplicative factor 2 K to obtaindimension K from the one before.The density after a sign change in T k is obtained by a corresponding sign change in µ k and for a permutation of T also µ permutes accordingly. Hence for any element h ∈ G we19ave h · T ∼ N ( h · µ, I K ) or P hµ [ hT ∈ A ] = P µ [ T ∈ A ] so the distribution is invariant; seeLehmann and Romano (2005).Define the absolute order statistic (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) as the reordered absolute valuesof the t -statistics such that | T | (1) < | T | (2) < ... < | T | ( K ) and the absolute order parameter (cid:110) | µ | (1) , ..., | µ | ( K ) (cid:111) as the reordered absolute values of the parameters µ k . Lemma 8 If T ∼ N ( µ, I K ) then the absolute order statistic (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) is a max-imal invariant statistic and the absolute order parameter (cid:110) | µ | (1) , ..., | µ | ( K ) (cid:111) is a maximalinvariant parameter under the group of transformations G = G × G . The distribution of (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) depends only on (cid:110) | µ | (1) , ..., | µ | ( K ) (cid:111) . Lemma 9
The probability density function of the absolute order statistic is given by: f { | T | (1) ,..., | T | ( K ) } (cid:16) | t | (1) , ..., | t | ( K ) (cid:17) = perm χ (cid:16) | t | (1) , | µ | (1) (cid:17) · · · χ (cid:16) | t | (1) , | µ | ( K ) (cid:17) ... χ (cid:16) | t | ( K ) , | µ | (1) (cid:17) · · · χ (cid:16) | t | ( K ) , | µ | ( K ) (cid:17) , (8) with perm ( A ) the permanent of the square matrix A and χ ( x, λ ) the noncentral Chi-distribution with one degree of freedom and noncentrality parameter λ > . Note that the null hypothesis implies | µ | (1) = ... = | µ | ( k ) = 0 for some k ≥
1. We will usedensity (8) on the relevant domain 0 ≤ | T | (1) ≤ ... ≤ | T | ( K ) ∈ [0 , ∞ ) to calculate rejectionprobabilities based on the critical region defined by the boundary function g (cid:16) | t | (2) , ..., | t | ( K ) (cid:17) . Dimensional Coherency.
If it is known that θ K (cid:54) = 0 , then the null hypothesis reduces to H : θ θ ...θ K − = 0 . Thisimplies that the critical region for the K − t -statistics must reduce to thesolution found for K − | µ K | is large. For large values of | T K | ( p -valuesvery small) it is essentially known that µ K and θ K are non-zero. The probability of rejectionwill effectively depend only on the K − t -values. In two dimensions this means that as t → ∞ the boundary function g ( t ) → .
96, which is the one-dimensional solution for testing H : θ = 0 . In three dimensions it means that the solution must reduce to the g -test derivedin Section 4. For three dimensions we have used a multivariate spline generalization usingbarycentric coordinates of the basic spline version of the varying g -method in two dimensions The permanent is defined as perm ( A ) = (cid:80) σ ∈ S n (cid:81) ni =1 a i,σ ( i ) with the sum over all permutations σ ofthe numbers 1 , ..., n, akin the determinant but without the ± signature of the permutation. The noncentral Chi-distribution with k degrees of freedom and noncentrality parameter λ has density: f ( x, k, λ ) = e − ( x + λ ) / x k/ λ (2 − k ) / I k/ − ( λx ) , λ > , x > I k/ − () the modified Bessel function of the first kind. g -boundary in 3D. CR is furthest removed from the origin and includes e.g.(4,4,4). The edges show the 2D solution since one t -statistic is very large. If two t -statisticsare very large then it reduces to 1.96, the 1D solution.and imposed dimensional coherency. This resulted in a maximum of 0 .
2% points differencefrom 5%, and hence not as close to similarity as in two dimensions.With increasing K, the dimension of the integral and the number of knots needed todefine the function g increases. The problem becomes progressively involved and suffersfrom the curse of dimensionality. We can determine the solution in dimension K using thesolution in dimension K − K = 3 thesolution is given in Figure 8. It uses the solution in two dimensions and an optimized weightfunction which leads to a maximum of 0 .
13% points difference from 5%. In four dimensionswe also determined a solution using this method based on optimized weights and dimensionalcoherency, but do not report it. Employing this method one could, in principle, recursivelydetermine the K + 1 dimensional solution on the basis of the K -dimensional solution, butthis is left for future research. For a numerical illustration, we consider the recursive model of union sentiment amongsouthern nonunion textile workers as used by Bollen and Stine (1990). The model: ym m = β β β ym m + τ α α (cid:20) x x (cid:21) + uv v , (9)21 α β β β τ Estimate –0.050 0.057 –0.283 0.987 –0.215 0.720 t -statistic –1.902 2.709 –3.582 7.120 –1.838 1.777Table 2: OLS Estimates and t -statistics for Union Sentiment Model ( N = 100)is a simplified version of McDonald and Clelland (1984) and discussed in some detail byBollen (1989, p. 82–93). It analyses the direct and indirect effects of tenure and age onunion sentiment via deference and/or labor activism. Tenure x is measured in log of yearsworking in a particular textile mill and age x is measured in years. The variables sentimenttowards unions y , deference/submissiveness to managers m , and support for labor activism m , are measures based on 7, 4, and 9 survey questions respectively. The disturbances ( u , v and v ) are assumed to be uncorrelated across equations and individuals. When they arenormally distributed, ML estimation of the system reduces to OLS applied to each equationseparately due to the recursive structure. age deference activism union sentiment α α β β β Figure 9: Union sentiment mediation graphWe use a selection of 100 observations out of the original 173 and focus on three alternativetheories of the indirect effects from age to union sentiment: two competing parallel effectsthat the age effect is mediated by increased deference in which case i = α β quantifiesthe indirect effect. The alternative mediation channel is that activism mediates such that i = α β is the indirect effect. The third channel is a serial effect that age affects deference,which in turn affects activism, which in turn affects union sentiment such that i = α β β measures the indirect effect. Figure 9 illustrates the three mediation channels. The OLSestimates of the coefficients of the structural equations and their t -statistics are shown inTable 2.The point estimates of the indirect effects and their t -statistics based on the delta methodare shown in Table 3. For the g -test we need the absolute order statistics, evaluate g, andcompare. For H : i = 0 we observe | t (ˆ β ) | = 1 . > .
774 = g (1 . g ( | t (ˆ α ) | ) , and hence reject. For H : i = 0 we have | t (ˆ α ) | = 2 . > .
960 = g (7 . g (cid:16)(cid:12)(cid:12)(cid:12) t (cid:16) ˆ β (cid:17)(cid:12)(cid:12)(cid:12)(cid:17) and also reject. Testing the last null hypothesis H : i = 0 requires the three-dimensional solution given in Figure 8. We have | t (ˆ α ) | = 1 . < .
96 = g (3 . , . g (cid:16)(cid:12)(cid:12)(cid:12) t (cid:16) ˆ β (cid:17)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) t (cid:16) ˆ β (cid:17)(cid:12)(cid:12)(cid:12)(cid:17) and do not reject. 22stimate Sobel t -statistic g -test i = α β . .
322 1 . ∗ > .
774 = g ( | − . | ) i = α β . . ∗ . ∗ > .
960 = g (7 . i = α β β . .
635 1 . ≯ .
96 = g (3 . , . t -statistics and g -test. * indicates significance at 5%The Sobel test with critical value 1 .
96 concludes that i is significant but does not findenough evidence for the i mediation channel. The new g -test in contrast, concludes that i is also significant. Both t -values in this case are smaller than 1 .
96, so the LR test would notreject either. The two t -values are of comparable magnitude and the g -test finds a significantmediation effect. For implementation of the g -test only the relevant t -statistics are required.The absolute values are ordered and the smallest value compared with the value of the g -function evaluated at the largest absolute t -value. This can be looked up in Table 1 (possiblyusing linear interpolation) or one can use the spline function detailed in Table 4 and codedin R provided in Appendix E. For i both tests draw the same conclusion. The three t -values involved are not ofcomparable magnitude and the t -statistics for β and β are so large that rejecting the null H : α β β = 0 essentially depends on whether α is zero. The corresponding absolute t -value of 1 .
90 is too small to warrant such conclusion.
This paper has addressed the mediation problem that is empirically extremely importantwith thousands of applications per year in many different fields including economics, business,marketing, and accounting and has resulted in more than 90,000 citations to a key reference.Theoretically it is an interesting statistical problem which has generated results dating backto Craig (1936) and still continues today with contributions on poor performance of the Waldstatistic, construction of similar tests, involving many different hypotheses with singularitiesin econometrics and elsewhere.We have proposed a new general method for constructing tests that are as near as pos-sible to similarity. This varying- g method proposes a flexible critical region boundary andminimizes the difference from 5% of the rejection probabilities at a number of points onthe boundary of the null hypothesis. Conceptually and practically this is very simple andstraightforward to implement. It does not require a choice of mixture distribution, nor theconstruction of least favorable distributions which may lead to nonsimilar solutions. Nu-merically it is also attractive in terms of convergence properties and avoids the need for The bootstrap is a popular alternative for testing mediation. Because of the asymmetry of the distri-bution involved this is carried out through alternative confidence intervals of the indirect effect. See e.g.MacKinnon et al. (2004) and Preacher and Hayes (2008). It is well known however that the bootstrap is notvalid. Simulations we carried out showed that bootstrap tests for mediation based on generally preferredBCa confidence intervals can have sizes of 8% when n = 100 and higher for n smaller. . g -test satisfies the size condition. The critical region is strictly larger thanthe LR and Wald critical regions and is therefore strictly and uniformly more powerful.The classic tests are therefore not admissible. For large values of the coefficients the powerdifference becomes negligible, but when mediation effects are small or have relatively bigstandard errors, the power can be close to 5% points higher than these classic tests. Thishas important consequences for empirical work. It enables researcher to prove mediationeffects earlier in circumstances that one could not show mediation before due to extremeconservativeness of standard tests near the origin. Appendix A Theory
A.1 Elementary Relation
Let y = ( y , · · · , y n ) (cid:48) , m = ( m , · · · , m n ) (cid:48) , x = ( x , · · · , x n ) (cid:48) , be vectors of observablesin deviations from their means such that ¯ y = 0, ¯ m = 0, ¯ x = 0 and disturbance vectors u = ( u , · · · , u n ) (cid:48) , v = ( v , · · · , v n ) (cid:48) . The model is then: y i = τ x i + βm i + u i , (10) m i = αx i + v i , (11)and the restricted version of equation (10) with β = 0 equals: y i = τ ∗ x i + w i . (12)The claim ˆ τ ∗ = ˆ τ + ˆ α ˆ β follows from a standard exercise to relate restricted and unre-stricted OLS estimators: (cid:98) τ ∗ = ( x (cid:48) x ) − x (cid:48) y = ( x (cid:48) x ) − x (cid:48) (cid:16) x ˆ τ + m ˆ β + ˆ u (cid:17) = ˆ τ + ( x (cid:48) x ) − x (cid:48) m ˆ β + ( x (cid:48) x ) − x (cid:48) ˆ u, and ˆ α = ( x (cid:48) x ) − x (cid:48) m is the OLS estimator in equation (11) and x (cid:48) ˆ u = 0 since ˆ u are the OLSresiduals from equation (10) and orthogonal to x. τ ∗ − τ = αβ follows by substituting (11) in (10): y i = τ x i + βm i + u i = ( τ + βα ) x i + ( βv i + u i ) = τ ∗ x i + w i . It follows that H : αβ = 0 ⇔ H : τ ∗ = τ . A.2 Likelihood
The joint density of ( y, m ) given x is f ( y, m | x ) = f ( y | m, x ) f ( m, x ) , and according tothe model: y i | m i , x i ∼ N ( τ x i + βm i , σ ) ,m i | x i ∼ N ( αx i , σ ) . Hence the log-likelihood: (cid:96) = (cid:96) ( τ , β, σ , α, σ ) = log f ( y, m | x ; τ , β, σ , α, σ ) = log f ( y | m, x ; τ , β, σ )+log f ( m | x ; α, σ ) , for n independent observations equals equation (4) which can be written as: (cid:96) ∝ − σ y (cid:48) y + τσ y (cid:48) x + βσ y (cid:48) m − (cid:18) τ βσ − ασ (cid:19) x (cid:48) m − (cid:18) β σ + 1 σ (cid:19) m (cid:48) m + − (cid:18) τ σ − α σ (cid:19) x (cid:48) x − n σ σ )= η (cid:48) r − κ ( η ) , with : η = (cid:18) − σ , τσ , βσ , − (cid:18) τ βσ − ασ (cid:19) , − (cid:18) β σ + 1 σ (cid:19)(cid:19) (cid:48) ,r = ( y (cid:48) y, y (cid:48) x, y (cid:48) m, x (cid:48) m, m (cid:48) m ) (cid:48) , and κ some function of η and x (cid:48) x which is fixed. Since dim ( η ) = dim ( r ) the model is a fullexponential model of dimension five following the Koopman-Fisher-Darmois theorem (seevan Garderen (1997)) and r is a complete sufficient statistic. The score s ( α, β, δ, σ , σ ) = s = ( s (cid:48) , s (cid:48) ) (cid:48) is analogues to the scores of the two separate regression models since ( τ , β, σ )appears in the first equation only, and ( α, σ ) appears in the second equation only. So: s = (cid:16) ( y − τx − βm ) (cid:48) xσ , ( y − τx − βm ) (cid:48) mσ , ( y − τx − βm ) (cid:48) ( y − τx − βm )2 σ − n σ , ( m − αx ) (cid:48) xσ , ( m − αx ) (cid:48) ( m − αx )2 σ − n σ (cid:17) (cid:48) and the Maximum Likelihood Estimator (MLE) equals the MLE for the two equations sep-arately: (cid:18) ˆ τ ˆ β (cid:19) = (cid:0) ( x : m ) (cid:48) ( x : m ) (cid:1) − ( x : m ) (cid:48) y ; (cid:98) σ = 1 n y (cid:48) M X y ;ˆ α = ( x (cid:48) x ) − x (cid:48) m ; (cid:98) σ = 1 n m (cid:48) M x m, with M A = I − A ( A (cid:48) A ) − A (cid:48) and X = [ x : m ] an n × r which is a minimal sufficient andcomplete statistic. 25 .3 Classic Tests Wald test.
Under H : αβ = r ( α, β ) = 0. Then R ( α, β ) = ∂ r ( α,β ) ∂ ( α,β ) (cid:48) = ( β, α ) (cid:48) and evaluatedat the (unrestricted) MLE equals R (cid:16) ˆ α, ˆ β (cid:17) = (cid:16) ˆ β, ˆ α (cid:17) (cid:48) . The Wald test therefore becomes: W = ˆ α ˆ β (cid:32)(cid:18) ˆ β ˆ α (cid:19) (cid:48) (cid:18) σ α σ β (cid:19) (cid:18) ˆ β ˆ α (cid:19)(cid:33) − ˆ α ˆ β = ˆ α ˆ β ˆ α σ β + ˆ β σ α · (cid:16) σ α σ β (cid:17) − (cid:16) σ α σ β (cid:17) − = T T T + T The Sobel test equals √ W and is usually expressed as the square root of the first term inthe second line: ˆ α ˆ β (cid:114) ˆ α σ β +ˆ β σ α . LR test.
The maximum value of the log-likelihood can be expressed in terms of the OLSresidual sum of squares in the usual way for the first and second equation,
RSS and RSS respectively: (cid:96) (cid:16) ˆ τ , ˆ β, ˆ σ , ˆ α, ˆ σ (cid:17) ∝ − σ n (cid:88) i =1 (cid:16) y i − ˆ τ x i − ˆ βm i (cid:17) − n σ ) + − σ n (cid:88) i =1 ( m i − ˆ αx i ) − n σ )= − n − n RSS /n ) − n − n RSS /n ) . Denote the restricted residual sums of squares by (cid:93)
RSS when β = 0, (cid:93) RSS when α = 0 , and the restricted maximized log-likelihoods by: (cid:96) α =0 (cid:16) ˆ τ , ˆ β, ˆ σ , , ˜ σ (cid:17) ∝ − n − n RSS /n ) − n − n (cid:16) (cid:93) RSS /n (cid:17) ,(cid:96) β =0 (˜ τ , , ˜ σ , ˆ α, ˆ σ ) ∝ − n − n (cid:16) (cid:93) RSS /n (cid:17) − n − n RSS /n ) . The LR test of the full model with five parameters, against the model with the singlerestriction β = 0 equals: LR β =0 = 2 (cid:16) − n RSS /n ) + n (cid:16) (cid:93) RSS /n (cid:17)(cid:17) = n log (cid:18) n T (cid:19) since (cid:93) RSS = ˆ β (cid:48) m ( m (cid:48) M x m ) − m ˆ β + RSS and T = ˆ β (cid:48) m ( m (cid:48) M x m ) − m ˆ β/ ˆ σ = (cid:16) (cid:93) RSS − RSS (cid:17) / ( RSS /n ) . α = 0 equals: LR α =0 = n log (cid:18) n T (cid:19) . The likelihood ratio test for H : α = 0 and/or β = 0 uses the maximized log-likelihoodunder the alternative (the same in both cases) and under the null, which means minimizingover LR α =0 and LR β =0 and hence: LR = min { LR α =0 , LR β =0 } , which is equivalent to rejecting for large values of:min (cid:8) T , T (cid:9) or min {| T | , | T |} . LM tests.
The score version of the LM statistic: LM = s (cid:16) ˜ λ (cid:17) (cid:48) I − λ s (cid:16) ˜ λ (cid:17) requires the score vector evaluated under the null, but there are three cases: (i) α = 0 ∧ β (cid:54) = 0(ii) β = 0 ∧ α (cid:54) = 0 (iii) α = 0 ∧ β = 0 s (cid:16) ˜ λ α =0 (cid:17) = ˜ v (cid:48) x ˜ v (cid:48) ˜ v/n ; s (cid:16) ˜ λ β =0 (cid:17) = ˜ u (cid:48) x m ˜ u (cid:48) ˜ u/n ; s (cid:16) ˜ λ α =0 ∧ β =0 (cid:17) = ˜ u (cid:48) x m ˜ u (cid:48) ˜ u/n ˜ v (cid:48) x ˜ v (cid:48) ˜ v/n . the inverse information matrix equals: I − λ = n σ σ (cid:0) n σ x (cid:48) x (cid:1) − n σ σ − n σ σ n σ σ σ n σ x (cid:48) x
00 0 0 0 σ n . Hence the three score versions of the LM test equal: LM α =0 = n (cid:18) ˜ v (cid:48) x ˜ v (cid:48) ˜ v (cid:19) x (cid:48) x (cid:18) ˜ v (cid:48) ˜ vn (cid:19) = n x (cid:48) m x x (cid:48) x m x (cid:48) x x (cid:48) m x m ,LM β =0 = n (cid:18) ˜ u (cid:48) x m ˜ u (cid:48) ˜ u (cid:19) n (cid:18) ˜ u (cid:48) ˜ u ˆ v (cid:48) ˆ v (cid:19) = n ˜ u (cid:48) x m x (cid:48) m ˜ u ˜ u (cid:48) ˜ u ˆ v (cid:48) ˆ v ,LM α =0 ∧ β =0 = n x (cid:48) m x x (cid:48) x m x (cid:48) x x (cid:48) m x m + n ˜ u (cid:48) x m x (cid:48) m ˜ u ˜ u (cid:48) ˜ u ˜ v (cid:48) ˜ v . ppendix B Proof of Theorem 1 The proof is slightly more transparent in terms of the acceptance region. The probabilityof not rejecting H should equal 0 .
95 uniformly over H : P [ AR g |∀ α ∈ R ∧ β = 0] = 0 .
95 = P [ AR g |∀ β ∈ R ∧ α = 0]. Without loss of generality we set β = 0. P [ AR g | α ] = P [ | T | < g ( | T | ) ∪ | T | < g ( | T | ) | β = 0 ∧ α ∈ R ] . Under H : β = 0 and α ∈ R : T ∼ N ( µ,
1) with µ = ασ α and T ∼ N (0 , . By independenceand symmetry we have, with φ ( · ) denoting the standard normal density (see also Figure 2for the areas of integration) : P [ AR g | µ ] = 2 (cid:90) + ∞−∞ φ ( t − µ ) (cid:34)(cid:90) g ( t )0 φ ( t ) + (cid:90) + ∞ g − ( t ) φ ( t ) (cid:35) dt dt =0 .
95 = 2 (cid:90) + ∞−∞ φ ( t − µ ) (cid:20) Φ ( g ( t )) − Φ (cid:0) g − ( t ) (cid:1) + 12 (cid:21) dt . This implies restrictions on g ( · ). Define F ( t ) = 2 · (cid:2) Φ ( g ( t )) − Φ ( g − ( t )) + − . / (cid:3) thenthe restrictions become:0 = (cid:90) + ∞−∞ φ ( t − µ ) F ( t ) dt ∀ µ ∈ R . The normal distribution N ( µ,
1) is a one parameter full exponential family and thereforecomplete. Hence F ( T ) ≡ only function with expectation 0 for all values of µ .Consequently g ( t ) must satisfy:Φ ( g ( t )) − Φ (cid:0) g − ( t ) (cid:1) = − . ∀ t ∈ R But g (0) = 0 implies g − (0) = 0 and hence Φ ( g (0)) − Φ ( g − (0)) = 0 (cid:54) = − .
025 and nosimilar boundary exists.
Notes
1. The proof can be extended explicitly to a function g ( t ) that is defined for t ≥ L only, and undefined on [0 , L ). Implicitly this is already covered by defining g ( t ) = 0 ∀ t ∈ [0 , L ) since this horizontal line segment contains no probability to contribute tothe NRP.2. We use 5% significance throughout but it is immediate from the proof that there is nosignificance level for which there exists a similar boundary function, apart from twotrivial exceptions. A size of 0% would yield g ( t ) = t and AR = R such that the testwould never reject. The other trivial solution is g ( t ) = 0 and defining g − (0) = ∞ accordingly, which is a test that always rejects and leading to an NRP of 100% for allparameter values. 28 ppendix C Invariance When testing the no-mediation hypothesis H : αβ = 0 , the parameters τ , σ , σ arenuisance parameters. Their values have no influence on whether the null is true or not andtherefore we want a test that is invariant with respect to an appropriate group of transfor-mations that leaves the relevant distributions and hypotheses invariant. All distributions areconditional on x since it is strictly exogenous, but m is a random variable that depends on x ,and y depends on both x and m. In the notation conditionality is implicit on x , but explicitif conditional on m. So conditional on x and under the normality assumption of ( u, v ) wehave the common OLS results, with all variables in deviation from their means: (cid:18) ˆ τ ˆ β (cid:19) | m ∼ N (cid:18)(cid:18) τβ (cid:19) , σ ( X (cid:48) X ) − (cid:19) , and s /σ ∼ χ n − ˆ α = ( x (cid:48) x ) − x (cid:48) m ∼ N ( α, σ ( x (cid:48) x ) − ) and s /σ ∼ χ n − ,with X = [ x : m ], s = y (cid:48) M X y = n (cid:99) σ , s = m (cid:48) M x m = n (cid:99) σ . The conditional varianceof (cid:16) ˆ τ , ˆ β (cid:17) (cid:48) depends on m only through ˆ α and s because: (cid:0) [ x : m ] (cid:48) [ x : m ] (cid:1) − = 1 s ( x (cid:48) x ) − (cid:20) m (cid:48) m − x (cid:48) m − m (cid:48) x x (cid:48) x (cid:21) = (cid:34) ( x (cid:48) x ) − + ˆ α s − ˆ αs − ˆ αs s (cid:35) , using: | X (cid:48) X | = x (cid:48) xm (cid:48) m − x (cid:48) mm (cid:48) x = x (cid:48) x ( m (cid:48) m − m (cid:48) x ( x (cid:48) x ) − x (cid:48) m ) = x (cid:48) xm (cid:48) M x m = x (cid:48) x · s ,m (cid:48) m = s + m (cid:48) x ( x (cid:48) x ) − x (cid:48) m = s + x (cid:48) x (cid:16) ( x (cid:48) x ) − x (cid:48) m (cid:17) = s + x (cid:48) x ˆ α ,x (cid:48) m = x (cid:48) x ˆ α. This implies that s depends on m only through ˆ α and s since s = y (cid:48) M X y = y (cid:48) y − (cid:0) ˆ τ ˆ β (cid:1) (cid:48) (cid:18) x (cid:48) x x (cid:48) mm (cid:48) x m (cid:48) m (cid:19) (cid:0) ˆ τ ˆ β (cid:1) . Hence s σ | m ≡ s σ | ˆ α, s , x ∼ χ n − . Conditional on m, s is alsodistributed independently of (cid:16) ˆ τ , ˆ β (cid:17) . Further note that conditional on m, or ˆ α, s , the dis-tribution of ˆ β | ˆ α, s ∼ N ( β, σ /s ) and ˆ τ | ˆ α, s , ˆ β ∼ N (cid:16) τ − ˆ α (cid:16) ˆ β − β (cid:17) , σ ( x (cid:48) x ) − (cid:17) .Writing the joint density of the sufficient statistics as product of conditional and marginaldistributions we obtain the representation: f (ˆ τ , ˆ β, s , ˆ α, s ) = N (cid:16) τ − ˆ α (cid:16) ˆ β − β (cid:17) , σ ( x (cid:48) x ) − (cid:17) × N ( β, σ /s ) × (cid:0) σ χ n − (cid:1) × N ( α, σ ( x (cid:48) x ) − ) × (cid:0) σ χ n − (cid:1) which is equivalent to the likelihood which was given in logs in equation (4).The transformations s → a s and s → a s with a , a > s , s ) in the same family with ( σ , σ ) replaced by ( a σ , a σ ) and have no Most invariance results in this section are in collaboration and thanks to Hillier (2019). σ and σ are present in the other compo-nents and we need to transform the remaining variables accordingly as: ˆ α → √ a ˆ α andˆ β → (cid:112) a /a ˆ β with densities: √ a ˆ α ∼ N ( √ a α, a σ ( x (cid:48) x ) − ) and (cid:112) a /a ˆ β (cid:12)(cid:12)(cid:12) a s ∼ N ( (cid:112) a /a β, a σ / ( a s ) ( x (cid:48) x ) − ) respectively. Finally, since τ is not involved in the infer-ence problem, we may transform ˆ τ → √ a (ˆ τ + a ) so that: √ a (ˆ τ + a ) | √ a ˆ α, (cid:112) a /a ˆ β, a s ∼ N (cid:16) √ a ( τ + a ) − √ a ˆ α (cid:112) a /a (cid:16) ˆ β − β (cid:17) , a σ ( x (cid:48) x ) − (cid:17) , which has the same form as before the transformation.These transformations preserve the family of distributions for the sufficient statistics (andMLEs), and transform the mediation effect as αβ → √ a αβ . They do not change the nullhypothesis being true or false, i.e. H is true before iff it is true after the transformation.We may therefore state: Proposition 10
The testing problem is invariant under the group K of transformationsacting on (cid:16) ˆ τ , ˆ β, s , ˆ α, s (cid:17) defined by (cid:16) ˆ τ , ˆ β, s , ˆ α, s (cid:17) → (cid:16) √ a (ˆ τ + a ) , (cid:112) a /a ˆ β, a s , √ a ˆ α, a s (cid:17) ,a ∈ R , a , a ∈ R + . The induced group of transformations ¯K acting on the parameter space is ( τ , β, σ , α, σ ) → (cid:16) √ a ( τ + a ) , (cid:112) a /a β, a σ , √ a α, a σ (cid:17) Proposition 11
A maximal invariant statistic under the group of transformations K is thevector of t -statistics: T = ( T , T ) (cid:48) = (cid:32) ˆ α/ (cid:114) n − s ( x (cid:48) x ) − , ˆ β/ (cid:114) n − s /s (cid:33) (cid:48) . A parameter-space maximal invariant under the induced group ¯K is µ = ( µ , µ ) (cid:48) = (cid:18) α/ (cid:113) σ ( x (cid:48) x ) − , β/ (cid:114) σ σ /n (cid:19) (cid:48) The distribution of ( T , T ) (cid:48) depends only on ( µ , µ ) (cid:48) . Proof.
The transformations on ˆ τ are transitive, so no invariant test can depend on ˆ τ . Wewill therefore restrict further analysis to the four remaining statistics (cid:16) ˆ β, s , ˆ α, s (cid:17) anduse k and ¯k to denote the transformations restricted to these four statistics. Invarianceof ( T , T ) follows immediate upon substitution. Now T = ( T , T ) is a maximal invariant30f T (ˆ β, ˆ α, s , s ) = T (˜ β, ˜ α, ˜ s , ˜ s ) and T (ˆ β, ˆ α, s , s ) = T (˜ β, ˜ α, ˜ s , ˜ s ) implies thatthere exists a group element k such that (˜ β, ˜ α, ˜ s , ˜ s ) = k (ˆ β, ˆ α, s , s ) : T : ˆ α (cid:113) n − s √ x (cid:48) x = ˜ α (cid:113) n − ˜ s √ x (cid:48) x ⇒ ˜ α = √ ˜ s √ s ˆ α = √ a ˆ αT : ˆ β (cid:113) n − s /s = (cid:101) β (cid:113) n − ˜ s / ˜ s ⇒ (cid:101) β = (cid:115) ˜ s / ˜ s s /s ˆ β = (cid:112) a /a ˆ β, and therefore a = ˜ s /s and a = ˜ s /s . The two values a and a give the correcttransformation for ˆ α and ˆ β and the same holds for s and s : ˜ s = a s and ˜ s = a s .So there is indeed a group element k such that (˜ β, ˜ α, ˜ s , ˜ s ) = k (ˆ β, ˆ α, s , s ). The sameargument applies to the parameter space. The last statement is a well-known property ofmaximal invariants. (cid:3) Note that ( T , T ) are the basic t -statistics for testing α = 0 and β = 0 when treatingthe two equations separately and estimating by OLS. The estimated standard error for ˆ α isthe standard formula (cid:113) n − ˜ s ( x (cid:48) x ) − and, using the Frisch-Waugh theorem, the estimatedstandard error for ˆ β conditional on m and x is (cid:113) n − s ( m (cid:48) M x m ) − = (cid:113) n − s /s .These exact invariance results provide a strong justification for restricting attention to thetwo t -statistics for any sample size, finite or asymptotically, since it is natural to restrict theproblem to procedures that are scale invariant and do not depend on τ . The testing problemhas further symmetries. The problem is invariant to changing the signs (reflections) of T and T or permuting them. This leads to maximal invariants with a sample and parameterspace that is only part of R K . Proof. (of Lemma 8) (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) is obviously invariant to changes in sign andpermutation as a consequence of the absolute values and subsequent sorting. It is a maximalinvariant because any two T and ˜ T such that (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) = (cid:26)(cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) (1) , ..., (cid:12)(cid:12)(cid:12) ˜ T (cid:12)(cid:12)(cid:12) ( K ) (cid:27) can only hold if ˜ T is a permutation of T with a number of sign changes. Hence therewill exist a transformation h = h · h ∈ G × G s.t. ˜ T = h · T. The same argumentholds for (cid:110) | µ | (1) , ..., | µ | ( K ) (cid:111) since the group of transformations on the parameter space isthe same as on the sample space. That the distribution of (cid:110) | T | (1) , ..., | T | ( K ) (cid:111) depends onlyon (cid:110) | µ | (1) , ..., | µ | ( K ) (cid:111) is again a property of maximal invariant. Lemma 9 gives an explicitexpression that further shows that the distribution is invariant under the G × G . (cid:3) Proof. (of Lemma 9) The absolute value of the normal variate T k with mean µ k and variance1 follows a noncentral Chi-distribution with one degree of freedom. The K distributions χ (cid:16) | t | ( k ) , | µ | ( k ) (cid:17) are independent. The result is then a direct application of Vaughan andVenables (1972, eq. 6). (cid:3) ppendix D Algorithms The construction of the optimal g -test is in two steps. The first step is a basic implementationof the general varying- g method. This generates a near similar test that deviates less than0 .
01% points from 5% . We use this (cid:15) as a starting value for determining an upper bound tothe power envelope. The second step is using this upper bound to derive an optimal g -testthat minimizes the distance between the power surface and the power envelope for tests inΓ M (cid:15) . Implementation of the varying- g method: Basic g -function algorithm
1. Define g ( · ) nonparametrically as a linear spline defined by J +2 knots (cid:8)(cid:0) t ( j ) , g ( j ) (cid:1)(cid:9) J +1 j =0 , i.e. by J + 2 values g ( j ) on a regular grid of points t ( j ) . The first and last knots are fixedat (0 ,
0) and (2 . , z . ) respectively, so there are J knots to be chosen. For points t not on the grid g ( t ) is obtained by linear interpolation and g ( t ) = z . ≈ .
96 for t > . .
2. The criterion function Q ( g ) is the accumulated NRP deviation from 5% as measuredby the asymmetric loss function q over a grid of points (cid:110) µ ( ι )0 (cid:111) Υ ι =1 with Υ > J and µ (1)0 = 0 : Q ( g ) = Υ (cid:88) ι =1 q (cid:16) N RP g (cid:16) µ ( ι )0 (cid:17) − . (cid:17) , withq ( x ) = (cid:26) − x : x ≤ x : x > N RP g ( µ ) = P [ T ∈ CR g | µ = 0 , µ = µ ≥ Q ( g ) by varying g ( · ):(a) Initialize g ( · ) with knots { (0 , , (0 . , . , (1 . , . , (2 . , . } correspond-ing to the LR boundary. The first and last knot are fixed and the middle two arevaried when optimizing Q ( g ).(b) For given g ( · ) calculate the NRPs by numerical integration for the grid of Υ noncentrality parameter points { (0 , µ ( ι ) ) } Υ ι =1 under the null, with Υ ≥ J andcalculate Q ( g ) . (c) Vary g ( · ) by changing J knots and minimize the criterion function Q ( g ), subjectto:i. 0 ≤ g ( j +1) − g ( j ) < δ : monotonicity and limited increaseii. g ( t ) ≤ t : logical restriction since maximal invariant is absolute order statisticiii. g ( J +1) = z . : dimensional coherence requires reduction to one dimensionalsolution (see Section 5)(d) Increase the number of knots J and iterate until convergence.32 omments
1. The grid points (cid:8) t ( j ) (cid:9) Jj =1 are chosen equally spaced between 0 and t J . The first andlast knot, (0 ,
0) and ( t ( J +1) , g ( J +1) = z . ) remain fixed. For the illustration we havechosen t ( J ) = 1 .
96 and t ( J +1) = 2 .
5. For large enough | T | it is essentially known that β (cid:54) = 0 and the rejection depends only on whether α = 0 is rejected. The corresponding5% critical value for | T | based on the normal distribution is the usual z . ≈ .
96 as | T | → ∞ .
2. For J small there are big gains in reducing the deviation from 5% by varying the knots (cid:8) t ( j ) , g ( j ) (cid:9) Jj =1 and also by increasing J, see Figure 3.3. The number Υ of µ ( ι ) points to check similarity was chosen to be Υ = 76 > J :60 points equally spaced between 0 and 6, and 16 points equally spaced between 6and 20. This imposes 152 side conditions. Step 3(c) imposes a further 3 J restrictionsapproximately for every choice of J , and about 100 when J = 32 .
4. The loss function q was chosen such that it puts large penalty on positive valuesof ( N RP − .
05) that violate the size condition. Even extreme penalties still lead toNRPs that are over 5% for some values of µ and therefore renders an invalid (oversized)test. Even though these NRP transgressions are very minor, we address this issue inthe optimal test in Section 4 and use the following algorithm. Optimal g -function In order to find the optimal g , we minimize the sum of differences between g ’s powersurface and the power envelope on a grid of points, subject to the size and (cid:15) similarity con-ditions. The criterion function further includes a roughness penalty on g based on numericalsecond derivatives ∆ g (cid:0) t ( i ) (cid:1) , and we impose monotonicity g (cid:0) t ( i +1) (cid:1) ≥ g (cid:0) t ( i ) (cid:1) and, since bydefinition of the absolute order statistic | T | (1) ≤ | T | (2) ,we logically restrict g to 0 ≤ g ( t ) ≤ t . Optimal g -function algorithm
1. Define g ( · ) nonparametrically as a linear spline defined above.2. Define the criterion function Q ∗ (cid:15) ( g ) as the accumulated power difference over the tri-angular grid of points M = (cid:110)(cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17)(cid:111) ≤ γ<κ ≤ Υ : Q ∗ (cid:15) ( g ) = Υ (cid:88) κ =1 (cid:88) γ ≤ κ ¯ π (cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17) − P (cid:104) CR g | (cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17)(cid:105) + λ J (cid:88) i =1 ∆ g (cid:0) t ( i ) (cid:1)
3. Minimize Q ∗ (cid:15) ( g ) by varying g ( · ):(a) Start with g ( · ) equal to the previously determined basic g -function . (b) For given g ( · ) calculate Q ∗ (cid:15) ( g ) by numerical integration . (c) Vary g ( · ) by changing J knots and minimize the criterion function Q ∗ (cid:15) ( g ), subjectto: 33. 0 . − (cid:15) ≤ P (cid:104) CR g | (cid:16) , µ ( ι )0 (cid:17)(cid:105) ≤ . , ∀ ι = 1 , · · · , Υ : near similarity andsize restrictionsii. 0 = g (0) ≤ g (cid:0) t ( j ) (cid:1) ≤ g (cid:0) t ( j +1) (cid:1) ≤ t ( j +1) : monotonicity,iii. g ( j +1) − g ( j ) < t ( j +1) − t ( j ) : limited increase and derivative,iv. g ( t ) ≤ t : logical restriction since argument is absolute order statistic,v. g ( J +1) = z . : dimensional coherence.(d) Increase the number of knots J and iterate until convergence.The regularization parameter was set to λ = 0 . . The basic implementation algorithm solved the optimal g -boundary by minimizing (cid:15). Once (cid:15) is determined, the current algorithm is akin to solving a dual problem that uses (cid:15) forthe inequality restrictions and maximizes power. It minimizes the total difference from thepower envelope.
Power Envelopes
We calculate two power envelopes: one for near similar tests in Γ (cid:15) and a second for non-similar tests. The algorithm for calculating the power envelope is related to Chiburis (2009)and implemented in Julia, see Bezanson et al. (2017), using Gurobi, an optimization packagethat can handle many side restrictions; see Gurobi Optimization (2019). We maximize powersubject to size and near similarity restrictions on a grid of Υ parameter points under thenull: 0 . − (cid:15) ≤ N RP (cid:0) µ ( ι ) (cid:1) ≤ .
05 for ι = 1 , ..., Υ . The upper bounds ensure correct size, atleast for the points considered. The lower bounds constitute the near similarity restriction.The power envelope is obtained by repeating this maximization on a grid of points ( µ , µ )under the alternative.For the nonsimilar power envelope we can discard the lower bound restrictions 0 . − (cid:15) ≤ N RP (cid:0) µ ( ι ) (cid:1) . The power can only increase (or remain the same) and the difference betweenthe two different power envelopes is the power loss one suffers from insisting on similarity.This turns out to be less than 2% points and it should be stressed that this overstates theloss since no single test achieves the power envelope.Denote the parameter space for the ordered absolute noncentrality parameterΞ = (cid:8) ( µ , µ ) ∈ R + × R + | ≤ µ ≤ µ (cid:9) . We will use a bounded (triangular) subset of this octant Ξ defined asΞ = (cid:8) ( µ , µ ) ∈ R + × R + | ≤ µ ≤ µ ≤ µ max (cid:9) and partitioned it into a null and alternativeparameter setΞ = (cid:8) ( µ , µ ) ∈ R + × R + | µ ≤ µ ≤ µ max (cid:9) andΞ = (cid:8) ( µ , µ ) ∈ R + × R + | < µ ≤ µ ≤ µ max (cid:9) respectively.Analogously define the sample space of the maximal invariant/absolute order statistic as T = (cid:8) ( t , t ) ∈ R + × R + | t ≤ t (cid:9) . Very large values of t and t are of limited interest andfor computational purposes we can restrict ourselves to a bounded triangular subset of thesample space: T = (cid:8) ( t , t ) ∈ R + × R + | t ≤ t ≤ t max (cid:9) Power Envelope Algorithm
1. Discretize Ξ into Υ points under H : M = (cid:110)(cid:16) , µ ( ι )0 (cid:17)(cid:111) Υ ι =1 .
34. Discretize Ξ by choosing a triangular array of Υ (1 + Υ ) points under H : M = (cid:110)(cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17)(cid:111) ≤ γ ≤ κ ≤ Υ
3. Partition T into squares s ij with 1 ≤ i ≤ j ≤ J such that (cid:91) ≤ i ≤ j ≤ J s ij = T and s ij ∩ s kl = ∅ ∀ ( i, j ) (cid:54) = ( k, l ) .
4. Under H for ι = 1 , · · · , Υ calculate p ιij = P (cid:104)(cid:16) | T | (1) , | T | (2) (cid:17) ∈ s ij | (cid:16) , µ ( ι )0 (cid:17) ∈ Ξ (cid:105)
5. For each 1 ≤ γ, κ ≤ m choose µ ( γ,κ ) = (cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17) ∈ M under the alternative. Forthis µ :(a) Calculate p γκij = P (cid:104)(cid:16) | T | (1) , | T | (2) (cid:17) ∈ s ij | µ ( γ,κ ) = (cid:16) µ ( γ,κ )1 , µ ( γ,κ )2 (cid:17)(cid:105) for each s ij ∈ T (b) Determine the critical region to maximize the powermax { φ γκij , ≤ i ≤ j ≤ J } (cid:88) ≤ i ≤ j ≤ J p γκij φ γκij by selecting indicators φ γκij = CR ( s ij ) equal to 1 if s ij is part of the criticalregion, or 0 if part of the acceptance region, subject to the near similarity andsize restrictions on the NRPs:0 . − (cid:15) ≤ (cid:88) ≤ i ≤ j ≤ J p ιij φ µij ≤ .
05 for ι = 1 , ..., Υ Comments . Optimizer: Gurobi: t max = 11 , each square s ij has lengths 0 . . Hencecardinality of | T | = 285150. For power calculations we use for M a regular grid with µ ∈ { . , . , · · · , } , µ ∈ { . , . , · · · , µ } . For size and near similarity restrictions weuse µ ∈ { . , . , · · · , . } and for near similarity (cid:15) = 10 − . Appendix E g -Function R Code R Code g < - function(y) { SplinePredict < - function(x,BP,coef) { if (x > =BP[2]) { idx < - 2 } else { idx < - 1 } h < - (x-BP[idx])return(coef[idx,1]+coef[idx,2]*h+coef[idx,3]*h^2+coef[idx,4]*h^3) } z0025 < - 1.9599639845400542355BP < - c(0.05, 0.16, 0.2, 1.2, 1.3, 1.5, 2.0, 2.1, 2.2)w < - c(0.05, 0.12375880123418455, 0.15698848058671158,1.1569884851127599, 1.247794136737431, 1.3715940041517778,1.8715940058845608, 1.9440175462611924, z0025) oef1 < - rbind(c(0.05, 1.0, -6.0947848621912755,28.178586075656632),c(0.12375880123418455, 0.6820300048642552,3.204148542775413,12.841273273689932))coef2 < - rbind(c(1.1569884851127599, 1.0, 0.06613363957554252,-9.85568477108435),c(1.247794136737431, 0.7175561847825775,-2.8905717917497653,11.988937765977742))coef3 < - rbind(c(1.8715940058845608, 1.0, -2.4006862861725504,-3.5695967616429662),c(1.9440175462611924, 0.41277483991620023,-3.4715653146654413,9.38460743389627))x < - abs(y)if (x < =BP[1]) { return(x) } else if (x < BP[3]) { return(SplinePredict(x,BP[1:3],coef1)) } else if (x < =BP[4]) { return(w[3]+(x-BP[3])/(BP[4]-BP[3])*(w[4]-w[3])) } else if (x < BP[6]) { return(SplinePredict(x,BP[4:6],coef2)) } else if (x < =BP[7]) { return(w[6]+(x-BP[6])/(BP[7]-BP[6])*(w[7]-w[6])) } else if (x < BP[9]) { return(SplinePredict(x,BP[7:9],coef3)) } else { return(z0025) }} t ( i − t ( i ) a i b i c i d i l ( t ) 0.000000 1.0000000.05 0.16 s ( t ) 0.050000 1.000000 -6.094785 28.1785860.16 0.20 s ( t ) 0.123759 0.682030 3.204149 12.8412730.20 1.20 l ( t ) 0.156988 1.0000001.20 1.30 s ( t ) 1.156988 1.000000 0.066134 -9.8556851.30 1.50 s ( t ) 1.247794 0.717556 -2.890572 11.9889381.50 2.00 l ( t ) 1.371590 1.0000002.00 2.10 s ( t ) 1.871594 1.000000 -2.400686 -3.5695972.10 2.20 s ( t ) 1.944018 0.412775 -3.471565 9.3846072.20 ∞ constant z . Table 4: The optimal g function in spline representation.Coefficients of the linear splines l i ( t ) and clamped cubic splines s i ( t ) : l i ( t ) = a i + b i ( t − t ( i − ) and s i ( t ) = a i + b i ( t − t ( i − ) + c i ( t − t ( i − ) + d i ( t − t ( i − ) References
Alwin, D. F. and R. M. Hauser (1975): “The decomposition of effects in path analysis,”
American sociological review , 37–47.
Andrews, D. W. K. (2012): “Similar-on-the-boundary tests for moment inequalities exist,but have poor power,” Tech. rep., Cowles Foundation Discussion Paper.36 ndrews, D. W. K., M. J. Moreira, and J. H. Stock (2006): “Optimal two-sidedinvariant similar tests for instrumental variables regression,”
Econometrica , 74, 715–752.——— (2008): “Efficient two-sided nonsimilar invariant tests in IV regression with weakinstruments,”
Journal of Econometrics , 146, 241–254.
Andrews, D. W. K. and W. Ploberger (1994): “Optimal tests when a nuisanceparameter is present only under the alternative,”
Econometrica , 1383–1414.
Baron, R. M. and D. A. Kenny (1986): “The moderator–mediator variable distinctionin social psychological research: Conceptual, strategic, and statistical considerations.”
Journal of personality and social psychology , 51, 1173.
Bezanson, J., A. Edelman, S. Karpinski, and V. B. Shah (2017): “Julia: A freshapproach to numerical computing,”
SIAM review , 59, 65–98.
Bollen, K. and R. Stine (1990): “Direct and indirect effects: Classical and bootstrapestimates of variability,”
Sociological methodology , 115–140.
Bollen, K. A. (1989):
Structural equations with latent variables , John Wiley & Sons.
Chiburis, R. C. (2009): “Approximately most powerful tests for moment inequalities,”in
Essays on Treatment Effects and Moment Inequalities , Ph.D. thesis, Department ofEconomics, Princeton University, chap. 3.
Coletti, A. L., K. L. Sedatole, and K. L. Towry (2005): “The effect of controlsystems on trust and cooperation in collaborative environments,”
The Accounting Review ,80, 477–500.
Craig, C. C. (1936): “On the frequency function of xy,”
The Annals of MathematicalStatistics , 7, 1–15.
Drton, M. and H. Xiao (2016): “Wald tests of singular hypotheses,”
Bernoulli , 22, 38–59.
Dufour, J.-M., E. Renault, and V. Zinde-Walsh (2017): “Wald tests when restric-tions are locally singular,” Tech. rep., arxiv.org/abs/1312.0569v1.
Elliott, G., U. K. M¨uller, and M. W. Watson (2015): “Nearly optimal tests whena nuisance parameter is present under the null hypothesis,”
Econometrica , 83, 771–811.
Freedman, L. S. and A. Schatzkin (1992): “Sample size for studying intermediateendpoints within intervention trials or observational studies,”
American Journal of Epi-demiology , 136, 1148–1159.
Glonek, G. F. V. (1993): “On the behaviour of Wald statistics for the disjunction of tworegular hypotheses,”
Journal of the Royal Statistical Society: Series B (Methodological) ,55, 749–755. 37 uggenberger, P., F. Kleibergen, and S. Mavroeidis (2019): “A more powerfulsubvector Anderson Rubin test in linear instrumental variables regression,”
QuantitativeEconomics , 10, 487–526.
Gurobi Optimization, L. (2019): “Gurobi Optimizer Reference Manual,” .
Heckman, J. and R. Pinto (2015a): “Causal analysis after Haavelmo,”
EconometricTheory , 31, 115–151.——— (2015b): “Econometric mediation analyses: Identifying the sources of treatmenteffects from experimentally estimated production technologies with unmeasured and mis-measured inputs,”
Econometric reviews , 34, 6–31.
Hillier, G. H. (2019): Personal communication.
Lehmann, E. L. and J. P. Romano (2005):
Testing statistical hypotheses , SpringerScience & Business Media.
MacKenzie, S. B., R. J. Lutz, and G. E. Belch (1986): “The role of attitude towardthe ad as a mediator of advertising effectiveness: A test of competing explanations,”
Journal of marketing research , 23, 130–143.
MacKinnon, D. P., C. M. Lockwood, J. M. Hoffman, S. G. West, and V. Sheets (2002): “A comparison of methods to test mediation and other intervening variable ef-fects,”
Psychological methods , 7, 83.
MacKinnon, D. P., C. M. Lockwood, and J. Williams (2004): “Confidence limitsfor the indirect effect: Distribution of the product and resampling methods,”
Multivariatebehavioral research , 39, 99–128.
McDonald, J. A. and D. A. Clelland (1984): “Textile workers and union sentiment,”
Social Forces , 63, 502–521.
Moreira, M. J. and R. Mour˜ao (2016): “A critical value function approach, with anapplication to persistent time-series,” arXiv preprint arXiv:1606.03496 . Perlman, M. D. and L. Wu (1999): “The emperor’s new tests,”
Statistical Science , 14,355–369.
Preacher, K. J. and A. F. Hayes (2008): “Asymptotic and resampling strategies forassessing and comparing indirect effects in multiple mediator models,”
Behavior researchmethods , 40, 879–891.
Sobel, M. E. (1982): “Asymptotic confidence intervals for indirect effects in structuralequation models,”
Sociological methodology , 13, 290–312. van Garderen, K. J. (1997): “Curved exponential models in econometrics,”
EconometricTheory , 13, 771–790. 38 an Giersbergen, N. P. A. (2014): “Inference about the indirect effect: a likelihoodapproach,” Tech. rep., Universiteit van Amsterdam, UvA-Econometrics Discussion Papers2014/10.
Vaughan, R. J. and W. N. Venables (1972): “Permanent expressions for order statisticdensities,”