Discrete convolution statistic for hypothesis testing
DDISCRETE CONVOLUTION STATISTIC FOR HYPOTHESIS TESTINGGiulio Prevedello, [email protected] R. Duffy, ken.duff[email protected] InstituteMaynooth UniversityMaynooth, IrelandKey Words: discrete convolution; sum of discrete random variables; statistical hypothesistesting; nonparametric maximum-likelihood estimation; sub-independenceMathematics Subject Classification: 62G05; 62G10; 62G20; 62P10; 62P20ABSTRACTThe question of testing for equality in distribution between two linear models, each con-sisting of sums of distinct discrete independent random variables with unequal numbers ofobservations, has emerged from the biological research. In this case, the computation of clas-sical χ statistics, which would not include all observations, results in loss of power, especiallywhen sample sizes are small. Here, as an alternative that uses all data, the nonparametricmaximum likelihood estimator for the distribution of sum of discrete and independent ran-dom variables, which we call the convolution statistic, is proposed and its limiting normalcovariance matrix determined. To challenge null hypotheses about the distribution of thissum, the generalized Wald’s method is applied to define a testing statistic whose distributionis asymptotic to a χ with as many degrees of freedom as the rank of such covariance matrix.Rank analysis also reveals a connection with the roots of the probability generating functionsassociated to the addend variables of the linear models. A simulation study is performed tocompare the convolution test with Pearson’s χ , and to provide usage guidelines.1. INTRODUCTIONIn this paper we examine the problem of testing the null hypothesis of equality in distri-1 a r X i v : . [ m a t h . S T ] A ug ution, denoted ∼ , for two linear models with distinct observables, that is H : a + k (cid:88) i =1 a i A i ∼ b + h (cid:88) i =1 b i B i . (1)We assume that the random variables A , . . . , A k , B , . . . , B h are bounded, independent, ofpossibly different distribution and all take values in a real lattice Λ( ζ ) = { ζu : u ∈ Z } forsome ζ ∈ R and that a , b ∈ Λ( ζ ), a , . . . , a k , b , . . . , b h ∈ Z , the set of integers.Equality in distribution between random variables can be tested using statistics suchas Pearson’s χ (Pearson, 1900) or the more general power-divergence family (Cressie andRead, 1984). The computation of these statistics, however, assumes that the number ofobservations of each of A , . . . , A k (and B , . . . , B h ) are equal. Otherwise, it would seemthat the data sets must be truncated for application of those methods, which could provewasteful if samples come in unequal counts and their collection is costly or laborious.For example, consider a problem in meta-analysis, where two studies are described bylinear models with distinct independent variables, and we wish to test for equality in dis-tribution between these models as in (1). In the simplest case, for k = 2, h = 1 with H : A + A ∼ B , the independent variables observed are n distributed as A , n as A and n as B , respectively noted { A , . . . , A n } , { A , . . . , A n } and { B , . . . , B n } , with n , n , n ∈ N . This scenario may arise because the independent variables are grouped dif-ferently in the studies (e.g., A occurrences of event E , A occurrences of event E , B occurrences of any event E or E ) or because the model choice is different (e.g., model oneis A + A and model two is B ∼ f ( A , A ) for a given function f ). Then, to test H ,Pearson’s χ could be computed using B , . . . , B n and the data from A and A paired as,for example, { A + A , . . . , A m + A m } with m = n = n , so that these m variablesare independent and identically distributed as A + A to comply with Pearson’s statisticassumptions. If observation sizes are unequal, e.g. n > n , then n − n > { A , . . . , A n } could be excluded from the calculation of { A + A , . . . , A m + A m } , nowwith m = min( n , n ). But any pairing or variables exclusion are two choices that, eitherarbitrarily or randomly determined, influences the outcome of the test.2e were personally motivated to address this question during a study on how stimula-tory signals are integrated by the immune system (Marchingo et al., 2016). We wished tostatistically test the hypothesis that the expansion impetus of two stimulii were integratedindependently by cells when the signals were provided together, which had been hypothesizedin a previously published study (Marchingo et al., 2014). The experimental data obtainedfor that work was costly to produce, both in terms of manpower and reagents, and inherentlycame with distinct numbers of observations of all variables. Thus we sought to develop astatistical procedure that utilized all available data. The resulting test may prove useful inother fields, such as medicine for efficacy evaluation of combination therapies, (e.g. Wolchoket al., 2013), which is a topic of growing interest (Editorial, 2017).We first show that the null hypothesis in (1) is equivalent to one without the scalarmultipliers, (cid:80) ki =1 X i ∼ (cid:80) hi =1 Y i , which simplifies notation (Lemma 1). To obtain a teststatistic that utilizes all data and therefore outperforms methods that require equal sized datasets, in Section 2 we study the maximum likelihood estimator (MLE) for the probability massvector (PMV) of (cid:80) ki =1 X i . This transpires to be the discrete convolution of the empiricalprobability mass vector (EPMV) of each variable X , . . . , X k and so we refer to it as the“convolution statistic” (Proposition 1).We then derive the asymptotic distribution of the convolution statistic and build a test-ing procedure for both goodness-of-fit and equality in distribution versions (Proposition 2),leveraging the generalized Wald’s method. This technique was introduced in Moore’s work(Mihalko and Moore, 1980; Moore, 1977, 1982; Moore and Spruill, 1975), as an extensionof Wald’s method (Wald, 1943), to build χ tests for statistics that are asymptotically nor-mal distributed with a singular covariance matrix. It was subsequently adjusted in (Hadiand Wells, 1990), whose version we employ. Such methodology found applications in thefields of econometrics (Andrews, 1987, 1988; Vuong, 1987; Wilson and Koehler, 1991), biol-ogy (Marchingo et al., 2016; Zhang, 1999), and statistical theory (Drost, 1989; Tyler, 1981;Voinov et al., 2008).In Section 3, we investigate the covariance matrix rank asymptotic of the convolution3tatistic (Theorem 1, Corollary 1 and Corollary 2), which is the central problem for thederivation of a testing procedure through the generalized Wald’s framework. Interestingly,such rank is related to the roots of the probability generating functions of X , . . . , X k and Y , . . . , Y h (Lemma 2). Finally, we show how a test for sub-independence (Hamedani, 2013;Schennach, 2019) can be built from the results achieved above (Corollary 3).To conclude, in Section 4 we provide simulated performance analysis for the convolutionstatistic against Person’s χ , and in Section 5 we discuss the guidelines for its application.We remark that all the necessary proofs are reported in Appendix.2. CONVOLUTION STATISTICTo derive a statistic for the testing of H : a + (cid:80) ki =1 a i A i ∼ b + (cid:80) hi =1 b i B i , we beginby showing that this null hypothesis is equivalent to another in which the variables havefinite, positive integer support and no parameters a , . . . , a k , b , . . . , b h are present. As aconsequence, we will work in this new setting as it facilitates the definition of the convolutionstatistic for the testing of H , especially in regard to a simpler notation. Lemma 1 (Null hypothesis simplification) . Let A , . . . , A k , B , . . . , B h be a sequence of finiteand independent random variables that map the probability space (Ω , F , P ) into a lattice Λ( ζ ) with ζ ∈ R . Given the null hypothesis H : a + k (cid:88) i =1 a i A i ∼ b + h (cid:88) i =1 b i B i , (2) with a , b ∈ Λ( ζ ) , a , . . . , a k , b , . . . , b h ∈ Z , there exists a sequence of positive, finite andindependent random variables X , . . . , X k , Y , . . . , Y h from the probability space (Ω , F , P ) into { , . . . , r l } , respectively, with r l ∈ N for l = 1 , . . . , k + h , P ( X i = 0) > for i = 1 , . . . , k , P ( Y j = 0) > for j = 1 , . . . , h , and such that H : k (cid:88) i =1 X i ∼ h (cid:88) i =1 Y i , (3) is equivalent to (2) .
4s a result of Lemma 1, we need only to consider H stated in equation (3). Thus givena sequence of k ≥ X , . . . , X k we write, for fixed i ∈ { , . . . , k } , that X i ∼ x i ∈ ∆ r i with∆ r i = (cid:40) v = ( v , . . . , v r i ) ∈ R r i +1 : v , v r i ∈ (0 , v j ≥ , j = 1 , . . . , r i − r i (cid:88) j =1 v j = 1 (cid:41) to indicate that X i takes values in { , . . . , r i } ⊆ N ∪ { } , with r i >
0, and is distributed withPMV x i = ( x i , . . . , x ir i ), that is P ( X i = j ) = x ij for j = 0 , . . . , r i .We remark that x i , x ir i ∈ (0 , i = 1 , . . . , k , are assumed to avoid degeneratecases, without loss of generality. In fact, x i > t ≤ k such that x tr t = P ( X t = r t ) = 0, there exists τ = max { j : x tj > } < r t , so that X t can be replaced by ˜ X t ∼ ˜ x t = (˜ x t , . . . , ˜ x tτ ) ∈ ∆ τ , with ˜ x tj = x tj for j = 0 , . . . , τ . Lastly,the constraint x i , x ir i < i = 1 , . . . , k ensures that X i is not a constant value.From now on, with these assumptions and notation, for the null hypothesis of goodness-of-fit test we consider H : k (cid:88) i =1 X i ∼ z (4)with s = (cid:80) ki =1 r i and z ∈ ∆ s . By independence, the sum of X , . . . , X k is distributed as thediscrete convolution, denoted ∗ , of their PMVs, that is k (cid:88) i =1 X i ∼ x ∗ . . . ∗ x k , where, for any two vectors v = ( v , . . . , v a ) ∈ R a +1 , w = ( w , . . . , w b ) ∈ R b +1 with a, b > v ∗ w ∈ R a + b +1 and ( v ∗ w ) i = (cid:80) aj =0 (cid:80) bl =0 v j w l δ j + l,i with δ i,j = 1 if i = j and beingnull otherwise. Hence, the null hypothesis for the goodness-of-fit-test (4) is equivalent to H : x ∗ . . . ∗ x k = z . Our first goal is to determine a statistic to test (4), which will subsequently be extendedto assess the equality in distribution between (cid:80) ki =1 X i and (cid:80) hi =1 Y i , where Y ∼ y ∈ ∆ r k +1 , . . . , Y h ∼ y h ∈ ∆ r k + h are other h ≥ s =5 ki =1 r i = (cid:80) hi =1 r k + i , namely H : k (cid:88) i =1 X i ∼ h (cid:88) i =1 Y i , (5)or, equivalently, H : x ∗ . . . ∗ x k = y ∗ . . . ∗ y h . When the PMVs x i for i = 1 , . . . , k and y j for j = 1 , . . . , h are unknown, care mustbe taken to define the test statistics for (4) and (5) based only on available informa-tion. In regard, the data consist of the observation of n i independent random variables { X i , . . . , X in i } identically distributed as X i for i = 1 , . . . , k and n k + i independent randomvariables { Y i , . . . , Y in k + i } identically distributed as Y i for i = 1 , . . . , h . Following a nonpara-metric approach, we fix i ∈ { , . . . , k } and define ˆ x in i ∈ ∆ r i the MLE of x i , that is( ˆ x in i ) u = 1 n i n i (cid:88) j =1 { X ij = u } for u = 0 , . . . , r i , with A being the indicator function of the event A . In particular, bythe multivariate central limit theorem (Serfling, 1980), √ n i ( ˆ x in i − x i ) is asymptoticallydistributed as a centered normal random variable with covariance Σ ( x i ) for n i large, namely √ n i ( ˆ x in i − x i ) ∼ n i →∞ N ( Σ ( x i )), where, for any PMV v = ( v , . . . , v a ) ∈ ∆ a and a ≥
0, wedefine Σ ( v ) ∈ R a +1 × R a +1 such that ( Σ ( v )) ij = v i δ i,j − v i v j for i, j = 0 , . . . , a . With thisnotation, we derive the MLE for the distribution of (cid:80) ki =1 X i . Proposition 1 (MLE for a sum of independent random variables) . Given k ≥ , let { X i , . . . , X in i } be n i ∈ N random variables independent and identically distributed as X i ∼ x i ∈ ∆ r i with r i ∈ N , for i = 1 , . . . , k . Set s = (cid:80) ki =1 r i . The MLE for the PMV x ∗ . . . ∗ x k ∈ ∆ s of (cid:80) ki =1 X i is ˆ x n ∗ . . . ∗ ˆ x kn k ∈ ∆ s , defined as ( ˆ x n ∗ . . . ∗ ˆ x kn k ) u = (cid:32) k (cid:89) j =1 n j (cid:33) − n (cid:88) i =1 · · · n k (cid:88) i k =1 { (cid:80) kj =1 X jij = u } , for every u = 0 , . . . , s . x ∗ . . . ∗ x k is calculated using all observables X ij for j = 1 , . . . , n i and i = 1 , . . . , k , which may not be the case for Person’s statistic, asexplained in Section 1, if at least one of the sample sizes n , . . . , n k differs from another.We introduce additional notation for what follows: A + , A (cid:48) , Ker( A ), nul( A ), rk( A ) forthe Moore-Penrose inverse, transpose, kernel, nullity and rank of a matrix A , respectively(Hagen et al., 2000; Horn and Johnson, 1986); χ ( s ) to indicate the χ distribution with s > T b +1 ( v ) ∈ R b +1 × R a + b +1 for the matrix of the discrete convolutionbetween v ∈ R a +1 and any b +1-dimensional vector, i.e. T b +1 ( v ) w = v ∗ w ∈ R a + b +1 with w ∈ R b +1 , given a, b ≥
0. We write T ( v ) without explicit domain dimension if this is clearfrom the context.We are now ready to determine the asymptotic behavior of ˆ x n ∗ . . . ∗ ˆ x kn k , which followsfrom an application of the delta method as well as properties of quadratic transformationof asymptotically multivariate normal vectors (Serfling, 1980). In order for the MLE ˆ x n ∗ . . . ∗ ˆ x kn k to converge to x ∗ . . . ∗ x k it is necessary that the sample sizes n , . . . , n k growwith proportional rates. For this reason, from now on, we set m = min( n , . . . , n k + h ) and assume c i = lim m →∞ mn i is finite and positive for every i = 1 , . . . , k + h . Proposition 2 (Asymptotic normality of convolutions) . Under the null hypothesis (4) , that x ∗ . . . ∗ x k = z , it holds that V m = √ m ( ˆ x n ∗ . . . ∗ ˆ x kn k − z ) ∼ m →∞ N ( Ψ ) (6) and V (cid:48) m Ψ + V m ∼ m →∞ χ (rk( Ψ )) (7) where Ψ = (cid:80) ki =1 c i T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) and x ( i ) = x ∗ . . . ∗ x i − ∗ x i +1 ∗ . . . ∗ x k for i = 1 , . . . , k . Alternatively, under the null hypothesis (5) , that x ∗ . . . ∗ x k = y ∗ . . . ∗ y h , itholds that W m = √ m (cid:0) ˆ x n ∗ . . . ∗ ˆ x kn k − ˆ y n k +1 ∗ . . . ∗ ˆ y hn k + h (cid:1) ∼ m →∞ N ( Ψ + Ξ ) (8)7 nd W (cid:48) m ( Ψ + Ξ ) + W m ∼ m →∞ χ (rk( Ψ + Ξ )) (9) where Ξ = (cid:80) hi =1 c k + i T ( y ( i ) ) Σ ( y i ) T ( y ( i ) ) (cid:48) and y ( i ) = y ∗ . . . ∗ y i − ∗ y i +1 ∗ . . . ∗ y k for i = 1 , . . . , h . We remark that expressions (7) and (9) require the knowledge of Ψ and Ψ + Ξ , but thesemay, in general, be unknown. Thus we take advantage of the generalized Wald’s method(Moore, 1977), which shows how to construct χ tests from consistent estimators of thecovariance variance matrices such as Ψ and Ψ + Ξ . We recall here Moore (1977, Theorem2) which will serve as backbone for the subsequent results. Proposition 3 (Generalized Wald’s method; Moore (1977), Theorem 2) . Suppose a sequenceof estimators { ˆ θ m } m ≥ of a parameter θ ∈ R d , with d > , is such that √ m (cid:16) ˆ θ m − θ (cid:17) ∼ m →∞ N ( Σ ) with rk( Σ ) ≤ d . Noted { B m } m ≥ a sequence of d -dimensional square matrices such that B m ∼ m →∞ B with B generalized-inverse of Σ , then m (cid:16) ˆ θ m − θ (cid:17) (cid:48) B m (cid:16) ˆ θ m − θ (cid:17) ∼ m →∞ χ (rk( Σ )) . Since the entries of ˆ Ψ m = k (cid:88) i =1 c i T ( ˆ x ( i ) n i ) Σ ( ˆ x in i ) T ( ˆ x ( i ) n i ) (cid:48) are continuous functions of x , . . . , x k , whose consistent estimator are ˆ x n , . . . , ˆ x kn k respec-tively, then ˆ Ψ m is a consistent estimator of Ψ and similarlyˆ Ξ m = h (cid:88) i =1 c k + i T ( ˆ y ( i ) n k + i ) Σ ( ˆ y in k + i ) T ( ˆ y ( i ) n k + i ) (cid:48) for Ξ .Note that Proposition 3 cannot be directly applied to (6) by setting B m = ˆ Ψ + m , asˆ Ψ + m may not be a consistent estimator of Ψ + . Given a sequence of consistent estimators8 A m } m ≥ for a matrix A of finite dimensions, then { A + m } m ≥ is a sequence of consistentestimators for A + if and only if rk( A m ) = rk( A ) for m large (Nashed, 1976). In particular,as the rank is a lower-semicontinuous operator on the space of finite dimensional matrices,then only rk( ˆ Ψ m ) ≥ rk( Ψ ) is guaranteed as m tends to infinity.If the limiting rank is known, consistency is ensured by Eckart-Young-Mirsky’s theorem(Eckart and Young, 1936), which is the solution to the basic low rank approximation of a finitedimensional matrix (Markovsky, 2012). To this end, given any d -dimensional symmetricmatrix A ∈ R d × R d with d ∈ N and its eigendecomposition A = P (cid:48) Λ P , with 0 < rk( A ) ≤ d , Λ diagonal matrix of the decreasing eigenvalues and P orthogonal matrix, then for any0 < r ≤ rk( A ) we define a rank- r matrix that approximates A (in light of the Eckart-Young-Mirsky theorem) as A r = ( D r P ) (cid:48) Λ r D r P ∈ R d × R d , where D r is a R r × R d matrixwith 1 at the diagonal and 0 elsewhere and Λ r is the R r × R r diagonal matrix of the largest r eigenvalues of Σ . In particular, A r may not be unique, as for the case when the r th and r + 1 th eigenvalues are equal. The following result is found in Hadi and Wells (1990, Theorem2.3) for the generalized inverses and we report it here for the case of Moore-Penrose inverses. Proposition 4 (Rank approximation; Hadi and Wells (1990), Theorem 2.3) . Suppose asequence of centered random variables { U m } m ≥ ∈ R d is asymptotically distributed as N ( Σ ) for m large, with < rk( Σ ) ≤ d where d > . Let { ˆ Σ m } m ≥ be a sequence of square matricesthat are consistent estimators of Σ , then for every < r ≤ rk( Σ ) U (cid:48) m ( ˆ Σ rm ) + U m ∼ m →∞ χ ( r ) , where ˆ Σ rm is a rank- r approximation of ˆ Σ m . Proposition 4 highlights the central role of the rank of Σ , which will be derived in thenext section for Σ = Ψ and Σ = Ψ + Ξ . In general, the determination of rk( Σ ) may bea difficult problem that depends on the structure of the Σ under consideration, and thislimitation may explain why an otherwise flexible tool such as the generalized Wald’s methodfrom Proposition 3 is not more widely employed. But, if the rank is known, Proposition4 provides a method for statistical testing null hypotheses, such as H : x ∗ . . . ∗ x k = z H : x ∗ . . . ∗ x k = y ∗ . . . ∗ y h , under which Σ is not invertible. This result alsoassures a solution if only a lower bound of the rank is given, at the cost of statistical power.Furthermore, the exclusion of smaller eigenvalues may still be necessary to achieve numericalstability when calculating the pseudo-inverse of ˆ Σ m , as due to Proposition 4, the effect ofthat truncation can be accounted for in the statistic formulation.3. DETERMINING THE COVARIANCE MATRIX RANKIn this section, we investigate the rank of Ψ and Ψ + Ξ , the covariance matrices from(6) and (8) of Proposition 2, in order to derive the number of degrees of freedom from thelimiting statistics for the goodness-of-fit (7) and equality in distribution (9) tests. Focusingon Ψ , we begin by showing that c i T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) is a positive semidefinite matrix forany fixed i ∈ { , . . . , k } . In fact, since Σ ( x i ) is positive semidefinite, for every v ∈ R s +1 c i v (cid:48) T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) v = c i w (cid:48) Σ ( x i ) w ≥ , (10)where w = T ( x ( i ) ) (cid:48) v . Additionally, we deduce from Horn and Johnson (1986, Observation7.1.3) that given A and B two positive semidefinite matrices with the same dimensions,then Ker( A + B ) = Ker( A ) ∩ Ker( B ) . (11)Taken together, (10) and (11) implyKer( Ψ ) = k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) (cid:1) . (12)Using kernel properties, we writeKer (cid:0) T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) (cid:1) = Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) ⊕ { v ∈ R s +1 : T ( x ( i ) ) (cid:48) v ∈ Ker( Σ ( x i )) } (13)where ⊕ represents the direct sum operation.Let i ∈ { , . . . , k } be fixed and let L i = { l : x il = 0 } be the set of indexes of the nullentries of x i ∈ ∆ r i . In general, the kernel of Σ ( x i ) is generated by the r i + 1-dimensionalall-ones vector r i and the canonical vectors e il = ( e il , . . . , e ilr i ) ∈ R r i +1 for every l ∈ L i ,where e ilu = δ l,u for u = 0 , . . . , r i , that is Ker( Σ ( x i )) = (cid:104) r i (cid:105) ⊕ E i , where E i = (cid:104){ e il : l ∈ L i }(cid:105) .10enoting r ( i ) = (cid:80) kj (cid:54) = i r j , we can expand T ( x ( i ) ) into T ( x ( i ) ) = x ( i )0 . . . x ( i )1 x ( i )0 . . . ...... ... . . . 0 x ( i ) r ( i ) x ( i ) r ( i ) − . . . x ( i )0 x ( i ) r ( i ) . . . ...... . . . . . . ...0 . . . x ( i ) r ( i ) , from which we deduce T ( x ( i ) ) (cid:48) s = r i , for every x ( i ) ∈ ∆ r ( i ) , and in particular s / ∈ Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) .To achieve an explicit formulation for the rank of Ψ (and analogously for Ψ + Ξ ), inTheorem 1 we will assume that x i ∈ ∆ r i Int ⊂ ∆ r i , defined as∆ r i Int = { v = ( v , . . . , v r i ) ∈ R r i +1 : r i (cid:88) j =1 v j = 1; 0 < v j < , j = 0 , . . . , r i } for every i = 1 , . . . , k . This ensures that E i = ∅ , so that Ker( Σ ( x i )) = (cid:104) r i (cid:105) . Under thishypothesis, from (12) and (13) we deduce thatKer( Ψ ) = k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) (cid:1) = (cid:104) s (cid:105) ⊕ k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) , (14)and, with the same reasoning applied to Ψ + Ξ , it follows thatKer( Ψ + Ξ ) = (cid:104) s (cid:105) ⊕ (cid:18)(cid:0) k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) (cid:1) ∩ (cid:0) h (cid:92) j =1 Ker (cid:0) T ( y ( j ) ) (cid:48) (cid:1) (cid:1)(cid:19) . (15)In the following Lemma we show how k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) (16)and (cid:18) k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) (cid:19) ∩ (cid:18) h (cid:92) j =1 Ker (cid:0) T ( y ( j ) ) (cid:48) (cid:1) (cid:19) (17)11epend on the roots in common between the probability generating functions of the randomvariables X , . . . , X k , Y , . . . , Y h . This result will be achieved in full generality without re-strictions for the PMVs, that is with x ∈ ∆ r , . . . , x k ∈ ∆ r k , y ∈ ∆ r k +1 , . . . , y h ∈ ∆ r k + h .We first provide some insight into this connection by considering the case k = 2Ker (cid:0) T ( x (1) ) (cid:48) (cid:1) ∩ Ker (cid:0) T ( x (2) ) (cid:48) (cid:1) = Ker (cid:18)(cid:104) T ( x ) T ( x ) (cid:105) (cid:48) (cid:19) . In fact, (cid:104) T ( x ) T ( x ) (cid:105) ∈ R r + r +2 × R r + r +1 has the same structure, with different dimen-sions, of a Sylvester matrix (Markovsky, 2012), whose nullity is the degree of the polynomialfrom the greatest common divisor of the probability generating functions associated to x and x .To formalize the connection with PMVs and polynomials, we introduce the bijection ϕ : ∪ a ≥ { u = ( u , . . . , u a ) ∈ R a +1 : (cid:80) ai =0 u i = 1; u a (cid:54) = 0 } → { u ( t ) ∈ R [ t ] : u (1) = 1 } that, forany a ≥
0, maps a vector v = ( v , . . . , v a ) ∈ { u = ( u , . . . , u a ) ∈ R a +1 : (cid:80) ai =0 u i = 1; u a (cid:54) = 0 } to the polynomial ϕ ( v )( t ) = (cid:80) ai =0 v i t i ∈ R [ t ] of degree deg ϕ ( v ) = a with coefficients v . Inparticular, this map transforms the convolution of vectors into the product of polynomials:given w ∈ { u = ( u , . . . , u b ) ∈ R b +1 : (cid:80) bi =0 u i = 1; u b (cid:54) = 0 } , for any b ≥
0, we have ϕ ( v ∗ w )( t ) = a (cid:88) i =0 ( v ∗ w ) i t i = ϕ ( v )( t ) ϕ ( w )( t ) . The map ϕ allows the extension of the notion of greatest common divisor between any twopolynomials gcd( ϕ ( v ), ϕ ( w )) to their related vectors v , w . This is achieved by establishinggcd( u ( t ) , u ( t )) ∈ { u ( t ) ∈ R [ t ] : u (1) = 1 } for any u ( t ) , u ( t ) ∈ { u ( t ) ∈ R [ t ] : u (1) = 1 } ,so that the greatest common divisor is uniquely defined, and by setting, for any v , w ∈∪ a ≥ { u = ( u , . . . , u a ) ∈ R a +1 : (cid:80) ai =0 u i = 1; u a (cid:54) = 0 } ,gcd( v , w ) = ϕ − (gcd( ϕ ( v )( t ) , ϕ ( w )( t ))) ∈ { u = ( u , . . . , u r g ) ∈ R r g +1 : r g (cid:88) i =0 u i = 1; u r g (cid:54) = 0 } with r g = deg gcd( ϕ ( v )( t ) , ϕ ( w )( t )) ≥
0. In particular, we say the vectors v , w are coprime ifand only if r g = 0. Following the same logic, we import the concept of least common multiplebetween v and w , denoted lcm( v , w ), and the property of divisibility between vectors. Of12ote, with the notation above, the probability generating functions of x , . . . , x k , y , . . . , y h are, respectively, ϕ ( x ) , . . . , ϕ ( x k ) , ϕ ( y ) , . . . , ϕ ( y h ).We now establish the relation between the kernels of (16), (17) and the greatest commondivisors g k = gcd( x (1) , . . . , x ( k ) ) and ¯ g h = gcd( y (1) , . . . , y ( h ) ). Lemma 2 (Kernels from gcd of PMVs) . Let k ≥ and x ∈ ∆ r , . . . , x k ∈ ∆ r k . Given g k = gcd( x (1) , . . . , x ( k ) ) ∈ R r gk +1 , it holds that T ( g k ) (cid:48) ∈ R (cid:80) ki =1 r i − r gk +1 × R (cid:80) ki =1 r i +1 and k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) = Ker( T ( g k ) (cid:48) ) . (18) Additionally, let h ≥ and y ∈ ∆ r k +1 , . . . , y h ∈ ∆ r k + h . Given ¯ g h = gcd( y (1) , . . . , y ( h ) ) ∈ R r ¯ gh +1 and ˜ g = gcd( g k , ¯ g h ) ∈ R r ˜ g +1 , it holds that T (¯ g h ) (cid:48) ∈ R (cid:80) ki =1 r i − r ¯ gh +1 × R (cid:80) ki =1 r i +1 , T (˜ g ) (cid:48) ∈ R (cid:80) ki =1 r i − r ˜ g +1 × R (cid:80) ki =1 r i +1 and k (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) ∩ h (cid:92) j =1 Ker (cid:0) T ( y ( j ) ) (cid:48) (cid:1) = Ker T ( g k ) (cid:48) T (¯ g h ) (cid:48) = Ker( T (˜ g ) (cid:48) ) (19)As consequence of Lemma 2, we can finally determine the rank of the covariance matrices Ψ and Ψ + Ξ assuming x ∈ ∆ r Int , . . . , x k ∈ ∆ r k Int , y ∈ ∆ r k +1 Int , . . . , y h ∈ ∆ r k + h Int , in order tocalculate the number of degrees of freedom of the limiting χ distribution in (7) and (9) fromProposition 2 for this case. Theorem 1 (Covariance matrix rank) . Under the assumptions of Proposition 2 and thenotations of Lemma 2, and given x ∈ ∆ r Int , . . . , x k ∈ ∆ r k Int , y ∈ ∆ r k +1 Int , . . . , y h ∈ ∆ r k + h Int , itfollows that
Ker( Ψ ) = (cid:104) s (cid:105) ⊕ Ker( T ( g k ) (cid:48) ) (20) and Ker( Ψ + Ξ ) = (cid:104) s (cid:105) ⊕ Ker( T (˜ g ) (cid:48) ) . (21) In particular, with s defined immediately prior to equation (5) , rk( Ψ ) = s − r g k (22) and rk( Ψ + Ξ ) = s − r ˜ g . (23)13n the case where x , . . . , x k , y , . . . , y h are coprime, that is when their probability gener-ating functions ϕ ( x ) , . . . , ϕ ( x k ) , ϕ ( y ) , . . . , ϕ ( y h ) have no root in common, we can simplifyTheorem 1 as follows. Corollary 1 (Rank from the coprime case) . Under the assumptions of Theorem 1, if x , . . . , x k , y , . . . , y h are coprime vectors, then rk( Ψ ) = rk( Ψ + Ξ ) = s , where s is definedimmediately prior to equation (5) . A statistic for the goodness-of-fit (4) and equality in distribution (5) tests can also bebuilt for the general case where x i ∈ ∆ r i and y i ∈ ∆ r i + k , leveraging Proposition 4 and thefollowing lower bound for the covariance matrix rank. Corollary 2 (Rank lower bound for the general case) . Under the assumptions of Proposition2 and the notations of Lemma 2, and given x ∈ ∆ r , . . . , x k ∈ ∆ r k , y ∈ ∆ r k +1 , . . . , y h ∈ ∆ r k + h , it follows that rk( Ψ ) ≥ s − r g k − k (cid:88) i =1 | L i | (24) and rk( Ψ + Ξ ) ≥ s − r ˜ g − k + h (cid:88) i =1 | L i | , (25) where L i = { l : x il = 0 } for i = 1 , . . . , k , L i + k = { l : y il = 0 } for i = 1 , . . . , h , with | L i | , | L i + k | indicating their respective cardinality, and s is defined immediately prior to equation (5) . Taken together, Lemma 2 and Theorem 1 serve as example of how to determine thecovariance matrix rank upon application of the generalized Wald’s framework. For example,assuming the variables { X j , . . . , X kj } for j = 1 , . . . , m are grouped in m k -tuples (thuswith m = n = . . . = n k ), each drawn independently from ( X , . . . , X k ), a statistic can bebuilt to test the null hypothesis that the random variables X , . . . , X k are sub-independent(Hamedani, 2013), that is H : ψ (cid:80) ki =1 X i ( t ) = k (cid:89) i =1 ψ X i ( t ) , for all t ∈ R , (26)14here ψ X represents the characteristic function of the random variable X . Sub-independenceis a property less stringent than independence, and both these properties imply uncorrelat-edness. Moreover the assumption of sub-independence can replace that of independence inseveral limit theorems (Hamedani, 2013; Schennach, 2019). Corollary 3 (Test for sub-independence) . Under the assumptions of Theorem 1 and thenull hypothesis of sub-independence (26) , such that x ∗ . . . ∗ x k = z , and given ˆ z m the MLEfor the PMV z ∈ ∆ s Int such that ( ˆ z m ) u = m (cid:80) mj =1 { (cid:80) ki =1 X ij = u } for u = 0 , . . . , s , it holds that S m = √ m ( ˆ x m ∗ . . . ∗ ˆ x km − ˆ z ) ∼ m →∞ N ( Υ ) , (27) where Υ = Σ ( z ) − Ψ and rk( Υ ) = s .In particular, given ˆ Υ m = Σ ( ˆ z m ) − ˆ Ψ m , then S (cid:48) m ( ˆ Υ sm ) + S m ∼ m →∞ χ ( s ) . (28)As the independence of random variables implies their sub-independence, Corollary 3 alsoprovides a routine to test for complete independence in k -way contingency tables (Andersen,1974; Bishop et al., 2007) (one dimension per random variable X i , i = 1 , . . . , k ), as analternative to the power divergence family of statistics (Cressie and Read, 1984). In thisinstance, the null hypothesis of sub-independence, weaker than complete independence, leadsto a reduction of free parameters to be estimated, from (cid:81) ki =1 r i (in the model from powerdivergence statistics) to (cid:80) ki =1 r i (in the framework of the convolution statistics), which isdesirable when k is large.4. POWER COMPARISONWe evaluate the performances of the convolution test in terms of type I error and power(1 minus type II error), which are the proportion of rejections with significance level α =0 .
05 under the null and alternative hypothesis respectively, setting Pearson’s χ test asthe benchmark. To do so, we simulate the smallest parametrized model that enables theinvestigation of how samples size, degrees of freedom reduction, and observables distribution15ffect the convolution test, using different parameters choices. It also allows the transitionfrom the null to alternative hypotheses by modulating a single parameter.We consider k = 2, h = 1 and X , X are two Bernoulli random variables with parameters p, q ∈ (0 ,
1) so that x = (1 − p, p ), x = (1 − q, q ) and x , x ∈ ∆ . Then we define, for ρ ∈ [0 , z ( ρ ) = (1 − ρ ) x ∗ x + ρ (1 − a, , a ) , (29)where a = pq + (cid:112) pq (1 − p )(1 − q ) is defined so that z ( ρ ) = ( z ( ρ ) , z ( ρ ) , z ( ρ ) ) ∈ ∆ is thePMV for the distribution of Z + Z where Z and Z are two Bernoulli random variableswith parameter p and q , respectively, and ρ is their correlation. The null hypothesis for thegoodness-of-fit (GF) test is H : X + X ∼ z (0) and we set a family of alternative hypothesesparametrized over ρ ∈ (0 ,
1] as H ρ : X + X (cid:28) z ( ρ ). Similarly, the null and the alternativehypotheses of the test for equality in distribution (ED) are defined as H : X + X ∼ Y with Y ∼ z (0) and H ρ : X + X (cid:28) Y with Y ∼ z ( ρ ), ρ ∈ (0 , n , n and n for X , X and Y , respectively, so that n , n ≤ n and m = min( n , n , n ) = min( n , n ). Since we areinterested in the comparison between the convolution and Pearson’s χ statistics, we needto calculate the latter even in the case of unequal sample size, i.e. when n (cid:54) = n . Thus, wedefine P GF m = (cid:88) j =0 ( (cid:80) mi =1 { X i + X i = j } − mz ( ρ ) j ) mz ( ρ ) j and P ED m = (cid:88) j =0 (cid:32) ( n m + n (cid:80) mi =1 { X i + X i = j } − mm + n (cid:80) n i =1 { Y i = j } ) mm + n ( (cid:80) mi =1 { X i + X i = j } + (cid:80) n i =1 { Y i = j } )+ ( mm + n (cid:80) n i =1 { Y i = j } − n m + n (cid:80) mi =1 { X i + X i = j } ) n m + n ( (cid:80) mi =1 { X i + X i = j } + (cid:80) n i =1 { Y i = j } ) (cid:33) for Pearson’s goodness-of-fit and equality in distribution testing statistic, respectively. Inparticular, n + n − m observations will not be used in the computation of P GF m and P ED m .We define the convolution statistic with fixed rank r = 1 , V (cid:48) m ( ˆ Ψ rm ) + V m and W (cid:48) m (( ˆ Ψ m + ˆ Ξ m ) r ) + W m . In the case wherethe n and n observations from the random variables X and X , respectively, are all equal,16he sample covariance matrix is null, i.e. ˆ Ψ m = , and ( ˆ Ψ rm ) + is not well defined. Inthis scenario, Pearson’s P GF m can still be calculated. As we aim to compare the power gainover Pearson’s procedures, we calculate the convolution statistics for goodness-of-fit test as V (cid:48) m ( ˆ Ψ rm ) + V m , where well defined, otherwise we set it to P GF m . With the same reasoningfor the equality in distribution case, for the following simulations we define the convolutionstatistic as C GF rm = V (cid:48) m ( ˆ Ψ rm ) + V m (1 − { ˆ Ψ m = } ) + P GF m { ˆ Ψ m = } and C ED rm = W (cid:48) m (( ˆ Ψ m + ˆ Ξ m ) r ) + W m (1 − { ˆ Ψ m + ˆ Ξ m = } ) + P ED m { ˆ Ψ m + ˆ Ξ m = } , for goodness-of-fit and equality in distribution tests, respectively. Note that lim m →∞ C GF rm ∼ χ ( r ), since ˆ Ψ m = if and only if ˆ x n , ˆ x n ∈ { (1 , , (0 , } , but, for j = 1 , m →∞ P (cid:0) ˆ x jn j ∈ { (1 , , (0 , } (cid:1) = 0. Analogously, also lim m →∞ C ED rm ∼ χ ( r ) holds true.Moreover, in order not to confound the comparative analysis, we do not reduce thelimiting χ degrees of freedom in the case where a positive eigenvalue of ˆ Ψ m , or ˆ Ψ m + ˆ Ξ m ,is set to 0 for being smaller than 10 − (cid:15) (here (cid:15) = 15 is the machine precision from Python’sfloating point number in Numpy 1.13.1).Finally, to assess whether deviations from the limit of the convolution statistic are dueto the estimate of the covariance matrix pseudo-inverse, we introduce Z GF rm = V (cid:48) m Ψ + V m and Z ED rm = W (cid:48) m ( Ψ + Ξ ) + W m for r = 1 ,
2, which are the convolution statistics calculated with the true covariance matrices.For all these statistics, we evaluate the proportion of hypothesis rejection for the signifi-cance level α = 0 .
05 by Monte Carlo approximation over L = 100 ,
000 independent instancesof the data. That is, the proportion of rejections for U (assumed to be one between P GF m , C GF rm , Z GF rm , P ED m , C ED rm , Z ED rm ) is calculated as1 L L (cid:88) l =1 { S r ( U l ) <α } , (30)given { U , . . . U L } are independent and identically distributed from U , and S r ( t ) = P ( χ ( r ) ≥ t )equal to the survival function for a χ distribution with r degrees of freedom.17n Fig. 1, we report the statistical power under H and a range of alternative hypotheses H ρ . We implement these comparisons for small and large samples with respect to m =min( n , n , n ), and with equal and unequal sizes.To comply with the rule-of-thumb recommendation for Pearson’s χ statistic application(Cressie and Read, 1984), the requirement for the expected frequencies m ( x ∗ x ) u ≥ u = 0 , ,
2, must be met when m observations are sampled from the distributionof X + X . Thus, we select two cases for the parameters ( p, q ) so that, under small samples m = 10, all three constraints from the rule-of-thumb are satisfied when ( p, q ) = (0 . , . m ( x ∗ x ) ≥
1, holds for ( p, q ) = (0 . , . χ , under cases favorable to P GF m and P ED m , when ( p, q ) = (0 . , .
8) and sample sizes are equal,or unfavorable, when n + n − m > p, q ) = (0 . , . C GF2 m provides better power over P GF m , and the latter over C GF1 m , but C GF2 m shows a proportion of rejections that is above α under H (Fig. 1a, top left and middle left panels). When ( p, q ) = (0 . , . C GF1 m and C GF2 m have similar behavior which outperforms P GF m (Fig. 1a, top right and middle rightpanels). Equivalent conclusions are inferred for the equality in distribution testing statisticcounterparts (Fig. 1b, top and middle panels). For large samples, under H , the proportionof rejections becomes closer to α for C GF2 m (Fig. 1a, bottom panels) and it coincides for C ED2 m (Fig. 1b, bottom panels); in terms of power, convolution statistics outperform Pearson’s χ ,with the exception of C GF1 m and C ED1 m when ( p, q ) = (0 . , .
8) (Fig. 1a,b, bottom left panels).In Fig. 2 we illustrate the convergence of the rejection rate for m large to the significancelevel α under H and the power convergence under the alternative hypothesis H . . Whentesting for goodness-of-fit, C GF2 m shows the highest rejection proportion which leads to goodpower under the alternative hypothesis (Fig. 2b), but a slow convergence to α under H ,reaching peaks of rejection up to 2 α (Fig. 2a). C GF1 m and P GF m , instead, have similar behaviorsbetter than C GF2 m , with P GF m outperforming C GF1 m in its most favorable case (( p, q ) = (0 . , . χ statistics, under hypotheses H (for ρ = 0) and H ρ (for ρ >
0) when testing for goodness-of-fit (a) and equality in distribution (b). A full horizontalline depicts the nominal rejection level α = 0 . C ED1 m and P ED m are similar (Fig. 2c,d), while C ED2 m presents a much faster convergence under H than its goodness-of-fit counterpart, as itapproaches α already at m = 100 (Fig. 2c).Together, Figs. 1 and 2 suggest a tendency of the convolution statistics to attain a moreanti-conservative behavior (type I error higher than α ), while for Person’s χ statistic this ismore conservative (type I error lower than α ).Lastly, in Fig. 3 we analyze the convolution statistic in the case of covariance matrixthat is near a reduced rank form, when its smallest positive eigenvalue approaches zero.Equivalently, this situation occurs if the roots of the PMVs for X and X are close, i.e.( p − /p − ( q − /q becomes null. To this end, we fix q ∈ (0 , Ψ ) = 1 if p = q (but still rk( Ψ + Ξ ) = 2), and we compare the proportion of rejections when the roots arewell separated or close each other, under both H and H . .For small samples, C GF1 m and P GF m present similar performance under H (Fig. 3a, left19igure 2: Comparison of convergence speed of convolution statistics against those calculatedwith the true covariance matrix and Pearson’s χ statistics, as samples size becomes largewith m . A full horizontal line depicts the nominal rejection level α = 0 . P ED m and C ED1 m are, respectively, conservative and anti-conservative (Fig. 3b,left panels). The power under H . favors the use of C GF1 m over P GF m for small samples, but,for large samples, P GF m outperforms C GF1 m when p is near q (Fig. 3a, right panels). As thespread between p and q widens, eventually C GF1 m performs better than P GF m . This commentaryis also true for the equality in distribution statistic counterparts (Fig. 3b, right panels).As predicted from Figs. 1 and 2 analyses, the behavior of C GF2 m shows higher proportionof rejections. Furthermore, due to rk( Ψ ) = 1, as p approaches q , C GF2 m undertakes a dramaticdeviation from the nominal rejection proportion α , indicative of the consistency failure of( ˆ Ψ m ) + in the estimation of Ψ + when p = q (Fig. 3a, bottom left panels). For large samples,we see that C ED2 m is more powerful than C ED1 m , since rk( Ψ + Ξ ) = 2, and also than P ED m Ψ tends tozero (when p approaches q = 0 .
8, under the model with X ∼ (1 − p, p ), X ∼ (1 − q, q ))and when the rule-of-thumb for Pearson’s χ statistic is violated (when p approaches 0).Convolution statistics are compared against those calculated with the true covariance matrixand Pearson’s χ statistics. A full horizontal line depicts the nominal rejection level α = 0 . C GF2 m and C ED2 m behaviors highlights the merit for the rank analysis of Section 3 and shows the danger ofsetting the same degrees of freedom for the convolution as for Pearson’s χ statistics withoutcareful consideration.To summarize the evidence above, we propose the following as general usage guidelines:when the number of samples is large or unbalanced, and PMV’s roots are distinct, the convo-lution statistic with maximum rank possible provides the best power; as PMV’s roots becomecloser or low sample size hiders the estimation of the covariance matrix pseudo-inverse, theconvolution statistic with reduced rank should be employed to ensure a good compromisebetween type I and type II errors control; Pearson’s χ is still recommended over the convo-lution statistics, but only for small sample data and only if the rule-of-thumb is not violated,in case the type I error must be conservatively controlled and type II error is considered21f secondary importance. Ultimately, it is advisable to run comparative simulations overspecific scenarios of interest that largely differ from the one considered above.5. DISCUSSIONIn this work, we have shown how to test hypotheses about the sum of discrete randomvariables using the operation of discrete convolution, from possibly unbalanced datasetsand without restrictions to specific distribution classes. Asymptotic properties of the con-volution were combined with the generalized Wald’s method to solve the testing problemsof: goodness-of-fit, equality in distribution and sub-independence. These results have beenachieved studying the rank of the limiting covariance matrix, from the convolution of PMVs.Of note, such rank was determined as a function of the number of roots shared between theprobability generating functions of the random variables considered. Benchmarking simu-lations were conducted on a simple parametric model covering several situations of interestand, in most of these, the convolution statistic performed better than Pearson’s.Perspective studies are still required to complete the description of the convolution statis-tics, including the cases where PMVs’ support is not connected, i.e. in ∆ r \ ∆ r Int for r >
References
Andersen, A. H. (1974). Multidimensional contingency tables.
Scandinavian Journal ofStatistics , 1(3):115–127.Andrews, D. W. K. (1987). Asymptotic results for generalized Wald tests.
Econ Theory ,3(3):348–358.Andrews, D. W. K. (1988). Chi-square diagnostic tests for econometric models: Introductionand applications.
J Econom , 37(1):135–156.Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (2007).
Discrete Multivariate Analysis:Theory and Practice . Springer, New York, NY, USA, 1 edition.Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests.
J R Stat Soc SeriesB Stat Methodol , 46(3):440–464.Drost, F. C. (1989). Generalized chi-square goodness-of-fit tests for location-scale modelswhen the number of classes tends to infinity.
Ann Stat , 17(3):1285–1300.Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lowerrank.
Psychometrika , 1(3):211–218.Editorial (2017). Rationalizing combination therapies.
Nat Med , 23:1113.Hadi, A. S. and Wells, M. T. (1990). A note on generalized Wald’s method.
Metrika ,37(1):309–315.Hagen, R., Roch, S., and Silbermann, B. (2000).
C* - Algebras and Numerical Analysis .CRC Press, New York, NY, USA. 23amedani, G. G. (2013). Sub-independence: An expository perspective.
Communicationsin Statistics – Theory and Methods , 42(3):3615–3638.Horn, R. A. and Johnson, C. R. (1986).
Matrix Analysis . Cambridge University Press, NewYork, NY, USA, 1 edition.Marchingo, J. M., Kan, A., Sutherland, R. M., Duffy, K. R., Wellard, C. J., Belz, G. T.,Lew, A. M., Dowling, M. R., Heinzel, S., and Hodgkin, P. D. (2014). Antigen affinity,costimulation, and cytokine inputs sum linearly to amplify T cell expansion.
Science ,346:1123–1127.Marchingo, J. M., Prevedello, G., Kan, A., Heinzel, S., Hodgkin, P. D., and Duffy, K. R.(2016). T-cell stimuli independently sum to regulate an inherited clonal division fate.
NatCommun , 7:13540.Markovsky, I. (2012).
Low Rank Approximation: Algorithms, Implementation, Applications .Springer, Cambridge, UK, 1 edition.Mihalko, D. P. and Moore, D. S. (1980). Chi-square tests of fit for type II censored data.
Ann Stat , 8(3):625–644.Moore, D. S. (1977). Generalized inverses, Walds method, and the construction of chi-squaredtests of fit.
J Am Stat Assoc , 72(357):131–137.Moore, D. S. (1982). The effect of dependence on chi squared tests of fit.
Ann Stat ,10(4):1163–1171.Moore, D. S. and Spruill, M. C. (1975). Unified large-sample theory of general chi-squaredstatistics for tests of fit.
Ann Stat , 3(3):599–616.Nashed, M. Z. (1976). Perturbations and approximations for generalized inverses and linearoperator equations. In Nashed, M. Z., editor,
Generalized Inverses and Applications , pages325–396. Academic Press, New York, NY, USA, 1 edition.24earson, K. (1900). On the criterion that a given system of deviations from the probablein the case of a correlated system of variables is such that it can be reasonably supposedto have arisen from random sampling.
The London, Edinburgh, and Dublin PhilosophicalMagazine and Journal of Science , 50(302):157–175.Schennach, S. M. (2019). Convolution without independence.
Journal of Econometrics ,211(1):308–318.Serfling, R. J. (1980).
Approximation Theorems of Mathematical Statistics . John Wiley &Sons, New York, NY, USA, 1 edition.Tyler, D. E. (1981). Asymptotic inference for eigenvectors.
Ann Stat , 9(4):725–736.Voinov, V., Roza, A., and Pya, N. (2008). Recent achievements in modified chi-squaredgoodness-of-fit testing. In Vonta, F., Nikulin, M., Limnios, N., and Huber-Carol, C.,editors,
Statistical Models and Methods for Biomedical and Technical Systems .Vuong, Q. H. (1987). Generalized inverses and asymptotic properties of Wald tests.
EconLett , 24(4):343–347.Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when thenumber of observations is large.
Trans Am Math Soc , 54(3):426–482.Wilson, J. R. and Koehler, K. J. (1991). Hierarchical models for cross-classified overdispersedmultinomial data.
J Bus Econ Stat , 9(1):103–110.Wolchok, J. D., Kluger, H., Callahan, M. K., Postow, M. A., Rizvi, N. A., Lesokhin, A. M.,Segal, N. H., Ariyan, C. E., Gordon, R., Reed, K., Burke, M. M., Caldwell, A., Kronenberg,S. A., Agunwamba, B. U., Zhang, X., Lowy, I., Inzunza, H. D., Feely, W., Horak, C. E.,Hong, Q., Korman, A. J., Wigginton, J. M., Gupta, A., and Sznol, M. (2013). Nivolumabplus Ipilimumab in advanced melanoma.
N Engl J Med , 369(2):122–133.Zhang, B. (1999). A chi-squared goodness-of-fit test for logistic regression models based oncase-control data.
Biometrika , 86(3):531–539.25PPENDIX
Proof of Lemma 1.
Without loss of generality, it is possible to shift from the lattice Λ( ζ ) tothe set of integers Z through the natural isomorphism φ : Λ( ζ ) → Z , φ ( ζu ) = u for every u ∈ Z . Using this function, we define A (cid:48) i = a i φ ( A i ) and B (cid:48) j = b j φ ( B j ) for i = 1 , . . . , k , j =1 , . . . , h , so to account for the multiplicative constants a , . . . , a k , b , . . . , b h in the variables A (cid:48) , . . . , A (cid:48) k , B (cid:48) , . . . , B (cid:48) h mapping Ω to N ∪ { } , and reduce (2) into H : φ ( a ) + k (cid:88) i =1 A (cid:48) i ∼ φ ( b ) + h (cid:88) i =1 B (cid:48) i . (31)Given τ i = min { j : P ( A (cid:48) i = j ) > } for i = 1 , . . . , k and τ k + i = min { j : P ( B (cid:48) i = j ) > } for i = 1 , . . . , h , that are well defined since the variables A (cid:48) , . . . , A (cid:48) k , B (cid:48) , . . . , B (cid:48) h are assumedfinite, we rewrite (31) as H : a + k (cid:88) i =1 τ i + k (cid:88) i =1 ( A (cid:48) i − τ i ) ∼ b + h (cid:88) i =1 τ k + i + h (cid:88) i =1 ( B (cid:48) i − τ k + i ) . (32)For the null hypothesis (32) to be true, a + (cid:80) ki =1 τ i = b + (cid:80) hi =1 τ k + i must hold. Otherwise,for example, if a + (cid:80) ki =1 τ i < b + (cid:80) hi =1 τ k + i , by definition of τ , . . . , τ k + h we would have0 = P (cid:32) b + h (cid:88) i =1 τ k + i + h (cid:88) i =1 ( B (cid:48) i − τ k + i ) = a + k (cid:88) i =1 τ i (cid:33) = P (cid:32) a + k (cid:88) i =1 τ i + k (cid:88) i =1 ( A (cid:48) i − τ i ) = a + k (cid:88) i =1 τ i (cid:33) ≥ k (cid:89) i =1 P ( A (cid:48) i = τ i ) > , that is impossible. Therefore (32) is equivalent to H : k (cid:88) i =1 ( A (cid:48) i − τ i ) ∼ h (cid:88) i =1 ( B (cid:48) i − τ k + i ) , which, in turn, can be reduced to the form (3) by defining X i = A (cid:48) i − τ i for i = 1 , . . . , k and Y i = B (cid:48) i − τ k + i for i = 1 , . . . , h thus accounting for the subtraction the constants inthe distribution of X i and Y i . As a consequence, the support of X i is { , . . . , r i } for somepositive integer r i ∈ N and P ( X i = 0) > i = 1 , . . . , k . The same applies to Y , . . . , Y h .26 roof of Proposition 1. Given a ij ∈ { , . . . , r i } , the independent sample from X ij for i =1 , . . . , k and j = 1 , . . . , n i , the MLE of x ∗ . . . ∗ x k is the element θ ∈ ∆ s that maximizes P ( X = a , . . . , X n = a n , . . . , X k = a k , . . . , X kn k = a kn k | θ )= k (cid:89) i =1 P ( X i = a i , . . . , X in i = a in i | θ )On the right hand side, for fixed i ∈ { , . . . , k } , P ( X i = a i , . . . , X in i = a in i | θ ) achievesmaximum value for any θ = θ ∗ . . . ∗ θ k such that θ i = ˆ x in i , with θ j ∈ ∆ r j for any j . Inparticular θ = ˆ x n ∗ . . . ∗ ˆ x kn k maximizes all factors, hence the whole product. Proof of Lemma 2.
We first prove (18) by induction on k . For k = 2 we have to show thatKer (cid:0) T ( x (1) ) (cid:48) (cid:1) ∩ Ker (cid:0) T ( x (2) ) (cid:48) (cid:1) = Ker T ( x (1) ) (cid:48) T ( x (2) ) (cid:48) = Ker( T ( g ) (cid:48) ) , (34)where g = gcd ( x (1) , x (2) ) = gcd ( x , x ) ∈ R r g +1 and r g ≥
0. By definition of g , thereexist two coprime vectors z ∈ R r − r g +1 , z ∈ R r − r g +1 such that x (1) = z ∗ g and x (2) = z ∗ g so that, by composition of discrete convolution, we can write T ( x (1) ) (cid:48) T ( x (2) ) (cid:48) = T ( z ) (cid:48) T ( z ) (cid:48) T ( g ) (cid:48) , where T ( g ) (cid:48) ∈ R r + r − r g +1 × R r + r +1 , T ( z ) (cid:48) ∈ R r +1 × R r + r − r g +1 and T ( z ) (cid:48) ∈ R r +1 × R r + r − r g +1 . As the number of columns of (cid:104) T ( z ) T ( z ) (cid:105) (cid:48) is lesser than the number ofrows, since r + r − r g + 1 ≤ r + r + 2 ⇔ r g + 1 ≥
0, to prove (34), it suffices to show thatrk (cid:18)(cid:104) T ( z ) T ( z ) (cid:105) (cid:48) (cid:19) is of full rank r + r − r g +1. By rank-nullity theorem, this is the caseif and only if nul (cid:16)(cid:104) T ( z ) T ( z ) (cid:105)(cid:17) = r g + 1. The latter is true by coprimeness between z and z and by matrix dimensionality, since the solutions ( a , b ) with a ∈ R r +1 , b ∈ R r +1 , tothe homogeneous system of equations T ( z ) a + T ( z ) b = z ∗ a + z ∗ b = , are characterizedby a = z ∗ u , b = − z ∗ u with u ∈ R r g +1 vector of free parameters. Assuming (18) for k , we now prove the case k + 1 to conclude. By inductive step and associative property of27onvolution, we can write k +1 (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) = Ker T ( g k ) (cid:48) T ( x k +1 ) (cid:48) T ( x ( k +1) ) (cid:48) = Ker T ( g k ∗ x k +1 ) (cid:48) T ( x ( k +1) ) (cid:48) . (35)Thus, by the definition of g k +1 = gcd( x (1) , . . . , x ( k +1) ) ∈ R r gk +1 +1 , the properties of gcd leadto g k +1 = gcd( g k ∗ x k +1 , x ∗ . . . ∗ x k ) = g k gcd( x k +1 , u k +1 ), where u k +1 ∈ R (cid:80) ki =1 r i − r gk +1 issuch that u k +1 ∗ g k +1 = x ∗ . . . ∗ x k . In particular, we deduce r g k +1 ≥ r g k . (36)As in the previous step, we introduce z ∈ R r gk + r k +1 − r gk +1 +1 and z ∈ R (cid:80) ki =1 r i − r gk +1 +1 coprime vectors such that z ∗ g k +1 = g k ∗ x k +1 and z ∗ g k +1 = x ∗ . . . ∗ x k . Thus, resuming(35), k +1 (cid:92) i =1 Ker (cid:0) T ( x ( i ) ) (cid:48) (cid:1) = Ker T ( z ) (cid:48) T ( z ) (cid:48) T ( g k +1 ) (cid:48) = Ker( T ( g k +1 ) (cid:48) )where last equality is analogous as for the case k = 2, given that the number of columns of (cid:104) T ( z ) T ( z ) (cid:105) (cid:48) ∈ R (cid:80) k +1 i =1 r i − r gk +2 × R (cid:80) k +1 i =1 r i − r gk +1 +1 is lesser than the number of rows, thatis k +1 (cid:88) i =1 r i − r g k +1 + 1 ≤ k +1 (cid:88) i =1 r i − r g k + 2 ⇔ r g k +1 + 1 ≥ r g k , which holds true by (36). To prove (19), we note that the first equation therein is establishedby (18), thus only the second equivalence shall be proved. Once more, we define z and z such that such that (cid:104) T ( z ) T ( z ) (cid:105) (cid:48) T (˜ g ) (cid:48) = (cid:104) T ( g k ) T (¯ g h ) (cid:105) (cid:48) with T ( z ) (cid:48) ∈ R (cid:80) ki =1 r i − r gk +1 × R (cid:80) ki =1 r i − r ˜ g +1 , T ( z ) (cid:48) ∈ R (cid:80) ki =1 r i − r ¯ gh +1 × R (cid:80) ki =1 r i − r ˜ g +1 . Again, the conclusion is verified bychecking that, in the matrix (cid:104) T ( z ) T ( z ) (cid:105) (cid:48) , there are less rows than columns, namely k (cid:88) i =1 r i + r ˜ g + 1 ≥ r g k + r ¯ g h , which holds true since g k ∗ ¯ g h = gcd( g k , ¯ g h ) ∗ lcm( g k , ¯ g h ) = ˜ g ∗ lcm( g k , ¯ g h ) and lcm( g k , ¯ g h ) isa divisor of z = x ∗ . . . ∗ x k = y ∗ . . . ∗ y h (as polynomials), so deg lcm( g k , ¯ g h ) ≤ (cid:80) ki =1 r i .28 roof of Theorem 1. The relations (20) and (21) derive as applications of Lemma 2 to (14)and (15), respectively. Lastly, (22) and (23) follow from (20) and (21), respectively, throughrank-nullity properties of linear transformations from a finite-dimensional domain.
Proof of Corollary 1.
This follows from Theorem 1, as Ker( T ( g k ) (cid:48) ) = Ker( T (˜ g ) (cid:48) ) = { (0 , . . . , } ⊆ R s +1 by coprimeness. Proof of Corollary 2.
It suffices to show (24), since (25) follows from the same reasoning.Relation (13), Lemma 2 ans Theorem 1 implyKer( Ψ ) = Ker( T ( g k ) (cid:48) ) ⊕ (cid:32) k (cid:92) i =1 { v ∈ R s +1 : T ( g k ) (cid:48) v ∈ Ker( Σ ( x i )) } (cid:33) ⊆ (cid:104) s (cid:105) ⊕ Ker( T ( g k ) (cid:48) ) ⊕ ki =1 { v ∈ R s +1 : T ( g k ) (cid:48) v ∈ E i } , since Ker( Σ ( x i )) = (cid:104) r i (cid:105) ⊕ E i , given E i = (cid:104){ e il : l ∈ L i }(cid:105) with e ilu = δ l,u for u = 0 , . . . , r i and i = 1 , . . . , k . As the dimension of E i equals the cardinality of L i , (24) follows fromnul( Ψ ) ≤ r g k + k (cid:88) i =1 | L i | . Proof of Corollary 3.
To prove (27) we write the relation √ m ˆ x m − x ...ˆ x km − x k ˆ z m − z ∼ m →∞ N Σ ( x ) . . . A . . . . . . ... ...... . . . . . . ... . . . ( x k ) A k A (cid:48) . . . . . . A (cid:48) k Σ ( z ) , (37)where the entries follow from the uncorrelatedness implied by sub-independence, and A i = D ( x i )( T ( x ( i ) ) (cid:48) − ( z . . . z ) (cid:48) ), given D ( x i ) the matrix with x i at the diagonal and 0elsewhere. Then, (27) is derived by applying the delta method to (37), with Υ = Ψ − k (cid:88) i =1 A (cid:48) i T ( x ( i ) ) (cid:48) − k (cid:88) i =1 T ( x ( i ) ) A i + Σ ( z ) = Σ ( z ) − Ψ , k (cid:88) i =1 T ( x ( i ) ) A i = k (cid:88) i =1 T ( x ( i ) )( D ( x i ) − x i x (cid:48) i ) T ( x ( i ) ) (cid:48) = k (cid:88) i =1 T ( x ( i ) ) Σ ( x i ) T ( x ( i ) ) (cid:48) = Ψ . To obtain rk( Υ ) = s , Υ is rewritten as Υ = Σ ( z ) − Ψ == (cid:104) I T ( x (1) ) · · · T ( x ( k ) ) (cid:105) Σ ( z ) . . . . . .
00 Σ ( x ) . . . . . . ...... . . . . . . . . . ...... . . . . . . . . . . . . . . . ( x k ) I − T ( x (1) ) (cid:48) ... − T ( x ( k ) ) (cid:48) = M M M , where I is the identity matrix of dimension s + 1 × s + 1. In fact, following the rationale asfor Theorem 1 and (13), Ker( M M ) = (cid:104) s (cid:105) , hence Ker( Υ ) = (cid:104) s (cid:105) ⊕ { v ∈ R s +1 : M M v ∈ Ker( M ) } . Since Ker( M ) is a space orthogonal to Im( M (cid:48) ), then Ker( M ) ⊆ { ( , w ) ∈ R s + k +1 : w ∈ R s + k } and therefore { v ∈ R s +1 : M M v ∈ Ker( M ) } = (cid:104) s (cid:105) , which leads tork( Υ ) = ss