Analysis of Deviance for Hypothesis Testing in Generalized Partially Linear Models
AAnalysis of Deviance for Hypothesis Testing in GeneralizedPartially Linear Models
Wolfgang Karl H¨ardleCenter for Applied Statistics and Economics, Humboldt University, 10099 Berlin, GERMANY [email protected]
ANDLi-Shan HuangInstitute of Statistics, National Tsing Hua University, 30013, TAIWAN [email protected]
September 10, 2020
Abstract
In this study, we develop nonparametric analysis of deviance tools for generalized partiallylinear models based on local polynomial fitting. Assuming a canonical link, we propose ex-pressions for both local and global analysis of deviance, which admit an additivity property thatreduces to analysis of variance decompositions in the Gaussian case. Chi-square tests based onintegrated likelihood functions are proposed to formally test whether the nonparametric termis significant. Simulation results are shown to illustrate the proposed chi-square tests and tocompare them with an existing procedure based on penalized splines. The methodology isapplied to German Bundesbank Federal Reserve data.
KEY WORDS: ANOVA decomposition; Integrated likelihood; Local polynomial regression.
Generalized linear models (McCullagh and Nelder 1989) are a large class of statistical models forrelating a response variable to linear combinations of predictor variables. The models allow theresponse variable to follow probability distributions in the exponential family such as the Binomialand Poisson, generalizing the Gaussian distribution in linear models, though a major limitation isthe prespecified linear form of predictors. Generalized partially linear models (Green and Silver-man1994; Carroll et al. 1997; H¨ardle et al. 2004) allow for a nonparametric component for acontinuous covariate while retaining the ease of linear relationships for the remaining variables.It is more flexible than the conventional linear approach and is a special case of generalized ad-ditive models (Hastie and Tibshirani 1990; Wood 2006) which include multiple nonparametric1 a r X i v : . [ m a t h . S T ] S e p omponents. H¨ardle et al. (1998) applied the generalized partially linear model to 1991 East-West German migration data to model the probability of migration with a nonlinear relationshipto household income and linear relationships to other covariates such as age, gender, and employ-ment status. Wood (2006, p. 248) gave an example of modeling the daily total deaths in Chicago inthe period 1987-2000 as a Poisson distribution with a nonlinear trend of time and linear effects ofdaily temperature and daily air-pollution levels of ozone, sulfur dioxide, and pm10. An illustratingfinance example in Section 6 of this paper is on bankruptcy prediction for firms, known as rating orscoring, from a set of financial ratio variables. The logistic partially linear model is used to modelthe probability of default with a nonlinear relationship to the account payable turnover ratio, whichis a short-term liquidity measure, and linear relationships to some selected financial ratios.In applying generalized partially linear models to data, inference tools to examine whether thenonparametric term is significant are of interest. For example, in H¨ardle et al. (1998), the non-linear estimated function of household income showed a saturation in the intention to migrate forhigher income households and the question was whether the overall income effect was significantstatistically. As analysis of deviance was developed for generalized linear models (McCullagh andNelder 1989), it is natural to ask whether one can extend it for generalized partially linear mod-els. Though Hastie and Tibshirani (1990) briefly discussed analysis of deviance for generalizedadditive models, they noted that “the distribution theory, however, is undeveloped” and “informaldeviance tests with some heuristic justification” were adopted. The present paper fills the gapby establishing local and global analysis of deviance expressions for generalized partially linearmodels and developing associated tests for checking whether the nonparametric term is signifi-cant. Li and Liang (2008) addressed assessing the significance of the nonparametric term in thelocal polynomial settings by extending generalized likelihood ratio tests (Fan et al. 2001), whichhave asymptotic chi-square distributions. Wood (2013) discusses approximate p -values for testingsignificance of smooth components of semiparametric generalized additive models by Wald-typetests based on penalized splines. We remark that testing in the generalized partially linear modelsis relatively less developed as compared to the special case of partially linear models under theGaussian distribution (H¨ardle et al. 2004). Hence, there is a need for developing analysis of de-viance tools parallel to those in generalized linear models for applications of generalized partiallylinear models.Based on the local polynomial approach (Fan and Gijbels 1996) and assuming a canonical link2n generalized partially linear models, we propose local and global expressions for analysis of de-viance, with the latter obtained by integrating the corresponding local likelihood quantities. Thismimics the “integrated likelihood” approach discussed by Lehmann (2006) and Severini (2007).Though the idea of local likelihood has been around for some time (Hastie and Tibshirani 1987;Loader 1999), we are not aware of using the integrated likelihood approach to combine the infor-mation of local likelihood in the smoothing literature. Then integrated likelihood ratio tests withasymptotic chi-square distributions are proposed to check whether the nonparametric term is sig-nificant. Our work extends the classic analysis of deviance to generalized partially linear modelswith theoretical justifications, and generalizes the work of Huang and Chen (2008) and Huang andDavidson (2010) in a special case of the Gaussian distribution to distributions in the canonicalexponential family.The organization of this paper is as follows. Section 2 outlines the analysis of deviance fornested hypotheses in parametric generalized linear models by Simon (1973). In Section 3, we pro-pose local and global analysis of deviance for nonparametric models in Theorem 1 for the simplercase with the nonparametric term as the only predictor. By combining local likelihood through in-tegration, as a by-product, new estimators for the canonical parameter and response mean are givenin equation (12), and Theorem 2 shows that the integrated likelihood quantities are asymptoticallyglobal likelihood quantities with the new estimators. Theorem 3 proposes integrated likelihoodratio tests with asymptotic chi-square distributions for testing whether the nonparametric term issignificant. Section 4 presents an extension of Theorems 1-3 to generalized partially linear modelsas Theorems 4 and 5. In Section 5, we illustrate the potential usefulness of the new tests withsimulated data and compare with tests by Wood (2013) in the R package mgcv . Section 6 appliesthe methodology to 2002 German Bundesbank Federal Reserve data and Section 7 gives someconcluding remarks and directions for future research. We first describe generalized linear models based on McCullagh and Nelder (1989). Let ( x , y ) , . . . , ( x n , y n ) be independent data pairs with the conditional density of Y given covari-3te X = x from a one-parameter exponential family: L ( y ; θ ( x )) = exp (cid:20) yθ ( x ) − b { θ ( x ) } a ( φ ) + c ( y, φ ) (cid:21) , (1)where a ( · ) > , b ( · ) and c ( · ) are known functions, φ is known or a nuisance parameter, and θ is thecanonical parameter with the conditional mean of response, E ( Y | X = x ) = µ ( x ) = b (cid:48) { θ ( x ) } .A transformation of mean G { µ ( x ) } may be modelled linearly by G { µ ( x ) } = b + b x , where G ( · ) is called the “link” function and estimates of b and b are obtained by maximum likelihood.If G ( · ) = ( b (cid:48) ) − ( · ) , then G is the canonical link function that links θ to the linear predictor.For simplicity, G is the canonical link function throughout the paper and the dependence of θ oncovariates is often suppressed if no ambiguities result.Let (cid:96) ( y ; θ ) = log L ( y ; θ ) , ˆ θ = G (ˆ µ ) denote the fitted value of θ with corresponding ˆ µ , and ˜ θ = G ( y ) when the fitted value equals the observed y . The deviance D (McCullagh and Nelder1989), measuring the discrepancy between data y = ( y , . . . , y n ) (cid:62) and fitted ˆ µ = (ˆ µ , . . . , ˆ µ n ) (cid:62) ,is D ( y , ˆ µ ) = 2 (cid:88) i { y i (˜ θ i − ˆ θ i ) − b (˜ θ i ) + b (ˆ θ i ) } . (2)In the Gaussian case, G is the identity link and D = (cid:80) i ( y i − ˆ µ i ) , which is the residual sumof squares in linear models. Let us now turn to testing hypotheses about θ = ( θ , . . . , θ n ) (cid:62) .Assume that D j = inf θ ∈ Ω j D , j = 1 , , with Ω ⊆ Ω . The analysis of deviance usually refersto comparing two nested parametric models and inference may be based on the difference D − D , which is simply the log likelihood ratio statistic with an asymptotic χ distribution. Theconventional analysis of deviance is generally not parallel to the analysis of variance in linearmodels, in the sense that the former does not have all the sum-of-squares quantities.An attempt to mimic analysis of variance for (1) can be based on the Kullback-Leibler (KL)divergence of two probability distributions with means µ and µ : KL ( µ , µ ) = 2 E µ [ (cid:96) { y ; G ( µ ) } − (cid:96) { y ; G ( µ ) } ] , where µ and µ are treated as fixed values and E µ is the conditional expectation with respectto y with µ = µ . Simon (1973) showed that for nested hypotheses Ω ⊂ Ω ⊂ R n with R n corresponding to the parameter space for an exact fit of ˜ θ and θ parameterized linearly in Ω and4 , KL ( y , µ ) = KL ( y , µ ) + KL ( µ , µ ) (3)when plugging in the values of maximum likelihood estimates for µ and µ . In other words, (3)shows that the KL divergence exhibits the Pythagorean property. For the Gaussian distribution, (3)reduces to the analysis of variance decomposition in linear models when µ and µ correspond tothe linear fit and the intercept-only model respectively, and the terms in (3) becomes total, residual,and regression sums of squares respectively.A linear form of x may be restrictive and one may consider a nonparametric approach: G { µ ( x ) } = m ( x ) . (4)Fan et al. (1995) discussed estimating m ( · ) by maximizing a locally weighted likelihood with alocal polynomial approximation. Based on Taylor’s expansion at x , θ i ≈ β + β ( x i − x ) + · · · + β p ( x i − x ) p ≡ θ i ( x ) . This approximation is plugged in the locally weighted log-likelihood at x , (cid:96) x ( y ; θ x ) ≡ (cid:88) i (cid:96) { y i ; θ i ( x ) } K h ( x i − x ) , (5)where θ x = ( θ ( x ) , . . . , θ n ( x )) (cid:62) , K ( · ) is usually a density function being symmetric at , h is thebandwidth determining the neighborhood size, and K h ( · ) = K ( · /h ) /h . Then ˆ β = ( ˆ β , . . . , ˆ β p ) (cid:62) maximizing (cid:96) x ( y ; θ x ) is solved and j ! ˆ β j estimates m ( j ) ( x ) , j = 0 , . . . , p , which is θ ( j ) ( x ) with thecanonical link. Fan et al. (1995) derived asymptotic properties of ˆ β j ( x ) ’s and adopted G − { ˆ β ( x ) } as an estimate for µ ( x ) . A further extension to generalized partially linear models is G { µ ( z , x ) } = z (cid:62) α + m ( x ) , (6)where z is a K -dimensional covariate vector. Without loss of generality, the intercept in (6) isembedded in m ( · ) . When α is unknown, estimation of α can be done via a two-step maximumlikelihood procedure that updates the linear and nonparametric estimates iteratively, as discussedin Carroll et al. (1997), p. 479. 5 Nonparametric Analysis of Deviance
This section focuses on (4). We start by deriving a local analysis of deviance expression for model(4) in the following by adapting (3) for locally weighted likelihood. Let ˆ θ i ( x ) = ˆ β + · · · +ˆ β p ( x i − x ) p , the resuting local polynomial estimate of θ i at x , ˆ θ x = (ˆ θ ( x ) , . . . , ˆ θ n ( x )) (cid:62) , ˆ µ x ( x i ) = G − { ˆ θ i ( x ) } , and ˆ µ x = (ˆ µ x ( x ) , . . . , ˆ µ x ( x n )) (cid:62) . As the ˆ β j ’s maximize (cid:96) x ( y ; θ x ) , the followingequations hold: (cid:88) i y i ( x i − x ) j K h ( x i − x ) = (cid:88) i ˆ µ x ( x i )( x i − x ) j K h ( x i − x ) , j = 0 , . . . , p, (cid:88) i y i ˆ θ i ( x ) K h ( x i − x ) = (cid:88) i ˆ µ x ( x i )ˆ θ i ( x ) K h ( x i − x ) . (7)The last equation indicates that ( y − ˆ µ x ) is orthogonal to ˆ θ x in the locally weighted inner productspace with weights K h ( x i − x ) . Hence the fact of residuals being orthogonal to fitted values inordinary linear models now becomes the fact of local residuals ( y − ˆ µ x ) being orthogonal to locallyfitted canonical parameters ˆ θ x in a kernel-weighted space. For (cid:96) x ( y ; ˆ θ x ) , an expression mimicking(2) for local deviance at x is therefore: d x ( y , ˆ µ x ) = 2 { (cid:96) x ( y ; ˜ θ ) − (cid:96) x ( y ; ˆ θ x ) } = 2 (cid:88) i [ y i { ˜ θ i − ˆ θ i ( x ) } − b (˜ θ i ) + b (ˆ θ i ( x ))] K h ( x i − x ) , (8)where ˜ θ = (˜ θ , . . . , ˜ θ ) (cid:62) with ˜ θ i = G ( y i ) , same as those defined around (2). Though (8) is anatural definition for local likelihood, we are not aware of a similar quantity to (8) in the litera-ture. Published work focuses on global deviance by taking (2) with G − { ˆ β ( x i ) } as the estimate,and strictly speaking, the resulting deviance expression is not based on maximized likelihood as ˆ β , . . . , ˆ β p are ignored. In comparison, the deviance (8) makes use of all coefficients ˆ β , . . . , ˆ β p from maximizing local likelihood. Then (3) is adapted to form a local analysis of deviance ex-pression, and a global expression may be obtained by integrating local quantities, as given in thefollowing Theorem. Theorem 1.
Suppose that conditions (A1) and (A2) in the Appendix hold. Under model (4), thefollowing results hold when using local polynomial approximations of p -th order.(a) For a grid point x in the support of covariate X , a local analysis of deviance expression is d x ( y , ¯ y ) = d x ( y , ˆ µ x ) + d x ( ˆ µ x , ¯ y ) , (9)6 here ¯ y is the sample mean of y , d x ( y , ¯ y ) is (8) with ˆ µ x and ˆ θ x replaced by ¯ y and G (¯ y ) respectively,and d x ( ˆ µ x , ¯ y ) ≡ E ˆ µ x (cid:104) (cid:96) x ( y ; ˆ θ x ) − (cid:96) x { y ; G − (¯ y ) } (cid:105) = 2 (cid:104) (cid:96) x ( y ; ˆ θ x ) − (cid:96) x { y ; G − (¯ y ) } (cid:105) . (10) (b) A global analysis of deviance expression is obtained by integrating the local quantities in (9)over the support of covariate X : (cid:90) d x ( y , ¯ y ) dx = (cid:90) d x ( y , ˆ µ x ) dx + (cid:90) d x ( ˆ µ x , ¯ y ) dx, (11) where (cid:82) d x ( y , ¯ y ) dx = KL ( y , ¯ y ) = D ( y , ¯ y ) under a boundary condition in (A1) that the weights (cid:82) K h ( x i − x ) dx = 1 , i = 1 , . . . , n . Theorem 1 provides elegant local and global analysis of deviance expressions that mimic theclassic case (3) (McCullagh and Nelder 1989; Simon 1973) and shows that the Pythagorean prop-erty of the KL divergence holds under model (4) with local polynomial fitting. It is straightforwardto show (9) based on (7) and (10) and hence the proof is omitted. Alternatively the proof in Simon(1973) for (3) can be adapted with kernel weights to show (9). The local expression (9) has aninterpretation that the null deviance at point x , d x ( y , ¯ y ) , can be decomposed into two parts, theresidual deviance after fitting a locally weighted polynomial at x , d x ( y , ˆ µ x ) , and the model de-viance at x , d x ( ˆ µ x , ¯ y ) . Equality (9) holds in finite-sample cases, similar to (3). The global analysisof deviance (11) extends the above interpretation to a fitted curve by local polynomials: the resid-ual deviance (cid:82) d x ( y , ˆ µ x ) dx is a measure of the lack of fit of fitting (4), whereas the null deviance (cid:82) d x ( y , ¯ y ) dx is such a measure for a reduced model that only includes the intercept. The quan-tities in (11) are weighted integrals (see (5)), which may be approximated by the Riemann sumin practice, and an analysis of deviance table based on (11) is formed, similar to the parametricframework. For a special case of Normal distribution with an identity link, (9) and (11) becomethe local and global analysis of variance decompositions respectively in Huang and Chen (2008).For the boundary condition in Theorem 1(b), if K ( · ) has a support [ − , and { x i , i = 1 , . . . , n } has a range of [ a, b ] , then a boundary-corrected kernel [ (cid:82) K h ( x i − x ) dx ] − K h ( x i − x ) may be usedfor x i in [ a, a + h ) and ( b − h, b ] to ensure that the integrated kernel weights equals to 1.As a by-product, the above derivations give rise to new “global” estimators for θ i ’s and µ i ’s: θ ∗ i = (cid:90) ˆ θ i ( x ) K h ( x i − x ) dx and µ ∗ i = G − ( θ ∗ i ) . (12)7hey are different from local estimates at x i : ˆ β ( x i ) and G − { ˆ β ( x i ) } . The asymptotic propertiesof θ ∗ i and µ ∗ i for “interior” points x i with p = 1 and are discussed in the Proposition below. Thereason p = 1 and 3 only is due to their simpler asymptotic bias expressions of ˆ β ( x ) than thoseof p = 0 and 2; see Theorems 1a and 1b in Fan et al. (1995). The “interior” region is definedas follows. For a kernel function with support [ − , , if the convex support of x i s is [ a, b ] , thendefine the interior region as [ a + 2 h, b − h ] . This definition is narrower than the conventional [ a + h, b − h ] , since for x i in [ a + h, a + 2 h ) (cid:83) ( b − h, b − h ] , the corresponding θ ∗ i and µ ∗ i in (12)involve ˆ β j ( x ) with x in [ a, a + h ) (cid:83) ( b − h, b ] , j = 0 , . . . , p . Proposition.
Suppose that conditions (A1)-(A5) in the Appendix hold. Assume that h → and nh → ∞ as n → ∞ . Then for interior points x i with p = 1 and ,(a) the order of the asymptotic bias of θ ∗ i is smaller than the conventional order h ( p +1) ; i.e., the h ( p +1) term of the bias of θ ∗ i is zero;(b) the asymptotic variance of θ ∗ i is of order n − h − ;(c) similarly, the order of the bias of µ ∗ i is smaller than the conventional order h ( p +1) and theasymptotic variance of µ ∗ i is of order n − h − . The proof for the Proposition is given in the Appendix. There has been some research aimed atfinding new ways of reducing bias of basic kernel smoothers, e.g., Kosmidis and Firth, (2009). Inthe Gaussian case with an identity link, Huang and Chan (2014) show that the bias of θ ∗ i for interiorpoints is of order h p +1) for p = 0 , , , , which is consistent with intuition that the higher the p ,the smaller the order of the bias. The derivation of explicit bias expressions of θ ∗ i in exponentialfamily is technically challenging, since the second-order expansions of the bias of ˆ β j ( x ) for (1)with (4) have not been addressed in the literature. We thus focus on analysis of deviance, while theissue of bias reduction may be studied in a future paper.Theorem 1 involves integrating local likelihood quantities to form a global analysis of devianceexpression and hence it is of interest to explore how integrated local likelihood (cid:82) (cid:96) x ( y , ˆ θ x ) dx be-haves as a global likelihood function. The following theorem shows that integrated local likeli-hood is asymptotically a global likelihood (cid:96) ( y ; θ ∗ ) with estimate θ ∗ = ( θ ∗ , . . . , θ ∗ n ) (cid:62) and that theintegrated deviance quantities (cid:82) d x ( y , ˆ µ x ) dx and (cid:82) d x ( ˆ µ x , ¯ y ) dx are asymptotically KL-divergencemeasures with estimate µ ∗ = ( µ ∗ , . . . , µ ∗ n ) (cid:62) . Theorem 2.
Under model (4), assume that conditions (A1)-(A5) in the Appendix hold, and h → , nh → ∞ as n → ∞ . For p = 1 and , a) the integrated likelihood function is asymptotically (cid:90) (cid:96) x ( y ; ˆ θ x ) dx = (cid:96) ( y ; θ ∗ ) + O ( h ( p +1) ) , (13) where the elements of θ ∗ are defined in (12);(b) the integrated deviance quantities are asymptotically (cid:90) d x ( y , ˆ µ x ) dx = KL ( y , µ ∗ ) + O ( h ( p +1) ) and (cid:90) d x ( ˆ µ x , ¯ y ) dx = KL ( µ ∗ , ¯ y ) + O ( h ( p +1) ) , (14) where the elements of µ ∗ are defined in (12);(c) from (11) and (14), KL ( y , ¯ y ) = KL ( y , µ ∗ ) + KL ( µ ∗ , ¯ y ) + O ( h ( p +1) ) , (15) which shows that the classic analysis of deviance holds asymptotically with µ ∗ . The proof of Theorem 2 is given in the Appendix and it utilizes some results stated in theProposition for p = 1 and 3. Hence Theorem 2 is limited to p = 1 and 3 only for the samereason described before the Proposition. The integrated local likelihood (cid:82) (cid:96) x ( y , ˆ θ x ) dx in (13) isa weighted integral of local likelihood with fitted local polynomials. In the literature, the idea ofintegrated likelihood was mentioned in Lehmann (2006), and Severini (2007) discussed integratedlikelihood functions to eliminate nuisance parameters in parametric settings. To our knowledge,(13) and (15) have never been raised in the nonparametric regression literature. The conventionwas to plug in ˆ β ( x i ) in (1) for θ i ; as ˆ β ( x i ) ’s are not maximum likelihood estimates globally, theKL-type additivity (3) would not hold. In contrast, (15) shows that the classic analysis of devianceholds asymptotically by utilizing the local additivity in (9). Two topics for further investigation areto apply Theorems 1 and 2 to develop bandwidth selection and residual diagnostic procedures. Forexample, bandwidth selection may be based on cross-validating the deviance or minimizing thecorrected Akaike information criterion (AICc, Hurvich et al. 1998), both with close connection toKL divergence. In Section 5, we explore adapting the AICc criterion with the integrated deviancefor bandwidth selection empirically.Based on integrated local likelihood, we next develop an integrated likelihood ratio test forexamining the significance of a nonparametric fit, parallel to chi-square tests in parametric settings(McCullagh and Nelder 1989). Under model (4), the intercept term is embedded in m ( · ) and hence9esting significance of m ( · ) becomes testing whether m ( · ) equals to a constant. Theorem 3.
Under the conditions of Theorem 2, for testing H : m ( x ) = a with a a constantversus H a : m ( x ) is not a constant function, when estimating m ( · ) by p -th order local polynomialswith p ≥ , the test statistic (cid:26)(cid:90) (cid:96) x ( y ; ˆ θ x ) dx − (cid:96) ( y ; ˆ a ) (cid:27) (16) is asymptotically distributed according to a χ -distribution with degrees of freedom (df) tr ( H ∗ p ) − ,where ˆ a is the maximum likelihood estimate under H and H ∗ p is the smoothing matrix for local p -th order polynomial regression defined in Huang and Chen (2008) in the case of the Normaldistribution. More explicitly, H ∗ p depending on x i s, bandwidth h , and the kernel function K ( · ) , is H ∗ p = (cid:90) W X p ( X (cid:62) p W X p ) − X (cid:62) p W dx, (17)where W is an n -dimensional diagonal matrix with K h ( x i − x ) as its diagonal elements, and X p isthe n × ( p + 1) design matrix with the ( j + 1) -th column (( x − x ) j , . . . , ( x n − x ) j ) (cid:62) , j = 0 , . . . , p .The dependence of W and X p on x is suppressed and the integration in (17) is performed elementby element in the resulting matrix product. In Theorem 3, the χ -distribution is allowed to have anon-integer degree of freedom, since the χ -distribution is a special case of the gamma distribution.The asymptotic order of tr ( H ∗ ) in the case of local linear regression p = 1 is of order h − (Huangand Chen 2008, p. 2093). We name the χ -test in Theorem 3 as an integrated likelihood ratio testsince the test statistic can be expressed as integrated likelihood ratio: (cid:90) (cid:96) x ( y ; ˆ θ x ) dx − (cid:96) ( y ; ˆ a ) = (cid:90) (cid:88) i [ (cid:96) { y i ; ˆ θ i ( x ) } − (cid:96) ( y i ; ˆ a )] K h ( x i − x ) dx under the boundary condition in (A1). In other words, under model (4), the test statistic (16)integrates the differences in local deviances between a nonparametric fit (8) and an intercept-onlyreduced model and it is distributed asymptotically as a chi-squared distribution with the differencein degrees of freedom of the two models. This interpretation makes (16) more compelling than thegeneralized likelihood ratio test in Li and Liang (2008), since their work does not have a connectionto deviance. 10 Analysis of Deviance for Partially Linear Models
We extend the results in Section 3 to generalized partially linear models (6). Denote ˘ µ x ( x i ) = G − { ˘ θ i ( x ) } where ˘ θ i ( x ) = z (cid:62) i ˘ α + ˘ β + · · · + ˘ β p ( x i − x ) p with estimates ˘ α and ˘ β j ’s under (6).To avoid confusion with the notation in Section 3, from now on ˘ µ x ( x i ) , ˘ θ i ( x ) , ˘ θ x , ˘ µ x , ˘ α , µ ∗∗ , and θ ∗∗ denote the estimates under (6). Since ˘ β j ’s maximize the local likelihood, the equations in (7)continue to hold with ˘ θ x and ˘ µ x under (6). The interpretation that ( y − ˘ µ x ) is orthogonal to ˘ θ x inthe locally weighted inner product space with weights K h ( x i − x ) continues to hold under (6). Anadditional equation from estimating α by maximum likelihood is (cid:88) i y i z ik = (cid:88) i z ik (cid:90) ˘ µ x ( x i ) K h ( x i − x ) dx, k = 1 , . . . , K, (18)where z ik denotes the value of the k -th covariate for the i -th observation. From (18), we observethat the column vector with entries ( y i − (cid:82) ˘ µ x ( x i ) K h ( x i − x ) dx ) , i = 1 , . . . , n , is orthogonalto the column space spanned by z . Moreover it can be shown that (cid:82) ˘ µ x ( x i ) K h ( x i − x ) dx = µ ∗∗ i + O ( h p +1 + n − / ) and hence ( y − µ ∗∗ ) is asymptotically orthogonal to the column spacespanned by z .Theorems 1 and 2 are extended to generalized partially linear models (6) in the following asTheorem 4(a) and 4(b) respectively when α is estimated by maximum likelihood. We develop localand global analysis of deviance expressions for (6) in Theorem 4(a) and Theorem 4(b) shows thatthe integrated likelihood quantities are asymptotically global likelihood quantities with θ ∗∗ and µ ∗∗ . Theorem 4.
For model (6), assume that Conditions (A) in the Appendix hold, and h → , nh →∞ as n → ∞ .(a) The local and global analysis of deviance (9) and (11) respectively hold with ˘ µ x ( x i ) and ˘ µ x when α is estimated by maximum likelihood.(b) Assume that α is estimated with a root- n rate. For p = 1 or 3, the expression in (13) holds with ˘ θ x and θ ∗∗ except the O ( h p +1 ) term replaced by O ( h p +1 + n − / ) . Similarly, (14) and (15) holdwith ˘ µ x and µ ∗∗ and the O ( h p +1 ) terms replaced by O ( h p +1 + n − / ) .(c) When the same kernel function and bandwidth are used in (4) and (6), the nonparametricmodel (4) is nested in (6). Then the difference in local residual deviance from fitting (6) to (4) canbe expressed as d x ( y , ˆ µ x ) − d x ( y , ˘ µ x ) = d x ( ˘ µ x , ˆ µ x ) . (19)11he proofs of Theorem 4(a) and 4(b) are analogous to Theorems 1 and 2 respectively and arethus omitted. We briefly outline the proof for Theorem 4(c). Based on (7) under (6), we have (cid:80) i y i ( x i − x ) j K h ( x i − x ) = (cid:80) i ˘ µ x ( x i )( x i − x ) j K h ( x i − x ) , j = 0 , . . . , p . Then multiplying the j -th equation by ˆ β j and summing them up, (cid:80) i { y i − ˘ µ x ( X i )) } ˆ θ i ( x ) K h ( X i − x ) = 0 is obtainedand (19) is proved.In a special case of the Gaussian distribution with an identity link, Theorem 4(a) becomes thelocal and global analysis of variance for partially linear models, which was discussed in Huangand Davidson (2010, section 3.1). Theorem 4(c) implies that the local residual deviance for fitting(6) is the local residual deviance for fitting (4) minus a term due to the parametric component.That is, the difference of local residual deviances between (6) and (4) is a KL-divergence measure d x ( ˘ µ x , ˆ µ x ) , and the local KL-divergence is additive between nested models (6) and (4). A similarinterpretation holds at a global scale after integrating the local contributions of (19): (cid:90) d x ( y , ˘ µ x ) dx = (cid:90) d x ( y , ˆ µ x ) dx − (cid:90) d x ( ˘ µ x , ˆ µ x ) dx. Analogous to Theorem 3, testing whether m ( · ) is significant under model (6) becomes testingwhether m ( · ) is significantly different from a constant and an integrated likelihood ratio test isproposed in the following theorem. Theorem 5.
For model (6), assume that Conditions (A) in the Appendix hold and the data matrix Z for covariates z is orthogonal to x = ( x , . . . , x n ) (cid:62) and the intercept column. For testing H : m ( x ) = a with a a constant versus H a : m ( x ) is not a constant function, when estimating m ( · ) by p -th order local polynomials with p ≥ , the test statistic (cid:26)(cid:90) (cid:96) x ( y ; ˘ θ x ) dx − (cid:96) ( y ; ˆ α ) (cid:27) (20) is asymptotically distributed according to a χ -distribution with df tr ( H ∗ p ) − , where ˆ α is themaximum likelihood estimate under parametric H . The assumption that Z is orthogonal to x in Theorem 5 is required for mathematical conve-nience, in the sense that the corresponding off-diagonal elements of the local Fisher information is0, for ease of deriving the asymptotic χ -distribution of the test statistic (20). It will be seen in thesimulations that the performance of the proposed tests remain reasonable when this assumption isviolated. The integrated likelihood ratio tests in Theorems 3 and 5 depend on the bandwidth h ,12ike the other nonparametric tests. Analogous to Theorem 3, the test statistic (20) has an interpre-tation of integrating the differences in local deviances between a fitted generalized partially linearmodel and an intercept-only reduced model. We remark that the proposed tests are different fromthose in Hastie and Tibshirani (1990) and Li and Liang (2008). The proposed test statistics utilizeintegrated likelihood that combines all maximized local likelihoods from fitting local polynomi-als. Some existing methods use only G − ( ˆ β ( x i )) ’s, and strictly speaking, the resulting expressionis not based on maximizing likelihood as ˆ β , . . . , ˆ β p are ignored; this fact was mentioned beforeTheorem 1 as well.Some work in the literature, e.g. H¨ardle et al. (1998), has considered testing whether m in (6)is significantly different from a linear trend, G ( µ ) = z (cid:62) α + a + a x . The extension of Theorem 5to testing a linear trend is non-trivial and will be pursued in future work, since the variance functionof y is allowed to be a function of the mean of y in exponential family (1). In a special case ofthe Gaussian distribution with a constant variance, analysis-of-variance F -type tests for checkinglinear trends are derived in Huang and Davidson (2010). We examine the empirical type-I errors and power for the proposed tests in Theorems 3 and 5.Local linear smoothing p = 1 with the Epanechnikov kernel is used throughout this section. Wefirst describe the algorithm for calculating the test statistic (20) in Theorem 5 for testing H : m ( x ) = a under model (6), while that for (16) under model (4) is similar. The algorithm adaptedfrom Carroll et al. (1997) is given as follows:Step 0 (initialization). Fit a parametric generalized linear model to obtain initial values ˘ α (0) .Step 1. For a set of grid points on the data range of x i ’s, given a value of h , maximize the locallikelihood with ˘ α ( r ) to obtain ˘ β ( r )0 , . . . , ˘ β ( r ) p for each grid point. Then with ˘ θ ( r ) i ( x ) ’s, calculate alocally weighted average as in (12) to obtain ˘ θ ∗∗ ( r ) i , i = 1 , . . . , n .Step 2. Maximize the global likelihood with θ ∗∗ ( r ) = (˘ θ ∗∗ ( r ) , . . . , ˘ θ ∗∗ ( r ) n ) (cid:62) to update ˘ α ( r +1) .Step 3. Continue Steps 1 and 2 until convergence. The test statistic (20) is calculated by integratingthe final local likelihoods and taking its difference to the global likelihood under H .The simulation study focuses on logistic regression in Examples 1-4 as we wish to evaluatethe proposed methods in order to analyze the German Bundesbank data in Section 6, while Ex-13mple 5 is on Poisson regression. The integrated likelihood in the test statistics (16) and (20) areapproximated discretely by taking 201 equally-spaced points on [0 , in Examples 1 and 2, and301 equally-spaced points on [ − . , in Examples 3-5. For x i s that fall in conventional boundaryregions, analogous approximations are used for calculating (cid:82) K h ( x i − x ) dx for boundary correc-tion in condition (A1). In addition to implementing the proposed tests with a fixed h , we also tryselecting the bandwidth by AICc (Hurvich et al., 1998) and by the idea of Horowitz and Spokoiny(2001) (HS). The AICc criterion is adapted with the integrated deviance: AICc ( h ) = log( D ∗ /n ) + 2( tr ( H ∗ ) + 1) / ( n − tr ( H ∗ ) − , where D ∗ denotes the integrated deviance (cid:82) d x ( y , ˆ µ x ) dx in (14) under model (4) or (cid:82) d x ( y , ˇ µ x ) dx under model (6). The HS idea is to select the bandwidth that maximizes the test statistic. Criticalvalues for the proposed tests are taken from the χ -distribution with 5% significance level and5000 simulated data sets are generated. The gam function in the mgcv R-package (Wood, 2013)provides a chi-square test of zero effect of a smooth term and we include it for comparison, withdefault 10 spline basis functions and the penalty estimated by REML.
Example 1: logit ( p ) = − a cos(2 πx ) , a = 0 , . , . , . We first check the χ -approximation under H when a = 0 for both a fixed design, x equally-spaced on [0 , , and arandom design, x ∼ U (0 , , with sample sizes n =
50, 100, and 200. The values of bandwidth h = h = h by AICc and HS are also given in Table 1. Itappears that when n =
50, the χ -approximation for (16) under H is not good as the empiricaltype-I errors are all above 0.05 for either a fixed or random design. For this reason, we do notconsider the case of n =
50 further. As suggested by two reviewers, a bootstrap alternative forcalculating the sample critical values for n = 50 may be considered for future research.When a = 0 and n = 100 , the empirical type-I errors are mostly reasonable except for a small h = h AICc , and h HS , with rates ranging about − . In this case with a random design, h AICc tends to select the largest bandwidth 0.3, 92.12% of 5000 simulations, since the true model under14able 1: Percent of rejection under H in Example 1 with a = 0 The empirical type-I errors of the test statistic (16) are close to 0.05 for n = 100 and with a fixed h ≥ . .The performance of h AICc when n = 200 is closer to H than that of n = 100 , and h HS has larger type-I errors asit attempts to optimize the power. The gam function performs consistently around 0.05 under H regardless of thesample sizes. h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 50 fixed design 12.80 8.74 6.70 6.02 10.02 15.86 3.22 n = 50 random design 14.10 9.34 7.20 6.24 10.38 17.26 3.06 n = 100 fixed design 7.38 5.30 4.80 4.52 7.18 9.96 3.99 n = 100 random design 7.94 5.84 4.70 4.36 7.28 10.48 4.08 n = 200 fixed design 4.96 4.20 4.18 3.96 5.56 7.36 4.62 n = 200 random design 5.22 4.38 4.42 4.20 6.00 7.76 5.04 H is a constant, and when AICc happens to select a small bandwidth such as 0.1, it often leads torejecting H . For h HS , it behaves differently since it attempts to optimize the power; when a = 0 and n = 100 , the empirical proportions of h HS on the 7 values of h = 0 . , . . . , . are 44.58%,5.8%, 4.96%, 4.08%, 5.04%, 4.84%, and 30.70%. Therefore the inflated type-I errors of h HS aresomewhat expected. In Horowitz and Spokoiny (2001), the critical values was based on resamplingfrom the finite-sample null distribution, while we use the asymptotic χ -distribution. When n =200 , the empirical type-I errors for the proposed tests are around . with a fixed bandwidth, andslightly above 0.05 for h AICc and h HS . The performance of h AICc when n = 200 is closer to H than that of n = 100 . The gam function performs consistently around 0.05 under H regardless ofthe sample sizes. When n =
100 with a fixed design, the df ( tr ( H ∗ ) − are 10.29, 6.84, 5.11, and4.07 respectively for h = gam is 1.33 with a range [1 . , . . Quantile-quantile plots (qqplots) of 5000test statistics (16) for n =
100 with a fixed design against the χ -quantiles with the correspondingdf are shown in Figure 1 for h = χ -distribution. The qqplots of n =
200 with a fixed bandwidth (not shown) are similar to thoseof n = a = a = h AICc and h HS are more powerful than those with a fixed h . Under alternatives, the proposed tests are15 h=0.1 quantiles from χ = t e s t s t a t i s t i c h=0.15 quantiles from χ = t e s t s t a t i s t i c h=0.2 quantiles from χ = t e s t s t a t i s t i c h=0.25 quantiles from χ = t e s t s t a t i s t i c Figure 1: Quantile-quantile plots of 5000 integrated likelihood ratio test statistics (16) under H in Example 1 for n = 100 with h = χ -distributionwith df 10.29, 6.84, 5.11, and 4.07 respectively.16ore powerful than gam except the case with n = 200 and h = n = a = a =
0, since H ∗ in (17) does not involvethe response y . When n = 100 , the average edf of gam increases as a increases, 1.80, 2.30, and2.64 for a = n =
200 is similar to thatof n = 100 . Table 2: Percent of rejection under H in Example 1 The rejection rate for the test statistic (16) increases as the value of the bandwidth increases when a = h AICc and h HS are more powerful than those with a fixed h . The test statistic (16) is more powerful than gam exceptthe case with n = 200 and h = h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 100 , a = 0 . n = 100 , a = 0 . n = 100 , a = 1 n = 200 , a = 0 . n = 200 , a = 0 . n = 200 , a = 1 Example 2: logit ( p ) = − f k ( x ) , k = 0 , , , where x ∼ U (0 , and functions f ( x ) =8 x (1 − x ) , f ( x ) = exp(2 x ) , and f ( x ) = 2 × x (1 − x ) +10 x (1 − x ) , are taken from Wood(2013). Wood (2013) considered an additive model with logit ( p ) = − f ( x ) + f ( x ) + f ( x ) ,while we use those functions in the univariate case separately. The results are shown in Table 3,with the proportions of rejection nearly 100% for both tests in cases of f and f . For f , a quadratictrend, our test is more powerful than gam except the case with n = 200 and h = The rejection rates are nearly 100% for (16) and gam in cases of f and f . For f , a quadratic trend, (16) is morepowerful than gam except the case with n = 200 and h = h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 100 , f n = 100 , f n = 100 , f
100 100 100 100 100 100 99.74 n = 200 , f n = 200 , f
100 100 100 100 100 100 100 n = 200 , f
100 100 100 100 100 100 10017 xample 3: logit ( p ) = b z + b z + a exp( − x ) , a = 0 , , , , where z is first generated asbinary taking values − z and x are first generated from a bivariatenormal distribution with mean 0, variances 0.5 and 1 respectively, and correlation 0.3. Then x istransformed to have a uniform distribution on ( − . , . To satisfy the conditions in Theorem 5, z and z are then made orthogonal to x and the intercept vector. After the orthogonized z and z are obtained, b = 0 . , b = − . . To understand how restrictive the orthogonality assumptionin Theorem 5 is, we also examine the performance of (20) with the original non-orthogonalizedvalues of z and z and same values of b and b .Table 4: Percent of rejection for Example 3 When a = 0 with a fixed h > = 0 . , the empirical type-I errors of (20) are close to 0.05. When a = 1 and 2, (20) ismore powerful than the gam test, while for a = 3 , the performance of the two tests are close. Under alternatives, h HS is the most powerful, while h AICc also performs well. The rejection rates for (20) are quite close whether Z and x areorthogonal or not (the non-orthogonalized version in brackets). h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 100 , a = 0 n = 100 , a = 1 n = 100 , a = 2 n = 100 , a = 3 n = 200 , a = 0 n = 200 , a = 1 n = 200 , a = 2 n = 200 , a = 3
100 100 100 100 100 100 99.96[100] [100] [100] [100] [100] [100] [99.98]The values of bandwidth are 0.15, 0.2, 0.25, 0.3, and 0.4, so that they are roughly equally-spaced on a logarithm scale. The percent of rejection is given in Table 4 for h = h AICc , and h HS , with the non-orthogonalized version in brackets. The case of h = 0 . isnot presented due to its inflated type-I errors: when a = 0 , n = 100 , and h = 0 . , the percentof rejection is 8.72 and 8.28 for the orthogonalized and non-orthogonalized version respectively.18rom Table 4, we observe that when a = 0 , the empirical type-I errors are reasonable except h AICc and h HS . Together with the observations in Example 1 under H , we may imply that optimizingthe bandwidth by some criterion may lead to inflated type-I errors for our test in the case of logisticregression. When a = 1 and 2, our test is more powerful than the gam test, while for a = 3 , theperformance of the two tests are close. Under alternatives, h HS is the most powerful, while h AICc also performs well, supporting
AICc as a bandwidth-selection criterion. The rejection rates for(20) are quite close whether Z and x are orthogonal or not, suggesting that this assumption maybe relaxed in practice. When n = 200 and a = 0 , the average df ( tr ( H ∗ ) − corresponding to h = 0 . , . . . , a . When n = 200 , the average edf for gam is 1.34, 2.22, 3.89, and4.73 for a =
0, 1, 2, 3 respectively. The df and edf of n = 100 are similar to those of n = 200 . Example 4: logit ( p ) = b z + b z + a cos(2 πx ) , a = 0 . , , . , where the data generationscheme of z , z , and x is identical to Example 3, and b and b are the same as Example 3. Thisexample adopts a nonlinear function of x similar to that of Example 1 with a different range of x . The same values of h as Example 3 are used and hence the dfs are analogous to Example 3,omitted for brevity. Table 5 shows that (20) is more powerful than gam when a = 0 . and 1.0, andour test with h HS and h AICc continues to perform well in this example. Again, we observe that forthe proposed tests, the rejection rates are quite close whether Z and x are orthogonal or not. Example 5: log ( µ ) = b z + b z + a exp( − x ) , a = 0 , , , (21)and log ( µ ) = b z + b z + a cos(2 πx ) , a = 0 . , . . (22)This example is for Poisson regression with the canonical log link, while the functional form for θ = log ( µ ) and data generation scheme follow those of Examples 3 and 4. From Table 6, weobserve that when a = 0 in (21) with n = 100 and h = 0 . , the type-I error is reasonable, incontrast to the logistic regression case. The performance of h = 0 . is close to that of h = 0 . andtherefore not presented in Table 6. We observe that our test using a fixed h is more powerful witha larger bandwidth when a = 1 in (21) and a = 0 . in (22), and the empirical power is comparablebetween (20) and gam tests. 19able 5: Percent of rejection for Example 4 The test statistic (20) is more powerful than gam under alternatives. The rejection rates are quite close whether Z and x are orthogonal or not (the non-orthogonalized version in brackets). h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 100 , a = 0 . n = 100 , a = 1 n = 100 , a = 1 . n = 200 , a = 0 . n = 200 , a = 1 n = 200 , a = 1 .
100 100 100 100 100 100 99.86[99.98] [100] [100] [100] [100] [100] [99.88]
Banking throughout the world is based on credit, or on trust in the debtor’s ability to fulfillhis/her debt obligation. However, facing increasing pressure from markets and regulators, bankshave based their risk analysis, increasingly, on statistical techniques to judge or predict corporatebankruptcy. This is known as rating or scoring. Its main purpose is to estimate the financial statusof a company and, if possible, to estimate the probability of a company default on its debt obliga-tions within a certain period. Logistic regression is probably the most commonly used techniqueto model the probability of default and logistic partially linear models may also be advantageousbecause of its flexibility, in allowing for the possibly nonlinear effects of one continuous covariate.We apply the methodology to the German Bundesbank Data in year 2002. The data providedby CRC 649, Humboldt-Universit¨at zu Berlin, contained 6123 companies of which 186 were in-solvent. Each firm is described by 28 financial ratio variables, x , . . . , x , and those of insolventfirms were collected two years prior to insolvency. To ensure the value of some variables as thedenominator should not be zero when calculating the ratios, 2079 firms were retained with 92insolvent. Though removing almost two thirds of the sample may seem excessive, we did not in-tend to analyze the majority of firms in the database. The focus was to investigate (i) differences20able 6: Percent of rejection for Example 5 When a = 0 , the type-I errors of (20) are reasonable for this Poisson regression example. The test statistic (20) usinga fixed h is more powerful with a larger bandwidth when a = 1 in (21) and a = 0 . in (22), and the empirical poweris comparable between (20) and gam tests. h = 0 . h = 0 . h = 0 . h = 0 . h AICc h HS gam n = 100 , a = 0 in (21) 5.36 4.80 4.18 3.94 7.06 7.62 4.88[ 5.48] [ 4.64] [ 4.40] [ 4.26] [ 7.14] [ 7.68] [ 4.86] n = 100 , a = 1 in (21) 41.26 44.48 47.24 48.42 51.98 54.48 37.58[41.36] [44.52] [47.12] [ 48.90] [52.60] [54.98] [37.24] n = 100 , a = 2 in (21) 99.98 99.98 99.98 100 99.98 100 99.94[99.98] [99.98] [99.98] [100] [99.98] [100] [99.96] n = 200 , a = 0 in (21) 4.98 4.56 4.48 4.32 6.92 7.38 5.14[ 5.02] [ 4.66] [ 4.74] [ 4.60] [ 7.06] [ 7.62] [ 5.32] n = 200 , a = 1 in (21) 74.10 78.64 81.48 82.88 85.02 85.86 75.04[73.62] [78.46] [80.84] [82.04] [84.12] [84.92] [74.14] n = 200 , a = 2 in (21) 100 100 100 100 100 100 100[100] [100] [100] [100] [100] [100] [100] n = 100 , a = 0 . in (22) 26.20 28.36 30.02 30.90 34.72 36.58 17.32[25.72] [27.56] [29.40] [30.18] [34.06] [35.66] [16.96] n = 100 , a = 1 . in (22) 99.92 100 100 100 100 100 99.98[99.94] [99.94] [99.98] [100] [100] [100] [99.94] n = 200 , a = 0 . in (22) 51.86 56.74 60.22 62.24 64.90 65.88 43.94[51.18] [56.22] [60.18] [61.92] [64.44] [65.38] [43.90] n = 200 , a = 1 . in (22) 100 100 100 100 100 100 100[100] [100] [100] [100] [100] [100] [100]between the financial ratios of the solvent and insolvent firms, and (ii) how the nonlinear effectsimprove parametric logistic fitting.Based on support vector machines and for a much larger data sample spanning from 1996through to 2002, Chen et al. (2011) selected x (accounts payable/sales) measuring accountpayable turnover, as the best predictor, and subsequently selected x (operating income/total as-sets) measuring profitability, x ((cash and cash equivalents)/total assets) measuring liquidity, x (total liabilities/total assets) measuring leverage, x (increase (decrease) inventories/inventories)measuring percentage of incremental inventories, x (inventories/sales) measuring inventoryturnover, x ((earnings before interest and tax)/total assets) and x (net income/sales) measur-ing net profit margin. For year 2002 data, we found that x and x have a large sample correlationcoefficient 0.95 and thus x is removed from our analysis and we further include x (log(total21ssets)) measuring firm size, as it is shown to be an important variable on predicting the probabil-ity of bankruptcy in the literature (see, e.g., Lopez 2004). In summary, there are 8 predictors, x , x , x , x , x , x , x , and x , and a binary response. See Chen et al. (2011) for detaildescriptions about the data.Since x was selected as the most important predictor by Chen et al. (2011), we modelits effects nonparametrically, while retaining linear trends for the remaining predictors in a logitmodel. The variable x measuring account payable turnover is a short-term liquidity measurefor quantifying the rate at which a firm pays off its suppliers. Generally speaking, “the firmswith higher account payable turnover will have less ability to convert their accounts into sales,have lower revenues, and go bankrupt more readily” (Chen et al. 2011). However this measure isspecific to different industries; every industry has a slightly different standard. Further examinationof x indicates that most values lie in [0 , . with only 15 observations in (0 . , . . If those15 observations are excluded, then the sample size becomes 2064, in which 91 are insolvent. Analternative approach, suggested by a reviewer, is taking logarithm of ( x
24 + 0 . (0.001 is addedsince x includes 0’s) and retaining the sample size n = 2079 .Local linear smoothing with the Epanechnikov kernel is used. The values of bandwidth for x , 0.125, 0.1, and 0.08, are equally-spaced on a logarithmic scale, corresponding to df 4.94,5.92, and 7.17 respectively. The bandwidth that minimizes AICc is 0.125 and h HS = 0 . . Thecurves for m ( x with pointwise confidence intervals based on empirical Fisher information ma-trices are shown in Figure 2 with h = 0 . and 0.1, and the proposed test for testing H : m ( x is a constant, gives highly significant p -values < − for all 3 values of the bandwidth, indicat-ing significance of m ( x in predicting probability of default . A linear logistic model gives apositive slope 10.37 for x with a highly significant p -value < − . Since Chen et al. (2011)interpreted the linear trend as higher default probability with high turnover, we attempt to interpretthe seemingly non-linear curves in Figure 2 as follows. Taking the curve with h = 0 . in Figure2, when x increases from 0.1 to 0.3, the estimate increases about 1.895, which means the oddsratio for a firm with x
24 = 0 . to become insolvent is exp(1 . . times relative to that fora firm with x
24 = 0 . . On the other hand, between x
24 = 0 . and . , the estimate decreases byan amount of − . , implying that the odds ratio for a firm with x
24 = 0 . to become insolventis exp( − . . times relative to that for a firm with x
24 = 0 . . Thus our analysis gainsnew insight suggesting that a German firm is likely to go bankrupt when it has a higher turnover22 .0 0.1 0.2 0.3 0.4 0.5 − − − x24 e s t i m a t ed m and C I Figure 2: Plot of the nonlinear trends of x in predicting the probability of bankruptcy usingbandwidth h = 0 . (solid line) and h = 0 . (dash line) with 95% pointwise confidence intervalsfor the 2002 German Bundesbank Data.for roughly 97.5% of firms (0.3 is approximately 97.5-percentile of x ), but for those firms with . < x < . (approximately 97.5- to 99-percentile), the default probability decreases as x increases.For smoothing on log ( x
24 + 0 . logx with n = 2079 , the values of h are , h HS = 6 . The curves for logx with h = 3 . and 6 shown in Figure 3 have alinear tendency for logx < − . x < . , and the confidence intervals corresponding to h = 3 . imply some uncertainty near the right-hand end points. The tests for H : m ( logx is a constant, give highly significant p -values of < . × − . If using a linear trend for logx ,the slope is 0.927 with a significant p -value of . × − . The analysis using logx implies thata German firm is likely to go bankrupt when it has a high turnover in the log scale, but the lineartrend is uncertain for those with x > . . Hence an interpretation in Chen et al. (2011) that aGerman firm is likely to go bankrupt when it has high turnover may not be entirely correct; theeffects of x on the probability of bankruptcy may be nonlinear for those with large turnovers, asshown in Figures 2 and 3. 23 − − − − log(x24 +0.001) e s t i m a t ed m and C I Figure 3: Plot of the trends of logx in predicting the probability of bankruptcy using bandwidth h = 6 (solid line) and h = 3 . (dash line) with 95% pointwise confidence intervals for the 2002German Bundesbank Data. We develop local and global analysis of deviance expressions and associated integrated likelihoodratio tests for generalized partially linear models with canonical links based on fitting local p -thorder polynomials. Though the idea of nonparametric analysis of deviance is not new (Hastie andTibshirani, 1990), the work in this paper provides theoretical justifications that connect to the clas-sic framework. Theorems 2 and 4(b) are restricted for p = 1 and 3 only, while Theorems 1, 3,4(a)(c), and 5 are for a nonnegative integer p . As a by-product, new estimators for the canonicalparameter and response mean are proposed and Theorems 2 and 4(b) show that the integrated like-lihood quantities are asymptotically global likelihood quantities with the new estimators. The newestimator ˆ θ ∗ i or ˘ θ ∗ i for the canonical parameter is formed by combining locally fitted ˆ θ i ( x ) or ˘ θ i ( x ) through weighted integration and thus utilize all locally fitted parameters, which is different fromthe conventional approach of focusing on ˆ β . The integrated likelihood approach of combininglocal likelihood appears to be new in the smoothing literature, though it was discussed by Severini(2007) and Lehmann (2006) in different settings. The numerical results of n = 100 and showthat the test statistics under the null hypothesis follow the asymptotic χ -distribution reasonably24ell and the performance under alternative hypotheses is sometimes more powerful than Wood(2013) in the R package mgcv . It has been suggested by a reviewer to investigate the asymptoticpower of the proposed tests. Since there is no simple explicit expression for Fisher information forgeneralized linear models (1), we conjecture that the study of power may be focused on specialcases of logistic and Poisson models, which will be explored for future research. For a smallersample size such as n = 50 , two reviewers has suggested to develop a bootstrap procedure forcalculating the sample critical values for further investigation.The local analysis of deviance in (9) and in Theorem 4(a) are derived assuming a fixed valueof bandwidth. It is straightforward to obtain local analysis of deviance expressions with varyingvalues of bandwidth at different x , but how to combine them to form global analysis of deviancewill be an interesting problem. Like all smoothing-based tests, the p -values of the integrated like-lihood ratio tests depend on the values of the smoothing parameter. We recommend plotting the“significant trace” (Bowman and Azzalini 1998) to assess the evidence across a wide range ofvalues of h and looking for some overall trends. For fitting generalized partially linear models, apractical problem is how to choose the predictor to be modelled nonparametrically. One approachmay be based on selecting the most significant predictor based on the smallest p -value of integratedlikelihood ratio tests when using approximately the same degrees of freedom for smoothing. Thisidea and the related model selection problems with a diverging number of linear covariates (Wanget al. 2014) may be explored for future research. A topic for further investigation is the problem ofbandwidth selection for models (4) and (6) based on cross-validating the deviance or minimizingthe Akaike information criterion. Further extension on developing analysis of deviance for gener-alized partially linear models with non-canonical links, for multiplicative bias reduction methods(Kosmidis and Firth 2009), for hazard estimation (Nielsen and Tanggaard 2001) as the propor-tional hazards models and Poisson regression are connected, and for generalized additive modelswith multiple nonparametric functions remain to be investigated. Acknowledgement
We thank the editor, associate editor , and two anonymous referees for their constructive commentsand suggestions. The work was conceived during the visit of the second author to the Humboldt-Universit¨at zu Berlin supported by CRC 649 “Economic Risk.” The support is greatly appreciated.The first author was partially supported by SKBI School of Business, Singapore ManagementUniversity, the second author (corresponding author) was partially supported by the Ministry of25cience and Technology NSC 101-2118-M-007-002-MY2 in Taiwan, and both authors were par-tially supported by CRC 649 “Economic Risk.” The authors wish to thank Mr. Leslie Udvarhelyiwho assisted in the proof-reading of the manuscript.
Appendix C ONDITIONS (A).(A1). The kernel K ( · ) is a Lipschitz continuous, bounded and symmetric probability density func-tion, having a support on a compact interval, say [ − , . For the x i s that fall in the conven-tional boundary region, a boundary-corrected kernel is used to ensure (cid:82) K h ( x i − x ) dx = 1 .(A2). The covariate X is assumed to have a bounded support X .(A3). The function ( ∂ /∂µ ) (cid:96) { y ; G ( µ ) } < for µ ∈ R and y in the range of the response variable.(A4). The functions L (cid:48) , θ ( p +2) , b (cid:48)(cid:48) ( θ ( · )) ≡ V ( · ) , V (cid:48)(cid:48) , and G (3) are continuous with respect to x .(A5). For each x in X , V ( x ) and G (cid:48) ( µ ( x )) are non-zero.(A6). For (6), the covariate vector z is assumed to have a bounded support.Conditions (A1)-(A5) are similar to those in Fan et al. (1995). Without loss of generality,Condition (A2) is satisfied in practice by strictly increasing transformations of X when the supportof X before transformation is unbounded. Similar explanations can be said about condition (A6).Condition (A3) ensures that the local polynomial estimate ˆ β lies in a compact set. Conditions (A3)and (A5) imply that V ( x ) > for x ∈ X . Condition (A4) implies that all thrid derivatives of (cid:96) { y ; G ( µ ) } with respect to µ are continuous, and V (cid:48) and µ (cid:48) are continuous with respect to x . Proof of Proposition
When p = 1 , the bias of θ ∗ i is expressed as follows: E (cid:26)(cid:90) ( ˆ β ( x ) + ( x i − x ) ˆ β ( x )) K h ( x i − x ) dx (cid:27) − θ i = E (cid:104)(cid:82) { ( ˆ β ( x ) − β ( x )) + ( x i − x )( ˆ β ( x ) − β ( x )) } K h ( x i − x ) dx − (cid:82) { β ( x )( x i − x ) + r ( x, x i ) } K h ( x i − x ) dx (cid:3) , (23)26here r ( x, x i ) denotes the remainder terms. Plugging the first-order term of the asymptotic biasof ˆ β ( x ) (Fan et al. (1995) and Aerts and Claeskens (1997)) in (23), leads to cancellation with the β ( x ) -term in (23). The remaining term (cid:82) ( x i − x )( ˆ β ( x ) − β ( x )) K h ( x i − x ) dx is of order h .Thus the h -order term in (23) is zero. Similar arguments can be shown for p = 3 .Fan et al. (1995) and Aerts and Claeskens (1997) showed that the variance of ˆ β j ( x ) is of order n − h − j − when p is odd. Then the variance of { ˆ β ( x ) + ( x i − x ) ˆ β ( x ) } is of order n − h − andhence the variance of θ ∗ i is of order n − h − . Finally, it is straightforward to show (c) based on (a)and (b) since µ ∗ i = G − ( θ ∗ i ) . Proof of Theorem 2
We only need to show Theorem 2(a) while Theorem 2(b) follows directly from Theorem 2(a). Forthe left-hand side of (13), ignoring the c ( y, φ ) and a ( φ ) terms in (1) which is unrelated to x , theintegrated likelihood is (cid:90) (cid:96) x ( y ; ˆ θ x ) dx = (cid:88) i (cid:26) y i (cid:90) ˆ θ i ( x ) K h ( x i − x ) dx − (cid:90) b (ˆ θ i ( x )) K h ( x i − x ) dx (cid:27) . By Taylor’s expansion, b (ˆ θ i ( x )) = b ( θ ∗ i ) + b (cid:48) ( θ ∗ i ) { ˆ θ i ( x ) − θ ∗ i } + b (cid:48)(cid:48) ( θ ∗ i ) { ˆ θ i ( x ) − θ ∗ i } / r i ( x ) , where r i ( x ) denotes the remainder terms. For the linear term, (cid:90) b (cid:48) ( θ ∗ i )(ˆ θ i ( x ) − θ ∗ i ) K h ( x i − x ) dx = b (cid:48) ( θ ∗ i )( θ ∗ i − θ ∗ i ) = 0 . The quadratic term (cid:82) { ˆ θ i ( x ) − θ ∗ i } K h ( x i − x ) dx = (cid:82) ˆ θ i ( x ) { ˆ θ i ( x ) − θ ∗ i } K h ( x i − x ) dx. For ˆ θ i ( x ) − θ ∗ i = (ˆ θ i ( x ) − θ i ) − ( θ ∗ i − θ i ) , the first term ˆ θ i ( x ) − θ i = ( ˆ β − β ) + ( ˆ β − β )( x i − x ) + · · · +( ˆ β p − β p )( x i − x ) p + r (cid:48) i ( x ) by Taylor’s expansion of θ i , where r (cid:48) i ( x ) denotes the remainder terms.Then based on Theorem 1(a) of Fan et al. (1995), ˆ θ i ( x ) − θ i is of order O ( h p +1 ) . For the secondterm θ ∗ i − θ i , it is of order O ( h ( p +1) ) based on the Proposition. Thus (cid:82) b (ˆ θ i ( x )) K h ( x i − x ) dx = b ( θ ∗ i ) + O ( h ( p +1) ) and (13) is proved. Proof of Theorem 3
Let (cid:96) ( y i ; a ) be the likelihood corresponding to y i with θ i = a . Define (cid:96) x ( y ; a ) = (cid:80) i (cid:96) ( y i ; a ) K h ( x i − x ) , and it is clear that (cid:82) (cid:96) x ( y ; a ) dx = (cid:96) ( y ; a ) under (A1). Recall that ˆ β =( ˆ β , . . . , ˆ β p ) (cid:62) maximizes local likelihood at x , (cid:96) x ( y ; θ x ) , with local polynomial approximation.27xpanding (cid:96) x ( y ; ˆ θ x ) , which is a function of ˆ β , around a ( p + 1) -length vector a = ( a , , . . . , (cid:62) , (cid:96) x ( y ; ˆ θ x ) − (cid:96) x ( y ; a ) = (cid:26) ∂(cid:96) x ∂ β ( y ; a ) (cid:27) (cid:62) ( ˆ β − a )+ 12 ( ˆ β − a ) (cid:62) ∂ (cid:96) x ∂ β ( y ; a )( ˆ β − a )+ O p ( √ n − h − ) , (24)where ∂(cid:96) x ( y ; a ) /∂ β = ∂(cid:96) x ( y ; θ x ) /∂ β (cid:12)(cid:12) a and similarly for ∂ (cid:96) x ( y ; a ) /∂ β . Substituting theexpansion ˆ β − a = i x ( a ) − ∂(cid:96) x ∂ β ( y ; a ) + O p ( √ n − h − ) , where i x ( a ) = E {− ∂ (cid:96) x ( y ; θ x ) /∂ β } ( a ) , we have for (24), (cid:26) ∂(cid:96) x ∂ β ( y ; a ) (cid:27) (cid:62) i x ( a ) − ∂(cid:96) x ∂ β ( y ; a ) + 12 (cid:26) ∂(cid:96) x ∂ β ( y ; a ) (cid:27) (cid:62) i x ( a ) − ∂ (cid:96) x ∂ β ( y ; a ) i x ( a ) − ∂(cid:96) x ∂ β ( y ; a )+ O p ( √ n − h − ) . Since ∂ (cid:96) x ( y ; a ) /∂ β = − i x ( a ) + O p ( √ nh ) , it follows that (cid:82) { (cid:96) x ( y ; ˆ θ x ) − (cid:96) x ( y ; a ) } dx is (cid:90) (cid:26) ∂(cid:96) x ∂ β ( y ; a ) (cid:27) (cid:62) i x ( a ) − ∂(cid:96) x ∂ β ( y ; a ) dx = (cid:26) i ( a ) − / ∂(cid:96)∂ θ ( y ; a ) (cid:27) (cid:62) H ∗ p (cid:26) i ( a ) − / ∂(cid:96)∂ θ ( y ; a ) (cid:27) , where the last expression is obtained by plugging in the explicit expressions of ∂(cid:96) x ( y ; a ) /∂ β and i x ( a ) − , and i ( a ) is the information matrix under H .By standard ML theory, { (cid:96) ( y ; ˆ a ) − (cid:96) ( y ; a ) } is asymptotically distributed according to a χ distribution with degree of freedom and i ( a ) − / ∂(cid:96) ( y ; a ) /∂ θ is asymptotically normallydistributed with mean vector 0 and identity covariance matrix. Then the test statistic (16) becomes (cid:26) i ( a ) − / ∂(cid:96)∂ θ ( y ; a ) (cid:27) (cid:62) { H ∗ p − P } (cid:26) i ( a ) − / ∂(cid:96)∂ θ ( y ; a ) (cid:27) , where P is an n × n matrix with /n in all entries. From Huang and Chen (2008), H ∗ p − P is symmetric and asymptotically idempotent. Thus the test statistic (16) has an asymptotic χ -distribution with df tr ( H ∗ p ) − . Proof of Theorem 5
The proof is an extension from that of Theorem 3. Let α = ( a , α (cid:62) ) (cid:62) denote the parametervector under H and (cid:96) ( y ; α ) be the corresponding likelihood. Define local likelihood at x under H by (cid:96) x ( y ; α ) = (cid:80) i (cid:96) ( y i ; α ) K h ( x i − x ) and hence (cid:82) (cid:96) x ( y ; α ) dx = (cid:96) ( y ; α ) . We consider (cid:82) (cid:96) x ( y ; ˘ θ x ) dx − (cid:96) ( y ; α ) and (cid:96) ( y ; ˆ α ) − (cid:96) ( y ; α ) separately, whose difference becomes (16).28et b = ( β (cid:62) , α (cid:62) ) (cid:62) be the parameter vector under H . Expanding (cid:96) x ( y ; ˘ θ x ) , which is a functionof ˘ b = ( ˘ β (cid:62) , ˘ α (cid:62) ) (cid:62) , around a ( p + K + 1) -length vector a = ( a , , . . . , , α (cid:62) ) (cid:62) , (cid:96) x ( y ; ˘ θ x ) − (cid:96) x ( y ; a ) = (cid:26) ∂(cid:96) x ∂ b ( y ; a ) (cid:27) (cid:62) (˘ b − a ) + 12 (˘ b − a ) (cid:62) ∂ (cid:96) x ∂ b ( y ; a )(˘ b − a ) + O p ( √ n − h − ) . (25)Substituting the expansion under H , ˘ β − a = i x ( a ) − ∂(cid:96) x ∂ b ( y ; a ) + O p ( √ n − h − ) , where i x ( a ) = E {− ∂ (cid:96) x ( y ; θ ) /∂ b } ( a ) , we have for (25), (cid:26) ∂(cid:96) x ∂ b ( y ; a ) (cid:27) (cid:62) i x ( a ) − ∂(cid:96) x ∂ b ( y ; a ) + 12 (cid:26) ∂(cid:96) x ∂ b ( y ; a ) (cid:27) (cid:62) i x ( a ) − ∂ (cid:96) x ∂ b ( y ; a ) i x ( a ) − ∂(cid:96) x ∂ b ( y ; a )+ O p ( √ n − h − ) . Since ∂ (cid:96) x ( y ; a ) /∂ b = − i x ( a ) + O p ( √ nh ) under H , it follows that (cid:82) { (cid:96) x ( y ; ˘ θ x ) − (cid:96) x ( y ; α ) } dx is asymptotically (cid:26) i ( α ) − / ∂(cid:96)∂ θ ( y ; α ) (cid:27) (cid:62) { H ∗ p + P z } (cid:26) i ( α ) − / ∂(cid:96)∂ θ ( y ; α ) (cid:27) , where P z is the projection matrix for Z and i ( α ) is the information matrix under H . The lastexpression is obtained by plugging in the explicit expressions of ∂(cid:96) x ( y ; a ) /∂ b and i x ( a ) − andusing the assumption that x and Z are orthogonal.For the other term, { (cid:96) ( y ; ˆ α ) − (cid:96) ( y ; α ) } is asymptotically distributed according to a χ -distribution with ( K + 1) degree of freedom under H by standard ML theory. Hence the teststatistic (16) becomes (cid:26) i ( α ) − / ∂(cid:96)∂ θ ( y ; α ) (cid:27) (cid:62) { H ∗ p − P } (cid:26) i ( α ) − / ∂(cid:96)∂ θ ( y ; α ) (cid:27) , where P is the same as in the proof of Theorem 3. Also i ( α ) − / ∂(cid:96) ( y ; α ) /∂ θ is asymptoticallynormally distributed with mean vector 0 and identity covariance matrix. Hence the test statistic(16) has an asymptotic chi-square distribution with df tr ( H ∗ p ) − .29 eferences Aerts, M., and Claeskens, G. (1997), “Local Polynomial Estimation in Multiparameter LikelihoodModels,”
Journal of the American Statistical Association , 92, 1536-1545.Bowman, A.W., and Azzalini, A. (1997),
Applied Smoothing Techniques for Data Analysis,
Lon-don: Oxford.Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997), “Generalized Partially Linear Single-index Models,”
Journal of the American Statistical Association , 92, 477-489.Chen, S., H¨ardle, W., and Moro, R. (2011), “Modeling Default Risk with Support Vector Ma-chines,”
Quantitative Finance,
11, 135-154.Fan, J., and Gijbels, I. (1996),
Local Polynomial Modelling and Its Applications,
London: Chap-man and Hall.Fan, J., Heckman, N. E., and Wand, M. P. (1995), “Local Polynomial Kernel Regression forGeneralized Linear Models and Quasi-Likelihood Functions,”
Journal of the American Sta-tistical Association , 90, 141-150.Fan, J., Zhang, C., and Zhang, J. (2001), “Generalized Likelihood Ratio Statistics and WilksPhenomenon,”
Annals of Statistics , 29, 153-193.Green, P. J., and Silverman, B. W. (1994),
Nonparametric Regression and Generalized LinearModels: a Roughness Penalty Approach,
London: Chapman and Hall.H¨ardle, W., Muller, M., and Mammen, E. (1998), “Testing Parametric Versus SemiparametricModeling in Generalized Linear Models,”
Journal of the American Statistical Association ,93, 1461-1474.H¨ardle, W. K., M¨uller, M., Sperlich, S., and Werwatz, A. (2004),
Nonparametric and Semipara-metric Models , Berlin: Springer.Hastie, T. J., and Tibshirani, R. J. (1990),
Generalized Additive Models,
London: Chapman andHall.Hastie, T. J., and Tibshirani, R. J. (1987), “Local Likelihood Estimation,”
Journal of the AmericanStatistical Association , 82, 559-567.Horowitz, J. L., and Spokoiny, V. G. (2001), “An Adaptive, Rate–optimal Test of a ParametricMean-Regression Model Against a Nonparametric Alternative,”
Econometrica , 69, 599-631.Huang, L.-S., and Chan, K.-S. (2014), “Local Polynomial and Penalized Trigonometric SeriesRegression,”
Statistica Sinica,
24, 1215-1238.Huang, L.-S., and Chen, J. (2008), “Analysis of Variance, Coefficient of Determination, and F-testfor Local Polynomial Regression,”
Annals of Statistics , 36, 2085-2109.30uang, L.-S., and Davidson, P. W. (2010), “Analysis of Variance and F -tests for Partial LinearModels with Applications to Environmental Health Data,” Journal of the American Statisti-cal Association , 105, 991-1004.Hurvich, C.M., Simonoff, J.S., and Tsai, C.-L. (1998), “Smoothing Parameter Selection in Non-parametric Regression Using an Improved Akaike Information Criterion,”
Journal of theRoyal Statistical Society , Series B, 60, 271-293.Kosmidis, I., and Firth, D. (2009), “Bias Reduction in Exponential Family Nonlinear Models,”
Biometrika , 96, 793-804.Lehmann, E. L. (2006), “On Likelihood Ratio Tests,” in
Optimality: The Second Erich L.Lehmann Symposium,
Institute of Mathematical Statistics Lecture Notes - Monograph Seriesvol. 49, ed J. Rojo, Beachwood, OH: Institute of Mathematical Statistics, pp 1-8.Li, R., and Liang, H. (2008), “Variable Selection in Semiparametric Regression Modeling,”
An-nals of Statistics , 36, 261-286.Loader, C. (1999),
Local Regression and Likelihood,
New York: Springer.Lopez, J. A. (2004), “The Empirical Relationship Between Average Asset Correlation, Firm Prob-ability of Default, and Asset Size,”
Journal of Financial Intermediation , 13, 265-283.Nielsen, J. P., and Tanggaard, C. (2001), “Boundary and Bias Correction in Kernel Hazard Esti-mation,”
Scandinavian Journal of Statistics,
28, 675-698.McCullagh, P., and Nelder, J. A. (1989),
Generalized Linear Models, (2nd ed.) London: Chapmanand Hall.Severini, T. A. (2007), “Integrated Likelihood Functions for Non-Bayesian Inference,”
Biometrika,
94, 529-542.Simon, G. (1973), “Additivity of Information in Exponential Family Probability Laws,”
Journalof the American Statistical Association , 68, 478-482.Wang, L., Xue, L., Qu, A., and Liang, H. (2014), “Estimation and Model Selection in GeneralizedAdditive Partial Linear Models for Correlated Data with Diverging Number of Covariates,”
Annals of Statistics , 42, 592-624.Wood, S. N. (2006),