Infinitely imbalanced binomial regression and deformed exponential families
aa r X i v : . [ m a t h . S T ] A p r Infinitely imbalanced binomial regressionand deformed exponential families
T. SeiApril 23, 2013
Abstract
The logistic regression model is known to converge to a Poissonpoint process model if the binary response tends to infinitely imbal-anced. In this paper, it is shown that this phenomenon is universal ina wide class of link functions on binomial regression. The proof relieson the extreme value theory. For the logit, probit and complementarylog-log link functions, the intensity measure of the point process be-comes an exponential family. For some other link functions, deformedexponential families appear. A penalized maximum likelihood estima-tor for the Poisson point process model is suggested.Keywords: binomial regression; extreme value theory; imbalanceddata; Poisson point process; q -exponential family. Let { ( X i , Y i ) } mi =1 be m independently and identically distributed observabledata on R p × { , } . The conditional distribution of Y i given X i is assumedto be P ( Y i = 1 | X i , a, b ) = G ( a + b T X i ) , a ∈ R , b ∈ R p , (1)where G ( · ) is a one-dimensional cumulative distribution function. The in-verse function G − ( p ) = sup { z : G ( z ) ≤ p } is the link function in terms ofgeneralized linear models. Denote the marginal distribution of X i by F ( dX i ).The distribution function G is typically the logistic, standard normal or Gum-bel distributions. The corresponding link functions are the logit, probit and1omplementary log-log functions, respectively. For the three examples, thelog-likelihood function of (1) is concave; see Wedderburn (1976).Our interest is the situation that the data is highly imbalanced. Inother words, the probability of success is almost zero. Examples of suchcases are fraud detection, medical diagnosis, political analysis and so forth.See e.g. Bolton & Hand (2002), Chawla et al. (2004), Jin et al. (2005), andKing & Zeng (2001). For the data without covariates, Poisson’s law of rareevents is well known: if P ( Y i = 1) = λ/m + o ( m − ), then the probabilitydistribution of P mi =1 Y i converges to the Poisson distribution with the meanparameter λ . From this observation, for highly imbalanced data, it is naturalto consider that the true parameter ( a, b ) in (1) depends on m , say ( a m , b m ),and G ( a m ) → m → ∞ .Owen (2007) showed that the maximum likelihood estimator of the logis-tic regression model converges to that of an exponential family if P mi =1 Y i isfixed and m goes to infinity. This result is roughly derived as follows. Con-sider the model (1) with the logistic distribution G ( z ) = e z / (1 + e z ). Take a m ( α ) = − log m + α and b m ( β ) = β for any fixed α and β . Then we obtain P ( Y i = 1 | X i , a m ( α ) , b m ( β )) = e − log m + α + β T X i e − log m + α + β T X i = e α + β T X i m + o ( m − )(2)as m → ∞ . By Bayes’ theorem, the conditional density of X i given Y i = 1with respect to the distribution F ( dX i ) is, at least formally, e β T X i R e β T x F ( dx ) + o (1) . (3)This is an exponential family with the sufficient statistic x i , and Owen’sresult follows. Remark 1.
To be precise, Owen (2007) proved the convergence result undera different setting from here. He assumed that the true conditional distri-bution of X i given Y i = j , j ∈ { , } , is any distribution F j . In our setting, F is asymptotically equal to F , and the density of F with respect to F should satisfy (3). In other words, our setting becomes misspecified unlessthis equality is satisfied. We discuss this point again in Section 5.Warton & Shepherd (2010) pointed out that the likelihood of logistic re-gression converges to a Poisson point process model with a specific form of2ntensity. Indeed, by (2), the probability P ( Y i = 1 , X i ∈ A ) is approximately m − R A e α + β T x F ( dx ) for any compact subset A of R p . Therefore, by Pois-son’s law of rare events, the number of observations X i for which X i ∈ A and Y i = 1 is approximately distributed according to the Poisson distribu-tion with mean R A e α + β T x F ( dx ). This is the Poisson point process with theintensity measure e α + β T x F ( dx ).In this paper, we consider the limit of various binomial regression mod-els other than the logistic model. As expected from the result on logisticregression, the limit becomes a Poisson point process. A remarkable factwe prove is that the intensity measure of the point process should be a q -exponential family for some real number q . The q -exponential family, alsocalled the deformed exponential family or α -family, is recently much inves-tigated in the literature of statistical physics and information geometry; seee.g. Amari (1985), Amari & Nagaoka (2000), Amari & Ohara (2011), Naudts(2002), Naudts (2010), and Tsallis (1988). The precise definition is given inSection 2. The proof relies on the theory of extreme values. For example,for the probit or complementary log-log link functions, the limit of binomialregression is the usual exponential family as with the logit link. On the otherhand, if G is the Cauchy distribution, then the limit becomes a q -exponentialfamily with q = 2. If the uniform distribution is used, q = 0.As a related work, Ding et al. (2011) introduced the t -logistic regression,that uses the q -exponential family for binary response, where q = t . In Sec-tion 3, we show that the t -logistic regression converges to the q -exponentialfamily if q ≥ q -exponential family of intensity measures. For some special cases, theestimator is reduced to a known admissible estimator for the Poisson meanparameter; see Ghosh & Yang (1988).Some related problems are discussed in Section 5. For each real number q , define the q -exponential function byexp q ( z ) = (cid:26) e z , if q = 1 , [1 + (1 − q ) z ] / (1 − q )+ , if q = 1 , (4)3here [ z ] + = max( z,
0) and [0] − = ∞ . This is inverse of the Box-Coxtransformation. Note that exp q ( z ) = ∞ for z ≥ − / (1 − q ) if q >
1. Thefunction exp q ( z ) is convex if and only if q ≥ G . Assumption 1.
There exist q > c m ∈ R and d m > G ( c m + d m z ) = 1 m exp q ( z ) + o ( m − ) (5)as m → ∞ for each z ∈ R .In the extreme value theory, it is known that there is no other asymptoticform than (5) as long as it exists; see e.g. de Haan & Ferreira (2006, Theorem1.1.2 and 1.1.3). The number q controls the lower tail structure of G . Forexample, the logistic distribution satisfies Assumption 1 with q = 1, c m = − log m and d m = 1. Other examples including the normal and Cauchydistributions are considered in Section 3.We define a m ( α ) = c m + d m α and b m ( β ) = d m β (6)for ( α, β ) ∈ R × R p by using the sequences c m and d m that satisfy (5). Denotethe probability law of { ( X i , Y i ) } mi =1 under the true parameter ( a m ( α ) , b m ( β ))by P m,α,β .Now the asymptotic form like (2) follows from the assumption. Indeed, P m,α,β ( Y i = 1 | X i ) = G ( a m ( α ) + b m ( β ) T X i )= G ( c m + d m ( α + β T X i ))= 1 m exp q ( α + β T X i ) + o ( m − ) . Therefore, as in the logistic regression, we expect that the binomial regressionmodel with G converges to the Poisson point process under Assumption 1.We give a lemma before the main result. Lemma 1.
Let ( α, β ) ∈ R × R p . Let A be any compact subset of R p suchthat the function exp q ( α + β T x ) is finite over x ∈ A . Then the followingequation holds: P m,α,β ( Y i = 1 , X i ∈ A ) = λ ( A ) m + o ( m − ) , (7)where λ ( A ) = R A exp q ( α + β T x ) F ( dx ).4he proof of Lemma 1 is given in Appendix. Theorem 1.
Denote the observations X i for which Y i = 1 by { x i } ni =1 . Then,under P m,α,β , the set { x i } ni =1 converges in law to the Poisson point processwith the intensity measure λ ( dx ) = exp q ( α + β T x ) F ( dx ) (8)as m → ∞ . More precisely, we havelim m →∞ P m,α,β ( { i | x i ∈ A j } = n j , j = 1 , . . . , J ) = J Y j =1 λ ( A j ) n j e − λ ( A j ) n j ! (9)for any positive integer J , non-negative integers n j and mutually disjointcompact subsets A j of R p such that exp q ( α + β T x ) is finite over x ∈ A j .The equation (9) is consistent with the definition of weak convergence ofpoint processes; see Embrechts et al. (1997). Proof of Theorem 1.
Define x ( A ) = { i ∈ { , . . . , n } | x i ∈ A } = { i ∈ { , . . . , m } | ( X i , Y i ) ∈ A × { }} . Since { ( X i , Y i ) } mi =1 is an independent and identically distributed sequence, therandom vector ( x ( A ) , . . . , x ( A J )) for the disjoint compact subsets { A j } Jj =1 isdistributed as the multinomial distribution. Then, by Lemma 1 and Poisson’slaw of rare events, ( x ( A ) , . . . , x ( A J )) converges to independent Poisson ran-dom variables with intensity ( λ ( A ) , . . . , λ ( A J )). The proof is completed.By Theorem 1, the logistic regression model converges to the Poissonpoint process model with intensity exp( α + β T x ) F ( dx ) as Warton & Shepherd(2010) showed. Definition 1.
For each q ∈ R , we call the set of intensity measures (8) the q -exponential family of intensity measures. Denote the law of the process { x i } ni =1 with respect to (8) by P ( q ) α,β . 5he q -exponential family of intensity measures is closely related to the q -exponential family of probability measures as follows. Denote the totalintensity by Λ q ( α, β ) = Z R p exp q ( α + β T x ) F ( dx ) . (10)Assume Λ q ( α, β ) < ∞ . Then the likelihood of P ( q ) α,β is e − Λ q ( α,β ) n ! n Y i =1 exp q ( α + β T x i ) , (11)where the base measure of n is the counting measure on { , , · · · } , and thebase measure of x i for each i is the distribution F ( dx i ). In (11), the number n of observed points is marginally distributed according to the Poisson dis-tribution with intensity Λ q ( α, β ). Each point x i is independently distributedaccording to the q -exponential family defined by the probability density func-tion exp q ( α + β T x i )Λ q ( α, β ) (12)with respect to F ( dx ). The q -exponential family is also called the deformedexponential family or the α -family; see Amari & Nagaoka (2000) for the α -family, where α = 2 q − α . It is known that the density (12) is also written as exp q ( θ T x i − ψ q ( θ ))with appropriate θ and ψ q ( θ ); see e.g. Amari & Ohara (2011). However, wedo not use this parametrization since the quantity Λ q ( α, β ) remains in thewhole likelihood (11).We conjecture that the maximum likelihood estimator of the binomialregression model P m,α,β converges to that of the Poisson process model P ( q ) α,β under mild conditions. However, we only give experimental results in Sec-tion 3. Instead, we study the estimation problem of the limit model P ( q ) α,β inSection 4. See also Section 5 for further discussion. In this section, we give some examples of distributions G satisfying Assump-tion 1, and experimental results on the maximum likelihood estimation.6ven if G satisfies Assumption 1, the sequences c m and d m are notuniquely determined. A unified choice is known (see Galambos (1987, Theo-rem 2.1.4–2.1.6)). However, in the following examples, one of possible pairs( c m , d m ) is explicitly given for each case.For the logistic distribution and the Gumbel distribution G ( z ) = 1 − exp( − e z ) on minimum values, we have q = 1 , c m = − log m, d m = 1 . (13)For the standard normal distribution, we have q = 1 , c m = − (2 log m ) / + log(log m ) + log(4 π )2(2 log m ) / , d m = (2 log m ) − / . (14)See e.g. Galambos (1987, Section 2.3.2). For the Cauchy distribution, wehave q = 2 , c m = − m/π, d m = m/π. (15)For other examples such as t -distribution and Pareto distributions, refer toGalambos (1987) and Embrechts et al. (1997).We briefly study the t -logistic regression proposed by Ding et al. (2011).For each real number t , let G t ( z ) = exp t ( z − γ t ( z )), where exp t denotes the q -exponential function with q = t and γ t ( z ) is uniquely determined byexp t ( z − γ t ( z )) + exp t ( − γ t ( z )) = 1 . (16)We call G t ( z ) the t -logistic distribution. Uniqueness of γ t ( z ) follows fromstrictly monotone property of the q -exponential function. The distribution G t ( z ) is symmetric in the sense that G t ( − z ) = 1 − G t ( z ) since γ t ( − z ) = − z + γ t ( z ) by (16). We obtain the following theorem. The proof is given inAppendix. Theorem 2.
The t -logistic distribution G t satisfies Assumption 1 with q =max( t, X i , Y i ) = (cid:26) (0 . . i − / ( n − ,
1) if i ∈ { , . . . , n } , (( i − n − / ( m − n − ,
0) if i ∈ { n + 1 , . . . , m } (17)for n = 10 and various m ’s. For the binomial regression models, the esti-mated regression coefficient (ˆ a, ˆ b ) is normalized by (6). From Table 1, theconvergence rate for the probit link is very slow, or may not converge. Forthe others, the rate is satisfactory. 7able 1: Comparison of the maximum likelihood estimate of the Poissonpoint process model with q = 1 and the binomial regression models. Thelogit, probit and clogog (complementary log-log) link functions are used.The sample is (17) and n is fixed to 10. The normalizing sequence ( c m , d m )is (13) and (14).Poisson process logit probit cloglog m ˆ α ˆ β ˆ α ˆ β ˆ α ˆ β ˆ α ˆ β q -exponential family ofintensity measures We deal with estimation problem of the q -exponential family of intensitymeasures (8). The maximum likelihood estimator is likely to fail to exist forsmall sample size n . We propose a penalized maximum likelihood estimator.We put the following assumption for simplicity. Assumption 2.
The covariate distribution F ( dx ) is known. The support of F , denoted by S ( F ), is finite, and is not included in any hyperplane in R p .The observable data { x i } ni =1 belongs to S ( F ).In practice, F ( dx ) may be replaced with the empirical, or estimated,distribution based on the covariate sample { X i } mi =1 of the original regressionproblem.The parameter space isΘ = { ( α, β ) | − q )( α + β T x ) > x ∈ S ( F ) } . (18)The set Θ is convex and unbounded since it is intersection of half spacesincluding the set { ( α, | − q ) α > } . Furthermore, Θ is open since S ( F ) is compact. In terms of convex analysis, Θ corresponds to the polarset of S ( F ). See Barvinok (2002). 8able 2: Comparison of the maximum likelihood estimate of the Poissonpoint process model with q = 2 and the binomial regression model with thecauchit (inverse of Cauchy) link function. The sample is (17) and n is fixedto 10. The normalizing sequence ( c m , d m ) is (15).Poisson process cauchit m ˆ α ˆ β ˆ α ˆ β − Λ q ( α, β ) + n X i =1 log exp q ( α + β T x i ) + κ Z log exp q ( α + β T x ) F ( dx ) , (19)where κ is a non-negative regularization parameter. If κ = 0, (19) is the log-likelihood function; see (11). The penalty term represents a pseudo-data ofsize κ distributed according to F . The function (19) is concave with respectto ( α, β ) if 0 ≤ q ≤
1. Indeed, we can directly confirm that − exp q ( z ) isconcave if q ≥
0, and that log(exp q ( z )) is concave if q ≤ Definition 2.
We call the maximizer of (19) the additive-smoothing estima-tor.This estimator has a desirable property as shown in the following example,even if q = 1. Example 1.
Let F be a two-point distribution on R defined by F ( x = 0) = p and F ( x = 1) = p , where p , p > p + p = 1. Denote the intensity at x = 0 and x = 1by λ = p exp q ( α ) and λ = p exp q ( α + β ), respectively. It is not difficultto show that ( α, β ) ∈ Θ corresponds one-to-one with ( λ , λ ) ∈ R , where R + is the set of positive numbers. Hence the model is equivalent to the9ndependent Poisson observable model with intensity ( λ , λ ), regardless of q . Then the penalized log-likelihood (19) becomes − λ − λ + n log λ + n log λ + κ (cid:18) p log λ p + p log λ p (cid:19) , where n j denotes the number of observations x i = j , j ∈ { , } . The additive-smoothing estimator is ˆ λ j = n j + κp j , j ∈ { , } . If κ >
0, then (ˆ λ , ˆ λ ) ∈ R and the estimator ( ˆ α, ˆ β ) always exists. Furthermore, if 0 < κ ≤
1, thisestimator is known to be admissible with respect to the Kullback-Leiblerloss function; see Ghosh & Yang (1988, Theorem 1). For the same reason, if S ( F ) has only p + 1 points in R p , then the additive-smoothing estimator isadmissible as long as 0 < κ ≤ q = 1 and F be any distribution satisfying Assumption 2. Then,since the model (11) is an exponential family, the pair ( n, ¯ x n ) is a sufficientstatistic, where ¯ x n = n − P ni =1 x i is the sample mean. Indeed, the additive-smoothing estimator should satisfyΛ ( ˆ α, ˆ β ) = n + κ and R xe ˆ β T x F ( dx ) R e ˆ β T x F ( dx ) = n ¯ x n + κ R xF ( dx ) n + κ . (20)For the maximum likelihood estimator, meaning κ = 0, the second equationof (20) is consistent with the result of Owen (2007). From the theory ofexponential families, the solution to (20) always exists if κ > R xF ( dx )belongs to the interior of the convex hull of S ( F ); see Barndorff-Nielsen (1978,Corollary 9.6). On the other hand, the maximum likelihood estimator failsto exist if ¯ x n is a boundary point.For q = 1, we provide a similar result on existence. First consider thefollowing example. The pair ( n, ¯ x n ) is not a sufficient statistic any more. Example 2.
Let q = 0 and F be a three-point distribution on R defined by F ( x = j ) = 1 / j ∈ { , , } . Denote the number of observations x i = j by n j . We use θ = 1 + α and φ = 1 + α + 2 β as a new parameter. Then theparameter space is θ > φ >
0. The penalized log-likelihood is − θ + φ n ∗ log θ + n ∗ log θ + φ n ∗ log φ, (21)10here n ∗ j = n j + κ/
3. The maximizer (ˆ θ, ˆ φ ) of (21) isˆ θ = 2 n ∗ ( n ∗ + n ∗ + n ∗ ) n ∗ + n ∗ and ˆ φ = 2 n ∗ ( n ∗ + n ∗ + n ∗ ) n ∗ + n ∗ . This always belongs to the parameter space if κ >
0. On the other hand, themaximum likelihood estimator fails to exist if n = 0 or n = 0.In general, the following theorem holds. The proof is given in Appendix. Theorem 3.
Let q be any real number and κ >
0. If Assumption 2 issatisfied, then the additive-smoothing estimator exists almost surely. It isunique if 0 ≤ q ≤ We studied so far the binomial regression. There are variants of multino-mial regression models. The multinomial t -logistic regression proposed byDing et al. (2011) can be proved to have a limit under imbalanced asymp-totics in the same manner as Theorem 2. The author was not aware of moregeneral results. The problem is postponed as a future work. We did not study convergence properties of estimators such as the maximumlikelihood estimator. Instead we considered the additive-smoothing estimatorfor the q -exponential family of intensity measures in Section 4.Owen (2007) showed that the maximum likelihood estimator of the logis-tic regression converges to that of the exponential family under imbalancedasymptotics. Then a natural conjecture is that the maximum likelihood es-timator of the binomial regression model, which is the maximizer of m X i =1 [ Y i log G ( a + b T X i ) + (1 − Y i ) log { − G ( a + b T X i ) } ] , converges to that of the q -exponential family. Note that estimation of ( a, b )is equivalent to that of ( α, β ) via the formula (6). It will be also meaningful11o study convergence of statistical experiments; see van der Vaart (1998) forthe terminology.An estimator corresponding to the additive-smoothing estimator of Defi-nition 2 is the maximizer of m X i =1 [ Y i log G ( a + b T X i ) + (1 − Y i ) log { − G ( a + b T X i ) } ]+ κm m X i =1 log { mG ( a + b T X i ) } since the additional term converges to κ R log exp q ( α + β T x ) F ( dx ) after nor-malization (6). The estimator is expected to converge as well. We studied asymptotic properties of the binomial regression model under anassumption that the model (1) is true. On the other hand, Owen (2007)put a different assumption, in that the true conditional distribution of thecovariate X i given Y i = j , j ∈ { , } , is fixed to some distribution F j . Inthis assumption, our setting is asymptotically described as F ( dx ) = F ( dx )and F ( dx ) = { exp q ( α + β T x ) / Λ q ( α, β ) } F ( dx ) by (11). In other words, if thetrue distributions F j do not satisfy this relation, the model is misspecified.It is important to consider robustness of estimators under the misspecifiedassumption. The problem is not so serious if the support of F is includedin that of F , since then F is absolutely continuous with respect to the esti-mated intensity measure exp( ˆ α + ˆ β T x ) F ( dx ), whenever ( ˆ α, ˆ β ) belongs to theparameter space (18). Otherwise, however, F is not absolutely continuous.In other words, the estimated intensity measure does not allow that the fu-ture data x n +1 falls into a region. In particular, if the support of F is notassumed a priori, there is risk of such a contradiction.One may consider to take a distribution F with the full support R p inorder to contain the support of F . However, if q = 1, we cannot assumesuch a distribution F since the parameter space (18) becomes { ( α, | − q ) α > } .A solution to this problem will be to use a parametric family of F togetherwith a Bayesian prior distribution. For example, let F ( dx ) = F ( dx | θ ) be theuniform distribution on the hypercube [ − θ, θ ] p , and assume a prior densityon θ >
0. As long as the true F ( dx ) has compact support, we have a chanceto detect it since there is a sufficiently large θ such that the support of F isincluded in that of F ( · | θ ). 12 .4 Bayesian prediction In the preceding subsection, we considered the Bayesian approach for treatingmisspecified case. Even if the model is correctly specified, the approach willbe fruitful.In Section 4, we considered the additive-smoothing estimator of ( α, β ).This is considered as a maximum-a-posteriori estimator if the prior density π ( α, β ) = exp (cid:18) κ Z log exp q ( α + β T x ) F ( dx ) (cid:19) is adopted. Then additive-smoothing Bayesian prediction can be also definedby the same prior.In Example 1, we noted that, for special cases of F and κ , the additive-smoothing estimator becomes an admissible estimator with respect to theKullback-Leibler divergence, shown by Ghosh & Yang (1988). For predictionproblem, a class of admissible predictive densities is investigated by Komaki(2004). Together with the additive-smoothing estimator, decision-theoreticproperties of the additive-smoothing prediction are of interest. Acknowledgement
The author thanks to Saki Saito for helpful discussions in the exploratorystage.
A Appendix
A.1 Proof of Lemma 1
Denote the induced probability distribution of t = α + β T X i by F ∗ ( dt ). Let A ∗ be A ∗ = { α + β T x | x ∈ A } . Then A ∗ is compact since A is. We have P m,α,β ( Y i = 1 , X i ∈ A ) = Z A G ( a m ( α ) + b m ( β ) T x ) F ( dx )= Z A G ( c m + d m ( α + β T x )) F ( dx )= Z A ∗ G ( c m + d m t ) F ∗ ( dt ) .
13o prove (7), it is enough to show that Z A ∗ G ( c m + d m t ) F ∗ ( dt ) = 1 m Z A ∗ exp q ( t ) F ∗ ( dt ) + o ( m − ) . By Assumption 1, we know mG ( c m + d m t ) = exp q ( t ) + o (1) for each t ∈ A ∗ .Hence it is enough to show that mG ( c m + d m t ) converges to exp q ( t ) uniformlyin t ∈ A ∗ . However, since mG ( c m + d m t ) is monotone in t and exp q ( t ) iscontinuous in t ∈ A ∗ , uniform convergence follows from the general argument;see e.g. Galambos (1987, Lemma 2.10.1)). Proof of Theorem 2
For each real number q , denote the set of distributions that satisfy Assump-tion 1 by D q .For t = 1, elementary calculation shows that G ( z ) = e z / (1 + e z ). Thisis the logistic distribution and belongs to D .For t = 0, we have G ( z ) = (1 + z ) / − < z <
1. This is the uniformdistribution on [ − ,
1] and belongs D .Let t >
1. It suffices to show that G t ( z ) = [(1 − t ) z ] / (1 − t ) + o (( − z ) / (1 − t ) ) , z → −∞ . Indeed, by the condition (16), if z → −∞ , then γ t ( z ) →
0. Thusexp t ( z − γ t ( z )) = exp t ( z + o ( z ))= [1 + (1 − t )( z + o ( z ))] / (1 − t ) = [(1 − t ) z ] / (1 − t ) + o (( − z ) / (1 − t ) )Hence G t belongs to D t .For t <
1, we first show that the support of G t has the infimum z ∗ = − / (1 − t ) and that γ t ( z ) tends to 0 as z → z ∗ + 0. Note that the t -exponential function exp t ( z ) is continuous in z ∈ R , strictly increasing over z > z ∗ , and remains 0 over z ≤ z ∗ . Since exp t ( z ) > z >
0, itmust be γ t ( z ) ≥ z ∈ R by (16). Then exp t ( z − γ t ( z )) > z > z ∗ . Conversely, if z > z ∗ , it must be exp t ( z − γ t ( z )) >
0. Indeed,if exp t ( z − γ t ( z )) = 0, then γ t ( z ) = 0 by (16), but this contradicts z > z ∗ .To prove γ t ( z ) → z → z ∗ + 0, due to (16), it is sufficient to show thatexp t ( z − γ t ( z )) → z → z ∗ + 0. This is shown as0 ≤ exp t ( z − γ t ( z )) ≤ exp t ( z ) → , z → z ∗ + 0 . < t < z ∗ = − / (1 − t ). It suffices to show that G t ( z ) = [(1 − t )( z − z ∗ )] / (1 − t ) + o (( z − z ∗ ) / (1 − t ) ) , z → z ∗ + 0 . (22)By the definition of z ∗ , we haveexp t ( z − γ t ( z )) = [1 + (1 − t )( z − γ t ( z ))] / (1 − t ) = [(1 − t )( z − z ∗ − γ t ( z ))] / (1 − t ) . (23)On the other hand, since γ t ( z ) → z → z ∗ + 0, we obtainexp t ( − γ t ( z )) = 1 − γ t ( z ) + o ( γ t ( z )) . (24)By substituting the two equations to (16), we obtain γ t ( z ) = O (( z − z ∗ ) / (1 − t ) ) = o ( z − z ∗ ). Then (23) implies (22). Hence G t belongs to D t .Finally, let t < z ∗ = − / (1 − t ). We show that G t belongs to D ,not D t . It suffices to show that G t ( z ) = ( z − z ∗ ) + o ( z − z ∗ ) , z → z ∗ + 0 . (25)For the same reason as the case 0 < t <
1, we have the two equations (23)and (24). By substituting them to (16), we obtain γ t ( z ) = ( z − z ∗ ) − ( z − z ∗ ) − t − t + o (( z − z ∗ ) − t ) . Then (23) implies (25). Hence G t belongs to D . Proof of Theorem 3
Uniqueness follows from concavity of (19) for 0 ≤ q ≤
1. We prove theexistence result. Since the case q = 1 is proved in (20), we assume q = 1.In the following, we prove the theorem only for the case that n = 0, thatis, no data is observed. The case n ≥ { x i } ni =1 is contained in the convex hull of the support of F .Let F be a discrete distribution with support { ξ j } Jj =1 ⊂ R p and put p j = F ( x = ξ j ) > j ∈ { , . . . , J } . By assumption, { ξ j } Jj =1 is not includedin any hyperplane of R p . The parameter space (18) is written asΘ = { ( α, β ) | − q )( α + β T ξ j ) > , j ∈ { , . . . , J }} . α, β ) = (0 ,
0) alwaysbelongs to Θ. The penalized log-likelihood is, since n = 0, L ( α, β ) = J X j =1 p j (cid:8) − exp q ( α + β T ξ j ) + log exp q ( α + β T ξ j ) (cid:9) . (26)By continuity of L ( α, β ) over Θ, it is sufficient to show that L ( α, β ) → −∞ if( α, β ) tends to a boundary point of Θ or ( α, β ) diverges. Note that if ( α , β )is a boundary point of Θ, then ( tα , tβ ) belongs to Θ for any 0 ≤ t < q < q > q <
1. Fix any boundary point ( α , β ) of Θ. Then there is at leastone ξ j such that exp q ( α + β T ξ j ) = 0. For such ξ j ’s, exp q ( t ( α + β T ξ j )) → +0as t → −
0. For the other ξ j ’s, exp q ( t ( α + β T ξ j )) is bounded as t → − L ( tα , tβ ) tends to −∞ as t → − q < α , β ) ∈ Θ \ { (0 , } such that ( tα , tβ ) ∈ Θ forany t >
0. Then it is necessary that α + β T ξ j ≥ j . Since { ξ j } is notcontained in a hyperplane, there is at least one ξ j such that α + β T ξ j > ξ j ’s, we have exp q ( tα + tβ T ξ j ) → ∞ as t → ∞ . For the other ξ j ’s,exp q ( tα + tβ T ξ j ) = exp q (0) = 1. Therefore, by (26), the function L ( tα , tβ )tends to −∞ as t → ∞ , and the case q < q >
1. Fix any boundary point ( α , β ) of Θ. Then there is at leastone ξ j such that exp q ( α + β T ξ j ) = ∞ . For such ξ j ’s, exp q ( t ( α + β T ξ j )) → ∞ as t → −
0. For the other ξ j ’s, exp q ( t ( α + β T ξ j )) is bounded as t → − L ( tα , tβ ) tends to −∞ as t → − q > α , β ) ∈ Θ \ { (0 , } such that ( tα , tβ ) ∈ Θ for any t >
0. Then it is necessary that α + β T ξ j ≤
0. Since { ξ j } isnot contained in a hyperplane, there is at least one ξ j such that α + β T ξ j <
0. For such ξ j ’s, exp q ( tα + tβ T ξ j ) → +0 as t → ∞ . For the other ξ j ’s,exp q ( tα + tβ T ξ j ) = exp q (0) = 1. Therefore, by (26), the function L ( tα , tβ )tends to −∞ as t → ∞ , and the case q > References
Amari, S. (1985).
Differential-geometrical methods in statistics . Berlin:Springer. 16 mari, S. & Nagaoka, H. (2000).
Methods of information geometry(Translations of Mathematical Monographs) . Oxford University Press.
Amari, S. & Ohara, A. (2011). Geometry of q-exponential family ofprobability distributions.
Entropy , 1170–1185. Barndorff-Nielsen, O. (1978).
Information and exponential families .Chichester: John Wiley & Sons.
Barvinok, A. (2002).
A course in convexity . American MathematicalSociety.
Bolton, R. J. & Hand, D. J. (2002). Statistical fraud detection: a review.
Statist. Sci. , 235–249. Chawla, N. V. , Japkowicz, N. & Koltz, A. (2004). Editorial: specialissue on learning from imbalanced data sets.
ACM SIGKDD ExplorationsNewsletter , 1–6. de Haan, L. & Ferreira, A. (2006).
Extreme value theory, an introduc-tion . New York: Springer.
Ding, N. , Vishwanathan, S. V. N. , Warmuth, M. & Denchev, V. (2011). t -logistic regression. J. Mach. Learn. Res. , 1–55. Embrechts, P. , Kl¨uppelberg, C. & Mikosch, T. (1997).
Modellingextremal events . Berlin: Springer.
Galambos, J. (1987).
The asymptotic theory of extreme order statistics .Malarbar: Robert E. Krieger Publishing Company.
Ghosh, M. & Yang, M.-C. (1988). Simultaneous estimation of Poissonmeans under entropy loss.
Ann. Statist. , 278–291. Jin, Y. , Rejesus, R. M. & Little, B. B. (2005). Binary choice models forrare events data: a crop insurance fraud application.
Applied Economics , 841–848. King, G. & Zeng, L. (2001). Logistic regression in rare events data.
Po-litical Analisis , 137–163. 17 omaki, F. (2004). Prediction of independent Poisson observables. Ann.Statist. , 1744–1769. Naudts, J. (2002). Deformed exponentials and logarithms in generalizedthermostatistics.
Physica A , 323–334.
Naudts, J. (2010). The q -exponential family in statistical physics. J. Phys.:Conf. Ser. , 012003.
Owen, A. B. (2007). Infinitely imbalanced logistic regression.
J. Mach.Learn. Res. , 761–773. Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics.
J. Statist. Phys. , 479–487. van der Vaart, A. W. (1998). Asymptotic statistics . Cambridge Univer-sity Press.
Warton, D. I. & Shepherd, L. C. (2010). Poisson point process modelssolve the “pseudo-absence problem” for presence only data in ecology.
Ann.Applied Statist. , 1383–1402. Wedderburn, R. W. M. (1976). On the existence and uniqueness ofthe maximum likelihood estimates for certain generalized linear models.
Biometrika63