Estimation of a Likelihood Ratio Ordered Family of Distributions -- with a Connection to Total Positivity
MMaximum Likelihood Estimation of a Likelihood RatioOrdered Family of Distributions
Alexandre M¨osching and Lutz D¨umbgenUniversity of BernJuly 23, 2020
Abstract
We consider bivariate observations ( X , Y ) , . . . , ( X n , Y n ) ⊂ X × R with a real set X such that, conditional on the X i , the Y i are independent random variables withdistribution P X i , where ( P x ) x ∈ X is unknown. Using an empirical likelihood approach,we devise an algorithm to estimate the unknown family of distributions ( P x ) x ∈ X underthe sole assumption that this family is increasing with respect to likelihood ratioorder. We review the latter concept and realize that no further assumption such as alldistributions P x having densities or having a common countable support is needed.The benefit of the stronger regularization imposed by likelihood ratio orderingover the usual stochastic ordering is evaluated in terms of estimation and predictiveperformances on simulated as well as real data. Keywords:
Likelihood ratio order, stochastic order, empirical likelihood, quasi-Newton method.
AMS 2000 subject classifications:
Acknowledgements.
This work was supported by Swiss National Science Foun-dation. The authors are grateful to Johanna Ziegel and Alexander Jordan for stimu-lating discussions.
Suppose that Y and Y are random variables with densities f and f with respect to somedominating measure. We then say that Y is smaller than Y in the likelihood ratio orderif the density ratio f /f is increasing (i.e. non-decreasing) on { f + f > } . This notionof order is stronger than the usual stochastic order which states that Y is stochasticallysmaller than Y if IP( Y > y ) ≤ IP( Y > y ) for all y . However, it is still a plausible orderon distributions which appears natural on many occasions.In discriminant analysis for instance, one assigns an observation y ∈ R to popula-tion f if the ratio f ( y ) /f ( y ) exceeds a certain prefixed threshold. If the density ratio f /f is increasing, this amounts to separating R into two halflines that are then used asclassification regions.In a similar fashion, if ( f θ ) θ ∈ Θ is a one-parameter family of strictly positive densitieswith Θ ⊂ R , an extension of the Neyman–Pearson Lemma states the existence of uniformlymost powerful tests of θ ≤ θ o against θ > θ o if the ratio f θ ( y ) /f θ ( y ) is increasing in y for all θ ≤ θ , see Karlin and Rubin (1956); Lehmann and Rojo (1992). Such aone-parameter family of distributions could for example be an exponential family whose1 a r X i v : . [ m a t h . S T ] J u l ensities are of the form f θ ( y ) = h ( y ) exp (cid:0) r ( θ ) y − s ( θ ) (cid:1) , for some measurable function h ,a strictly increasing function r : Θ → R and normalizing constants s ( θ ). Then the ratio f θ ( y ) /f θ ( y ) will be increasing in y , provided that θ ≤ θ .Furthermore, likelihood ratio ordering is also a frequent assumption or implication ofmodels in mathematical finance, see Beare and Moon (2015); Jewitt (1991).Thus far, estimation of distributions under a likelihood ratio ordering constraint wassolely limited to the two-population setting. First, Dykstra et al. (1995) estimated the pa-rameters of two multinomial distributions that are likelihood ratio ordered via a restrictedmaximum likelihood approach. After reparametrization, they found that the maximiza-tion problem at hand had reduced to a specific bioessay problem treated by Robertsonet al. (1988) and which makes use of the theory of isotonic regression. It is then sug-gested that their approach generalizes well to any two distributions that are absolutelycontinuous with respect to some dominating measure. Later, Carolan and Tebbs (2005)focused on testing procedures for the equality of two continuous distribution functions F and F versus the hypothesis of their likelihood ratio ordering. To this end, they madeuse of the equivalence between likelihood ratio ordering and the convexity of the ordinaldominance curve α (cid:55)→ F (cid:0) F − ( α ) (cid:1) , α ∈ [0 , F is absolutely contin-uous with respect to F . The convexity of the ordinal dominance curve was also exploitedby Westling et al. (2019) to provide nonparametric maximum likelihood estimators of F an F under likelihood ratio order for discrete, continuous, as well as mixed continuous-discrete distributions. Their approach builds on the idea of Carolan and Tebbs (2005),where estimators are produced from the greatest convex minorant of the empirical ordinaldominance curve. However, this method still necessitates the restrictive assumption that F is absolutely continuous with respect to F . Other relevant references on the topic areYu et al. (2017), who treat the estimation problem with a maximum smoothed likelihoodapproach – requiring the choice of a kernel and bandwidth parameters – as well as Roosenand Hennessy (2004), who test for likelihood ratio ordering of n ≥ P x ) x ∈ X on R indexed on a realset X and assume that this family is increasing with respect to likelihood ratio order asdefined in Appendix A. In particular, no assumption of absolute continuity is requiredfor our model. In Section 2, we formulate an empirical likelihood approach (Owen, 1988,2001) to estimate the family ( P x ) x ∈ X . We then find that the problem of maximizingthe (empirical) likelihood under the likelihood ratio ordering constraint yields a finite-dimensional constrained optimization problem. To approach the unique solution, we devisean algorithm in Section 3 which adapts and extends ideas from Jongbloed (1998) andD¨umbgen et al. (2006) for the present, more complex setting. It makes use of a quasi-Newton approach, and new search directions are obtained via multiple isotonic weightedleast squares regression. The proof of convergence of our algorithm in given in Section 4.Finally, in Section 5 we perform simulations to illustrate the benefits of the new estimationparadigm compared to the usual stochastic order constraint as discussed by M¨osching andD¨umbgen (2020) and Henzi et al. (2019).The notion of likelihood ratio order between two random variables can be extended toa partial order on the set of all probability measures on the real line. This is explainedin detail in Appendix A. That material generalizes definitions and results in Shaked andShanthikumar (2007). We also explain the connection between two distributions beinglikelihood ratio ordered and the convexity of their ordinal dominance curve under weakerassumptions than in previous work, e.g. by Westling et al. (2019).Most proofs and technical details are deferred to Appendix B.2 Modelling
For a set X ⊂ R , we consider a family of distributions ( P x ) x ∈ X on R and N ≥ X , Y ) , ( X , Y ) , . . . , ( X N , Y N ) ∈ X × R , such that, conditional on ( X j ) Nj =1 , the r.v.s Y i are independent with Y i | ( X j ) Nj =1 ∼ P X i , for 1 ≤ i ≤ N. Let x < · · · < x n and y < · · · < y m be the unique and sorted values of the covariatesand responses, respectively. That means, n, m ≤ N and { x , . . . , x n } = { X , X , . . . , X N } , { y , . . . , y m } = { Y , Y , . . . , Y N } . The absolute frequency of ( x i , y j ) in the sample is given by the number w ij := { k : ( X k , Y k ) = ( x i , y j ) } . Now we model the unknown family of distributions ( P x i ) ni =1 by ( (cid:98) P x i ) ni =1 , where (cid:98) P x i := m (cid:88) j =1 p ij δ y j . (2.1)Here δ y denotes Dirac measure at y , and p := ( p ij ) ∈ [0 , n × m is a parameter satisfying m (cid:88) j =1 p ij = 1 for 1 ≤ i ≤ n. (2.2)To obtain an estimator of ( P x i ) ni =1 , one could then consider maximizing the empiricallikelihood L (cid:0) p (cid:1) := n (cid:89) i =1 m (cid:89) j =1 p w ij ij , (2.3)subject to constraint (2.2). In this simple unconstrained case, the maximum is known tobe attained by p ij = w ij (cid:80) mj =1 w ij , that is, (cid:98) P x i is the empirical distribution of the sample { Y k : X k = x i } .If the setting allows it, a natural constraint to set on ( P x ) x ∈ X is the usual stochasticorder on distributions, also known as first order stochastic dominance: For x , x ∈ X ,such that x ≤ x , we say that P x is stochastically smaller than P x if P x (cid:0) ( −∞ , y ]) ≥ P x (cid:0) ( −∞ , y ]) for all y ∈ R . As discussed by M¨osching and D¨umbgen (2020) and El Barmiand Mukerjee (2005), least squares estimation of ( P x i ) ni =1 under usual stochastic orderingconstraint yields an estimator of the form (2.1), with probability weights p ij given by q ij − q i,j − where q ij := min r ≤ i max s ≥ i { k : x r ≤ X k ≤ x s and Y k ≤ y j } { k : x r ≤ X k ≤ x s } ,q i := 0 . , i y , j lll llll llll ll ll lll l ll l l l ll ( X , Y ) ( X , Y )( X , Y ) ( X , Y ) ( X , Y ) ( X , Y ) ( X , Y )( X , Y ) i m i M i j n j N j d j x i = i for 1 ≤ i ≤ n = 6 and y j = j for 1 ≤ j ≤ m = 7.These coefficients are computed via the pool-adjacent-violators algorithm, see Barlowet al. (1972); Robertson et al. (1988), or more efficiently via a sequential version of thisprocedure, see Henzi et al. (2020).Throughout the remainder of this article, we assume that the family ( P x ) x ∈ X is in-creasing with respect to likelihood ratio order. If we use the model of ( P x i ) ni =1 describedin (2.1) with a parameter p ∈ [0 , n × m satisfying constraint (2.2), then likelihood ratioordering of ( (cid:98) P x i ) ni =1 is characterized in terms of p as p i j · p i j ≤ p i j · p i j , for all 1 ≤ i ≤ i ≤ n and 1 ≤ j ≤ j ≤ m. (2.4)To estimate ( P x i ) ni =1 , we opt for a maximizer of the likelihood L under constraints (2.2,2.4).We refer the reader to Appendix A for a thorough discussion about likelihood ratio orderingand to Example A.10 to see how property (2.4) derives from the general definition. The problem of finding a maximizer of L under constraints (2.2,2.4) can be greatly simpli-fied from the structure of the data and the likelihood ratio order constraint itself. Theseaspects allow us to formulate an equivalent though much simpler problem involving fewerparameters to estimate. We start with some notations which are exemplified in Figure 1.For i ∈ I := { , . . . , n } we define indices 1 ≤ m i ≤ M i ≤ m via y m i = min { Y k : X k ≥ x i } ,y M i = max { Y k : X k ≤ x i } . The inequality m i ≤ M i follows from the fact that X k = x i for at least one index k . Inparticular, if w ij >
0, then m i ≤ j ≤ M i . Consequently, with J i := { j : m i ≤ j ≤ M i }
4e may write w i := { k : X k = x i } = m (cid:88) j =1 w ij = (cid:88) j ∈J i w ij . Note also that m ≤ · · · ≤ m n and M ≤ · · · ≤ M n . With the set P := (cid:8) ( i, j ) : i ∈ I , j ∈ J i (cid:9) of relevant pairs of indices for p , the likelihood function becomes L ( p ) = (cid:89) ( i,j ) ∈P p w ij ij , (2.5)and a maximizer p of L under constraints (2.2,2.4) must satisfy: p i,j +1 · p i +1 ,j ≤ p ij · p i +1 ,j +1 , for all m i +1 ≤ j < M i and 1 ≤ i < n, (2.6)and p ij (cid:40) > i, j ) ∈ P , = 0 otherwise . (2.7)Likewise, a maximizer of L under constraints (2.2,2.6–2.7) will satisfy constraint (2.4).Hence, it is equivalent to search for p ∈ [0 , n × m that satisfies (2.2,2.6–2.7) and maxi-mizes L . For rigorous proofs of these two claims, see Lemmata B.6 and B.8. log -probabilities and consecutive differ-ences In view of the likelihood score (2.5) and constraint (2.7) on the parameter p , it is naturalto reformulate the maximization problem at hand in terms of negative log-likelihood andlog-probabilities. In addition, taking differences of log-parameters allows us to rephraseconstraint (2.6) in terms of vectors with increasing components. We clarify this transitionin the present section.Let θ = ( θ ij ) ( i,j ) ∈P ∈ R P be defined by θ ij := (cid:40) log( p im i ) if j = m i , log( p ij ) − log( p i,j − ) if m i < j ≤ M i , so that p ij = (cid:40) exp (cid:16)(cid:80) js = m i θ is (cid:17) if ( i, j ) ∈ P , . Then, constraint (2.6) is equivalent to θ ij ≤ θ i +1 ,j , for all m i +1 < j ≤ M i and 1 ≤ i < n ,whereas constraint (2.2) reads (cid:88) j ∈J i exp (cid:32) j (cid:88) s = m i θ is (cid:33) = 1 , for all i ∈ I . (2.8)Next we consider an arbitrary index j ∈ { , . . . , m } and define indices 1 ≤ n j ≤ N j ≤ n via x n j := min { X k : Y k ≥ y j } ,x N j := max { X k : Y k ≤ y j } . m i +1 < j ≤ M i if and only if n j ≤ i < N j − , for arbitraryindices 1 ≤ i < n and 1 < j ≤ m , see Lemma B.9. Furthermore, we have P = (cid:8) ( i, m i ) : i ∈ I (cid:9) ∪ (cid:18) m (cid:91) j =2 (cid:8) ( i, j ) : n j ≤ i ≤ N j − (cid:9)(cid:19) . Consequently, constraint (2.6) is equivalently formulated as( θ ij ) N j − i = n j ∈ R d j ↑ , whenever 1 < j ≤ m and d j := N j − − n j + 1 > , where R d ↑ := { v ∈ R d : v ≤ v ≤ · · · ≤ v d } . The numbers n j , N j and d j are exemplified in Figure 1. Finally, we define cumulativeweights w ij := M i (cid:88) s = j w is , for all ( i, j ) ∈ P . Then, the problem of maximizing L under constraints (2.2,2.6–2.7) becomes equivalent tominimizing the strictly convex functional f ( θ ) := (cid:88) ( i,j ) ∈P (cid:32) − w ij θ ij + w i exp (cid:32) j (cid:88) s = m i θ is (cid:33)(cid:33) , (2.9)over the closed convex coneΘ := (cid:8) θ ∈ R P : ( θ ij ) N j − i = n j ∈ R d j ↑ for all 1 < j ≤ m (cid:9) . The exponential terms in (2.9) are Lagrange-type terms accounting for the normalizationof θ in (2.8), see Lemma B.11.In the sequel, we denote the unique minimizer of f in Θ by θ ∗ := arg min θ ∈ Θ f ( θ ) . For η ∈ R P relatively close to a feasible parameter θ ∈ Θ, one can determine a secondorder approximation of f around θ using its gradient g ( θ ) := (cid:18) ∂f∂θ ij ( θ ) (cid:19) ( i,j ) ∈P = (cid:0) − w ij + H ij ( θ ) (cid:1) ( i,j ) ∈P , as well as non-trivial entries of its Hessian matrix ∂ f∂θ ij ∂θ k(cid:96) ( θ ) = (cid:40) H i, max( j,(cid:96) ) ( θ ) if i = k ∈ I and j, (cid:96) ∈ J i , , where H ( θ ) := (cid:0) H ij ( θ ) (cid:1) ( i,j ) ∈P is defined by H ij ( θ ) := ∂ f∂θ ij ( θ ) = w i M i (cid:88) r = j exp (cid:32) r (cid:88) s = m i θ is (cid:33) . f around θ therefore reads f θ ( η ) := f ( θ ) + (cid:88) ( i,j ) ∈P (cid:18) g ij ( θ )( η ij − θ ij ) + 12 H ij ( θ )( η ij − θ ij ) (cid:19) . Since H ij ( θ ) > i, j ) ∈ P , the function f θ is strictly convex. Furthermore, itssummands can be rearranged to yield f θ ( η ) = c θ + 12 (cid:88) i ∈I H im i ( θ ) (cid:0) η im i − ˜ θ im i ( θ ) (cid:1) + 12 m (cid:88) j =2 N j − (cid:88) i = n j H ij ( θ ) (cid:0) η ij − ˜ θ ij ( θ ) (cid:1) , with c θ independent of η and˜ θ ij ( θ ) := θ ij − g ij ( θ ) H ij ( θ ) = θ ij + w ij H ij ( θ ) − . It therefore possesses a unique minimizer under constraint which can be characterized bymeans of isotonic regression, as indicated by the next Lemma.
Lemma 3.1 . The vector η ∗ ( θ ) := arg min η ∈ Θ f θ ( η ) is characterized by η ∗ im i ( θ ) = ˜ θ im i ( θ ) , for all i ∈ I and (cid:0) η ∗ ij ( θ ) (cid:1) N j − i = n j = Iso (cid:16)(cid:0) ˜ θ ij ( θ ) (cid:1) N j − i = n j , (cid:0) H ij ( θ ) (cid:1) N j − i = n j (cid:17) , for all < j ≤ m, where Iso( ξ , b ) stands for the isotonic regression of ξ ∈ R d with weights b ∈ R d , b > ,that means, Iso( ξ , b ) is the unique vector that minimizes (cid:80) dk =1 b k ( z k − ξ k ) over all z ∈ R d ↑ ,see Barlow et al. (1972); Robertson et al. (1988).Additionally, the mapping θ (cid:55)→ η ∗ ( θ ) is continuous in Θ . In consequence, if η ∗ ( θ ) (cid:54) = θ , then f θ (cid:0) η ∗ ( θ ) (cid:1) < f θ ( θ ) = f ( θ ), and we define v ( θ ) := η ∗ ( θ ) − θ . Because the gradients of f θ and f at θ are equal, we have that Df (cid:0) θ , v ( θ ) (cid:1) = Df θ (cid:0) θ , v ( θ ) (cid:1) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) < , where the justification of the second equality is given in Lemma B.13. Thus, we find that θ (cid:54) = θ ∗ and a strict improvement of f can be found in the direction v ( θ ). Suppose that θ ∈ Θ is such that Df θ (cid:0) θ , v ( θ ) (cid:1) <
0, so θ (cid:54) = θ ∗ and f θ (cid:0) θ + v ( θ ) (cid:1) < f ( θ ).In particular, there exists some number t ∈ (0 ,
1] such that ρ θ ( t ) := f (cid:0) θ + t v ( θ ) (cid:1) − f ( θ ) < . f . That means, we look for t ∗ ∈ (0 ,
1] such that ρ θ ( t ∗ ) is as close as possible to min t ∈ (0 , ρ θ ( t ).For that, we proceed similarly as in D¨umbgen et al. (2006). First, let t o := 2 − n o with n o the smallest integer such that ρ θ (2 − n o ) <
0. We then define a Hermite interpolation of ρ θ : ˜ ρ θ ( t ) := ρ (cid:48) θ (0) · t + t − o (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) · t . This new function is such that ˜ ρ θ (0) = ρ θ (0) = 0, ˜ ρ θ ( t o ) = ρ θ ( t o ) < ρ (cid:48) θ (0) = ρ (cid:48) θ (0) = Df (cid:0) θ , v ( θ ) (cid:1) < . The minimum of ˜ ρ θ in (0 ,
1] is therefore attained at t ∗ := min (cid:32) − ρ (cid:48) θ (0)2 (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) , (cid:33) · t o . We then update θ with θ + t ∗ v ( θ ). As shown in Lemma 1 of D¨umbgen et al. (2006), thisprocedure ensures that f ( θ ) − f (cid:0) θ + t ∗ v ( θ ) (cid:1) ≥ · max t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) > . Let θ ∈ Θ be a feasible parameter and c ( θ ) := (cid:0) c ij ( θ ) (cid:1) ( i,j ) ∈P be defined by c ij ( θ ) := (cid:40) log (cid:16)(cid:80) M i r = m i exp (cid:0)(cid:80) rs = m i θ is (cid:1)(cid:17) if j = m i , . Then the parameter θ (cid:48) := θ − c ( θ ) is such that θ (cid:48) ∈ Θ and f ( θ (cid:48) ) ≤ f ( θ ) with equality, ifand only if, c ( θ ) = , see Lemma B.10. In other words, replacing θ with θ − c ( θ ) has theeffect of decreasing the value of f , unless θ is normalized already. Since the vector is feasible, one can simply take its normalized version θ := − c ( ) asa starting point. 8 .5 Complete algorithm Let ε > f then reads: Minimize f over Θ θ ← − c ( ) v ( θ ) ← (cid:0) arg min η ∈ Θ f θ ( η ) (cid:1) − θ while f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) < − εt o ← f θ (cid:0) θ + t o v ( θ ) (cid:1) > f ( θ ) t o ← t o / t ∗ ← t o · min (cid:16) − ρ (cid:48) θ (0)2 − (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) − , (cid:17) θ ← θ + t ∗ v ( θ ) θ ← θ − c ( θ ) v ( θ ) ← (cid:0) arg min η ∈ Θ f θ ( η ) (cid:1) − θ end whilereturn( θ ) The algorithmic mapping ψ : Θ → Θ constructed in Section 3 consists of the two steps θ ← θ + t ∗ v ( θ ) θ ← θ − c ( θ )which have the effect of strictly decreasing the value of f , unless θ = θ ∗ . Theorem 4.1 . Let θ ∈ Θ be any feasible starting point. Then, the sequence ( θ n ) n ≥ ∈ Θ resulting from the algorithmic mapping ψ : θ n +1 := ψ ( θ n ) , n ≥ , is such that lim n →∞ f ( θ n ) = min θ ∈ Θ f ( θ ) . In other words lim n →∞ θ n = θ ∗ . In this section, we compare estimation and prediction performances of the likelihood ra-tio order constrained estimator presented in this article with the estimator under usualstochastic order obtained via isotonic distributional regression. The latter estimator wasmentioned briefly in the introduction. It is extensively discussed in Henzi et al. (2019)and M¨osching and D¨umbgen (2020).
A Gamma model
We choose a parametric family of distributions from which we drawobservations. We will then use these data to provide distribution estimates which we thencompare with the truth. The specific model we have in mind is a family ( P x ) x ∈ X of Gammadistributions d P x d y ( y ) = f x ( y ) := b ( x ) − a ( x ) Γ (cid:0) a ( x ) (cid:1) y a ( x ) − exp (cid:0) − y/b ( x ) (cid:1) , y Figure 2: Conditional Gamma density with shape a ( x ) := 2 + ( x + 1) and scale b ( x ) :=1 − / exp(10 x ) for x in the interval X := [1 , a : X → (0 , ∞ ) and scale function b : X → (0 , ∞ ). Then ( P x ) x ∈ X is increasing with respect to likelihood ratio ordering if, and only if, both a and b are non-decreasing. Recall that since the family is increasing in likelihood ratio order, it is alsoincreasing with respect to the usual stochastic order. Figure 2 shows the true conditionaldensity for the specific parameters a and b selected for this study. Sampling method
Let n o ∈ { , , } be a predefined number and let X o := 1 + 3 n o · { , , . . . , n o } ⊂ X := [1 , . For a given sample size N ∈ N , the sample ( X , Y ) , ( X , Y ) , . . . , ( X N , Y N ) is obtainedas follows: Draw X , X , . . . , X N uniformly from X o and sample independently each Y k from P X k . This yields unique covariates x < x < · · · < x n , as well as unique responses y < y < · · · < y m , for some 1 ≤ n, m ≤ N .For each such sample, we compute estimators of ( P x i ) ni =1 under likelihood ratio orderingconstraint, as well as the usual stochastic ordering constraint. Using linear interpolation,we complete both families of estimates with covariates originally in { x i } ni =1 to families ofestimates with covariates in the full set X o . Simple calculations show that this extensionpreserves the original stochastic ordering of the estimated distributions. We therefore ob-tain estimates ( (cid:98) P x ) x ∈X o and ( q P x ) x ∈X o under likelihood ratio ordering and usual stochasticordering constraint, respectively. The corresponding families of cdf’s are then written (cid:98) F := ( (cid:98) F x ) x ∈X o and q F := ( q F x ) x ∈X o , whereas the truth is denoted by F := ( F x ) x ∈X o . Al-though the performance of the empirical distribution is far worse than those of the twoorder constrained estimators, it is still useful to study its behavior, for example to betterunderstand boundary effects. The family of empirical cdf’s will be written (cid:98) F := ( (cid:98) F x ) x ∈X o . A simple score
To verify the ability of the estimators (cid:98) F and q F to retrieve the truthgiven by F , we compute Monte-Carlo estimates of the mean and selected quantiles of each10f the scores R x ( (cid:98) F , F ), R x ( q F , F ) and R x ( (cid:98) F , F ) for each x ∈ X o , where R x ( G, F ) := (cid:90) | G x ( y ) − F x ( y ) | d P x ( y ) , with G being either (cid:98) F , q F or (cid:98) F , and the integral being computed numerically. We alsocompute the relative change in score100 · R x ( (cid:98) F , F ) − R x ( q F , F ) R x ( q F , F ) . The results of the simulations are displayed in Figure 3. A first observation is thatthe performance of all three estimators gets worse close to the boundary points of X ,and this is more pronounced for the two order constrained estimators. This is a knownphenomenon from shape constrained inference. However, in the interior of X , taking thestochastic ordering into account pays off.The second column of plots in Figure 3 shows the relative change in score when esti-mating the family of distributions with a likelihood ratio ordering constraint instead of theusual stochastic ordering constraint. It is observed that the improvement in score becomeslarger and occurs on a wider sub-interval of X as n o and N increase. Only towards theboundary, the usual stochastic order seems to have better performance. Theoretical predictive performances
Using the same Gamma model, we evaluatepredictive performances of both estimators using the continuous ranked probability scoreCRPS( G x , y ) := (cid:90) (cid:0) G x ( z ) − [ y ≤ z ] (cid:1) d z. The CRPS is a sctrictly proper scoring rule which allows for comparisons of probabilisticforecasts, see Gneiting and Raftery (2007) and Jordan et al. (2019). It can be seen as anextension of the mean absolute error for probabilistic forecasts. The CRPS is thereforeinterpreted in the same unit of measurement as the true distribution or data.Because in our case the true underlying distribution is known, we study the expectedCRPS score given by S x ( G, F ) := (cid:90)
CRPS( G x , y ) d P x ( y )= m (cid:88) j =0 (cid:90) [ y j ,y j +1 ) (cid:0) G x ( y j ) − F x ( z ) (cid:1) d z + b ( x ) B (1 / , a ( x )) , for G ∈ { (cid:98) F , q F , (cid:98) F } , where y := −∞ , y m +1 := + ∞ , B ( · , · ) is the beta function, and thelatter integrals are computed via numerical integration.Consequently, we compute Monte-Carlo estimates of the mean and selected quantilesof S x ( (cid:98) F , F ), S x ( q F , F ) and S x ( (cid:98) F , F ), as well as the relative change in score100 · S x ( (cid:98) F , F ) − S x ( q F , F ) S x ( q F , F )for each x ∈ X o .Figure 4 outlines the results of the simulations. Similar boundary effects as for thesimple score are observed. On the interior of X , the usual stochastic order improves thenaive empirical estimator, and the likelihood ratio order yields the best results.11 .0 1.5 2.0 2.5 3.0 3.5 4.0 . . . . . . n o = , N = x R x ( G , F ) − − Relative change x % . . . . . . n o = , N = x R x ( G , F ) − − Relative change x % . . . . . . n o = , N = x R x ( G , F ) − Relative change x % Figure 3: Monte Carlo simulations to evaluate estimation performances with a simplescore. Left column: Simple scores with G being either (cid:98) F (green), q F (blue) or (cid:98) F (grey).Right column: Relative change of the score when enforcing a likelihood ratio orderingconstraint over the usual stochastic ordering constraint. The thicker line is the meanvariation, whereas the thin lines are the 25 and 75%-quantiles. Negative values representan improvement in score. 12 .0 1.5 2.0 2.5 3.0 3.5 4.0 n o = , N = x S x ( G , F ) − − Relative change x % n o = , N = x S x ( G , F ) − − Relative change x % n o = , N = x S x ( G , F ) − Relative change x % Figure 4: Monte Carlo simulations to evaluate prediction performances using a CRPS-typescore. Left column: CRPS scores with G being either (cid:98) F (green), q F (blue) or (cid:98) F (grey).Right column: Relative change of the score when enforcing a likelihood ratio orderingconstraint over the usual stochastic ordering constraint. The thicker line is the meanvariation, whereas the thin lines are the 25 and 75%-quantiles. Negative values representan improvement in score. 13 lll lll l ll ll l ll l ll ll ll l ll ll l ll l ll lll lll ll l ll ll ll ll ll ll ll l l ll ll ll ll l l ll lll l ll l ll l l ll ll lll ll lll ll ll l ll l lll ll l ll ll ll ll ll lll ll lll l lll l ll ll l l ll ll lll lllll l lll ll l ll lll l ll ll ll ll lll l ll ll l ll ll l lll l l ll ll l ll ll ll ll l ll ll ll l lll l ll l ll l lll l l llll ll lll lll lll lll l lll ll ll l ll ll ll l ll ll l ll llll l ll ll lll ll ll ll ll ll ll ll ll ll l l l lll ll ll ll l ll llll lll ll l l lll ll l ll ll l ll llll lll ll ll l ll lll l lll ll l ll ll lll l ll lll l ll ll l ll lll l lllll l ll lll llll lll l ll l lll llll lll l l lll ll ll l lll l ll ll l lll l ll ll l ll lll l lll l ll l llll l ll lll llll ll llll l lll l ll lll l l ll lll ll lll l ll lll l ll llll ll ll ll l ll ll llll l l ll ll ll l l ll lll l llll lll ll l ll l l ll ll lll l l ll lll ll l lll ll ll ll ll lll lll ll ll ll l ll l lll lll l ll lll lllll l l l ll ll l ll l l l lll ll ll ll l l lll l ll ll ll l ll ll ll l ll ll lll lll lll l ll lll l llll l lll ll l lll ll ll ll ll ll ll l ll l l ll ll l ll lll l l ll ll ll l ll l ll lll ll ll lll l ll l ll ll ll ll l lll lll ll ll ll ll lll l lll l ll l lll lll l llll ll ll l ll ll l lll l ll l llll l l ll l ll lll ll l lll l l ll lll l l l llll ll l ll ll ll l ll l l llll l l ll lll ll ll l l lll l ll lllll l ll llll l ll l lll ll ll ll lll ll ll l lll ll ll lll ll l ll l l lll l llll l ll lll lll ll lll l llll l ll ll l ll l ll l lll ll lll ll l ll lll ll l l ll ll ll ll ll ll l ll l ll ll ll ll ll ll lll lll l l ll ll l ll lll ll l lll ll ll ll l ll l lll ll ll lllll lll ll lll l ll ll lll l ll ll l ll lll l lll lllll l ll ll l ll ll lll ll ll l ll ll lll l ll lll l lll l lll ll ll ll lll lll lll l lll ll l lll l ll lllll l llll ll ll lll l l l ll ll l ll lll l l lll lll ll l ll llll lll l l lll ll l ll l l lll l l l lll ll l l llll ll lll l ll llll ll l ll ll lll l lll lll llll ll l l lll ll ll ll lll ll ll l ll ll ll ll ll ll ll lll ll l ll ll l l ll l ll ll ll l l l lll lll lll lllll ll ll l ll l ll ll ll ll ll ll lll l l llll ll lll lll ll ll l ll lll llll ll l ll ll l lll ll ll l ll l ll ll llll l llll l ll l l lll ll l lll l ll l lll ll ll ll lll lll llll l l ll ll l l lll ll lll l ll ll l ll l llll ll l llll ll llll ll l l ll l ll lll l lll l lll l l l ll l lll ll l lll l lll ll ll ll ll lll l l l l ll ll l l l ll ll ll l l lll l lll l ll ll ll l l lll l ll ll lll lll ll l ll ll ll lll lll l l ll lll l l ll ll ll l l l ll l ll ll ll ll l llll ll ll l lll ll lll ll l ll lll l lll l l ll lll lll ll lll ll l ll lll l ll ll lll l lll ll ll ll ll lllll ll l ll lll l l ll ll l lll llll lll l ll ll l ll ll ll l llll ll llll ll ll ll ll lll lll l ll lll ll lll ll lll l l l ll lll l l ll ll ll ll l ll ll ll l ll lll ll ll l l lll l l lll l ll l ll lll ll l llll l ll l ll lll l l ll ll l lll l ll ll l lll llll llll ll l ll ll ll ll l l llll ll llll ll ll ll ll ll ll llll l ll ll lll ll l l lll llll lll l ll lll l ll l ll l l lll l ll ll ll l ll l lll ll lll l lll ll l l lll l lll ll l l lll llll ll ll lll lll ll l ll l ll ll ll lll l l lll l ll ll lll ll l l ll lll ll llll ll l ll ll lll l ll l ll ll l ll l ll ll l llll ll l ll l lll ll lll l lll l l lll llll ll ll l l l ll ll l ll lll ll l ll l l ll lllll ll l llll ll l l llll llll ll l ll ll lll lll l l l lll l lllll l lll ll l ll l ll ll ll lll l lll ll ll l lll l lll l ll llll ll lll l l ll l lll ll ll l ll lll llllll l ll l ll l ll l l l ll ll l lll ll l ll ll ll l ll l llll l ll ll l ll ll l lll ll ll l lll ll ll ll lll l lll l lll lll l ll lll ll ll l ll ll lll ll l ll l l lll l ll ll ll ll ll ll l ll lll ll l lll l l lll l l l l ll l llll ll llll llll llll llll l lll l ll l ll ll l lll ll l ll ll lll ll llll l lll l ll ll l lll ll l ll ll ll l lll llll l lll ll ll l lll lll llll ll ll l l lll lll l ll l l ll ll ll lll llll ll ll ll ll l llll ll ll lll l ll ll l ll ll lll lll ll l ll l ll lll lll l ll llll l ll ll l lll ll ll lll lll l l l lll ll l ll l ll l ll ll lll l ll ll llll ll ll l ll l ll lll l ll l ll lll ll l l lllll lll l ll ll l ll l ll ll ll ll l l ll lll ll lll ll l ll ll ll lll lll l l llll ll l lll l l lll l l l ll l l ll ll l ll ll l ll ll l lll ll ll l ll lll ll ll l ll ll l l llll l lll lll lll ll ll ll l l ll l l ll ll ll l ll l l ll ll l lll lll l ll l l ll l l lll ll ll ll ll lll lll ll l l l ll lll llll ll ll ll lll ll ll ll ll l ll l llll ll l ll ll l l l lll ll ll l lll ll lll ll l ll ll lll ll l ll lll ll ll ll ll lll l ll lll ll ll lll l l ll Age [years] W e i gh t [ k g ] Figure 5: Subsample of the weight for age data. A logarithmic scale was used for theweight variable.In terms of relative change in score, it appears that imposing a likelihood ratio orderingconstraint to estimate the family of distributions yields an average score reduction of about2% in comparison with the usual stochastic ordering estimator for a sample of N = 50.For larger values of N , this improvement occurs on a wider subinterval of X and morefrequently, as shown by the 75% mark.Note further that the expected CRPS increases on the interior of X . This is due to thefact that the CRPS has the same unit of measurement as the response variable. Since thescale of the response characterized by b increases with x , then so does the correspondingscore. Empirical predictive performances
We use the weight for age dataset already stud-ied in M¨osching and D¨umbgen (2020). It comprises the age and weight of N := 16 344 girlswith an age in X := [2 ,
16] years old, of which we present a subsample in Figure 5. Thedataset was publicly released as part of the National Health and Nutrition ExaminationSurvey conducted in the US between 1963 and 1991 (data available from )and was analyzed by Kuczmarski et al. (2002) with parametric models to produce smoothquantile curves.Although likelihood ratio ordering constraint is much harder to justify than the verynatural stochastic ordering constraint, we are interested in the effect of a stronger regu-larization imposed by the former constraint.The forecast evaluation is performed using a leave- N train -out cross-validation scheme.More precisely, we choose random subsets D train of N train observations which we use totrain our estimators (cid:98) F and q F . Using the rest of the N test := N − N train data pairs in D test , we evaluate predictive performances by computing the sample mean and selectedquantiles of (cid:98) S x ( (cid:98) F , F ), (cid:98) S x ( q F , F ), (cid:98) S x ( (cid:98) F , F ) and100 · (cid:98) S x ( (cid:98) F , F ) − (cid:98) S x ( q F , F ) (cid:98) S x ( q F , F )14or each x ∈ X o , where (cid:98) S x ( G, F ) := (cid:80) ( X,Y ) ∈D test : X = x CRPS( G x , Y ) { ( X, Y ) ∈ D test : X = x } , for G ∈ { (cid:98) F , q F , (cid:98) F } .Figure 6 shows the forecast evaluation results. As expected, the empirical CRPSincreases with age, since the spread of the weight increases with age. As to the relativechange in score, improvements of about 2% can be seen for N train ∈ { , } and ofabout 1% for N train = 1000 on average. The region of X where the advantage of likelihoodratio ordering over the usual stochastic ordering is at its largest in case of N train = 1000,showing the benefit of a stronger regularization. References
Barlow, R. E. , Bartholomew, D. J. , Bremner, J. M. and
Brunk, H. D. (1972).
Statistical inference under order restrictions. The theory and application of isotonicregression . John Wiley & Sons, London-New York-Sydney. Wiley Series in Probabilityand Mathematical Statistics.
Beare, B. K. and
Moon, J.-M. (2015). Nonparametric tests of density ratio ordering.
Econometric Theory https://doi.org/10.1017/S0266466614000401 Carolan, C. A. and
Tebbs, J. M. (2005). Nonparametric tests for and against likelihoodratio ordering in the two-sample problem.
Biometrika https://doi.org/10.1093/biomet/92.1.159 Doob, J. L. (1994).
Measure theory , vol. 143 of
Graduate Texts in Mathematics . Springer-Verlag, New York.URL https://doi.org/10.1007/978-1-4612-0877-8
D¨umbgen, L. , Freitag-Wolf, S. and
Jongbloed, G. (2006). Estimating a unimodaldistribution from interval-censored data.
J. Amer. Statist. Assoc. https://doi.org/10.1198/016214506000000032
Dykstra, R. , Kochar, S. and
Robertson, T. (1995). Inference for likelihood ratioordering in the two-sample problem.
J. Amer. Statist. Assoc. https://doi.org/10.2307/2291340 El Barmi, H. and
Mukerjee, H. (2005). Inferences under a stochastic ordering con-straint: the k -sample case. J. Amer. Statist. Assoc. https://doi.org/10.1198/016214504000000764
Gneiting, T. and
Raftery, A. E. (2007). Strictly proper scoring rules, prediction, andestimation.
J. Amer. Statist. Assoc. https://doi.org/10.1198/016214506000001437
Henzi, A. , M¨osching, A. and
D¨umbgen, L. (2020). Accelerating the pool-adjacent-violators algorithm for isotonic distributional regression. Preprint, arXiv:2006.05527.
Henzi, A. , Ziegel, J. and
Gneiting, T. (2019). Isotonic distributional regression.Preprint, arXiv:1909.03725. 15 N train = x S ^ x ( G , F ) − − − Relative change x % N train = x S ^ x ( G , F ) − − − Relative change x % N train = x S ^ x ( G , F ) − Relative change x % Figure 6: Monte Carlo simulations to evaluate prediction performances using an empiricalCRPS score. Left column: empirical CRPS scores with G being either (cid:98) F (green), q F (blue)or (cid:98) F (grey). Right column: Relative change of the score when enforcing a likelihood ratioordering constraint over the usual stochastic ordering constraint. The thicker line is themean variation, whereas the thin lines are the 25 and 75%-quantiles. Negative valuesrepresent an improvement in score. 16 ewitt, I. (1991). Applications of likelihood ratio orderings in economics. In Stochasticorders and decision under risk (Hamburg, 1989) , vol. 19 of
IMS Lecture Notes Monogr.Ser.
Inst. Math. Statist., Hayward, CA, 174–189.URL https://doi.org/10.1214/lnms/1215459856
Jongbloed, G. (1998). The iterative convex minorant algorithm for nonparametric esti-mation.
J. Comput. Graph. Statist. https://doi.org/10.2307/1390706 Jordan, A. , Krger, F. and
Lerch, S. (2019). Evaluating probabilistic forecasts withscoringrules.
Journal of Statistical Software Karlin, S. and
Rubin, H. (1956). The theory of decision procedures for distributionswith monotone likelihood ratio.
Ann. Math. Statist. https://doi.org/10.1214/aoms/1177728259 Kuczmarski, R. J. , Ogden, C. L. , Guo, S. S. , Grummer-Strawn, L. M. , Flegal,K. M. , Mei, Z. , Wei, R. , Curtin, L. R. , Roche, A. F. and
Johnson, C. L. (2002).CDC Growth Charts for the United States: Methods and development.
Vital HealthStat. . Lehmann, E. L. and
Rojo, J. (1992). Invariant directional orderings.
Ann. Statist. https://doi.org/10.1214/aos/1176348905 M¨osching, A. and
D¨umbgen, L. (2020). Monotone least squares and isotonic quantiles.
Electron. J. Stat. https://doi.org/10.1214/19-EJS1659 Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional.
Biometrika https://doi.org/10.1093/biomet/75.2.237 Owen, A. B. (2001).
Empirical likelihood . No. 92 in Monographs on Statistics andApplied Probability, Chapman and Hall/CRC.
Robertson, T. , Wright, F. T. and
Dykstra, R. L. (1988).
Order restricted statisticalinference . Wiley Series in Probability and Mathematical Statistics: Probability andMathematical Statistics, John Wiley & Sons, Ltd., Chichester.
Roosen, J. and
Hennessy, D. A. (2004). Testing for the monotone likelihood ratioassumption.
J. Bus. Econom. Statist. https://doi.org/10.1198/073500104000000235 Shaked, M. and
Shanthikumar, J. G. (2007).
Stochastic orders . Springer Series inStatistics, Springer, New York.URL https://doi.org/10.1007/978-0-387-34675-5
Shorack, G. R. and
Wellner, J. A. (1986).
Empirical processes with applicationsto statistics . Wiley Series in Probability and Mathematical Statistics: Probability andMathematical Statistics, John Wiley & Sons, Inc., New York.
Westling, T. , Downes, K. J. and
Small, D. S. (2019). Nonparametric maximumlikelihood estimation under a likelihood ratio order. Preprint, arXiv:1904.12321.17 u, T. , Li, P. and
Qin, J. (2017). Density estimation in the two-sample problem withlikelihood ratio ordering.
Biometrika https://doi.org/10.1093/biomet/asw069
A Likelihood ratio order
A.1 Two distributions
At first, let us recall some well-known facts from measure theory. Let µ and ν be σ -finitemeasures on a measurable space ( X , B ). The measure µ dominates ν , and one writes ν (cid:28) µ , if ν ( B ) = 0 for any set B ∈ B with µ ( B ) = 0. The well-known Radon–Nikodymtheorem states that ν (cid:28) µ if and only if ν has a density f with respect to µ . That means, f : X → [0 , ∞ ) is measurable, and ν ( B ) = (cid:82) B f d µ for any B ∈ B . This is abbreviated as f = d ν/ d µ . If f ∗ is another such function, then µ ( { f ∗ (cid:54) = f } ) = 0.Now we consider two probability measures Q and P on ( R , B ), where B is the Borel σ -field on R . Let µ be a σ -finite measure such that Q (cid:28) µ and P (cid:28) µ ; for instance, let µ = P + Q . If we choose explicit densities f = d Q/ d µ and g = d P/ d µ , then ( P + Q )( { f + g = 0 } ) = 0, and on R \ { f + g = 0 } , the density ratio f /g is well-defined in [0 , ∞ ]. Lemma A.1 . The following properties of Q and P are equivalent: (D1) There exists an isotonic version ρ : R → [0 , of d Q/ d( P + Q ) . (D2) For any dominating measure µ of P and Q and explicit versions f = d P/ d µ , g =d Q/ d µ , there exists a set N ⊃ { f + g = 0 } such that ( P + Q )( N ) = 0 and g/f is isotonic on R \ N. (D3) For any dominating measure µ of P and Q , there exist versions f = d P/ d µ and g = d Q/ d µ such that g/f is isotonic on R \ { f + g = 0 } . (D4) For any dominating measure µ of P and Q , there exist versions f = d P/ d µ and g = d Q/ d µ such that, for all x, y ∈ R with x ≤ y , f ( y ) g ( x ) ≤ f ( x ) g ( y ) . (D5) For all
A, B ∈ B such that A ≤ B (element-wise), P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) . (D6) For all intervals A = ( w, x ] and B = ( y, z ] with x ≤ y , P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) . We now state the definition of likelihood ratio order on distributions.
Definition A.2 (Likelihood ratio order). We say that P is smaller than Q with respectto likelihood ratio order , and we write P ≤ lr Q , if any and thus all of the properties(D1–6) are satisfied. Remark A.3 . Shaked and Shanthikumar (2007) restrict their attention to probabilitymeasures which are either discrete or dominated by Lebesgue measure. Their definitionof likelihood ratio corresponds to properties (D3–4) with µ being counting measure orLebesgue measure on the real line. 18 emark A.4 (Usual stochastic order). Recall that the distribution P is smaller than Q with respect to the usual stochastic order , and we write P ≤ st Q , if P (cid:0) ( −∞ , x ] (cid:1) ≥ Q (cid:0) ( −∞ , x ] (cid:1) for all x ∈ R . The likelihood ratio order is stronger than the usual stochasticorder in the sense that P ≤ st Q if P ≤ lr Q . This follows immediately from property (D5)applied to A := ( −∞ , x ] and B := ( x, ∞ ) for arbitrary x ∈ R , because P ( B ) Q ( A ) = Q ( A ) − P ( A ) Q ( A ) and P ( A ) Q ( B ) = P ( A ) − P ( A ) Q ( A ).The reverse statement is false in general, but the likelihood ratio order of two distri-butions is tightly connected to stochastic order of domain-conditional distributions. Lemma A.5 . We have that P ≤ lr Q if and only if P ( · | C ) ≤ st Q ( · | C ) for all C ∈ B suchthat P ( C ) , Q ( C ) > . Remark A.6 (Ordered ranges). Recall that the support of a probability measure P on R , denoted by supp( P ), is the setsupp( P ) := { y ∈ R : P ( U ) > U ⊂ R s.t. y ∈ U } . In particular, if F denotes the distribution function of P , then the endpoints inf(supp( P ))and sup(supp( P )) are given by inf { x ∈ R : F ( x ) > } and sup { x ∈ R : F ( x ) < } , respec-tively. Consequently, if P ≤ st Q , then inf(supp( P )) ≤ inf(supp( Q )) and sup(supp( P )) ≤ sup(supp( Q )). A.2 Ordinal dominance curves
In case Q (cid:28) P , another equivalent definition of likelihood ratio order involving theordinal dominance curve can be stated. To define that property, we first need to setsome notations. Let F and G be the distribution functions of P and Q , respectively.In the sequel, we use the standard conventions F ( −∞ ) := lim x →−∞ F ( x ) = 0 and F ( ∞ ) := lim x →∞ F ( x ) = 1. We also use the following definitions: The image of F is the set Im( F ) := { F ( x ) : x ∈ R } . For any α ∈ [0 , α -quantile of F is the number F − ( α ) defined by F − ( α ) := −∞ if α = 0 , min { x ∈ R : F ( x ) ≥ α } if α ∈ (0 , , sup { x ∈ R : F ( x ) < } if α = 1 . The ordinal dominance curve is the function H = H F,G : [0 , → [0 ,
1] defined for each α ∈ [0 ,
1] by H ( α ) := G (cid:0) F − ( α ) (cid:1) . Let V ⊂ R . We say that a function h : V → R is convex on V if, for all x, y, z ∈ V with x < y < z , h ( y ) − h ( x ) y − x ≤ h ( z ) − h ( y ) z − y . The next result has been shown in the special case of F and G being continuous andstrictly increasing by Lehmann and Rojo (1992) and in the general case but with additionalassumptions and weaker conclusions by Westling et al. (2019). Theorem A.7 . If Q (cid:28) P , then the following two claims are equivalent: (i) P ≤ lr Q . (ii) The ordinal dominance curve H is convex on Im( F ) . .3 Families of distributions The next lemma shows that the relation ≤ lr defines indeed a partial order. Lemma A.8 . The relation ≤ lr defines a partial order on the set of all distributions onthe real line. That means, for arbitrary probability measures P , Q and R on R , • P ≤ lr P (reflexivity); • P ≤ lr Q and Q ≤ lr P implies that P = Q (antisymmetry); • P ≤ lr Q and Q ≤ lr R implies that P ≤ lr R (transitivity). We now define likelihood ratio ordering for a family of distributions.
Definition A.9 . Let ( X , (cid:22) ) be a partially ordered set and ( P x ) x ∈ X be a family of dis-tributions on ( R , B ). We say that ( P x ) x ∈ X is increasing in x with respect to likelihoodratio order if P x ≤ lr P x for all x , x ∈ X such that x (cid:22) x . Example A.10 (Total order on finite sets). For integers n, m ≥
2, let X = { x , . . . , x n } and Y = { y , . . . , y n } with real numbers x < · · · < x n and y < · · · < y m . For 1 ≤ i ≤ n ,let P x i be a probability distribution on Y with masses p ij := P x i ( { y j } ) for 1 ≤ j ≤ m .According to properties (D3–5) of Lemma A.1, the family ( P x ) x ∈ X is increasing withrespect to likelihood ratio order if and only if p i j · p i j ≤ p i j · p i j whenever 1 ≤ i < i ≤ n and 1 ≤ j < j ≤ m . B Proofs and technical details
B.1 Local averages and isotonicity
Let µ be a locally finite measure on the real line, that means, µ ( B ) < ∞ for bounded sets B ⊂ R . As usual, L ( µ ) denotes the set of measurable functions f such that (cid:107) f (cid:107) := (cid:82) | f | d µ is finite, and L ( µ ) denotes the set of locally integrable functions, that is, measurablefunctions f such that (cid:107) f B (cid:107) < ∞ , for all bounded sets B ⊂ R .For all f ∈ L ( µ ) and bounded intervals B , define the µ -average of f on B as D µ f ( B ) := µ ( B ) − (cid:90) B f d µ with the convention 0 / n ≥
0, we partition R into the half-openintervals I nj := (cid:0) − n ( j − , − n j (cid:3) , j ∈ Z . Now, for f ∈ L ( µ ) define Π n f : R → R viaΠ n f ( x ) := D µ f ( I nj ) for j ∈ Z and x ∈ I nj . Lemma B.1 . For all n ∈ N , the operator Π n : L loc ( µ ) → L loc ( µ ) is linear. For any f ∈ L , it satisfies the inequalities (cid:13)(cid:13) (Π n f )1 I nj (cid:13)(cid:13) ≤ (cid:107) f I nj (cid:107) for all j ∈ Z . In particular, (cid:107) Π n f (cid:107) ≤ (cid:107) f (cid:107) . Moreover, Π n f → f µ -a.e. . roof of Lemma B.1 . Linearity of Π n follows from linearity of integrals. The contrac-tion properties follows from the fact that for any integer j , (cid:13)(cid:13) (Π n f )1 I nj (cid:13)(cid:13) = µ ( I nj ) (cid:12)(cid:12) D µ f ( I nj ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I nj f d µ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) I nj | f | d µ = (cid:107) f I nj (cid:107) . Since (cid:107) h (cid:107) = (cid:80) j ∈ Z (cid:107) h I nj (cid:107) for h = f, Π n f , this implies the inequality (cid:107) Π n f (cid:107) ≤ (cid:107) f (cid:107) .Now pick numbers c j > j ∈ Z , such that (cid:80) j ∈ Z c j (cid:82) I j | f | d µ < ∞ and IP( B ) := (cid:80) j ∈ Z c j µ ( B ∩ I j ) defines a probability measure on ( R , B ). One can easily verify thatΠ n f is the conditional expectation IE( f | B n ) of f ∈ L (IP) with respect to the σ -field B n generated by the intervals I nj , j ∈ Z . Moreover, since B ⊂ B ⊂ B ⊂ · · · , and since B is generated by (cid:83) n ≥ B n , the sequence (Π n f ) n ≥ is a martingale, and by the martingaleconvergence theorem (Doob, 1994), it converges almost surely to IE( f | B ) = f . Now theassertion follows from the fact that IP( B ) = 0 if and only if µ ( B ) = 0.Let Z n := { j ∈ Z : µ ( I nj ) > } and i n := sup( Z n ). For x ∈ R let j n ( x ) := min (cid:0) { j ∈ Z n : x ≤ − n j } ∪ { i n } (cid:1) . Let Π ∗ n be the operator on L ( µ ) defined for all f ∈ L ( µ ) and x ∈ R byΠ ∗ n f ( x ) := D µ f ( I nj n ( x ) ) . In particular, Π ∗ n f ≡ D µ f (cid:0) I nj (cid:1) on I nj , whenever j ∈ Z is such that µ ( I nj ) >
0. Thatmeans, Π n f and Π ∗ n f coincide µ -almost everywhere. However, the slightly more compli-cated definition of Π ∗ n ensures that it preserves isotonicity. Lemma B.2 . For all n ∈ N , the operator Π ∗ n : L ( µ ) → L ( µ ) is linear. For any f ∈ L ( µ ) , the following two conclusions hold true: If f is isotonic on R , then D µ f ( I nj ) is isotonic in j ∈ Z n . If D µ f ( I nj ) is isotonic in j ∈ Z n , then Π ∗ n f is isotonic on R . Proof of Lemma B.2 . Linearity of Π ∗ n is obvious. If f is isotonic, then D µ f ( I nj ) isisotonic in j ∈ Z n , because for numbers j < k in Z n , D µ f ( I nj ) ≤ f (2 − n j ) ≤ f (2 − n ( k − ≤ D µ f ( I nk ) . Since j n ( x ) ∈ Z n for all x ∈ R , and since j n ( x ) ≤ j n ( y ) whenever x ≤ y , we find that Π ∗ n f is isotonic on R whenever D µ f ( I nj ) is isotonic in j ∈ Z n . Lemma B.3 . For any function f ∈ L loc ( µ ) , the following two claims are equivalent:(i) There exists an isotonic function f ∗ : R → [ −∞ , ∞ ] such that f ∗ = f µ -almosteverywhere.(ii) D µ f ( A ) ≤ D µ f ( B ) for all A = ( w, x ] , B = ( y, z ] such that x ≤ y and µ ( A ) , µ ( B ) > . Proof of Lemma B.3 . Let µ ( { f ∗ (cid:54) = f } ) = 0 for an isotonic function f ∗ : R → [ −∞ , ∞ ].For intervals A, B as in (ii), this implies that D µ f ( A ) ≤ f ∗ ( x ) ≤ f ∗ ( y ) ≤ D µ f ( B ).Consequently, (i) implies (ii).Suppose now that f ∈ L ( µ ) satisfies (ii). With f n := Π n f as in Lemma B.1, thelatter lemma implies lim n →∞ f n ( x ) = f ( x ) for all x ∈ R \ N for some Borel set N with µ ( N ) = 0. Next let f ∗ n := Π ∗ n f . Then it follows from (ii) and Lemma B.2 that f ∗ n is anisotonic function. Moreover, f ∗ n = f n on R \ N n for some Borel set N n with µ ( N n ) = 0.Consequently, lim n →∞ f ∗ n ( x ) = f ( x ) for all x ∈ R \ N ∗ , where N ∗ := N ∪ (cid:83) n ≥ N n satisfies µ ( N ∗ ) = 0. Since all functions f ∗ n are isotonic on R , this implies that f ∗ ( x ) := lim inf n →∞ f ∗ n ( x )defines an isotonic function f ∗ : R → [ −∞ , ∞ ], and f ∗ = f on R \ N ∗ . Hence, (i) issatisfied as well. 21 .2 Proofs for Appendix A Proof of Lemma A.1 . We prove the result via a chain of implications. (D1) ⇒ (D2) : Let µ , f and g be as in (D2), and let ρ : R → [0 ,
1] be an isotonic versionof d Q/ d( P + Q ). Since f + g = d( P + Q ) / d µ , for any Borel set B , Q ( B ) = (cid:90) B ρ d( P + Q ) = (cid:90) B ρ ( f + g ) d µ, so ρ ( f + g ) = d Q/ d µ . Moreover, P ( B ) = ( P + Q )( B ) − Q ( B ) = (cid:90) B (1 − ρ )( f + g ) d µ, whence (1 − ρ )( f + g ) = d P/ d µ . That means, N := { f (cid:54) = (1 − ρ )( f + g ) } ∪ { g (cid:54) = ρ ( f + g ) } satisfies µ ( N ) = 0. With N := { f + g = 0 } , we obtain a set N := N ∪ N such that µ ( N ) = 0, and on R \ N , g/f = ρ/ (1 − ρ )is isotonic, because [0 , (cid:51) r (cid:55)→ r/ (1 − r ) ∈ [0 , ∞ ] is strictly increasing. Hence, (D1)implies (D2). (D2) ⇒ (D3) : Let µ dominate P and Q with explicit versions f = d P/ d µ and g = d Q/ d µ .Suppose that g/f is isotonic on R \ N , where N ⊃ { f + g = 0 } satisfies µ ( N ) = 0. Then wemay replace f and g with ˜ f := f R \ N and ˜ g := g R \ N , respectively. Then N = { ˜ f +˜ g = 0 } ,and on R \ , ˜ g/ ˜ f = g/f is isotonic. Hence, (D2) implies (D3). (D3) ⇒ (D4) : Let µ dominate P and Q , and suppose that there exist versions f = d P/ d µ and g = d Q/ d µ such that g/f is isotonic on R \ { f + g = 0 } . For points x ≤ y in R \ { f + g = 0 } , the inequality ( g/f )( x ) ≤ ( g/f )( y ) is equivalent to g ( y ) f ( x ) ≤ f ( x ) g ( y ),and the latter inequality is always correct if x ∈ { f + g = 0 } or y ∈ { f + g = 0 } .Consequently, f ( y ) g ( x ) ≤ f ( x ) g ( y ) if x, y ∈ R , x ≤ y. Hence (D3) implies (D4). (D4) ⇒ (D5) ⇒ (D6) : Let µ , f and g be as in (D4) and A and B be sets as in (D5).Integrating f ( x ) g ( y ) and g ( x ) f ( y ) on A × B with respect to the product measure µ ⊗ µ yields (D5). Condition (D6) is obviously a special case of (D5). (D6) ⇒ (D1) : Suppose that P and Q satisfy (D6). The inequality P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) is easily seen to be equivalent to µ ( B ) Q ( A ) ≤ µ ( A ) Q ( B ) with µ := P + Q .Since Q ≤ µ , the latter inquality is trivial if µ ( A ) = 0 or µ ( B ) = 0. Thus (D6) is equiva-lent to(D6’) ( Q/µ )( A ) ≤ ( Q/µ )( B ) for A = ( w, x ] and B = ( y, z ] with x ≤ y and µ ( A ) , µ ( B ) > Q ≤ µ also implies that there exists a density ρ = d Q/ d µ with values in[0 , D µ ρ ( A ) ≤ D µ ρ ( B ) for A = ( w, x ] and B = ( y, z ] with x ≤ y and µ ( A ) , µ ( B ) > D µ ρ ( C ) := µ ( C ) − (cid:82) C ρ d µ with 0 / ρ may be chosen to be isotonic. Thus condition (D6) implies (D1). Proof of Lemma A.5 . For notational convenience, we write P C := P ( · | C ) and Q C := Q ( · | C ). Suppose first that P ≤ lr Q , and let C be a Borel set with P ( C ) , Q ( C ) >
0. Since P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) for all Borel sets A, B such that A ≤ B , and since A ∩ C ≤ B ∩ C ,we find that P C ( B ) Q C ( A ) = P ( B ∩ C ) P ( C ) Q ( A ∩ C ) Q ( C ) ≤ P ( A ∩ C ) P ( C ) Q ( B ∩ C ) Q ( C ) = P C ( A ) Q C ( B ) . P C ≤ lr Q C , and this implies that P C ≤ st Q C , see Remark A.4.Suppose now that P C ≤ st Q C for all Borel sets C such that P ( C ) , Q ( C ) >
0. To verifythat P ≤ lr Q , it suffices to show that P ( A ) Q ( B ) − P ( B ) Q ( A ) ≥ A = ( w, x ] and B = ( y, z ] with x ≤ y . With C := A ∪ B , it suffices toconsider the case P ( C ) , Q ( C ) >
0, because otherwise (B.1) is trivial. But then (B.1) isequivalent to P C ( A ) Q C ( B ) − P C ( B ) Q C ( A ) ≥ , and since P C ( B ) = 1 − P C ( A ) and Q C ( B ) = 1 − Q C ( A ), the latter inequality is equivalentto P C ( A ) ≥ Q C ( A ) . But this is a consequence of P C ≤ st Q C , because P C ( A ) = P C (( −∞ , x ]) and Q C ( A ) = Q C (( −∞ , x ]). Proof of Theorem A.8 . Reflexivity of the likelihood ratio order is obvious. To showantisymmetry, note that P ≤ lr Q and Q ≤ lr P implies that P ≤ st Q and Q ≤ st P . But thelatter two inequalities mean that the distribution functions of P and Q coincide, whence P ≡ Q .It remains to prove transitivity. Suppose that P ≤ lr Q ≤ lr R , and let µ := P + Q + R .Then there exist versions f = d P/ d µ , g = d Q/ d µ and h = d R/ d µ such that g/f is isotonic on { f + g > } and h/g is isotonic on { g + h > } . In particular, { f + g > } = A ∪ B with A := { f > g } < B := { g > } and { g + h > } = B ∪ C with B = { g > } < C := { g = 0 < h } . The relation “ < ” for sets is meant elementwise. In particular, { f > } ⊂ A ∪ B , whence f = 0 on C , and { h > } ⊂ B ∪ C , whence h = 0 on A . Consequently, { f + h > } = A ∪ B (cid:48) ∪ C with B (cid:48) := { f + h > } ∩ B. Since h/f = 0 on A and h/f = ∞ on C , it suffices to show that h/f is isotonic on B (cid:48) .But since g > B (cid:48) , the ratio h/f = ( h/g )( g/f ) is the product of two isotonic functionsfrom B (cid:48) to [0 , ∞ ], whence it is isotonic on B (cid:48) , too.The proof of Theorem A.7 uses elementary inequalities for distribution and quantilefunctions; see for instance Chapter 1 of Shorack and Wellner (1986): Lemma B.4 . For all α ∈ [0 , , we have F (cid:0) F − ( α ) (cid:1) ≥ α , with equality if, and only if, α ∈ Im( G ) . For all x ∈ R , F − (cid:0) F ( x ) (cid:1) ≤ x , with equality if, and only if, F ( x o ) < F ( x ) forall x o < x . Proof of Theorem A.7 . Suppose that (i) holds and let r, s, t ∈ Im( G ) with r < s < t .Lemma B.4 yields F − ( r ) < F − ( s ) < F − ( t ), so A := ( F − ( r ) , F − ( s )] and B :=( F − ( s ) , F − ( t )] are such that P ( A ) = F (cid:0) F − ( s ) (cid:1) − F (cid:0) F − ( r ) (cid:1) = s − r > ,P ( B ) = F (cid:0) F − ( t ) (cid:1) − F (cid:0) F − ( s ) (cid:1) = t − s > . t − s ) (cid:0) H ( s ) − H ( r ) (cid:1) = P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) = ( s − r ) (cid:0) H ( t ) − H ( s ) (cid:1) , and dividing both sides by ( s − r )( t − s ) shows that H is convex on Im( G ).Suppose now that (ii) holds. To verify (i), we have to show that P (( x , x ]) Q (( x , x ]) − P (( x , x ]) Q (( x , x ]) ≥ x < x ≤ x < x . If P (( x i , x j ]) = 0, then Q (( x i , x j ]) = 0,because Q (cid:28) P . Hence it suffices to verify (B.2) in case of P (( x , x ]) = F ( x ) − F ( x ) > P (( x , x ]) = F ( x ) − F ( x ) >
0. For 1 ≤ i ≤
4, let ˜ x i := F − ( F ( x i )). Then ˜ x i ≤ x i and F (˜ x i ) = F ( x i ) =: t i , whence P ((˜ x i , x i ]) = 0 and Q ((˜ x i , x i ]) = 0. Consequently, theleft hand side of (B.2) equals( t − t ) Q ((˜ x , ˜ x ]) − ( t − t ) Q ((˜ x , ˜ x ])= ( t − t ) (cid:0) H ( t ) − H ( t ) (cid:1) − ( t − t ) (cid:0) H ( t ) − H ( t ) (cid:1) = ( t − t )( t − t ) (cid:16) H ( t ) − H ( t ) t − t − H ( t ) − H ( t ) t − t (cid:17) ≥ H . B.3 Proofs for Section 2
Lemma B.5 . The numbers m i and M i have the following properties: (a) m i = M i = j if, and only if, X k ≤ x i implies Y k ≤ y j and X k ≥ x i implies Y k ≥ y j ; (b) The numbers r i := min { l : M l = M i } and s i := max { l : m l = m i } are such that r i ≤ i ≤ s i and w r i M i , w s i m i > . Proof of Lemma B.5 . Part (a) follows immediately from the definition of m i and M i .Indeed, m i = M i = j if and only if y j = min { Y k : X k ≥ x i } = max { Y k : X k ≤ y j } .As to part (b), it follows from M ≤ · · · ≤ M n that r i ≤ i ≤ s i . By definition of M r i , Y k = y M i for some k such that X k ≤ x r i . We even have X k = x r i , because otherwise r i > M r i − ≥ M i , a contradiction to minimality of r i . Hence, w r i M i ≥
1. Likewise,the definition of m s i implies that Y k = y m i for some k such that X k ≥ x s i . Here X k = x s i ,because otherwise s i < n and m s i +1 ≤ m i , a contradiction to the definition of s i . Hence, w s i m i ≥ Lemma B.6 . A matrix p ∈ [0 , n × m that satisfies (2.6–2.7) must satisfy (2.4) . Proof of Lemma B.6 . Recall that we have to show p i j · p i j ≤ p i j · p i j (B.3)for all 1 ≤ i ≤ i ≤ n and 1 ≤ j ≤ j ≤ m . We distinguish several cases: Case j = j : Here (B.3) is trivially verified.
Case m i ≤ j < j ≤ M i : We have that m i ≤ j < j ≤ M i , for all i ≤ i ≤ i , so24 ij > j ≤ j ≤ j and i ≤ i ≤ i , by (2.7). We further have that m i +1 < M i , forall i ≤ i < i , so we can apply (2.6) to all j ≤ j ≤ j and i ≤ i ≤ i . This yields p i ,j p i ,j = p i ,j p i ,j − · p i ,j − p i ,j − · · · p i ,j +1 p i ,j ≤ p i +1 ,j p i +1 ,j − · p i +1 ,j − p i +1 ,j − · · · p i +1 ,j +1 p i +1 ,j ≤ · · ·≤ p i ,j p i ,j − · p i ,j − p i ,j − · · · p i ,j +1 p i ,j = p i ,j p i ,j , which is equivalent to (B.3). Case j < m i , M i < j or M i ≤ m i : From (2.7), then in each case at least one of p i j or p i j is zero, so (B.3) is trivial. Lemma B.7 . For all ( i, j ) ∈ P , either w ij > , or there exists ≤ i < i ≤ n and m i ≤ j < j ≤ M i such that i ≤ i ≤ i , j ∈ { j , j } and w i j , w i j > . In other words, for all ( i, j ) ∈ P , either w ij > i, j ) lies on the upper or lower edgeof a rectangle whose upper-left and lower-right corners have strictly positive weight. Proof of Lemma B.7 . Let ( i, j ) ∈ P such that w ij = 0. Since w i = (cid:80) M i r = m i w ir >
0, ithas to be that m i < M i , otherwise w i = w ij = 0. We distinguish several cases: Case j = M i : Define i := min { k : M k = M i } and j := j = M i = M i . By Lemma B.5part (b), we know that w i j >
0. Now w ij = 0 implies that i < i . Letting i := max { k : m k = m i } and j := m i = m i < M i = j , another application of Lemma B.5 part (b)shows that w i j > Case j = m i : With a similar reasoning as in the previous case, we find that the numbers i := min { k : M k = M i } , j := M i = M i , i := max { k : m k = m i } and j := j = m i = m i < M i = j fulfill the required properties. Case m i < j < M i : By definition of y j , the set { k : w kj > } is non empty and doesnot include i . Let i o be one of its elements. In case i o > i , we define i := min { k : M k = M i } , i := i o , j := j and j := M i . These numbers are such that i ≤ i < i o = i , j = j < M i = j and w i j , w i j >
0, by Lemma B.5 part (b). Likewise, in case i o < i ,the numbers i := i o , i := max { k : m k = m i } , j := m i and j := j fulfill the requiredproperties. Lemma B.8 . A maximizer p ∈ [0 , n × m of L satisfying (2.2,2.4) must satisfy (2.6–2.7).Additionally, we have that L ( p ) > . Proof of Lemma B.8 . Let p ∈ [0 , n × m be a maximizer of L satisfying (2.2,2.4). Prop-erty (2.6) is a direct consequence of (2.4).Next, we show that a maximizer p of L must be such that L ( p ) >
0. For that, let r := [ r , r , . . . , r n ] (cid:62) ∈ [0 , n × m where r i := ( r ij ) mj =1 is defined by r ij := 1 / ( M i − m i + 1),if j ∈ J i , and r ij := 0 otherwise. Since r satisfies (2.6–2.7), we obtain by means ofLemma B.6 that (2.4) holds. Because (2.2) is satisfied as well, the matrix r is a feasibleparameter such that L ( p ) ≥ L ( r ) = (cid:89) ( i,j ) ∈P r w ij ij ≥ (cid:18) min ( i,j ) ∈P r ij (cid:19) nm > .
25s to (2.7), we first prove that v i := (cid:80) j ∈J i p ij = 1, for all i ∈ I , yielding p ij = 0 if j / ∈ J i . For that, we verify primarily that v i >
0, for all i ∈ I . Ab absurdo, suppose thatthere exists i o ∈ I such that p i o j = 0, for all j ∈ J i o . Since w i o >
0, there is at least one j o ∈ J i o such that w i o j o >
0, which yields the contradiction L ( p ) = 0.Thus, we suppose that 0 < v i ≤
1, for all i ∈ I . Let q = [ q , q , . . . , q n ] (cid:62) be an n × m matrix with each q i := ( q ij ) mj =1 defined as follows: q ij := (cid:40) p ij v i if ( i, j ) ∈ P , . These coefficients are such that q ij ≥ p ij , if j ∈ J i , and (cid:80) mj =1 q ij = 1, for all i ∈ I ,so q ∈ [0 , n × m and it satisfies (2.2). Furthermore, since p satisfies (2.4), it holds that q i,j +1 · q i +1 ,j ≤ q ij · q i +1 ,j +1 , for all m i +1 ≤ j < M i and 1 ≤ j < n , that means, q satisfies(2.6–2.7). Lemma B.6 then implies that q satisfies (2.4) as well. Therefore q is a feasibleparameter such that L ( p ) = (cid:89) ( i,j ) ∈P p w ij ij = (cid:89) i ∈I v w i i (cid:89) j ∈J i q w ij ij ≤ L ( q ) ≤ L ( p ) , where the first inequality is due to the fact that 0 < v w i i ≤ p is a maximizer of L . But then all the inequalities are equalities, which forces v i = 1, for all i ∈ I .Finally, let us show that p ij > i, j ) ∈ P . Notice first that p ij > w ij >
0, otherwise L ( p ) = 0 < L ( r ) as seen previously. Hence we treat the case of w ij = 0for some ( i, j ) ∈ P . From Lemma B.7, there exist 1 ≤ i < i ≤ n and m i ≤ j < j ≤ M i with i ≤ i ≤ i , j ∈ { j , j } and w i j , w i j >
0, so p i j , p i j > j = j , so i < i ≤ i , and let us show that p ij >
0. Ab absurdo, if p ij = 0 and since p i j >
0, the likelihood ratio ordering constraint (2.4) yields p i j · p iu ≤ p i u · p ij = 0 for all u ≤ j , so p iu = 0 for all u ≤ j . In case i = i , we would get acontradiction with p i j >
0, and if i < i < i , we would find thatinf supp( (cid:98) P x i ) > j = j > j ≥ inf supp( (cid:98) P x i )and contradict Remark A.6.In case j = j , claiming that p ij = 0 would lead us to a similar contradiction. Lemma B.9 . Let ≤ i ≤ n and ≤ j ≤ m be arbitrary indices. Then(1) j ≤ M i if, and only if, n j ≤ i ;(2) m i ≤ j if, and only if, i ≤ N j . Proof of Lemma B.9 . To show (1), consider the following chain of equivalences j > M i ⇐⇒ ( X k ≤ x i ⇒ Y k < y j ) ⇐⇒ ( Y k ≥ y j ⇒ X k > x i ) ⇐⇒ i < n j . Whence j ≤ M i if, and only if, i ≥ n j . Statement (2) is proved similarly. Lemma B.10 . Let c ( θ ) := (cid:0) c ij ( θ ) (cid:1) ( i,j ) ∈P be defined for all θ ∈ Θ by c ij ( θ ) := (cid:40) log (cid:0)(cid:80) r ∈J i exp (cid:0)(cid:80) rs = m i θ is (cid:1)(cid:1) if j = m i , otherwise . hen the parameter θ (cid:48) := θ − c ( θ ) is such that θ (cid:48) ∈ Θ and f ( θ (cid:48) ) ≤ f ( θ ) with equality, ifand only if, c ( θ ) = . Proof of Lemma B.10 . We prove the result with slightly different notations. Let λ =( λ ij ) ( i,j ) ∈P ∈ R P be defined by λ ij := (cid:80) js = m i θ is . Then θ ∈ Θ if, and only if, λ ∈ Λ := { λ ∈ R P : λ i,j +1 + λ i +1 ,j ≤ λ ij + λ i +1 ,j +1 , for m i +1 ≤ j < M i , ≤ i < n } and θ satisfies constraint (2.8) if, and only if, λ is such that (cid:88) j ∈J i exp( λ ij ) = 1 , for all i ∈ I . (B.4)Finally, the functional (cid:96) ( λ ) := (cid:88) ( i,j ) ∈P (cid:0) w ij λ ij − w i exp( λ ij ) (cid:1) , is such that (cid:96) ( λ ) = − f ( θ ).Let us now define c i := log (cid:88) r ∈J i exp( λ ir ) , for all 1 ≤ i ≤ n and the parameter λ (cid:48) ∈ R P with λ (cid:48) ij := λ ij − c i . Then λ (cid:48) ∈ Λ and it satisfies (B.4). In particular, since exp( c ) ≥ c for all c ∈ R withequality if, and only if, c = 0, we have (cid:96) ( λ (cid:48) ) = (cid:88) ( i,j ) ∈P w ij ( λ ij − c i ) − (cid:88) i ∈I w i (cid:88) j ∈J i exp( λ ij − c i )= (cid:88) ( i,j ) ∈P w ij λ ij − (cid:88) i ∈I w i (1 + c i ) ≥ (cid:88) ( i,j ) ∈P w ij λ ij − (cid:88) i ∈I w i exp( c i )= (cid:96) ( λ ) , with equality if, and only if, c i = 0 for all i ∈ I . In consequence, replacing λ with λ (cid:48) has the effect of strictly increasing the log-likelihood score, unless λ is normalized already.Since θ (cid:48) ij := λ (cid:48) i,j − λ (cid:48) i,j − = λ i,j − λ i,j − = θ ij for all m i < j ≤ M i and i ∈ I , thecorresponding update on θ is given by the parameter θ (cid:48) := θ − c ( θ ). Lemma B.11 . An element θ ∈ Θ minimizes f if, and only if, it minimizes f o ( θ ) := − (cid:88) ( i,j ) ∈P w ij θ ij under constraint (2.8) . roof of Lemma B.11 . We use the same notations as in the proof of Lemma B.10.The equivalent claim therefore reads: An element λ ∈ Λ maximizes (cid:96) if, and only if, itmaximizes (cid:96) o ( λ ) := (cid:88) ( i,j ) ∈P w ij λ ij under constraint (B.4).Suppose first that λ maximizes (cid:96) . Then the directional derivative of (cid:96) at λ and in thedirection of (1 [ r = i ] ) ( r,s ) ∈P ∈ Λ is given by ∂∂c (cid:96) (cid:0) λ + c (1 [ r = i ] ) n,mr,s =1 (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) c =0 = w i (cid:16) − (cid:88) j ∈J i exp( λ ij ) (cid:17) , and it is equal to 0 by optimality of λ . In consequence, λ satisfies constraint (B.4). Italso maximizes (cid:96) o because if ˜ λ satisfies constraint (B.4) but is such that (cid:96) o (˜ λ ) > (cid:96) o ( λ ),then we also have (cid:96) (˜ λ ) > (cid:96) ( λ ), which is a contradiction with the optimality of λ .Suppose now that ˜ λ maximizes (cid:96) o under constraint (B.4) and suppose that λ is suchthat (cid:96) ( λ ) > (cid:96) (˜ λ ). As explained in the proof of Lemma B.10, the parameter λ (cid:48) defined by λ (cid:48) ij := λ ij − c i is such that (cid:96) ( λ (cid:48) ) ≥ (cid:96) ( λ ) and it satisfies (B.4). In particular, (cid:96) o ( λ (cid:48) ) > (cid:96) o (˜ λ ),which contradicts the optimality of ˜ λ . B.4 Proofs for Section 3
Lemma B.12 . For w , y ∈ R d with w > , the mapping ( w , y ) (cid:55)→ IE w (cid:0) y | R d ↑ (cid:1) is contin-uous in both arguments. Proof of Lemma B.12 . For w , y ∈ R d with w >
0, it is well known thatIso( y , w ) = (cid:18) min s ≥ i max r ≤ i A rs ( w , y ) (cid:19) di =1 , where each function A rs ( w , y ) := arg min x ∈ R s (cid:88) i = r w i ( y i − x ) = (cid:80) si = r w i y i (cid:80) si = r w i , ≤ r ≤ s ≤ d is continuous in both its arguments. But since the functions x (cid:55)→ max (cid:0) γ ( x ) , γ ( x ) (cid:1) and x (cid:55)→ min (cid:0) γ ( x ) , γ ( x ) (cid:1) are continuous whenever γ and γ are continuous, we find that( w , y ) (cid:55)→ min s ≥ i max r ≤ i A rs ( w , y )is continuous in both its arguments. Proof of Lemma 3.1 . In view of the structure of the constrained set Θ, one can de-compose the minimization of f θ under constraint into a sum of independent minimizationproblems: min η ∈ Θ f θ ( η ) = c θ + n (cid:88) i ∈I A i + m (cid:88) j =2 B j , with A i := min η ∈ R H im i ( θ ) (cid:0) η − ˜ θ im i (cid:1) = 0 ,B j := min ( η ij ) Nj − i = nj ∈ R dj ↑ N j − (cid:88) i = n j H ij ( θ ) (cid:0) η ij − ˜ θ ij (cid:1) . A i is obvious and yields the unique solution η ∗ im i ( θ ) := ˜ θ im i ( θ ) , for each i ∈ I . As to the minimization problem of each B j , 1 < j ≤ m , it is treated withthe theory of isotonic regression (Barlow et al., 1972; Robertson et al., 1988) which aimsat determining the unique element z ∈ R d ↑ that minimizes d (cid:88) k =1 b k ( z k − ξ k ) , for given dimension d ≥
1, positive weights b := ( b k ) dk =1 and vector ξ := ( ξ k ) dk =1 . Forsuch generic problems, the solution is given by the least squares projection of ξ onto thesubspace R d ↑ , which we denote by Iso( ξ , b ) and it is efficiently computed via the pool-adjacent-violators algorithm. We therefore define (cid:0) η ∗ ij ( θ ) (cid:1) N j − i = n j = Iso (cid:16)(cid:0) ˜ θ ij ( θ ) (cid:1) N j − i = n j , (cid:0) H ij ( θ ) (cid:1) N j − i = n j (cid:17) , for each 1 < j ≤ m .To see that the mapping θ (cid:55)→ η ∗ ( θ )is continuous, recall that f is infinitely differentiable on Θ, so θ (cid:55)→ H ij ( θ ) and therefore θ (cid:55)→ ˜ θ ij ( θ ) are continuous for all ( i, j ) ∈ P . Lemma B.12 allows us to conclude. Lemma B.13 . In the context of Section 3, we have that Df θ (cid:0) θ , v ( θ ) (cid:1) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) . Proof of Lemma B.13 . Notice first that the nature of f θ yields that g : [0 , ∞ ) (cid:55)→ R t → f θ (cid:0) θ + t v ( θ ) (cid:1) − f ( θ )is a strictly convex and quadratic function such that g (0) = 0 and g (1) = f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ). Furthermore, since θ + v ( θ ) = η ∗ is the unique minimum of f θ in Θ, the function g attains its minimum precisely at t = 1, so g (cid:48) (1) = 0. In consequence, g ( t ) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) t − (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) t . Hence, we find that Df θ (cid:0) θ , v ( θ ) (cid:1) = lim t → f θ (cid:0) θ + t v ( θ ) (cid:1) − f ( θ ) t = lim t → g ( t ) t = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) . B.5 Proofs for Section 4
Lemma B.14 . For any compact subset ˜Θ ⊂ Θ , there exists a constant D = D ( ˜Θ) > such that max t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) ≥ (cid:16) min (cid:0) − Df (cid:0) θ , v ( θ ) (cid:1) , D (cid:1)(cid:17) D for all θ ∈ ˜Θ . roof of Lemma B.14 . In order to ease the readability of the proof, we assume thatthe parameter θ and all the relevant objects have been primarily vectorized. That means,every pair of indices ( i, j ) ∈ P is enumerated, resulting in a problem with K := P dimensions.The compactness of ˜Θ and the continuity of v (see Lemma 3.1) and H yield theexistence of constants D , D > θ ∈ ˜Θ (cid:107) v ( θ ) (cid:107) ≤ D , sup s ∈ [0 , sup θ ∈ ˜Θ λ max (cid:16) H (cid:0) θ + s v ( θ ) (cid:1)(cid:17) ≤ D . We therefore obtain the following Taylor’s approximation of f around θ ∈ ˜Θ: f (cid:0) θ + t v ( θ ) (cid:1) = f ( θ ) + t · Df (cid:0) θ , v ( θ ) (cid:1) + t · v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) , where R ( x ) = (cid:0) R kl ( x ) (cid:1) Kk,l =1 is defined by R kl ( x ) := (cid:90) (1 − s ) H kl (cid:0) θ + s ( x − θ ) (cid:1) d s and is such that (cid:12)(cid:12)(cid:12) R kl (cid:0) θ + t v ( θ ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (1 − s ) (cid:12)(cid:12)(cid:12) H kl (cid:0) θ + st v ( θ ) (cid:1)(cid:12)(cid:12)(cid:12) d s ≤ (cid:90) (1 − s ) λ max (cid:16) H (cid:0) θ + st v ( θ ) (cid:1)(cid:17) d s ≤ D , since max k,l | A kl | ≤ λ max ( A ) for all positive definite matrices A ∈ R K × K . Therefore, wefind that (cid:12)(cid:12)(cid:12) v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) (cid:12)(cid:12)(cid:12) ≤ D K (cid:107) v ( θ ) (cid:107) ≤ K D D , where the first inequality results from Cauchy–Schwarz and the equivalence of norms.Consequently, with D := KD D we have thatmax t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) = max t ∈ (0 , − t · Df (cid:0) θ , v ( θ ) (cid:1) − t · v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) ≥ max t ∈ (0 , − t · Df (cid:0) θ , v ( θ ) (cid:1) − t D (cid:40) Df ( θ , v ( θ )) D if − Df ( θ , v ( θ )) D ∈ (0 , − Df (cid:0) θ , v ( θ ) (cid:1) − D if − Df ( θ , v ( θ )) D > ≥ (cid:16) min (cid:0) − Df (cid:0) θ , v ( θ ) (cid:1) , D (cid:1)(cid:17) D , for all θ ∈ ˜Θ. Lemma B.15 . For θ ∈ Θ , the sequence ( θ n ) n ≥ defined by θ n +1 := ψ ( θ n ) for all n ≥ is such that (cid:0) f ( θ n ) (cid:1) n ≥ is convergent and lim n →∞ Df (cid:0) θ n , v ( θ n ) (cid:1) = 0 . roof of Lemma B.15 . By construction of ψ , the sequence (cid:0) f ( θ n ) (cid:1) n ≥ is strictly de-creasing and bounded from below by f ( θ ∗ ). It is therefore convergent. In particular, wehave that lim n →∞ f ( θ n ) − f ( θ n +1 ) = 0 . (B.5)Suppose now ab absurdo that the desired claim does not hold. That means, there exist δ > (cid:0) n ( k ) (cid:1) k ≥ of ( n ) n ≥ such that − Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) ≥ δ, for all k ≥ . By convexity of f , the set ˜Θ := (cid:8) θ ∈ Θ : f ( θ ) ≥ f ( θ ) (cid:9) , is a compact subset of Θ which contains (cid:0) θ n ( k ) (cid:1) k ≥ , so by Lemma B.14, there exists aconstant D > f ( θ n ( k ) ) − f (cid:0) θ n ( k )+1 (cid:1) ≥ · max t ∈ (0 , (cid:16) f ( θ n ( k ) ) − f (cid:0) θ n ( k ) + t v ( θ n ( k ) ) (cid:1)(cid:17) ≥ (cid:16) min (cid:0) − Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) , D (cid:1)(cid:17) D ≥ (cid:0) min( δ, D ) (cid:1) D .
This contradicts property (B.5) and concludes the proof.
Proof of Theorem 4.1 . From Lemma B.15, we know already that (cid:0) f ( θ n ) (cid:1) n ≥ is con-vergent. Ab absurdo, suppose thatlim n →∞ f ( θ n ) = f ∞ > f ( θ ∗ ) . Because ( θ n ) n ≥ is a sequence in the compact set˜Θ := (cid:8) θ ∈ Θ : f ( θ ) ≥ f ( θ ) (cid:9) , there exists a subsequence (cid:0) n ( k ) (cid:1) k ≥ of ( n ) n ≥ and an element ˜ θ ∈ ˜Θ such thatlim k →∞ θ n ( k ) = ˜ θ . The continuity of f yields f (˜ θ ) = lim k →∞ f ( θ n ( k ) ) = f ∞ > f ( θ ∗ ) , so ˜ θ (cid:54) = θ ∗ , but Lemma B.15 and the continuity of θ (cid:55)→ Df (cid:0) θ , v ( θ ) (cid:1) = ∇ f ( θ ) (cid:62) v ( θ ) give0 = lim n →∞ Df (cid:0) θ n , v ( θ n ) (cid:1) = lim k →∞ Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) = Df (cid:0) ˜ θ , v (˜ θ ) (cid:1) < , which is a contradiction. Thus, lim n →∞ f ( θ n ) = f ( θ ∗∗