[PDF] Estimation of a Likelihood Ratio Ordered Family of Distributions -- with a Connection to Total Positivity

Abstract

Consider bivariate observations ( X 1 , Y 1 ),…,( X n , Y n )∈R×R with unknown conditional distributions Q x of Y , given that X=x . The goal is to estimate these distributions under the sole assumption that Q x is isotonic in x with respect to likelihood ratio order. If the observations are identically distributed, a related goal is to estimate the joint distribution L(X,Y) under the sole assumption that it is totally positive of order two in a certain sense. After reviewing and generalizing the concepts of likelihood ratio order and total positivity of order two, an algorithm is developed which estimates the unknown family of distributions ( Q x ) x via empirical likelihood. The benefit of the stronger regularization imposed by likelihood ratio order over the usual stochastic order is evaluated in terms of estimation and predictive performances on simulated as well as real data.

Full PDF

MMaximum Likelihood Estimation of a Likelihood RatioOrdered Family of Distributions

Alexandre M¨osching and Lutz D¨umbgenUniversity of BernJuly 23, 2020

Abstract

We consider bivariate observations ( X , Y ) , . . . , ( X n , Y n ) ⊂ X × R with a real set X such that, conditional on the X i , the Y i are independent random variables withdistribution P X i , where ( P x ) x ∈ X is unknown. Using an empirical likelihood approach,we devise an algorithm to estimate the unknown family of distributions ( P x ) x ∈ X underthe sole assumption that this family is increasing with respect to likelihood ratioorder. We review the latter concept and realize that no further assumption such as alldistributions P x having densities or having a common countable support is needed.The beneﬁt of the stronger regularization imposed by likelihood ratio orderingover the usual stochastic ordering is evaluated in terms of estimation and predictiveperformances on simulated as well as real data. Keywords:

Likelihood ratio order, stochastic order, empirical likelihood, quasi-Newton method.

AMS 2000 subject classiﬁcations:

Acknowledgements.

This work was supported by Swiss National Science Foun-dation. The authors are grateful to Johanna Ziegel and Alexander Jordan for stimu-lating discussions.

Suppose that Y and Y are random variables with densities f and f with respect to somedominating measure. We then say that Y is smaller than Y in the likelihood ratio orderif the density ratio f /f is increasing (i.e. non-decreasing) on { f + f > } . This notionof order is stronger than the usual stochastic order which states that Y is stochasticallysmaller than Y if IP( Y > y ) ≤ IP( Y > y ) for all y . However, it is still a plausible orderon distributions which appears natural on many occasions.In discriminant analysis for instance, one assigns an observation y ∈ R to popula-tion f if the ratio f ( y ) /f ( y ) exceeds a certain preﬁxed threshold. If the density ratio f /f is increasing, this amounts to separating R into two halﬂines that are then used asclassiﬁcation regions.In a similar fashion, if ( f θ ) θ ∈ Θ is a one-parameter family of strictly positive densitieswith Θ ⊂ R , an extension of the Neyman–Pearson Lemma states the existence of uniformlymost powerful tests of θ ≤ θ o against θ > θ o if the ratio f θ ( y ) /f θ ( y ) is increasing in y for all θ ≤ θ , see Karlin and Rubin (1956); Lehmann and Rojo (1992). Such aone-parameter family of distributions could for example be an exponential family whose1 a r X i v : . [ m a t h . S T ] J u l ensities are of the form f θ ( y ) = h ( y ) exp (cid:0) r ( θ ) y − s ( θ ) (cid:1) , for some measurable function h ,a strictly increasing function r : Θ → R and normalizing constants s ( θ ). Then the ratio f θ ( y ) /f θ ( y ) will be increasing in y , provided that θ ≤ θ .Furthermore, likelihood ratio ordering is also a frequent assumption or implication ofmodels in mathematical ﬁnance, see Beare and Moon (2015); Jewitt (1991).Thus far, estimation of distributions under a likelihood ratio ordering constraint wassolely limited to the two-population setting. First, Dykstra et al. (1995) estimated the pa-rameters of two multinomial distributions that are likelihood ratio ordered via a restrictedmaximum likelihood approach. After reparametrization, they found that the maximiza-tion problem at hand had reduced to a speciﬁc bioessay problem treated by Robertsonet al. (1988) and which makes use of the theory of isotonic regression. It is then sug-gested that their approach generalizes well to any two distributions that are absolutelycontinuous with respect to some dominating measure. Later, Carolan and Tebbs (2005)focused on testing procedures for the equality of two continuous distribution functions F and F versus the hypothesis of their likelihood ratio ordering. To this end, they madeuse of the equivalence between likelihood ratio ordering and the convexity of the ordinaldominance curve α (cid:55)→ F (cid:0) F − ( α ) (cid:1) , α ∈ [0 , F is absolutely contin-uous with respect to F . The convexity of the ordinal dominance curve was also exploitedby Westling et al. (2019) to provide nonparametric maximum likelihood estimators of F an F under likelihood ratio order for discrete, continuous, as well as mixed continuous-discrete distributions. Their approach builds on the idea of Carolan and Tebbs (2005),where estimators are produced from the greatest convex minorant of the empirical ordinaldominance curve. However, this method still necessitates the restrictive assumption that F is absolutely continuous with respect to F . Other relevant references on the topic areYu et al. (2017), who treat the estimation problem with a maximum smoothed likelihoodapproach – requiring the choice of a kernel and bandwidth parameters – as well as Roosenand Hennessy (2004), who test for likelihood ratio ordering of n ≥ P x ) x ∈ X on R indexed on a realset X and assume that this family is increasing with respect to likelihood ratio order asdeﬁned in Appendix A. In particular, no assumption of absolute continuity is requiredfor our model. In Section 2, we formulate an empirical likelihood approach (Owen, 1988,2001) to estimate the family ( P x ) x ∈ X . We then ﬁnd that the problem of maximizingthe (empirical) likelihood under the likelihood ratio ordering constraint yields a ﬁnite-dimensional constrained optimization problem. To approach the unique solution, we devisean algorithm in Section 3 which adapts and extends ideas from Jongbloed (1998) andD¨umbgen et al. (2006) for the present, more complex setting. It makes use of a quasi-Newton approach, and new search directions are obtained via multiple isotonic weightedleast squares regression. The proof of convergence of our algorithm in given in Section 4.Finally, in Section 5 we perform simulations to illustrate the beneﬁts of the new estimationparadigm compared to the usual stochastic order constraint as discussed by M¨osching andD¨umbgen (2020) and Henzi et al. (2019).The notion of likelihood ratio order between two random variables can be extended toa partial order on the set of all probability measures on the real line. This is explainedin detail in Appendix A. That material generalizes deﬁnitions and results in Shaked andShanthikumar (2007). We also explain the connection between two distributions beinglikelihood ratio ordered and the convexity of their ordinal dominance curve under weakerassumptions than in previous work, e.g. by Westling et al. (2019).Most proofs and technical details are deferred to Appendix B.2 Modelling

For a set X ⊂ R , we consider a family of distributions ( P x ) x ∈ X on R and N ≥ X , Y ) , ( X , Y ) , . . . , ( X N , Y N ) ∈ X × R , such that, conditional on ( X j ) Nj =1 , the r.v.s Y i are independent with Y i | ( X j ) Nj =1 ∼ P X i , for 1 ≤ i ≤ N. Let x < · · · < x n and y < · · · < y m be the unique and sorted values of the covariatesand responses, respectively. That means, n, m ≤ N and { x , . . . , x n } = { X , X , . . . , X N } , { y , . . . , y m } = { Y , Y , . . . , Y N } . The absolute frequency of ( x i , y j ) in the sample is given by the number w ij := { k : ( X k , Y k ) = ( x i , y j ) } . Now we model the unknown family of distributions ( P x i ) ni =1 by ( (cid:98) P x i ) ni =1 , where (cid:98) P x i := m (cid:88) j =1 p ij δ y j . (2.1)Here δ y denotes Dirac measure at y , and p := ( p ij ) ∈ [0 , n × m is a parameter satisfying m (cid:88) j =1 p ij = 1 for 1 ≤ i ≤ n. (2.2)To obtain an estimator of ( P x i ) ni =1 , one could then consider maximizing the empiricallikelihood L (cid:0) p (cid:1) := n (cid:89) i =1 m (cid:89) j =1 p w ij ij , (2.3)subject to constraint (2.2). In this simple unconstrained case, the maximum is known tobe attained by p ij = w ij (cid:80) mj =1 w ij , that is, (cid:98) P x i is the empirical distribution of the sample { Y k : X k = x i } .If the setting allows it, a natural constraint to set on ( P x ) x ∈ X is the usual stochasticorder on distributions, also known as ﬁrst order stochastic dominance: For x , x ∈ X ,such that x ≤ x , we say that P x is stochastically smaller than P x if P x (cid:0) ( −∞ , y ]) ≥ P x (cid:0) ( −∞ , y ]) for all y ∈ R . As discussed by M¨osching and D¨umbgen (2020) and El Barmiand Mukerjee (2005), least squares estimation of ( P x i ) ni =1 under usual stochastic orderingconstraint yields an estimator of the form (2.1), with probability weights p ij given by q ij − q i,j − where q ij := min r ≤ i max s ≥ i { k : x r ≤ X k ≤ x s and Y k ≤ y j } { k : x r ≤ X k ≤ x s } ,q i := 0 . , i y , j lll llll llll ll ll lll l ll l l l ll ( X , Y ) ( X , Y )( X , Y ) ( X , Y ) ( X , Y ) ( X , Y ) ( X , Y )( X , Y ) i m i M i j n j N j d j x i = i for 1 ≤ i ≤ n = 6 and y j = j for 1 ≤ j ≤ m = 7.These coeﬃcients are computed via the pool-adjacent-violators algorithm, see Barlowet al. (1972); Robertson et al. (1988), or more eﬃciently via a sequential version of thisprocedure, see Henzi et al. (2020).Throughout the remainder of this article, we assume that the family ( P x ) x ∈ X is in-creasing with respect to likelihood ratio order. If we use the model of ( P x i ) ni =1 describedin (2.1) with a parameter p ∈ [0 , n × m satisfying constraint (2.2), then likelihood ratioordering of ( (cid:98) P x i ) ni =1 is characterized in terms of p as p i j · p i j ≤ p i j · p i j , for all 1 ≤ i ≤ i ≤ n and 1 ≤ j ≤ j ≤ m. (2.4)To estimate ( P x i ) ni =1 , we opt for a maximizer of the likelihood L under constraints (2.2,2.4).We refer the reader to Appendix A for a thorough discussion about likelihood ratio orderingand to Example A.10 to see how property (2.4) derives from the general deﬁnition. The problem of ﬁnding a maximizer of L under constraints (2.2,2.4) can be greatly simpli-ﬁed from the structure of the data and the likelihood ratio order constraint itself. Theseaspects allow us to formulate an equivalent though much simpler problem involving fewerparameters to estimate. We start with some notations which are exempliﬁed in Figure 1.For i ∈ I := { , . . . , n } we deﬁne indices 1 ≤ m i ≤ M i ≤ m via y m i = min { Y k : X k ≥ x i } ,y M i = max { Y k : X k ≤ x i } . The inequality m i ≤ M i follows from the fact that X k = x i for at least one index k . Inparticular, if w ij >

0, then m i ≤ j ≤ M i . Consequently, with J i := { j : m i ≤ j ≤ M i }

4e may write w i := { k : X k = x i } = m (cid:88) j =1 w ij = (cid:88) j ∈J i w ij . Note also that m ≤ · · · ≤ m n and M ≤ · · · ≤ M n . With the set P := (cid:8) ( i, j ) : i ∈ I , j ∈ J i (cid:9) of relevant pairs of indices for p , the likelihood function becomes L ( p ) = (cid:89) ( i,j ) ∈P p w ij ij , (2.5)and a maximizer p of L under constraints (2.2,2.4) must satisfy: p i,j +1 · p i +1 ,j ≤ p ij · p i +1 ,j +1 , for all m i +1 ≤ j < M i and 1 ≤ i < n, (2.6)and p ij (cid:40) > i, j ) ∈ P , = 0 otherwise . (2.7)Likewise, a maximizer of L under constraints (2.2,2.6–2.7) will satisfy constraint (2.4).Hence, it is equivalent to search for p ∈ [0 , n × m that satisﬁes (2.2,2.6–2.7) and maxi-mizes L . For rigorous proofs of these two claims, see Lemmata B.6 and B.8. log -probabilities and consecutive diﬀer-ences In view of the likelihood score (2.5) and constraint (2.7) on the parameter p , it is naturalto reformulate the maximization problem at hand in terms of negative log-likelihood andlog-probabilities. In addition, taking diﬀerences of log-parameters allows us to rephraseconstraint (2.6) in terms of vectors with increasing components. We clarify this transitionin the present section.Let θ = ( θ ij ) ( i,j ) ∈P ∈ R P be deﬁned by θ ij := (cid:40) log( p im i ) if j = m i , log( p ij ) − log( p i,j − ) if m i < j ≤ M i , so that p ij = (cid:40) exp (cid:16)(cid:80) js = m i θ is (cid:17) if ( i, j ) ∈ P , . Then, constraint (2.6) is equivalent to θ ij ≤ θ i +1 ,j , for all m i +1 < j ≤ M i and 1 ≤ i < n ,whereas constraint (2.2) reads (cid:88) j ∈J i exp (cid:32) j (cid:88) s = m i θ is (cid:33) = 1 , for all i ∈ I . (2.8)Next we consider an arbitrary index j ∈ { , . . . , m } and deﬁne indices 1 ≤ n j ≤ N j ≤ n via x n j := min { X k : Y k ≥ y j } ,x N j := max { X k : Y k ≤ y j } . m i +1 < j ≤ M i if and only if n j ≤ i < N j − , for arbitraryindices 1 ≤ i < n and 1 < j ≤ m , see Lemma B.9. Furthermore, we have P = (cid:8) ( i, m i ) : i ∈ I (cid:9) ∪ (cid:18) m (cid:91) j =2 (cid:8) ( i, j ) : n j ≤ i ≤ N j − (cid:9)(cid:19) . Consequently, constraint (2.6) is equivalently formulated as( θ ij ) N j − i = n j ∈ R d j ↑ , whenever 1 < j ≤ m and d j := N j − − n j + 1 > , where R d ↑ := { v ∈ R d : v ≤ v ≤ · · · ≤ v d } . The numbers n j , N j and d j are exempliﬁed in Figure 1. Finally, we deﬁne cumulativeweights w ij := M i (cid:88) s = j w is , for all ( i, j ) ∈ P . Then, the problem of maximizing L under constraints (2.2,2.6–2.7) becomes equivalent tominimizing the strictly convex functional f ( θ ) := (cid:88) ( i,j ) ∈P (cid:32) − w ij θ ij + w i exp (cid:32) j (cid:88) s = m i θ is (cid:33)(cid:33) , (2.9)over the closed convex coneΘ := (cid:8) θ ∈ R P : ( θ ij ) N j − i = n j ∈ R d j ↑ for all 1 < j ≤ m (cid:9) . The exponential terms in (2.9) are Lagrange-type terms accounting for the normalizationof θ in (2.8), see Lemma B.11.In the sequel, we denote the unique minimizer of f in Θ by θ ∗ := arg min θ ∈ Θ f ( θ ) . For η ∈ R P relatively close to a feasible parameter θ ∈ Θ, one can determine a secondorder approximation of f around θ using its gradient g ( θ ) := (cid:18) ∂f∂θ ij ( θ ) (cid:19) ( i,j ) ∈P = (cid:0) − w ij + H ij ( θ ) (cid:1) ( i,j ) ∈P , as well as non-trivial entries of its Hessian matrix ∂ f∂θ ij ∂θ k(cid:96) ( θ ) = (cid:40) H i, max( j,(cid:96) ) ( θ ) if i = k ∈ I and j, (cid:96) ∈ J i , , where H ( θ ) := (cid:0) H ij ( θ ) (cid:1) ( i,j ) ∈P is deﬁned by H ij ( θ ) := ∂ f∂θ ij ( θ ) = w i M i (cid:88) r = j exp (cid:32) r (cid:88) s = m i θ is (cid:33) . f around θ therefore reads f θ ( η ) := f ( θ ) + (cid:88) ( i,j ) ∈P (cid:18) g ij ( θ )( η ij − θ ij ) + 12 H ij ( θ )( η ij − θ ij ) (cid:19) . Since H ij ( θ ) > i, j ) ∈ P , the function f θ is strictly convex. Furthermore, itssummands can be rearranged to yield f θ ( η ) = c θ + 12 (cid:88) i ∈I H im i ( θ ) (cid:0) η im i − ˜ θ im i ( θ ) (cid:1) + 12 m (cid:88) j =2 N j − (cid:88) i = n j H ij ( θ ) (cid:0) η ij − ˜ θ ij ( θ ) (cid:1) , with c θ independent of η and˜ θ ij ( θ ) := θ ij − g ij ( θ ) H ij ( θ ) = θ ij + w ij H ij ( θ ) − . It therefore possesses a unique minimizer under constraint which can be characterized bymeans of isotonic regression, as indicated by the next Lemma.

Lemma 3.1 . The vector η ∗ ( θ ) := arg min η ∈ Θ f θ ( η ) is characterized by η ∗ im i ( θ ) = ˜ θ im i ( θ ) , for all i ∈ I and (cid:0) η ∗ ij ( θ ) (cid:1) N j − i = n j = Iso (cid:16)(cid:0) ˜ θ ij ( θ ) (cid:1) N j − i = n j , (cid:0) H ij ( θ ) (cid:1) N j − i = n j (cid:17) , for all < j ≤ m, where Iso( ξ , b ) stands for the isotonic regression of ξ ∈ R d with weights b ∈ R d , b > ,that means, Iso( ξ , b ) is the unique vector that minimizes (cid:80) dk =1 b k ( z k − ξ k ) over all z ∈ R d ↑ ,see Barlow et al. (1972); Robertson et al. (1988).Additionally, the mapping θ (cid:55)→ η ∗ ( θ ) is continuous in Θ . In consequence, if η ∗ ( θ ) (cid:54) = θ , then f θ (cid:0) η ∗ ( θ ) (cid:1) < f θ ( θ ) = f ( θ ), and we deﬁne v ( θ ) := η ∗ ( θ ) − θ . Because the gradients of f θ and f at θ are equal, we have that Df (cid:0) θ , v ( θ ) (cid:1) = Df θ (cid:0) θ , v ( θ ) (cid:1) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) < , where the justiﬁcation of the second equality is given in Lemma B.13. Thus, we ﬁnd that θ (cid:54) = θ ∗ and a strict improvement of f can be found in the direction v ( θ ). Suppose that θ ∈ Θ is such that Df θ (cid:0) θ , v ( θ ) (cid:1) <

0, so θ (cid:54) = θ ∗ and f θ (cid:0) θ + v ( θ ) (cid:1) < f ( θ ).In particular, there exists some number t ∈ (0 ,

1] such that ρ θ ( t ) := f (cid:0) θ + t v ( θ ) (cid:1) − f ( θ ) < . f . That means, we look for t ∗ ∈ (0 ,

1] such that ρ θ ( t ∗ ) is as close as possible to min t ∈ (0 , ρ θ ( t ).For that, we proceed similarly as in D¨umbgen et al. (2006). First, let t o := 2 − n o with n o the smallest integer such that ρ θ (2 − n o ) <

0. We then deﬁne a Hermite interpolation of ρ θ : ˜ ρ θ ( t ) := ρ (cid:48) θ (0) · t + t − o (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) · t . This new function is such that ˜ ρ θ (0) = ρ θ (0) = 0, ˜ ρ θ ( t o ) = ρ θ ( t o ) < ρ (cid:48) θ (0) = ρ (cid:48) θ (0) = Df (cid:0) θ , v ( θ ) (cid:1) < . The minimum of ˜ ρ θ in (0 ,

1] is therefore attained at t ∗ := min (cid:32) − ρ (cid:48) θ (0)2 (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) , (cid:33) · t o . We then update θ with θ + t ∗ v ( θ ). As shown in Lemma 1 of D¨umbgen et al. (2006), thisprocedure ensures that f ( θ ) − f (cid:0) θ + t ∗ v ( θ ) (cid:1) ≥ · max t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) > . Let θ ∈ Θ be a feasible parameter and c ( θ ) := (cid:0) c ij ( θ ) (cid:1) ( i,j ) ∈P be deﬁned by c ij ( θ ) := (cid:40) log (cid:16)(cid:80) M i r = m i exp (cid:0)(cid:80) rs = m i θ is (cid:1)(cid:17) if j = m i , . Then the parameter θ (cid:48) := θ − c ( θ ) is such that θ (cid:48) ∈ Θ and f ( θ (cid:48) ) ≤ f ( θ ) with equality, ifand only if, c ( θ ) = , see Lemma B.10. In other words, replacing θ with θ − c ( θ ) has theeﬀect of decreasing the value of f , unless θ is normalized already. Since the vector is feasible, one can simply take its normalized version θ := − c ( ) asa starting point. 8 .5 Complete algorithm Let ε > f then reads: Minimize f over Θ θ ← − c ( ) v ( θ ) ← (cid:0) arg min η ∈ Θ f θ ( η ) (cid:1) − θ while f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) < − εt o ← f θ (cid:0) θ + t o v ( θ ) (cid:1) > f ( θ ) t o ← t o / t ∗ ← t o · min (cid:16) − ρ (cid:48) θ (0)2 − (cid:0) t − o ρ θ ( t o ) − ρ (cid:48) θ (0) (cid:1) − , (cid:17) θ ← θ + t ∗ v ( θ ) θ ← θ − c ( θ ) v ( θ ) ← (cid:0) arg min η ∈ Θ f θ ( η ) (cid:1) − θ end whilereturn( θ ) The algorithmic mapping ψ : Θ → Θ constructed in Section 3 consists of the two steps θ ← θ + t ∗ v ( θ ) θ ← θ − c ( θ )which have the eﬀect of strictly decreasing the value of f , unless θ = θ ∗ . Theorem 4.1 . Let θ ∈ Θ be any feasible starting point. Then, the sequence ( θ n ) n ≥ ∈ Θ resulting from the algorithmic mapping ψ : θ n +1 := ψ ( θ n ) , n ≥ , is such that lim n →∞ f ( θ n ) = min θ ∈ Θ f ( θ ) . In other words lim n →∞ θ n = θ ∗ . In this section, we compare estimation and prediction performances of the likelihood ra-tio order constrained estimator presented in this article with the estimator under usualstochastic order obtained via isotonic distributional regression. The latter estimator wasmentioned brieﬂy in the introduction. It is extensively discussed in Henzi et al. (2019)and M¨osching and D¨umbgen (2020).

A Gamma model

We choose a parametric family of distributions from which we drawobservations. We will then use these data to provide distribution estimates which we thencompare with the truth. The speciﬁc model we have in mind is a family ( P x ) x ∈ X of Gammadistributions d P x d y ( y ) = f x ( y ) := b ( x ) − a ( x ) Γ (cid:0) a ( x ) (cid:1) y a ( x ) − exp (cid:0) − y/b ( x ) (cid:1) , y Figure 2: Conditional Gamma density with shape a ( x ) := 2 + ( x + 1) and scale b ( x ) :=1 − / exp(10 x ) for x in the interval X := [1 , a : X → (0 , ∞ ) and scale function b : X → (0 , ∞ ). Then ( P x ) x ∈ X is increasing with respect to likelihood ratio ordering if, and only if, both a and b are non-decreasing. Recall that since the family is increasing in likelihood ratio order, it is alsoincreasing with respect to the usual stochastic order. Figure 2 shows the true conditionaldensity for the speciﬁc parameters a and b selected for this study. Sampling method

Let n o ∈ { , , } be a predeﬁned number and let X o := 1 + 3 n o · { , , . . . , n o } ⊂ X := [1 , . For a given sample size N ∈ N , the sample ( X , Y ) , ( X , Y ) , . . . , ( X N , Y N ) is obtainedas follows: Draw X , X , . . . , X N uniformly from X o and sample independently each Y k from P X k . This yields unique covariates x < x < · · · < x n , as well as unique responses y < y < · · · < y m , for some 1 ≤ n, m ≤ N .For each such sample, we compute estimators of ( P x i ) ni =1 under likelihood ratio orderingconstraint, as well as the usual stochastic ordering constraint. Using linear interpolation,we complete both families of estimates with covariates originally in { x i } ni =1 to families ofestimates with covariates in the full set X o . Simple calculations show that this extensionpreserves the original stochastic ordering of the estimated distributions. We therefore ob-tain estimates ( (cid:98) P x ) x ∈X o and ( q P x ) x ∈X o under likelihood ratio ordering and usual stochasticordering constraint, respectively. The corresponding families of cdf’s are then written (cid:98) F := ( (cid:98) F x ) x ∈X o and q F := ( q F x ) x ∈X o , whereas the truth is denoted by F := ( F x ) x ∈X o . Al-though the performance of the empirical distribution is far worse than those of the twoorder constrained estimators, it is still useful to study its behavior, for example to betterunderstand boundary eﬀects. The family of empirical cdf’s will be written (cid:98) F := ( (cid:98) F x ) x ∈X o . A simple score

To verify the ability of the estimators (cid:98) F and q F to retrieve the truthgiven by F , we compute Monte-Carlo estimates of the mean and selected quantiles of each10f the scores R x ( (cid:98) F , F ), R x ( q F , F ) and R x ( (cid:98) F , F ) for each x ∈ X o , where R x ( G, F ) := (cid:90) | G x ( y ) − F x ( y ) | d P x ( y ) , with G being either (cid:98) F , q F or (cid:98) F , and the integral being computed numerically. We alsocompute the relative change in score100 · R x ( (cid:98) F , F ) − R x ( q F , F ) R x ( q F , F ) . The results of the simulations are displayed in Figure 3. A ﬁrst observation is thatthe performance of all three estimators gets worse close to the boundary points of X ,and this is more pronounced for the two order constrained estimators. This is a knownphenomenon from shape constrained inference. However, in the interior of X , taking thestochastic ordering into account pays oﬀ.The second column of plots in Figure 3 shows the relative change in score when esti-mating the family of distributions with a likelihood ratio ordering constraint instead of theusual stochastic ordering constraint. It is observed that the improvement in score becomeslarger and occurs on a wider sub-interval of X as n o and N increase. Only towards theboundary, the usual stochastic order seems to have better performance. Theoretical predictive performances

Using the same Gamma model, we evaluatepredictive performances of both estimators using the continuous ranked probability scoreCRPS( G x , y ) := (cid:90) (cid:0) G x ( z ) − [ y ≤ z ] (cid:1) d z. The CRPS is a sctrictly proper scoring rule which allows for comparisons of probabilisticforecasts, see Gneiting and Raftery (2007) and Jordan et al. (2019). It can be seen as anextension of the mean absolute error for probabilistic forecasts. The CRPS is thereforeinterpreted in the same unit of measurement as the true distribution or data.Because in our case the true underlying distribution is known, we study the expectedCRPS score given by S x ( G, F ) := (cid:90)

CRPS( G x , y ) d P x ( y )= m (cid:88) j =0 (cid:90) [ y j ,y j +1 ) (cid:0) G x ( y j ) − F x ( z ) (cid:1) d z + b ( x ) B (1 / , a ( x )) , for G ∈ { (cid:98) F , q F , (cid:98) F } , where y := −∞ , y m +1 := + ∞ , B ( · , · ) is the beta function, and thelatter integrals are computed via numerical integration.Consequently, we compute Monte-Carlo estimates of the mean and selected quantilesof S x ( (cid:98) F , F ), S x ( q F , F ) and S x ( (cid:98) F , F ), as well as the relative change in score100 · S x ( (cid:98) F , F ) − S x ( q F , F ) S x ( q F , F )for each x ∈ X o .Figure 4 outlines the results of the simulations. Similar boundary eﬀects as for thesimple score are observed. On the interior of X , the usual stochastic order improves thenaive empirical estimator, and the likelihood ratio order yields the best results.11 .0 1.5 2.0 2.5 3.0 3.5 4.0 . . . . . . n o = , N = x R x ( G , F ) − − Relative change x % . . . . . . n o = , N = x R x ( G , F ) − − Relative change x % . . . . . . n o = , N = x R x ( G , F ) − Relative change x % Figure 3: Monte Carlo simulations to evaluate estimation performances with a simplescore. Left column: Simple scores with G being either (cid:98) F (green), q F (blue) or (cid:98) F (grey).Right column: Relative change of the score when enforcing a likelihood ratio orderingconstraint over the usual stochastic ordering constraint. The thicker line is the meanvariation, whereas the thin lines are the 25 and 75%-quantiles. Negative values representan improvement in score. 12 .0 1.5 2.0 2.5 3.0 3.5 4.0 n o = , N = x S x ( G , F ) − − Relative change x % n o = , N = x S x ( G , F ) − − Relative change x % n o = , N = x S x ( G , F ) − Relative change x % Figure 4: Monte Carlo simulations to evaluate prediction performances using a CRPS-typescore. Left column: CRPS scores with G being either (cid:98) F (green), q F (blue) or (cid:98) F (grey).Right column: Relative change of the score when enforcing a likelihood ratio orderingconstraint over the usual stochastic ordering constraint. The thicker line is the meanvariation, whereas the thin lines are the 25 and 75%-quantiles. Negative values representan improvement in score. 13 lll lll l ll ll l ll l ll ll ll l ll ll l ll l ll lll lll ll l ll ll ll ll ll ll ll l l ll ll ll ll l l ll lll l ll l ll l l ll ll lll ll lll ll ll l ll l lll ll l ll ll ll ll ll lll ll lll l lll l ll ll l l ll ll lll lllll l lll ll l ll lll l ll ll ll ll lll l ll ll l ll ll l lll l l ll ll l ll ll ll ll l ll ll ll l lll l ll l ll l lll l l llll ll lll lll lll lll l lll ll ll l ll ll ll l ll ll l ll llll l ll ll lll ll ll ll ll ll ll ll ll ll l l l lll ll ll ll l ll llll lll ll l l lll ll l ll ll l ll llll lll ll ll l ll lll l lll ll l ll ll lll l ll lll l ll ll l ll lll l lllll l ll lll llll lll l ll l lll llll lll l l lll ll ll l lll l ll ll l lll l ll ll l ll lll l lll l ll l llll l ll lll llll ll llll l lll l ll lll l l ll lll ll lll l ll lll l ll llll ll ll ll l ll ll llll l l ll ll ll l l ll lll l llll lll ll l ll l l ll ll lll l l ll lll ll l lll ll ll ll ll lll lll ll ll ll l ll l lll lll l ll lll lllll l l l ll ll l ll l l l lll ll ll ll l l lll l ll ll ll l ll ll ll l ll ll lll lll lll l ll lll l llll l lll ll l lll ll ll ll ll ll ll l ll l l ll ll l ll lll l l ll ll ll l ll l ll lll ll ll lll l ll l ll ll ll ll l lll lll ll ll ll ll lll l lll l ll l lll lll l llll ll ll l ll ll l lll l ll l llll l l ll l ll lll ll l lll l l ll lll l l l llll ll l ll ll ll l ll l l llll l l ll lll ll ll l l lll l ll lllll l ll llll l ll l lll ll ll ll lll ll ll l lll ll ll lll ll l ll l l lll l llll l ll lll lll ll lll l llll l ll ll l ll l ll l lll ll lll ll l ll lll ll l l ll ll ll ll ll ll l ll l ll ll ll ll ll ll lll lll l l ll ll l ll lll ll l lll ll ll ll l ll l lll ll ll lllll lll ll lll l ll ll lll l ll ll l ll lll l lll lllll l ll ll l ll ll lll ll ll l ll ll lll l ll lll l lll l lll ll ll ll lll lll lll l lll ll l lll l ll lllll l llll ll ll lll l l l ll ll l ll lll l l lll lll ll l ll llll lll l l lll ll l ll l l lll l l l lll ll l l llll ll lll l ll llll ll l ll ll lll l lll lll llll ll l l lll ll ll ll lll ll ll l ll ll ll ll ll ll ll lll ll l ll ll l l ll l ll ll ll l l l lll lll lll lllll ll ll l ll l ll ll ll ll ll ll lll l l llll ll lll lll ll ll l ll lll llll ll l ll ll l lll ll ll l ll l ll ll llll l llll l ll l l lll ll l lll l ll l lll ll ll ll lll lll llll l l ll ll l l lll ll lll l ll ll l ll l llll ll l llll ll llll ll l l ll l ll lll l lll l lll l l l ll l lll ll l lll l lll ll ll ll ll lll l l l l ll ll l l l ll ll ll l l lll l lll l ll ll ll l l lll l ll ll lll lll ll l ll ll ll lll lll l l ll lll l l ll ll ll l l l ll l ll ll ll ll l llll ll ll l lll ll lll ll l ll lll l lll l l ll lll lll ll lll ll l ll lll l ll ll lll l lll ll ll ll ll lllll ll l ll lll l l ll ll l lll llll lll l ll ll l ll ll ll l llll ll llll ll ll ll ll lll lll l ll lll ll lll ll lll l l l ll lll l l ll ll ll ll l ll ll ll l ll lll ll ll l l lll l l lll l ll l ll lll ll l llll l ll l ll lll l l ll ll l lll l ll ll l lll llll llll ll l ll ll ll ll l l llll ll llll ll ll ll ll ll ll llll l ll ll lll ll l l lll llll lll l ll lll l ll l ll l l lll l ll ll ll l ll l lll ll lll l lll ll l l lll l lll ll l l lll llll ll ll lll lll ll l ll l ll ll ll lll l l lll l ll ll lll ll l l ll lll ll llll ll l ll ll lll l ll l ll ll l ll l ll ll l llll ll l ll l lll ll lll l lll l l lll llll ll ll l l l ll ll l ll lll ll l ll l l ll lllll ll l llll ll l l llll llll ll l ll ll lll lll l l l lll l lllll l lll ll l ll l ll ll ll lll l lll ll ll l lll l lll l ll llll ll lll l l ll l lll ll ll l ll lll llllll l ll l ll l ll l l l ll ll l lll ll l ll ll ll l ll l llll l ll ll l ll ll l lll ll ll l lll ll ll ll lll l lll l lll lll l ll lll ll ll l ll ll lll ll l ll l l lll l ll ll ll ll ll ll l ll lll ll l lll l l lll l l l l ll l llll ll llll llll llll llll l lll l ll l ll ll l lll ll l ll ll lll ll llll l lll l ll ll l lll ll l ll ll ll l lll llll l lll ll ll l lll lll llll ll ll l l lll lll l ll l l ll ll ll lll llll ll ll ll ll l llll ll ll lll l ll ll l ll ll lll lll ll l ll l ll lll lll l ll llll l ll ll l lll ll ll lll lll l l l lll ll l ll l ll l ll ll lll l ll ll llll ll ll l ll l ll lll l ll l ll lll ll l l lllll lll l ll ll l ll l ll ll ll ll l l ll lll ll lll ll l ll ll ll lll lll l l llll ll l lll l l lll l l l ll l l ll ll l ll ll l ll ll l lll ll ll l ll lll ll ll l ll ll l l llll l lll lll lll ll ll ll l l ll l l ll ll ll l ll l l ll ll l lll lll l ll l l ll l l lll ll ll ll ll lll lll ll l l l ll lll llll ll ll ll lll ll ll ll ll l ll l llll ll l ll ll l l l lll ll ll l lll ll lll ll l ll ll lll ll l ll lll ll ll ll ll lll l ll lll ll ll lll l l ll Age [years] W e i gh t [ k g ] Figure 5: Subsample of the weight for age data. A logarithmic scale was used for theweight variable.In terms of relative change in score, it appears that imposing a likelihood ratio orderingconstraint to estimate the family of distributions yields an average score reduction of about2% in comparison with the usual stochastic ordering estimator for a sample of N = 50.For larger values of N , this improvement occurs on a wider subinterval of X and morefrequently, as shown by the 75% mark.Note further that the expected CRPS increases on the interior of X . This is due to thefact that the CRPS has the same unit of measurement as the response variable. Since thescale of the response characterized by b increases with x , then so does the correspondingscore. Empirical predictive performances

We use the weight for age dataset already stud-ied in M¨osching and D¨umbgen (2020). It comprises the age and weight of N := 16 344 girlswith an age in X := [2 ,

16] years old, of which we present a subsample in Figure 5. Thedataset was publicly released as part of the National Health and Nutrition ExaminationSurvey conducted in the US between 1963 and 1991 (data available from )and was analyzed by Kuczmarski et al. (2002) with parametric models to produce smoothquantile curves.Although likelihood ratio ordering constraint is much harder to justify than the verynatural stochastic ordering constraint, we are interested in the eﬀect of a stronger regu-larization imposed by the former constraint.The forecast evaluation is performed using a leave- N train -out cross-validation scheme.More precisely, we choose random subsets D train of N train observations which we use totrain our estimators (cid:98) F and q F . Using the rest of the N test := N − N train data pairs in D test , we evaluate predictive performances by computing the sample mean and selectedquantiles of (cid:98) S x ( (cid:98) F , F ), (cid:98) S x ( q F , F ), (cid:98) S x ( (cid:98) F , F ) and100 · (cid:98) S x ( (cid:98) F , F ) − (cid:98) S x ( q F , F ) (cid:98) S x ( q F , F )14or each x ∈ X o , where (cid:98) S x ( G, F ) := (cid:80) ( X,Y ) ∈D test : X = x CRPS( G x , Y ) { ( X, Y ) ∈ D test : X = x } , for G ∈ { (cid:98) F , q F , (cid:98) F } .Figure 6 shows the forecast evaluation results. As expected, the empirical CRPSincreases with age, since the spread of the weight increases with age. As to the relativechange in score, improvements of about 2% can be seen for N train ∈ { , } and ofabout 1% for N train = 1000 on average. The region of X where the advantage of likelihoodratio ordering over the usual stochastic ordering is at its largest in case of N train = 1000,showing the beneﬁt of a stronger regularization. References

Barlow, R. E. , Bartholomew, D. J. , Bremner, J. M. and

Brunk, H. D. (1972).

Statistical inference under order restrictions. The theory and application of isotonicregression . John Wiley & Sons, London-New York-Sydney. Wiley Series in Probabilityand Mathematical Statistics.

Beare, B. K. and

Moon, J.-M. (2015). Nonparametric tests of density ratio ordering.

Econometric Theory https://doi.org/10.1017/S0266466614000401 Carolan, C. A. and

Tebbs, J. M. (2005). Nonparametric tests for and against likelihoodratio ordering in the two-sample problem.

Biometrika https://doi.org/10.1093/biomet/92.1.159 Doob, J. L. (1994).

Measure theory , vol. 143 of

Graduate Texts in Mathematics . Springer-Verlag, New York.URL https://doi.org/10.1007/978-1-4612-0877-8

D¨umbgen, L. , Freitag-Wolf, S. and

Jongbloed, G. (2006). Estimating a unimodaldistribution from interval-censored data.

J. Amer. Statist. Assoc. https://doi.org/10.1198/016214506000000032

Dykstra, R. , Kochar, S. and

Robertson, T. (1995). Inference for likelihood ratioordering in the two-sample problem.

J. Amer. Statist. Assoc. https://doi.org/10.2307/2291340 El Barmi, H. and

Mukerjee, H. (2005). Inferences under a stochastic ordering con-straint: the k -sample case. J. Amer. Statist. Assoc. https://doi.org/10.1198/016214504000000764

Gneiting, T. and

Raftery, A. E. (2007). Strictly proper scoring rules, prediction, andestimation.

J. Amer. Statist. Assoc. https://doi.org/10.1198/016214506000001437

Henzi, A. , M¨osching, A. and

D¨umbgen, L. (2020). Accelerating the pool-adjacent-violators algorithm for isotonic distributional regression. Preprint, arXiv:2006.05527.

Henzi, A. , Ziegel, J. and

Gneiting, T. (2019). Isotonic distributional regression.Preprint, arXiv:1909.03725. 15 N train = x S ^ x ( G , F ) − − − Relative change x % N train = x S ^ x ( G , F ) − − − Relative change x % N train = x S ^ x ( G , F ) − Relative change x % Figure 6: Monte Carlo simulations to evaluate prediction performances using an empiricalCRPS score. Left column: empirical CRPS scores with G being either (cid:98) F (green), q F (blue)or (cid:98) F (grey). Right column: Relative change of the score when enforcing a likelihood ratioordering constraint over the usual stochastic ordering constraint. The thicker line is themean variation, whereas the thin lines are the 25 and 75%-quantiles. Negative valuesrepresent an improvement in score. 16 ewitt, I. (1991). Applications of likelihood ratio orderings in economics. In Stochasticorders and decision under risk (Hamburg, 1989) , vol. 19 of

IMS Lecture Notes Monogr.Ser.

Inst. Math. Statist., Hayward, CA, 174–189.URL https://doi.org/10.1214/lnms/1215459856

Jongbloed, G. (1998). The iterative convex minorant algorithm for nonparametric esti-mation.

J. Comput. Graph. Statist. https://doi.org/10.2307/1390706 Jordan, A. , Krger, F. and

Lerch, S. (2019). Evaluating probabilistic forecasts withscoringrules.

Journal of Statistical Software Karlin, S. and

Rubin, H. (1956). The theory of decision procedures for distributionswith monotone likelihood ratio.

Ann. Math. Statist. https://doi.org/10.1214/aoms/1177728259 Kuczmarski, R. J. , Ogden, C. L. , Guo, S. S. , Grummer-Strawn, L. M. , Flegal,K. M. , Mei, Z. , Wei, R. , Curtin, L. R. , Roche, A. F. and

Johnson, C. L. (2002).CDC Growth Charts for the United States: Methods and development.

Vital HealthStat. . Lehmann, E. L. and

Rojo, J. (1992). Invariant directional orderings.

Ann. Statist. https://doi.org/10.1214/aos/1176348905 M¨osching, A. and

D¨umbgen, L. (2020). Monotone least squares and isotonic quantiles.

Electron. J. Stat. https://doi.org/10.1214/19-EJS1659 Owen, A. B. (1988). Empirical likelihood ratio conﬁdence intervals for a single functional.

Biometrika https://doi.org/10.1093/biomet/75.2.237 Owen, A. B. (2001).

Empirical likelihood . No. 92 in Monographs on Statistics andApplied Probability, Chapman and Hall/CRC.

Robertson, T. , Wright, F. T. and

Dykstra, R. L. (1988).

Order restricted statisticalinference . Wiley Series in Probability and Mathematical Statistics: Probability andMathematical Statistics, John Wiley & Sons, Ltd., Chichester.

Roosen, J. and

Hennessy, D. A. (2004). Testing for the monotone likelihood ratioassumption.

J. Bus. Econom. Statist. https://doi.org/10.1198/073500104000000235 Shaked, M. and

Shanthikumar, J. G. (2007).

Stochastic orders . Springer Series inStatistics, Springer, New York.URL https://doi.org/10.1007/978-0-387-34675-5

Shorack, G. R. and

Wellner, J. A. (1986).

Empirical processes with applicationsto statistics . Wiley Series in Probability and Mathematical Statistics: Probability andMathematical Statistics, John Wiley & Sons, Inc., New York.

Westling, T. , Downes, K. J. and

Small, D. S. (2019). Nonparametric maximumlikelihood estimation under a likelihood ratio order. Preprint, arXiv:1904.12321.17 u, T. , Li, P. and

Qin, J. (2017). Density estimation in the two-sample problem withlikelihood ratio ordering.

Biometrika https://doi.org/10.1093/biomet/asw069

A Likelihood ratio order

A.1 Two distributions

At ﬁrst, let us recall some well-known facts from measure theory. Let µ and ν be σ -ﬁnitemeasures on a measurable space ( X , B ). The measure µ dominates ν , and one writes ν (cid:28) µ , if ν ( B ) = 0 for any set B ∈ B with µ ( B ) = 0. The well-known Radon–Nikodymtheorem states that ν (cid:28) µ if and only if ν has a density f with respect to µ . That means, f : X → [0 , ∞ ) is measurable, and ν ( B ) = (cid:82) B f d µ for any B ∈ B . This is abbreviated as f = d ν/ d µ . If f ∗ is another such function, then µ ( { f ∗ (cid:54) = f } ) = 0.Now we consider two probability measures Q and P on ( R , B ), where B is the Borel σ -ﬁeld on R . Let µ be a σ -ﬁnite measure such that Q (cid:28) µ and P (cid:28) µ ; for instance, let µ = P + Q . If we choose explicit densities f = d Q/ d µ and g = d P/ d µ , then ( P + Q )( { f + g = 0 } ) = 0, and on R \ { f + g = 0 } , the density ratio f /g is well-deﬁned in [0 , ∞ ]. Lemma A.1 . The following properties of Q and P are equivalent: (D1) There exists an isotonic version ρ : R → [0 , of d Q/ d( P + Q ) . (D2) For any dominating measure µ of P and Q and explicit versions f = d P/ d µ , g =d Q/ d µ , there exists a set N ⊃ { f + g = 0 } such that ( P + Q )( N ) = 0 and g/f is isotonic on R \ N. (D3) For any dominating measure µ of P and Q , there exist versions f = d P/ d µ and g = d Q/ d µ such that g/f is isotonic on R \ { f + g = 0 } . (D4) For any dominating measure µ of P and Q , there exist versions f = d P/ d µ and g = d Q/ d µ such that, for all x, y ∈ R with x ≤ y , f ( y ) g ( x ) ≤ f ( x ) g ( y ) . (D5) For all

A, B ∈ B such that A ≤ B (element-wise), P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) . (D6) For all intervals A = ( w, x ] and B = ( y, z ] with x ≤ y , P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) . We now state the deﬁnition of likelihood ratio order on distributions.

Deﬁnition A.2 (Likelihood ratio order). We say that P is smaller than Q with respectto likelihood ratio order , and we write P ≤ lr Q , if any and thus all of the properties(D1–6) are satisﬁed. Remark A.3 . Shaked and Shanthikumar (2007) restrict their attention to probabilitymeasures which are either discrete or dominated by Lebesgue measure. Their deﬁnitionof likelihood ratio corresponds to properties (D3–4) with µ being counting measure orLebesgue measure on the real line. 18 emark A.4 (Usual stochastic order). Recall that the distribution P is smaller than Q with respect to the usual stochastic order , and we write P ≤ st Q , if P (cid:0) ( −∞ , x ] (cid:1) ≥ Q (cid:0) ( −∞ , x ] (cid:1) for all x ∈ R . The likelihood ratio order is stronger than the usual stochasticorder in the sense that P ≤ st Q if P ≤ lr Q . This follows immediately from property (D5)applied to A := ( −∞ , x ] and B := ( x, ∞ ) for arbitrary x ∈ R , because P ( B ) Q ( A ) = Q ( A ) − P ( A ) Q ( A ) and P ( A ) Q ( B ) = P ( A ) − P ( A ) Q ( A ).The reverse statement is false in general, but the likelihood ratio order of two distri-butions is tightly connected to stochastic order of domain-conditional distributions. Lemma A.5 . We have that P ≤ lr Q if and only if P ( · | C ) ≤ st Q ( · | C ) for all C ∈ B suchthat P ( C ) , Q ( C ) > . Remark A.6 (Ordered ranges). Recall that the support of a probability measure P on R , denoted by supp( P ), is the setsupp( P ) := { y ∈ R : P ( U ) > U ⊂ R s.t. y ∈ U } . In particular, if F denotes the distribution function of P , then the endpoints inf(supp( P ))and sup(supp( P )) are given by inf { x ∈ R : F ( x ) > } and sup { x ∈ R : F ( x ) < } , respec-tively. Consequently, if P ≤ st Q , then inf(supp( P )) ≤ inf(supp( Q )) and sup(supp( P )) ≤ sup(supp( Q )). A.2 Ordinal dominance curves

In case Q (cid:28) P , another equivalent deﬁnition of likelihood ratio order involving theordinal dominance curve can be stated. To deﬁne that property, we ﬁrst need to setsome notations. Let F and G be the distribution functions of P and Q , respectively.In the sequel, we use the standard conventions F ( −∞ ) := lim x →−∞ F ( x ) = 0 and F ( ∞ ) := lim x →∞ F ( x ) = 1. We also use the following deﬁnitions: The image of F is the set Im( F ) := { F ( x ) : x ∈ R } . For any α ∈ [0 , α -quantile of F is the number F − ( α ) deﬁned by F − ( α ) :=  −∞ if α = 0 , min { x ∈ R : F ( x ) ≥ α } if α ∈ (0 , , sup { x ∈ R : F ( x ) < } if α = 1 . The ordinal dominance curve is the function H = H F,G : [0 , → [0 ,

1] deﬁned for each α ∈ [0 ,

1] by H ( α ) := G (cid:0) F − ( α ) (cid:1) . Let V ⊂ R . We say that a function h : V → R is convex on V if, for all x, y, z ∈ V with x < y < z , h ( y ) − h ( x ) y − x ≤ h ( z ) − h ( y ) z − y . The next result has been shown in the special case of F and G being continuous andstrictly increasing by Lehmann and Rojo (1992) and in the general case but with additionalassumptions and weaker conclusions by Westling et al. (2019). Theorem A.7 . If Q (cid:28) P , then the following two claims are equivalent: (i) P ≤ lr Q . (ii) The ordinal dominance curve H is convex on Im( F ) . .3 Families of distributions The next lemma shows that the relation ≤ lr deﬁnes indeed a partial order. Lemma A.8 . The relation ≤ lr deﬁnes a partial order on the set of all distributions onthe real line. That means, for arbitrary probability measures P , Q and R on R , • P ≤ lr P (reﬂexivity); • P ≤ lr Q and Q ≤ lr P implies that P = Q (antisymmetry); • P ≤ lr Q and Q ≤ lr R implies that P ≤ lr R (transitivity). We now deﬁne likelihood ratio ordering for a family of distributions.

Deﬁnition A.9 . Let ( X , (cid:22) ) be a partially ordered set and ( P x ) x ∈ X be a family of dis-tributions on ( R , B ). We say that ( P x ) x ∈ X is increasing in x with respect to likelihoodratio order if P x ≤ lr P x for all x , x ∈ X such that x (cid:22) x . Example A.10 (Total order on ﬁnite sets). For integers n, m ≥

2, let X = { x , . . . , x n } and Y = { y , . . . , y n } with real numbers x < · · · < x n and y < · · · < y m . For 1 ≤ i ≤ n ,let P x i be a probability distribution on Y with masses p ij := P x i ( { y j } ) for 1 ≤ j ≤ m .According to properties (D3–5) of Lemma A.1, the family ( P x ) x ∈ X is increasing withrespect to likelihood ratio order if and only if p i j · p i j ≤ p i j · p i j whenever 1 ≤ i < i ≤ n and 1 ≤ j < j ≤ m . B Proofs and technical details

B.1 Local averages and isotonicity

Let µ be a locally ﬁnite measure on the real line, that means, µ ( B ) < ∞ for bounded sets B ⊂ R . As usual, L ( µ ) denotes the set of measurable functions f such that (cid:107) f (cid:107) := (cid:82) | f | d µ is ﬁnite, and L ( µ ) denotes the set of locally integrable functions, that is, measurablefunctions f such that (cid:107) f B (cid:107) < ∞ , for all bounded sets B ⊂ R .For all f ∈ L ( µ ) and bounded intervals B , deﬁne the µ -average of f on B as D µ f ( B ) := µ ( B ) − (cid:90) B f d µ with the convention 0 / n ≥

0, we partition R into the half-openintervals I nj := (cid:0) − n ( j − , − n j (cid:3) , j ∈ Z . Now, for f ∈ L ( µ ) deﬁne Π n f : R → R viaΠ n f ( x ) := D µ f ( I nj ) for j ∈ Z and x ∈ I nj . Lemma B.1 . For all n ∈ N , the operator Π n : L loc ( µ ) → L loc ( µ ) is linear. For any f ∈ L , it satisﬁes the inequalities (cid:13)(cid:13) (Π n f )1 I nj (cid:13)(cid:13) ≤ (cid:107) f I nj (cid:107) for all j ∈ Z . In particular, (cid:107) Π n f (cid:107) ≤ (cid:107) f (cid:107) . Moreover, Π n f → f µ -a.e. . roof of Lemma B.1 . Linearity of Π n follows from linearity of integrals. The contrac-tion properties follows from the fact that for any integer j , (cid:13)(cid:13) (Π n f )1 I nj (cid:13)(cid:13) = µ ( I nj ) (cid:12)(cid:12) D µ f ( I nj ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) I nj f d µ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) I nj | f | d µ = (cid:107) f I nj (cid:107) . Since (cid:107) h (cid:107) = (cid:80) j ∈ Z (cid:107) h I nj (cid:107) for h = f, Π n f , this implies the inequality (cid:107) Π n f (cid:107) ≤ (cid:107) f (cid:107) .Now pick numbers c j > j ∈ Z , such that (cid:80) j ∈ Z c j (cid:82) I j | f | d µ < ∞ and IP( B ) := (cid:80) j ∈ Z c j µ ( B ∩ I j ) deﬁnes a probability measure on ( R , B ). One can easily verify thatΠ n f is the conditional expectation IE( f | B n ) of f ∈ L (IP) with respect to the σ -ﬁeld B n generated by the intervals I nj , j ∈ Z . Moreover, since B ⊂ B ⊂ B ⊂ · · · , and since B is generated by (cid:83) n ≥ B n , the sequence (Π n f ) n ≥ is a martingale, and by the martingaleconvergence theorem (Doob, 1994), it converges almost surely to IE( f | B ) = f . Now theassertion follows from the fact that IP( B ) = 0 if and only if µ ( B ) = 0.Let Z n := { j ∈ Z : µ ( I nj ) > } and i n := sup( Z n ). For x ∈ R let j n ( x ) := min (cid:0) { j ∈ Z n : x ≤ − n j } ∪ { i n } (cid:1) . Let Π ∗ n be the operator on L ( µ ) deﬁned for all f ∈ L ( µ ) and x ∈ R byΠ ∗ n f ( x ) := D µ f ( I nj n ( x ) ) . In particular, Π ∗ n f ≡ D µ f (cid:0) I nj (cid:1) on I nj , whenever j ∈ Z is such that µ ( I nj ) >

0. Thatmeans, Π n f and Π ∗ n f coincide µ -almost everywhere. However, the slightly more compli-cated deﬁnition of Π ∗ n ensures that it preserves isotonicity. Lemma B.2 . For all n ∈ N , the operator Π ∗ n : L ( µ ) → L ( µ ) is linear. For any f ∈ L ( µ ) , the following two conclusions hold true: If f is isotonic on R , then D µ f ( I nj ) is isotonic in j ∈ Z n . If D µ f ( I nj ) is isotonic in j ∈ Z n , then Π ∗ n f is isotonic on R . Proof of Lemma B.2 . Linearity of Π ∗ n is obvious. If f is isotonic, then D µ f ( I nj ) isisotonic in j ∈ Z n , because for numbers j < k in Z n , D µ f ( I nj ) ≤ f (2 − n j ) ≤ f (2 − n ( k − ≤ D µ f ( I nk ) . Since j n ( x ) ∈ Z n for all x ∈ R , and since j n ( x ) ≤ j n ( y ) whenever x ≤ y , we ﬁnd that Π ∗ n f is isotonic on R whenever D µ f ( I nj ) is isotonic in j ∈ Z n . Lemma B.3 . For any function f ∈ L loc ( µ ) , the following two claims are equivalent:(i) There exists an isotonic function f ∗ : R → [ −∞ , ∞ ] such that f ∗ = f µ -almosteverywhere.(ii) D µ f ( A ) ≤ D µ f ( B ) for all A = ( w, x ] , B = ( y, z ] such that x ≤ y and µ ( A ) , µ ( B ) > . Proof of Lemma B.3 . Let µ ( { f ∗ (cid:54) = f } ) = 0 for an isotonic function f ∗ : R → [ −∞ , ∞ ].For intervals A, B as in (ii), this implies that D µ f ( A ) ≤ f ∗ ( x ) ≤ f ∗ ( y ) ≤ D µ f ( B ).Consequently, (i) implies (ii).Suppose now that f ∈ L ( µ ) satisﬁes (ii). With f n := Π n f as in Lemma B.1, thelatter lemma implies lim n →∞ f n ( x ) = f ( x ) for all x ∈ R \ N for some Borel set N with µ ( N ) = 0. Next let f ∗ n := Π ∗ n f . Then it follows from (ii) and Lemma B.2 that f ∗ n is anisotonic function. Moreover, f ∗ n = f n on R \ N n for some Borel set N n with µ ( N n ) = 0.Consequently, lim n →∞ f ∗ n ( x ) = f ( x ) for all x ∈ R \ N ∗ , where N ∗ := N ∪ (cid:83) n ≥ N n satisﬁes µ ( N ∗ ) = 0. Since all functions f ∗ n are isotonic on R , this implies that f ∗ ( x ) := lim inf n →∞ f ∗ n ( x )deﬁnes an isotonic function f ∗ : R → [ −∞ , ∞ ], and f ∗ = f on R \ N ∗ . Hence, (i) issatisﬁed as well. 21 .2 Proofs for Appendix A Proof of Lemma A.1 . We prove the result via a chain of implications. (D1) ⇒ (D2) : Let µ , f and g be as in (D2), and let ρ : R → [0 ,

1] be an isotonic versionof d Q/ d( P + Q ). Since f + g = d( P + Q ) / d µ , for any Borel set B , Q ( B ) = (cid:90) B ρ d( P + Q ) = (cid:90) B ρ ( f + g ) d µ, so ρ ( f + g ) = d Q/ d µ . Moreover, P ( B ) = ( P + Q )( B ) − Q ( B ) = (cid:90) B (1 − ρ )( f + g ) d µ, whence (1 − ρ )( f + g ) = d P/ d µ . That means, N := { f (cid:54) = (1 − ρ )( f + g ) } ∪ { g (cid:54) = ρ ( f + g ) } satisﬁes µ ( N ) = 0. With N := { f + g = 0 } , we obtain a set N := N ∪ N such that µ ( N ) = 0, and on R \ N , g/f = ρ/ (1 − ρ )is isotonic, because [0 , (cid:51) r (cid:55)→ r/ (1 − r ) ∈ [0 , ∞ ] is strictly increasing. Hence, (D1)implies (D2). (D2) ⇒ (D3) : Let µ dominate P and Q with explicit versions f = d P/ d µ and g = d Q/ d µ .Suppose that g/f is isotonic on R \ N , where N ⊃ { f + g = 0 } satisﬁes µ ( N ) = 0. Then wemay replace f and g with ˜ f := f R \ N and ˜ g := g R \ N , respectively. Then N = { ˜ f +˜ g = 0 } ,and on R \ , ˜ g/ ˜ f = g/f is isotonic. Hence, (D2) implies (D3). (D3) ⇒ (D4) : Let µ dominate P and Q , and suppose that there exist versions f = d P/ d µ and g = d Q/ d µ such that g/f is isotonic on R \ { f + g = 0 } . For points x ≤ y in R \ { f + g = 0 } , the inequality ( g/f )( x ) ≤ ( g/f )( y ) is equivalent to g ( y ) f ( x ) ≤ f ( x ) g ( y ),and the latter inequality is always correct if x ∈ { f + g = 0 } or y ∈ { f + g = 0 } .Consequently, f ( y ) g ( x ) ≤ f ( x ) g ( y ) if x, y ∈ R , x ≤ y. Hence (D3) implies (D4). (D4) ⇒ (D5) ⇒ (D6) : Let µ , f and g be as in (D4) and A and B be sets as in (D5).Integrating f ( x ) g ( y ) and g ( x ) f ( y ) on A × B with respect to the product measure µ ⊗ µ yields (D5). Condition (D6) is obviously a special case of (D5). (D6) ⇒ (D1) : Suppose that P and Q satisfy (D6). The inequality P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) is easily seen to be equivalent to µ ( B ) Q ( A ) ≤ µ ( A ) Q ( B ) with µ := P + Q .Since Q ≤ µ , the latter inquality is trivial if µ ( A ) = 0 or µ ( B ) = 0. Thus (D6) is equiva-lent to(D6’) ( Q/µ )( A ) ≤ ( Q/µ )( B ) for A = ( w, x ] and B = ( y, z ] with x ≤ y and µ ( A ) , µ ( B ) > Q ≤ µ also implies that there exists a density ρ = d Q/ d µ with values in[0 , D µ ρ ( A ) ≤ D µ ρ ( B ) for A = ( w, x ] and B = ( y, z ] with x ≤ y and µ ( A ) , µ ( B ) > D µ ρ ( C ) := µ ( C ) − (cid:82) C ρ d µ with 0 / ρ may be chosen to be isotonic. Thus condition (D6) implies (D1). Proof of Lemma A.5 . For notational convenience, we write P C := P ( · | C ) and Q C := Q ( · | C ). Suppose ﬁrst that P ≤ lr Q , and let C be a Borel set with P ( C ) , Q ( C ) >

0. Since P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) for all Borel sets A, B such that A ≤ B , and since A ∩ C ≤ B ∩ C ,we ﬁnd that P C ( B ) Q C ( A ) = P ( B ∩ C ) P ( C ) Q ( A ∩ C ) Q ( C ) ≤ P ( A ∩ C ) P ( C ) Q ( B ∩ C ) Q ( C ) = P C ( A ) Q C ( B ) . P C ≤ lr Q C , and this implies that P C ≤ st Q C , see Remark A.4.Suppose now that P C ≤ st Q C for all Borel sets C such that P ( C ) , Q ( C ) >

0. To verifythat P ≤ lr Q , it suﬃces to show that P ( A ) Q ( B ) − P ( B ) Q ( A ) ≥ A = ( w, x ] and B = ( y, z ] with x ≤ y . With C := A ∪ B , it suﬃces toconsider the case P ( C ) , Q ( C ) >

0, because otherwise (B.1) is trivial. But then (B.1) isequivalent to P C ( A ) Q C ( B ) − P C ( B ) Q C ( A ) ≥ , and since P C ( B ) = 1 − P C ( A ) and Q C ( B ) = 1 − Q C ( A ), the latter inequality is equivalentto P C ( A ) ≥ Q C ( A ) . But this is a consequence of P C ≤ st Q C , because P C ( A ) = P C (( −∞ , x ]) and Q C ( A ) = Q C (( −∞ , x ]). Proof of Theorem A.8 . Reﬂexivity of the likelihood ratio order is obvious. To showantisymmetry, note that P ≤ lr Q and Q ≤ lr P implies that P ≤ st Q and Q ≤ st P . But thelatter two inequalities mean that the distribution functions of P and Q coincide, whence P ≡ Q .It remains to prove transitivity. Suppose that P ≤ lr Q ≤ lr R , and let µ := P + Q + R .Then there exist versions f = d P/ d µ , g = d Q/ d µ and h = d R/ d µ such that g/f is isotonic on { f + g > } and h/g is isotonic on { g + h > } . In particular, { f + g > } = A ∪ B with A := { f > g } } and { g + h > } = B ∪ C with B = { g > } < C := { g = 0 < h } . The relation “ < ” for sets is meant elementwise. In particular, { f > } ⊂ A ∪ B , whence f = 0 on C , and { h > } ⊂ B ∪ C , whence h = 0 on A . Consequently, { f + h > } = A ∪ B (cid:48) ∪ C with B (cid:48) := { f + h > } ∩ B. Since h/f = 0 on A and h/f = ∞ on C , it suﬃces to show that h/f is isotonic on B (cid:48) .But since g > B (cid:48) , the ratio h/f = ( h/g )( g/f ) is the product of two isotonic functionsfrom B (cid:48) to [0 , ∞ ], whence it is isotonic on B (cid:48) , too.The proof of Theorem A.7 uses elementary inequalities for distribution and quantilefunctions; see for instance Chapter 1 of Shorack and Wellner (1986): Lemma B.4 . For all α ∈ [0 , , we have F (cid:0) F − ( α ) (cid:1) ≥ α , with equality if, and only if, α ∈ Im( G ) . For all x ∈ R , F − (cid:0) F ( x ) (cid:1) ≤ x , with equality if, and only if, F ( x o ) < F ( x ) forall x o < x . Proof of Theorem A.7 . Suppose that (i) holds and let r, s, t ∈ Im( G ) with r < s < t .Lemma B.4 yields F − ( r ) < F − ( s ) < F − ( t ), so A := ( F − ( r ) , F − ( s )] and B :=( F − ( s ) , F − ( t )] are such that P ( A ) = F (cid:0) F − ( s ) (cid:1) − F (cid:0) F − ( r ) (cid:1) = s − r > ,P ( B ) = F (cid:0) F − ( t ) (cid:1) − F (cid:0) F − ( s ) (cid:1) = t − s > . t − s ) (cid:0) H ( s ) − H ( r ) (cid:1) = P ( B ) Q ( A ) ≤ P ( A ) Q ( B ) = ( s − r ) (cid:0) H ( t ) − H ( s ) (cid:1) , and dividing both sides by ( s − r )( t − s ) shows that H is convex on Im( G ).Suppose now that (ii) holds. To verify (i), we have to show that P (( x , x ]) Q (( x , x ]) − P (( x , x ]) Q (( x , x ]) ≥ x < x ≤ x < x . If P (( x i , x j ]) = 0, then Q (( x i , x j ]) = 0,because Q (cid:28) P . Hence it suﬃces to verify (B.2) in case of P (( x , x ]) = F ( x ) − F ( x ) > P (( x , x ]) = F ( x ) − F ( x ) >

0. For 1 ≤ i ≤

4, let ˜ x i := F − ( F ( x i )). Then ˜ x i ≤ x i and F (˜ x i ) = F ( x i ) =: t i , whence P ((˜ x i , x i ]) = 0 and Q ((˜ x i , x i ]) = 0. Consequently, theleft hand side of (B.2) equals( t − t ) Q ((˜ x , ˜ x ]) − ( t − t ) Q ((˜ x , ˜ x ])= ( t − t ) (cid:0) H ( t ) − H ( t ) (cid:1) − ( t − t ) (cid:0) H ( t ) − H ( t ) (cid:1) = ( t − t )( t − t ) (cid:16) H ( t ) − H ( t ) t − t − H ( t ) − H ( t ) t − t (cid:17) ≥ H . B.3 Proofs for Section 2

Lemma B.5 . The numbers m i and M i have the following properties: (a) m i = M i = j if, and only if, X k ≤ x i implies Y k ≤ y j and X k ≥ x i implies Y k ≥ y j ; (b) The numbers r i := min { l : M l = M i } and s i := max { l : m l = m i } are such that r i ≤ i ≤ s i and w r i M i , w s i m i > . Proof of Lemma B.5 . Part (a) follows immediately from the deﬁnition of m i and M i .Indeed, m i = M i = j if and only if y j = min { Y k : X k ≥ x i } = max { Y k : X k ≤ y j } .As to part (b), it follows from M ≤ · · · ≤ M n that r i ≤ i ≤ s i . By deﬁnition of M r i , Y k = y M i for some k such that X k ≤ x r i . We even have X k = x r i , because otherwise r i > M r i − ≥ M i , a contradiction to minimality of r i . Hence, w r i M i ≥

1. Likewise,the deﬁnition of m s i implies that Y k = y m i for some k such that X k ≥ x s i . Here X k = x s i ,because otherwise s i < n and m s i +1 ≤ m i , a contradiction to the deﬁnition of s i . Hence, w s i m i ≥ Lemma B.6 . A matrix p ∈ [0 , n × m that satisﬁes (2.6–2.7) must satisfy (2.4) . Proof of Lemma B.6 . Recall that we have to show p i j · p i j ≤ p i j · p i j (B.3)for all 1 ≤ i ≤ i ≤ n and 1 ≤ j ≤ j ≤ m . We distinguish several cases: Case j = j : Here (B.3) is trivially veriﬁed.

Case m i ≤ j < j ≤ M i : We have that m i ≤ j < j ≤ M i , for all i ≤ i ≤ i , so24 ij > j ≤ j ≤ j and i ≤ i ≤ i , by (2.7). We further have that m i +1 < M i , forall i ≤ i < i , so we can apply (2.6) to all j ≤ j ≤ j and i ≤ i ≤ i . This yields p i ,j p i ,j = p i ,j p i ,j − · p i ,j − p i ,j − · · · p i ,j +1 p i ,j ≤ p i +1 ,j p i +1 ,j − · p i +1 ,j − p i +1 ,j − · · · p i +1 ,j +1 p i +1 ,j ≤ · · ·≤ p i ,j p i ,j − · p i ,j − p i ,j − · · · p i ,j +1 p i ,j = p i ,j p i ,j , which is equivalent to (B.3). Case j < m i , M i < j or M i ≤ m i : From (2.7), then in each case at least one of p i j or p i j is zero, so (B.3) is trivial. Lemma B.7 . For all ( i, j ) ∈ P , either w ij > , or there exists ≤ i . In other words, for all ( i, j ) ∈ P , either w ij > i, j ) lies on the upper or lower edgeof a rectangle whose upper-left and lower-right corners have strictly positive weight. Proof of Lemma B.7 . Let ( i, j ) ∈ P such that w ij = 0. Since w i = (cid:80) M i r = m i w ir >

0, ithas to be that m i < M i , otherwise w i = w ij = 0. We distinguish several cases: Case j = M i : Deﬁne i := min { k : M k = M i } and j := j = M i = M i . By Lemma B.5part (b), we know that w i j >

0. Now w ij = 0 implies that i Case j = m i : With a similar reasoning as in the previous case, we ﬁnd that the numbers i := min { k : M k = M i } , j := M i = M i , i := max { k : m k = m i } and j := j = m i = m i < M i = j fulﬁll the required properties. Case m i < j < M i : By deﬁnition of y j , the set { k : w kj > } is non empty and doesnot include i . Let i o be one of its elements. In case i o > i , we deﬁne i := min { k : M k = M i } , i := i o , j := j and j := M i . These numbers are such that i ≤ i

0, by Lemma B.5 part (b). Likewise, in case i o < i ,the numbers i := i o , i := max { k : m k = m i } , j := m i and j := j fulﬁll the requiredproperties. Lemma B.8 . A maximizer p ∈ [0 , n × m of L satisfying (2.2,2.4) must satisfy (2.6–2.7).Additionally, we have that L ( p ) > . Proof of Lemma B.8 . Let p ∈ [0 , n × m be a maximizer of L satisfying (2.2,2.4). Prop-erty (2.6) is a direct consequence of (2.4).Next, we show that a maximizer p of L must be such that L ( p ) >

0. For that, let r := [ r , r , . . . , r n ] (cid:62) ∈ [0 , n × m where r i := ( r ij ) mj =1 is deﬁned by r ij := 1 / ( M i − m i + 1),if j ∈ J i , and r ij := 0 otherwise. Since r satisﬁes (2.6–2.7), we obtain by means ofLemma B.6 that (2.4) holds. Because (2.2) is satisﬁed as well, the matrix r is a feasibleparameter such that L ( p ) ≥ L ( r ) = (cid:89) ( i,j ) ∈P r w ij ij ≥ (cid:18) min ( i,j ) ∈P r ij (cid:19) nm > .

25s to (2.7), we ﬁrst prove that v i := (cid:80) j ∈J i p ij = 1, for all i ∈ I , yielding p ij = 0 if j / ∈ J i . For that, we verify primarily that v i >

0, for all i ∈ I . Ab absurdo, suppose thatthere exists i o ∈ I such that p i o j = 0, for all j ∈ J i o . Since w i o >

0, there is at least one j o ∈ J i o such that w i o j o >

0, which yields the contradiction L ( p ) = 0.Thus, we suppose that 0 < v i ≤

1, for all i ∈ I . Let q = [ q , q , . . . , q n ] (cid:62) be an n × m matrix with each q i := ( q ij ) mj =1 deﬁned as follows: q ij := (cid:40) p ij v i if ( i, j ) ∈ P , . These coeﬃcients are such that q ij ≥ p ij , if j ∈ J i , and (cid:80) mj =1 q ij = 1, for all i ∈ I ,so q ∈ [0 , n × m and it satisﬁes (2.2). Furthermore, since p satisﬁes (2.4), it holds that q i,j +1 · q i +1 ,j ≤ q ij · q i +1 ,j +1 , for all m i +1 ≤ j < M i and 1 ≤ j < n , that means, q satisﬁes(2.6–2.7). Lemma B.6 then implies that q satisﬁes (2.4) as well. Therefore q is a feasibleparameter such that L ( p ) = (cid:89) ( i,j ) ∈P p w ij ij = (cid:89) i ∈I v w i i (cid:89) j ∈J i q w ij ij ≤ L ( q ) ≤ L ( p ) , where the ﬁrst inequality is due to the fact that 0 < v w i i ≤ p is a maximizer of L . But then all the inequalities are equalities, which forces v i = 1, for all i ∈ I .Finally, let us show that p ij > i, j ) ∈ P . Notice ﬁrst that p ij > w ij >

0, otherwise L ( p ) = 0 < L ( r ) as seen previously. Hence we treat the case of w ij = 0for some ( i, j ) ∈ P . From Lemma B.7, there exist 1 ≤ i

0, so p i j , p i j > j = j , so i

0. Ab absurdo, if p ij = 0 and since p i j >

0, the likelihood ratio ordering constraint (2.4) yields p i j · p iu ≤ p i u · p ij = 0 for all u ≤ j , so p iu = 0 for all u ≤ j . In case i = i , we would get acontradiction with p i j >

0, and if i j = j > j ≥ inf supp( (cid:98) P x i )and contradict Remark A.6.In case j = j , claiming that p ij = 0 would lead us to a similar contradiction. Lemma B.9 . Let ≤ i ≤ n and ≤ j ≤ m be arbitrary indices. Then(1) j ≤ M i if, and only if, n j ≤ i ;(2) m i ≤ j if, and only if, i ≤ N j . Proof of Lemma B.9 . To show (1), consider the following chain of equivalences j > M i ⇐⇒ ( X k ≤ x i ⇒ Y k < y j ) ⇐⇒ ( Y k ≥ y j ⇒ X k > x i ) ⇐⇒ i < n j . Whence j ≤ M i if, and only if, i ≥ n j . Statement (2) is proved similarly. Lemma B.10 . Let c ( θ ) := (cid:0) c ij ( θ ) (cid:1) ( i,j ) ∈P be deﬁned for all θ ∈ Θ by c ij ( θ ) := (cid:40) log (cid:0)(cid:80) r ∈J i exp (cid:0)(cid:80) rs = m i θ is (cid:1)(cid:1) if j = m i , otherwise . hen the parameter θ (cid:48) := θ − c ( θ ) is such that θ (cid:48) ∈ Θ and f ( θ (cid:48) ) ≤ f ( θ ) with equality, ifand only if, c ( θ ) = . Proof of Lemma B.10 . We prove the result with slightly diﬀerent notations. Let λ =( λ ij ) ( i,j ) ∈P ∈ R P be deﬁned by λ ij := (cid:80) js = m i θ is . Then θ ∈ Θ if, and only if, λ ∈ Λ := { λ ∈ R P : λ i,j +1 + λ i +1 ,j ≤ λ ij + λ i +1 ,j +1 , for m i +1 ≤ j < M i , ≤ i < n } and θ satisﬁes constraint (2.8) if, and only if, λ is such that (cid:88) j ∈J i exp( λ ij ) = 1 , for all i ∈ I . (B.4)Finally, the functional (cid:96) ( λ ) := (cid:88) ( i,j ) ∈P (cid:0) w ij λ ij − w i exp( λ ij ) (cid:1) , is such that (cid:96) ( λ ) = − f ( θ ).Let us now deﬁne c i := log  (cid:88) r ∈J i exp( λ ir )  , for all 1 ≤ i ≤ n and the parameter λ (cid:48) ∈ R P with λ (cid:48) ij := λ ij − c i . Then λ (cid:48) ∈ Λ and it satisﬁes (B.4). In particular, since exp( c ) ≥ c for all c ∈ R withequality if, and only if, c = 0, we have (cid:96) ( λ (cid:48) ) = (cid:88) ( i,j ) ∈P w ij ( λ ij − c i ) − (cid:88) i ∈I w i (cid:88) j ∈J i exp( λ ij − c i )= (cid:88) ( i,j ) ∈P w ij λ ij − (cid:88) i ∈I w i (1 + c i ) ≥ (cid:88) ( i,j ) ∈P w ij λ ij − (cid:88) i ∈I w i exp( c i )= (cid:96) ( λ ) , with equality if, and only if, c i = 0 for all i ∈ I . In consequence, replacing λ with λ (cid:48) has the eﬀect of strictly increasing the log-likelihood score, unless λ is normalized already.Since θ (cid:48) ij := λ (cid:48) i,j − λ (cid:48) i,j − = λ i,j − λ i,j − = θ ij for all m i < j ≤ M i and i ∈ I , thecorresponding update on θ is given by the parameter θ (cid:48) := θ − c ( θ ). Lemma B.11 . An element θ ∈ Θ minimizes f if, and only if, it minimizes f o ( θ ) := − (cid:88) ( i,j ) ∈P w ij θ ij under constraint (2.8) . roof of Lemma B.11 . We use the same notations as in the proof of Lemma B.10.The equivalent claim therefore reads: An element λ ∈ Λ maximizes (cid:96) if, and only if, itmaximizes (cid:96) o ( λ ) := (cid:88) ( i,j ) ∈P w ij λ ij under constraint (B.4).Suppose ﬁrst that λ maximizes (cid:96) . Then the directional derivative of (cid:96) at λ and in thedirection of (1 [ r = i ] ) ( r,s ) ∈P ∈ Λ is given by ∂∂c (cid:96) (cid:0) λ + c (1 [ r = i ] ) n,mr,s =1 (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) c =0 = w i (cid:16) − (cid:88) j ∈J i exp( λ ij ) (cid:17) , and it is equal to 0 by optimality of λ . In consequence, λ satisﬁes constraint (B.4). Italso maximizes (cid:96) o because if ˜ λ satisﬁes constraint (B.4) but is such that (cid:96) o (˜ λ ) > (cid:96) o ( λ ),then we also have (cid:96) (˜ λ ) > (cid:96) ( λ ), which is a contradiction with the optimality of λ .Suppose now that ˜ λ maximizes (cid:96) o under constraint (B.4) and suppose that λ is suchthat (cid:96) ( λ ) > (cid:96) (˜ λ ). As explained in the proof of Lemma B.10, the parameter λ (cid:48) deﬁned by λ (cid:48) ij := λ ij − c i is such that (cid:96) ( λ (cid:48) ) ≥ (cid:96) ( λ ) and it satisﬁes (B.4). In particular, (cid:96) o ( λ (cid:48) ) > (cid:96) o (˜ λ ),which contradicts the optimality of ˜ λ . B.4 Proofs for Section 3

Lemma B.12 . For w , y ∈ R d with w > , the mapping ( w , y ) (cid:55)→ IE w (cid:0) y | R d ↑ (cid:1) is contin-uous in both arguments. Proof of Lemma B.12 . For w , y ∈ R d with w >

0, it is well known thatIso( y , w ) = (cid:18) min s ≥ i max r ≤ i A rs ( w , y ) (cid:19) di =1 , where each function A rs ( w , y ) := arg min x ∈ R s (cid:88) i = r w i ( y i − x ) = (cid:80) si = r w i y i (cid:80) si = r w i , ≤ r ≤ s ≤ d is continuous in both its arguments. But since the functions x (cid:55)→ max (cid:0) γ ( x ) , γ ( x ) (cid:1) and x (cid:55)→ min (cid:0) γ ( x ) , γ ( x ) (cid:1) are continuous whenever γ and γ are continuous, we ﬁnd that( w , y ) (cid:55)→ min s ≥ i max r ≤ i A rs ( w , y )is continuous in both its arguments. Proof of Lemma 3.1 . In view of the structure of the constrained set Θ, one can de-compose the minimization of f θ under constraint into a sum of independent minimizationproblems: min η ∈ Θ f θ ( η ) = c θ + n (cid:88) i ∈I A i + m (cid:88) j =2 B j , with A i := min η ∈ R H im i ( θ ) (cid:0) η − ˜ θ im i (cid:1) = 0 ,B j := min ( η ij ) Nj − i = nj ∈ R dj ↑ N j − (cid:88) i = n j H ij ( θ ) (cid:0) η ij − ˜ θ ij (cid:1) . A i is obvious and yields the unique solution η ∗ im i ( θ ) := ˜ θ im i ( θ ) , for each i ∈ I . As to the minimization problem of each B j , 1 < j ≤ m , it is treated withthe theory of isotonic regression (Barlow et al., 1972; Robertson et al., 1988) which aimsat determining the unique element z ∈ R d ↑ that minimizes d (cid:88) k =1 b k ( z k − ξ k ) , for given dimension d ≥

1, positive weights b := ( b k ) dk =1 and vector ξ := ( ξ k ) dk =1 . Forsuch generic problems, the solution is given by the least squares projection of ξ onto thesubspace R d ↑ , which we denote by Iso( ξ , b ) and it is eﬃciently computed via the pool-adjacent-violators algorithm. We therefore deﬁne (cid:0) η ∗ ij ( θ ) (cid:1) N j − i = n j = Iso (cid:16)(cid:0) ˜ θ ij ( θ ) (cid:1) N j − i = n j , (cid:0) H ij ( θ ) (cid:1) N j − i = n j (cid:17) , for each 1 < j ≤ m .To see that the mapping θ (cid:55)→ η ∗ ( θ )is continuous, recall that f is inﬁnitely diﬀerentiable on Θ, so θ (cid:55)→ H ij ( θ ) and therefore θ (cid:55)→ ˜ θ ij ( θ ) are continuous for all ( i, j ) ∈ P . Lemma B.12 allows us to conclude. Lemma B.13 . In the context of Section 3, we have that Df θ (cid:0) θ , v ( θ ) (cid:1) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) . Proof of Lemma B.13 . Notice ﬁrst that the nature of f θ yields that g : [0 , ∞ ) (cid:55)→ R t → f θ (cid:0) θ + t v ( θ ) (cid:1) − f ( θ )is a strictly convex and quadratic function such that g (0) = 0 and g (1) = f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ). Furthermore, since θ + v ( θ ) = η ∗ is the unique minimum of f θ in Θ, the function g attains its minimum precisely at t = 1, so g (cid:48) (1) = 0. In consequence, g ( t ) = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) t − (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) t . Hence, we ﬁnd that Df θ (cid:0) θ , v ( θ ) (cid:1) = lim t → f θ (cid:0) θ + t v ( θ ) (cid:1) − f ( θ ) t = lim t → g ( t ) t = 2 (cid:16) f θ (cid:0) θ + v ( θ ) (cid:1) − f ( θ ) (cid:17) . B.5 Proofs for Section 4

Lemma B.14 . For any compact subset ˜Θ ⊂ Θ , there exists a constant D = D ( ˜Θ) > such that max t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) ≥ (cid:16) min (cid:0) − Df (cid:0) θ , v ( θ ) (cid:1) , D (cid:1)(cid:17) D for all θ ∈ ˜Θ . roof of Lemma B.14 . In order to ease the readability of the proof, we assume thatthe parameter θ and all the relevant objects have been primarily vectorized. That means,every pair of indices ( i, j ) ∈ P is enumerated, resulting in a problem with K := P dimensions.The compactness of ˜Θ and the continuity of v (see Lemma 3.1) and H yield theexistence of constants D , D > θ ∈ ˜Θ (cid:107) v ( θ ) (cid:107) ≤ D , sup s ∈ [0 , sup θ ∈ ˜Θ λ max (cid:16) H (cid:0) θ + s v ( θ ) (cid:1)(cid:17) ≤ D . We therefore obtain the following Taylor’s approximation of f around θ ∈ ˜Θ: f (cid:0) θ + t v ( θ ) (cid:1) = f ( θ ) + t · Df (cid:0) θ , v ( θ ) (cid:1) + t · v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) , where R ( x ) = (cid:0) R kl ( x ) (cid:1) Kk,l =1 is deﬁned by R kl ( x ) := (cid:90) (1 − s ) H kl (cid:0) θ + s ( x − θ ) (cid:1) d s and is such that (cid:12)(cid:12)(cid:12) R kl (cid:0) θ + t v ( θ ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (1 − s ) (cid:12)(cid:12)(cid:12) H kl (cid:0) θ + st v ( θ ) (cid:1)(cid:12)(cid:12)(cid:12) d s ≤ (cid:90) (1 − s ) λ max (cid:16) H (cid:0) θ + st v ( θ ) (cid:1)(cid:17) d s ≤ D , since max k,l | A kl | ≤ λ max ( A ) for all positive deﬁnite matrices A ∈ R K × K . Therefore, weﬁnd that (cid:12)(cid:12)(cid:12) v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) (cid:12)(cid:12)(cid:12) ≤ D K (cid:107) v ( θ ) (cid:107) ≤ K D D , where the ﬁrst inequality results from Cauchy–Schwarz and the equivalence of norms.Consequently, with D := KD D we have thatmax t ∈ (0 , (cid:16) f ( θ ) − f (cid:0) θ + t v ( θ ) (cid:1)(cid:17) = max t ∈ (0 , − t · Df (cid:0) θ , v ( θ ) (cid:1) − t · v ( θ ) (cid:62) R (cid:0) θ + t v ( θ ) (cid:1) v ( θ ) ≥ max t ∈ (0 , − t · Df (cid:0) θ , v ( θ ) (cid:1) − t D (cid:40) Df ( θ , v ( θ )) D if − Df ( θ , v ( θ )) D ∈ (0 , − Df (cid:0) θ , v ( θ ) (cid:1) − D if − Df ( θ , v ( θ )) D > ≥ (cid:16) min (cid:0) − Df (cid:0) θ , v ( θ ) (cid:1) , D (cid:1)(cid:17) D , for all θ ∈ ˜Θ. Lemma B.15 . For θ ∈ Θ , the sequence ( θ n ) n ≥ deﬁned by θ n +1 := ψ ( θ n ) for all n ≥ is such that (cid:0) f ( θ n ) (cid:1) n ≥ is convergent and lim n →∞ Df (cid:0) θ n , v ( θ n ) (cid:1) = 0 . roof of Lemma B.15 . By construction of ψ , the sequence (cid:0) f ( θ n ) (cid:1) n ≥ is strictly de-creasing and bounded from below by f ( θ ∗ ). It is therefore convergent. In particular, wehave that lim n →∞ f ( θ n ) − f ( θ n +1 ) = 0 . (B.5)Suppose now ab absurdo that the desired claim does not hold. That means, there exist δ > (cid:0) n ( k ) (cid:1) k ≥ of ( n ) n ≥ such that − Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) ≥ δ, for all k ≥ . By convexity of f , the set ˜Θ := (cid:8) θ ∈ Θ : f ( θ ) ≥ f ( θ ) (cid:9) , is a compact subset of Θ which contains (cid:0) θ n ( k ) (cid:1) k ≥ , so by Lemma B.14, there exists aconstant D > f ( θ n ( k ) ) − f (cid:0) θ n ( k )+1 (cid:1) ≥ · max t ∈ (0 , (cid:16) f ( θ n ( k ) ) − f (cid:0) θ n ( k ) + t v ( θ n ( k ) ) (cid:1)(cid:17) ≥ (cid:16) min (cid:0) − Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) , D (cid:1)(cid:17) D ≥ (cid:0) min( δ, D ) (cid:1) D .

This contradicts property (B.5) and concludes the proof.

Proof of Theorem 4.1 . From Lemma B.15, we know already that (cid:0) f ( θ n ) (cid:1) n ≥ is con-vergent. Ab absurdo, suppose thatlim n →∞ f ( θ n ) = f ∞ > f ( θ ∗ ) . Because ( θ n ) n ≥ is a sequence in the compact set˜Θ := (cid:8) θ ∈ Θ : f ( θ ) ≥ f ( θ ) (cid:9) , there exists a subsequence (cid:0) n ( k ) (cid:1) k ≥ of ( n ) n ≥ and an element ˜ θ ∈ ˜Θ such thatlim k →∞ θ n ( k ) = ˜ θ . The continuity of f yields f (˜ θ ) = lim k →∞ f ( θ n ( k ) ) = f ∞ > f ( θ ∗ ) , so ˜ θ (cid:54) = θ ∗ , but Lemma B.15 and the continuity of θ (cid:55)→ Df (cid:0) θ , v ( θ ) (cid:1) = ∇ f ( θ ) (cid:62) v ( θ ) give0 = lim n →∞ Df (cid:0) θ n , v ( θ n ) (cid:1) = lim k →∞ Df (cid:0) θ n ( k ) , v ( θ n ( k ) ) (cid:1) = Df (cid:0) ˜ θ , v (˜ θ ) (cid:1) < , which is a contradiction. Thus, lim n →∞ f ( θ n ) = f ( θ ∗∗