[PDF] Unifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates

Abstract

Given a prediction task, understanding when one can and cannot design a consistent convex surrogate loss, particularly a low-dimensional one, is an important and active area of machine learning research. The prediction task may be given as a target loss, as in classification and structured prediction, or simply as a (conditional) statistic of the data, as in risk measure estimation. These two scenarios typically involve different techniques for designing and analyzing surrogate losses. We unify these settings using tools from property elicitation, and give a general lower bound on prediction dimension. Our lower bound tightens existing results in the case of discrete predictions, showing that previous calibration-based bounds can largely be recovered via property elicitation. For continuous estimation, our lower bound resolves on open problem on estimating measures of risk and uncertainty.

Full PDF

PProceedings of Machine Learning Research vol 134:1–23, 2021

Unifying Lower Bounds on Prediction Dimension ofConsistent Convex Surrogates

Jessie Finocchiaro [email protected]

Rafael Frongillo [email protected]

Bo Waggoner [email protected]

CU Boulder

Abstract

Given a prediction task, understanding when one can and cannot design a consistent convexsurrogate loss, particularly a low-dimensional one, is an important and active area of machinelearning research. The prediction task may be given as a target loss, as in classiﬁcation andstructured prediction, or simply as a (conditional) statistic of the data, as in risk measureestimation. These two scenarios typically involve diﬀerent techniques for designing andanalyzing surrogate losses. We unify these settings using tools from property elicitation,and give a general lower bound on prediction dimension. Our lower bound tightens existingresults in the case of discrete predictions, showing that previous calibration-based boundscan largely be recovered via property elicitation. For continuous estimation, our lowerbound resolves on open problem on estimating measures of risk and uncertainty.

1. Introduction

A surrogate loss function is an error measure that is related but not identical to one’s targetproblem of interest. Selecting a hypothesis by minimizing surrogate risk is one of the mostwidespread techniques in supervised machine learning. There are two main reasons whya surrogate loss is necessary: (1) the target loss does not satisfy some desiderata, such asconvexity, or (2) the goal is to estimate some target statistic and there is no target loss,as in many continuous estimation problems. In both settings, a key criteria for choosinga surrogate loss is consistency , a precursor to excess risk bounds and convergence rates.Roughly speaking, consistency means that minimizing surrogate risk corresponds to solvingthe target problem of interest, i.e. in (1) the target risk is also minimized, or in (2) thecontinuous prediction approaches the true conditional statistic.Despite the ubiquity of surrogate losses, we lack general frameworks to design and analyzeconsistent surrogates. This state of aﬀairs is especially dire when one seeks low predictiondimension , the dimension of the surrogate prediction domain. For example, in multiclassclassiﬁcation with n labels, the prediction domain might be R n . In many type (1) settings,such as structured prediction and extreme classiﬁcation, the prediction dimension can easilybecome intractably large, forcing one to sacriﬁce consistency for computational eﬃciency. Tounderstand whether this sacriﬁce is necessary, recent work developed tools like the feasiblesubspace dimension to lower bound the prediction dimension of any consistent convexsurrogate (Ramaswamy and Agarwal, 2016). Challenges of type (2) include risk measuressuch as conditional value at risk (CVaR), with applications in ﬁnancial regulation, robust © 2021 J. Finocchiaro, R. Frongillo & B. Waggoner. a r X i v : . [ c s . L G ] F e b nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates engineering design, and algorithmic fairness. Risk measures provably cannot be speciﬁedvia a target loss, and thus we seek a surrogate loss of low (or at least ﬁnite) predictiondimension. Recent work (Fissler et al., 2016; Frongillo and Kash, 2020) gives predictiondimension bounds for some of these risk measures, but without the requirement that thesurrogate be convex: bounds for convex surrogates are left as a major open question.We present a uniﬁcation of existing techniques to bound the prediction dimension ofconsistent convex surrogates in both settings above. Applied to settings of type (1), werecover the feasible subspace dimension result of Ramaswamy and Agarwal (2016), andgive an example where our bound is even tighter. For type (2), we give the ﬁrst predictiondimension bounds for risk measures with respect to convex surrogates, addressing the openquestion above. Our framework rests on property elicitation , a weaker condition thancalibration, as a tool to understand consistency across a wide variety of domains. The “four quadrants” of problem types

Above, we discuss a signiﬁcant divergence inprevious frameworks: constructing a surrogate given a target loss versus a target statistic . Inaddition to the two possible targets, we may have one of two domains: a discrete (i.e. ﬁnite)target prediction space, like a classiﬁcation problem, or a continuous one, like a regressionor estimation problem. We informally refer to the four resulting cases—target loss vs. targetstatistic, and discrete vs. continuous predictions—as the “four quadrants” of supervisedlearning problems, shown in Table 1. For further examples, see Appendix E.

Literature on consistency and calibration

We focus on the construction of consistentsurrogate losses L : R d × Y → R , roughly meaning that minimizing L -loss corresponds tosolving the target problem of interest. When given a target loss ‘ , we roughly deﬁne L to beconsistent if minimizing L , and applying a link function, minimizes ‘ (Deﬁnition 5) (Zhang,2004; Bartlett et al., 2006; Tewari and Bartlett, 2007; Steinwart, 2007; Ramaswamy andAgarwal, 2016). When given a target statistic such as the conditional quantile or variance,but no target loss, we introduce a notion of consistency in line with classical statistics(Deﬁnition 6) (Gy¨orﬁ et al., 2006; Fan and Yao, 1998; Ruppert et al., 1997). Here we willdeﬁne L to be consistent if minimizing L and applying a link function yields estimatesconverging to the correct value.A priori, it is not clear that compatible deﬁnitions of consistency could be given for bothtarget statistics and target losses. In fact, we observe that consistency for target losses isa special case of consistency for target statistics (§ 3). This observation suggests propertyelicitation (see § 2.1) as a useful tool to study general lower bounds.As deﬁnitions of consistency are relatively intractable to apply directly, the literatureoften focuses on a weaker condition called calibration, which only applies when givena discrete target loss, e.g. Quadrants 1 and 3. Particularly, Zhang (2004); Lin (2004);Bartlett et al. (2006); Tewari and Bartlett (2007); Ramaswamy and Agarwal (2016) showthe equivalence of consistency and calibration in Quadrant 1, where one is given a targetstatistic and discrete prediction set. We discuss the additional relationship of elicitation andcalibration in Appendix A, and derive Theorem 8 via calibration. Contributions

First, we formalize a notion of consistency with respect to a target statistic(Deﬁnition 6) and show its relationship to consistency with respect to a target loss (Lemma 7).We then show indirect elicitation is a necessary condition for consistency (Theorem 8). With nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Target loss Target statisticDiscrete

Q1:

Classiﬁcation

Q2:

Risk-averse classiﬁcation prediction (Appendix E)

Continuous

Q3:

Least-squares regression

Q4:

Variance estimation estimation

Table 1: The four quadrants of problem types, with an example of interest for each. L cvx L ConsistencyCalibration (Q 1,3) Indirect elicitation d -ﬂats in level sets cons cvx bounds forQ1,2. (Cor. 15)cons cvx bounds forQ3,4. (Thm. 17)Bartlettet al.(2006) T h m . Lem. 11 C o r . C o r . Figure 1: Flow and implications of our results. Compared to calibration, we suggest indirect elicitationas a simpler but almost-as-powerful necessary condition for consistency. In particular, we obtain atestable necessary condition, based on d -ﬂats, for whether there exists a d -dimensional consistentconvex surrogate. This condition recovers and strengthens existing calibration-based results. these tools in hand, we present a new framework for deriving lower bounds on the predictiondimension of consistent convex surrogates (Corollaries 12 and 13) via indirect elicitation.These bounds are the ﬁrst to our knowledge that can be applied in all four quadrants.Moreover, our framework can also give tighter bounds than previously existed in theliterature. We illustrate this sharpness with new bounds for well-studied problems such asabstain loss (§ 5) and variance, CVaR, and other measures of risk and uncertainty (§ 6).See Figure 1 for a roadmap of our main results.

2. Setting

We consider supervised learning problems in the space

X × Y , for some feature space X anda label space Y , with data drawn from a distribution D over X × Y . The task is to producea hypothesis f : X → R , for some prediction space R , which may be diﬀerent from Y . Forexample, in ranking problems, R may be all |Y| ! permutations over the |Y| labels forming Y .As we focus on conditional distributions p := D x = Pr[ Y | X = x ] over Y given some x ∈ X ,we often abstract away x , working directly with a convex set of distributions over outcomes P ⊆ ∆ Y . We then write e.g. E p L ( · , Y ) to mean the expectation when Y ∼ p .If given, we use ‘ : R × Y → R to denote a target loss , with predictions r ∈ R . Similarly, L : R d × Y → R will typically denote a surrogate loss, with surrogate predictions u ∈ R d .We write L d for the set of B ( R d ) ⊗ Y -measurable and lower semi-continuous surrogates L : R d × Y → R such that E Y ∼ p L ( u, Y ) < ∞ for all u ∈ R d , p ∈ P , that are minimizablein that arg min u E p L ( u, Y ) is nonempty for all p ∈ P . Moreover, L cvx d ⊆ L d is the set ofconvex (in R d for every y ∈ Y ) losses in L d . Set L = ∪ d ∈ N L d , and L cvx = ∪ d ∈ N L cvx d . A loss ‘ : R × Y → R is discrete if R is a ﬁnite set. For a given p ∈ P , the (conditional) regret , or nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates excess risk, of a loss L is given by R L ( u, p ) := E p L ( u, Y ) − inf u ∗ E p L ( u ∗ , Y ). Typically, wenotate ﬁnite report sets R . Arising from the statistics and economics literature, property elicitation is similar to cal-ibration, but only characterizes exact minimizers of a surrogate (Savage, 1971; Osbandand Reichelstein, 1985; Lambert et al., 2008; Lambert and Shoham, 2009; Lambert, 2018;Frongillo and Kash, 2015, 2014). Speciﬁcally, given a statistic or property

Γ of interest, whichmaps a distribution p ∈ P ⊆ ∆ Y to the set of desired or correct predictions, the minimizersof L should precisely coincide with Γ. For example, squared loss L ( r, y ) = ( r − y ) elicits themean Γ( p ) = E p Y . For intuition, to relate to consistency, one can think of p = Pr[ Y | X = x ]as a conditional distribution, though the deﬁnition is also applied to point prediction settings. Deﬁnition 1 (Property, elicits) A property is a set-valued function Γ :

P → R \ {∅} ,which we denote Γ : P ⇒ R . A loss L : R × Y → R elicits the property Γ if ∀ p ∈ P , Γ( p ) = arg min u ∈R E p L ( u, Y ) . (1)An example is the mean, Γ( p ) = { E p Y } . The level set of Γ at value r ∈ R is Γ r := { p ∈P : r ∈ Γ( p ) } . We call a property Γ : P ⇒ R discrete if R is a ﬁnite set, as in Quadrants 1and 2. A property is single-valued if | Γ( p ) | = 1 for all p ∈ P , in which case we may writeΓ : P → R and Γ( p ) ∈ R . The mean is single-valued. We deﬁne the range of a propertyby range Γ = S p ∈P Γ( p ) ⊆ R . When L ∈ L , we use Γ := prop P [ L ] to denote the uniqueproperty elicited by L (for distributions in P ) from eq. (1). Typically, we denote the targetproperty by γ , and the surrogate by Γ.To relate property elicitation to consistency, we need to allow for a link function, whichgives rise to the notion of indirect elicitation. For single-valued properties, this deﬁnitionreduces to the natural requirement γ = ψ ◦ Γ. Deﬁnition 2 (Indirect Elicitation)

A surrogate loss and link ( L, ψ ) indirectly elicit aproperty γ : P ⇒ R if L elicits a property Γ : P ⇒ R d such that for all u ∈ R d , we have Γ u ⊆ γ ψ ( u ) . We say L indirectly elicits γ if such a link ψ exists. An important caveat to the above deﬁnitions is that, since Γ = prop P [ L ] is nonemptyeverywhere, we must have L ∈ L , meaning that E p L ( · , Y ) always achieves a minimum.This restriction is also implicit in e.g. (Agarwal and Agarwal, 2015). While some popularsurrogates such as logistic and exponential loss are not minimizable, these losses are stillcovered in Corollary 13 and Theorem 17 as Γ( p ) = ∅ when p ∈ P := relint(∆ Y ); moreover,by thresholding L ( u, y ) = max( L ( u, y ) , (cid:15) ) for suﬃciently small (cid:15) > L ∈ L for both. We expect that a generalization of property elicitation which allows for “inﬁnite”predictions (e.g., along a prescribed ray), thereby ensuring a minimum is always achievedfor convex losses, would allow us to lift the minimizable restriction entirely. Various works have studied the minimum prediction dimension d needed in order to con-struct a consistent surrogate loss L : R d × Y → R , typically through proxies such as nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates calibration (Steinwart and Christmann, 2008; Agarwal and Agarwal, 2015; Ramaswamyand Agarwal, 2016) and property elicitation (Frongillo and Kash, 2015; Fissler et al., 2016;Frongillo and Kash, 2020). In Quadrant 1, Ramaswamy and Agarwal (2016) introduce aspecial case of convex consistency dimension (Deﬁnition 3), which led to consistent convexsurrogates for discrete prediction problems such as hierarchical classiﬁcation (Ramaswamyet al., 2015) and classiﬁcation with an abstain option (Ramaswamy et al., 2018). Deﬁnition 3 (Convex Consistency Dimension)

Given target loss ‘ : R × Y → R orproperty γ : P ⇒ R , its convex consistency dimension cons cvx ( · ) is the minimum dimension d such that ∃ L ∈ L cvx d and link ψ such that ( L, ψ ) is consistent with respect to ‘ or γ .Consistency is deﬁned for a target loss in Deﬁnition 5 and for a target property in Deﬁnition 6.In the case of a target property γ , i.e. a statistic, Lambert et al. (2008) similarly introducethe notion of elicitation complexity , later generalized by Frongillo and Kash (2020), whichcaptures the lowest prediction dimension of a surrogate which indirectly elicits γ . Thisnotion is quite general as it includes continuous estimation settings and does not inherentlydepend on a target loss being given. Deﬁnition 4 (Convex Elicitation Complexity)

Given a target property γ , the convexelicitation complexity elic cvx ( γ ) is the minimum dimension d such that there is a L ∈ L cvx d indirectly eliciting γ . Agarwal and Agarwal (2015, Corollary 10) provide a necessary condition for the directconvex elicitation of single-valued properties, yielding bounds on the dimensionality of levelsets. Moreover, Finocchiaro et al. (2019) study surrogate losses which embed a discrete loss,which is a special case of indirect elicitation. Finocchiaro et al. (2020) further introduce thenotion of embedding dimension , which is a lower bound on both convex elicitation complexityof discrete properties and convex consistency dimension of discrete losses and ﬁnite statistics.

3. Consistency implies indirect elicitation

In this section, we connect consistency of any surrogate to an indirect elicitation requirement.This will allow us to show indirect elicitation gives state-of-the-art lower bounds on theprediction dimension of consistent convex surrogates.We start by formalizing consistency in two ways that generalize across our four quadrants.First, given a target loss ‘ , we say L is consistent if optimizing L and applying a link ψ optimizes ‘ (Deﬁnition 5). Second, given a target property γ , such as the α -quantile, wesay L is consistent if optimizing L implies approaching, in some sense, the correct statistic γ ( D x ) of the conditional distributions D x = Pr[ Y | X = x ] (Deﬁnition 6). We then observethat Deﬁnition 5 is subsumed by Deﬁnition 6, and use this to show consistency implies L indirectly elicits prop P [ ‘ ] or γ respectively. Deﬁnition 5 (Consistent: loss)

A loss L ∈ L and link ( L, ψ ) are D -consistent for aset D of distributions over X × Y with respect to a target loss ‘ if, for all D ∈ D and allsequences of measurable hypothesis functions { f m : X → R} , E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) = ⇒ E D ‘ (( ψ ◦ f m )( X ) , Y ) → inf f E D ‘ (( ψ ◦ f )( X ) , Y ) . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates For a given convex set

P ⊆ ∆ Y , we simply say ( L, ψ ) is consistent if it is D -consistent forsome D satisfying the following: for all p ∈ P , there exists D ∈ D and x ∈ X such that D has a point mass on x and p = D x . Instead of a target loss ‘ , one may want to learn a target property, i.e. a conditionalstatistic such as the expected value, variance, or entropy. In this case, following the traditionin the statistics literature on conditional estimation (Gy¨orﬁ et al., 2006; Fan and Yao, 1998;Ruppert et al., 1997), we formalize consistency as converging to the correct conditionalestimates of the property. Convergence is measured by functions µ ( r, p ) that formalizehow close r is to “correct” for conditional distribution p . In particular we should have µ ( r, p ) = 0 ⇐⇒ r ∈ γ ( p ). Deﬁnition 6 (Consistent: property)

Suppose we are given a loss L ∈ L , link function ψ : R d → R , and property γ : P ⇒ R . Moreover, let µ : R × P → R + be any functionsatisfying µ ( r, p ) = 0 ⇐⇒ r ∈ γ ( p ) . We say ( L, ψ ) is ( µ, D )-consistent with respect to γ if,for all D ∈ D and sequences of measurable functions { f m : X → R} , E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) = ⇒ E X µ ( ψ ◦ f m ( X ) , D X ) → . (2) We simply say ( L, ψ ) is µ -consistent if it is ( µ, D ) -consistent for some D satisfying thefollowing: for all p ∈ P , there exists D ∈ D and x ∈ X such that D has a point mass on x and p = D x . Additionally, we say ( L, ψ ) is consistent if there is a µ such that ( L, ψ ) is µ -consistent. Typical deﬁnitions of consistency require D to be the set of all distributions over X × Y ,while our conditions are much weaker. As the main focus of this paper is lower bounds onthe prediction dimension, i.e., showing that surrogates of a certain prediction dimensioncannot exist, these weaker conditions translate to stronger impossibility statements.Given a target loss ‘ , we can deﬁne a statistic γ , the property it elicits. Intuitively,consistency of a surrogate L with respect to ‘ and γ are equivalent, i.e. in both casesestimates should converge to values that minimize ‘ -loss. We formalize this by letting µ bethe ‘ -regret, yielding Lemma 7, proven in Appendix D. Lemma 7

Let a convex

P ⊆ ∆ Y be given. Given a surrogate loss L ∈ L , link ψ , and targetloss ‘ , set µ ( r, p ) := R ‘ ( r, p ) . Then there is a D such that ( L, ψ ) is D -consistent with respectto ‘ if and only if ( L, ψ ) is ( µ, D ) -consistent with respect to γ := prop P [ ‘ ] . Because each target loss in L elicits some property, but not all target properties can beelicited by a loss (e.g. the variance), consistency with respect to a property is the strictlybroader notion. This points to indirect elicitation as a natural necessary condition forconsistency, as formalized in Theorem 8. Theorem 8

For a surrogate L ∈ L , if the pair ( L, ψ ) is consistent with respect to a property γ : P ⇒ R or a loss ‘ eliciting γ , then ( L, ψ ) indirectly elicits γ . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof

By Lemma 7, it suﬃces to show the result for consistency with respect to a property γ , setting γ := prop P [ ‘ ] if ‘ is given instead. We show the contrapositive; suppose ( L, ψ )does not indirectly elicit γ , meaning we have some p ∈ P so that u ∈ Γ( p ) but ψ ( u ) γ ( p ), where Γ := prop P [ L ]. Observe that we use the fact Γ( p ) = ∅ . By deﬁnition, ifwe had consistency, there must be some distribution D on X × Y with a point masson some x ∈ X and D x = p . Consider a constant sequence { f m } with f m = f suchthat f ( x ) = u , so that E D L ( f m ( X ) , Y ) = E D x L ( f m ( x ) , Y ) = E p L ( u, Y ). Since u ∈ Γ( p ), we have E p L ( u, Y ) = inf f E D x L ( f ( x ) , Y ) = inf f E D L ( f ( X ) , Y ). In particular, wehave E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ). However, we have E X µ ( ψ ◦ f m ( X ) , D X ) = µ ( f m ( x ) , p ) = µ ( ψ ( u ) , p ) = 0, since ψ ( u ) γ ( p ). Therefore ( L, ψ ) is not consistent withrespect to γ (Deﬁnition 6).This result allows us to state elicitation complexity as a lower bound for convex consistencydimension. Corollary 9

Given a property γ : P ⇒ R or loss ‘ : R × Y → R eliciting γ , we have elic cvx ( γ ) ≤ cons cvx ( γ ) = cons cvx ( ‘ ) .

4. Prediction Dimension of Consistent Convex Surrogates

We now turn to the question of bounding the prediction dimension of a consistent convexsurrogate. From Theorem 8, given a target property γ or loss ‘ with γ = prop P [ ‘ ], this taskreduces to lower bounding the prediction dimension of a convex surrogate indirectly eliciting γ . We now explore two tools, Corollaries 12 and 13, for proving such convex elicitationlower bounds. The key idea, crystallized from the proofs of Ramaswamy and Agarwal(2016, Theorem 16) and Agarwal and Agarwal (2015, Theorem 9), is to consider a particulardistribution p and surrogate prediction u ∈ R d with is optimal for p . Theorem 11 will showthat if d is small, then the level set { p ∈ P : u ∈ arg min u E p L ( u , Y ) } must be large; infact, it must roughly contain a high-dimensional ﬂat . By deﬁnition of indirect elicitation,there is some level set γ r (where u is linked to r ) containing this ﬂat as well. The use of thisresult is to leverage the contrapositive: if γ has a level set intricate enough to not containany high-dimensional ﬂats, then γ cannot have a low-dimensional consistent surrogate. Deﬁnition 10 (Flat)

For d ∈ N , a d -ﬂat , or simply ﬂat , is a nonempty set F = ker P W := { q ∈ P : E q W = } for some measurable W : Y → R d . We state our elicitation lower bounds in Corollaries 12 and 13, which when combinedwith Theorem 8, yield consistency bounds. A similar result is Agarwal and Agarwal(2015, Theorem 9), which bounds the dimension of level sets of a single-valued prop P [ L ].Corollaries 12 and 13 instead bound the dimension of ﬂats contained in the level sets, anadditional power which we leverage in our examples. Lemma 11

Let

Γ : P ⇒ R d be (directly) elicited by L ∈ L cvx d for some d ∈ N . Let Y beeither a ﬁnite set, or Y = R , in which case we assume each p ∈ P admits a Lebesgue densitysupported on the same set for all p ∈ P . For all u ∈ range Γ and p ∈ Γ u , there is some V u,p : Y → R d such that p ∈ ker P V u,p ⊆ Γ u .

1. This assumption is largely for technical convenience, to ensure that V u,p does not depend on p . Any suchassumption would suﬃce, and we suspect even that condition can be relaxed. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof As L is convex and elicits Γ, we have u ∈ Γ( p ) ⇐⇒ ∈ ∂ E p L ( u, Y ). We proceed intwo cases, depending on |Y| . Finite Y : If Y is ﬁnite, this is additionally equivalent to ∈ ⊕ y p y ∂L ( u, y ), where ⊕ denotes the Minkowski sum (Hiriart-Urruty and Lemar´echal, 2012, Theorem 4.1.1). Expanding, we have ⊕ y p y ∂L ( u, y ) = { P y ∈Y p y x y | x y ∈ ∂L ( u, y ) ∀ y ∈ Y} , and thus W p = P y p y x y = where W = [ x , . . . , x n ] ∈ R d × n ; cf. (Ramaswamy and Agarwal, 2016, A m in Theorem 16). Let V u,p : Y → R d , y W y be the function encoding the columns of W . Observe that E p V u,p = . Y = R : Any L ∈ L cvx d satisﬁes the assumptions of Ioﬀe and Tikhomirov (1969), so we mayinterchange subdiﬀerentiation and expectation. Speciﬁcally, letting V u,p = { V : Y → R d | V measurable , V ( y ) ∈ ∂L ( u, y ) p -a.s. } , we have ∂ E p L ( u, Y ) = { R V ( y ) dp ( y ) | V ∈ V u,p } . As ∈ ∂ E p L ( u, Y ), in particular, there is some V u,p ∈ V u,p such that E p V u,p = 0. For any q ∈ P ,as by assumption q is supported on the same set as p , we have V u,p ( y ) ∈ ∂L ( u, y ) q -a.s., sothat V u,p ∈ V u,q . Thus, E q V u,p = 0 implies 0 ∈ ∂ E q L ( u, Y ) by the above.In both cases, we take the ﬂat F := ker P V u,p , and have p ∈ F by construction. To see F ⊆ Γ u , from the chain of equivalences above, we have for any q ∈ P that q ∈ ker P V u,p = ⇒ ∈ ∂ E q L ( u, Y ) = ⇒ u ∈ Γ( q ) = ⇒ q ∈ Γ u .Knowing indirect elicitation implies the existence of such a ﬂat, we now apply Theorem 8and Lemma 11 to construct lower bounds on convex consistency dimension. Corollary 12

Let target property γ : P ⇒ R and d ∈ N be given. Let Y be either a ﬁniteset, or Y = R , in which case we assume each p ∈ P admits a Lebesgue density supported onthe same set for all p ∈ P . Let p ∈ P with | γ ( p ) | = 1 , and take γ ( p ) = { r } . If there is no d -ﬂat F with p ∈ F ⊆ γ r , then cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1 . Proof

Let (

L, ψ ) indirectly elicit γ , where L ∈ L cvx d , and let Γ = prop P [ L ]. As Γ isnon-empty, there is some u ∈ Γ( p ). Since γ is single-valued at p , we have r = ψ ( u ); byLemma 11, we know there is a d -ﬂat F = ker P V u,p so that p ∈ F ⊆ Γ u . By deﬁnition ofindirect elicitation, we additionally have Γ u ⊆ γ r . Thus, we have p ∈ F ⊆ γ r . If no ﬂat F satisﬁes the above conditions, then no L ∈ L cvx d indirectly elicits γ , so elic cvx ( γ ) ≥ d + 1,and recall cons cvx ( γ ) ≥ elic cvx ( γ ) by Corollary 9. Corollary 13

Let an elicitable target property γ : P ⇒ R be given, where P ⊆ ∆ Y isdeﬁned over a ﬁnite set of outcomes Y , and let d ∈ N . Let p ∈ relint( P ) . If there is no d -ﬂat F with p ∈ F ⊆ γ r , then cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1 . Proof

Let (

L, ψ ) indirectly elicit γ and the convex function L and elicit Γ. As Γ is non-empty,there is some u ∈ Γ( p ), and suppose r = ψ ( u ). Take F ⊆ Γ u to be the ﬂat that exists byLemma 11. If r = r , then p ∈ F ⊆ Γ u ⊆ γ r by indirect elicitation. Otherwise, by Lemma 39,for elicitable properties with p ∈ γ r ∩ γ r , we observe p ∈ F ⊆ γ r ⇐⇒ p ∈ F ⊆ γ r .As above, if no ﬂat F satisﬁes the above conditions, then no L ∈ L cvx d indirectly elicits γ , so cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1, recalling Corollary 9 for the ﬁrst inequality. ∂ represents the subdiﬀerential ∂f ( x ) = { z : f ( x ) − f ( x ) ≥ h z, x − x i ∀ x } . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates

5. Discrete-valued predictions

The main known technique for lower bounds on surrogate dimensions is given by Ramaswamyand Agarwal (2016) for the Quadrant 1 (target loss and discrete predictions). The proofheavily builds around the “limits of sequences” in the deﬁnition of calibration. By restrictingslightly to the broad class of minimizable losses L cvx , we show their bound follows relativelydirectly from Corollary 13. (We conjecture that the minimizability restriction to L cvx canbe lifted; see § 7.) Ramaswamy and Agarwal (2016) construct what they call the subspaceof feasible dimensions and give bounds in terms of its dimension. Deﬁnition 14 (Subspace of feasible directions)

The subspace of feasible directions S C ( p ) of a convex set C ⊆ R n at p ∈ C is S C ( p ) = { v ∈ R n : ∃ (cid:15) > such that p + (cid:15)v ∈C ∀ (cid:15) ∈ ( − (cid:15) , (cid:15) ) } . Ramaswamy and Agarwal (2016) gives a lower bound on the dimensionality of allconsistent convex surrogates, i.e. cons cvx ( ‘ ) ≥ k p k − dim( S γ r ( p )) − p and r ∈ γ ( p ),particularly in the setting where one is given a discrete prediction problem and target lossover ﬁnite outcomes. It turns out that the subspace of feasible directions is essentially aspecial case of a ﬂat described by Lemma 11. So, by making a slight restriction to the classof minimizable convex surrogates L cvx , we can derive this lower bound from our generaltechnique in a way that we ﬁnd shorter and simpler. Corollary 15 (Ramaswamy and Agarwal (2016) Theorem 18)

Let ‘ : R × Y → R be a discrete loss eliciting γ : ∆ Y ⇒ R with Y ﬁnite. Then for all p ∈ ∆ Y and r ∈ γ ( p ) , cons cvx ( γ ) ≥ k p k − dim( S γ r ( p )) − . (3) Proof [Sketch] If cons cvx ( γ ) ≤ d , then there is a L ∈ L cvx d so that L is consistent with respectto γ , and in turn, indirectly elicits γ . Lemma 11 says that there is some d -ﬂat F = ker P V such that p ∈ F ⊆ γ r . In particular, if p ∈ relint(∆ Y ), we can see dim( F ) = dim( S γ r ( p )).Since aﬀhull(∆ Y ) has dimension |Y| − k p k −

1, by rank-nullity and rank( V ) ≤ d (moreprecisely, the corresponding linear map q E q V ) we have d ≥ k p k − − dim( S γ r ( p )).When p relint(∆ Y ), we can project down to the subsimplex on the support of p , againof dimension k p k −

1, and modify L and ‘ accordingly. Now p is in the relative interior ofthis subsimplex, so the above gives cons cvx ( γ ) ≥ k p k − − dim( S γ r ( p )), where now S isrelative to R supp( p ) . Finally, the feasible subspace dimension in the projected space is thesame as in the original space because of p ’s location on a face of ∆ Y .There are some cases where the bound provided by Corollaries 12 and 13 is strictlytighter than the bound provided by feasible subspace dimension in Corollary 15. For anexample of how Corollary 12 applies to a discrete property for which there is no target loss– a non-elicitable property, i.e. Quadrant 2, which is not considered by Ramaswamy et al.(2018) – we refer the reader to Appendix E. Example: High-conﬁdence classiﬁcation.

Given the target loss ‘ abs ( r, y ) := I { r

6∈ { y, ⊥}} +(1 / I { r = ⊥} , we can consider the abstain property it elicits, where one predicts the mostlikely outcome y if P r [ Y = y | x ] ≥ / ⊥ otherwise. Ramaswamyand Agarwal (2016) present a convex surrogate for the abstain loss that takes as input a nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates prediction whose dimension is logarithmic in the number of outcomes, yielding new up-per bounds on cons cvx ( ‘ abs ) which are an exponential improvement over previous results,e.g., Crammer and Singer (2001).To lower bound the dimension of convex surrogates, we can consider two diﬀerentdistributions; in the ﬁrst, our bound yields a strict gap over the feasible subspace dimensionbound, and in the second, the bounds are equal. First, we choose p = • to be the uniformdistribution (see Figure 2). In this case, the bound by feasible subspace dimension yieldscons cvx ( ‘ abs ) ≥ − − • . When intersected with the simplex, one cansee that any line (a 1-ﬂat, since • ∈ relint(∆ Y )) in the simplex through • also leaves thecell γ ⊥ , which contains p . See Figure 2 (R) for intuition; a 1-ﬂat through p ∈ relint(∆ Y )would be a line in such a ﬁgure. Therefore, we have no 1-ﬂat containing p staying in γ ⊥ , sowe obtain a better lower bound, cons cvx ( ‘ abs ) ≥

2. Combining this with the upper boundsgiven by Ramaswamy et al. (2018), we observe the bound cons cvx ( ‘ abs ) = 2 is tight in thiscase with |Y| = 3.Our bounds sometimes match those of (Ramaswamy and Agarwal, 2016); consider thedistribution ? = (1 / , / , / γ ⊥ and γ at ? is 1, since one only moves toward the distributions (0 , / , /

2) and(1 / , , /

2) without leaving the level sets, and the three points are collinear in aﬀhull(∆ Y ),suggesting S γ ⊥ ( q ) = 1. This yields cons cvx ( ‘ abs ) ≥ − − γ ⊥ and γ , so we have cons cvx ( ‘ abs ) ≥ d -ﬂats appear to work well at distributions where previous bounds viafeasible subspace dimension would have been vacuous. In essence, ﬂats allow us a “global”view of the property we are eliciting, while the feasible subspace method only permits a “local”look at the property, so we ﬁnd our method works better for distributions in relint(∆ Y ). p p p ⊥ • F p p p ⊥ • F Figure 2: (Left) Feasible subspace dimension S γ ⊥ ( • ) = 2 and S γ ⊥ ( ? ) = 1, giving the boundcons cvx ( ‘ abs ) ≥ − − • (a line since • ∈ relint(∆ Y )) stays fullycontained in γ ⊥ , so cons cvx ( ‘ abs ) ≥

6. Continuous-valued predictions

In continuous estimation problems, often one is not given a target loss, but instead a target(conditional) statistic of the data one wishes to estimate. Examples include estimating the nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates mean or variance of y conditioned on a given x . In this setting, Lemma 11 gives lowerbounds on the prediction dimension of convex losses with a link to the desired conditionalstatistic, i.e., the convex elicitation complexity. In particular, Theorem 17 below yields newbounds on the convex elicitation complexity of statistics which quantify risk or uncertaintysuch as variance, entropy, or ﬁnancial risk measures.These bounds address an open question of Frongillo and Kash (2020), that of developinga theory of elicitation complexity with respect to convex-elicitable properties. The lowerbounds of previous work are essentially all with respect to identiﬁable properties; a propertyis d -indentiﬁable if its level sets are all d -ﬂats. Frongillo and Kash (2020) rely on ﬁnding adimension d such that the level sets of certain risk measures γ have too much curvature tocontain any d -ﬂat. Thus, the elicitation complexity with respect to identiﬁable properties isgreater than d .In contrast, properties elicited by non-smooth convex losses are generally not identiﬁable.For example, the properties elicited by hinge loss and the abstain surrogate are not identiﬁable,as their level sets are not ﬂats (see Figure 2). It therefore might appear that entirely new ideasare needed. Our framework is closely related to identiﬁability, however; Lemma 11 statesthat the level sets of d -dimensional convex-elicitable properties, if not d -ﬂats themselves, areunions of d -ﬂats. Thus, the general logic of Frongillo and Kash (2020) can still apply. Inparticular, we recover their main lower bound for the large class of Bayes risks. Deﬁnition 16

Given loss function L : R × Y → R for some report set R , the Bayes risk of L is deﬁned as L ( p ) := inf r ∈R E p L ( r, Y ) . Condition 1

For some r ∈ range Γ , the level set Γ r = ker P V is a d -ﬂat presented by some V : Y → R d such that ∈ int { E p V : p ∈ P} . Theorem 17

Let P be a set of Lebesgue densities supported on the same set for all p ∈ P .Let Γ :

P → R d satisfy Condition 1 for some r ∈ R d . Let L ∈ L cvx elicit Γ such that L isnon-constant on Γ r . Then cons cvx ( L ) ≥ elic cvx ( L ) ≥ d + 1 . We now illustrate the theorem with two important examples: variance and conditionalvalue at risk. Several other applications from Frongillo and Kash (2020), such as spectralrisk measures, entropy, and norms, follow similarly.

Example: Variance.

As a warm-up, let us see how to show elic cvx (Var) = 2, meaningthe lowest dimension of a convex loss to estimate conditional variance is 2. This lowerbound will follow from Theorem 17 using that variance is the Bayes risk of squared loss L ( r, y ) = ( r − y ) , which elicits the mean Γ( p ) = E p Y . Interestingly, while perhaps intuitivelyobvious, even this simple result is novel. In particular, the well-known fact that the varianceis not elicitable does not yield a lower bound of 2, as it does not rule out the variance beinga link of a real-valued convex-elicitable property; cf. Frongillo and Kash (2020, Remark 1). Corollary 18

Let P be a set of continuous Lebesgue densities on Y = R with all p ∈ P having the same support. If there exist p, q, q ∈ P with E p Y = E q Y = E q Y and Var( p ) =Var( q ) , then cons cvx (Var) = elic cvx (Var) = 2 . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof

For the upper bound, we may elicit the ﬁrst two moments via the convex loss L ( r, y ) =( r − y ) + ( r − y ) , and recover the variance via ψ ( r ) = r − r , giving elic cvx (Var) ≤ E q Y < E q Y . Let r = E q Y + E q Y ,and deﬁne V : Y → R , y y − r . Then ker P V = { p ∈ P | E p Y = r } = Γ r whereΓ : p E p Y is the mean. As E q Y < r < E q Y , we conclude E q V < < E q V . Wehave now satisﬁed Condition 1 for d = 1. To apply Theorem 17, it remains to showthat Var is non-constant on Γ r . By our assumptions and the deﬁnition of Var, we have E p Y = E q Y . Letting p = q + q , p = p + q , we have E p i Y = r for i ∈ { , } , but E p Y = E q Y + E q Y = E p Y + E q Y = E p Y . As p , p have the same mean butdiﬀerent second moments, we conclude Var( p ) = Var( p ). Example: Conditional Value at Risk.

Frongillo and Kash (2020) observe that one ofthe most prominent ﬁnancial risk measures, the conditional value at risk (CVaR), can beexpressed as a Bayes risk. In particular, for 0 < α <

1, we may deﬁneCVaR α ( p ) = inf r ∈ R E p n α ( r − Y ) r ≥ Y − r o , (4)which is the Bayes risk of the transformed pinball loss L α ( r, y ) = α ( r − y ) r ≥ y − r . Inturn, L α elicits the α -quantile, the quantity q α ( p ) such that Pr p [ Y ≥ q α ( p )] = α . FollowingFrongillo and Kash (2020), we will restrict to the set P q of probability measures over R with connected support and whose CDFs are strictly increasing on their support, so that q α is single-valued. Under mild assumptions, we ﬁnd that there is no consistent real-valuedconvex surrogate for CVaR α . Corollary 19

Let P be a set of continuous Lebesgue densities on Y = R with all p ∈ P having support on the same interval. If we have p , p , p , p ∈ P with q α ( p ) < q α ( p )

3, which if true would constitute an interesting gap betweenelicitation complexity for identiﬁable and convex-elicitable properties.

7. Conclusions and future work

In this work, we show that indirect property elicitation can be a powerful necessary conditionfor the existence of a consistent surrogate loss (Theorem 8). Furthermore, we introduce anew lower bound (Corollaries 12 and 13) on convex consistency dimension that is generallyapplicable and extends previous results from both the discrete (Corollary 15) and continuous(Corollaries 18 and 19) estimation settings.Several important questions remain open. Particularly for the discrete settings, wewould like to know whether one can lift the restriction that surrogates always achieve aminimum; we conjecture positively. Of course, we would like to characterize cons cvx andelic cvx and develop a general framework for constructing surrogates achieving the bestpossible prediction dimension. Moreover, the practical reason why consistency is desired is toensure the guarantee of empirical risk minimization (ERM) rates; however, the relationshipbetween ERM rates and property elicitation has not been studied. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates References

Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and propertyelicitation. In

JMLR Workshop and Conference Proceedings , volume 40, pages 1–19, 2015.URL .Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliﬀe. Convexity, classiﬁcation, andrisk bounds.

Journal of the American Statistical Association , 101(473):138–156, 2006.URL http://amstat.tandfonline.com/doi/abs/10.1198/016214505000000907 .Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines.

Journal of machine learning research , 2(Dec):265–292, 2001.Jianoing Fan and Qiwei Yao. Eﬃcient estimation of conditional variance functions instochastic regression.

Biometrika , 85(3):645–660, 09 1998. ISSN 0006-3444. doi: 10.1093/biomet/85.3.645. URL https://doi.org/10.1093/biomet/85.3.645 .Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework forconsistent polyhedral surrogates. In

Advances in neural information processing systems ,2019.Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. Embedding dimension of polyhedrallosses.

The Conference on Learning Theory , 2020.Tobias Fissler.

On higher order elicitability and some limit theorems on the Poisson andWiener space . PhD thesis, 2017.Tobias Fissler, Johanna F Ziegel, and others. Higher order elicitability and Osband’s principle.

The Annals of Statistics , 44(4):1680–1707, 2016.Gerald B Folland.

Real analysis: modern techniques and their applications , volume 40. JohnWiley & Sons, 1999.Rafael Frongillo and Ian Kash. General truthfulness characterizations via convex analysis.In

Web and Internet Economics , pages 354–370. Springer, 2014.Rafael Frongillo and Ian Kash. Vector-Valued Property Elicitation. In

Proceedings of the28th Conference on Learning Theory , pages 1–18, 2015.Rafael Frongillo and Ian A Kash. Elicitation Complexity of Statistical Properties.

Biometrika ,11 2020. ISSN 0006-3444. doi: 10.1093/biomet/asaa093. URL https://doi.org/10.1093/biomet/asaa093 .L´aszl´o Gy¨orﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk.

A distribution-free theoryof nonparametric regression . Springer Science & Business Media, 2006.Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal.

Fundamentals of convex analysis .Springer Science & Business Media, 2012.Aleksandr Davidovich Ioﬀe and Vladimir Mikhailovich Tikhomirov. On minimization ofintegral functionals.

Functional Analysis and Its Applications , 3(3):218–227, 1969. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. 2018. URL https://web.stanford.edu/˜nlambert/papers/elicitability.pdf .Nicolas S. Lambert and Yoav Shoham. Eliciting truthful answers to multiple-choice questions.In

Proceedings of the 10th ACM conference on Electronic commerce , pages 109–118, 2009.Nicolas S. Lambert, David M. Pennock, and Yoav Shoham. Eliciting properties of probabilitydistributions. In

Proceedings of the 9th ACM Conference on Electronic Commerce , pages129–138, 2008.Yi Lin. A note on margin-based loss functions in classiﬁcation.

Statistics & probabilityletters , 68(1):73–82, 2004.Kent Osband and Stefan Reichelstein. Information-eliciting compensation schemes.

Jour-nal of Public Economics , 27(1):107–115, June 1985. ISSN 0047-2727. doi: 10.1016/0047-2727(85)90031-3. URL .Kent Harold Osband.

Providing Incentives for Better Cost Forecasting . University ofCalifornia, Berkeley, 1985.Harish Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Convex calibrated surrogatesfor hierarchical classiﬁcation. In

International Conference on Machine Learning , pages1852–1860, 2015.Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclassloss matrices.

The Journal of Machine Learning Research , 17(1):397–441, 2016.Harish G Ramaswamy, Ambuj Tewari, Shivani Agarwal, et al. Consistent algorithms formulticlass classiﬁcation with an abstain option.

Electronic Journal of Statistics , 12(1):530–554, 2018.David Ruppert, M. P. Wand, Ulla Holst, and Ola H¨osjer. Local polynomial variance-functionestimation.

Technometrics , 39(3):262–273, 1997. doi: 10.1080/00401706.1997.10485117.URL .L.J. Savage. Elicitation of personal probabilities and expectations.

Journal of the AmericanStatistical Association , pages 783–801, 1971.Ingo Steinwart. How to compare diﬀerent loss functions and their risks.

ConstructiveApproximation , 26(2):225–287, 2007.Ingo Steinwart and Andreas Christmann.

Support Vector Machines . Springer Science &Business Media, September 2008. ISBN 978-0-387-77242-4. Google-Books-ID: HUnqnr-pYt4IC.Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classiﬁcation methods.

The Journal of Machine Learning Research , 8:1007–1025, 2007. URL http://dl.acm.org/citation.cfm?id=1390325 .Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convexrisk minimization.

Annals of Statistics , pages 56–85, 2004. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Appendix A. Notes on calibration

When given a discrete target loss, such as for classiﬁcation-like problems, direct empiricalrisk minimization is typically NP-hard, forcing one to ﬁnd a more tractable surrogate. Toensure consistency, the literature has embraced the notion of calibration from Steinwart andChristmann (2008, Chapter 3), which aligns with the deﬁnition in Tewari and Bartlett (2007)for multiclass classiﬁcation, and its generalizations to arbitrary discrete target losses (Agarwaland Agarwal, 2015; Ramaswamy and Agarwal, 2016). Calibration is more tractable andweaker than consistency, yet the two are equivalent under suitable assumptions (Tewariand Bartlett, 2007; Ramaswamy and Agarwal, 2016),notably in Quadrant 1. Intuitively,calibration says one cannot achieve the optimal surrogate loss while linking to a suboptimaltarget prediction.

Deﬁnition 20 (Calibrated: Quadrant 1)

Let ‘ : R × Y → R be a discrete target loss.A surrogate loss L : R d × Y → R and link ψ : R d → R pair ( L, ψ ) is P -calibrated withrespect to ‘ if ∀ p ∈ P : inf u ∈ R d : ψ ( u ) arg min r E p ‘ ( r,Y ) E p L ( u, Y ) > inf u ∈ R d E p L ( u, Y ) . (5) We simply say L is calibrated if P = ∆ Y . Many works characterize calibrated surrogates for speciﬁc discrete target losses (Zhang,2004; Lin, 2004; Bartlett et al., 2006; Tewari and Bartlett, 2007), including the canonical0-1 loss for binary and multiclass classiﬁcation. We give another deﬁnition of calibrationwhich is a special case of calibration via Steinwart and Christmann (2008), and show it isequivalent to Deﬁnition 20 in discrete prediction settings, but can be applied in continuousestimation settings as well. We use this more general deﬁnition of calibration when provingstatements about the relationship between consistency, calibration, and indirect elicitation.The close connection between indirect elicitation and consistency was ﬁrst explored byAgarwal and Agarwal (2015). In particular, calibration of L ∈ L with respect to ‘ impliesindirect elicitation quite directly: take u ∈ R d and p ∈ Γ u , implying u ∈ Γ( p ). From eq. (1), E p L ( u, Y ) = inf u ∈ R d E p L ( u , Y ), so we must have ψ ( u ) ∈ γ ( p ) from eq. (5), as desired. Deﬁnition 21 (Calibrated: Quadrants 1 and 3)

A loss L : R d ×Y → R is P -calibrated with respect to a loss ‘ : R × Y → R if there is a link ψ : R d → R such that, for all distribu-tions p ∈ P , there exists a function ζ : R + → R + with ζ continuous at + and ζ (0) = 0 suchthat for all u ∈ R d , we have ‘ ( ψ ( u ); p ) − ‘ ( p ) ≤ ζ ( E p L ( u, Y ) − L ( p )) . (6) If P = ∆ Y , we simply say ( L, ψ ) is calibrated. Consider the following four conditions: Suppose we are given ζ : R + → R + .A ζ satisﬁes ζ : 0 (cid:15) m → ⇒ ζ ( (cid:15) m ) → nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates C Given ζ : R → R + , for all u ∈ R d , R ‘ ( ψ ( u ); p ) ≤ ζ ( R L ( u ; p )).D For all p ∈ P and sequences { u m } so that R L ( u m ; p ) →

0, we have R ‘ ( ψ ( u m ); p ) → ζ so that ( A ∧ C ) deﬁnes calibration as in Deﬁnition 21, and weshow A ⇐⇒ B in Lemma 23. Lemma 24 shows calibration if and only if D , which yields acondition equivalent to calibration without dependence the function ζ . Proposition 22

When R and Y are ﬁnite, a continuous loss and link ( L, ψ ) are P -calibratedwith respect to a target loss ‘ via Deﬁnition 21 if and only if they are P -calibrated viaDeﬁnition 20. Proof = ⇒ We prove the contrapositive; if (

L, ψ ) is not calibrated with respect to ‘ byDeﬁnition 20, then it is not calibrated via Deﬁnition 21 either. If ( L, ψ ) are not calibratedwith respect to ‘ by Deﬁnition 20, then there is a p ∈ P so that inf u : ψ ( u ) γ ( p ) E p L ( u, Y ) =inf u E p L ( u, Y ). Thus there is a sequence { u m } so that lim m →∞ ψ ( u m ) γ ( p ) and E p L ( u m , Y ) → L ( p ). Now we have R L ( u m ; p ) → R ‘ ( ψ ( u m ); p )

0, so by Lemma 24, we contradictcalibration by Deﬁnition 21. ⇐ = Suppose there was a function ζ satisfying the bound in eq. (6) for a ﬁxed distribution p ∈ P . Observe the bound in eq. (5) can be written as R L ( u, p ) > p ∈ ∆ Y and u such that ψ ( u ) = γ ( p ). By eq. (6), for any sequence { u m } so that ψ ( u m ) γ ( p ), we havemust have ζ ( R ‘ ( ψ ( u m ) , p )) R ‘ ( ψ ( u ) , p )

0. Therefore R L ( u m , p )

0; thus, the strict inequality holds.The following Lemma shows that conditions A and B are equivalent, so that we canusing condition B in lieu of condition A in the proof of Lemma 24 Lemma 23

A function ζ : R → R is continuous at and ζ (0) = 0 if and only if thesequence { u m } → ⇒ ζ ( u m ) → . Proof = ⇒ Suppose we have a sequence { u m } →

0. By continuity, we have lim u m → ζ ( u m ) = ζ (0) = 0, so ζ ( u m ) → ⇐ = Suppose ζ (0) = 0 but ζ was continuous at 0. The constant sequence { u m } = 0then converges to 0, but as ζ is continuous at 0, we must have lim m →∞ ζ ( u m ) = ζ (0) = 0,so ζ ( u m ) ζ (0) = 0 but ζ was not continuous at 0. There must be a sequence { u m } → m →∞ ζ ( u m ) = ζ (0) = 0, so ζ ( u m ) ζ in mind. Lemma 24

A continuous surrogate and link ( L, ψ ) are P -calibrated (via deﬁnition 21)with respect to ‘ if and only if, for all p ∈ P and sequences { u m } so that R L ( u m ; p ) → , wehave R ‘ ( ψ ( u m ); p ) → . Proof = ⇒ Take a sequence { u m } so that R L ( u m ; p ) →

0. Since ζ (0) = 0 and ζ iscontinuous at 0, we have ζ ( R L ( u m ; p )) →

0. As the bound from Equation (6) is satisﬁed for nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates all u ∈ R d by assumption, we observe ∀ m, ≤ R ‘ ( ψ ( u m ); p ) ≤ ζ ( R L ( u m ; p ))= ⇒ ≤ lim m →∞ R ‘ ( ψ ( u m ); p ) ≤ lim m →∞ ζ ( R L ( u m ; p )) = 0= ⇒ m →∞ R ‘ ( ψ ( u m ); p ) . ⇐ = Fix p ∈ P , and consider ζ ( c ) := sup u : R L ( u,p ) ≤ c R ‘ ( ψ ( u ); p ). We will show R L ( u m ; p ) → ⇒ R ‘ ( ψ ( u m ); p ) → ζ constructedabove. With ζ as constructed, we observe that the bound in equation (6) is satisﬁed forall u ∈ R d and apply Lemma 23 to observe that if there is a sequence { (cid:15) m } → ζ ( (cid:15) m )

0, it is because R L ( u m , p ) = ⇒ R ‘ ( ψ ( u m ) , p ) → u ∈ R d by con-struction of ζ . Let S ( v ) := { u ∈ R d : R L ( u ; p ) ≤ R L ( v, p ) } . Showing R ‘ ( ψ ( u ); p ) ≤ sup u ∈ S ( u ) R ‘ ( ψ ( u ); p ) for all u ∈ R d gives the condition C . As u is in the space over whichthe supremum is being taken (as R L ( u ; p ) ≤ R L ( u ; p )), we then have calibration by deﬁnitionof the supremum.Now suppose there exists a sequence { (cid:15) m } → ζ ( (cid:15) m )

0. Consider S ( (cid:15) ) = { u ∈ R d : R L ( u, p ) ≤ (cid:15) } . (cid:15) ≤ (cid:15) = ⇒ S ( (cid:15) ) ⊆ S ( (cid:15) )= ⇒ ζ ( (cid:15) ) ≤ ζ ( (cid:15) ) . Now suppose there exists a sequence { u m } so that R L ( u m , p ) →

0. Then for all (cid:15) >

0, thereexists a m ∈ N so that R L ( u m , p ) < (cid:15) for all m ≥ m . Since this is true for all (cid:15) , we have S ( (cid:15) ) nonempty for all (cid:15) >

0, and therefore ζ ( c ) is discrete for all c >

0. Now if ζ ( (cid:15) m )

0, itmust be because R ‘ ( ψ ( u m ) , p ) R L ( u m , p ) → ⇒ R ‘ ( ψ ( u m ) , p ) → { u m } with converging surrogate regret alwaysexists by continuity and boundedness from below of the surrogate loss, since we can takethe constant sequence at the (attained) inﬁmum. A.1. Relating calibration, consistency, and indirect elicitation.

Even with the more general notion of calibration that extends beyond discrete predictions,we still have consistency implying calibration.

Proposition 25

If a loss and link ( L, ψ ) are consistent with respect to a loss ‘ , then theyare calibrated with respect to ‘ . Proof

We show the contrapositive. If (

L, ψ ) are not calibrated with respect to ‘ , thenthere is a sequence { u m } such that R L ( u m ; p ) → R ‘ ( ψ ( u m ); p ) D ∼ X × Y has only one x ∈ X with P r D ( X = x ) > p := D x and E D f ( X, Y ) = E p f ( x, Y ). Consider any sequence of functions { f m } → f with f m ( x ) = u m for all f m . Now we have E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ), but E D ‘ ( ψ ◦ f ( X ) , Y ) inf f E D ‘ ( ψ ◦ f ( X ) , Y ), and therefore ( L, ψ ) is not consistent with respect to ‘ . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Moreover, we have calibration implying indirect elicitation.

Lemma 26

If a surrogate and link ( L, ψ ) with L ∈ L are calibrated with respect to a loss ‘ : R × Y → R , then L indirectly elicits the property γ := prop P [ ‘ ] . Proof

Let Γ be the unique property directly elicited by L , and ﬁx p ∈ ∆ Y with u such that p ∈ Γ u . We know such a u exists since Γ( p ) = ∅ . As p ∈ Γ u , then ζ ( E p L ( u, Y ) − L ( p )) = ζ (0) = 0, we observe the bound ‘ ( ψ ( u ); p ) ≤ ‘ ( p ). We also have ‘ ( ψ ( u ); p ) ≥ ‘ ( p ) bydeﬁnition of ‘ , so we must have ‘ ( ψ ( u ); p ) = ‘ ( p ) = ‘ ( γ ( p ); p ), and therefore, p ∈ γ ψ ( u ) .Thus, we have Γ u ⊆ γ ψ ( u ) , so L indirectly elicits γ .Combining the two results, we can observe the result of Theorem 8 another way: throughcalibration . Appendix B. Reconstructing Ramaswamy and Agarwal (2016, Thm. 16)

Lemma 27

Let the d -ﬂat F ⊆ P (deﬁned over ﬁnite Y ) contain some p ∈ relint( P ) . Then(i) p ∈ relint( F ) ;(ii) dim( S F ( p )) ≥ dim(aﬀhull( P )) − d . Proof As F is a d -ﬂat, we have some W : Y → R d such that F = ker P W . Throughout, givena point (typically a distribution) p and convex set P , we deﬁne P p := P − { p } . Deﬁne T W : span( P p ) → R d , v E v W .(i) Since p ∈ relint( P ), for all q ∈ P , there is some small enough (cid:15) > α ∈ ( − (cid:15), (cid:15) ), the point q α := p − α ( q − p ) is still in P . In particular, for q ∈ F , we claim q α ∈ F . As p, q ∈ F , we have E p W = E q W = . By linearity of expectation, we then have E q α W = . This implies q α ∈ F , and therefore p ∈ relint( F ).(ii) We ﬁrst show span( F p ) = S F ( p ). First, take v ∈ S F ( p ), and take (cid:15) as in thedeﬁnition. For (cid:15) = (cid:15) /

2, we then have p + (cid:15)v ∈ F = ⇒ (cid:15)v ∈ F p , and therefore, v ∈ span( F p ).Now take v ∈ span( F p ). Since p ∈ relint( F ) (i), we have ∈ relint( F p ). Therefore there isan (cid:15) > (cid:15)v ∈ F p for all (cid:15) ∈ ( − (cid:15) , (cid:15) ) by convexity of F . Therefore, v ∈ S F ( p ), andwe observe S F ( p ) = span( F p ).We now show S F ( p ) = ker( T W ). Observe that S F ( p ) ⊆ ker( T W ) follows trivially from thedeﬁnitions of the two functions. Now let v ∈ ker( T W ), and v ∈ F p . This means E v W = ,so it suﬃces to show v = cv ∈ F p , thus showing v ∈ S F ( p ). Since p ∈ relint( P ), we musthave ∈ relint( F p ), so we know there is some small enough (cid:15) > − αv ∈ F p for α ∈ ( − (cid:15), (cid:15) ). Take c = − α , and we conclude v ∈ S F ( p ). Therefore, ker( T W ) = S F ( p ).We ﬁnally want to show dim(aﬀhull( P )) = dim(span( P p )). Consider that any q ∈ span( P p ) can be written as a scalar multiple of an element of P p , which can be written as aconvex combination of elements of the minimal basis P p . In particular, since ∈ P p , it can bewritten as an aﬃne combination of elements of the basis, so dim(aﬀhull( P )) ≥ dim(span( P p )).We also have aﬀhull( P ) − { p } ⊆ span( P p ), so dim(aﬀhull( P )) = dim(aﬀhull( P ) − { p } ) ≤ span( P p ). Therefore, dim(aﬀhull( P )) = dim(span( P p )). nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates As Y is a ﬁnite set, span( P p ) is a ﬁnite-dimensional vector space. The rank-nullitytheorem states dim(im( T W )) + dim(ker( T W )) = dim(span( P p )) = dim(aﬀhull( P )). Asdim(im( T W )) ≤ d , and we have shown above that S F ( p ) = span( F p ) = ker( T W ), theconclusion follows. Corollary 28 (Ramaswamy and Agarwal (2016) Theorem 18)

Let ‘ : R × Y → R be a discrete loss eliciting γ : ∆ Y ⇒ R with Y ﬁnite. Then for all p ∈ ∆ Y and r ∈ γ ( p ) , cons cvx ( γ ) ≥ k p k − dim( S γ r ( p )) − . (3) Proof

Let L ∈ L cvx d be a calibrated surrogate for ‘ , and let Γ := prop ∆ Y [ L ]. Consider Y := { y ∈ Y : p y > } and p = ( p y ) y ∈Y ∈ ∆ Y . Take L := L | Y and ‘ := ‘ | Y . Deﬁne h : R Y → R Y such that h ( q ) = q such that q y = q y for y ∈ Y and q y = 0 otherwise. TakeΓ = Γ ◦ h , γ = γ ◦ h .We wish to ﬁrst show L indirectly elicits γ . Since L indirectly elicits γ , we have alink ψ such that for all u ∈ R d , Γ u ⊆ γ ψ ( u ) . As Γ ( q ) = Γ( h ( q )) and γ ( q ) = γ ( h ( q )), wehave q ∈ Γ u ⇐⇒ h ( q ) ∈ Γ u = ⇒ h ( q ) ∈ γ ψ ( u ) ⇐⇒ ( q y ) y ∈Y ∈ γ ψ ( u ) , and therefore, L indirectly elicits γ via the link ψ ◦ proj( Y ), where proj( Y ) : q ( q y ) y ∈Y .We aim to show dim( S γ r ( p )) ≥ dim( S γ r ( p )). We do this by showing that h ( S γ r ( p )) ⊆S γ r ( p ), and the result holds as h is linear and injective. Suppose v ∈ h ( S γ r ( p )), then thereexists a v so that v = h ( v ) and an (cid:15) > (cid:15)v + p ∈ γ r for all (cid:15) ∈ ( − (cid:15) , (cid:15) ). Since h is linear and recall h ( γ r ) ⊆ γ r , this implies (cid:15)v + p ∈ γ r for all (cid:15) ∈ ( − (cid:15) , (cid:15) ). Therefore v ∈ S γ r ( p ), and the result follows.As L indirectly elicits γ , by Corollary 13, we know there exists a d -ﬂat F with p ∈ F ⊆ γ r .Taking P = ∆ Y , we know p ∈ relint(∆ Y ) by construction, so we can apply Lemma 27(ii),which gives dim( S F ( p )) ≥ dim(aﬀhull(∆ Y )) − d = k p k − − d . Additionally, S F ( p ) ⊆S γ r ( p ) by subset inclusion of the sets themselves. Chaining these results, we obtaindim( S γ r ( p )) ≥ dim( S γ r ( p )) ≥ dim( S F ( p )) ≥ k p k − − d . Appendix C. Proof of Theorem 17

C.1. General setting of elicitation complexity

We brieﬂy introduce the general notion of elicitation complexity, of which Deﬁnition 4 is aspecial case, as some statements are more naturally made in this general setting.

Deﬁnition 29 Γ reﬁnes Γ if for all r ∈ range Γ there exists r ∈ range Γ with Γ r ⊆ Γ r . Equivalently, Γ reﬁnes Γ if there is a link function ψ : range Γ → range Γ such thatΓ r ⊆ Γ ψ ( r ) for all r ∈ range Γ .

3. To reason about dim(aﬀhull(∆ Y )) = k p k −

1, observe that the uniform distribution on ∆ Y has fullsupport and therefore requires k p k − nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Deﬁnition 30

For k ∈ N ∪ {∞} , let E k ( P ) denote the class of all elicitable properties Γ :

P → R k , and E ( P ) := S k ∈ N ∪{∞} E k ( P ) . When P is implicit we simply write E . Deﬁnition 31

Let C be a class of properties. The elicitation complexity of a property Γ with respect to C , denoted elic C (Γ) , is the minimum value of k ∈ N ∪ {∞} such that thereexists ˆΓ ∈ C ∩ E k ( P ) that reﬁnes Γ . C.2. Supporting statementsProposition 32 (Osband (1985))

Let Γ be elicitable. Then Γ r is convex for all r ∈ range Γ . Lemma 33 (Set-valued extension of Frongillo and Kash (2020, Lemma 4)) If Γ reﬁnes Γ then elic C (Γ ) ≥ elic C (Γ) . Proof

As Γ reﬁnes Γ, we have some ψ : range Γ → range Γ such that for all r ∈ range Γ we have Γ r ⊆ Γ ψ ( r ) . Suppose we have ˆΓ ∈ C and ϕ : range ˆΓ → range Γ such that for all u ∈ range ˆΓ we have ˆΓ u ⊆ Γ ϕ ( u ) . Then for all u ∈ range ˆΓ we have ˆΓ u ⊆ Γ ϕ ( u ) ⊆ Γ ( ψ ◦ ϕ )( u ) .In particular, if elic C (Γ ) = m , then we have such a ˆΓ : P ⇒ R m , and hence elic C (Γ) ≤ m . Lemma 34 (Frongillo and Kash (2020, Lemma 8))

Suppose L ∈ L elicits Γ :

P →R and has Bayes risk L . Then for any p, p ∈ P with Γ( p ) = Γ( p ) , we have L ( λp +(1 − λ ) p ) >λL ( p ) + (1 − λ ) L ( p ) for all λ ∈ (0 , . Lemma 35 (Adapted from Frongillo and Kash (2020, Theorem 4)) If L elicits asingle-valued Γ , and ˆΓ reﬁnes L , then ˆΓ reﬁnes Γ . Proof

Suppose for a contradiction that ˆΓ does not reﬁne Γ. Then we have some u ∈ range ˆΓsuch that for all r ∈ range Γ we have ˆΓ u Γ r . In particular, recalling that Γ is single-valued,we must have p, p ∈ ˆΓ u such that Γ( p ) = Γ( p ). Moreover, as ˆΓ reﬁnes L , we also have L ( p ) = L ( p ). From Lemma 34 and λ = 1 / L ( q ) > L ( p ) + L ( p ) = L ( p ), where q = p + p . As the level set ˆΓ u is convex by Proposition 32, we also have q ∈ ˆΓ u , andhence L ( q ) = L ( p ), a contradiction. Lemma 36 (Minor modiﬁcations from Frongillo and Kash (2020))

Let V be a realvector space. Let f : V → R k be linear and C ⊆ V convex with span C = V , and let m ∈ N .Suppose that ∈ int f ( C ) , and for all v ∈ S := C ∩ ker f , there exists a linear ˆ f v : V → R m with v ∈ C ∩ ker ˆ f v ⊆ S . Then m ≥ k . If m = k , we additionally have ∈ int ˆ f v ( C ) forsome v ∈ S . Proof

The condition 0 ∈ int f ( C ) is equivalent to the existence of some v , . . . v k +1 ∈ C suchthat 0 ∈ int conv { f ( v i ) : i ∈ { , . . . , k + 1 }} . Let α , . . . , α k +1 > P k +1 i =1 α i = 1, such that P k +1 i =1 α i f ( v i ) = 0. As these are barycentric coordinates, this choice of α i is unique, a factwhich will be important later. We will take v = P k +1 i =1 α i v i , an element of C by convexity,and thus an element of S as f ( v ) = 0. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Let ˆ f v : V → R m be linear with v ∈ ˆ S := C ∩ ker ˆ f v ⊆ S . Let β , . . . , β k +1 ∈ R , P k +1 i =1 β i = 0, such that P k +1 i =1 β i ˆ f v ( v i ) = 0. We will show that the β i must be identicallyzero, i.e. that { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} are aﬃnely independent. By construction, v := P k +1 i =1 β i v i ∈ ker ˆ f v , and as v ∈ ker ˆ f v , for all λ > v λ := v + λv = P k +1 i =1 ( α i + λβ i ) v i ∈ ker ˆ f v . Taking λ suﬃciently small, we have γ i := α i + λβ i > i , and P k +1 i =1 γ i = P k +1 i =1 α i + λ P k +1 i =1 β i = 1. By convexity of C , we have v λ ∈ C . Now v λ ∈ C ∩ ker ˆ f v ⊆ S = C ∩ ker f , and in particular v λ ∈ ker f . Thus, f ( v λ ) = P k +1 i =1 γ i f ( v i ) = 0.By the uniqueness of barycentric coordinates, for all i ∈ { , . . . , k + 1 } , we must have γ i = α i and thus β i = 0, as desired.As ˆ f v ( C ) contains k + 1 aﬃnely independent points, we have m ≥ dim im ˆ f v ≥ k . When m = k , by aﬃne independence, the set conv { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} has dimension k in R k . As 0 = ˆ f v ( v ) = P k +1 i =1 α i ˆ f v ( v i ), and α i > i , we conclude 0 ∈ int conv { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} ⊆ int ˆ f v ( C ). Lemma 37 (Frongillo and Kash (2020, Lemma 14))

Let V be a real vector space.Let f : V → R k be linear, C ⊆ V convex with span C = V , and let S = C ∩ ker f . If ∈ int f ( C ) then span S = ker f . C.3. Proving the lower bound for spectral risks

Let C ∗ d be the class of properties Γ which are elicited by a convex loss L ∈ L cvx d for some d ∈ N , and let C ∗ := S d ∈ N C ∗ d . Then for all properties γ , if elic C ∗ ( γ ) < ∞ , we haveelic C ∗ ( γ ) = elic cvx ( γ ), a fact we use tacitly in the proof. Theorem 17

Let P be a set of Lebesgue densities supported on the same set for all p ∈ P .Let Γ :

P → R d satisfy Condition 1 for some r ∈ R d . Let L ∈ L cvx elicit Γ such that L isnon-constant on Γ r . Then cons cvx ( L ) ≥ elic cvx ( L ) ≥ d + 1 . Proof

Let V : Y → R d and r be given by the statement of the theorem and from Condition 1.Let m = elic C ∗ ( L ), so that we have ˆΓ ∈ C ∗ m which reﬁnes L . By Lemma 35 we have ˆΓ reﬁnesΓ. We now establish the conditions of Lemma 36 for C = P . Let f : span P → R d , p E p V .From Condition 1, we have 0 ∈ int f ( P ) and ker f ∩ P = ker P V = Γ r . Now let p ∈ Γ r be arbitrary, and take any u ∈ ˆΓ( p ). As Γ is single-valued, r ∈ range Γ is the uniquevalue with p ∈ Γ r . As ˆΓ reﬁnes Γ, there exists r ∈ range Γ with ˆΓ u ⊆ Γ r , and since p ∈ ˆΓ u , we conclude r = r from the above. From Lemma 11, we have some ˆ V u,p with p ∈ ker P ˆ V u,p ⊆ ˆΓ u ⊆ Γ r = ker P V . Letting ˆ f p : span P → R d , p E p ˆ V u,p , we have nowsatisﬁed the conditions of Lemma 36. We conclude m ≥ d , and moreover, if m = d , thenthere exists some q ∈ Γ r such that 0 ∈ int ˆ f q ( P ).Now suppose m = d for a contradiction. Let ˆ S := ker f q ∩ P . Applying Lemma 37 tothe functions f and ˆ f q we have span ker f = spanΓ r and span ker ˆ f q = span ˆ S . As ˆ S ⊆ Γ r ,we have ker ˆ f q = span ˆ S ⊆ spanΓ r = ker f . By the ﬁrst isomorphism theorem, we also havecodim ker ˆ f q = codim ker f = d , as the images of these linear maps span all of R d . By thethird isomorphism theorem we conclude Γ r = ˆ S . Moreover, as ˆ S ⊆ ˆΓ u ⊆ Γ r , we haveˆ S = ˆΓ u = Γ r . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates We now see that L is constant on Γ r since there is some link function ψ : R m → R suchthat Γ r = ˆΓ u ⊆ L ψ ( u ) , meaning L ( p ) = ψ ( u ) for all p ∈ Γ r . This statement contradicts theassumption that L is non-constant on Γ r . Appendix D. Miscellaneous omitted proofs

Lemma 7

Let a convex

First, observe that µ ( r, p ) = 0 ⇐⇒ E p ‘ ( r, Y ) = inf r ∈R E p ‘ ( r , Y ) ⇐⇒ r ∈ γ ( p ).Now suppose ( L, ψ ) are consistent with respect to ‘ , and take any sequence { f m } ofmeasurable hypotheses. Rewriting the right-hand side of Deﬁnition 5, E D ‘ ( ψ ◦ f m ( X ) , Y ) → inf f E D ‘ ( ψ ◦ f ( X ) , Y ) (7) ⇐⇒ E X R ‘ ( ψ ◦ f m ( X ) , D X ) → ⇐⇒ E X µ ( ψ ◦ f m ( X ) , D X ) → . (8)Therefore, E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) implies (7) if and only if it implies (8).Observe that the assumptions on L allow us to apply the Fubini-Tonelli Theorem (Folland,1999, Theorem 2.37), which yields the equivalence of eq. 7 to the next line.A hyperplane weakly separates two sets if its two closed halfspaces respectively containthe two sets. Lemma 38 If γ : P ⇒ R is an elicitable property, then for any pair of predictions r, r ∈ R where γ r = γ r , there is a hyperplane H = { x ∈ R Y : v · x = 0 } , for some v ∈ R Y , that weaklyseparates γ r and γ r and has γ r ∩ H = γ r ∩ H = γ r ∩ γ r . Proof

Let ‘ elicit γ . Let v = ‘ ( r, · ) − ‘ ( r , · ), interpreted as a nonzero vector in R Y . Let H = { q : v · q = 0 } . If v · q <

0, then r cannot be optimal, so q γ r . So γ r ⊆ { q : v · q ≥ } .Symmetrically, γ r ⊆ { q : v · q ≤ } . This is weak separation, and it immediately impliesthat γ r ∩ γ r ⊆ H . Finally, if and only if v · q = 0, i.e. q ∈ H , by deﬁnition the expectedlosses of both reports are the same. So q ∈ γ r ∩ H ⇐⇒ q ∈ γ r ∩ H . This gives γ r ∩ H = γ r ∩ H = γ r ∩ γ r ∩ H = γ r ∩ γ r . Lemma 39

Suppose we are given an elicitable property γ : P ⇒ R , where Y is ﬁnite, anddistribution p ∈ relint( P ) such that p ∈ γ r ∩ γ r for r, r ∈ R . Then for any ﬂat F containing p , F ⊆ γ r ⇐⇒ F ⊆ γ r . Proof If γ r = γ r , we are done. Otherwise, Lemma 38 gives a hyperplane H = { x ∈ R Y : v · x = 0 } and a guarantee that γ r ⊆ { q ∈ ∆ Y : v · q ≤ } , while γ r ⊆ { q ∈ ∆ Y : v · q ≥ } ,and ﬁnally γ r ∩ γ r ⊆ H .Suppose F ⊆ γ r ; we wish to show F ⊆ γ r . Let q ∈ F . By Lemma 27(i), we have p ∈ relint( F ), so there exists (cid:15) > q = p − (cid:15) ( q − p ) ∈ F . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Now, suppose for contradiction that q γ r . Then v · q <

0: containment in γ r gives v · q ≤

0, and if v · q = 0 then q ∈ γ r ∩ H = ⇒ q ∈ γ r , a contradiction. But, noting that p ∈ H , we have v · q = − (cid:15) ( v · q ) >

0, so q is not in γ r . This contradicts the assumption F ⊆ γ r . Therefore, we must have q ∈ γ r , so we have shown F ⊆ γ r . Because r and r werecompletely symmetric, this completes the proof. Appendix E. Omitted Examples

Discrete problem with no target loss (Quadrant 2).

Consider the following scenariowhere someone is deciding how to dress for the weather based on a meteorologist’s forecast.Consider the three outcomes Y = { rainy, sunny, snowy } , and we suppose we want to havesome bias towards health and safety, so the meteorologist should only predict sunny weatherif P r [sunny | weather data] ≥ /

4. Otherwise, they should predict whatever is more likelygiven the weather data: rain or snow.We can now model this problem by a property with the reports R = Y , and have γ ( p ) =  sunny p sunny ≥ / p sunny ≤ / ∧ p rainy ≥ p snowy snowy p sunny ≤ / ∧ p snowy ≥ p rainy , shown in Figure 3. Since the cells of elicitable properties in the simplex form a powerdiagram (Lambert and Shoham, 2009), we know that there is actually no target loss thatdirectly elicits this problem. Constructing a consistent surrogate for this task is ill-deﬁnedwithout Deﬁnition 6. The function µ ( r, p ) = I { r γ ( p ) } now allows us to use Deﬁnition 6to think about consistent surrogates for this task.Intuitively, since the feasible subspace dimension bound would be lowest at the distribu-tion p = (1 / , / , / p . However,we cannot apply either at p since γ ( p ) = { rainy, snowy, sunny } but the property is notelicitable. Ramaswamy and Agarwal (2016, Theorem 16) cannot draw any conclusions aboutthis property for two reasons: ﬁrst, we are given a target property instead of a target loss.Second, since the property is not elicitable (hence why there can be no target loss), weobserve dim( S γ rainy ( p )) = dim( S γ sunny ( p )), contradicting the requirements of Ramaswamyand Agarwal (2016, Lemma 23).However, our bounds from Corollary 12 on the distribution q = (1 / , / − (cid:15), / (cid:15) )for a small enough (cid:15) >

0, which we can apply since γ ( q ) = { snowy } , suggest that the convexelicitation complexity elic cvx ( γ ) ≥

2, since there is no way to draw a 1-ﬂat (a line, since q ∈ relint(∆ Y )) through q while staying in just one level set on the simplex.This example also extends to other decision-tree-like properties that do not have anexplicit or easily constructed target loss. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates rainy sunny snowy sunnyrainy snowy ••