Unifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates
PProceedings of Machine Learning Research vol 134:1–23, 2021
Unifying Lower Bounds on Prediction Dimension ofConsistent Convex Surrogates
Jessie Finocchiaro [email protected]
Rafael Frongillo [email protected]
Bo Waggoner [email protected]
CU Boulder
Abstract
Given a prediction task, understanding when one can and cannot design a consistent convexsurrogate loss, particularly a low-dimensional one, is an important and active area of machinelearning research. The prediction task may be given as a target loss, as in classification andstructured prediction, or simply as a (conditional) statistic of the data, as in risk measureestimation. These two scenarios typically involve different techniques for designing andanalyzing surrogate losses. We unify these settings using tools from property elicitation,and give a general lower bound on prediction dimension. Our lower bound tightens existingresults in the case of discrete predictions, showing that previous calibration-based boundscan largely be recovered via property elicitation. For continuous estimation, our lowerbound resolves on open problem on estimating measures of risk and uncertainty.
1. Introduction
A surrogate loss function is an error measure that is related but not identical to one’s targetproblem of interest. Selecting a hypothesis by minimizing surrogate risk is one of the mostwidespread techniques in supervised machine learning. There are two main reasons whya surrogate loss is necessary: (1) the target loss does not satisfy some desiderata, such asconvexity, or (2) the goal is to estimate some target statistic and there is no target loss,as in many continuous estimation problems. In both settings, a key criteria for choosinga surrogate loss is consistency , a precursor to excess risk bounds and convergence rates.Roughly speaking, consistency means that minimizing surrogate risk corresponds to solvingthe target problem of interest, i.e. in (1) the target risk is also minimized, or in (2) thecontinuous prediction approaches the true conditional statistic.Despite the ubiquity of surrogate losses, we lack general frameworks to design and analyzeconsistent surrogates. This state of affairs is especially dire when one seeks low predictiondimension , the dimension of the surrogate prediction domain. For example, in multiclassclassification with n labels, the prediction domain might be R n . In many type (1) settings,such as structured prediction and extreme classification, the prediction dimension can easilybecome intractably large, forcing one to sacrifice consistency for computational efficiency. Tounderstand whether this sacrifice is necessary, recent work developed tools like the feasiblesubspace dimension to lower bound the prediction dimension of any consistent convexsurrogate (Ramaswamy and Agarwal, 2016). Challenges of type (2) include risk measuressuch as conditional value at risk (CVaR), with applications in financial regulation, robust © 2021 J. Finocchiaro, R. Frongillo & B. Waggoner. a r X i v : . [ c s . L G ] F e b nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates engineering design, and algorithmic fairness. Risk measures provably cannot be specifiedvia a target loss, and thus we seek a surrogate loss of low (or at least finite) predictiondimension. Recent work (Fissler et al., 2016; Frongillo and Kash, 2020) gives predictiondimension bounds for some of these risk measures, but without the requirement that thesurrogate be convex: bounds for convex surrogates are left as a major open question.We present a unification of existing techniques to bound the prediction dimension ofconsistent convex surrogates in both settings above. Applied to settings of type (1), werecover the feasible subspace dimension result of Ramaswamy and Agarwal (2016), andgive an example where our bound is even tighter. For type (2), we give the first predictiondimension bounds for risk measures with respect to convex surrogates, addressing the openquestion above. Our framework rests on property elicitation , a weaker condition thancalibration, as a tool to understand consistency across a wide variety of domains. The “four quadrants” of problem types
Above, we discuss a significant divergence inprevious frameworks: constructing a surrogate given a target loss versus a target statistic . Inaddition to the two possible targets, we may have one of two domains: a discrete (i.e. finite)target prediction space, like a classification problem, or a continuous one, like a regressionor estimation problem. We informally refer to the four resulting cases—target loss vs. targetstatistic, and discrete vs. continuous predictions—as the “four quadrants” of supervisedlearning problems, shown in Table 1. For further examples, see Appendix E.
Literature on consistency and calibration
We focus on the construction of consistentsurrogate losses L : R d × Y → R , roughly meaning that minimizing L -loss corresponds tosolving the target problem of interest. When given a target loss ‘ , we roughly define L to beconsistent if minimizing L , and applying a link function, minimizes ‘ (Definition 5) (Zhang,2004; Bartlett et al., 2006; Tewari and Bartlett, 2007; Steinwart, 2007; Ramaswamy andAgarwal, 2016). When given a target statistic such as the conditional quantile or variance,but no target loss, we introduce a notion of consistency in line with classical statistics(Definition 6) (Gy¨orfi et al., 2006; Fan and Yao, 1998; Ruppert et al., 1997). Here we willdefine L to be consistent if minimizing L and applying a link function yields estimatesconverging to the correct value.A priori, it is not clear that compatible definitions of consistency could be given for bothtarget statistics and target losses. In fact, we observe that consistency for target losses isa special case of consistency for target statistics (§ 3). This observation suggests propertyelicitation (see § 2.1) as a useful tool to study general lower bounds.As definitions of consistency are relatively intractable to apply directly, the literatureoften focuses on a weaker condition called calibration, which only applies when givena discrete target loss, e.g. Quadrants 1 and 3. Particularly, Zhang (2004); Lin (2004);Bartlett et al. (2006); Tewari and Bartlett (2007); Ramaswamy and Agarwal (2016) showthe equivalence of consistency and calibration in Quadrant 1, where one is given a targetstatistic and discrete prediction set. We discuss the additional relationship of elicitation andcalibration in Appendix A, and derive Theorem 8 via calibration. Contributions
First, we formalize a notion of consistency with respect to a target statistic(Definition 6) and show its relationship to consistency with respect to a target loss (Lemma 7).We then show indirect elicitation is a necessary condition for consistency (Theorem 8). With nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Target loss Target statisticDiscrete
Q1:
Classification
Q2:
Risk-averse classification prediction (Appendix E)
Continuous
Q3:
Least-squares regression
Q4:
Variance estimation estimation
Table 1: The four quadrants of problem types, with an example of interest for each. L cvx L ConsistencyCalibration (Q 1,3) Indirect elicitation d -flats in level sets cons cvx bounds forQ1,2. (Cor. 15)cons cvx bounds forQ3,4. (Thm. 17)Bartlettet al.(2006) T h m . Lem. 11 C o r . C o r . Figure 1: Flow and implications of our results. Compared to calibration, we suggest indirect elicitationas a simpler but almost-as-powerful necessary condition for consistency. In particular, we obtain atestable necessary condition, based on d -flats, for whether there exists a d -dimensional consistentconvex surrogate. This condition recovers and strengthens existing calibration-based results. these tools in hand, we present a new framework for deriving lower bounds on the predictiondimension of consistent convex surrogates (Corollaries 12 and 13) via indirect elicitation.These bounds are the first to our knowledge that can be applied in all four quadrants.Moreover, our framework can also give tighter bounds than previously existed in theliterature. We illustrate this sharpness with new bounds for well-studied problems such asabstain loss (§ 5) and variance, CVaR, and other measures of risk and uncertainty (§ 6).See Figure 1 for a roadmap of our main results.
2. Setting
We consider supervised learning problems in the space
X × Y , for some feature space X anda label space Y , with data drawn from a distribution D over X × Y . The task is to producea hypothesis f : X → R , for some prediction space R , which may be different from Y . Forexample, in ranking problems, R may be all |Y| ! permutations over the |Y| labels forming Y .As we focus on conditional distributions p := D x = Pr[ Y | X = x ] over Y given some x ∈ X ,we often abstract away x , working directly with a convex set of distributions over outcomes P ⊆ ∆ Y . We then write e.g. E p L ( · , Y ) to mean the expectation when Y ∼ p .If given, we use ‘ : R × Y → R to denote a target loss , with predictions r ∈ R . Similarly, L : R d × Y → R will typically denote a surrogate loss, with surrogate predictions u ∈ R d .We write L d for the set of B ( R d ) ⊗ Y -measurable and lower semi-continuous surrogates L : R d × Y → R such that E Y ∼ p L ( u, Y ) < ∞ for all u ∈ R d , p ∈ P , that are minimizablein that arg min u E p L ( u, Y ) is nonempty for all p ∈ P . Moreover, L cvx d ⊆ L d is the set ofconvex (in R d for every y ∈ Y ) losses in L d . Set L = ∪ d ∈ N L d , and L cvx = ∪ d ∈ N L cvx d . A loss ‘ : R × Y → R is discrete if R is a finite set. For a given p ∈ P , the (conditional) regret , or nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates excess risk, of a loss L is given by R L ( u, p ) := E p L ( u, Y ) − inf u ∗ E p L ( u ∗ , Y ). Typically, wenotate finite report sets R . Arising from the statistics and economics literature, property elicitation is similar to cal-ibration, but only characterizes exact minimizers of a surrogate (Savage, 1971; Osbandand Reichelstein, 1985; Lambert et al., 2008; Lambert and Shoham, 2009; Lambert, 2018;Frongillo and Kash, 2015, 2014). Specifically, given a statistic or property
Γ of interest, whichmaps a distribution p ∈ P ⊆ ∆ Y to the set of desired or correct predictions, the minimizersof L should precisely coincide with Γ. For example, squared loss L ( r, y ) = ( r − y ) elicits themean Γ( p ) = E p Y . For intuition, to relate to consistency, one can think of p = Pr[ Y | X = x ]as a conditional distribution, though the definition is also applied to point prediction settings. Definition 1 (Property, elicits) A property is a set-valued function Γ :
P → R \ {∅} ,which we denote Γ : P ⇒ R . A loss L : R × Y → R elicits the property Γ if ∀ p ∈ P , Γ( p ) = arg min u ∈R E p L ( u, Y ) . (1)An example is the mean, Γ( p ) = { E p Y } . The level set of Γ at value r ∈ R is Γ r := { p ∈P : r ∈ Γ( p ) } . We call a property Γ : P ⇒ R discrete if R is a finite set, as in Quadrants 1and 2. A property is single-valued if | Γ( p ) | = 1 for all p ∈ P , in which case we may writeΓ : P → R and Γ( p ) ∈ R . The mean is single-valued. We define the range of a propertyby range Γ = S p ∈P Γ( p ) ⊆ R . When L ∈ L , we use Γ := prop P [ L ] to denote the uniqueproperty elicited by L (for distributions in P ) from eq. (1). Typically, we denote the targetproperty by γ , and the surrogate by Γ.To relate property elicitation to consistency, we need to allow for a link function, whichgives rise to the notion of indirect elicitation. For single-valued properties, this definitionreduces to the natural requirement γ = ψ ◦ Γ. Definition 2 (Indirect Elicitation)
A surrogate loss and link ( L, ψ ) indirectly elicit aproperty γ : P ⇒ R if L elicits a property Γ : P ⇒ R d such that for all u ∈ R d , we have Γ u ⊆ γ ψ ( u ) . We say L indirectly elicits γ if such a link ψ exists. An important caveat to the above definitions is that, since Γ = prop P [ L ] is nonemptyeverywhere, we must have L ∈ L , meaning that E p L ( · , Y ) always achieves a minimum.This restriction is also implicit in e.g. (Agarwal and Agarwal, 2015). While some popularsurrogates such as logistic and exponential loss are not minimizable, these losses are stillcovered in Corollary 13 and Theorem 17 as Γ( p ) = ∅ when p ∈ P := relint(∆ Y ); moreover,by thresholding L ( u, y ) = max( L ( u, y ) , (cid:15) ) for sufficiently small (cid:15) > L ∈ L for both. We expect that a generalization of property elicitation which allows for “infinite”predictions (e.g., along a prescribed ray), thereby ensuring a minimum is always achievedfor convex losses, would allow us to lift the minimizable restriction entirely. Various works have studied the minimum prediction dimension d needed in order to con-struct a consistent surrogate loss L : R d × Y → R , typically through proxies such as nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates calibration (Steinwart and Christmann, 2008; Agarwal and Agarwal, 2015; Ramaswamyand Agarwal, 2016) and property elicitation (Frongillo and Kash, 2015; Fissler et al., 2016;Frongillo and Kash, 2020). In Quadrant 1, Ramaswamy and Agarwal (2016) introduce aspecial case of convex consistency dimension (Definition 3), which led to consistent convexsurrogates for discrete prediction problems such as hierarchical classification (Ramaswamyet al., 2015) and classification with an abstain option (Ramaswamy et al., 2018). Definition 3 (Convex Consistency Dimension)
Given target loss ‘ : R × Y → R orproperty γ : P ⇒ R , its convex consistency dimension cons cvx ( · ) is the minimum dimension d such that ∃ L ∈ L cvx d and link ψ such that ( L, ψ ) is consistent with respect to ‘ or γ .Consistency is defined for a target loss in Definition 5 and for a target property in Definition 6.In the case of a target property γ , i.e. a statistic, Lambert et al. (2008) similarly introducethe notion of elicitation complexity , later generalized by Frongillo and Kash (2020), whichcaptures the lowest prediction dimension of a surrogate which indirectly elicits γ . Thisnotion is quite general as it includes continuous estimation settings and does not inherentlydepend on a target loss being given. Definition 4 (Convex Elicitation Complexity)
Given a target property γ , the convexelicitation complexity elic cvx ( γ ) is the minimum dimension d such that there is a L ∈ L cvx d indirectly eliciting γ . Agarwal and Agarwal (2015, Corollary 10) provide a necessary condition for the directconvex elicitation of single-valued properties, yielding bounds on the dimensionality of levelsets. Moreover, Finocchiaro et al. (2019) study surrogate losses which embed a discrete loss,which is a special case of indirect elicitation. Finocchiaro et al. (2020) further introduce thenotion of embedding dimension , which is a lower bound on both convex elicitation complexityof discrete properties and convex consistency dimension of discrete losses and finite statistics.
3. Consistency implies indirect elicitation
In this section, we connect consistency of any surrogate to an indirect elicitation requirement.This will allow us to show indirect elicitation gives state-of-the-art lower bounds on theprediction dimension of consistent convex surrogates.We start by formalizing consistency in two ways that generalize across our four quadrants.First, given a target loss ‘ , we say L is consistent if optimizing L and applying a link ψ optimizes ‘ (Definition 5). Second, given a target property γ , such as the α -quantile, wesay L is consistent if optimizing L implies approaching, in some sense, the correct statistic γ ( D x ) of the conditional distributions D x = Pr[ Y | X = x ] (Definition 6). We then observethat Definition 5 is subsumed by Definition 6, and use this to show consistency implies L indirectly elicits prop P [ ‘ ] or γ respectively. Definition 5 (Consistent: loss)
A loss L ∈ L and link ( L, ψ ) are D -consistent for aset D of distributions over X × Y with respect to a target loss ‘ if, for all D ∈ D and allsequences of measurable hypothesis functions { f m : X → R} , E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) = ⇒ E D ‘ (( ψ ◦ f m )( X ) , Y ) → inf f E D ‘ (( ψ ◦ f )( X ) , Y ) . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates For a given convex set
P ⊆ ∆ Y , we simply say ( L, ψ ) is consistent if it is D -consistent forsome D satisfying the following: for all p ∈ P , there exists D ∈ D and x ∈ X such that D has a point mass on x and p = D x . Instead of a target loss ‘ , one may want to learn a target property, i.e. a conditionalstatistic such as the expected value, variance, or entropy. In this case, following the traditionin the statistics literature on conditional estimation (Gy¨orfi et al., 2006; Fan and Yao, 1998;Ruppert et al., 1997), we formalize consistency as converging to the correct conditionalestimates of the property. Convergence is measured by functions µ ( r, p ) that formalizehow close r is to “correct” for conditional distribution p . In particular we should have µ ( r, p ) = 0 ⇐⇒ r ∈ γ ( p ). Definition 6 (Consistent: property)
Suppose we are given a loss L ∈ L , link function ψ : R d → R , and property γ : P ⇒ R . Moreover, let µ : R × P → R + be any functionsatisfying µ ( r, p ) = 0 ⇐⇒ r ∈ γ ( p ) . We say ( L, ψ ) is ( µ, D )-consistent with respect to γ if,for all D ∈ D and sequences of measurable functions { f m : X → R} , E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) = ⇒ E X µ ( ψ ◦ f m ( X ) , D X ) → . (2) We simply say ( L, ψ ) is µ -consistent if it is ( µ, D ) -consistent for some D satisfying thefollowing: for all p ∈ P , there exists D ∈ D and x ∈ X such that D has a point mass on x and p = D x . Additionally, we say ( L, ψ ) is consistent if there is a µ such that ( L, ψ ) is µ -consistent. Typical definitions of consistency require D to be the set of all distributions over X × Y ,while our conditions are much weaker. As the main focus of this paper is lower bounds onthe prediction dimension, i.e., showing that surrogates of a certain prediction dimensioncannot exist, these weaker conditions translate to stronger impossibility statements.Given a target loss ‘ , we can define a statistic γ , the property it elicits. Intuitively,consistency of a surrogate L with respect to ‘ and γ are equivalent, i.e. in both casesestimates should converge to values that minimize ‘ -loss. We formalize this by letting µ bethe ‘ -regret, yielding Lemma 7, proven in Appendix D. Lemma 7
Let a convex
P ⊆ ∆ Y be given. Given a surrogate loss L ∈ L , link ψ , and targetloss ‘ , set µ ( r, p ) := R ‘ ( r, p ) . Then there is a D such that ( L, ψ ) is D -consistent with respectto ‘ if and only if ( L, ψ ) is ( µ, D ) -consistent with respect to γ := prop P [ ‘ ] . Because each target loss in L elicits some property, but not all target properties can beelicited by a loss (e.g. the variance), consistency with respect to a property is the strictlybroader notion. This points to indirect elicitation as a natural necessary condition forconsistency, as formalized in Theorem 8. Theorem 8
For a surrogate L ∈ L , if the pair ( L, ψ ) is consistent with respect to a property γ : P ⇒ R or a loss ‘ eliciting γ , then ( L, ψ ) indirectly elicits γ . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof
By Lemma 7, it suffices to show the result for consistency with respect to a property γ , setting γ := prop P [ ‘ ] if ‘ is given instead. We show the contrapositive; suppose ( L, ψ )does not indirectly elicit γ , meaning we have some p ∈ P so that u ∈ Γ( p ) but ψ ( u ) γ ( p ), where Γ := prop P [ L ]. Observe that we use the fact Γ( p ) = ∅ . By definition, ifwe had consistency, there must be some distribution D on X × Y with a point masson some x ∈ X and D x = p . Consider a constant sequence { f m } with f m = f suchthat f ( x ) = u , so that E D L ( f m ( X ) , Y ) = E D x L ( f m ( x ) , Y ) = E p L ( u, Y ). Since u ∈ Γ( p ), we have E p L ( u, Y ) = inf f E D x L ( f ( x ) , Y ) = inf f E D L ( f ( X ) , Y ). In particular, wehave E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ). However, we have E X µ ( ψ ◦ f m ( X ) , D X ) = µ ( f m ( x ) , p ) = µ ( ψ ( u ) , p ) = 0, since ψ ( u ) γ ( p ). Therefore ( L, ψ ) is not consistent withrespect to γ (Definition 6).This result allows us to state elicitation complexity as a lower bound for convex consistencydimension. Corollary 9
Given a property γ : P ⇒ R or loss ‘ : R × Y → R eliciting γ , we have elic cvx ( γ ) ≤ cons cvx ( γ ) = cons cvx ( ‘ ) .
4. Prediction Dimension of Consistent Convex Surrogates
We now turn to the question of bounding the prediction dimension of a consistent convexsurrogate. From Theorem 8, given a target property γ or loss ‘ with γ = prop P [ ‘ ], this taskreduces to lower bounding the prediction dimension of a convex surrogate indirectly eliciting γ . We now explore two tools, Corollaries 12 and 13, for proving such convex elicitationlower bounds. The key idea, crystallized from the proofs of Ramaswamy and Agarwal(2016, Theorem 16) and Agarwal and Agarwal (2015, Theorem 9), is to consider a particulardistribution p and surrogate prediction u ∈ R d with is optimal for p . Theorem 11 will showthat if d is small, then the level set { p ∈ P : u ∈ arg min u E p L ( u , Y ) } must be large; infact, it must roughly contain a high-dimensional flat . By definition of indirect elicitation,there is some level set γ r (where u is linked to r ) containing this flat as well. The use of thisresult is to leverage the contrapositive: if γ has a level set intricate enough to not containany high-dimensional flats, then γ cannot have a low-dimensional consistent surrogate. Definition 10 (Flat)
For d ∈ N , a d -flat , or simply flat , is a nonempty set F = ker P W := { q ∈ P : E q W = } for some measurable W : Y → R d . We state our elicitation lower bounds in Corollaries 12 and 13, which when combinedwith Theorem 8, yield consistency bounds. A similar result is Agarwal and Agarwal(2015, Theorem 9), which bounds the dimension of level sets of a single-valued prop P [ L ].Corollaries 12 and 13 instead bound the dimension of flats contained in the level sets, anadditional power which we leverage in our examples. Lemma 11
Let
Γ : P ⇒ R d be (directly) elicited by L ∈ L cvx d for some d ∈ N . Let Y beeither a finite set, or Y = R , in which case we assume each p ∈ P admits a Lebesgue densitysupported on the same set for all p ∈ P . For all u ∈ range Γ and p ∈ Γ u , there is some V u,p : Y → R d such that p ∈ ker P V u,p ⊆ Γ u .
1. This assumption is largely for technical convenience, to ensure that V u,p does not depend on p . Any suchassumption would suffice, and we suspect even that condition can be relaxed. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof As L is convex and elicits Γ, we have u ∈ Γ( p ) ⇐⇒ ∈ ∂ E p L ( u, Y ). We proceed intwo cases, depending on |Y| . Finite Y : If Y is finite, this is additionally equivalent to ∈ ⊕ y p y ∂L ( u, y ), where ⊕ denotes the Minkowski sum (Hiriart-Urruty and Lemar´echal, 2012, Theorem 4.1.1). Expanding, we have ⊕ y p y ∂L ( u, y ) = { P y ∈Y p y x y | x y ∈ ∂L ( u, y ) ∀ y ∈ Y} , and thus W p = P y p y x y = where W = [ x , . . . , x n ] ∈ R d × n ; cf. (Ramaswamy and Agarwal, 2016, A m in Theorem 16). Let V u,p : Y → R d , y W y be the function encoding the columns of W . Observe that E p V u,p = . Y = R : Any L ∈ L cvx d satisfies the assumptions of Ioffe and Tikhomirov (1969), so we mayinterchange subdifferentiation and expectation. Specifically, letting V u,p = { V : Y → R d | V measurable , V ( y ) ∈ ∂L ( u, y ) p -a.s. } , we have ∂ E p L ( u, Y ) = { R V ( y ) dp ( y ) | V ∈ V u,p } . As ∈ ∂ E p L ( u, Y ), in particular, there is some V u,p ∈ V u,p such that E p V u,p = 0. For any q ∈ P ,as by assumption q is supported on the same set as p , we have V u,p ( y ) ∈ ∂L ( u, y ) q -a.s., sothat V u,p ∈ V u,q . Thus, E q V u,p = 0 implies 0 ∈ ∂ E q L ( u, Y ) by the above.In both cases, we take the flat F := ker P V u,p , and have p ∈ F by construction. To see F ⊆ Γ u , from the chain of equivalences above, we have for any q ∈ P that q ∈ ker P V u,p = ⇒ ∈ ∂ E q L ( u, Y ) = ⇒ u ∈ Γ( q ) = ⇒ q ∈ Γ u .Knowing indirect elicitation implies the existence of such a flat, we now apply Theorem 8and Lemma 11 to construct lower bounds on convex consistency dimension. Corollary 12
Let target property γ : P ⇒ R and d ∈ N be given. Let Y be either a finiteset, or Y = R , in which case we assume each p ∈ P admits a Lebesgue density supported onthe same set for all p ∈ P . Let p ∈ P with | γ ( p ) | = 1 , and take γ ( p ) = { r } . If there is no d -flat F with p ∈ F ⊆ γ r , then cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1 . Proof
Let (
L, ψ ) indirectly elicit γ , where L ∈ L cvx d , and let Γ = prop P [ L ]. As Γ isnon-empty, there is some u ∈ Γ( p ). Since γ is single-valued at p , we have r = ψ ( u ); byLemma 11, we know there is a d -flat F = ker P V u,p so that p ∈ F ⊆ Γ u . By definition ofindirect elicitation, we additionally have Γ u ⊆ γ r . Thus, we have p ∈ F ⊆ γ r . If no flat F satisfies the above conditions, then no L ∈ L cvx d indirectly elicits γ , so elic cvx ( γ ) ≥ d + 1,and recall cons cvx ( γ ) ≥ elic cvx ( γ ) by Corollary 9. Corollary 13
Let an elicitable target property γ : P ⇒ R be given, where P ⊆ ∆ Y isdefined over a finite set of outcomes Y , and let d ∈ N . Let p ∈ relint( P ) . If there is no d -flat F with p ∈ F ⊆ γ r , then cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1 . Proof
Let (
L, ψ ) indirectly elicit γ and the convex function L and elicit Γ. As Γ is non-empty,there is some u ∈ Γ( p ), and suppose r = ψ ( u ). Take F ⊆ Γ u to be the flat that exists byLemma 11. If r = r , then p ∈ F ⊆ Γ u ⊆ γ r by indirect elicitation. Otherwise, by Lemma 39,for elicitable properties with p ∈ γ r ∩ γ r , we observe p ∈ F ⊆ γ r ⇐⇒ p ∈ F ⊆ γ r .As above, if no flat F satisfies the above conditions, then no L ∈ L cvx d indirectly elicits γ , so cons cvx ( γ ) ≥ elic cvx ( γ ) ≥ d + 1, recalling Corollary 9 for the first inequality. ∂ represents the subdifferential ∂f ( x ) = { z : f ( x ) − f ( x ) ≥ h z, x − x i ∀ x } . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates
5. Discrete-valued predictions
The main known technique for lower bounds on surrogate dimensions is given by Ramaswamyand Agarwal (2016) for the Quadrant 1 (target loss and discrete predictions). The proofheavily builds around the “limits of sequences” in the definition of calibration. By restrictingslightly to the broad class of minimizable losses L cvx , we show their bound follows relativelydirectly from Corollary 13. (We conjecture that the minimizability restriction to L cvx canbe lifted; see § 7.) Ramaswamy and Agarwal (2016) construct what they call the subspaceof feasible dimensions and give bounds in terms of its dimension. Definition 14 (Subspace of feasible directions)
The subspace of feasible directions S C ( p ) of a convex set C ⊆ R n at p ∈ C is S C ( p ) = { v ∈ R n : ∃ (cid:15) > such that p + (cid:15)v ∈C ∀ (cid:15) ∈ ( − (cid:15) , (cid:15) ) } . Ramaswamy and Agarwal (2016) gives a lower bound on the dimensionality of allconsistent convex surrogates, i.e. cons cvx ( ‘ ) ≥ k p k − dim( S γ r ( p )) − p and r ∈ γ ( p ),particularly in the setting where one is given a discrete prediction problem and target lossover finite outcomes. It turns out that the subspace of feasible directions is essentially aspecial case of a flat described by Lemma 11. So, by making a slight restriction to the classof minimizable convex surrogates L cvx , we can derive this lower bound from our generaltechnique in a way that we find shorter and simpler. Corollary 15 (Ramaswamy and Agarwal (2016) Theorem 18)
Let ‘ : R × Y → R be a discrete loss eliciting γ : ∆ Y ⇒ R with Y finite. Then for all p ∈ ∆ Y and r ∈ γ ( p ) , cons cvx ( γ ) ≥ k p k − dim( S γ r ( p )) − . (3) Proof [Sketch] If cons cvx ( γ ) ≤ d , then there is a L ∈ L cvx d so that L is consistent with respectto γ , and in turn, indirectly elicits γ . Lemma 11 says that there is some d -flat F = ker P V such that p ∈ F ⊆ γ r . In particular, if p ∈ relint(∆ Y ), we can see dim( F ) = dim( S γ r ( p )).Since affhull(∆ Y ) has dimension |Y| − k p k −
1, by rank-nullity and rank( V ) ≤ d (moreprecisely, the corresponding linear map q E q V ) we have d ≥ k p k − − dim( S γ r ( p )).When p relint(∆ Y ), we can project down to the subsimplex on the support of p , againof dimension k p k −
1, and modify L and ‘ accordingly. Now p is in the relative interior ofthis subsimplex, so the above gives cons cvx ( γ ) ≥ k p k − − dim( S γ r ( p )), where now S isrelative to R supp( p ) . Finally, the feasible subspace dimension in the projected space is thesame as in the original space because of p ’s location on a face of ∆ Y .There are some cases where the bound provided by Corollaries 12 and 13 is strictlytighter than the bound provided by feasible subspace dimension in Corollary 15. For anexample of how Corollary 12 applies to a discrete property for which there is no target loss– a non-elicitable property, i.e. Quadrant 2, which is not considered by Ramaswamy et al.(2018) – we refer the reader to Appendix E. Example: High-confidence classification.
Given the target loss ‘ abs ( r, y ) := I { r
6∈ { y, ⊥}} +(1 / I { r = ⊥} , we can consider the abstain property it elicits, where one predicts the mostlikely outcome y if P r [ Y = y | x ] ≥ / ⊥ otherwise. Ramaswamyand Agarwal (2016) present a convex surrogate for the abstain loss that takes as input a nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates prediction whose dimension is logarithmic in the number of outcomes, yielding new up-per bounds on cons cvx ( ‘ abs ) which are an exponential improvement over previous results,e.g., Crammer and Singer (2001).To lower bound the dimension of convex surrogates, we can consider two differentdistributions; in the first, our bound yields a strict gap over the feasible subspace dimensionbound, and in the second, the bounds are equal. First, we choose p = • to be the uniformdistribution (see Figure 2). In this case, the bound by feasible subspace dimension yieldscons cvx ( ‘ abs ) ≥ − − • . When intersected with the simplex, one cansee that any line (a 1-flat, since • ∈ relint(∆ Y )) in the simplex through • also leaves thecell γ ⊥ , which contains p . See Figure 2 (R) for intuition; a 1-flat through p ∈ relint(∆ Y )would be a line in such a figure. Therefore, we have no 1-flat containing p staying in γ ⊥ , sowe obtain a better lower bound, cons cvx ( ‘ abs ) ≥
2. Combining this with the upper boundsgiven by Ramaswamy et al. (2018), we observe the bound cons cvx ( ‘ abs ) = 2 is tight in thiscase with |Y| = 3.Our bounds sometimes match those of (Ramaswamy and Agarwal, 2016); consider thedistribution ? = (1 / , / , / γ ⊥ and γ at ? is 1, since one only moves toward the distributions (0 , / , /
2) and(1 / , , /
2) without leaving the level sets, and the three points are collinear in affhull(∆ Y ),suggesting S γ ⊥ ( q ) = 1. This yields cons cvx ( ‘ abs ) ≥ − − γ ⊥ and γ , so we have cons cvx ( ‘ abs ) ≥ d -flats appear to work well at distributions where previous bounds viafeasible subspace dimension would have been vacuous. In essence, flats allow us a “global”view of the property we are eliciting, while the feasible subspace method only permits a “local”look at the property, so we find our method works better for distributions in relint(∆ Y ). p p p ⊥ • F p p p ⊥ • F Figure 2: (Left) Feasible subspace dimension S γ ⊥ ( • ) = 2 and S γ ⊥ ( ? ) = 1, giving the boundcons cvx ( ‘ abs ) ≥ − − • (a line since • ∈ relint(∆ Y )) stays fullycontained in γ ⊥ , so cons cvx ( ‘ abs ) ≥
6. Continuous-valued predictions
In continuous estimation problems, often one is not given a target loss, but instead a target(conditional) statistic of the data one wishes to estimate. Examples include estimating the nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates mean or variance of y conditioned on a given x . In this setting, Lemma 11 gives lowerbounds on the prediction dimension of convex losses with a link to the desired conditionalstatistic, i.e., the convex elicitation complexity. In particular, Theorem 17 below yields newbounds on the convex elicitation complexity of statistics which quantify risk or uncertaintysuch as variance, entropy, or financial risk measures.These bounds address an open question of Frongillo and Kash (2020), that of developinga theory of elicitation complexity with respect to convex-elicitable properties. The lowerbounds of previous work are essentially all with respect to identifiable properties; a propertyis d -indentifiable if its level sets are all d -flats. Frongillo and Kash (2020) rely on finding adimension d such that the level sets of certain risk measures γ have too much curvature tocontain any d -flat. Thus, the elicitation complexity with respect to identifiable properties isgreater than d .In contrast, properties elicited by non-smooth convex losses are generally not identifiable.For example, the properties elicited by hinge loss and the abstain surrogate are not identifiable,as their level sets are not flats (see Figure 2). It therefore might appear that entirely new ideasare needed. Our framework is closely related to identifiability, however; Lemma 11 statesthat the level sets of d -dimensional convex-elicitable properties, if not d -flats themselves, areunions of d -flats. Thus, the general logic of Frongillo and Kash (2020) can still apply. Inparticular, we recover their main lower bound for the large class of Bayes risks. Definition 16
Given loss function L : R × Y → R for some report set R , the Bayes risk of L is defined as L ( p ) := inf r ∈R E p L ( r, Y ) . Condition 1
For some r ∈ range Γ , the level set Γ r = ker P V is a d -flat presented by some V : Y → R d such that ∈ int { E p V : p ∈ P} . Theorem 17
Let P be a set of Lebesgue densities supported on the same set for all p ∈ P .Let Γ :
P → R d satisfy Condition 1 for some r ∈ R d . Let L ∈ L cvx elicit Γ such that L isnon-constant on Γ r . Then cons cvx ( L ) ≥ elic cvx ( L ) ≥ d + 1 . We now illustrate the theorem with two important examples: variance and conditionalvalue at risk. Several other applications from Frongillo and Kash (2020), such as spectralrisk measures, entropy, and norms, follow similarly.
Example: Variance.
As a warm-up, let us see how to show elic cvx (Var) = 2, meaningthe lowest dimension of a convex loss to estimate conditional variance is 2. This lowerbound will follow from Theorem 17 using that variance is the Bayes risk of squared loss L ( r, y ) = ( r − y ) , which elicits the mean Γ( p ) = E p Y . Interestingly, while perhaps intuitivelyobvious, even this simple result is novel. In particular, the well-known fact that the varianceis not elicitable does not yield a lower bound of 2, as it does not rule out the variance beinga link of a real-valued convex-elicitable property; cf. Frongillo and Kash (2020, Remark 1). Corollary 18
Let P be a set of continuous Lebesgue densities on Y = R with all p ∈ P having the same support. If there exist p, q, q ∈ P with E p Y = E q Y = E q Y and Var( p ) =Var( q ) , then cons cvx (Var) = elic cvx (Var) = 2 . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Proof
For the upper bound, we may elicit the first two moments via the convex loss L ( r, y ) =( r − y ) + ( r − y ) , and recover the variance via ψ ( r ) = r − r , giving elic cvx (Var) ≤ E q Y < E q Y . Let r = E q Y + E q Y ,and define V : Y → R , y y − r . Then ker P V = { p ∈ P | E p Y = r } = Γ r whereΓ : p E p Y is the mean. As E q Y < r < E q Y , we conclude E q V < < E q V . Wehave now satisfied Condition 1 for d = 1. To apply Theorem 17, it remains to showthat Var is non-constant on Γ r . By our assumptions and the definition of Var, we have E p Y = E q Y . Letting p = q + q , p = p + q , we have E p i Y = r for i ∈ { , } , but E p Y = E q Y + E q Y = E p Y + E q Y = E p Y . As p , p have the same mean butdifferent second moments, we conclude Var( p ) = Var( p ). Example: Conditional Value at Risk.
Frongillo and Kash (2020) observe that one ofthe most prominent financial risk measures, the conditional value at risk (CVaR), can beexpressed as a Bayes risk. In particular, for 0 < α <
1, we may defineCVaR α ( p ) = inf r ∈ R E p n α ( r − Y ) r ≥ Y − r o , (4)which is the Bayes risk of the transformed pinball loss L α ( r, y ) = α ( r − y ) r ≥ y − r . Inturn, L α elicits the α -quantile, the quantity q α ( p ) such that Pr p [ Y ≥ q α ( p )] = α . FollowingFrongillo and Kash (2020), we will restrict to the set P q of probability measures over R with connected support and whose CDFs are strictly increasing on their support, so that q α is single-valued. Under mild assumptions, we find that there is no consistent real-valuedconvex surrogate for CVaR α . Corollary 19
Let P be a set of continuous Lebesgue densities on Y = R with all p ∈ P having support on the same interval. If we have p , p , p , p ∈ P with q α ( p ) < q α ( p ) 3, which if true would constitute an interesting gap betweenelicitation complexity for identifiable and convex-elicitable properties. 7. Conclusions and future work In this work, we show that indirect property elicitation can be a powerful necessary conditionfor the existence of a consistent surrogate loss (Theorem 8). Furthermore, we introduce anew lower bound (Corollaries 12 and 13) on convex consistency dimension that is generallyapplicable and extends previous results from both the discrete (Corollary 15) and continuous(Corollaries 18 and 19) estimation settings.Several important questions remain open. Particularly for the discrete settings, wewould like to know whether one can lift the restriction that surrogates always achieve aminimum; we conjecture positively. Of course, we would like to characterize cons cvx andelic cvx and develop a general framework for constructing surrogates achieving the bestpossible prediction dimension. Moreover, the practical reason why consistency is desired is toensure the guarantee of empirical risk minimization (ERM) rates; however, the relationshipbetween ERM rates and property elicitation has not been studied. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates References Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and propertyelicitation. In JMLR Workshop and Conference Proceedings , volume 40, pages 1–19, 2015.URL .Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, andrisk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006.URL http://amstat.tandfonline.com/doi/abs/10.1198/016214505000000907 .Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research , 2(Dec):265–292, 2001.Jianoing Fan and Qiwei Yao. Efficient estimation of conditional variance functions instochastic regression. Biometrika , 85(3):645–660, 09 1998. ISSN 0006-3444. doi: 10.1093/biomet/85.3.645. URL https://doi.org/10.1093/biomet/85.3.645 .Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. An embedding framework forconsistent polyhedral surrogates. In Advances in neural information processing systems ,2019.Jessie Finocchiaro, Rafael Frongillo, and Bo Waggoner. Embedding dimension of polyhedrallosses. The Conference on Learning Theory , 2020.Tobias Fissler. On higher order elicitability and some limit theorems on the Poisson andWiener space . PhD thesis, 2017.Tobias Fissler, Johanna F Ziegel, and others. Higher order elicitability and Osband’s principle. The Annals of Statistics , 44(4):1680–1707, 2016.Gerald B Folland. Real analysis: modern techniques and their applications , volume 40. JohnWiley & Sons, 1999.Rafael Frongillo and Ian Kash. General truthfulness characterizations via convex analysis.In Web and Internet Economics , pages 354–370. Springer, 2014.Rafael Frongillo and Ian Kash. Vector-Valued Property Elicitation. In Proceedings of the28th Conference on Learning Theory , pages 1–18, 2015.Rafael Frongillo and Ian A Kash. Elicitation Complexity of Statistical Properties. Biometrika ,11 2020. ISSN 0006-3444. doi: 10.1093/biomet/asaa093. URL https://doi.org/10.1093/biomet/asaa093 .L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theoryof nonparametric regression . Springer Science & Business Media, 2006.Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of convex analysis .Springer Science & Business Media, 2012.Aleksandr Davidovich Ioffe and Vladimir Mikhailovich Tikhomirov. On minimization ofintegral functionals. Functional Analysis and Its Applications , 3(3):218–227, 1969. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. 2018. URL https://web.stanford.edu/˜nlambert/papers/elicitability.pdf .Nicolas S. Lambert and Yoav Shoham. Eliciting truthful answers to multiple-choice questions.In Proceedings of the 10th ACM conference on Electronic commerce , pages 109–118, 2009.Nicolas S. Lambert, David M. Pennock, and Yoav Shoham. Eliciting properties of probabilitydistributions. In Proceedings of the 9th ACM Conference on Electronic Commerce , pages129–138, 2008.Yi Lin. A note on margin-based loss functions in classification. Statistics & probabilityletters , 68(1):73–82, 2004.Kent Osband and Stefan Reichelstein. Information-eliciting compensation schemes. Jour-nal of Public Economics , 27(1):107–115, June 1985. ISSN 0047-2727. doi: 10.1016/0047-2727(85)90031-3. URL .Kent Harold Osband. Providing Incentives for Better Cost Forecasting . University ofCalifornia, Berkeley, 1985.Harish Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Convex calibrated surrogatesfor hierarchical classification. In International Conference on Machine Learning , pages1852–1860, 2015.Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclassloss matrices. The Journal of Machine Learning Research , 17(1):397–441, 2016.Harish G Ramaswamy, Ambuj Tewari, Shivani Agarwal, et al. Consistent algorithms formulticlass classification with an abstain option. Electronic Journal of Statistics , 12(1):530–554, 2018.David Ruppert, M. P. Wand, Ulla Holst, and Ola H¨osjer. Local polynomial variance-functionestimation. Technometrics , 39(3):262–273, 1997. doi: 10.1080/00401706.1997.10485117.URL .L.J. Savage. Elicitation of personal probabilities and expectations. Journal of the AmericanStatistical Association , pages 783–801, 1971.Ingo Steinwart. How to compare different loss functions and their risks. ConstructiveApproximation , 26(2):225–287, 2007.Ingo Steinwart and Andreas Christmann. Support Vector Machines . Springer Science &Business Media, September 2008. ISBN 978-0-387-77242-4. Google-Books-ID: HUnqnr-pYt4IC.Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. The Journal of Machine Learning Research , 8:1007–1025, 2007. URL http://dl.acm.org/citation.cfm?id=1390325 .Tong Zhang. Statistical behavior and consistency of classification methods based on convexrisk minimization. Annals of Statistics , pages 56–85, 2004. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Appendix A. Notes on calibration When given a discrete target loss, such as for classification-like problems, direct empiricalrisk minimization is typically NP-hard, forcing one to find a more tractable surrogate. Toensure consistency, the literature has embraced the notion of calibration from Steinwart andChristmann (2008, Chapter 3), which aligns with the definition in Tewari and Bartlett (2007)for multiclass classification, and its generalizations to arbitrary discrete target losses (Agarwaland Agarwal, 2015; Ramaswamy and Agarwal, 2016). Calibration is more tractable andweaker than consistency, yet the two are equivalent under suitable assumptions (Tewariand Bartlett, 2007; Ramaswamy and Agarwal, 2016),notably in Quadrant 1. Intuitively,calibration says one cannot achieve the optimal surrogate loss while linking to a suboptimaltarget prediction. Definition 20 (Calibrated: Quadrant 1) Let ‘ : R × Y → R be a discrete target loss.A surrogate loss L : R d × Y → R and link ψ : R d → R pair ( L, ψ ) is P -calibrated withrespect to ‘ if ∀ p ∈ P : inf u ∈ R d : ψ ( u ) arg min r E p ‘ ( r,Y ) E p L ( u, Y ) > inf u ∈ R d E p L ( u, Y ) . (5) We simply say L is calibrated if P = ∆ Y . Many works characterize calibrated surrogates for specific discrete target losses (Zhang,2004; Lin, 2004; Bartlett et al., 2006; Tewari and Bartlett, 2007), including the canonical0-1 loss for binary and multiclass classification. We give another definition of calibrationwhich is a special case of calibration via Steinwart and Christmann (2008), and show it isequivalent to Definition 20 in discrete prediction settings, but can be applied in continuousestimation settings as well. We use this more general definition of calibration when provingstatements about the relationship between consistency, calibration, and indirect elicitation.The close connection between indirect elicitation and consistency was first explored byAgarwal and Agarwal (2015). In particular, calibration of L ∈ L with respect to ‘ impliesindirect elicitation quite directly: take u ∈ R d and p ∈ Γ u , implying u ∈ Γ( p ). From eq. (1), E p L ( u, Y ) = inf u ∈ R d E p L ( u , Y ), so we must have ψ ( u ) ∈ γ ( p ) from eq. (5), as desired. Definition 21 (Calibrated: Quadrants 1 and 3) A loss L : R d ×Y → R is P -calibrated with respect to a loss ‘ : R × Y → R if there is a link ψ : R d → R such that, for all distribu-tions p ∈ P , there exists a function ζ : R + → R + with ζ continuous at + and ζ (0) = 0 suchthat for all u ∈ R d , we have ‘ ( ψ ( u ); p ) − ‘ ( p ) ≤ ζ ( E p L ( u, Y ) − L ( p )) . (6) If P = ∆ Y , we simply say ( L, ψ ) is calibrated. Consider the following four conditions: Suppose we are given ζ : R + → R + .A ζ satisfies ζ : 0 (cid:15) m → ⇒ ζ ( (cid:15) m ) → nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates C Given ζ : R → R + , for all u ∈ R d , R ‘ ( ψ ( u ); p ) ≤ ζ ( R L ( u ; p )).D For all p ∈ P and sequences { u m } so that R L ( u m ; p ) → 0, we have R ‘ ( ψ ( u m ); p ) → ζ so that ( A ∧ C ) defines calibration as in Definition 21, and weshow A ⇐⇒ B in Lemma 23. Lemma 24 shows calibration if and only if D , which yields acondition equivalent to calibration without dependence the function ζ . Proposition 22 When R and Y are finite, a continuous loss and link ( L, ψ ) are P -calibratedwith respect to a target loss ‘ via Definition 21 if and only if they are P -calibrated viaDefinition 20. Proof = ⇒ We prove the contrapositive; if ( L, ψ ) is not calibrated with respect to ‘ byDefinition 20, then it is not calibrated via Definition 21 either. If ( L, ψ ) are not calibratedwith respect to ‘ by Definition 20, then there is a p ∈ P so that inf u : ψ ( u ) γ ( p ) E p L ( u, Y ) =inf u E p L ( u, Y ). Thus there is a sequence { u m } so that lim m →∞ ψ ( u m ) γ ( p ) and E p L ( u m , Y ) → L ( p ). Now we have R L ( u m ; p ) → R ‘ ( ψ ( u m ); p ) 0, so by Lemma 24, we contradictcalibration by Definition 21. ⇐ = Suppose there was a function ζ satisfying the bound in eq. (6) for a fixed distribution p ∈ P . Observe the bound in eq. (5) can be written as R L ( u, p ) > p ∈ ∆ Y and u such that ψ ( u ) = γ ( p ). By eq. (6), for any sequence { u m } so that ψ ( u m ) γ ( p ), we havemust have ζ ( R ‘ ( ψ ( u m ) , p )) R ‘ ( ψ ( u ) , p ) 0. Therefore R L ( u m , p ) 0; thus, the strict inequality holds.The following Lemma shows that conditions A and B are equivalent, so that we canusing condition B in lieu of condition A in the proof of Lemma 24 Lemma 23 A function ζ : R → R is continuous at and ζ (0) = 0 if and only if thesequence { u m } → ⇒ ζ ( u m ) → . Proof = ⇒ Suppose we have a sequence { u m } → 0. By continuity, we have lim u m → ζ ( u m ) = ζ (0) = 0, so ζ ( u m ) → ⇐ = Suppose ζ (0) = 0 but ζ was continuous at 0. The constant sequence { u m } = 0then converges to 0, but as ζ is continuous at 0, we must have lim m →∞ ζ ( u m ) = ζ (0) = 0,so ζ ( u m ) ζ (0) = 0 but ζ was not continuous at 0. There must be a sequence { u m } → m →∞ ζ ( u m ) = ζ (0) = 0, so ζ ( u m ) ζ in mind. Lemma 24 A continuous surrogate and link ( L, ψ ) are P -calibrated (via definition 21)with respect to ‘ if and only if, for all p ∈ P and sequences { u m } so that R L ( u m ; p ) → , wehave R ‘ ( ψ ( u m ); p ) → . Proof = ⇒ Take a sequence { u m } so that R L ( u m ; p ) → 0. Since ζ (0) = 0 and ζ iscontinuous at 0, we have ζ ( R L ( u m ; p )) → 0. As the bound from Equation (6) is satisfied for nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates all u ∈ R d by assumption, we observe ∀ m, ≤ R ‘ ( ψ ( u m ); p ) ≤ ζ ( R L ( u m ; p ))= ⇒ ≤ lim m →∞ R ‘ ( ψ ( u m ); p ) ≤ lim m →∞ ζ ( R L ( u m ; p )) = 0= ⇒ m →∞ R ‘ ( ψ ( u m ); p ) . ⇐ = Fix p ∈ P , and consider ζ ( c ) := sup u : R L ( u,p ) ≤ c R ‘ ( ψ ( u ); p ). We will show R L ( u m ; p ) → ⇒ R ‘ ( ψ ( u m ); p ) → ζ constructedabove. With ζ as constructed, we observe that the bound in equation (6) is satisfied forall u ∈ R d and apply Lemma 23 to observe that if there is a sequence { (cid:15) m } → ζ ( (cid:15) m ) 0, it is because R L ( u m , p ) = ⇒ R ‘ ( ψ ( u m ) , p ) → u ∈ R d by con-struction of ζ . Let S ( v ) := { u ∈ R d : R L ( u ; p ) ≤ R L ( v, p ) } . Showing R ‘ ( ψ ( u ); p ) ≤ sup u ∈ S ( u ) R ‘ ( ψ ( u ); p ) for all u ∈ R d gives the condition C . As u is in the space over whichthe supremum is being taken (as R L ( u ; p ) ≤ R L ( u ; p )), we then have calibration by definitionof the supremum.Now suppose there exists a sequence { (cid:15) m } → ζ ( (cid:15) m ) 0. Consider S ( (cid:15) ) = { u ∈ R d : R L ( u, p ) ≤ (cid:15) } . (cid:15) ≤ (cid:15) = ⇒ S ( (cid:15) ) ⊆ S ( (cid:15) )= ⇒ ζ ( (cid:15) ) ≤ ζ ( (cid:15) ) . Now suppose there exists a sequence { u m } so that R L ( u m , p ) → 0. Then for all (cid:15) > 0, thereexists a m ∈ N so that R L ( u m , p ) < (cid:15) for all m ≥ m . Since this is true for all (cid:15) , we have S ( (cid:15) ) nonempty for all (cid:15) > 0, and therefore ζ ( c ) is discrete for all c > 0. Now if ζ ( (cid:15) m ) 0, itmust be because R ‘ ( ψ ( u m ) , p ) R L ( u m , p ) → ⇒ R ‘ ( ψ ( u m ) , p ) → { u m } with converging surrogate regret alwaysexists by continuity and boundedness from below of the surrogate loss, since we can takethe constant sequence at the (attained) infimum. A.1. Relating calibration, consistency, and indirect elicitation. Even with the more general notion of calibration that extends beyond discrete predictions,we still have consistency implying calibration. Proposition 25 If a loss and link ( L, ψ ) are consistent with respect to a loss ‘ , then theyare calibrated with respect to ‘ . Proof We show the contrapositive. If ( L, ψ ) are not calibrated with respect to ‘ , thenthere is a sequence { u m } such that R L ( u m ; p ) → R ‘ ( ψ ( u m ); p ) D ∼ X × Y has only one x ∈ X with P r D ( X = x ) > p := D x and E D f ( X, Y ) = E p f ( x, Y ). Consider any sequence of functions { f m } → f with f m ( x ) = u m for all f m . Now we have E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ), but E D ‘ ( ψ ◦ f ( X ) , Y ) inf f E D ‘ ( ψ ◦ f ( X ) , Y ), and therefore ( L, ψ ) is not consistent with respect to ‘ . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Moreover, we have calibration implying indirect elicitation. Lemma 26 If a surrogate and link ( L, ψ ) with L ∈ L are calibrated with respect to a loss ‘ : R × Y → R , then L indirectly elicits the property γ := prop P [ ‘ ] . Proof Let Γ be the unique property directly elicited by L , and fix p ∈ ∆ Y with u such that p ∈ Γ u . We know such a u exists since Γ( p ) = ∅ . As p ∈ Γ u , then ζ ( E p L ( u, Y ) − L ( p )) = ζ (0) = 0, we observe the bound ‘ ( ψ ( u ); p ) ≤ ‘ ( p ). We also have ‘ ( ψ ( u ); p ) ≥ ‘ ( p ) bydefinition of ‘ , so we must have ‘ ( ψ ( u ); p ) = ‘ ( p ) = ‘ ( γ ( p ); p ), and therefore, p ∈ γ ψ ( u ) .Thus, we have Γ u ⊆ γ ψ ( u ) , so L indirectly elicits γ .Combining the two results, we can observe the result of Theorem 8 another way: throughcalibration . Appendix B. Reconstructing Ramaswamy and Agarwal (2016, Thm. 16) Lemma 27 Let the d -flat F ⊆ P (defined over finite Y ) contain some p ∈ relint( P ) . Then(i) p ∈ relint( F ) ;(ii) dim( S F ( p )) ≥ dim(affhull( P )) − d . Proof As F is a d -flat, we have some W : Y → R d such that F = ker P W . Throughout, givena point (typically a distribution) p and convex set P , we define P p := P − { p } . Define T W : span( P p ) → R d , v E v W .(i) Since p ∈ relint( P ), for all q ∈ P , there is some small enough (cid:15) > α ∈ ( − (cid:15), (cid:15) ), the point q α := p − α ( q − p ) is still in P . In particular, for q ∈ F , we claim q α ∈ F . As p, q ∈ F , we have E p W = E q W = . By linearity of expectation, we then have E q α W = . This implies q α ∈ F , and therefore p ∈ relint( F ).(ii) We first show span( F p ) = S F ( p ). First, take v ∈ S F ( p ), and take (cid:15) as in thedefinition. For (cid:15) = (cid:15) / 2, we then have p + (cid:15)v ∈ F = ⇒ (cid:15)v ∈ F p , and therefore, v ∈ span( F p ).Now take v ∈ span( F p ). Since p ∈ relint( F ) (i), we have ∈ relint( F p ). Therefore there isan (cid:15) > (cid:15)v ∈ F p for all (cid:15) ∈ ( − (cid:15) , (cid:15) ) by convexity of F . Therefore, v ∈ S F ( p ), andwe observe S F ( p ) = span( F p ).We now show S F ( p ) = ker( T W ). Observe that S F ( p ) ⊆ ker( T W ) follows trivially from thedefinitions of the two functions. Now let v ∈ ker( T W ), and v ∈ F p . This means E v W = ,so it suffices to show v = cv ∈ F p , thus showing v ∈ S F ( p ). Since p ∈ relint( P ), we musthave ∈ relint( F p ), so we know there is some small enough (cid:15) > − αv ∈ F p for α ∈ ( − (cid:15), (cid:15) ). Take c = − α , and we conclude v ∈ S F ( p ). Therefore, ker( T W ) = S F ( p ).We finally want to show dim(affhull( P )) = dim(span( P p )). Consider that any q ∈ span( P p ) can be written as a scalar multiple of an element of P p , which can be written as aconvex combination of elements of the minimal basis P p . In particular, since ∈ P p , it can bewritten as an affine combination of elements of the basis, so dim(affhull( P )) ≥ dim(span( P p )).We also have affhull( P ) − { p } ⊆ span( P p ), so dim(affhull( P )) = dim(affhull( P ) − { p } ) ≤ span( P p ). Therefore, dim(affhull( P )) = dim(span( P p )). nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates As Y is a finite set, span( P p ) is a finite-dimensional vector space. The rank-nullitytheorem states dim(im( T W )) + dim(ker( T W )) = dim(span( P p )) = dim(affhull( P )). Asdim(im( T W )) ≤ d , and we have shown above that S F ( p ) = span( F p ) = ker( T W ), theconclusion follows. Corollary 28 (Ramaswamy and Agarwal (2016) Theorem 18) Let ‘ : R × Y → R be a discrete loss eliciting γ : ∆ Y ⇒ R with Y finite. Then for all p ∈ ∆ Y and r ∈ γ ( p ) , cons cvx ( γ ) ≥ k p k − dim( S γ r ( p )) − . (3) Proof Let L ∈ L cvx d be a calibrated surrogate for ‘ , and let Γ := prop ∆ Y [ L ]. Consider Y := { y ∈ Y : p y > } and p = ( p y ) y ∈Y ∈ ∆ Y . Take L := L | Y and ‘ := ‘ | Y . Define h : R Y → R Y such that h ( q ) = q such that q y = q y for y ∈ Y and q y = 0 otherwise. TakeΓ = Γ ◦ h , γ = γ ◦ h .We wish to first show L indirectly elicits γ . Since L indirectly elicits γ , we have alink ψ such that for all u ∈ R d , Γ u ⊆ γ ψ ( u ) . As Γ ( q ) = Γ( h ( q )) and γ ( q ) = γ ( h ( q )), wehave q ∈ Γ u ⇐⇒ h ( q ) ∈ Γ u = ⇒ h ( q ) ∈ γ ψ ( u ) ⇐⇒ ( q y ) y ∈Y ∈ γ ψ ( u ) , and therefore, L indirectly elicits γ via the link ψ ◦ proj( Y ), where proj( Y ) : q ( q y ) y ∈Y .We aim to show dim( S γ r ( p )) ≥ dim( S γ r ( p )). We do this by showing that h ( S γ r ( p )) ⊆S γ r ( p ), and the result holds as h is linear and injective. Suppose v ∈ h ( S γ r ( p )), then thereexists a v so that v = h ( v ) and an (cid:15) > (cid:15)v + p ∈ γ r for all (cid:15) ∈ ( − (cid:15) , (cid:15) ). Since h is linear and recall h ( γ r ) ⊆ γ r , this implies (cid:15)v + p ∈ γ r for all (cid:15) ∈ ( − (cid:15) , (cid:15) ). Therefore v ∈ S γ r ( p ), and the result follows.As L indirectly elicits γ , by Corollary 13, we know there exists a d -flat F with p ∈ F ⊆ γ r .Taking P = ∆ Y , we know p ∈ relint(∆ Y ) by construction, so we can apply Lemma 27(ii),which gives dim( S F ( p )) ≥ dim(affhull(∆ Y )) − d = k p k − − d . Additionally, S F ( p ) ⊆S γ r ( p ) by subset inclusion of the sets themselves. Chaining these results, we obtaindim( S γ r ( p )) ≥ dim( S γ r ( p )) ≥ dim( S F ( p )) ≥ k p k − − d . Appendix C. Proof of Theorem 17 C.1. General setting of elicitation complexity We briefly introduce the general notion of elicitation complexity, of which Definition 4 is aspecial case, as some statements are more naturally made in this general setting. Definition 29 Γ refines Γ if for all r ∈ range Γ there exists r ∈ range Γ with Γ r ⊆ Γ r . Equivalently, Γ refines Γ if there is a link function ψ : range Γ → range Γ such thatΓ r ⊆ Γ ψ ( r ) for all r ∈ range Γ . 3. To reason about dim(affhull(∆ Y )) = k p k − 1, observe that the uniform distribution on ∆ Y has fullsupport and therefore requires k p k − nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Definition 30 For k ∈ N ∪ {∞} , let E k ( P ) denote the class of all elicitable properties Γ : P → R k , and E ( P ) := S k ∈ N ∪{∞} E k ( P ) . When P is implicit we simply write E . Definition 31 Let C be a class of properties. The elicitation complexity of a property Γ with respect to C , denoted elic C (Γ) , is the minimum value of k ∈ N ∪ {∞} such that thereexists ˆΓ ∈ C ∩ E k ( P ) that refines Γ . C.2. Supporting statementsProposition 32 (Osband (1985)) Let Γ be elicitable. Then Γ r is convex for all r ∈ range Γ . Lemma 33 (Set-valued extension of Frongillo and Kash (2020, Lemma 4)) If Γ refines Γ then elic C (Γ ) ≥ elic C (Γ) . Proof As Γ refines Γ, we have some ψ : range Γ → range Γ such that for all r ∈ range Γ we have Γ r ⊆ Γ ψ ( r ) . Suppose we have ˆΓ ∈ C and ϕ : range ˆΓ → range Γ such that for all u ∈ range ˆΓ we have ˆΓ u ⊆ Γ ϕ ( u ) . Then for all u ∈ range ˆΓ we have ˆΓ u ⊆ Γ ϕ ( u ) ⊆ Γ ( ψ ◦ ϕ )( u ) .In particular, if elic C (Γ ) = m , then we have such a ˆΓ : P ⇒ R m , and hence elic C (Γ) ≤ m . Lemma 34 (Frongillo and Kash (2020, Lemma 8)) Suppose L ∈ L elicits Γ : P →R and has Bayes risk L . Then for any p, p ∈ P with Γ( p ) = Γ( p ) , we have L ( λp +(1 − λ ) p ) >λL ( p ) + (1 − λ ) L ( p ) for all λ ∈ (0 , . Lemma 35 (Adapted from Frongillo and Kash (2020, Theorem 4)) If L elicits asingle-valued Γ , and ˆΓ refines L , then ˆΓ refines Γ . Proof Suppose for a contradiction that ˆΓ does not refine Γ. Then we have some u ∈ range ˆΓsuch that for all r ∈ range Γ we have ˆΓ u Γ r . In particular, recalling that Γ is single-valued,we must have p, p ∈ ˆΓ u such that Γ( p ) = Γ( p ). Moreover, as ˆΓ refines L , we also have L ( p ) = L ( p ). From Lemma 34 and λ = 1 / L ( q ) > L ( p ) + L ( p ) = L ( p ), where q = p + p . As the level set ˆΓ u is convex by Proposition 32, we also have q ∈ ˆΓ u , andhence L ( q ) = L ( p ), a contradiction. Lemma 36 (Minor modifications from Frongillo and Kash (2020)) Let V be a realvector space. Let f : V → R k be linear and C ⊆ V convex with span C = V , and let m ∈ N .Suppose that ∈ int f ( C ) , and for all v ∈ S := C ∩ ker f , there exists a linear ˆ f v : V → R m with v ∈ C ∩ ker ˆ f v ⊆ S . Then m ≥ k . If m = k , we additionally have ∈ int ˆ f v ( C ) forsome v ∈ S . Proof The condition 0 ∈ int f ( C ) is equivalent to the existence of some v , . . . v k +1 ∈ C suchthat 0 ∈ int conv { f ( v i ) : i ∈ { , . . . , k + 1 }} . Let α , . . . , α k +1 > P k +1 i =1 α i = 1, such that P k +1 i =1 α i f ( v i ) = 0. As these are barycentric coordinates, this choice of α i is unique, a factwhich will be important later. We will take v = P k +1 i =1 α i v i , an element of C by convexity,and thus an element of S as f ( v ) = 0. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Let ˆ f v : V → R m be linear with v ∈ ˆ S := C ∩ ker ˆ f v ⊆ S . Let β , . . . , β k +1 ∈ R , P k +1 i =1 β i = 0, such that P k +1 i =1 β i ˆ f v ( v i ) = 0. We will show that the β i must be identicallyzero, i.e. that { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} are affinely independent. By construction, v := P k +1 i =1 β i v i ∈ ker ˆ f v , and as v ∈ ker ˆ f v , for all λ > v λ := v + λv = P k +1 i =1 ( α i + λβ i ) v i ∈ ker ˆ f v . Taking λ sufficiently small, we have γ i := α i + λβ i > i , and P k +1 i =1 γ i = P k +1 i =1 α i + λ P k +1 i =1 β i = 1. By convexity of C , we have v λ ∈ C . Now v λ ∈ C ∩ ker ˆ f v ⊆ S = C ∩ ker f , and in particular v λ ∈ ker f . Thus, f ( v λ ) = P k +1 i =1 γ i f ( v i ) = 0.By the uniqueness of barycentric coordinates, for all i ∈ { , . . . , k + 1 } , we must have γ i = α i and thus β i = 0, as desired.As ˆ f v ( C ) contains k + 1 affinely independent points, we have m ≥ dim im ˆ f v ≥ k . When m = k , by affine independence, the set conv { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} has dimension k in R k . As 0 = ˆ f v ( v ) = P k +1 i =1 α i ˆ f v ( v i ), and α i > i , we conclude 0 ∈ int conv { ˆ f v ( v i ) : i ∈ { , . . . , k + 1 }} ⊆ int ˆ f v ( C ). Lemma 37 (Frongillo and Kash (2020, Lemma 14)) Let V be a real vector space.Let f : V → R k be linear, C ⊆ V convex with span C = V , and let S = C ∩ ker f . If ∈ int f ( C ) then span S = ker f . C.3. Proving the lower bound for spectral risks Let C ∗ d be the class of properties Γ which are elicited by a convex loss L ∈ L cvx d for some d ∈ N , and let C ∗ := S d ∈ N C ∗ d . Then for all properties γ , if elic C ∗ ( γ ) < ∞ , we haveelic C ∗ ( γ ) = elic cvx ( γ ), a fact we use tacitly in the proof. Theorem 17 Let P be a set of Lebesgue densities supported on the same set for all p ∈ P .Let Γ : P → R d satisfy Condition 1 for some r ∈ R d . Let L ∈ L cvx elicit Γ such that L isnon-constant on Γ r . Then cons cvx ( L ) ≥ elic cvx ( L ) ≥ d + 1 . Proof Let V : Y → R d and r be given by the statement of the theorem and from Condition 1.Let m = elic C ∗ ( L ), so that we have ˆΓ ∈ C ∗ m which refines L . By Lemma 35 we have ˆΓ refinesΓ. We now establish the conditions of Lemma 36 for C = P . Let f : span P → R d , p E p V .From Condition 1, we have 0 ∈ int f ( P ) and ker f ∩ P = ker P V = Γ r . Now let p ∈ Γ r be arbitrary, and take any u ∈ ˆΓ( p ). As Γ is single-valued, r ∈ range Γ is the uniquevalue with p ∈ Γ r . As ˆΓ refines Γ, there exists r ∈ range Γ with ˆΓ u ⊆ Γ r , and since p ∈ ˆΓ u , we conclude r = r from the above. From Lemma 11, we have some ˆ V u,p with p ∈ ker P ˆ V u,p ⊆ ˆΓ u ⊆ Γ r = ker P V . Letting ˆ f p : span P → R d , p E p ˆ V u,p , we have nowsatisfied the conditions of Lemma 36. We conclude m ≥ d , and moreover, if m = d , thenthere exists some q ∈ Γ r such that 0 ∈ int ˆ f q ( P ).Now suppose m = d for a contradiction. Let ˆ S := ker f q ∩ P . Applying Lemma 37 tothe functions f and ˆ f q we have span ker f = spanΓ r and span ker ˆ f q = span ˆ S . As ˆ S ⊆ Γ r ,we have ker ˆ f q = span ˆ S ⊆ spanΓ r = ker f . By the first isomorphism theorem, we also havecodim ker ˆ f q = codim ker f = d , as the images of these linear maps span all of R d . By thethird isomorphism theorem we conclude Γ r = ˆ S . Moreover, as ˆ S ⊆ ˆΓ u ⊆ Γ r , we haveˆ S = ˆΓ u = Γ r . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates We now see that L is constant on Γ r since there is some link function ψ : R m → R suchthat Γ r = ˆΓ u ⊆ L ψ ( u ) , meaning L ( p ) = ψ ( u ) for all p ∈ Γ r . This statement contradicts theassumption that L is non-constant on Γ r . Appendix D. Miscellaneous omitted proofs Lemma 7 Let a convex P ⊆ ∆ Y be given. Given a surrogate loss L ∈ L , link ψ , and targetloss ‘ , set µ ( r, p ) := R ‘ ( r, p ) . Then there is a D such that ( L, ψ ) is D -consistent with respectto ‘ if and only if ( L, ψ ) is ( µ, D ) -consistent with respect to γ := prop P [ ‘ ] . Proof First, observe that µ ( r, p ) = 0 ⇐⇒ E p ‘ ( r, Y ) = inf r ∈R E p ‘ ( r , Y ) ⇐⇒ r ∈ γ ( p ).Now suppose ( L, ψ ) are consistent with respect to ‘ , and take any sequence { f m } ofmeasurable hypotheses. Rewriting the right-hand side of Definition 5, E D ‘ ( ψ ◦ f m ( X ) , Y ) → inf f E D ‘ ( ψ ◦ f ( X ) , Y ) (7) ⇐⇒ E X R ‘ ( ψ ◦ f m ( X ) , D X ) → ⇐⇒ E X µ ( ψ ◦ f m ( X ) , D X ) → . (8)Therefore, E D L ( f m ( X ) , Y ) → inf f E D L ( f ( X ) , Y ) implies (7) if and only if it implies (8).Observe that the assumptions on L allow us to apply the Fubini-Tonelli Theorem (Folland,1999, Theorem 2.37), which yields the equivalence of eq. 7 to the next line.A hyperplane weakly separates two sets if its two closed halfspaces respectively containthe two sets. Lemma 38 If γ : P ⇒ R is an elicitable property, then for any pair of predictions r, r ∈ R where γ r = γ r , there is a hyperplane H = { x ∈ R Y : v · x = 0 } , for some v ∈ R Y , that weaklyseparates γ r and γ r and has γ r ∩ H = γ r ∩ H = γ r ∩ γ r . Proof Let ‘ elicit γ . Let v = ‘ ( r, · ) − ‘ ( r , · ), interpreted as a nonzero vector in R Y . Let H = { q : v · q = 0 } . If v · q < 0, then r cannot be optimal, so q γ r . So γ r ⊆ { q : v · q ≥ } .Symmetrically, γ r ⊆ { q : v · q ≤ } . This is weak separation, and it immediately impliesthat γ r ∩ γ r ⊆ H . Finally, if and only if v · q = 0, i.e. q ∈ H , by definition the expectedlosses of both reports are the same. So q ∈ γ r ∩ H ⇐⇒ q ∈ γ r ∩ H . This gives γ r ∩ H = γ r ∩ H = γ r ∩ γ r ∩ H = γ r ∩ γ r . Lemma 39 Suppose we are given an elicitable property γ : P ⇒ R , where Y is finite, anddistribution p ∈ relint( P ) such that p ∈ γ r ∩ γ r for r, r ∈ R . Then for any flat F containing p , F ⊆ γ r ⇐⇒ F ⊆ γ r . Proof If γ r = γ r , we are done. Otherwise, Lemma 38 gives a hyperplane H = { x ∈ R Y : v · x = 0 } and a guarantee that γ r ⊆ { q ∈ ∆ Y : v · q ≤ } , while γ r ⊆ { q ∈ ∆ Y : v · q ≥ } ,and finally γ r ∩ γ r ⊆ H .Suppose F ⊆ γ r ; we wish to show F ⊆ γ r . Let q ∈ F . By Lemma 27(i), we have p ∈ relint( F ), so there exists (cid:15) > q = p − (cid:15) ( q − p ) ∈ F . nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates Now, suppose for contradiction that q γ r . Then v · q < 0: containment in γ r gives v · q ≤ 0, and if v · q = 0 then q ∈ γ r ∩ H = ⇒ q ∈ γ r , a contradiction. But, noting that p ∈ H , we have v · q = − (cid:15) ( v · q ) > 0, so q is not in γ r . This contradicts the assumption F ⊆ γ r . Therefore, we must have q ∈ γ r , so we have shown F ⊆ γ r . Because r and r werecompletely symmetric, this completes the proof. Appendix E. Omitted Examples Discrete problem with no target loss (Quadrant 2). Consider the following scenariowhere someone is deciding how to dress for the weather based on a meteorologist’s forecast.Consider the three outcomes Y = { rainy, sunny, snowy } , and we suppose we want to havesome bias towards health and safety, so the meteorologist should only predict sunny weatherif P r [sunny | weather data] ≥ / 4. Otherwise, they should predict whatever is more likelygiven the weather data: rain or snow.We can now model this problem by a property with the reports R = Y , and have γ ( p ) = sunny p sunny ≥ / p sunny ≤ / ∧ p rainy ≥ p snowy snowy p sunny ≤ / ∧ p snowy ≥ p rainy , shown in Figure 3. Since the cells of elicitable properties in the simplex form a powerdiagram (Lambert and Shoham, 2009), we know that there is actually no target loss thatdirectly elicits this problem. Constructing a consistent surrogate for this task is ill-definedwithout Definition 6. The function µ ( r, p ) = I { r γ ( p ) } now allows us to use Definition 6to think about consistent surrogates for this task.Intuitively, since the feasible subspace dimension bound would be lowest at the distribu-tion p = (1 / , / , / p . However,we cannot apply either at p since γ ( p ) = { rainy, snowy, sunny } but the property is notelicitable. Ramaswamy and Agarwal (2016, Theorem 16) cannot draw any conclusions aboutthis property for two reasons: first, we are given a target property instead of a target loss.Second, since the property is not elicitable (hence why there can be no target loss), weobserve dim( S γ rainy ( p )) = dim( S γ sunny ( p )), contradicting the requirements of Ramaswamyand Agarwal (2016, Lemma 23).However, our bounds from Corollary 12 on the distribution q = (1 / , / − (cid:15), / (cid:15) )for a small enough (cid:15) > 0, which we can apply since γ ( q ) = { snowy } , suggest that the convexelicitation complexity elic cvx ( γ ) ≥ 2, since there is no way to draw a 1-flat (a line, since q ∈ relint(∆ Y )) through q while staying in just one level set on the simplex.This example also extends to other decision-tree-like properties that do not have anexplicit or easily constructed target loss. nifying Lower Bounds on Prediction Dimension of Consistent Convex Surrogates rainy sunny snowy sunnyrainy snowy ••