Distribution-Free Conditional Median Inference
DDistribution-Free Conditional Median Inference
Dhruv Medarametla * Emmanuel Candès † February 15, 2021
Abstract
We consider the problem of constructing confidence intervals for the median of a response Y ∈ R conditionalon features X = x ∈ R d in a situation where we are not willing to make any assumption whatsoever on theunderlying distribution of the data ( X, Y ) . We propose a method based upon ideas from conformal prediction andestablish a theoretical guarantee of coverage while also going over particular distributions where its performanceis sharp. Further, we provide a lower bound on the length of any possible conditional median confidence interval.This lower bound is independent of sample size and holds for all distributions with no point masses. Consider a dataset ( X , Y ) , . . . , ( X n , Y n ) ⊆ R d × R and a test point ( X n +1 , Y n +1 ) , with all datapoints beingdrawn i.i.d. from the same distribution P . Given our training data, can we provide a confidence interval for theregression function µ ( x ) = E [ Y n +1 | X n +1 = x ] ?Methods for inferring the conditional mean of a distribution are certainly not in short supply. To the best ofour knowledge, however, each approach relies on some assumptions in order to guarantee coverage. For instance,classical linear inference is often used but is only accurate if Y | X is normal with mean µ ( x ) = E [ Y | X = x ] affinein x and standard deviation independent of x . Nonparametric regressions cannot estimate the conditional meanwithout imposing smoothness conditions and assuming that the conditional distribution is sufficiently light tailed.Since reliable conditional mean inference is a very common problem, these methods are nevertheless used all thetimes, e.g. in predicting disease survival times, classifying spam emails, pricing financial assets, and more. Theissue is that the assumptions these methods make rarely hold in practice. Thus, the question remains: is it possibleto estimate the conditional mean in a distribution-free setting, with no assumptions on P ?It turns out that it is not only impossible to get a nontrivial confidence interval for the conditional mean E [ Y | X = x ] , but it is actually impossible to get a confidence interval for E [ Y ] itself. This result originates inBahadur and Savage [1956], where the authors show that any parameter sensitive to tails of a distribution cannotbe estimated when no restrictions exist on the distribution class; an example of a distribution with a non-estimablemean is given in Appendix B.1.Thus, within the distribution-free setting, making progress on inferring the conditional mean requires a modifi-cation to the problem statement. One strategy is to restrict the range of Y . An example of this is in Barber [2020],which introduces an algorithm that calculates a confidence interval for the conditional mean in the case where Y ∈ { , } . However, even with this restriction, Barber shows that there exists a fundamental bound limiting howsmall any such confidence interval can be.The other strategy is to modify the measure of central tendency that we study. Bahadur and Savage’s resultsuggests that the best parameters to study are robust to distribution outliers; this observation motivates our investi-gation of the conditional median.In a nutshell, the conditional median is possible to infer because of the strong and quantifiable relationshipbetween any particular sampled datapoint ( x, y ) and Median ( Y | X = x ) . Its robustness to outliers means that evenwithin the distribution-free setting, there is no need to worry about ‘hidden’ parts of the distribution. Moreover, * Department of Statistics, Stanford University † Department of Statistics and Mathematics, Stanford University a r X i v : . [ m a t h . S T ] F e b here already exists a well-known algorithm for estimating Median ( Y ) given a finite number n of i.i.d. samples.Explored in Noether [1972] and covered in Appendix B.2, this algorithm produces intervals with widths going tozero as the sample size goes to infinity, suggesting that an algorithm for the conditional median might also performwell.Our goal in this paper is to combine ideas from regular median inference with procedures from distribution-free inference in order to understand how well an algorithm can cover the conditional median and, more generally,conditional quantiles. In particular, we want to see if the properties of the median and quantiles lead to a validinference method while also examining the limits of this inference.The methods used in this paper are similar to those from distribution-free predictive inference , which focuseson predicting Y n +1 from a finite training set. The field of conformal predictive inference began with Vovk et al.[2005] and was built up by works such as Shafer and Vovk [2008] and Vovk et al. [2009]; it has been generatinginterest recently due to its versatility and lack of assumptions. Applications of conformal predictive inferencerange from criminal justice to drug discovery, as seen in Romano et al. [2019a] and Cortés-Ciriano and Bender[2019] respectively.While this paper relies on techniques from predictive inference, our focus is on parameter inference , which isquite different from prediction. Whereas predictive inference exploits the fact that ( X n +1 , Y n +1 ) is exchangeablewith the sample datapoints, parameter inference requires another layer of analysis, as Y n +1 is no longer the objectof inference. This additional complexity demands modifying approaches from predictive inference to producevalid parameter inference. We begin by setting up definitions to formalize the concepts above. Throughout this paper, we assume that anydistribution ( X, Y ) ∼ P is over R d × R unless explicitly stated otherwise.Given a feature vector X n +1 = x , we let ˆ C n ( x ) ⊆ R denote a confidence interval for some functional of theconditional distribution Y n +1 | X n +1 = x . Note that we use the phrase confidence interval for convenience; in itsmost general form, ˆ C n ( x ) is a subset of R . This interval is a function of the point x at which we seek inferenceas well as of our training data D = { ( X , Y ) , . . . , ( X n , Y n ) } . We write ˆ C n to refer to the general algorithm thatmaps D to the resulting confidence intervals ˆ C n ( x ) for each x ∈ R d .In order for ˆ C n to be useful, we want it to capture , or contain, the parameter we care about with high probability.We formalize this as follows: Definition 1.
We say that ˆ C n satisfies distribution-free median coverage at level − α , denoted by (1 − α ) -Median,if P (cid:8) Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) (cid:9) ≥ − α for all distributions P on ( X, Y ) ∈ R d × R . Definition 2.
For < q < , let Quantile q ( Y n +1 | X n +1 ) refer to the q th quantile of the conditional distribution Y n +1 | X n +1 . We say that ˆ C n satisfies distribution-free quantile coverage for the q th quantile at level − α , denotedby (1 − α, q ) -Quantile, if P { Quantile q ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) } ≥ − α for all distributions P on ( X, Y ) ∈ R d × R . The probabilities in Definitions 1 and 2 are both taken over the training data D = { ( X , Y ) , . . . , ( X n , Y n ) } and test point X n +1 . Thus, satisfying (1 − α ) -Median is equivalent to satisfying (1 − α, . -Quantile.These concepts are similar to predictive coverage, with the key difference being that our goal is now to predicta function of X n +1 rather than a new datapoint Y n +1 . Definition 3.
We say that ˆ C n satisfies distribution-free predictive coverage at level − α , denoted by (1 − α ) -Predictive, if P (cid:8) Y n +1 ∈ ˆ C n ( X n +1 ) (cid:9) ≥ − α for all distributions P on ( X, Y ) ∈ R d × R , where this probability is taken over the training data D = { ( X , Y ) , . . . , ( X n , Y n ) } and test point ( X n +1 , Y n +1 ) .Finally, we define the type of conformity scores that we will use in our general quantile inference algorithm. Definition 4.
We say that a function f : ( R d , R ) → R is a locally nondecreasing conformity score if, for all x ∈ R d and y, y (cid:48) ∈ R with y ≤ y (cid:48) , we have f ( x, y ) ≤ f ( x, y (cid:48) ) .2 .2 Summary of Results We find that there exists a distribution-free predictive inference algorithm that satisfies both (1 − α/ -Predictiveand (1 − α ) -Median. Moreover, an improved version of this algorithm also satisfies (1 − α, q ) -Quantile. Together,these prove that there exists nontrivial algorithms ˆ C n that satisfy (1 − α ) -Median and (1 − α, q ) -Quantile for all < q < .We also find that there exist inherent limitations on how well these algorithms can ever perform. Specifically,we show that any algorithm that contains Median ( Y n +1 | X n +1 ) with probability − α must also contain Y n +1 withprobability at least − α .Taken together, these results give us somewhat conflicting perspectives. On the one hand, there exist distribution-free algorithms that capture the conditional median and conditional quantile with high likelihood; on the other hand,any such algorithm will also capture a large proportion of the distribution itself, putting a hard limit on how wellsuch algorithms can ever perform. This section proves the existence of algorithms obeying distribution-free median and quantile coverage. We thenfocus on situations where these algorithms are sharp.
Algorithm 1 below operates by taking the training dataset and separating it into two halves of sizes n + n = n .Next, a regression algorithm ˆ µ is trained on D = { ( X , Y ) , . . . , ( X n , Y n ) } . The residuals Y i − ˆ µ ( X i ) arecalculated for n < i ≤ n , and the − α/ quantile of the absolute value of these residuals is then used to createa confidence band around the prediction ˆ µ ( X n +1 ) . The expert will recognize that this is identical to a well-knownalgorithm from predictive inference as explained later. Algorithm 1:
Confidence Interval for Median ( Y n +1 | X n +1 ) with Coverage − α Input:
Number of i.i.d. datapoints n ∈ N .Split sizes n + n = n .Datapoints ( X , Y ) , . . . , ( X n , Y n ) ∼ P ⊆ ( R d , R ) .Test point X n +1 ∼ P .Regression algorithm ˆ µ .Coverage level − α ∈ (0 , . Process:
Randomly split { , . . . , n } into disjoint I and I with |I | = n and |I | = n .Fit regression function ˆ µ : R d → R on { ( X i , Y i ) : i ∈ I } .For i ∈ I set E i = | Y i − ˆ µ ( X i ) | .Compute Q − α/ ( E ) , the (1 − α/ /n ) -th empirical quantile of { E i : i ∈ I } . Output:
Confidence interval ˆ C n ( X n +1 ) = [ˆ µ ( X n +1 ) − Q − α/ ( E ) , ˆ µ ( X n +1 ) + Q − α/ ( E )] for Median ( Y n +1 | X n +1 ) . Theorem 1.
For all distributions P and all regression algorithms ˆ µ , the output of Algorithm 1 contains Median ( Y n +1 | X n +1 ) with probability at least − α . That is, the algorithm satisfies (1 − α ) -Median. In the interest of space, we only provide a sketch of the proof here; the full proof of Theorem 1 is included inAppendix A.1.
Proof Sketch.
The main idea revolves around bounding the number of scores E i = | Y i − ˆ µ ( X i ) | that are at least | Median ( Y n +1 | X n +1 ) − ˆ µ ( X n +1 ) | . If it can be shown that at least ≈ n · α/ of the E i ’s are greater than or equalto | Median ( Y n +1 | X n +1 ) − ˆ µ ( X n +1 ) | with probability − α , then Median ( Y n +1 | X n +1 ) is contained in ˆ C n ( X n +1 ) with the desired probability. 3howing this result breaks down into two parts. The first compares R n +1 := | Median ( Y n +1 | X n +1 ) − ˆ µ ( X n +1 ) | to R i := | Median ( Y i | X i ) − ˆ µ ( X i ) | for i ∈ I . Because our data is i.i.d. and ˆ µ is only based upon { ( X i , Y i ) : i ∈ I } , we can use the exchangeability of the data to give a strong lower bound on the probability that R n +1 is not an extreme value relative to the R i ’s, i ∈ I .The second part compares the error term E i to R i . Using properties of the median, it can be shown that E i ≥ R i with probability at least / . This gives us a bound on the probability that the individual error terms thatwe are directly observing are greater than the error terms that we want to understand.In sum, the first part ties R n +1 to R i and shows that they share the same distribution; the second part showsthat we can use our observed data to bound R i with a particular probability. By combining these two parts, we cancompare R n +1 to the observed E i ’s. The α/ term here shows up due to the second part, as we must halve thebounding quantile of the E i ’s to account for the fact that only about half of the E i ’s are greater than the medianerror terms. Remark . Algorithm 1 works independently of how ˆ µ is trained; this means that any regression function maybe used, from simple linear regression to more complicated machine learning algorithms.We return at last to the connection with predictive inference. Introduced in Papadopoulos et al. [2002] andVovk et al. [2005] and studied in Lei et al. [2013], Barber et al. [2019a], Romano et al. [2019a], and several otherpapers, the split conformal method was initially created to achieve distribution-free predictive coverage guarantees.In particular, Vovk et al. [2005] shows that Algorithm 1 satisfies (1 − α/ -Predictive, implying that in order tocapture Median ( Y n +1 | X n +1 ) with probability − α , our algorithm produces a wider confidence interval than analgorithm trying to capture Y n +1 with the same probability. Algorithm 1 is a good first step towards a usable method for conditional median inference; however, it may be toorudimentary to be used in practice. Algorithm 2 is a more general version of Algorithm 1 that results in conditionalquantile coverage and better empirical performance. This provides a better understanding of how diverse parameterinference algorithms can be.
Algorithm 2:
Confidence Interval for Quantile q ( Y n +1 | X n +1 ) with Coverage − α Input:
Number of i.i.d. datapoints n ∈ N .Split sizes n + n = n .Datapoints ( X , Y ) , . . . , ( X n , Y n ) ∼ P ⊆ ( R d , R ) .Test point X n +1 ∼ P .Locally nondecreasing conformity score algorithms f lo and f hi .Quantile level q ∈ (0 , .Coverage level − α ∈ (0 , .Split probabilities r + s = α . Process:
Randomly split { , . . . , n } into disjoint I and I with |I | = n and |I | = n .Fit conformity scores f lo , f hi : ( R d , R ) → R on { ( X i , Y i ) : i ∈ I } .For i ∈ I set E lo i = f lo ( X i , Y i )) and E hi i = f hi ( X i , Y i )) .Compute Q lo rq ( E ) , the rq (1 + 1 /n ) − /n empirical quantile of { E lo i : i ∈ I } , and Q hi − s (1 − q ) ( E ) , the (1 − s (1 − q ))(1 + 1 /n ) empirical quantile of { E hi i : i ∈ I } . Output:
Confidence interval ˆ C n ( X n +1 ) = { y : Q lo rq ( E ) ≤ f lo ( X n +1 , y ) , f hi ( X n +1 , y ) ≤ Q hi − s (1 − q ) ( E ) } forQuantile q ( Y n +1 | X n +1 ) .Algorithm 2 differs from Algorithm 1 in two ways. First, we use the rq quantile of the lower scores to createthe confidence interval’s lower bound, and the − s (1 − q ) quantile of the upper scores (corresponding to thetop s (1 − q ) of the score distribution) for the upper bound. Second, the functions we fit are no longer regression4unctions, but instead locally nondecreasing conformity scores. These scores are described in Definition 4; seeRemark 2.3 for examples. Theorem 2.
For all distributions P , all locally nondecreasing conformity scores f lo and f hi , and all < q < ,the output of Algorithm 2 contains Quantile q ( Y n +1 | X n +1 ) with probability at least − α . That is, Algorithm 2satisfies (1 − α, q ) -Quantile. The proof of Theorem 2 is covered in Appendix A.2 and is similar to that of Theorem 1; the main modificationscome from the changes described above. Regarding the first change, the asymmetrical quantiles on the lower andupper end of the E i ’s balance the fact that datapoints have asymmetrical probabilities of being on either side ofthe conditional quantile. Regarding the second change, because the conformity scores E i still preserve relativeordering, they do not affect the relationship between datapoints and the conditional quantile. Remark . One possible choice for r and s is r = s = α/ . This is motivated by the logic that r and s decide theprobabilities of failure on the lower bound and the upper bound, respectively; if we want the bound to be equallyaccurate on both ends, it makes sense to set r and s equal. Another choice is r = (1 − q ) α and s = qα ; thisresults in the quantiles for Q lo rq and Q hi − s (1 − q ) being approximately equal, with the algorithm taking the q (1 − q ) α quantile of the scores on both the lower and upper ends. Remark . The versatility of the conformity scores f lo and f hi is what differentiates Algorithm 2 from Algorithm1 and makes it a viable option for conditional quantile inference. Below are a few examples of possible scores andthe style of intervals they produce.- f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i ) , where ˆ µ : R d → R is trained on { ( X i , Y i ) : i ∈ I } as a centraltendency estimator. This is the conformity score used in Algorithm 1 and Vovk et al. [2005], resulting in aconfidence interval of the form [ˆ µ ( X n +1 )+ c lo , ˆ µ ( X n +1 )+ c hi ] for some c lo , c hi ∈ R . This score is best whenthe conditional distribution Y | X = x is similar for all x and either the mean or median can be estimated withreasonable accuracy. Note that if the conditional distribution Y − E [ Y | X ] is independent of X , Algorithm2 will output the same confidence interval for ˆ µ ( x ) = E [ Y | X = x ] and ˆ µ ( x ) = Median ( Y | X = x ) .- f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i )ˆ σ ( X i ) , where ˆ µ : R d → R and ˆ σ : R d → R + are trained on { ( X i , Y i ) : i ∈I } as a central tendency estimator and conditional absolute deviation estimator, respectively. This scoreresults in a confidence interval of the form [ˆ µ ( X n +1 ) + c lo ˆ σ ( X n +1 ) , ˆ µ ( X n +1 ) + c hi ˆ σ ( X n +1 )] for some c lo , c hi ∈ R . Unlike the previous example, this score no longer results in a fixed-length confidence interval;it is best used when there is high heteroskedasticity in the underlying distribution. This is the conformityscores used to create adaptive predictive intervals in Lei et al. [2018]. Note that a normalization constant γ > can be added to the denominator ˆ σ ( X i ) to create stable confidence intervals.- f lo ( X i , Y i ) = Y i − ˆ Q lo ( X i ) and f hi ( X i , Y i ) = Y i − ˆ Q hi ( X i ) , where ˆ Q lo , ˆ Q hi : R d → R are trained on { ( X i , Y i ) : i ∈ I } to estimate the rq quantile and the − s (1 − q ) quantile of the conditional distribution,respectively. This choice results in a confidence interval of the form [ ˆ Q lo ( X n +1 ) + c lo , ˆ Q hi ( X n +1 ) + c hi ] forsome c lo , c hi ∈ R . These scores are best when one can estimate the conditional quantiles reasonably welland the conditional distribution Y | X = x is heteroskedastic. Note that if ˆ Q lo and ˆ Q hi are trained well, thenthe resulting confidence interval will be approximately [ ˆ Q lo ( X n +1 ) , ˆ Q hi ( X n +1 )] . These are the scores usedto create the predictive intervals seen in Romano et al. [2019b] and Sesia and Candès [2020].- f lo ( X i , Y i ) = f hi ( X i , Y i ) = ˆ F Y | X = X i ( Y i ) , where ˆ F Y | X = x : R d × R → [0 , is trained on { ( X i , Y i ) : i ∈ I } to be the estimated cumulative distribution function of the conditional distribution Y | X . Usingthis score will result in a confidence interval [ ˆ F − Y | X = X n +1 ( c lo ) , ˆ F − Y | X = X n +1 ( c hi )] for some c lo , c hi ∈ [0 , ,similar to the predictive intervals in Chernozhukov et al. [2019] and Kivaranovic et al. [2019]. This can be agood approach when the conditional distribution Y | X is particularly complex.- f lo ( X i , Y i ) = f hi ( X i , Y i ) = log Y i − ˆ µ ( X i ) , where ˆ µ : R d → R is trained on { ( X i , Y i ) : i ∈ I } as a logcentral tendency estimator. This results in a confidence interval of the form [ c lo exp(ˆ µ ( X n +1 )) , c hi exp(ˆ µ ( X n +1 ))] for some c lo , c hi ∈ R + . This score works well when Y is known to be positive and one wants to minimizethe approximation ratio; it is equivalent to taking a log transformation of the data.In general, a good choice for f lo ( X i , Y i ) and f hi ( X i , Y i ) depends on one’s underlying belief about the distri-bution as well as on the sample size n , though some scores perform better in practice. Sesia and Candès [2020]5ontains more information on the effect of the conformity score on the size of predictive intervals; we also simulatethe impact of different scores on conditional quantile intervals in Section 4. Now that we have seen that Algorithms 1 and 2 achieve coverage, an important question to ask is whether or notthe terms for the error quantile can be improved. Do our methods consistently overcover the conditional median,and if so, is it possible to take a lower quantile of the error terms and still have Theorems 1 and 2 hold? In thissection, we prove that this is impossible by going over a particular distribution P δ for which the − α/ term inAlgorithm 1 is necessary. Additionally, we go over a choice ˆ µ c with the property that Algorithm 1 always results in − α/ coverage when ran with input ˆ µ c ; this implies that there does not exist a distribution with the property thatAlgorithm 1 will always provide a sharp confidence interval for the conditional median regardless of the regressionalgorithm.For each δ > , consider ( X, Y ) ∼ P δ over R × R , where P δX = Unif [ − . , . and Y | X d = X B ; B ∈ { , } is here an independent Bernoulli variable with P ( B = 1) = 0 . δ . That is, . δ of the distribution is on theline segment Y = X from ( − . , − . to (0 . , . , and . − δ of the distribution is on the line segment Y = 0 from ( − . , to (0 . , . Thus, Median ( Y | X = x ) = x . A visualization of P δ is shown in Figure 1. X Y P Conditional MedianDatapoints
Figure 1: A distribution for which it is difficult to estimate the conditional median.We know that Algorithm 1 is accurate for all distributions P and all algorithms ˆ µ . Consider the regressionalgorithm ˆ µ : R → R such that ˆ µ ( x ) = 0 for all x ∈ R ; in other words, ˆ µ predicts Y i = 0 for all X i . We show thatAlgorithm 1 returns a coverage almost exactly equal to − α . Theorem 3.
For all (cid:15) > , there exist N and δ > such that if we sample n > N datapoints from the distribution P δ and use Algorithm 1 with ˆ µ = 0 as defined above and n = n = n/ to get a confidence interval for theconditional median, P (cid:8) Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) (cid:9) ≤ − α + (cid:15). The proof is in Appendix A.3. Theorem 3 does not directly prove that the − α/ term in Algorithm 1 is sharp.However, we can see that if Algorithm 1 used the − α (cid:48) / quantile of the residuals with α (cid:48) > α , then by Theorem3 there would exist a choice for δ and n where the probability of conditional median coverage would be less than − α . Therefore, the − α/ term is required for the probability of coverage to always be at least − α . Remark . It is possible to generalize Theorem 3 to Algorithm 2 as well; we can change P δ to have Y | X ∼ X B with B ∼ Bernoulli ( q + [ X ≥ − q ) + δ ) . This results in Quantile q ( Y | X = x ) = x for all x ∈ [ − . , . .Then, if we consider the conformity scores f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i , it can be shown for large n and small δ that Algorithm 2 returns a confidence interval that has a conditional quantile coverage of at most − α + (cid:15) ,meaning that the rq and − s (1 − q ) terms in the quantiles for the error scores are sharp.6hese results may seem somewhat pedantic because we are restricting ˆ µ ( x ) to be the zero function and f lo ( X i , Y i ) and f hi ( X i , Y i ) to be Y i ; this simplification is done to better illustrate our point. Even when f lo ( X i , Y i ) and f hi ( X i , Y i ) are trained using more complicated approaches, there still exist distributions that result in only − α coverage for Algorithm 2. For an example of a distribution where Algorithm 2 only achieves − α coverage forstandard conformity scores f lo ( X i , Y i ) and f hi ( X i , Y i ) , refer to P in Section 4. The existence of P δ and similarly‘confusing’ distributions helps to show why capturing the conditional median can be tricky in a distribution-freesetting.At the same time, there exist conformity scores for which Algorithms 1 and 2 have rates of coverage that arealways near − α/ . For c > , define the randomized regression function ˆ µ c as follows: set M = max i ∈I | Y i | and ˆ µ c ( x ) = A x for all x ∈ R d , where A x i.i.d. ∼ N (0 , ( cM ) ) . We prove the following: Theorem 4.
For all (cid:15) > , there exists c and N such that for all n > N , there is a split n + n = n such thatwhen Algorithm 1 is ran using the regression function ˆ µ c on n datapoints with I of size n and I of size n , theresulting interval will be finite and will satisfy P (cid:8) Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) (cid:9) ≥ − α/ − (cid:15) for any distribution P . The proof for this theorem is in Appendix A.4.
Remark . Theorem 4 can be extended to Algorithm 2 by taking f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ c ( X i ) . Thecorresponding result shows that given a large enough number of datapoints and a particular data split, there existconformity scores that result in Algorithm 2 capturing the conditional quantile nontrivially with probability at least − rq − s (1 − q ) for all distributions P .Due to the definition of ˆ µ , the resulting confidence intervals will be near-useless; the predictions will be sofar off that the intervals will have width several times the range of the (slightly clipped) marginal distribution P Y .However, they still will be finite, and will still achieve predictive inference at a rate roughly equal to − α/ . Theexistence of scores that always result in higher-than-needed rates of coverage means that any result like Theorem 3that provides a nontrivial upper bound for either Algorithm 1 or Algorithm 2’s coverage of the conditional medianon a specific distribution will have to restrict the class of regression functions and/or conformity scores. Up until this point, we have looked at the existence and accuracy of algorithms for estimating the conditionalmedian. This section focuses on the limitations of all such algorithms by proving a strong lower bound on the sizeof any confidence interval.
Theorem 5.
Let ˆ C n be any algorithm that satisfies (1 − α ) -Median. Then, for any nonatomic distribution P on R d × R , we have that P (cid:8) Y n +1 ∈ ˆ C n ( X n +1 ) (cid:9) ≥ − α. That is, ˆ C n satisfies (1 − α ) -Predictive for all nonatomic distributions P .Proof. The proof above uses the same approach from the proof of Theorem 1 in Barber [2020]. Consider anarbitrary ˆ C n that satisfies (1 − α ) -Median, and let P be any distribution over R d × R for which P X is nonatomic.Pick some M ≥ n + 1 , and sample L = { ( X j , Y j ) : 1 ≤ j ≤ M } i.i.d. ∼ P . We define two different ways ofsampling our data from L .Fix L and pick ( X , Y ) , . . . , ( X n +1 , Y n +1 ) without replacement from L . Call this method of sampling Q .It is clear that after marginalizing over L , the ( X i , Y i ) ’s are effectively drawn i.i.d. from P ; thus, we have that P P { Y n +1 ∈ ˆ C n ( X n +1 ) } = E L (cid:2) P Q { Y n +1 ∈ ˆ C n ( X n +1 ) |L} (cid:3) . Now, pick ( X , Y ) , . . . , ( X n +1 , Y n +1 ) with replacement from L , and call this method of sampling Q . Notethat because P X is nonatomic, the X j ’s are distinct with probability , which means that Median Q ( Y | X = j ) = Y j . Then, as ˆ C n applies to all distributions, it applies to our point distribution over L ; thus, we have thatfor all L , P Q { Y n +1 ∈ ˆ C n ( X n +1 ) |L} = P Q { Median Q ( Y | X = X n +1 ) ∈ ˆ C n ( X n +1 ) |L} ≥ − α. Now note that under Q , for ≤ a < b ≤ M , the probability of ( X a , Y a ) and ( X b , Y b ) being equal is /M ;thus, by the union bound, the probability of the event { ( X a , Y a ) = ( X b , Y b ) for any a (cid:54) = b } is bounded by n /M .Therefore, for any fixed L , the total variation distance between the distributions Q and Q is at most n /M ,implying that P Q { Y n +1 ∈ ˆ C n ( X n +1 ) |L} ≥ P Q { Y n +1 ∈ ˆ C n ( X n +1 )) |L} − n /M ≥ − α − n /M, which means that P P { Y n +1 ∈ ˆ C n ( X n +1 ) } = E L (cid:2) P Q { Y n +1 ∈ ˆ C n ( X n +1 ) |L} (cid:3) ≥ − α − n /M. Taking the limit as M goes to infinity gives the result. Remark . Theorem 5 also applies to all algorithms that satisfy (1 − α, q ) -Quantile; for our uniform distributionover L , Quantile q ( Y | X = X j ) = Y j , so the proof translates exactly. As a result, this means that all algorithmsthat satisfy (1 − α, q ) -Quantile also satisfy (1 − α ) -Predictive for all nonatomic distributions P . Remark . The approach taken in the proof of Theorem 5 is similar to those used to show the limits of distribution-free inference in other settings. As mentioned earlier, Barber [2020] shows that in the setting of distributions P over R d × { , } , any confidence interval ˆ C n ( X n +1 ) for E [ Y | X = X n +1 ] with coverage − α must contain Y n +1 with probability − α for all nonatomic distributions P , and goes on to provide a lower bound for the length ofthe confidence interval. Additionally, Barber et al. [2019a] proves a similar theorem about predictive algorithms ˆ C n ( X n +1 ) for Y n +1 that are required to have a weak form of conditional coverage. The proof for the result fromBarber [2020] involves the same idea of marginalizing over a large finite sampled subset L in order to apply ˆ C n tothe distribution over L ; the proof for the result from Barber et al. [2019a] focuses on sampling a large number ofdatapoints conditioned on whether or not they belong to a specific subset B ⊆ R d × R . In both cases, studying twosampling distributions and measuring the total variation distance between them was crucial. Thus, it seems thatthis strategy may have further use in the future when studying confidence intervals for other parameters or data ina distribution-free setting.Theorem 5 tells us that conditional median inference is at least as imprecise as predictive inference. As a result,because all predictive intervals have nonvanishing widths (assuming nonzero conditional variance) no matter thesample size n , it is not possible to write down conditional median algorithms with widths converging to . Thus,it may be better to study different distribution parameters if we are looking for better empirical performance. Fordiscussion on additional distribution parameters that may be worth studying, refer to Section 5.2.Lastly, we know that Algorithm 1 captures Y n +1 with probability − α/ . Therefore, there may exist a betterconditional median algorithm that only captures Y n +1 with probability − α . Based on our result from Section 2.3,any such algorithm will likely follow a format different than the split conformal approach. Studying this problemin more detail, particularly on difficult distributions P , might lead to more accurate conditional median algorithms. In this section, we analyze the impact of different conformity scores on the outcome of Algorithm 2. Specifically,we look at the four following conformity scores:Score 1: f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i ) . We train ˆ µ to predict the conditional mean using quantileregression forests on the dataset { ( X i , Y i ) : i ∈ I } .Score 2: f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i )ˆ σ ( X i ) . We train ˆ µ and ˆ σ jointly using random forests on the dataset { ( X i , Y i ) : i ∈ I } . 8core 3: f lo ( X i , Y i ) = Y i − ˆ Q lo ( X i ) and f hi ( X i , Y i ) = Y i − ˆ Q hi ( X i ) . We train ˆ Q lo and ˆ Q hi to predict theconditional α/ quantile and − α/ quantile, respectively, using quantile regression forests on the dataset { ( X i , Y i ) : i ∈ I } .Score 4: f lo ( X i , Y i ) = f hi ( X i , Y i ) = ˆ F Y | X = X i ( Y i ) . We create ˆ F Y | X = X i ( Y i ) by using quantile regressionforests trained on { ( X i , Y i ) : i ∈ I } to estimate the conditional q th quantile for q ∈ { , . , . . . , . , } and use linear interpolation between quantiles to estimate the conditional CDF. Our training method ensuresthat quantile predictions will never cross.In order to test the conditional median coverage rate, we must look at distributions for which the conditionalmedian is known and, therefore, focus on simulated datasets. We consider the performance of Algorithm 2 onthese three distributions:Distribution 1: We draw ( X, Y ) ∼ P from R d × R , where d = 10 . Here, X = ( X , . . . , X d ) is an equicorrelatedmultivariate Gaussian vector with mean zero and Var ( X i ) = 1 , Cov ( X i , X j ) = 0 . for i (cid:54) = j . We set Y = ( X + X ) − X + σ ( X ) (cid:15) , where (cid:15) ∼ N (0 , is independent of X and σ ( x ) = 0 . . (cid:107) x (cid:107) forall x ∈ R d .Distribution 2: We draw ( X, Y ) ∼ P from R × R . Draw X ∼ Unif [ − π, π ] and Y = U / f ( X ) , where U ∼ Unif [0 , is independent of X and f ( x ) = 1 + | x | sin ( x ) for all x ∈ R .Distribution 3: We draw ( X, Y ) ∼ P from R × R . Draw X ∼ Unif [ − , and set Y = B · f ( X ) , where B ∼ Bernoulli (0 . δ ) is independent of X and f ( x ) = γ { M x } − γ − ( − (cid:98) Mx (cid:99) (1 − γ ) for all x ∈ R .Note that { r } is the fractional part of r . We set δ = 0 . and M = 1 /γ = 25 .Distributions 2 and 3 are shown in Figure 2.
10 5 0 5 10 X Y P Conditional MedianDatapoints X Y P Conditional MedianDatapoints
Figure 2: Plots of n = 10 , datapoints from the two distributions P and P overlaid with the conditionalmedians. The left panel is a case where the conditional distribution Y | X has high heteroskedasticity. The right isa case where it is nearly impossible to tell the location of the conditional median.For each distribution, we run Algorithm 2 using each conformity score to get a confidence interval for theconditional median. We run trials; in each trial, we set n = 5 , with n = n = n/ and α = 0 . with r = s = α/ . We test coverage on , datapoints for each trial. The average coverage rate, average intervalwidth, and other statistics for each distribution and conformity score are shown in Figure 3. An example of theresulting confidence intervals for a single trial on Distribution 2 are displayed in Figure 4.9istribution Score AC SDAC MCC AW SDAW1 1 .
10% 0 .
19% 0% 13 .
99 0 . .
80% 0 .
09% 8% 12 .
85 0 . .
74% 0 .
13% 8% 12 .
80 0 . .
84% 0 .
09% 12% 12 .
93 0 . .
84% 0 .
20% 88% 4 .
516 0 . .
53% 0 .
29% 68% 3 .
613 0 . .
88% 0 .
14% 92% 3 .
622 0 . .
89% 0 .
11% 92% 3 .
701 0 . .
18% 0 .
78% 16% 2 .
116 0 . .
20% 0 .
76% 12% 2 .
083 0 . .
10% 0 .
97% 0% 1 .
988 0 . .
21% 0 .
93% 0% 1 .
990 0 . Figure 3: For each distribution and conformity score, we calculate: average coverage (AC), an estimateof P { Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) } ; standard deviation of average coverage (SDAC), an estima-tion of Var ( P { Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) |D} ) / , where D = { ( X , Y ) , . . . , ( X n , Y n ) } ; mini-mum conditional coverage (MCC), an estimate of min x P { Median ( Y | X = x ) ∈ ˆ C n ( x ) } ; average width(AW), an estimate of E [ len ( ˆ C n ( X n +1 ))] ; and standard deviation of average width (SDAW), an estimate ofVar ( E [ len ( ˆ C n ( X n +1 )) |D ]) / . Estimations are averaged over trials. − α = 0 . for all trials.10 a) f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i ) . . coverage; . average width.
10 5 0 5 10 X Y Conformity Score 1
Conditional MedianConfidence IntervalDatapoints (b) f lo ( X i , Y i ) = f hi ( X i , Y i ) = Y i − ˆ µ ( X i )ˆ σ ( X i ) . . coverage; . average width.
10 5 0 5 10 X Y Conformity Score 2
Conditional MedianConfidence IntervalDatapoints (c) f lo ( X i , Y i ) = Y i − ˆ Q lo ( X i ) ; f hi ( X i , Y i ) = Y i − ˆ Q hi ( X i ) . . coverage; . average width.
10 5 0 5 10 X Y Conformity Score 3
Conditional MedianConfidence IntervalDatapoints (d) f lo ( X i , Y i ) = f hi ( X i , Y i ) = ˆ F Y | X = X i ( Y i ) . . coverage; . average width.
10 5 0 5 10 X Y Conformity Score 4
Conditional MedianConfidence IntervalDatapoints
Figure 4: Confidence intervals (pink regions) from one trial for each conformity score on Distribution 2. Note thatall scores result in a coverage well over − α = 0 . .Looking first at rates of coverage, we see that all scores have coverage much greater than − α on Distributions1 and 2. However, Distribution 3 is a case where all scores have near-identical rates of coverage just above − α .Further investigation into the confidence intervals produced for Distribution 3 suggests that the algorithms are oftenfailing to capture the conditional median when its absolute value is almost exactly .The minimum conditional coverage on Distributions 1 and 3 is near 0 for each conformity score. Interestingly,the scores with the worst minimum conditional coverage on Distribution 1 have the relative best minimum condi-tional coverage on Distribution 3, and vice versa. Scores 3 and 4 have a minimum conditional coverage greaterthan − α on Distribution 2, implying that these scores achieve point-wise conditional coverage.Regarding interval width, Score 1 performs significantly worse on Distributions 1 and 2 than all other scores;meanwhile, the 3 other scores have roughly equal average widths. On Distribution 3, Scores 3 and 4 produceintervals with significantly less width than Scores 1 and 2.Overall, we see that Score 1 is significantly worse than the other scores on distributions with a wide range inconditional variance; Scores 3 and 4 behave very similarly on all distributions and perform slightly better than11core 2 on Distributions 2 and 3. This paper introduced two algorithms for capturing the conditional median of a distribution within the distribution-free setting, as well as a particular distribution where the performance of one of these algorithms was sharp. Ourlower bounds prove that in the distribution-free setting, conditional median inference is fundamentally as difficultas prediction itself, thereby setting a concrete limit to how well any median inference algorithm can ever perform.
A few observations may prove useful. For one, distributions such as P δ from Section 2.3 and P from Section 4will likely show up again. Because each distribution is a mixture of two disjoint distributions with roughly equalweights, it is hard to identify which half contains the median. It is likely that similar distributions will show upas the performance-limiting distribution for distribution-free parameter inference. Further, the proof technique ofsampling a large finite number of datapoints and then marginalizing (Section 3) is similar to those in Barber [2020]and Barber et al. [2019a], pointing out to possible future use. Lastly, our results and those of Barber [2020] indicatethat the value of conditional parameters cannot be known with higher accuracy than the values of future samples. We hope that this paper motivates further work on conditional parameter inference. We see three immediatepotential avenues:• One direction is to get tighter bands by imposing mild shape constraints on the conditional median function.For instance, if we know that µ ( x ) := Median ( Y | X = x ) is convex, then the results from Section 3no longer apply. Similarly, assuming that µ ( x ) is decreasing in x or Lipschitz would yield intervals withvanishing widths in the limit of large samples. For instance, when predicting economic damages caused bytornadoes using wind speed as a covariate, one may assume that the median damage is nondecreasing aswind speed increases.• Another direction is to extend our methods to study other conditional parameters. For example, the condi-tional interquartile range can be studied using similar ideas, as can other robust measures of scale. Similarly,the truncated mean, smoothed median, and other measures of central tendency may be amenable to model-free inference and analysis.• A third subject of study is creating full conformal inference methods based off of our split conformal algo-rithms. Unlike split conformal inference, the full conformal method does not rely on splitting the dataset intoa fitting half and a ranking half; instead, it calculates the conformity of a potential datapoint ( X n +1 , y ) tothe full dataset D and includes y in its confidence region only if ( X n +1 , y ) is similar enough to the observeddatapoints. The study of full conformal inference has grown alongside that of split conformal inference;the method can be seen in Vovk et al. [2005], Shafer and Vovk [2008], and Lei et al. [2018]. Standard fullconformal algorithms do not guarantee coverage of the conditional median; however, there may exist mod-ifications similar to locally nondecreasing conformity scores that result in a full conformal algorithm thatcaptures the conditional median. Acknowledgements
D. M. thanks Stanford University for supporting this research as part of the Masters program in Statistics. E. C. wassupported by Office of Naval Research grant N00014-20-12157, by the National Science Foundation grant OAC1934578, and by the Army Research Office (ARO) under grant W911NF-17-1-0304. We thank Lihua Lei for ad-vice on different approaches and recommended resources on this topic, as well as the Stanford Statistics departmentfor listening to a preliminary version of this work. 12 eferences
Raghu R Bahadur and Leonard J Savage. The nonexistence of certain statistical procedures in nonparametricproblems.
The Annals of Mathematical Statistics , 27(4):1115–1122, 1956.Rina Foygel Barber. Is distribution-free inference possible for binary regression? arXiv , pages arXiv–2004, 2020.Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-freeconditional predictive inference. arXiv preprint arXiv:1903.04684 , 2019a.Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with thejackknife+. arXiv preprint arXiv:1905.02928 , 2019b.Wenyu Chen, Kelli-Jean Chun, and Rina Foygel Barber. Discretized conformal prediction for efficient distribution-free inference.
Stat , 7(1):e173, 2018.Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Distributional conformal prediction. arXiv preprintarXiv:1909.07889 , 2019.Gerda Claeskens, Ingrid Van Keilegom, et al. Bootstrap confidence bands for regression curves and their deriva-tives.
The Annals of Statistics , 31(6):1852–1884, 2003.Isidro Cortés-Ciriano and Andreas Bender. Concepts and applications of conformal prediction in computationaldrug discovery. arXiv preprint arXiv:1908.03569 , 2019.Francisco Cribari-Neto and Maria da Glória A Lima. Heteroskedasticity-consistent interval estimators.
Journal ofStatistical Computation and Simulation , 79(6):787–803, 2009.Leying Guan. Conformal prediction with localization. arXiv preprint arXiv:1908.08558 , 2019.Ulf Johansson, Henrik Boström, and Tuve Löfström. Conformal prediction using decision trees. In , pages 330–339. IEEE, 2013.Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb. Adaptive, distribution-free prediction intervals for deepneural networks. arXiv preprint arXiv:1905.10634 , 2019.Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets.
Journal of the American StatisticalAssociation , 108(501):278–287, 2013.Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictiveinference for regression.
Journal of the American Statistical Association , 113(523):1094–1111, 2018.Lihua Lei and Emmanuel J Candès. Conformal inference of counterfactuals and individual treatment effects. arXivpreprint arXiv:2006.06138 , 2020.Philip J McCarthy. Stratified sampling and distribution-free confidence intervals for a median.
Journal of theAmerican Statistical Association , 60(311):772–783, 1965.Gottfried E Noether. Distribution-free confidence intervals.
The American Statistician , 26(1):39–41, 1972.Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines forregression. In
European Conference on Machine Learning , pages 345–356. Springer, 2002.Harris Papadopoulos, Vladimir Vovk, and Alexander Gammerman. Regression conformal prediction with nearestneighbours.
Journal of Artificial Intelligence Research , 40:815–840, 2011.Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel J Candès. With malice towards none: Assess-ing uncertainty via equalized coverage. arXiv preprint arXiv:1908.05428 , 2019a.13aniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In
Advances in NeuralInformation Processing Systems , pages 3543–3553, 2019b.Matteo Sesia and Emmanuel J Candès. A comparison of some conformal quantile regression methods.
Stat , 9(1):e261, 2020.Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction.
Journal of Machine Learning Research , 9(Mar):371–421, 2008.Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction undercovariate shift. In
Advances in Neural Information Processing Systems , pages 2530–2540, 2019.Vladimir Vovk, Alex Gammerman, and Glenn Shafer.
Algorithmic learning in a random world . Springer Science& Business Media, 2005.Vladimir Vovk, Ilia Nouretdinov, Alex Gammerman, et al. On-line predictive linear regression.
The Annals ofStatistics , 37(3):1566–1590, 2009. 14
Theorem Proofs
A.1 Proof of Theorem 1
The proof of the theorem relies on two lemmas: the first establishes a connection between Median ( Y n +1 | X n +1 ) and Median ( Y | X = X i ) ’s using exchangeability, and the second gives us a relationship between Median ( Y | X = X i ) ’s and the Y i ’s.We begin with some notation. For all x ∈ R d in the support of P , set m ( x ) = Median ( Y | X = x ) . Also,for ≤ i ≤ n + 1 , let R ( X i ) = | m ( X i ) − ˆ µ ( X i ) | , and for ≤ i ≤ n let E ( X i ) = | Y i − ˆ µ ( X i ) | . Finally,put M R = { i ∈ I : R ( X i ) ≥ R ( X n +1 ) } as the number of i ∈ I for which R ( X i ) ≥ R ( X n +1 ) , and M E = { i ∈ I : E ( X i ) ≥ R ( X n +1 ) } as the number of i ∈ I for which E ( X i ) ≥ R ( X n +1 ) .Note that Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) = [ˆ µ ( X n +1 ) − Q − α/ ( E ) , ˆ µ ( X n +1 ) + Q − α/ ( E )] if andonly if | Median ( Y n +1 | X n +1 ) − ˆ µ ( X n +1 ) | = R ( X n +1 ) is at most Q − α/ ( E ) . Thus, we must study the value of R ( X n +1 ) in relation to the elements of { E ( X i ) : i ∈ I } .The first lemma relates R ( X n +1 ) to the other R ( X i ) ’s. Lemma A.1.
For all ≤ m ≤ n , P (cid:8) M R ≥ m (cid:9) ≥ − mn + 1 . Proof.
Our statement follows from the fact that our samples are i.i.d. Because ˆ µ is independent of ( X i , Y i ) for i ∈ I and independent of X n +1 , the values { R ( X i ) : i ∈ I }∪{ R ( X n +1 ) } are i.i.d. as well. Then, the probabilityof less than m values in { R ( X i ) : i ∈ I } being at least R ( X n +1 ) is bounded above by m |I | +1 , as the ordering ofthese values is uniformly random. Taking the complement of both sides gives the result.The second lemma gives a direct relationship between E ( X i ) and R ( X i ) . (Note that the events { E ( X i ) ≥ R ( X i ) } below are mutually independent.) Lemma A.2.
For all i ∈ I , P (cid:8) E ( X i ) ≥ R ( X i ) (cid:9) ≥ / . Proof.
Conditioning on X i , we see that P { Y i ≥ m ( X i ) } ≥ / and P { Y i ≤ m ( X i ) } ≥ / by the definition ofthe conditional median. Furthermore, each event { Y i ≥ m ( X i ) } and { Y i ≤ m ( X i ) } is independent of both ˆ µ and X i , as ˆ µ is a function of { ( X i , Y i ) : i ∈ I } and the datapoints are i.i.d. Then, if m ( X i ) ≥ ˆ µ ( X i ) , with probability / we have that m ( X i ) ≤ Y i , in which case | m ( X i ) − ˆ µ ( X i ) | = m ( X i ) − ˆ µ ( X i ) ≤ Y i − ˆ µ ( X i ) = | Y i − ˆ µ ( X i ) | .Similarly, if m ( X i ) < ˆ µ ( X i ) , with probability / we have that m ( X i ) ≥ Y i , in which case | m ( X i ) − ˆ µ ( X i ) | =ˆ µ ( X i ) − m ( X i ) ≤ ˆ µ ( X i ) − Y i = | Y i − ˆ µ ( X i ) | . The conclusion holds in both cases.We now study the number of datapoints obeying E ( X i ) ≥ R ( X n +1 ) by combining these lemmas together.Consider any ≤ m ≤ n . Note that by conditioning on M R , P (cid:8) M E ≥ m (cid:9) = n (cid:88) j =0 P (cid:8) M R = j (cid:9) P (cid:8) M E ≥ m | M R = j (cid:9) ≥ n (cid:88) j =0 P (cid:8) M R = j (cid:9) j (cid:88) k = m (cid:18) jk (cid:19) − j ≥ n (cid:88) j =0 n + 1 j (cid:88) k = m (cid:18) jk (cid:19) − j . The first inequality holds true by Lemma A.2, which implies that P { M E ≥ m | M R = j } can be bounded belowby the probability that M ≥ m for M ∼ Binom ( j, . . The second inequality is due to Lemma A.1. We knowthat (cid:80) jk = m (cid:0) jk (cid:1) − j is a nondecreasing function of j ; by Lemma A.1, the CDF of the distribution of M R is lowerbounded by the CDF of the uniform distribution over { , , . . . , n } , meaning that E M R (cid:34) M R (cid:88) k = m (cid:18) M R k (cid:19) − M R (cid:35) ≥ n + 1 n (cid:88) j =0 j (cid:88) k = m (cid:18) jk (cid:19) − j . n (cid:88) j =0 n + 1 j (cid:88) k = m (cid:18) jk (cid:19) − j = n (cid:88) j =0 n + 1 (cid:16) − m − (cid:88) k =0 (cid:18) jk (cid:19) − j (cid:17) =1 − n + 1 n (cid:88) j =0 m − (cid:88) k =0 (cid:18) jk (cid:19) − j =1 − n + 1 m − (cid:88) k =0 n (cid:88) j =0 (cid:18) jk (cid:19) − j ≥ − n + 1 m − (cid:88) k =0 ∞ (cid:88) j =0 (cid:18) jk (cid:19) − j =1 − n + 1 m − (cid:88) k =0 − mn + 1 . The second-to-last equality comes from evaluating the generating function G ( (cid:0) tk (cid:1) ; z ) = (cid:80) ∞ t =0 (cid:0) tk (cid:1) z t at z = 0 . .Putting it together, in Algorithm 1, Q − α/ ( E ) is set to be the (1 − α/ /n ) -th quantile of { E i : i ∈ I } .This is equal to the (cid:100) n (1 − α/ /n ) (cid:101) = (cid:100) (1 − α/ n + 1) (cid:101) smallest value of { E i : i ∈ I } . This meansthat if M E ≥ n − (cid:100) (1 − α/ n + 1) (cid:101) + 1 , then R ( X n +1 ) will be at most the n − (cid:100) (1 − α/ n + 1) (cid:101) + 1 largest value of { E i : i ∈ I } , which is equal to the (cid:100) (1 − α/ n + 1) (cid:101) smallest value, or Q − α/ ( E ) .However, we have calculated a lower bound for the inverse CDF of M E earlier. Substituting this in, we get that P { R ( X n +1 ) ≤ Q − α/ ( E ) } ≥ P { M E ≥ n − (cid:100) (1 − α/ n + 1) (cid:101) + 1 }≥ P { M E ≥ n + 1 − (1 − α/ n + 1) } = P { M E ≥ α/ n + 1) }≥ − α by our previous calculation, completing our proof. A.2 Proof of Theorem 2
Our approach is similar to that in Appendix A.1. The main difference in the proof arises from the fact thatAlgorithm 2 no longer uses the absolute value and uses two separate fitted functions, meaning that it is importantto bound the probability of the confidence interval covering the desired value from both sides.We begin with some definitions. For all x ∈ R d in the support of P , let q ( x ) be Quantile q ( Y | X = x ) . For all ≤ i ≤ n + 1 , define:• R lo ( X i ) = f lo ( X i , q ( X i )) and R hi ( X i ) = f hi ( X i , q ( X i )) .• E lo ( X i ) = f lo ( X i , Y i ) and E hi ( X i ) = f hi ( X i , Y i ) .• M lo R = { i ∈ I : R lo ( X i ) ≤ R lo ( X n +1 ) } and M hi R = { i ∈ I : R hi ( X i ) ≥ R hi ( X n +1 ) } as the numberof i ∈ I for which R lo ( X i ) ≤ R lo ( X n +1 ) and R hi ( X i ) ≥ R hi ( X n +1 ) respectively.• M lo E = { i ∈ I : E lo ( X i ) ≤ R lo ( X n +1 ) } and M hi E = { i ∈ I : E hi ( X i ) ≥ R hi ( X n +1 ) } as the numberof i ∈ I for which E lo ( X i ) ≤ R lo ( X n +1 ) and E hi ( X i ) ≥ R hi ( X n +1 ) respectively.Now, note that Quantile q ( Y | X = X n +1 ) ∈ ˆ C n ( X n +1 ) precisely when f lo ( X n +1 , Quantile q ( Y | X = X n +1 )) = R lo ( X n +1 ) ≥ Q lo rq and f hi ( X n +1 , Quantile q ( Y | X = X n +1 )) = R hi ( X n +1 ) ≤ Q hi − s (1 − q ) . As such, we pro-ceed to develop two lemmas which extend those from Section A.1: the first helps us to understand the distri-butions of M hi R and M lo R , and the second studies the individual relationships between E lo ( X i ) and R lo ( X i ) and16etween E hi ( X i ) and R hi ( X i ) . With both of these lemmas, we are able to bound the probability of each event { R lo ( X n +1 ) ≥ Q lo rq } and { R hi ( X n +1 ) ≤ Q hi − s (1 − q ) } . Lemma A.3.
For all ≤ m ≤ n , P (cid:8) M lo R ≥ m (cid:9) ≥ − mn + 1 and P (cid:8) M hi R ≥ m (cid:9) ≥ − mn + 1 . Proof.
We prove the result for M hi R ; the same approach holds for M lo R . Because f hi is independent of ( X i , Y i ) for i ∈ I and independent of X n +1 , the values in { R hi ( X i ) : i ∈ I } ∪ { R hi ( X n +1 ) } are i.i.d.. Thus, the probabilityof less than m values in { R hi ( X i ) : i ∈ I } being at least R hi ( X n +1 ) is bounded above by m |I | +1 , as the orderingof these values is uniformly random. Taking the complement of both sides establishes the claim. Lemma A.4.
For all i ∈ I , P (cid:8) E lo ( X i ) ≤ R lo ( X i ) (cid:9) ≥ q and P (cid:8) E hi ( X i ) ≥ R hi ( X i ) (cid:9) ≥ − q. Proof.
We see that P { Y i ≤ q ( X i ) } ≥ q and P { Y i ≥ q ( X i ) } ≥ − q by the definition of the conditional quantile.Furthermore, each event { Y i ≤ q ( X i ) } and { Y i ≥ q ( X i ) } is independent of f lo , f hi , and X i , as f lo and f hi are functions of { ( X i , Y i ) : i ∈ I } and the datapoints are i.i.d. Then, with probability at least q we have that E lo ( X i ) = f lo ( X i , Y i ) ≤ f lo ( X i , q ( X i )) = R lo ( X i ) by the definition of a locally nondecreasing conformityscore. Similarly, with probability at least − q we have that E hi ( X i ) = f hi ( X i , Y i ) ≥ f hi ( X i , q ( X i )) = R hi ( X i ) ,thereby concluding the proof.We now study the number of i ∈ I such that E hi ( X i ) ≥ R hi ( X n +1 ) by combining these two lemmas together.Consider any ≤ m ≤ n , and note that by conditioning on M hi R , we have P (cid:8) M hi E ≥ m (cid:9) = n (cid:88) j =0 P (cid:8) M hi R = j (cid:9) P (cid:8) M hi E ≥ m | M hi R = j (cid:9) ≥ n (cid:88) j =0 P (cid:8) M hi R = j (cid:9) j (cid:88) k = m (cid:18) jk (cid:19) (1 − q ) k q j − k ≥ n (cid:88) j =0 n + 1 j (cid:88) k = m (cid:18) jk (cid:19) (1 − q ) k q j − k . The first inequality holds because P { M hi E ≥ m | M hi R = j } can be bounded below by the probability that M ≥ m for M ∼ Binom ( j, − q ) by Lemma A.4. The second inequality holds since the CDF M hi R is greater than or equalto the CDF of the uniform distribution over { , , . . . , n } by Lemma A.3. Then, as (cid:80) jk = m (cid:0) jk (cid:1) (1 − q ) k q j − k is anondecreasing function of j , we have E M hi R M hi R (cid:88) k = m (cid:18) M hi R k (cid:19) (1 − q ) k q M hi R − k ≥ n + 1 n (cid:88) j =0 j (cid:88) k = m (cid:18) jk (cid:19) (1 − q ) k q j − k .
17e now solve the summation, which gives n (cid:88) j =0 n + 1 j (cid:88) k = m (cid:18) jk (cid:19) (1 − q ) k q j − k = n (cid:88) j =0 n + 1 (cid:16) − m − (cid:88) k =0 (cid:18) jk (cid:19) (1 − q ) k q j − k (cid:17) =1 − n + 1 n (cid:88) j =0 m − (cid:88) k =0 (cid:18) jk (cid:19) (1 − q ) k q j − k =1 − n + 1 m − (cid:88) k =0 n (cid:88) j =0 (cid:18) jk (cid:19) (1 − q ) k q j − k ≥ − n + 1 m − (cid:88) k =0 (cid:16) − qq (cid:17) k ∞ (cid:88) j =0 (cid:18) jk (cid:19) q j =1 − n + 1 m − (cid:88) k =0 − q =1 − − q · mn + 1 , where the second-to-last equality is from evaluating the generating function G ( (cid:0) tk (cid:1) ; z ) = (cid:80) ∞ t =0 (cid:0) tk (cid:1) z t at z = q .Note that this same calculation works for counting the i ∈ I with E lo ( X i ) ≤ R lo ( X n +1 ) using the samelemmas, substituting M lo E for M hi E , M lo R for M hi R , and q for − q within the calculation. This gives P (cid:8) M lo E ≥ m (cid:9) ≥ − q · mn + 1 . Now, in Algorithm 2, Q hi − s (1 − q ) ( E ) is defined as the (1 − s (1 − q ))(1 + 1 /n ) -th quantile of { E hi i : i ∈ I } .This is equal to the (cid:100) n (1 − s (1 − q ))(1+1 /n ) (cid:101) = (cid:100) (1 − s (1 − q ))( n +1) (cid:101) smallest value of { E hi i : i ∈ I } . Thus,if M hi E ≥ n − (cid:100) (1 − s (1 − q ))( n + 1) (cid:101) + 1 , then R hi ( X n +1 ) will be at most the n − (cid:100) (1 − s (1 − q ))( n + 1) (cid:101) + 1 largest value of { E hi i : i ∈ I } , which is equal to the (cid:100) (1 − s (1 − q ))( n + 1) (cid:101) smallest value, or Q hi − s (1 − q ) ( E ) .Then, using our earlier lower bound for the inverse CDF of M hi E , we get that P { R hi ( X n +1 ) ≤ Q hi − s (1 − q ) ( E ) } ≥ P { M hi E ≥ n − (cid:100) (1 − s (1 − q ))( n + 1) (cid:101) + 1 }≥ P { M hi E ≥ n + 1 − (1 − s (1 − q ))( n + 1) } = P { M hi E ≥ s (1 − q )( n + 1) }≥ − s. Similarly, Q lo rq ( E ) is defined as the rq − (1 − rq ) /n -th quantile of { E lo i : i ∈ I } . This is equal to the (cid:100) n ( rq − (1 − rq ) /n ) (cid:101) = (cid:100) ( n + 1) rq − (cid:101) smallest value of { E lo i : i ∈ I } . Thus, if M lo E ≥ (cid:100) ( n + 1) rq − (cid:101) ,then R lo ( X n +1 ) will be at least Q lo rq ( E ) . Then, our lower bound for the M lo E inverse CDF tells us that P { R lo ( X n +1 ) ≥ Q lo rq ( E ) } ≥ P { M lo E ≥ (cid:100) ( n + 1) rq − (cid:101)}≥ P { M lo E ≥ ( n + 1) rq }≥ − r. Finally, by the union bound, we have that P { Q lo rq ( E ) ≤ R lo ( X n +1 ) and R hi ( X n +1 ) ≤ Q hi − s (1 − q ) ( E ) } ≥ − r − s = 1 − α completing our proof. 18 .3 Proof of Theorem 3 We show that given (cid:15) , there exists δ and N such that for all n > N , running Algorithm 1 on P δ with our chosen ˆ µ results in a confidence interval that contains the conditional median with probability at most − α + (cid:15) . Ourapproch is similar to that in Appendix A.1; however, we apply the inequalities in the opposite directions and usesome analysis in order to get an upper bound as opposed to a lower bound.First, note that { ( X i , Y i ) : i ∈ I } is irrelevant to our algorithm, as ˆ µ is set to be the zero function. Then, foreach i ∈ I , E i = | Y i | . Thus, Q − α/ ( E ) is the (1 − α/ /n ) -th empirical quantile of {| Y i | : i ∈ I } , andour confidence interval is ˆ C n ( X n +1 ) = [ − Q − α/ ( E ) , Q − α/ ( E )] . Because the parameter we want to cover isMedian ( Y | X = X n +1 ) = X n +1 ,Median ( Y n +1 | X n +1 ) ∈ ˆ C n ( X n +1 ) if and only if | X n +1 | ≤ Q − α/ ( {| Y i | : i ∈ I } ) . Define M R = { i ∈ I : | X i | ≥ | X n +1 |} and M E = { i ∈ I : | Y i | ≥ | X n +1 |} . Now, note that Q − α/ ( {| Y i | : i ∈ I } ) is the (cid:100) n (1 − α/ /n ) (cid:101) = (cid:100) (1 − α/ n + 1) (cid:101) smallest value of {| Y i | : i ∈ I } ,which equals the n + 1 − (cid:100) (1 − α/ n + 1) (cid:101) largest value. Letting m = n + 1 − (cid:100) (1 − α/ n + 1) (cid:101) , | X n +1 | ≤ Q − α/ ( {| Y i | : i ∈ I } ) if and only if M E ≥ m. We now build up the following two lemmas.
Lemma A.5.
For all ≤ M ≤ n , P (cid:8) M R = M (cid:9) = 1 n + 1 . Proof.
This holds from the fact that the values {| X i | : i ∈ I } ∪ {| X n +1 |} are i.i.d. and have a distribution over [0 , . with no point masses. As a result, M R is uniformly distributed over { , , . . . , n } . Lemma A.6. M E | M R ∼ Binom ( M R , . δ ) .Proof. First, note that for all i ∈ I , if | X i | < | X n +1 | , then P {| Y i | ≥ | X n +1 |} = 0 . This is because if | X i | < | X n +1 | , then | Y i | ≤ | X i | < | X n +1 | . Additionally, if | X i | ≥ | X n +1 | , then P {| Y i | ≥ | X n +1 |} = 0 . δ . Thisis due to the fact that if | X i | ≥ | X n +1 | , we have that | Y i | = 0 with probability . − δ and | Y i | = | X i | withprobability . δ . With probability , | X n +1 | > . Therefore, | Y i | ≥ | X n +1 | if and only if | Y i | = | X i | , whichoccurs with probability . δ .Furthermore, the events {| Y i | = | X i |} are mutually independent for all i ∈ I (the pairs ( X i , Y i ) are i.i.d.).Then, since M E = (cid:88) i ∈I [ | Y i | ≥ | X n +1 | ] = (cid:88) i ∈I [ | X i | ≥ | X n +1 | ] [ | Y i | = | X i | ] and each term [ | Y i | = | X i | ] is i.i.d. Bernoulli with probability . δ , and M R = (cid:80) i ∈I [ | X i | ≥ | X n +1 | ] , theresult follows.We now apply our two lemmas: P (cid:8) M E ≥ m (cid:9) = n (cid:88) j =0 P (cid:8) M R = j (cid:9) P (cid:8) M E ≥ m | M R = j (cid:9) = 1 n + 1 n (cid:88) j =0 j (cid:88) k = m P (cid:8) M E ≥ m | M R = j (cid:9) = 1 n + 1 n (cid:88) j =0 j (cid:88) k = m (cid:18) jk (cid:19) (0 . δ ) k (0 . − δ ) j − k ; P (cid:8) M E ≥ m (cid:9) = 1 n + 1 n (cid:88) j =0 (cid:16) − m − (cid:88) k =0 (cid:18) jk (cid:19) (0 . δ ) k (0 . − δ ) j − k (cid:17) =1 − n + 1 n (cid:88) j =0 m − (cid:88) k =0 (cid:18) jk (cid:19) (0 . δ ) k (0 . − δ ) j − k =1 − n + 1 m − (cid:88) k =0 (cid:16) . δ . − δ (cid:17) k n (cid:88) j =0 (cid:18) jk (cid:19) (0 . − δ ) j =1 − n + 1 m − (cid:88) k =0 (cid:16) . δ . − δ (cid:17) k (cid:16) (0 . − δ ) k (0 . δ ) k +1 − ∞ (cid:88) j = n +1 (cid:18) jk (cid:19) (0 . − δ ) j (cid:17) , where the last equality is from evaluating the generating function G ( (cid:0) tk (cid:1) ; z ) = (cid:80) ∞ t =0 (cid:0) tk (cid:1) z t at z = 0 . − δ .Finally, in order to bring this into a coherent bound, we expand the equation to bring out the − α term andisolate the remainder, which we can then show goes to : P (cid:8) M E ≥ m (cid:9) =1 − n + 1 m − (cid:88) k =0 (cid:16) . δ − (cid:16) . δ . − δ (cid:17) k ∞ (cid:88) j = n +1 (cid:18) jk (cid:19) (0 . − δ ) j (cid:17) =1 − mn + 1 · . δ + m − (cid:88) k =0 (cid:16) . δ . − δ (cid:17) k ∞ (cid:88) j = n +1 (cid:18) jk (cid:19) (0 . − δ ) j ≤ − mn + 1 · . δ + m − (cid:88) k =0 (cid:16) . δ . − δ (cid:17) k (cid:18) n + 1 k (cid:19) (0 . − δ ) n +1 − (0 . − δ ) n +1 n +1 − k , where the inequality arises from upper bounding the summation (cid:80) ∞ j = n +1 (cid:0) jk (cid:1) (0 . − δ ) j by (cid:18) n + 1 k (cid:19) (0 . − δ ) n +1 ∞ (cid:88) j =0 (cid:18) n + 1 n + 1 − k (0 . − δ ) (cid:19) j using the maximum ratio of consecutive terms. Applying m ≤ ( n + 1) α/ twice gives P (cid:8) M E ≥ m (cid:9) =1 − mn + 1 · . δ + (cid:16) . δ . − δ (cid:17) m − − (0 . − δ ) n +1 n +1 − m m − (cid:88) k =0 (cid:18) n + 1 k (cid:19) (0 . − δ ) n +1 ≤ − mn + 1 · . δ + (cid:16) . δ . − δ (cid:17) m − (cid:16) α − α (cid:17) m − (cid:88) k =0 (cid:18) n + 1 k (cid:19) (0 . − δ ) n +1 ≤ − α + 2 αδ δ + (cid:16) α − α (cid:17)(cid:16) . δ . − δ (cid:17) ( n +1) α/ · (cid:80) (cid:98) ( n +1) α/ (cid:99) k =0 (cid:18) n + 1 k (cid:19) n +1 . Because α < , (cid:80) (cid:98) ( n +1) α/ (cid:99) k =0 (cid:0) n +1 k (cid:1) n +1 → as n → ∞ ; this is due to the fact that (cid:80) n +1 k =0 (cid:0) n +1 k (cid:1) n +1 = 1 ( n , . is O ( √ n ) , meaning that (cid:80) n +1 −(cid:98) ( n +1) α/ (cid:99) k = (cid:98) ( n +1) α/ (cid:99) +1 (cid:0) n +1 k (cid:1) n +1 → . Furthermore, as Binom ( n , . approaches a normal distribution as n → ∞ and Φ( − c √ n ) is O ( d − n ) forsome d > , for small enough δ , (cid:18) . δ . − δ (cid:19) ( n +1) α/ (cid:80) (cid:98) ( n +1) α/ (cid:99) k =0 (cid:0) n +1 k (cid:1) n +1 → as n → ∞ . Thus, we can pick D and N such that for all δ < D and n ≥ N , (cid:18) . δ . − δ (cid:19) ( n +1) α/ (cid:80) (cid:98) ( n +1) α/ (cid:99) k =0 (cid:0) n +1 k (cid:1) n +1 ≤ (cid:15)/ , noting that n = n/ . Then, setting δ = min { (cid:15) α − (cid:15) , D } and αδ δ ≤ (cid:15)/ yields P (cid:8) M E ≥ m (cid:9) ≤ − α + (cid:15)/ (cid:15)/ − α + (cid:15). This says that the probability P { M E ≥ m } of the confidence interval containing the conditional median is at most − α + (cid:15) . A.4 Proof of Theorem 4
We show that given (cid:15) there exists c , N , and n + n = n for all n > N such that running Algorithm 1 on n > N datapoints from an arbitrary distribution P with regression function ˆ µ c and split sizes n + n = n results in afinite confidence interval that contains the conditional median with probability at least − α/ − (cid:15) .For each x in the support of P , define m ( x ) = Median ( Y | X = x ) and recall that M = max i ∈I | Y i | . We beginwith two lemmas. Lemma A.7.
For all i ∈ I , P {| Y i | ≤ M } ≥ − n +1 .Proof. This results from the fact that | Y i | is exchangeable with | Y j | for all j ∈ I ; thus, the probability that | Y i | isthe unique maximum of the set { Y j : j ∈ I ∪ { i }} is bounded above by n +1 . Taking the complement yields thedesired result. Lemma A.8.
For all i ∈ I ∪ { n + 1 } , P {| m ( X i ) | ≤ M } ≥ − n +1 .Proof. Note that | m ( X i ) | is exchangeable with | m ( X j ) | for all j ∈ I . Letting M R = {| m ( X j ) | ≥ | m ( X i ) | : j ∈ I } , exchangeability gives that the CDF of M R is bounded below by the CDF of the uniform distribution over { , , . . . , n } . For each j ∈ I , the event {| Y j | ≥ | m ( X j ) |} occurs with probability at least / by definition ofthe median; moreover, these events are mutually independent. Therefore, if we condition on M R , we have that P {| m ( X i ) | > max j ∈I | Y j | (cid:12)(cid:12) M R = k } ≤ (cid:89) j ∈I | m ( X j ) |≥| m ( X i ) | P {| Y j | < | m ( X j ) |} ≤ − k . Putting this together, we see that P {| m ( X i ) | > max j ∈I | Y j |} = n (cid:88) k =0 P { M R = k } P {| m ( X i ) | > max j ∈I | Y j | (cid:12)(cid:12) M R = k }≤ n (cid:88) k =0 P { M R = k } k ≤ n + 1 n (cid:88) k =0 k ≤ n + 1 . A be the event {| Y i | ≤ M for all i ∈ I and | m ( X i ) | ≤ M for all i ∈ I ∪ { n + 1 }} . By Lemmas A.7 andA.8, P { A } ≥ − n + 2 n + 1 . Select N = (cid:106) /α + 10 (cid:15) (cid:107) + (cid:98) /α (cid:99) + 1 , and for all n > N , set n = (cid:98) /α (cid:99) + 1 and n = n − n . As a result, we have that /n < α/ and n + 2 n + 1 < (cid:15)/ , so P { A } ≥ − (cid:15)/ .Next, let B be the event {| ˆ µ c ( X i ) − ˆ µ c ( X i ) | > M for all i (cid:54) = i ∈ I ∪{ n +1 }} . Note that lim c →∞ P { B } = 1 by definition of ˆ µ c . Select c such that P { B } ≥ − (cid:15)/ . By the union bound, P { A ∩ B } ≥ − (cid:15) . Lemma A.9.
On the event A ∩ B , for all i ∈ I , | m ( X i ) − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | if and only if | Y i − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | . Proof.
Notice that on the event A ∩ B , | m ( X i ) − ˆ µ c ( X i ) | , | Y i − ˆ µ c ( X i ) | ∈ [ | ˆ µ c ( X i ) |− M, | ˆ µ c ( X i ) | + M ] . This holdsbecause | m ( X i ) | , | Y i | ≤ M on the event A . Similarly, | m ( X n +1 ) − ˆ µ c ( X n +1 ) | ∈ [ | ˆ µ c ( X n +1 ) |− M, | ˆ µ c ( X n +1 ) | + M ] . These two intervals both have length M , but their centers are at a distance greater than M on the event B ,meaning that the intervals are disjoint. Therefore, | m ( X i ) − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | implies that allelements of the first interval are greater than all elements of the second, so | Y i − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | ;similarly, | Y i − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | also implies that all elements of the first interval are greaterthan all elements of the second, so | m ( X i ) − ˆ µ c ( X i ) | ≥ | m ( X n +1 ) − ˆ µ c ( X n +1 ) | .Looking at Algorithm 1, we have that m ( X n +1 ) ∈ ˆ C n ( X n +1 ) if | m ( X n +1 ) − ˆ µ c ( X n +1 ) | ≤ Q − α/ ( E ) ,where E i = | Y i − ˆ µ c ( X i ) | for all i ∈ I . Because /n < α/ , Q − α/ ( E ) is finite and thus the confidenceinterval is bounded. By Lemma A.9, on the event A ∩ B , | m ( X n +1 ) − ˆ µ c ( X n +1 ) | ≤ Q − α/ ( E ) if and only if | m ( X n +1 ) − ˆ µ c ( X n +1 ) | ≤ Q − α/ ( F ) , where F i = | m ( X i ) − ˆ µ c ( X i ) | for all i ∈ I .Define C to be the event {| m ( X n +1 ) − ˆ µ c ( X n +1 ) | ≤ Q − α/ ( F ) } . We have just shown that on the event A ∩ B ∩ C , we have m ( X n +1 ) ∈ ˆ C n ( X n +1 ) . Additionally, because the elements of {| m ( X i ) − ˆ µ c ( X i ) | : i ∈I ∪ { n + 1 }} are exchangeable, we have that P { C } ≥ − α/ . Then, by the union bound, P { m ( X n +1 ) ∈ ˆ C n ( X n +1 ) } ≥ P { A ∩ B ∩ C } ≥ − α/ − (cid:15), proving the desired result. 22 Additional Results
B.1 Impossibility of Capturing the Distribution Mean
Instead of proving the impossibility of capturing the conditional mean of a distribution, we prove a more generalresult: we show that there does not exist an algorithm to capture the mean of a distribution Y ∼ P given noassumptions about P . This is a more general form of our result because if we set X ⊥⊥ Y in ( X, Y ) ∼ P ,then E [ Y | X ] = E [ Y ] , meaning that the impossibility of capturing the mean results in the conditional mean beingimpossible to capture as well.Consider an algorithm ˆ C n that, given i.i.d. samples Y , . . . , Y n ∼ P , returns a (possibly randomized) confi-dence interval ˆ C n ( D ) , D = { Y i , ≤ i ≤ n } , with length bounded by some function of P that captures E [ Y ] withprobability at least − α , i.e. P { E [ Y ] ∈ ˆ C n ( D ) } ≥ − α . Pick a > α , with a < . Consider a distribution P where for Y ∼ P , P { Y = 0 } = a /n and P { Y = u } = 1 − a /n for some u . Then for Y , . . . , Y n ∼ P , P { Y = · · · = Y n = 0 } ≥ a . Consider ˆ C n ( { , . . . , } ) ; by our assumption on ˆ C n , there must exist some m ∈ R for which P { m ∈ ˆ C n ( { , . . . , } ) } < − α/a . Then, setting u = m − a /n yields E [ Y ] = m . With probability a , ˆ C n ( D ) = ˆ C n ( { , . . . , } ) , so P { E [ Y ] = m (cid:54)∈ ˆ C n ( D ) } > a · αa = α. This implies that P { E [ Y ] ∈ ˆ C n ( D ) } < − α as desired, completing the proof. B.2 Capturing the Distribution Median
Algorithm 3:
Confidence Interval for Median ( Y ) with Coverage − α Input:
Number of i.i.d. datapoints n ∈ N .Datapoints Y , . . . , Y n ∼ P ⊆ R .Coverage level − α ∈ (0 , . Process:
Order the Y i as Y (1) ≤ · · · ≤ Y ( n ) .Calculate the largest k ≥ such that for X ∼ Binom ( n, . , we have P { X < k } ≤ α/ . Output:
Confidence interval ˆ C n = [ Y ( k ) , Y ( n +1 − k ) ] for Median ( Y ) .(Note that Y (0) = −∞ and Y ( n +1) = ∞ )We now show that Algorithm 3 captures the median of P with probability at least − α . Let m = Median ( P ) ,let M lo = { Y i ≤ m : 1 ≤ i ≤ n } be the number of Y i at most m , and let M hi = { Y i ≥ m : 1 ≤ i ≤ n } be the number of Y i at least m . Note that by the definition of m , we have that for all i , P { Y i ≤ m } ≥ . , and P { Y i ≥ m } ≥ . as well. Additionally, the events { Y i ≤ m } are mutually independent for all i , as are the events { Y i ≥ m } . This implies that both M lo and M hi follow a Binom ( n, . distribution.Since ˆ C n = [ Y ( k ) , Y ( n +1 − k ) ] , we have that m ∈ ˆ C n if and only if M lo ≥ k and M hi ≥ k . Then, P { m ∈ ˆ C n } = P { M lo ≥ k and M hi ≥ k } =1 − P { M lo < k or M hi < k }≥ − ( P { M lo < k } + P { M hi < k } ) ≥ − ( α/ α/ − α.α.