[PDF] Regularization Strategies for Quantile Regression

Abstract

We investigate different methods for regularizing quantile regression when predicting either a subset of quantiles or the full inverse CDF. We show that minimizing an expected pinball loss over a continuous distribution of quantiles is a good regularizer even when only predicting a specific quantile. For predicting multiple quantiles, we propose achieving the classic goal of non-crossing quantiles by using deep lattice networks that treat the quantile as a monotonic input feature, and we discuss why monotonicity on other features is an apt regularizer for quantile regression. We show that lattice models enable regularizing the predicted distribution to a location-scale family. Lastly, we propose applying rate constraints to improve the calibration of the quantile predictions on specific subsets of interest and improve fairness metrics. We demonstrate our contributions on simulations, benchmark datasets, and real quantile regression problems.

Full PDF

RRegularization Strategies for Quantile Regression

Taman Narayan Serena Wang Kevin Canini Maya R. Gupta Abstract

We investigate different methods for regularizingquantile regression when predicting either a sub-set of quantiles or the full inverse CDF. We showthat minimizing an expected pinball loss over acontinuous distribution of quantiles is a good regu-larizer even when only predicting a speciﬁc quan-tile. For predicting multiple quantiles, we proposeachieving the classic goal of non-crossing quan-tiles by using deep lattice networks that treat thequantile as a monotonic input feature, and we dis-cuss why monotonicity on other features is an aptregularizer for quantile regression. We show thatlattice models enable regularizing the predicteddistribution to a location-scale family. Lastly, wepropose applying rate constraints to improve thecalibration of the quantile predictions on speciﬁcsubsets of interest and improve fairness metrics.We demonstrate our contributions on simulations,benchmark datasets, and real quantile regressionproblems.

1. Introduction

Real world users often seek estimates of the quantiles of arandom variable. For example, a delivery service may wantthe estimate that 90 % of the time ( τ = 0 . ), a delivery willarrive within 38 minutes. For a random variable Y ∈ R anda desired quantile τ ∈ (0 , , the τ -quantile of Y is deﬁnedas q τ = inf { q : P ( Y ≤ q ) ≥ τ } . For a pair of randomvariables ( X, Y ) ∈ R D × R , the conditional τ -quantileof Y for feature vector X is deﬁned as q τ ( x ) = inf { q : P ( Y ≤ q | X = x ) ≥ τ } . For example, that conditioned onit being 5pm, the model estimates that of the time itwill arrive in 53 minutes. Quantile regression takes a setof pairs ( x, y ) ∈ R D × R from a joint distribution over ( X, Y ) as training data and seeks to estimate some or all ofthe conditional quantiles of Y for any value of X .Estimating multiple quantiles τ can produce the classic Google Research, Mountain View, California, USA. Corre-spondence to: Taman Narayan < [email protected] > , SerenaWang < [email protected] > . problem of quantile crossing (Koenker & Bassett, 1978),where the estimated quantiles violate the commonsenserequirement that q τ ( x ) ≥ q τ (cid:48) ( x ) for τ ≥ τ (cid:48) at every x . Forexample, if an AI tells a customer there is a chancetheir delivery will arrive within 38 minutes, but also thatthere is an chance it will arrive within 40 minutes,then the customer may think the AI is broken or buggy. Infact, this kind of embarrassing mistake happens easily (He,1997; Koenker & Bassett, 1978); see Table 1 for data. Inthis paper, we simultaneously estimate the desired quantiles(or all quantiles) with explicit monotonicity constraints on τ to guarantee non-crossing quantiles, and explore otherregularization strategies for quantile regression.Our main contributions are: (1) showing that training withan expected pinball loss can regularize individual quan-tile predictions, (2) proposing training arbitrarily ﬂexiblemodels that guarantee the classic interpretability goal of non-crossing quantiles by treating the quantile as a mono-tonic feature in a deep lattice network (DLN), (3) show-ing the importance of monotonic regularizers on input fea-tures for quantile regression, (4) showing the DLN functionclass enables regularization of the estimated distributionto a location-scale family, and (5) showing that rate con-straints can be applied to promote well-calibrated quantileestimates for relevant subsets of the data and to improvefairness metrics. We use a broad set of simulated and realdata to illustrate the signiﬁcance of these regularizers.

2. Quantile Regression Training Objective

First, we give a comprehensive training objective for quan-tile regression that brings together and modernizes manygreat insights from the machine learning and statistics com-munities. Details of the different aspects of this optimizationproblem, our proposals, and the related work will follow.Given a training set { ( x i , y i ) } Ni =1 where each x i ∈ R D is afeature vector and y i ∈ R is a corresponding label, we addthe quantile of interest τ ∈ (0 , as an auxiliary feature andmodel the conditional quantile q τ ( x ) by f ( x, τ ; θ ) , whereour model f : R D +1 → R is parameterized by θ ∈ R p . Toﬁt f , we propose minimizing the pinball loss (Koenker &Bassett, 1978), L τ ( y, ˆ y ) = max( τ ( y − ˆ y ) , ( τ − y − ˆ y )) ,in expectation with respect to a distribution P T , where T ∈ (0 , is a random quantile, subject to constraints that a r X i v : . [ s t a t . M L ] F e b egularization Strategies for Quantile Regression quantile estimates may not cross. We also propose option-ally adding rate constraints (Goh et al., 2016; Cotter et al.,2019d) to ensure the quantile estimates are well-calibratedon speciﬁed data subsets. The resulting problem is a con-strained optimization: min θ E T (cid:34) N (cid:88) i =1 L T ( y i , f ( x i , T ; θ )) (cid:35) (1)s.t. f ( x, τ + ; θ ) ≥ f ( x, τ − ; θ ) (2) ∀ x ∈ R D and any τ + , τ − ∈ (0 , where τ + ≥ τ − , s.t. ( τ s − (cid:15) − s ) ≤ |D s | (cid:88) ( x j ,y j ) ∈D s I [ y j ≤ f ( x j , τ s ; θ )] ≤ ( τ s + (cid:15) + s ) for s ∈ { , . . . , S } , (3)Each of the S rate constraints in (3) are speciﬁed by aquantile τ s , a dataset D s of interest, which may be a subsetof the training data or an auxiliary dataset, and allowedslacks (cid:15) − s , (cid:15) + s ∈ [0 , . I is the usual indicator.

3. Minimizing the Expected Pinball Loss

The idea in (1) and (2) of simultaneous estimation of quan-tiles by minimizing the sum of pinball losses over a pre-speciﬁed, discrete set of quantiles with non-crossing con-straints dates to at least Takeuchi et al. (2006) and is gen-erally regarded as useful (Bondell et al., 2010; Liu & Wu,2011; Bang et al., 2016; Cannon, 2018). Tagasovska &Lopez-Paz (2019) proposed training for every τ equally byminimizing the expected loss over all quantiles, drawing aseparate τ uniformly at random between zero and one foreach example in each batch or epoch of training. Tagasovska& Lopez-Paz (2019) show that minimizing with a uniform P T expected pinball loss induces smoothing across τ thatlikely reduces non-crossing, but we show in Section 6 thatcrossing can still occur quite frequently with DNNs.We note that these prior works, and the case of single quan-tile regression, can all be understood as special cases ofchoosing a distribution P T to sample the pinball loss fromand then optimizing (1). In fact, we will show that using abroad P T regularizes the estimates of speciﬁc τ ’s, particu-larly closer to the median. Thus, even if only a single τ isof interest, we propose using a beta distribution for P T cen-tered on that τ as a regularization strategy. Likewise, if oneis only interested in a few discrete quantiles, we will showthat training with a uniform P T can be a good regularizer.This proposal extends what statisticians have long known tobe true for unconditional quantile estimation: that quantileestimators that smooth multiple quantiles can be more efﬁ-cient than simply taking the desired sample quantile (Harrell& Davis, 1982; Kaigh & Lachenbruch, 1982; David & Stein-berg, 1986). For example, for a uniform [ a, b ] distribution, the average of the sample min and sample max is the mini-mum variance unbiased estimator (MVUE) for the samplemedian (follows from the Lehmann-Schaffe theorem). Thepopular Harrell-Davis quantile estimator is a weighted aver-age of all the sample order statistics, and is asymptoticallyequivalent to computing the mean of bootstrapped medians(Harrell & Davis, 1982).

4. Why Use Deep Lattice Networks?

We propose using deep lattice networks (DLNs) (You et al.,2017) for quantile regression, because they can be madearbitrarily ﬂexible, and as we show in this section, DLNsefﬁciently enable three key regularization strategies: non-crossing constraints, monotonic features, and restricting thelearned distribution to location-scale families.

A lattice is a nonlinear function formed by interpolating amulti-dimensional look-up table (Garcia & Gupta, 2009).Lattices are naturally smooth and can approximate any con-tinuous bounded function by adding more keypoints to themulti-dimensional look-up table. Because the look-up tableparameters form a regular grid of function values, manyshape constraints such as monotonicity can be imposed efﬁ-ciently through linear inequality constraints on the parame-ters(Gupta et al., 2016; Canini et al., 2016; You et al., 2017;Gupta et al., 2018; Cotter et al., 2019b). DLNs are state-of-the-art universal approximators for bounded partiallymonotonic functions (You et al., 2017), and are made up oflattice layers, linear layers, and calibrator layers (which area set of one-dimensional piecewise linear functions). Seethe Supplemental for a quick review on DLNs.

The non-crossing constraints of (2) encode a common-senseexpectation that aides model interpretability and trustwor-thiness, as well as serving as as a semantically-meaningful,tuning-free regularizer. We propose using DLNs to achievenon-crossing quantiles. First, add τ as an input feature tothe DLN as done in (1), and thus the resulting quantile re-gression function f ( x, τ ) can represent arbitrary boundedquantile functions, given enough model parameters. Sec-ond, impose monotonicity on the τ parameter in the DLNto satisfy (2), while still achieving full ﬂexibility for otherfeatures. Our implementation uses the open-source Tensor-Flow Lattice library (TensorFlow Blog Post, 2020). Moregenerally, one could use unrestricted ReLU or embeddinglayers for the ﬁrst few model layers on x , then fuse in τ laterwith τ -monotonic DLN layers.Prior work has used a similar mechanism to impose the non-crossing constraint (2) for a discrete number of quantiles egularization Strategies for Quantile Regression and for more-restrictive function classes, such as linear mod-els (Bondell et al., 2010) and two-layer monotonic neuralnetworks (Cannon, 2018), which are known to have limitedﬂexibility (Minin et al., 2010; Daniels & Velikova, 2010).Models with more layers (Minin et al., 2010; Lang, 2005) ormin-max layers (Sill, 1998) can provide universal approx-imations. Monotonic neural nets have also been proposedfor estimating a CDF (Chilinski & Silva, 2018). DLNs can efﬁciently impose monotonicity on selected inputfeatures. Monotonicity constraints are particularly useful forquantile regression because real-world quantile regressionproblems often use features that are past measurements(or strong correlates) to predict the future distribution ofmeasurements. For example, when predicting the quantilesof the time a bus route takes given D = 8 features, its past7 travel times and the month, if any of the past travel timefeatures were increased, the model should predict longerfuture travel times, never shorter ones. This type of domainknowledge can be captured as a tuning-free, semantically-meaningful regularizer that also aids model interpretabilityby constraining the model’s predictions to be monotonicallyincreasing in each of those input features. Using a DLN for f , and training (1) with a uniform P T expected pinball loss and imposing non-crossing constraintsas per (2), our method will estimate a complete and well-behaved inverse CDF, unlike much of the prior empiricalrisk minimization work in quantile regression. (Cannon,2018; Tagasovska & Lopez-Paz, 2019; Bondell et al., 2010;Takeuchi et al., 2006; Liu & Wu, 2011; Bang et al., 2016).There are two main prior approaches to estimating a com-plete inverse CDF. The ﬁrst relies on nonparametric strate-gies; K-nearest neighbor methods, for example, can be ex-tended to predict quantiles by taking the quantiles ratherthan the mean from within a neighborhood (Bhattacharya &Gangopadhyay, 1990). Quantile regression forests (Mein-shausen, 2006) use co-location in random-forest leaf nodesto generate a local distribution estimate.The second strategy is to ﬁt a parametric distribution to thedata. Traditionally these methods have been fairly rigid,such as assuming Gaussian noise. He (1997) developeda method to ﬁt a shared but learned location-scale familyacross x . Yan et al. (2018) found success with a modiﬁed4-parameter Gaussian whose skew and variance was de-pendent on x . Recently, Gasthaus et al. (2019) proposed spline quantile function DNN models whose outputs are theparameters for a piecewise-linear quantile function, whichcan ﬁt any continuous bounded distribution, given sufﬁcientparameters. They only discuss recurrent neural networks, but their framework is applicable to the generic quantileregression setting we treat as well.One advantage of our approach, with DLNs and an explicit τ feature, is that it maintains the possibility of full ﬂexibility,as do the nonparametric methods and Gasthaus et al. (2019),but enables regularizing the distribution in very natural waysby making certain architecture choices, similar to the morerigid distributional approaches. As a simple example, byconstraining the DLN architecture to not learn interactionsbetween τ and any of the x features, one learns a regressionmodel with homoskedastic errors. In fact, a basic two-layerDLN called a calibrated lattice model (Gupta et al., 2016)can be constrained to learn distributions across x that comefrom a shared, learned, location-scale family: Lemma:

Let f ( x, τ ) be a calibrated lattice model (Guptaet al., 2016) with piece-wise linear calibrator c ( τ ) : [0 , → [0 , for τ , and only 2 look-up table parameters for τ inthe lattice layer, and suppose the look-up table is inter-polated with multilinear interpolation to form the lattice.Then f ( x, τ ) represents an inverse CDF function F − ( y | x ) where the estimated distribution for every x is from the samelocation-scale family as the calibrator c ( τ ) . Proof:

If a random variable Y conditioned on X be-longs to the location-scale family then for τ ∈ (0 , and a ∈ R and b > , it must hold that the conditional in-verse CDF satisﬁes F − Y | X = z ( τ ) = a + bF − Y | X = x ( τ ) . Notethat for τ ∈ (0 , , interpolating a lattice with two look-up table parameters in the τ dimension yields the estimate ˆ F − Y | X = z ( τ ) = f ( z, τ ) = f ( z,

0) + c ( τ )( f ( z, − f ( z, .Thus mapping to the location-scale property, a = f ( z, , b = f ( z, − f ( z, , and ˆ F − Y | X = x ( τ ) = c ( τ ) . Thus everyestimated conditional inverse CDF ˆ F − Y | X = z ( τ ) is a transla-tion and scaling of the piecewise linear function c ( τ ) .The number of keypoints in c ( τ ) controls the complexity ofthe learned base distribution, allowing the model to approxi-mate location-scale families like the Gaussian, gamma, orPareto distributions. The number of lattice knots in the τ feature, meanwhile, naturally controls how much the dis-tribution should be allowed to vary across x . Two knots,as noted above, limits us to a shared location-scale family,while three knots gives an extra degree of freedom to shrinkor stretch one side of the distribution differently across x .Ensembling, layer depth, and further lattice vertices in τ steadily move one towards full generality.

5. Rate Constraints and Quantile Property

Quantile regression models would ideally satisfy the quan-tile property (Takeuchi et al., 2006), meaning that the pro-portion of observed outcomes less than the model predictionis τ , for any subset of the data. This subset accuracy issue egularization Strategies for Quantile Regression also a primary concern for fairness in machine learning: forexample, one may wish to ensure quantile estimates achievesome mandated level of quantile accuracy for each of a setof socioeconomic groups.Prior work (Takeuchi et al., 2006; Sangnier et al., 2016) hasshown that the pinball loss can fall short of achieving thequantile property on the entire dataset in the presence of reg-ularization, suggesting the use of additional unconstrainedconstant terms to maintain guarantees for a discrete numberof quantiles τ . For subsets, work in the fairness literature(Yang et al., 2019) demonstrated how the quantile propertycan suffer over certain subsets of the population if thosesubsets are not known to the model, with the authors rec-ommending learning per-group post-shifts to correct theseshortcomings. In fact, in the presence of parameter sharing,simultaneous quantile learning, and monotonicity, our pro-posed model structure may not be able to satisfy the quantileproperty everywhere, even if it has access to Boolean fea-tures deﬁning the subsets of interest.We propose the use of rate constraints to help the modelachieve the quantile property for speciﬁed subsets and quan-tiles. Rate constraints are data-dependent constraints onmetrics like accuracy or recall in an empirical risk minimiza-tion framework (Goh et al., 2016; Cotter et al., 2019d;c;a;Narasimhan et al., 2020). We set up our rate constraints asin (3) to impose that the quantile property hold over selectedsubsets of the training data, with some slack (cid:15) . The slackvalues are a hyperparameter of the training, and can be cho-sen by validation, or set by the model maker based on whatthey ﬁnd is feasible to achieve. This use of rate constraintsmay decrease the training loss, but can regularize the modelto work well on the subsets of interest, and avoid overﬁttingnoisy training examples.Rate constraints are non-differentiable and data dependent,and so take some care to impose: we use the open-sourceTensorFlow Constrained Optimization library (3), using its best iterate (Cotter et al., 2019d) as an approximate solution.Note that unlike the non-crossing constraints in (2), whichare constraints purely on the model parameters and thuscan be guaranteed no matter how the model is used, rateconstraints are deﬁned on a dataset, so even if the constraintsare perfectly satisﬁed on the training set, the rate constraintsmight not hold on an IID test set. Using additional validationsets (which we did not do) can improve the generalizationof the constraint satisfaction (Cotter et al., 2019d).

6. Experiments

We ﬁrst show the value of using DLNs, then use DLNsto show the value of the P τ and rate constraints regulariz-ers, though those contributions are function-class agnostic.Bolded table results indicate that the metric is not statisti- cally signiﬁcantly different from the best metric among themodels being compared, using an unpaired t-test. All hyperparameters were optimized on validation sets. Weused Keras models in TensorFlow 2.2 for the unrestrictedDNN comparisons that optimize (1) (Tagasovska & Lopez-Paz, 2019) as well as the spline quantile function (SQF)DNNs of Gasthaus et al. (2019) which optimize the sameobjective in a different manner while also guaranteeingnon-crossing quantile estimates. For DLNs, we used theTensorFlow Lattice library (TensorFlow Blog Post, 2020).To train models with rate constraints, we used the customtraining losses from the TensorFlow Constrained Optimiza-tion library and resolved the stochasticity of the classiﬁer(Narasimhan et al., 2019) by taking the best iterate (GoogleAI Blog Post, 2020; Cotter et al., 2019d). For all DNN andDLN experiments, we use the Adam optimizer (Kingma& Ba, 2015) with its default learning rate of 0.001, exceptwhere noted. For DNN models, we optimized over the num-ber of hidden layers and the hidden dimension, as well asthe number of distribution keypoints for SQF-DNNs in par-ticular. For DLN models we optimized over the numberof calibration keypoints, lattice vertices, and in cases withensembles of lattices, the number and dimensionality ofbase models. For both DLNs and DNNs, we additionallyoptimized over the number of training epochs. Trainingthese different models took roughly equally long. Quantileregression forests (QRF) were trained with the quantreg-Forest R package (Meinshausen, 2017) and validated overthe number of trees and the minimum node size. For rateconstraint experiments, the slack on the constraints was alsovalidated for the lowest quantile property violation.

The Beijing Multi-Site Air-Quality datasetfrom UCI (Dheeru & Karra Taniskidou, 2017)(Zhang et al.,2017) contains hourly air quality data from 12 monitoringregions around Beijing. We trained models to predict thequantiles of the PM2.5 concentration from D = 7 features:temperature, pressure, dew point, rain, wind speed, region,and wind direction. The DLN model is an ensemble of2-layer calibrated lattice models. We split the data by time(not IID) with earlier examples forming a training set of size252,481, later examples a validation set of size 84,145, andmost recent examples a test set of size 84,145. Trafﬁc:

This is a proprietary dataset for estimating traveltime on a driving route. The DLN is a 2-layer calibratedlattice model whose inputs are 1 categorical and 3 continu-ous features. We used 1,000 IID examples each for training,validation, and testing, with the training examples occurringearlier in time than the validation and test examples. For all egularization Strategies for Quantile Regression

Table 1.

Simulation results: Quantile MSE and crossing violations computed for τ ∈ { . , . , . . . , . } . Sine-skew (1,7) Griewank Michalewicz AckleyModel MSE Viol. MSE Viol. MSE Viol. MSE Viol.DLN mono . ± . . ± . . ± . ± DLN non-mono . ± .

09 8% 0 . ± .

02 11% 0 . ± .

006 20% 455 ± DNN non-mono . ± .

38% 1 . ± .

02 5% 0 . ± .

011 12% 237 ± . SQF-DNN mono . ± .

11 0 1 . ± .

01 0 . ± . ±

81 0

QRF mono . ± .

21 0 1 . ± .

02 0 0 . ± .

019 0 343 ± Table 2.

Real data experiments: Each column is the pinball loss on the test set, averaged over τ ∈ { . , . , . . . , . } . Model Air Quality Trafﬁc Wine PuzzlesDLN mono on τ only . ± .

292 0 . ± . . ± . . ± . DLN mono on τ & features . ± .

292 0 . ± . . ± . . ± . DLN non-mono . ± .

427 0 . ± . . ± . . ± . DNN non-mono . ± .

213 0 . ± . . ± . . ± . SQF-DNN mono . ± . . ± . . ± . . ± . QRF mono . ± . . ± . . ± . . ± . Trafﬁc models, we optimized the Adam algorithm’s learningrate and batch size.

Wine:

We used the Wine Reviews dataset from Kaggle(Bahri, 2018). We predict the quantiles of quality on a 100-point scale. We used D = 42 features, including price,country of origin, and 40 Boolean features indicating de-scriptive terms such as “complex” or “oak”. The DLNmodel is an ensemble of 2-layer calibrated lattice models,each containing a subset of the features (the exact size of theensemble and dimension of each model chosen by validationset performance). Using the DLN architecture, we furtherconstrain the model output to be monotonically increasingin the price feature. The data was split IID with 84,641examples for training, 12,091 for validation, and 24,184 fortesting. Puzzles:

The Hoefnagel Puzzle Club uses quantile esti-mates of how long a member will hold a puzzle borrowedfrom their library before returning it. The dataset we use ispublicly available on their website. Each example has ﬁvepast hold-times, and a sixth feature denotes if a user belongsto one of three subsets based on their past activity, { activeusers, high-variance users, new users } . The DLN model isan ensemble of 2-layer calibrated lattice models. For theDLN, we also constrain the model output to be monotoni-cally increasing in the most recent past hold-time feature.The 936 train and 235 validation examples are IID from pastdata, while the 210 test samples are the most recent samples(not IID with the train and validation data). We test our proposal to use monotonic DLNs trained witha uniform P T expected pinball loss to predict all quantilessimultaneously. Simulations:

We start with a selection of simulations froma recent quantile regression survey paper (Torossian et al.,2020) based on the sine, Griewank, Michalewicz, and Ack-ley functions with carefully designed noise distributionsto represent a range of variances and skews across the re-spective input domains. We used 250 training examplesfor the 1-D sine and Michalewicz functions, 1,000 pointsfor the 2-D Griewank function, and 10,000 points for the9-D Ackley function. Our metric (MSE) is the average L difference between the estimated and true quantile curves,averaged over values of x across the domain, averaged over100 repeats. We also compute the fraction of test points forwhich at least two of their 99 quantiles crossed.The results in Table 1 demonstrate that across these fourdisparate simulations, the proposed monotonic DLNs arethe best or statistically tied for the best, and the mono-tonicity on τ consistently improves the performance overthe non-monotonic DLN. The non-monotonic DNNs weretrained with the same sampled expected pinball loss, andare sometimes close in performance but suffer substantiallyfrom crossing quantiles, despite the hypothesis in recentwork that just minimizing E T would reduce quantile cross-ing (Tagasovska & Lopez-Paz, 2019). For example, on thesine-skew task, crossing was observed between at least twoquantiles on 37.9% of test x values! Spline quantile func-tions (Gasthaus et al., 2019) and quantile regression forests egularization Strategies for Quantile Regression (Meinshausen, 2006) avoid crossing by construction, butperformed inconsistently on the simulations. Real Data:

Table 2 compares these models on the four realdatasets. The DLN constrained to be monotonic on τ per-formed the best or statistically similar to the best on threeof the four problems, and was statistically signiﬁcantly bet-ter than the (non-monotonic) DNNs trained with the sameexpected pinball loss (Tagasovska & Lopez-Paz, 2019) ineach case. The monotonicity constraint on the DLN slightlyimproved its test peformance over the unconstrained DLNfor all four problems. SQF-DNNs tied the DLN with mono-tonicity on two of the four datasets. The QRFs proved moreeffective on the real datasets than in the simulations, doingparticularly well on the Wine dataset, which we hypothesizeis an artifact of there being only 20 possible training labelsand QRFs predicting sample quantiles from subsets of thetraining labels. Feature monotonicity regularizers:

Table 2 also showsthat adding monotonicity constraints on relevant input fea-tures further improves the test pinball loss. For wine, theprice input was constrained to have a monotonic effect onthe predicted wine quality quantiles (Gupta et al., 2018),and for Puzzles, each of the past hold-times was constrainedto have a monotonic effect on the predicted future hold timequantiles. For air quality and trafﬁc, there were no input fea-tures that we thought should be monotonic, so those resultsare the same as for monotonic on τ only. Constraining inputfeatures to be monotonic also helps explain what the modelis doing (Gupta et al., 2016). We test our proposal that minimizing an expected pinballloss with P T provides useful regularization. Unconditioned:

We start with the classic problem of pre-dicting the quantiles an exponential distribution with λ = 1 without any features X to condition on. We set the DLN tosimply be a linear model on τ ∈ [0 , , trained to minimizethe expected pinball loss over a Beta P T with mode setat the desired quantile. We compare to the sample quan-tile, which minimizes the pinball loss for its τ , and to theHarrell-Davis estimator (Harrell & Davis, 1982).Figure 1 shows that for high Beta concentrations (producinga spiky P T on the quantile of interest), the DLN performssimilarly to the sample quantile, as expected. For a mid-dle range of Beta concentrations, minimizing the expectedpinball loss with P T beats the Harrell-Davis estimator. Single Feature:

We ran simulations on the 1D sine-skewfunction of Torossian et al. (2020) plotted in Figure 2 withnoise parameters ( a, b ) = (1 , for symmetric low-noise, (7 , for symmetric high-noise, and (1 , a sharply asym-metric high-noise function. (a) N = 51 , τ = 0 . (b) N = 505 , τ = 0 . Figure 1.

Unconditional Quantile Estimation of exp( λ = 1) Sample is the sample quantile, HD is the Harrell-Davis estimator,and DLN uses a a linear model f ( τ ) on τ ∈ [0 , ﬁt to minimizeexpected pinball loss with respect to Beta P T . Results were aver-aged over 1,000 random draws of N samples, with 95% conﬁdenceintervals depicted by shading and error bars. Table 3 shows that in the (1 , low-noise case the high-smoothing options with the Beta P T concentration hyperpa-rameter at C = 2 (uniform) and C = 10 work best. In the (7 , case of high heteroskedasticity, we are best off withmoderate amounts of smoothing: the quantiles nearby to τ = 0 . resemble the well-behaved (1 , case. In the (1 , case, the erratic asymmetric tails resist smoothing; they in-troduce enough model error that we are best off essentiallytraining for the median alone. In general, more trainingdata improves the relative performance of low-smoothingmodels. Real Data:

Table 4 compares the accuracy of using differentdistributions P T to ﬁt three target quantiles on each dataset.First, we compare two methods of training a single modelto ﬁt all three quantiles: τ ∼ Unif (0 , , as in Tagasovska& Lopez-Paz (2019), vs τ ∼ Discrete on just the threetarget quantiles. These are competitive with each other, onaverage. egularization Strategies for Quantile Regression (a) Sine-skew (1,1) (b) Sine-skew (7,7) (c) Sine-skew (1,7)

Figure 2.

True quantiles of the sine-skew distribution with different noise parameters (Torossian et al., 2020). The colored lines show thequantiles for τ = 0 . , . , . , . , . . Table 3.

Sine-skew experiment with different sine-skew noise choices (1,1), (7,7) and (1,7). MSE between true median and estimatedmedian, averaged over x ∼ Unif ( − , . Trained with expected pinball loss with Beta P T with mode τ = 0 . and varying concentrations C . Note that C = 2 is the uniform distribution while C = 10 , is close to single- τ sampling. ( a, b ) N C = 2 C = 10 C = 100 C = 1 , C = 10 , ,

1) 100 . ± .

032 0 . ± .

030 0 . ± .

038 0 . ± . . ± . ,

1) 1 , . ± .

003 0 . ± .

004 0 . ± .

004 0 . ± . . ± . ,

7) 100 . ± .

289 6 . ± . . ± . . ± . . ± . ,

7) 1 ,

000 4 . ± .

148 1 . ± . . ± .

048 1 . ± . . ± . ,

7) 100 3 . ± . . ± .

180 2 . ± . . ± . . ± . (1 ,

7) 1 ,

000 0 . ± .

059 0 . ± . . ± .

025 0 . ± .

026 0 . ± . Table 4.

Effect of training with different sampling distributions for τ ∼ P T . The key comparisons are: (1) the single models trained withUnif vs Discrete, and (2) the three separate models trained with Beta or single τ . All models were DLNs and monotonic on τ . Air Quality:

Model Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) τ ∼ Unif (0 , . ± . . ± .

075 2 . ± . τ ∼ Discrete . ± .

073 15 . ± .

062 3 . ± . τ ∼ Beta . ± .

183 13 . ± .

054 2 . ± . Single τ ∈ T 23 . ± . . ± . . ± . Trafﬁc:

Model Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) τ ∼ Unif (0 , . ± . . ± . . ± . τ ∼ Discrete . ± . . ± . . ± . τ ∼ Beta . ± . . ± . . ± . Single τ ∈ T 0 . ± . . ± . . ± . Wine:

Model Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) τ ∼ Unif (0 ,

1) 0 . ± . . ± . . ± . τ ∼ Discrete . ± . . ± . . ± . τ ∼ Beta . ± . . ± . . ± . Single τ ∈ T 0 . ± . . ± . . ± . Puzzles:

Model Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) Pinball loss ( τ = 0 . ) τ ∼ Unif (0 , . ± .

013 4 . ± . . ± . τ ∼ Discrete . ± . . ± .

011 2 . ± . τ ∼ Beta . ± .

013 4 . ± .

018 2 . ± . Single τ ∈ T . ± .

013 4 . ± .

022 2 . ± . egularization Strategies for Quantile Regression Next, we compare training three separate models to ﬁt thethree quantiles either (1) using a Beta distribution centeredon the target quantile and a concentration hyperparametervalidated within [10, 1000], or (2) using only the singletarget quantile itself. The Beta-smoothed models performedas well or better than the single-quantile models in all cases.At the extreme, training for single τ s using uniform P T works well, sometimes better, sometimes worse than single,and has the added advantage of providing a complete inverseCDF in a single model without quantile-crossing. We use rate constraints to ensure the quantile propertyroughly holds across subsets of interest on the training data.For the Air Quality and Trafﬁc problems, we apply con-straints on τ ∈ T = { . , . , . } , emphasizing accurateestimates of upper quantiles for air pollution and trafﬁc forthe 12 regions in the Air Quality dataset and the 10 coun-tries in the Trafﬁc dataset. For the Wine problems, we applyconstraints over τ ∈ T = { . , . , . } for 20 countriesplus an “Other country” category aggregating the remainingsmall countries, highlighting both the upper and lower quan-tiles of wine quality. For the Puzzles problem, we applyconstraints over τ ∈ T = { . , . , . } , and enforce con-straints over three subsets of users: { active users, high-riskusers, new users } .For each problem, we included a rate constraint for eachcombination of the subsets of interest and the quantiles ofinterest (for example, for 10 countries and 3 quantiles ofinterest, there are × rate constraints). Hyper-parameters were chosen from the validation performanceaccording to the heuristic from Cotter et al. (2019d) that con-siders both constraint violation and objective. We compareagainst unconstrained models that are trained and validatedto optimize pinball loss. The max quantile violation met-ric takes the max of the absolute quantile error over allconstrained subsets D s and quantiles τ s : max s ∈{ ,...,S } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ s − |D s | (cid:88) ( x j ,y j ) ∈D s I [ y j ≤ f ( x j , τ s ; θ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Table 5 reports the max quantile violation over the subsetson the quantiles T on the test set, and pinball loss averagedover the quantiles T on the test set. The rate constraintssigniﬁcantly improved the test maximum quantile violationfor Air Quality, Wine, and Puzzles, and was statistically tiedfor Trafﬁc. These subset wins caused mixed results on theoverall test pinball loss: statistically signiﬁcantly hurting itfor Air Quality, but statistically signiﬁcantly improving theoverall pinball loss on Puzzles, which is the most non-IIDof the four datasets, with the test set known to have moreexamples from the hard new-users subset which may have beneﬁted from its rate constraint at training time. Table 5.

Effect of rate constraints.

Unconstr is a DLN with no rateconstraints.

Constr is a DLN with rate constraints.

Air QualityModel Max violation Pinball lossUnconstr . ± . . ± . Constr . ± . . ± . TrafﬁcModel Max violation Pinball lossUnconstr . ± . . ± . Constr . ± . . ± . WineModel Max violation Pinball lossUnconstr . ± . . ± . Constr . ± . . ± . PuzzlesModel Max violation Pinball lossUnconstr . ± .

005 3 . ± . Constr . ± .

008 3 . ± .

7. Conclusions

We investigated different regularization strategies for quan-tile regression. First, we built on classic statistics about sam-ple quantile estimation to propose training with a smoothedexpected pinball loss. We showed that a uniform P T yieldsperformance similar to that of discrete quantiles, plus amore ﬂexible model that can predict any τ . We demon-strated that smoothing with a Beta P T can be more accuratethan training for a single τ of interest.We then attacked the classic goal of non-crossing quantileswith DLNs and showed that DLNs with τ and feature mono-tonicity work well for quantile regression, and do so in anempirical risk minimization (ERM) framework, with allthe ﬂexibility and computational efﬁciency the ERM frame-work brings. Not only do they provide a way to guaranteenon-crossing for multi-quantile regression without limitingﬂexibility, they also proved effective at predicting the fullconditional distribution of y given x across a wide varietyof problems.Lastly, we showed that rate constraints on subsets of data canimprove test performance on those subsets, and may helpor hurt the aggregate loss. Here, we used rate constraintsto buoy the worst-case of the subsets, a common fairnessnotion. As these strategies attack different aspects of theproblem, they can be used separately or work together. egularization Strategies for Quantile Regression References

Bahri, D. Wine reviews.

Kaggle , 2018. URL .Bang, S., Cho, H., and Jhun, M. Simultaneous estimationfor non-crossing multiple quantile regression with rightcensored data.

Statistics and Computing , 2016.Bhattacharya, P. K. and Gangopadhyay, A. K. Kernel andnearest-neighbor estimation of a conditional quantile.

TheAnnals of Statistics , 1990.Bondell, H. D., Reich, B. J., and Wang, H. Non-crossingquantile regression curve estimation.

Biometrika , 2010.Canini, K., Cotter, A., Fard, M. M., Gupta, M. R., andPfeifer, J. Fast and ﬂexible monotonic functions withensembles of lattices.

Advances in Neural InformationProcessing Systems (NeurIPS) , 2016.Cannon, A. J. Non-crossing nonlinear regression quantiles.

Stochastic Environmental Research and Risk Assessment ,32:3207–3225, 2018.Chilinski, P. and Silva, R. Neural likelihoods via cumulativedistribution functions. arXiv , 2018.Cotter, A., Gupta, M., Jiang, H., Srebro, N., Sridharan, K.,Wang, S., Woodworth, B., and You, S. Two player gamesfor efﬁcient non-convex constrained optimization.

ICML ,2019a.Cotter, A., Gupta, M. R., Jiang, H., Louidor, E., Muller, J.,Narayan, T., Wang, S., and Zhu, T. Shape constraints forset functions.

ICML , 2019b.Cotter, A., Jiang, H., and Sridharan, K. Two player gamesfor efﬁcient non-convex constrained optimization.

ALT ,2019c.Cotter, A., Jiang, H., Wang, S., Narayan, T., Gupta, M. R.,You, S., and Sridharan, K. Optimization with non-differentiable constraints with applications to fairness,recall, churn, and other goals.

JMLR , 2019d.Daniels, H. and Velikova, M. Monotone and partially mono-tone neural networks.

IEEE Trans. Neural Networks , 21(6):906–917, 2010.David, C. E. and Steinberg, S. M. Quantile estimation. In

Encyclopedia of Statistical Sciences , volume 7. Wiley,New York, 1986.Dheeru, D. and Karra Taniskidou, E. UCI machine learningrepository, 2017. URL http://archive.ics.uci.edu/ml .Garcia, E. K. and Gupta, M. R. Lattice regression.

NeurIPS ,2009. Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S., Sali-nas, D., Flunkert, V., and Januschowski, T. Probabilisticforecasting with spline quantile function RNNs.

AIStats ,2019.Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. Sat-isfying real-world goals with dataset constraints. In

Ad-vances in Neural Information Processing Systems , pp.2415–2423, 2016.Google AI Blog Post. Setting fairness goals withthe TensorFlow Constrained Optimization Library,2020. URL https://ai.googleblog.com/2020/02/setting-fairness-goals-with-\tensorflow.html .Gupta, M. R., Cotter, A., Pfeifer, J., Voevodski, K., Canini,K., Mangylov, A., Moczydlowski, W., and Esbroeck, A. V.Monotonic calibrated interpolated look-up tables.

Jour-nal of Machine Learning Research (JMLR) , 17(109):1–47, 2016. URL http://jmlr.org/papers/v17/15-243.html .Gupta, M. R., Bahri, D., Cotter, A., and Canini, K. Dimin-ishing returns shape constraints for interpretability andregularization.

Advances in Neural Information Process-ing Systems (NeurIPS) , 2018.Harrell, F. E. and Davis, C. E. A new distribution-freequantile estimator.

Biometrika , 1982.He, X. Quantile curves without crossing.

The AmericanStatistician , 51(2):186–192, 1997.Kaigh, W. D. and Lachenbruch, P. A. A generalized quantileestimator.

Communications in Statistics - Theory andMethods , 1982.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In

International Conference on LearningRepresentations , 2015.Koenker, R. and Bassett, G. Regression quantiles. In

Econo-metrica , 1978.Lang, B. Monotonic multi-layer perceptron networks as uni-versal approximators.

Artiﬁcial neural networks: formalmodels and their applications-ICANN , 2005.Liu, Y. and Wu, Y. Simultaneous multiple non-crossingquantile regression estimation using kernel constraints.

Journal of Nonparametric Statistics , 23(2):415–437,2011.Meinshausen, N. Quantile regression forests.

Journal ofMachine Learning Research (JMLR) , 2006.Meinshausen, N. quantregForest: Quantile RegressionForests , 2017. egularization Strategies for Quantile Regression

Minin, A., Velikova, M., Lang, B., and Daniels, H. Com-parison of universal approximators incorporating partialmonotonicity by structure.

Neural Networks , 23(4):471–475, 2010.Narasimhan, H., Cotter, A., and Gupta, M. R. On makingstochastic classiﬁers deterministic.

Advances in NeuralInformation Processing Systems , 2019.Narasimhan, H., Cotter, A., Gupta, M. R., and Wang, S.Pairwise fairness for ranking and regression.

AAAI , 2020.Sangnier, M., Fercoq, O., and d’Alche Buc, F. Joint quantileregression in vector-valued RKHSs.

NeurIPS , 2016.Sill, J. Monotonic networks.

Advances in Neural Informa-tion Processing Systems (NeurIPS) , 1998.Tagasovska, N. and Lopez-Paz, D. Single-model uncertain-ties for deep learning.

NeurIPS , 2019.Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J.Nonparametric quantile estimation.

Journal of MachineLearning Research (JMLR) , 2006.TensorFlow Blog Post. TensorFlow Lattice: ﬂexi-ble, controlled, and interpretable ML, 2020. URL https://blog.tensorflow.org/2020/02/tensorflow-lattice-flexible-\controlled-and-interpretable-ML.html .Torossian, L., Picheny, V., Faivre, R., and Garivier, A. Areview on quantile regression for stochastic computer ex-periments.

Reliable Engineering & System Safety , 2020.Yan, X., Zhang, W., Ma, L., Liu, W., and Wu, Q. Parsimo-nious quantile regression of ﬁnancial asset tail dynamicsvia sequential learning.

NeurIPS , 2018.Yang, D., Lafferty, J., and Pollard, D. Fair quantile regres-sion. arXiv: 1907.08646 , 2019.You, S., Canini, K., Ding, D., Pfeifer, J., and Gupta,M. R. Deep lattice networks and partial monotonic func-tions.

Advances in Neural Information Processing Sys-tems (NeurIPS) , 2017.Zhang, S., Guo, B., Dong, A., He, J., Xu, Z., and Chen, S.Cautionary tales on air-quality improvement in Beijing.

Proceedings of the Royal Society A , 473, 2017. egularization Strategies for Quantile Regression

A. Background on Deep Lattice Networks

We provide more background on Deep Lattice Networks(DLNs) (You et al., 2017). DLNs are an arbitrarily ﬂexiblefunction class that can efﬁciently impose monotonicity onany subset of input features, without restricting the ﬂexibil-ity on other features. They are constructed by composingtogether layers that individually preserve monotonicity, themost notable of which is the lattice layer.A lattice (Garcia & Gupta, 2009) is a function parameter-ized by the values θ it takes at knots arrayed in a regulargrid throughout the input domain, which is assumed to bebounded. The function value for all other inputs is generatedby linearly interpolating from the surrounding knots.The simplest example of a lattice is a 1-dimensional lattice,equivalent to a piecewise-linear function (PLF). PLFs playan important role in DLNs; they are often used to individu-ally transform inputs before they are fed into more complexlattices.Given a D -dimensional input with L d knots for the d thinput, we therefore have θ ∈ R L , where L = (cid:81) Dd =1 L d .Given an interpolation kernel Φ : R D → R L , the latticefunction f can be expressed as a kernel function: f ( x ) = L (cid:88) i =1 Φ( x ) i θ i Our interpolation strategy is the multilinear method dis-cussed in Gupta et al. (2016). At a high level, this meansthat the weight on a knot is the product of the scalar weightswe’d place on that knot in each dimension. In particular,let v d be the vector of knot positions in the d th dimensionand v d [ l ] and v d [ r ] be the knots on either side of x d . Ourright-weight w d [ r ] would be x d − v d [ l ] v d [ r ] − v d [ l ] and our left-weight w d [ l ] = 1 − w d [ r ] . Our ﬁnal interpolation weight on aparticular knot θ i is either 0 (if it is not in the surroundinghypercube) or (cid:81) Dd =1 w d [ s ] , where s corresponds to the left-or right-weight depending on whether the knot is to the left-or right- of the input x in that dimension.We show some examples of 2-dimensional lattices with × knots for a total of four parameters in Figure 3. Youcan see how lattices are capable of learning interactionsbetween features, and that given sufﬁcient knots over thebounded input domain, lattice models can approximate anycontinuous bounded input-output relationship.A few key ideas make lattices generally useful and easy touse. The ﬁrst is that they are differentiable in their parame-ters and so can be learned with any standard gradient-basedapproach in an empirical risk minimization framework.The second is that their parameterization makes it straight- Figure 3.

Example 2-d lattices with four parameters, each of whichrepresents the value the function takes at one of the four corners ofthe input domain. Intermediate values are computed via multilinearinterpolation. forward to impose constraints such as monotonicity by im-posing that any two neighboring parameters in the selecteddirection in the look-up table obey the monotonicity con-straints. In Figure 3, that would entail constraining θ ≥ θ and θ ≥ θ3