Estimation and Applications of Quantiles in Deep Binary Classification
EE STIMATION AND A PPLICATIONS OF Q UANTILES IN D EEP B INARY C LASSIFICATION
Anuj Tambwekar
Department of CS&EPES UniversityBengaluru, Karnataka, India [email protected]
Anirudh Maiya
Department of CS&EPES UniversityBengaluru, Karnataka, India [email protected]
Soma Dhavala
FounderMLSquareBengaluru, India [email protected]
Snehanshu Saha
Department of CSIS and APPCAIRBirla Institute of Technology and ScienceGoa, India [email protected] A BSTRACT
Quantile regression, based on check loss, is a widely used inferential paradigm in Econometricsand Statistics. The conditional quantiles provide a robust alternative to classical conditional means,and also allow uncertainty quantification of the predictions, while making very few distributionalassumptions. We consider the analogue of check loss in the binary classification setting. We assumethat the conditional quantiles are smooth functions that can be learnt by Deep Neural Networks(DNNs). Subsequently, we compute the Lipschitz constant of the proposed loss, and also show thatits curvature is bounded, under some regularity conditions. Consequently, recent results on the errorrates and DNN architecture complexity become directly applicable.We quantify the uncertainty of the class probabilities in terms of prediction intervals, and developindividualized confidence scores that can be used to decide whether a prediction is reliable or not atscoring time. By aggregating the confidence scores at the dataset level, we provide two additionalmetrics, model confidence, and retention rate, to complement the widely used classifier summaries.We also the robustness of the proposed non-parametric binary quantile classification framework arealso studied, and we demonstrate how to obtain several univariate summary statistics of the condi-tional distributions, in particular conditional means, using smoothed conditional quantiles, allowingthe use of explanation techniques like Shapley to explain the mean predictions. Finally, we demon-strate an efficient training regime for this loss based on Stochastic Gradient Descent with LipschitzAdaptive Learning Rates (LALR).
Deep Learning has seen tremendous success over the last few years in the fields of Computer Vision, Speech andNatural Language processing [26]. As it is making its way into numerous real world applications, focus is shiftingfrom achieving state-of-the-art performance to questions about explainability, robustness, trustworthiness, fairness andtraining efficiency, among others. Uncertainty Quantification (UQ) is also witnessing a renewed interest within theDeep Learning community. All of the aforementioned aspects need to be tackled in a holistic manner to democratizeAI and make AI equitable for all sections of the society at large [1]. In a seminal paper, Parzen provides a foundationfor exploratory and confirmatory data analysis using quantiles [36], and later argues for unification of the theory andpractice of statistical methods with them [37]. In this work, we take these ideas forward and show how some of theproblems mentioned before in the Deep Learning context, can be solved using a quantile-centric approach.Quantile Regression (QR) generalizes the traditional mean regression to model the relationship between the quantilesof the response to the dependent variables, median regression being the special case [22]. QR inherits many desirableproperties of quantiles: they are robust to noise in the response variable, have a clear probabilistic interpretation, and a r X i v : . [ c s . L G ] F e b re equivariant under monotonic transformations. Besides, they also have appealing asymptotic properties under mildassumptions both in the parametric and the non-parametric settings [38, 9]. QR has found many successful applica-tions in Econometrics and Statsitics, such as modeling growth curves, extreme events, and in the robust regressioncontexts [23, 6, 32, 7]. Its introduction to the Machine Learning community is relatively recent, where [19] showedthe relationship between ν − Support Vector Machines and QR, for example. [48] applied QR for modeling aleatoric uncertainty in deep learning via prediction intervals (PIs). Unlike previous works, we study QR in the binary clas-sification setting since: 1) Much of the earlier work on QR can be extended to the deep learning context, with veryminimal effort, which is not the case with classification tasks 2) Despite the dominance of classification tasks in theDL space, reliance on the popular but problematic binary cross entropy is still prevalent, and there is a need to findviable alternatives 3) We also want to study several problems together, as mentioned before, with binary classificationas a test bed. We hope that, our findings can be extended to multi-class settings in future.In the rest of this work, first we setup the problem, along with notations, and derive the Binary Quantile Regression(BQR) loss. We derive some properties of the loss function and provide the learning rates under the regularity assump-tions. Later, for each of the sub-problems, namely, UQ, Explainability, Robustness, and Adaptive Learning Rates, weprovide the necessary background, develop the idea, and provide the results. Finally, we discuss our findings, scopefor improvements and new opportunities.
For any real valued random variable Z , with distribution function F ( z ) , with F ( z ) = P ( Z ≤ z ) , thequantile function Q ( τ ) is given as Q ( τ ) = F − ( τ ) = inf { r : F ( r ) ≥ τ } for any < τ < . Assumption 2.1.
We collect n i.i.d samples { x i , y i } ni =1 , where x ∈ [ − , d , and is continuously distributed, rep-resents the d-dimensional input features and y ∈ { , } the class label. For an absolute constant M > , assume (cid:107) f ∗ (cid:107) ∞ ≤ M Assumption 2.2.
Assume f ∗ lies in the Sobolev ball W β, ∞ ([ − , d ) , with smoothness β ∈ N + f ∗ ( x ) ∈ W β, ∞ ([ − , d ) := (cid:26) f : max α, | α |≤ β sup x ∈ [ − , d | D α | ≤ (cid:27) , where α = ( α , α , . . . , α d ) , | α | = α + α , . . . + α d and D α f is the weak derivative. Assumption 2.3.
Let f ∗ lie in a class F. For the feedforward network class F DNN , let the approximation error (cid:15) be (cid:15) f ∗ := sup f ∗ ∈ F inf f ∈ F DNN (cid:107) f (cid:107) ∞ ≤ M (cid:107) f − f ∗ (cid:107) ∞ Given the input features x , we aim to learn a classifier that maps the inputs to the class labels. Let Q x ( τ ) = f τ ( x ) , τ ∈ (0 , be a continuous, smooth, conditional (on x) quantile learnt by the DNN. We consider an architecture of the form Q x ( τ ) = g τ ( g c ( x )) where g τ is quantile-specific network and g c is a layer shared by all quantiles. [50] showed that,sharing parameters across quantiles generally leads to better statistical efficiency. It is akin to multi-task learning,where each quantile estimation is a task, and our architecture is inspired by this observation. [30] considered the median regression for thresholded binary response models of the form Z = xβ + U, Y = I ( Z ≥ ,where Z is the latent response, β is a d × vector of unknowns, U ∼ F ( . ) are i.i.d errors from a continuousdistribution and I ( . ) is an indicator function. Later, in [31], he proved the consistency and asymptotic properties ofthe Maximum Score Estimator and also noted that it can be extended to model other quantiles as a solution to theto the optimization problem: arg min β | β | =1 (cid:80) ni =1 ρ τ ( y i − I ( xβ ≥ . Here, ρ τ is the check loss or pinball loss,defined as ρ τ ( e ) = ( τ − I ( e < e [22]. It is well known that check loss is a generalized version of the MeanAbsolute Error (MAE), often used in Robust Regression settings, and that quantiles minimize the check loss. [17, 24]provided efficient estimators by replacing the Indicator function with smooth kernels. [5] considered the Bayesiancounterpart by noting that check loss is the kernel of Asymmetric Laplace Density (ALD). Below, we extend this tothe non-parametric settings and derive the loss function suitable for DNNs.2 .3 BQR Loss Let us reconsider thresholded binary response model y = I ( z ≥ , z = Q x ( τ ) = f τ ( x ) + (cid:15) where (cid:15) ∼ ALD (0 , , τ ) and ALD ( y ; µ, σ, τ ) ≡ τ (1 − τ ) σ − exp (cid:0) − ρ τ (( y − µ ) σ − ) (cid:1) It can be shown that, P ( y = 1 | f τ ( x )) ≡ (cid:26) − τ exp(( τ − f τ ( x )) 0 < f τ ( x )(1 − τ ) exp( τ f τ ( x )) 0 ≥ f τ ( x ) The empirical loss, under the settings defined earlier, can now be defined as the negative of the log-likeihood function,given as: L BQR ( y ; f τ ( x ) ) = y i log (cid:0) ( P ( y = 1 | f τ ( x i )) − ) (cid:1) +(1 − y i ) log (cid:0) (1 − P ( y = 1 | f τ ( x i ))) − (cid:1) It is to be noted that, we can recover logistic and probit models, when the error distributions are logistic and normaldistributions, respectively. Next, we analyze the learnability of latent functions. Before we do that, we provide twolemmas. (See Appendix A for Proofs)
Lemma 2.1.
The Lipschitz constant of the BQR loss is max( τ, − τ ) The implications of this important result are manifested in section 6. We set the learning rate accordingly and accom-plish significantly faster convergence in binary classification tasks. It may also be useful in studying the robustness ofBQR against adversarial attacks [42].
Lemma 2.2.
BQR also admits a bound in terms of the curvature of the function f ∗ . That is c E (( f − f ∗ ) ) ≤ E ( L ( y, f ) − L ( y, f ∗ )) ≤ c E (( f − f ∗ ) ) where c and c constants, bounded away from 0. Due to Lemmas 2.1 and 2.2, the BQR loss satisfies Eqn(2.1) of [12]. Consequently, all the results of their paper aredirectly applicable, under suitable conditions. In particular, we restate their major result, Theorem 2:
Theorem 2.3.
Suppose Assumptions 2.1-2.3 hold. Let f be the deep ReLU network with W number of parameters.Under BQR, with probability at least − e − γ , for large enough n , for some C > , (cid:107) f − f ∗ (cid:107) L ( x ) = E (( f − f ∗ ) ) ≤ B for B = C (cid:16) W log( W ) n log n + log log n + rn + (cid:15) f ∗ (cid:17) The above non-asymptotic error bounds can be used to tune and optimize the architectures as a function of the DNNarchitecture complexity W , sample size n , confidence γ and approximation error (cid:15) f ∗ . Most refreshingly, it allows usto look at the DNN as a decompressor: given quantized outputs y , and inputs x , we can train a DNN to decompressthe signal f ( x ) . In Fig. 1, we show the estimated conditional quantiles of the latent response. It is fascinating to seethat the original signal is recovered, despite observing only quantized labels at the time of training. In addition, itis worth noting that multiple quantiles are simultaneously estimated. In order to prevent quantile crossing, we add aregularization term, L BQC = n (cid:88) i =1 m − (cid:88) p =1 max(0 , Q x i ( τ p ) − Q x i ( τ p +1 ) giving us regularized BQR: L BQ = L BQR + λL BQC . In this work, we find that for most cases a λ = 1 was sufficientto prevent crossing. Next, we quantify how well the latent functions are learnt in terms of coverage, where coverage isan estimate of P x,y ( x < Q x ( τ )) that should be close to the nominal value τ f τ ( x ) for D1. To verify that the quantiles we obtain possess the coverage property, we scale the true latent distribution and theobtained quantiles. The threshold value is first subtracted from the true latent, and the resulting distribution is thennormalized to have 0 mean and unit standard deviation. For the quantiles, the mean and standard deviation of themedian is obtained, and all the quantiles are normalized using these terms. This normalization step is done solely forcomparing against the latent.We created a collection of datasets as per the distributions listed below, sampling X from U ( − , and classified thepoints as class 0 if y i ≤ µ . We computed the coverage for the generated quantiles. The datasets used are formulatedas follows, with D5 and D6 being variants of the dataset proposed in [2]. The results can be seen in Table 1• D1 : y i = 5 sin 8 x i + ζ i , where ζ i ∼ N (0 , • D2 : y i = (4 x i ) / + ζ i , where ζ i ∼ N (0 , . • D3 : y i = (cid:112) (4 x i ) + 5 − . ζ i , where ζ i ∼ U ( − . , . • D4 : y i = ζ i + (cid:26) x i sin(1 / x i ) x (cid:54) = 00 x = 0 , where ζ i ∼ N (0 , . • D5 : y i = 2((1 − x i + 2(3 x i ) ) exp (cid:8) − . x i ) (cid:9) − .
5) + ζ i , where ζ i ∼ N (0 , . • D6 : y i = 2((1 − x i + 2(3 x i ) ) exp (cid:8) − . x i ) (cid:9) − .
5) + ζ i / , where ζ i ∼ χ (2) We use regression datasets taken from the UCI Machine Learning Repository [10], and convert them into classificationtasks by thresholding the target, and converting it into a binary label. We use two different thresholds, one to simulate4overage for τ Dataset 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9D1 0.10 0.18 0.28 0.40 0.51 0.59 0.69 0.81 0.92D2 0.14 0.23 0.33 0.45 0.51 0.60 0.69 0.79 0.84D3 0.10 0.19 0.28 0.39 0.51 0.58 0.69 0.81 0.86D4 0.04 0.09 0.22 0.37 0.50 0.67 0.75 0.82 0.91D5 0.08 0.20 0.36 0.46 0.53 0.61 0.70 0.82 0.89D6 0.05 0.18 0.32 0.40 0.49 0.55 0.69 0.81 0.90Table 1: Coverage values for simulated datasetsa balanced classification task, and the other to simulate an imbalanced problem. The results can be seen in Table2. The scaling methodology is the same as the method described for the simulated datasets. For both simulated andreal-world datasets, we observe that reported coverages are very close to their nominal values around the median, andthe precision decreases as the nominal quantile moves away from the median. While we do not know the distributionof the estimators, from the classical QR perspective, it suggests that, the precision at the lower quantiles is moredominated by the density terms, than by the τ (1 − τ ) factor [23]. Coverage for τ Dataset t Acc. RMSE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Abalone 9 0.81 0.83 0.08 0.19 0.27 0.35 0.55 0.68 0.79 0.88 0.977 0.89 0.89 0.08 0.19 0.31 0.43 0.54 0.65 0.75 0.84 0.96Boston 22 0.88 0.74 0.17 0.22 0.30 0.39 0.50 0.62 0.71 0.80 0.9018 0.90 0.82 0.09 0.20 0.34 0.42 0.53 0.67 0.74 0.80 0.86California 180K 0.80 0.77 0.05 0.15 0.26 0.36 0.51 0.62 0.75 0.84 0.93200K 0.79 0.83 0.10 0.25 0.37 0.48 0.57 0.66 0.73 0.79 0.87Concrete 35 0.88 0.62 0.09 0.17 0.26 0.36 0.50 0.64 0.76 0.88 0.9450 0.91 0.66 0.11 0.18 0.28 0.41 0.50 0.65 0.79 0.83 0.87Energy 20 0.99 0.40 0.13 0.18 0.27 0.39 0.50 0.66 0.79 0.85 0.9115 0.94 0.52 0.07 0.16 0.29 0.37 0.51 0.65 0.74 0.81 0.90Protein 5 0.82 0.82 0.10 0.22 0.34 0.44 0.53 0.63 0.73 0.83 0.939 0.81 0.84 0.09 0.16 0.30 0.42 0.53 0.62 0.72 0.82 0.92Redshift 0.65 0.91 0.83 0.09 0.18 0.26 0.37 0.48 0.61 0.81 0.86 0.920.9 0.92 0.88 0.07 0.10 0.15 0.32 0.45 0.70 0.77 0.88 0.96Wine 5 0.82 0.82 0.08 0.18 0.27 0.38 0.49 0.61 0.72 0.83 0.926 0.93 0.93 0.03 0.12 0.24 0.37 0.51 0.64 0.73 0.80 0.86Yacht 2 0.98 0.63 0.17 0.27 0.35 0.43 0.49 0.55 0.64 0.81 0.897.5 0.98 0.60 0.19 0.34 0.41 0.45 0.51 0.69 0.84 0.91 0.98Table 2: Coverage results for binary classification using thresholded UCI regression datasets [34] studied how Deep Learning models can be fooled easily, despite the high confidence in the predictions. [13,33, 47] discuss why the widely used class probabilities, estimated with logistic loss, cannot be used as confidencemeasures, as they can overestimate the confidence, and are not consistent. One way to approach the problem is byUncertainty Quantification (UQ). [25] proposed Deep Ensembles with a Bayesian justification to report Monte Carloestimates of prediction variance. [27] posed UQ as a min-max problem, where a single model, instead of an ensemble,is input-distance aware. [48] proposed using Quantiles to report the PIs in the regression setting. It is straightforwardto establish PIs using the conditional quantiles even in the binary classification setting since P x,y ( x < Q x ( τ )) = τ .It follows then that, [ Q x (0 . τ ) , Q x (1 − . τ )] is a − τ ) % PI at x . Any monotonic transformation, such as a5igmoid function or an Indicator function, can be used produce PIs in the class probabilities space or the label space.Along with measuring the precision, it is sometimes helpful to know when to withheld from a making prediction.Recently, [20] proposed Trustscore based on how close a sample is to a set of high trustworthy samples to that affect.In addition to reporting the precision (via PIs), we can also measure the confidence via confidence score δ defined asfollows: Definition 3.1.
Confidence Score ( δ ), defined for a sample x as δ = inf d ∈ (0 , . { d : s.t Q x (0 . − d ) ≤ ≤ Q x (0 . d ) } The Confidence Score δ is thus a metric of how close to the decision boundary the latent function for x i is. As δ increases, the likelihood of the point being misclassified reduces, as the quantiles for the latent response move furtheraway from the decision boundary. The relationship between misclassiication rate and confidence can be explicitlystated as follows: Theorem 3.1.
An instance with confidence score δ has a misclassification rate of . − δ Proof.
Let µ be the median of the latent response z , i.e, µ = Q x (0 . , and that µ ≥ . Note that, P ( z < = µ ) = P ( z < = 0) + P (0 < z ≤ µ ) By definition P ( z < = µ ) = 0 . , δ = P (0 < z < = µ ) , and P ( z < = 0) is the misclassification rate. Using samereasoning, we can show that, when µ < , the misclasification rate P ( z > is . − δ . Hence, the misclassificationrate is . − δ .To verify the relationship between misclassification and confidence, we use the same datasets and thresholds used inour coverage computation tests, and compute the goodness of fit ( R ) score of the expected misclassification rate vs δ curve on the obtained values misclassification rate per delta, The results can be seen in Table 3 . In addition, byomitting samples whose δ -score is below a certain threshold, more confident predictions can be obtained, as per thetheorem described above. We term this threshold as the model confidence . However, it is important to keep in mindthat as this tolerance for a certain δ becomes more rigid, the number of acceptable decisions will also reduce. Wedefine the retention rate for a given confidence threshold as the ratio of number of points having a δ -Score less thanthat confidence score to the samples available for decision. Table 4 shows the retention ( r r ) and misclassification rates( m r ) for some standard binary classification datasets [10, 11, 21, 29].In addition, we can use the tried and tested classifier metrics, namely AUC, RoC and Precision-Recall curves in orderto evaluate the classifier per Confidence Score level. We simply compute the TPR-FPR and Precision-Recall curvesfor the classifier, only considering points that have a specific confidence level. Figure 2 shows an example of thesame. As one can note, the classifier performance improves when low confidence labels are withheld. The per δ -scoreperformance can be evaluated and used as another metric when deciding whether or not a prediction should be rejected.Figure 2: Per δ -Score AUC-ROC Curves Note: The R score for yacht is correct. The value is computed using scikit learn’s r2_score , which ranges from −∞ to 1.0 -ScoreDataset t Rate 0.1 0.2 0.3 0.4 0.5 R Abalone 9 m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r δ -Score in Artificially created classification tasks To compare our δ x with TS, we compute both of them for all the samples in a dataset. Following this, we rank thesamples based on TS and create 10 equi-distributed bins, one bin per decile. We the compute the average δ x and TS ofall points in each bin. As per [20], low ranking points are the ones likely to be misclassified, As seen in Table5, as TSbin decile increases, the average δ x score also increases as expected, indicating that our method captures the expectedtrend. 7onfidence Scores( δ )Dataset Acc. Metric 0.1 0.2 0.3 0.4 0.5Asirra 0.95 m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r δ x values. This makes perfect sense,as these are easy to classify datasets. However, by construction, TS involves ranking the samples, and as a result, itis possible for some samples with low confidence to get high TS and vice versa, which is undesirable. Unlike TS, δ x is well calibrated due to Theorem 3.2, and requires no other models to be run on the top, thanks to the conditionalquantiles. As noted previously in Section 3, BCE can be overconfident or can be fooled easily in the presence of noise. Whileconfidence scores may help detect such spurious cases, is it possible to reduce them in the first place? That brings usto developing robust estimation techniques. It is widely established that Quantiles are a type of robust estimators tonoise in the response variable (vertical noise). In the classification setting, [14] showcase the ability of the MAE of theclass probabilities being more robust to label noise than Categorical Cross Entropy but note that training under MAEcould be slow. In our case, having shown the promise of BQR in UQ, we are interested in seeing whether medianclass probabilities are robust to label noise. To study this, we use the same networks as before, with one trained onBCE, and the other BQR. For each dataset, we vary the percentage of wrongly labelled samples in the training set,and compare the accuracy of the model on the entire real dataset. The results can be seen in Table 6. At no noise, theBQR based classifier is usually equivalent to, or slightly less accurate than the BCE one, however, as noise increases(20% and above), eventually the BQR classifier begins to outperform the BCE one, but it is to be noted that these noiselevels are extremely high when this occurs 8rustscore BinDataset Avg. 1 2 3 4 5 6 7 8 9 10Banknote δ x δ x δ x δ x δ x δ x δ x δ x δ x δ x -score and Trustscore values per Trustscore bin % of Flipped LabelsDataset Loss 0% 10% 20% 30% 40%Banknote BCE 1.000 1.000 0.998 0.986 0.925BQR 1.000 0.999 0.997 0.989 0.939Haberman BCE 0.767 0.766 0.744 0.733 0.642BQR 0.764 0.763 0.749 0.735 0.688Heart BCE 0.919 0.881 0.822 0.723 0.645BQR 0.899 0.859 0.836 0.785 0.700Ionosphere BCE 0.962 0.923 0.881 0.799 0.666BQR 0.950 0.916 0.887 0.841 0.706Pima BCE 0.817 0.803 0.776 0.714 0.618BQR 0.802 0.792 0.776 0.735 0.683Sonar BCE 0.957 0.880 0.786 0.700 0.586BQR 0.946 0.875 0.801 0.718 0.607Titanic BCE 0.874 0.868 0.858 0.828 0.756BQR 0.872 0.866 0.859 0.845 0.805WBC BCE 0.978 0.970 0.964 0.930 0.841BQR 0.975 0.970 0.967 0.951 0.917Table 6: BCE vs BQR Loss on Label Noise A strong criticism of Deep Learning models is that they are black-boxes in nature [40]. A wide variety of techniquesthat attempt to explain the model predictions in terms of activations, saliency maps, and counter-factuals based ongradient propagation are proposed and are actively being developed [43, 44, 46]. In majority of the cases, the same9echniques can be applied to DNNs fit with BQR as well. There are additional classes of explanations for meanpredictions like shapley [28] and
LIME [39]. Below, we show, how the conditional quantiles can be used to estimateconditional effects, as well as report conditional means. Recall that Q x ( τ ) is the conditional quantile at the covariate x , where τ is chosen over a set of discrete values T ∈ { τ , τ , τ , . . . , τ m , τ m +1 } with τ = 0 , τ m +1 = 1 , and there are m outputs available from the neural net corresponding to the remaining values of τ . We can get a smoothed versionof the conditional quantiles by: Q sx ( τ ) τ ∈ (0 , = m (cid:88) i =0 Q x ( τ i ) (cid:90) τ i +1 p = τ i h K ( τ − ph ) dp where K ( . ) is suitable kernel with bandwidth parameter h [36]. In our examples, we used a Gaussian kernel withbandwidth set to 0.1. Now, one can immediately compute any univariate statistic. In particular, the mean responsecan be computed as: E ( f ( x )) = (cid:82) τ =0 Q sx ( τ ) dτ . Likewise, V ar ( f ( x )) can also be computed. In fact, any quantity ofinterest can be computed simply by post-processing the smoothed full distribution [37].Figure 3: Heart Disease Latent vs Max Heart Rate for an Average patient (Interpolated)Figure 3 showcases how quantiles can be used to aid in explainable predictions. In these graphs, the average metrics ofa patient in the heart disease dataset were computed, and the quantiles were predicted using these average parameterswhile varying the maximum heart rate from the the minimum recorded value to the maximum, in steps of 1. Thefigure graphically showcases the region of uncertainty, something which cannot be obtained from conventional binaryclassifiers as they provide only a single threshold value. Figure 4 showcases the shapely summary statistics of themean response of the latent on the test data of the heart disease dataset via quantile interpolation.The smoothed quantiles also show how it is possible obtain more fine grained values of the confidence metric δ , whilekeeping the number of prediction quantile outputs manageable. A critical parameter in training DNNs via Stochastic Gradient Descent is the learning rate. One of the early approachesto adapt the learning rate is by recognizing the inverse relationship between the step size of gradient descent updateand the Lipschitz constant of the function being optimized [4]. Since the Lipschitz constant is generally unknownapriori, [3, 35] estimate a local approximation during training the feed forward networks. Recently works have derivedadaptive Learning Rates by exploiting the gradient properties of Deep ReLU networks [49, 41] , and successfullyapplied them to train models on large datasets [45]. We summarize their results in the following proposition :10igure 4: An Example of the obtained
Shapely statistics of the mean response of the Heart Disease Dataset
Proposition 6.1.
In a Deep ReLU network, let constant k z be the supremum of gradients w.r.t the function, and let L be the Lipschitz constant of the Loss. Then, the adaptive Learning rate η is η = ( k z L ) − , where the weight updaterule is: w t = w t − − η ∇ L ( f ( x )) This particular choice of LALR, under the assumption that gradients cannot change arbitrarily fast, ensures a convexquadratic upper bound, minimized by the descent step.To show the efficacy of the Lipschitz constant based adaptive learning rate, we compared the performance of theadaptive learning rate verses fixed learning rates of 0.01 and 0.1, and tested how quickly they were able to reach aspecified target accuracy in terms of number of epochs. The results can be found in Table 7. N/A indicates that theclassifier was unable to reach the accuracy threshold within 5000 Epochs (500 for IMDB) - if this occurs, the maximumaccuracy reached is provided as well. For IMDB, we use an embedding dimension of size 100, and a 2 layer LSTMof dimension 256 which feeds a linear layer.Dataset Accuracy N . N . N /L Banknote 0.99 945 104 14Haberman 0.80 N/A (0.775) 773 78Heart 0.85 221 161 4ILP 0.75 N/A (0.733) 317 28IMDB 0.90 N/A (0.716) 106 27Ionosphere 0.90 1841 104 6Pima 0.80 4099 417 53Sonar 0.97 2199 320 40Titanic 0.87 982 152 17Winsconsin 0.97 1577 101 8Table 7: Convergence comparison between the Adaptive and Fixed Learning rates for SGDFor our image datasets, we compared the efficacy of the LALR on various deep architectures. The Resnet [16]implementations are Pytorch’s default Resnet18 and Resnet50 implementations of 18 and 50 layers each, while theDensenet [18] implementation is Pytorch’s Densenet121 architecture consisting of 10 layers, out of which 4 are Denseblocks, for a total of 121 layers. For all models, the optimizer was SGD, and each test was run for 20 epochs. Weobtained the best validation accuracy of LR=0.01 and found the number of epochs for the other learning rates to11chieve both a training and validation accuracy equal to or greater than LR=0.01. The results can be seen in Table 8.For all our tests, the LALR based models converged faster, barring Asirra on Densenet.Dataset Arch. Target Acc. LR N E T E (min)Asirra Resnet18 0.76 Fixed (0.01) 16 2.5Fixed (0.1) 15 2.5Adaptive 6 4.2Resnet50 0.70 Fixed (0.01) 20 4.7Fixed (0.1) 16 4.7Adaptive 5 7.0Densenet 0.86 Fixed (0.01) 20 4.9Fixed (0.1) 8 4.9Adaptive 9 12.1Pneumonia Resnet18 0.83 Fixed (0.01) 18 1.2Fixed (0.1) 7 1.2Adaptive 6 2.3Resnet50 0.82 Fixed (0.01) 20 1.9Fixed (0.1) 15 1.9Adaptive 9 3.2Densenet 0.81 Fixed (0.01) 20 1.9Fixed (0.1) 10 1.9Adaptive 3 3.2Table 8: Adaptive LR performance in Deep Binary Image Classification To summarize, in this work we put forth the Binary Quantile Regression loss function - a loss function to allow DNNsfor binary classification to learn the quantiles of the latent function learnt by the network. By estimating these quantiles,we are able to gain additional insight into the uncertainty of the predictions of the network in real time. To further this,we also extend this uncertainty quantification technique to the sample confidence score we term δ x . We show that δ x is an accurate measure to capture uncertainty, that provides a mathematical likelihood of misclassification, as per thefunction learnt by the model. Following this, we explore how quantiles provide solutions to current open problemsin the deep learning space by being more robust to extreme amounts label noise and allowing for standard functionexplanation techniques to be applied to DNN outputs. Finally, we show how recent advances in adaptive learning ratescan be applied to BQR as well.To conclude, BQR allows us to enhance binary classification networks by providing additional information at predic-tion times, with no impact on performance. The quantile outputs obtained have a variety of use cases, the most potentof which is the ability to provide the uncertainty metric we describe.One glaring limitation however, is of course the fact that BQR applies only to binary classification tasks, and cannot beused directly in a multivariate setting. One way of overcoming this drawback is to use a One-vs-all approach, akin tomulticlass classification with SVMs. Alternatively, when the classes are ordinal, we can extend the binary thresholdedmodel to include more cut-points. Otherwise, it is not trivial to extend BQR to multi-class setting because there is nounique way to define multivariate quantiles. However, recent research in depth quantiles [8, 15] could allow BQR tobe extended in this direction. Acknowledgement
The authors would like to thank the Science and Engineering Research Board (SERB), Department of Science andTechnology, Government of India, for supporting our research by providing us with resources to conduct our experi-ments. The project reference number is: EMR/2016/005687. The authors are indebted to Prof. Probal Choudhury, ISIKolkata for suggestions and critical insights which helped the manuscript immensely. Anuj and Anirudh would like tothank Inumella Sricharan for his help, and Dr. K S Srinivas of the Department of CS&E, PES University for his help,advice and encouragement. 12 eferences [1] Shakkeel Ahmed, Ravi S. Mula, and Soma S. Dhavala. A framework for democratizing ai.
ArXiv ,abs/2001.00818, 2020.[2] Pritam Anand, Reshma Rastogi, and Suresh Chandra. A new asymmetric (cid:15) -insensitive pinball loss function basedsupport vector quantile regression model, 2019.[3] George S. Androulakis, Michael N. Vrahatis, and George D. Magoulas. Effective backpropagation training withvariable stepsize.
Neural Networks , 10(1):69—82, January 1997.[4] Larry Armijo. Minimization of functions having lipschitz continuous first partial derivatives.
Pacific J. Math. ,16(1):1–3, 1966.[5] Dries F. Benoit and Dirk Van den Poel. Binary quantile regression: a bayesian approach based on the asymmetriclaplace distribution.
Journal of Applied Econometrics , 27(7):1174–1188, 2012.[6] A. Colin Cameron and Pravin K. Trivedi.
Microeconometrics Using Stata, Revised Edition . Stata Press, 2ndedition, 2010.[7] Probal Chaudhuri. Generalized regression quantiles : forming a useful toolkit for robust linear regression. In
InL1 Statistical Analysis and Related Methods – Proceedings of the Second International Conference on L1 Normand Related Methods , page 169–185. North Holland : Amsterdam, 1992.[8] Probal Chaudhuri. On a geometric notion of quantiles for multivariate data.
Journal of the American StatisticalAssociation , 91:862–872, 1996.[9] Probal Chaudhuri, K Doksum, and A. Samarov. On average derivative quantile regression.
Ann. Statist ,25(2):715–744, 1997.[10] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.[11] Jeremy Elson, John (JD) Douceur, Jon Howell, and Jared Saul. Asirra: A captcha that exploits interest-alignedmanual image categorization. In
Proceedings of 14th ACM Conference on Computer and CommunicationsSecurity (CCS) , October 2007.[12] Max H. Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference: Appli-cation to causal effects and other semiparametric estimands.
ArXiv , 2018.[13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty indeep learning. In
Proceedings of the 33rd International Conference on Machine Learning (ICML-16) , 2016.[14] Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural net-works. In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence , AAAI’17, page 1919–1925,2017.[15] Marc Hallin, Davy Paindaveine, and Miroslav Šiman. Multivariate quantiles and multiple-output regressionquantiles: From l 1 optimization to halfspace depth.
Ann. Statist. , 38(2):635–669, 04 2010.[16] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. , pages 770–778, 2016.[17] J. L Horowitz. A smoothed maximum score estimator for the binary response model.
Econometrica , 60:505–531,1992.[18] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. , pages 2261–2269, 2017.[19] Takeuchi Ichiro, V. Le Quoc, D. Sears Timothy, and J. Smola Alexander. Nonparametric quantile estimation.
Journal of Machine Learning Research , 7:1231–64, 07 2006.[20] H Jiang, B Kim, M Guan, and M Gupta. To trust or not to trust a classifier.
Advances in Neural InformationProcessing Systems 31 , 2018.[21] Daniel S. Kermany, Michael H. Goldbaum, Wenjia Cai, Carolina Carvalho Soares Valentim, and Kang Zhang.Identifying medical diagnoses and treatable diseases by image-based deep learning.
Cell , 172:1122–1131.e9,2018.[22] R. Koenker and G. B. Bassett. Regression quantiles.
Econometrica , 46:33–50, 1978.[23] Roger Koenker.
Quantile Regression . Econometric Society Monographs. Cambridge University Press, 2005.[24] G Kordas. Smoothed binary regression quantiles.
Journal of Applied Economics , 21:387–407, 2006.1325] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertaintyestimation using deep ensembles. In
Proceedings of the 31st International Conference on Neural InformationProcessing Systems , NIPS’17, page 6405–6416, Red Hook, NY, USA, 2017. Curran Associates Inc.[26] Yann LeCun, Y. Bengio, and Geoffrey Hinton. Deep learning.
Nature , 521:436–44, 05 2015.[27] Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan.Simple and principled uncertainty estimation with deterministic deep learning via distance awareness, 2020.[28] Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz,Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding withexplainable ai for trees.
Nature Machine Intelligence , 2(1):56–67, Jan 2020.[29] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learningword vectors for sentiment analysis. In
Proceedings of the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA, June 2011.Association for Computational Linguistics.[30] C. F Mansky. Maximum score estimation of the stochastic utility model of choice.
Journal of Economics ,3:205–228, 1975.[31] C. F Mansky. Semiparametric analysis of discrete response: Asymptotic porperties of the maximum scoreestimator.
Journal of Economics , 3:205–228, 1985.[32] Ricardo Maronna, Douglas Martin, and Victor Yohai.
Robust Statistics: Theory and Methods . Wiley, 06 2006.[33] Kamil Nar, Orhan Ocal, S. Shankar Sastry, and Kannan Ramchandran. Cross-entropy loss and low-rank featureshave responsibility for adversarial examples.
CoRR , abs/1901.08360, 2019.[34] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predic-tions for unrecognizable images. ,2015.[35] V.P. Palgianakos, Michael Vrahatis, and George Magoulas. Nonmonotone methods for backpropagation trainingwith adaptive learning rate.
International Joint Conference on Neural Networks , 3:1762 – 1767, 02 1999.[36] Emanuel Parzen. Nonparametric statistical data modeling.
Journal of the American Statistical Association ,74(365):105–121, 1979.[37] Emanuel Parzen. Quantile probability and statistical data modeling.
Statistical Science , 19, 11 2004.[38] Stephen Portnoy and Roger. Koenker. Adaptive l -estimation for linear models. Ann. Statist. , 17(1):362–381,1989.[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictionsof any classifier. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’16, page 1135–1144, New York, NY, USA, 2016. Association for Computing Machin-ery.[40] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use inter-pretable models instead.
Nature Machine Intelligence , 1(5):206–215, May 2019.[41] S Saha, Tejas Prashanth, Suraj Aralihalli, Sumedh Basarkod, T. S. B. Sudarshan, and Soma. S. Dhavala. Lalr:Theoretical and experimental validation of lipschitz adaptive learning rate in regression and neural networks.
International Joint Conference on neural Networks , abs/2006.13307, 2020.[42] Kevin Scaman and Aladin Virmaux. Lipschitz regularity of deep neural networks: Analysis and efficient estima-tion. In
Advances in Neural Information Processing Systems 32 , page 3839–3848, 2018.[43] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagatingactivation differences. In
Proceedings of the 34th International Conference on Machine Learning - Volume 70 ,ICML’17, page 3145–3153. JMLR.org, 2017.[44] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classification models and saliency maps, 2014.[45] Shailesh Sridhar, Snehanshu Saha, A. Shaikh, Rahul Yedida, and Sriparna Saha. Parsimonious computing: Aminority training regime for effective prediction in large microarray expression data sets.
International JointConference on neural Networks , arxive.org/abs/2005.08442, 2020.[46] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Gradients of counterfactuals.
ArXiv , abs/1611.02639, 2016.[47] Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high-dimensional logisticregression.
Proceedings of the National Academy of Sciences , 116(29):14516–14525, 2019.1448] Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. In
Advances in NeuralInformation Processing Systems , pages 6417–6428, 2019.[49] Rahul Yedida and Snehanshu Saha. Lipschitzlr: Using theoretically computed adaptive learning rates for fastconvergence.
Applied Intelligence , arxiv.org/abs/1902.07399, 2020.[50] Hui Zou and Ming Yuan. Composite quantile regression and the oracle model selection theory.
Annals ofStatistics , 36(3):1108–1126, 2008. 15 ppendix A Proofs
Lemma 2.1
Lemma.
The Lipschitz constant of the BQR loss is max( τ, − τ ) Proof.
Recall that, the empirical risk under the BQR loss is: L ( y, z ) = − ( y log p z + (1 − y ) log (1 − p z )) where p z ≡ (cid:26) − τ exp(( τ − z ) z ≥ − τ ) exp( τ z )) z < Let us consider the following cases.
Case-1a : < z < z , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ e ( τ − z ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z first, we get, lim z → | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ ) z and then taking the limit w.r.t z later, we get lim z ,z → | L (1 , z ) − L (1 , z ) || z − z | = τ Therefore, | L (1 , z ) − L (1 , z ) || z − z | ≤ τ Case-1b : < z < z , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log ( τ e ( τ − z ) − log ( τ e ( τ − z ) z − z In this case, the RHS simplifies to, | L (0 , z ) − L (0 , z ) || z − z | = ( z − z )(1 − τ ) z − z = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Case-2a : z < < z , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log ((1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z first, we get, lim z → | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ ) z and then taking the limit w.r.t z , we get lim z ,z → | L (1 , z ) − L (1 , z ) || z − z | = τ | L (1 , z ) − L (1 , z ) || z − z | ≤ τ Case-2b : z < < z , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log ( τ e ( τ − z ) − log (1 − (1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z first, we get, lim z → | L (0 , z ) − L (0 , z ) || z − z | = log (1 − τ e ( τ − z ) − log ( τ ) z and then taking the limit w.r.t z , we get lim z ,z → | L (0 , z ) − L (0 , z ) || z − z | = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Case-3a : z < z < , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log ((1 − τ ) e τz ) − log ((1 − τ ) e τz ) z − z The RHS simplifies to, | L (1 , z ) − L (1 , z ) || z − z | = τ ( z − z ) z − z Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ τ Case-3b : z < z < , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log (1 − (1 − τ ) e τz ) − log (1 − (1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z first, we get, lim z → | L (0 , z ) − L (0 , z ) || z − z | = log (1 − τ ) − log (1 − τ e ( τ − z ) z and then taking the limit w.r.t z , we get lim z ,z → | L (0 , z ) − L (0 , z ) || z − z | = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Hence, ∀ z , z ∈ R, y ∈ { , } | L ( y, z ) − L ( y, z ) || z − z | ≤ max(1 − τ, τ ) emma 2.2 Lemma.
BQR also admits a bound in terms of the curvature of the function f ∗ . That is c E (( f − f ∗ ) ) ≤ E ( L ( y, f ) − L ( y, f ∗ )) ≤ c E (( f − f ∗ ) ) where c and c constants, bounded away from 0. Proof.
Recall that, the empirical risk under the BQR loss is: L ( y, z ) = − ( y log p z + (1 − y ) log (1 − p z )) where p z ≡ (cid:26) − τ exp(( τ − z ) z > − τ ) exp( τ z )) z ≤ Using Taylor series expansion of h a ( b ) = L ( b, y ) − L ( a, y ) , with a = f, b = f ∗ , we can write, h a ( b ) = h a ( a ) + h (cid:48) a ( a )( b − a ) + 12 h (cid:48)(cid:48) a ( a )( b − a ) We will be looking at h (cid:48)(cid:48) a to determine the bounds for the curvature of the loss function. Let us consider the followingcases. Case-1 : b ≥ , a ≥ h a ( b ) = − (1 − te − (1 − t ) a ) log (cid:16) − te − (1 − t ) b (cid:17) − ( te − (1 − t ) a ) log (cid:16) te − (1 − t ) b (cid:17) + g ( a )= − (1 − te − (1 − t ) a ) log (cid:16) − te − (1 − t ) b (cid:17) − ( te − (1 − t ) a )(log( t ) − (1 − t ) b )) + g ( a ) h (cid:48) a ( b ) = − (1 − te − (1 − t ) a ) t (1 − t ) e − (1 − t ) b (1 − te − (1 − t ) b ) − ( te − (1 − t ) a )(1 − t ) h (cid:48)(cid:48) a ( b ) = (1 − te − (1 − t ) a ) t (1 − t ) e − (1 − t ) b (1 − te − (1 − t ) b ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = M, b = M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − (1 − t ) M − te − (1 − t ) M h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Case-2 : b ≤ , a ≤ h a ( b ) = − (1 − t ) e ta log (cid:0) (1 − t ) e tb (cid:1) − (1 − (1 − t ) e ta ) log (cid:0) − (1 − t ) e tb (cid:1) + g ( a )= − (1 − t ) e ta (log((1 − t ) + tb ) − (1 − (1 − t ) e ta ) log (cid:0) − (1 − t ) e tb (cid:1) + g ( a ) h (cid:48) a ( b ) = − (1 − t ) te ta + t (1 − t )(1 − (1 − t ) e ta ) e tb − (1 − t ) e tb h (cid:48)(cid:48) a ( b ) = t (1 − t )(1 − (1 − t ) e ta ) e tb (1 − (1 − t ) e tb ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = − M, b = − M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − tM − (1 − t ) e − tM h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) ase-3 : b ≥ , a ≤ h a ( b ) = − (1 − t ) e ta log (cid:16) − te − (1 − t ) b (cid:17) − (1 − (1 − t ) e ta ) log (cid:16) te − (1 − t ) b (cid:17) + g ( a )= − (1 − t ) e ta − log (cid:16) − te − (1 − t ) b (cid:17) − (1 − (1 − t ) e ta )(log( t ) + − (1 − t ) b ) + g ( a ) h (cid:48) a ( b ) = − t (1 − t ) e ta e − (1 − t ) b − te − (1 − t ) b + (1 − (1 − t ) e ta (1 − t ) h (cid:48)(cid:48) a ( b ) = t (‘ − t ) e ta e − (1 − t ) b (1 − te − (1 − t ) b ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = − M, b = M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − M (1 − te − (1 − t ) M ) h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Case-4 : b ≤ , a ≥ h a ( b ) = − (1 − te − (1 − t ) a ) log (cid:0) (1 − t ) e tb (cid:1) − te − (1 − t ) a log (cid:0) − (1 − t ) e tb (cid:1) + g ( a )= − (1 − te − (1 − t ) a ) log (cid:0) (1 − t ) e tb (cid:1) − (1 − te − (1 − t ) a )(log(1 − t ) + tb ) h (cid:48) a ( b ) = − (1 − te − (1 − t ) a ) t − t (1 − t ) e − (1 − t ) a e tb − (1 − t ) e t bh (cid:48)(cid:48) a ( b ) = t (1 − t ) e − (1 − t ) a e tb (1 − (1 − t ) e tb ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = M, b = − M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − M (1 − (1 − t ) e − M ) h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Therefore, c = 0 . A , A , A , A ) c = 0 . t (1 − t ) Theorem 2.3
Theorem.