[PDF] Estimation and Applications of Quantiles in Deep Binary Classification

Abstract

Quantile regression, based on check loss, is a widely used inferential paradigm in Econometrics and Statistics. The conditional quantiles provide a robust alternative to classical conditional means, and also allow uncertainty quantification of the predictions, while making very few distributional assumptions. We consider the analogue of check loss in the binary classification setting. We assume that the conditional quantiles are smooth functions that can be learnt by Deep Neural Networks (DNNs). Subsequently, we compute the Lipschitz constant of the proposed loss, and also show that its curvature is bounded, under some regularity conditions. Consequently, recent results on the error rates and DNN architecture complexity become directly applicable. We quantify the uncertainty of the class probabilities in terms of prediction intervals, and develop individualized confidence scores that can be used to decide whether a prediction is reliable or not at scoring time. By aggregating the confidence scores at the dataset level, we provide two additional metrics, model confidence, and retention rate, to complement the widely used classifier summaries. We also the robustness of the proposed non-parametric binary quantile classification framework are also studied, and we demonstrate how to obtain several univariate summary statistics of the conditional distributions, in particular conditional means, using smoothed conditional quantiles, allowing the use of explanation techniques like Shapley to explain the mean predictions. Finally, we demonstrate an efficient training regime for this loss based on Stochastic Gradient Descent with Lipschitz Adaptive Learning Rates (LALR).

Full PDF

EE STIMATION AND A PPLICATIONS OF Q UANTILES IN D EEP B INARY C LASSIFICATION

Anuj Tambwekar

Department of CS&EPES UniversityBengaluru, Karnataka, India [email protected]

Anirudh Maiya

Department of CS&EPES UniversityBengaluru, Karnataka, India [email protected]

Soma Dhavala

FounderMLSquareBengaluru, India [email protected]

Snehanshu Saha

Department of CSIS and APPCAIRBirla Institute of Technology and ScienceGoa, India [email protected] A BSTRACT

Quantile regression, based on check loss, is a widely used inferential paradigm in Econometricsand Statistics. The conditional quantiles provide a robust alternative to classical conditional means,and also allow uncertainty quantiﬁcation of the predictions, while making very few distributionalassumptions. We consider the analogue of check loss in the binary classiﬁcation setting. We assumethat the conditional quantiles are smooth functions that can be learnt by Deep Neural Networks(DNNs). Subsequently, we compute the Lipschitz constant of the proposed loss, and also show thatits curvature is bounded, under some regularity conditions. Consequently, recent results on the errorrates and DNN architecture complexity become directly applicable.We quantify the uncertainty of the class probabilities in terms of prediction intervals, and developindividualized conﬁdence scores that can be used to decide whether a prediction is reliable or not atscoring time. By aggregating the conﬁdence scores at the dataset level, we provide two additionalmetrics, model conﬁdence, and retention rate, to complement the widely used classiﬁer summaries.We also the robustness of the proposed non-parametric binary quantile classiﬁcation framework arealso studied, and we demonstrate how to obtain several univariate summary statistics of the condi-tional distributions, in particular conditional means, using smoothed conditional quantiles, allowingthe use of explanation techniques like Shapley to explain the mean predictions. Finally, we demon-strate an efﬁcient training regime for this loss based on Stochastic Gradient Descent with LipschitzAdaptive Learning Rates (LALR).

Deep Learning has seen tremendous success over the last few years in the ﬁelds of Computer Vision, Speech andNatural Language processing [26]. As it is making its way into numerous real world applications, focus is shiftingfrom achieving state-of-the-art performance to questions about explainability, robustness, trustworthiness, fairness andtraining efﬁciency, among others. Uncertainty Quantiﬁcation (UQ) is also witnessing a renewed interest within theDeep Learning community. All of the aforementioned aspects need to be tackled in a holistic manner to democratizeAI and make AI equitable for all sections of the society at large [1]. In a seminal paper, Parzen provides a foundationfor exploratory and conﬁrmatory data analysis using quantiles [36], and later argues for uniﬁcation of the theory andpractice of statistical methods with them [37]. In this work, we take these ideas forward and show how some of theproblems mentioned before in the Deep Learning context, can be solved using a quantile-centric approach.Quantile Regression (QR) generalizes the traditional mean regression to model the relationship between the quantilesof the response to the dependent variables, median regression being the special case [22]. QR inherits many desirableproperties of quantiles: they are robust to noise in the response variable, have a clear probabilistic interpretation, and a r X i v : . [ c s . L G ] F e b re equivariant under monotonic transformations. Besides, they also have appealing asymptotic properties under mildassumptions both in the parametric and the non-parametric settings [38, 9]. QR has found many successful applica-tions in Econometrics and Statsitics, such as modeling growth curves, extreme events, and in the robust regressioncontexts [23, 6, 32, 7]. Its introduction to the Machine Learning community is relatively recent, where [19] showedthe relationship between ν − Support Vector Machines and QR, for example. [48] applied QR for modeling aleatoric uncertainty in deep learning via prediction intervals (PIs). Unlike previous works, we study QR in the binary clas-siﬁcation setting since: 1) Much of the earlier work on QR can be extended to the deep learning context, with veryminimal effort, which is not the case with classiﬁcation tasks 2) Despite the dominance of classiﬁcation tasks in theDL space, reliance on the popular but problematic binary cross entropy is still prevalent, and there is a need to ﬁndviable alternatives 3) We also want to study several problems together, as mentioned before, with binary classiﬁcationas a test bed. We hope that, our ﬁndings can be extended to multi-class settings in future.In the rest of this work, ﬁrst we setup the problem, along with notations, and derive the Binary Quantile Regression(BQR) loss. We derive some properties of the loss function and provide the learning rates under the regularity assump-tions. Later, for each of the sub-problems, namely, UQ, Explainability, Robustness, and Adaptive Learning Rates, weprovide the necessary background, develop the idea, and provide the results. Finally, we discuss our ﬁndings, scopefor improvements and new opportunities.

For any real valued random variable Z , with distribution function F ( z ) , with F ( z ) = P ( Z ≤ z ) , thequantile function Q ( τ ) is given as Q ( τ ) = F − ( τ ) = inf { r : F ( r ) ≥ τ } for any < τ < . Assumption 2.1.

We collect n i.i.d samples { x i , y i } ni =1 , where x ∈ [ − , d , and is continuously distributed, rep-resents the d-dimensional input features and y ∈ { , } the class label. For an absolute constant M > , assume (cid:107) f ∗ (cid:107) ∞ ≤ M Assumption 2.2.

Assume f ∗ lies in the Sobolev ball W β, ∞ ([ − , d ) , with smoothness β ∈ N + f ∗ ( x ) ∈ W β, ∞ ([ − , d ) := (cid:26) f : max α, | α |≤ β sup x ∈ [ − , d | D α | ≤ (cid:27) , where α = ( α , α , . . . , α d ) , | α | = α + α , . . . + α d and D α f is the weak derivative. Assumption 2.3.

Let f ∗ lie in a class F. For the feedforward network class F DNN , let the approximation error (cid:15) be (cid:15) f ∗ := sup f ∗ ∈ F inf f ∈ F DNN (cid:107) f (cid:107) ∞ ≤ M (cid:107) f − f ∗ (cid:107) ∞ Given the input features x , we aim to learn a classiﬁer that maps the inputs to the class labels. Let Q x ( τ ) = f τ ( x ) , τ ∈ (0 , be a continuous, smooth, conditional (on x) quantile learnt by the DNN. We consider an architecture of the form Q x ( τ ) = g τ ( g c ( x )) where g τ is quantile-speciﬁc network and g c is a layer shared by all quantiles. [50] showed that,sharing parameters across quantiles generally leads to better statistical efﬁciency. It is akin to multi-task learning,where each quantile estimation is a task, and our architecture is inspired by this observation. [30] considered the median regression for thresholded binary response models of the form Z = xβ + U, Y = I ( Z ≥ ,where Z is the latent response, β is a d × vector of unknowns, U ∼ F ( . ) are i.i.d errors from a continuousdistribution and I ( . ) is an indicator function. Later, in [31], he proved the consistency and asymptotic properties ofthe Maximum Score Estimator and also noted that it can be extended to model other quantiles as a solution to theto the optimization problem: arg min β | β | =1 (cid:80) ni =1 ρ τ ( y i − I ( xβ ≥ . Here, ρ τ is the check loss or pinball loss,deﬁned as ρ τ ( e ) = ( τ − I ( e < e [22]. It is well known that check loss is a generalized version of the MeanAbsolute Error (MAE), often used in Robust Regression settings, and that quantiles minimize the check loss. [17, 24]provided efﬁcient estimators by replacing the Indicator function with smooth kernels. [5] considered the Bayesiancounterpart by noting that check loss is the kernel of Asymmetric Laplace Density (ALD). Below, we extend this tothe non-parametric settings and derive the loss function suitable for DNNs.2 .3 BQR Loss Let us reconsider thresholded binary response model y = I ( z ≥ , z = Q x ( τ ) = f τ ( x ) + (cid:15) where (cid:15) ∼ ALD (0 , , τ ) and ALD ( y ; µ, σ, τ ) ≡ τ (1 − τ ) σ − exp (cid:0) − ρ τ (( y − µ ) σ − ) (cid:1) It can be shown that, P ( y = 1 | f τ ( x )) ≡ (cid:26) − τ exp(( τ − f τ ( x )) 0 < f τ ( x )(1 − τ ) exp( τ f τ ( x )) 0 ≥ f τ ( x ) The empirical loss, under the settings deﬁned earlier, can now be deﬁned as the negative of the log-likeihood function,given as: L BQR ( y ; f τ ( x ) ) = y i log (cid:0) ( P ( y = 1 | f τ ( x i )) − ) (cid:1) +(1 − y i ) log (cid:0) (1 − P ( y = 1 | f τ ( x i ))) − (cid:1) It is to be noted that, we can recover logistic and probit models, when the error distributions are logistic and normaldistributions, respectively. Next, we analyze the learnability of latent functions. Before we do that, we provide twolemmas. (See Appendix A for Proofs)

Lemma 2.1.

The Lipschitz constant of the BQR loss is max( τ, − τ ) The implications of this important result are manifested in section 6. We set the learning rate accordingly and accom-plish signiﬁcantly faster convergence in binary classiﬁcation tasks. It may also be useful in studying the robustness ofBQR against adversarial attacks [42].

Lemma 2.2.

BQR also admits a bound in terms of the curvature of the function f ∗ . That is c E (( f − f ∗ ) ) ≤ E ( L ( y, f ) − L ( y, f ∗ )) ≤ c E (( f − f ∗ ) ) where c and c constants, bounded away from 0. Due to Lemmas 2.1 and 2.2, the BQR loss satisﬁes Eqn(2.1) of [12]. Consequently, all the results of their paper aredirectly applicable, under suitable conditions. In particular, we restate their major result, Theorem 2:

Theorem 2.3.

Suppose Assumptions 2.1-2.3 hold. Let f be the deep ReLU network with W number of parameters.Under BQR, with probability at least − e − γ , for large enough n , for some C > , (cid:107) f − f ∗ (cid:107) L ( x ) = E (( f − f ∗ ) ) ≤ B for B = C (cid:16) W log( W ) n log n + log log n + rn + (cid:15) f ∗ (cid:17) The above non-asymptotic error bounds can be used to tune and optimize the architectures as a function of the DNNarchitecture complexity W , sample size n , conﬁdence γ and approximation error (cid:15) f ∗ . Most refreshingly, it allows usto look at the DNN as a decompressor: given quantized outputs y , and inputs x , we can train a DNN to decompressthe signal f ( x ) . In Fig. 1, we show the estimated conditional quantiles of the latent response. It is fascinating to seethat the original signal is recovered, despite observing only quantized labels at the time of training. In addition, itis worth noting that multiple quantiles are simultaneously estimated. In order to prevent quantile crossing, we add aregularization term, L BQC = n (cid:88) i =1 m − (cid:88) p =1 max(0 , Q x i ( τ p ) − Q x i ( τ p +1 ) giving us regularized BQR: L BQ = L BQR + λL BQC . In this work, we ﬁnd that for most cases a λ = 1 was sufﬁcientto prevent crossing. Next, we quantify how well the latent functions are learnt in terms of coverage, where coverage isan estimate of P x,y ( x < Q x ( τ )) that should be close to the nominal value τ f τ ( x ) for D1. To verify that the quantiles we obtain possess the coverage property, we scale the true latent distribution and theobtained quantiles. The threshold value is ﬁrst subtracted from the true latent, and the resulting distribution is thennormalized to have 0 mean and unit standard deviation. For the quantiles, the mean and standard deviation of themedian is obtained, and all the quantiles are normalized using these terms. This normalization step is done solely forcomparing against the latent.We created a collection of datasets as per the distributions listed below, sampling X from U ( − , and classiﬁed thepoints as class 0 if y i ≤ µ . We computed the coverage for the generated quantiles. The datasets used are formulatedas follows, with D5 and D6 being variants of the dataset proposed in [2]. The results can be seen in Table 1• D1 : y i = 5 sin 8 x i + ζ i , where ζ i ∼ N (0 , • D2 : y i = (4 x i ) / + ζ i , where ζ i ∼ N (0 , . • D3 : y i = (cid:112) (4 x i ) + 5 − . ζ i , where ζ i ∼ U ( − . , . • D4 : y i = ζ i + (cid:26) x i sin(1 / x i ) x (cid:54) = 00 x = 0 , where ζ i ∼ N (0 , . • D5 : y i = 2((1 − x i + 2(3 x i ) ) exp (cid:8) − . x i ) (cid:9) − .

5) + ζ i , where ζ i ∼ N (0 , . • D6 : y i = 2((1 − x i + 2(3 x i ) ) exp (cid:8) − . x i ) (cid:9) − .

5) + ζ i / , where ζ i ∼ χ (2) We use regression datasets taken from the UCI Machine Learning Repository [10], and convert them into classiﬁcationtasks by thresholding the target, and converting it into a binary label. We use two different thresholds, one to simulate4overage for τ Dataset 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9D1 0.10 0.18 0.28 0.40 0.51 0.59 0.69 0.81 0.92D2 0.14 0.23 0.33 0.45 0.51 0.60 0.69 0.79 0.84D3 0.10 0.19 0.28 0.39 0.51 0.58 0.69 0.81 0.86D4 0.04 0.09 0.22 0.37 0.50 0.67 0.75 0.82 0.91D5 0.08 0.20 0.36 0.46 0.53 0.61 0.70 0.82 0.89D6 0.05 0.18 0.32 0.40 0.49 0.55 0.69 0.81 0.90Table 1: Coverage values for simulated datasetsa balanced classiﬁcation task, and the other to simulate an imbalanced problem. The results can be seen in Table2. The scaling methodology is the same as the method described for the simulated datasets. For both simulated andreal-world datasets, we observe that reported coverages are very close to their nominal values around the median, andthe precision decreases as the nominal quantile moves away from the median. While we do not know the distributionof the estimators, from the classical QR perspective, it suggests that, the precision at the lower quantiles is moredominated by the density terms, than by the τ (1 − τ ) factor [23]. Coverage for τ Dataset t Acc. RMSE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Abalone 9 0.81 0.83 0.08 0.19 0.27 0.35 0.55 0.68 0.79 0.88 0.977 0.89 0.89 0.08 0.19 0.31 0.43 0.54 0.65 0.75 0.84 0.96Boston 22 0.88 0.74 0.17 0.22 0.30 0.39 0.50 0.62 0.71 0.80 0.9018 0.90 0.82 0.09 0.20 0.34 0.42 0.53 0.67 0.74 0.80 0.86California 180K 0.80 0.77 0.05 0.15 0.26 0.36 0.51 0.62 0.75 0.84 0.93200K 0.79 0.83 0.10 0.25 0.37 0.48 0.57 0.66 0.73 0.79 0.87Concrete 35 0.88 0.62 0.09 0.17 0.26 0.36 0.50 0.64 0.76 0.88 0.9450 0.91 0.66 0.11 0.18 0.28 0.41 0.50 0.65 0.79 0.83 0.87Energy 20 0.99 0.40 0.13 0.18 0.27 0.39 0.50 0.66 0.79 0.85 0.9115 0.94 0.52 0.07 0.16 0.29 0.37 0.51 0.65 0.74 0.81 0.90Protein 5 0.82 0.82 0.10 0.22 0.34 0.44 0.53 0.63 0.73 0.83 0.939 0.81 0.84 0.09 0.16 0.30 0.42 0.53 0.62 0.72 0.82 0.92Redshift 0.65 0.91 0.83 0.09 0.18 0.26 0.37 0.48 0.61 0.81 0.86 0.920.9 0.92 0.88 0.07 0.10 0.15 0.32 0.45 0.70 0.77 0.88 0.96Wine 5 0.82 0.82 0.08 0.18 0.27 0.38 0.49 0.61 0.72 0.83 0.926 0.93 0.93 0.03 0.12 0.24 0.37 0.51 0.64 0.73 0.80 0.86Yacht 2 0.98 0.63 0.17 0.27 0.35 0.43 0.49 0.55 0.64 0.81 0.897.5 0.98 0.60 0.19 0.34 0.41 0.45 0.51 0.69 0.84 0.91 0.98Table 2: Coverage results for binary classiﬁcation using thresholded UCI regression datasets [34] studied how Deep Learning models can be fooled easily, despite the high conﬁdence in the predictions. [13,33, 47] discuss why the widely used class probabilities, estimated with logistic loss, cannot be used as conﬁdencemeasures, as they can overestimate the conﬁdence, and are not consistent. One way to approach the problem is byUncertainty Quantiﬁcation (UQ). [25] proposed Deep Ensembles with a Bayesian justiﬁcation to report Monte Carloestimates of prediction variance. [27] posed UQ as a min-max problem, where a single model, instead of an ensemble,is input-distance aware. [48] proposed using Quantiles to report the PIs in the regression setting. It is straightforwardto establish PIs using the conditional quantiles even in the binary classiﬁcation setting since P x,y ( x < Q x ( τ )) = τ .It follows then that, [ Q x (0 . τ ) , Q x (1 − . τ )] is a − τ ) % PI at x . Any monotonic transformation, such as a5igmoid function or an Indicator function, can be used produce PIs in the class probabilities space or the label space.Along with measuring the precision, it is sometimes helpful to know when to withheld from a making prediction.Recently, [20] proposed Trustscore based on how close a sample is to a set of high trustworthy samples to that affect.In addition to reporting the precision (via PIs), we can also measure the conﬁdence via conﬁdence score δ deﬁned asfollows: Deﬁnition 3.1.

Conﬁdence Score ( δ ), deﬁned for a sample x as δ = inf d ∈ (0 , . { d : s.t Q x (0 . − d ) ≤ ≤ Q x (0 . d ) } The Conﬁdence Score δ is thus a metric of how close to the decision boundary the latent function for x i is. As δ increases, the likelihood of the point being misclassiﬁed reduces, as the quantiles for the latent response move furtheraway from the decision boundary. The relationship between misclassiication rate and conﬁdence can be explicitlystated as follows: Theorem 3.1.

An instance with conﬁdence score δ has a misclassiﬁcation rate of . − δ Proof.

Let µ be the median of the latent response z , i.e, µ = Q x (0 . , and that µ ≥ . Note that, P ( z < = µ ) = P ( z < = 0) + P (0 < z ≤ µ ) By deﬁnition P ( z < = µ ) = 0 . , δ = P (0 < z < = µ ) , and P ( z < = 0) is the misclassiﬁcation rate. Using samereasoning, we can show that, when µ < , the misclasiﬁcation rate P ( z > is . − δ . Hence, the misclassiﬁcationrate is . − δ .To verify the relationship between misclassiﬁcation and conﬁdence, we use the same datasets and thresholds used inour coverage computation tests, and compute the goodness of ﬁt ( R ) score of the expected misclassiﬁcation rate vs δ curve on the obtained values misclassiﬁcation rate per delta, The results can be seen in Table 3 . In addition, byomitting samples whose δ -score is below a certain threshold, more conﬁdent predictions can be obtained, as per thetheorem described above. We term this threshold as the model conﬁdence . However, it is important to keep in mindthat as this tolerance for a certain δ becomes more rigid, the number of acceptable decisions will also reduce. Wedeﬁne the retention rate for a given conﬁdence threshold as the ratio of number of points having a δ -Score less thanthat conﬁdence score to the samples available for decision. Table 4 shows the retention ( r r ) and misclassiﬁcation rates( m r ) for some standard binary classiﬁcation datasets [10, 11, 21, 29].In addition, we can use the tried and tested classiﬁer metrics, namely AUC, RoC and Precision-Recall curves in orderto evaluate the classiﬁer per Conﬁdence Score level. We simply compute the TPR-FPR and Precision-Recall curvesfor the classiﬁer, only considering points that have a speciﬁc conﬁdence level. Figure 2 shows an example of thesame. As one can note, the classiﬁer performance improves when low conﬁdence labels are withheld. The per δ -scoreperformance can be evaluated and used as another metric when deciding whether or not a prediction should be rejected.Figure 2: Per δ -Score AUC-ROC Curves Note: The R score for yacht is correct. The value is computed using scikit learn’s r2_score , which ranges from −∞ to 1.0 -ScoreDataset t Rate 0.1 0.2 0.3 0.4 0.5 R Abalone 9 m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r δ -Score in Artiﬁcially created classiﬁcation tasks To compare our δ x with TS, we compute both of them for all the samples in a dataset. Following this, we rank thesamples based on TS and create 10 equi-distributed bins, one bin per decile. We the compute the average δ x and TS ofall points in each bin. As per [20], low ranking points are the ones likely to be misclassiﬁed, As seen in Table5, as TSbin decile increases, the average δ x score also increases as expected, indicating that our method captures the expectedtrend. 7onﬁdence Scores( δ )Dataset Acc. Metric 0.1 0.2 0.3 0.4 0.5Asirra 0.95 m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r m r r r δ x values. This makes perfect sense,as these are easy to classify datasets. However, by construction, TS involves ranking the samples, and as a result, itis possible for some samples with low conﬁdence to get high TS and vice versa, which is undesirable. Unlike TS, δ x is well calibrated due to Theorem 3.2, and requires no other models to be run on the top, thanks to the conditionalquantiles. As noted previously in Section 3, BCE can be overconﬁdent or can be fooled easily in the presence of noise. Whileconﬁdence scores may help detect such spurious cases, is it possible to reduce them in the ﬁrst place? That brings usto developing robust estimation techniques. It is widely established that Quantiles are a type of robust estimators tonoise in the response variable (vertical noise). In the classiﬁcation setting, [14] showcase the ability of the MAE of theclass probabilities being more robust to label noise than Categorical Cross Entropy but note that training under MAEcould be slow. In our case, having shown the promise of BQR in UQ, we are interested in seeing whether medianclass probabilities are robust to label noise. To study this, we use the same networks as before, with one trained onBCE, and the other BQR. For each dataset, we vary the percentage of wrongly labelled samples in the training set,and compare the accuracy of the model on the entire real dataset. The results can be seen in Table 6. At no noise, theBQR based classiﬁer is usually equivalent to, or slightly less accurate than the BCE one, however, as noise increases(20% and above), eventually the BQR classiﬁer begins to outperform the BCE one, but it is to be noted that these noiselevels are extremely high when this occurs 8rustscore BinDataset Avg. 1 2 3 4 5 6 7 8 9 10Banknote δ x δ x δ x δ x δ x δ x δ x δ x δ x δ x -score and Trustscore values per Trustscore bin % of Flipped LabelsDataset Loss 0% 10% 20% 30% 40%Banknote BCE 1.000 1.000 0.998 0.986 0.925BQR 1.000 0.999 0.997 0.989 0.939Haberman BCE 0.767 0.766 0.744 0.733 0.642BQR 0.764 0.763 0.749 0.735 0.688Heart BCE 0.919 0.881 0.822 0.723 0.645BQR 0.899 0.859 0.836 0.785 0.700Ionosphere BCE 0.962 0.923 0.881 0.799 0.666BQR 0.950 0.916 0.887 0.841 0.706Pima BCE 0.817 0.803 0.776 0.714 0.618BQR 0.802 0.792 0.776 0.735 0.683Sonar BCE 0.957 0.880 0.786 0.700 0.586BQR 0.946 0.875 0.801 0.718 0.607Titanic BCE 0.874 0.868 0.858 0.828 0.756BQR 0.872 0.866 0.859 0.845 0.805WBC BCE 0.978 0.970 0.964 0.930 0.841BQR 0.975 0.970 0.967 0.951 0.917Table 6: BCE vs BQR Loss on Label Noise A strong criticism of Deep Learning models is that they are black-boxes in nature [40]. A wide variety of techniquesthat attempt to explain the model predictions in terms of activations, saliency maps, and counter-factuals based ongradient propagation are proposed and are actively being developed [43, 44, 46]. In majority of the cases, the same9echniques can be applied to DNNs ﬁt with BQR as well. There are additional classes of explanations for meanpredictions like shapley [28] and

LIME [39]. Below, we show, how the conditional quantiles can be used to estimateconditional effects, as well as report conditional means. Recall that Q x ( τ ) is the conditional quantile at the covariate x , where τ is chosen over a set of discrete values T ∈ { τ , τ , τ , . . . , τ m , τ m +1 } with τ = 0 , τ m +1 = 1 , and there are m outputs available from the neural net corresponding to the remaining values of τ . We can get a smoothed versionof the conditional quantiles by: Q sx ( τ ) τ ∈ (0 , = m (cid:88) i =0 Q x ( τ i ) (cid:90) τ i +1 p = τ i h K ( τ − ph ) dp where K ( . ) is suitable kernel with bandwidth parameter h [36]. In our examples, we used a Gaussian kernel withbandwidth set to 0.1. Now, one can immediately compute any univariate statistic. In particular, the mean responsecan be computed as: E ( f ( x )) = (cid:82) τ =0 Q sx ( τ ) dτ . Likewise, V ar ( f ( x )) can also be computed. In fact, any quantity ofinterest can be computed simply by post-processing the smoothed full distribution [37].Figure 3: Heart Disease Latent vs Max Heart Rate for an Average patient (Interpolated)Figure 3 showcases how quantiles can be used to aid in explainable predictions. In these graphs, the average metrics ofa patient in the heart disease dataset were computed, and the quantiles were predicted using these average parameterswhile varying the maximum heart rate from the the minimum recorded value to the maximum, in steps of 1. Theﬁgure graphically showcases the region of uncertainty, something which cannot be obtained from conventional binaryclassiﬁers as they provide only a single threshold value. Figure 4 showcases the shapely summary statistics of themean response of the latent on the test data of the heart disease dataset via quantile interpolation.The smoothed quantiles also show how it is possible obtain more ﬁne grained values of the conﬁdence metric δ , whilekeeping the number of prediction quantile outputs manageable. A critical parameter in training DNNs via Stochastic Gradient Descent is the learning rate. One of the early approachesto adapt the learning rate is by recognizing the inverse relationship between the step size of gradient descent updateand the Lipschitz constant of the function being optimized [4]. Since the Lipschitz constant is generally unknownapriori, [3, 35] estimate a local approximation during training the feed forward networks. Recently works have derivedadaptive Learning Rates by exploiting the gradient properties of Deep ReLU networks [49, 41] , and successfullyapplied them to train models on large datasets [45]. We summarize their results in the following proposition :10igure 4: An Example of the obtained

Shapely statistics of the mean response of the Heart Disease Dataset

Proposition 6.1.

In a Deep ReLU network, let constant k z be the supremum of gradients w.r.t the function, and let L be the Lipschitz constant of the Loss. Then, the adaptive Learning rate η is η = ( k z L ) − , where the weight updaterule is: w t = w t − − η ∇ L ( f ( x )) This particular choice of LALR, under the assumption that gradients cannot change arbitrarily fast, ensures a convexquadratic upper bound, minimized by the descent step.To show the efﬁcacy of the Lipschitz constant based adaptive learning rate, we compared the performance of theadaptive learning rate verses ﬁxed learning rates of 0.01 and 0.1, and tested how quickly they were able to reach aspeciﬁed target accuracy in terms of number of epochs. The results can be found in Table 7. N/A indicates that theclassiﬁer was unable to reach the accuracy threshold within 5000 Epochs (500 for IMDB) - if this occurs, the maximumaccuracy reached is provided as well. For IMDB, we use an embedding dimension of size 100, and a 2 layer LSTMof dimension 256 which feeds a linear layer.Dataset Accuracy N . N . N /L Banknote 0.99 945 104 14Haberman 0.80 N/A (0.775) 773 78Heart 0.85 221 161 4ILP 0.75 N/A (0.733) 317 28IMDB 0.90 N/A (0.716) 106 27Ionosphere 0.90 1841 104 6Pima 0.80 4099 417 53Sonar 0.97 2199 320 40Titanic 0.87 982 152 17Winsconsin 0.97 1577 101 8Table 7: Convergence comparison between the Adaptive and Fixed Learning rates for SGDFor our image datasets, we compared the efﬁcacy of the LALR on various deep architectures. The Resnet [16]implementations are Pytorch’s default Resnet18 and Resnet50 implementations of 18 and 50 layers each, while theDensenet [18] implementation is Pytorch’s Densenet121 architecture consisting of 10 layers, out of which 4 are Denseblocks, for a total of 121 layers. For all models, the optimizer was SGD, and each test was run for 20 epochs. Weobtained the best validation accuracy of LR=0.01 and found the number of epochs for the other learning rates to11chieve both a training and validation accuracy equal to or greater than LR=0.01. The results can be seen in Table 8.For all our tests, the LALR based models converged faster, barring Asirra on Densenet.Dataset Arch. Target Acc. LR N E T E (min)Asirra Resnet18 0.76 Fixed (0.01) 16 2.5Fixed (0.1) 15 2.5Adaptive 6 4.2Resnet50 0.70 Fixed (0.01) 20 4.7Fixed (0.1) 16 4.7Adaptive 5 7.0Densenet 0.86 Fixed (0.01) 20 4.9Fixed (0.1) 8 4.9Adaptive 9 12.1Pneumonia Resnet18 0.83 Fixed (0.01) 18 1.2Fixed (0.1) 7 1.2Adaptive 6 2.3Resnet50 0.82 Fixed (0.01) 20 1.9Fixed (0.1) 15 1.9Adaptive 9 3.2Densenet 0.81 Fixed (0.01) 20 1.9Fixed (0.1) 10 1.9Adaptive 3 3.2Table 8: Adaptive LR performance in Deep Binary Image Classiﬁcation To summarize, in this work we put forth the Binary Quantile Regression loss function - a loss function to allow DNNsfor binary classiﬁcation to learn the quantiles of the latent function learnt by the network. By estimating these quantiles,we are able to gain additional insight into the uncertainty of the predictions of the network in real time. To further this,we also extend this uncertainty quantiﬁcation technique to the sample conﬁdence score we term δ x . We show that δ x is an accurate measure to capture uncertainty, that provides a mathematical likelihood of misclassiﬁcation, as per thefunction learnt by the model. Following this, we explore how quantiles provide solutions to current open problemsin the deep learning space by being more robust to extreme amounts label noise and allowing for standard functionexplanation techniques to be applied to DNN outputs. Finally, we show how recent advances in adaptive learning ratescan be applied to BQR as well.To conclude, BQR allows us to enhance binary classiﬁcation networks by providing additional information at predic-tion times, with no impact on performance. The quantile outputs obtained have a variety of use cases, the most potentof which is the ability to provide the uncertainty metric we describe.One glaring limitation however, is of course the fact that BQR applies only to binary classiﬁcation tasks, and cannot beused directly in a multivariate setting. One way of overcoming this drawback is to use a One-vs-all approach, akin tomulticlass classiﬁcation with SVMs. Alternatively, when the classes are ordinal, we can extend the binary thresholdedmodel to include more cut-points. Otherwise, it is not trivial to extend BQR to multi-class setting because there is nounique way to deﬁne multivariate quantiles. However, recent research in depth quantiles [8, 15] could allow BQR tobe extended in this direction. Acknowledgement

The authors would like to thank the Science and Engineering Research Board (SERB), Department of Science andTechnology, Government of India, for supporting our research by providing us with resources to conduct our experi-ments. The project reference number is: EMR/2016/005687. The authors are indebted to Prof. Probal Choudhury, ISIKolkata for suggestions and critical insights which helped the manuscript immensely. Anuj and Anirudh would like tothank Inumella Sricharan for his help, and Dr. K S Srinivas of the Department of CS&E, PES University for his help,advice and encouragement. 12 eferences [1] Shakkeel Ahmed, Ravi S. Mula, and Soma S. Dhavala. A framework for democratizing ai.

ArXiv ,abs/2001.00818, 2020.[2] Pritam Anand, Reshma Rastogi, and Suresh Chandra. A new asymmetric (cid:15) -insensitive pinball loss function basedsupport vector quantile regression model, 2019.[3] George S. Androulakis, Michael N. Vrahatis, and George D. Magoulas. Effective backpropagation training withvariable stepsize.

Neural Networks , 10(1):69—82, January 1997.[4] Larry Armijo. Minimization of functions having lipschitz continuous ﬁrst partial derivatives.

Paciﬁc J. Math. ,16(1):1–3, 1966.[5] Dries F. Benoit and Dirk Van den Poel. Binary quantile regression: a bayesian approach based on the asymmetriclaplace distribution.

Journal of Applied Econometrics , 27(7):1174–1188, 2012.[6] A. Colin Cameron and Pravin K. Trivedi.

Microeconometrics Using Stata, Revised Edition . Stata Press, 2ndedition, 2010.[7] Probal Chaudhuri. Generalized regression quantiles : forming a useful toolkit for robust linear regression. In

InL1 Statistical Analysis and Related Methods – Proceedings of the Second International Conference on L1 Normand Related Methods , page 169–185. North Holland : Amsterdam, 1992.[8] Probal Chaudhuri. On a geometric notion of quantiles for multivariate data.

Journal of the American StatisticalAssociation , 91:862–872, 1996.[9] Probal Chaudhuri, K Doksum, and A. Samarov. On average derivative quantile regression.

Ann. Statist ,25(2):715–744, 1997.[10] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.[11] Jeremy Elson, John (JD) Douceur, Jon Howell, and Jared Saul. Asirra: A captcha that exploits interest-alignedmanual image categorization. In

Proceedings of 14th ACM Conference on Computer and CommunicationsSecurity (CCS) , October 2007.[12] Max H. Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference: Appli-cation to causal effects and other semiparametric estimands.

ArXiv , 2018.[13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty indeep learning. In

Proceedings of the 33rd International Conference on Machine Learning (ICML-16) , 2016.[14] Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural net-works. In

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence , AAAI’17, page 1919–1925,2017.[15] Marc Hallin, Davy Paindaveine, and Miroslav Šiman. Multivariate quantiles and multiple-output regressionquantiles: From l 1 optimization to halfspace depth.

Ann. Statist. , 38(2):635–669, 04 2010.[16] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. , pages 770–778, 2016.[17] J. L Horowitz. A smoothed maximum score estimator for the binary response model.

Econometrica , 60:505–531,1992.[18] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. , pages 2261–2269, 2017.[19] Takeuchi Ichiro, V. Le Quoc, D. Sears Timothy, and J. Smola Alexander. Nonparametric quantile estimation.

Journal of Machine Learning Research , 7:1231–64, 07 2006.[20] H Jiang, B Kim, M Guan, and M Gupta. To trust or not to trust a classiﬁer.

Advances in Neural InformationProcessing Systems 31 , 2018.[21] Daniel S. Kermany, Michael H. Goldbaum, Wenjia Cai, Carolina Carvalho Soares Valentim, and Kang Zhang.Identifying medical diagnoses and treatable diseases by image-based deep learning.

Cell , 172:1122–1131.e9,2018.[22] R. Koenker and G. B. Bassett. Regression quantiles.

Econometrica , 46:33–50, 1978.[23] Roger Koenker.

Quantile Regression . Econometric Society Monographs. Cambridge University Press, 2005.[24] G Kordas. Smoothed binary regression quantiles.

Journal of Applied Economics , 21:387–407, 2006.1325] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertaintyestimation using deep ensembles. In

Proceedings of the 31st International Conference on Neural InformationProcessing Systems , NIPS’17, page 6405–6416, Red Hook, NY, USA, 2017. Curran Associates Inc.[26] Yann LeCun, Y. Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521:436–44, 05 2015.[27] Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan.Simple and principled uncertainty estimation with deterministic deep learning via distance awareness, 2020.[28] Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz,Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding withexplainable ai for trees.

Nature Machine Intelligence , 2(1):56–67, Jan 2020.[29] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learningword vectors for sentiment analysis. In

Proceedings of the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA, June 2011.Association for Computational Linguistics.[30] C. F Mansky. Maximum score estimation of the stochastic utility model of choice.

Journal of Economics ,3:205–228, 1975.[31] C. F Mansky. Semiparametric analysis of discrete response: Asymptotic porperties of the maximum scoreestimator.

Journal of Economics , 3:205–228, 1985.[32] Ricardo Maronna, Douglas Martin, and Victor Yohai.

Robust Statistics: Theory and Methods . Wiley, 06 2006.[33] Kamil Nar, Orhan Ocal, S. Shankar Sastry, and Kannan Ramchandran. Cross-entropy loss and low-rank featureshave responsibility for adversarial examples.

CoRR , abs/1901.08360, 2019.[34] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High conﬁdence predic-tions for unrecognizable images. ,2015.[35] V.P. Palgianakos, Michael Vrahatis, and George Magoulas. Nonmonotone methods for backpropagation trainingwith adaptive learning rate.

International Joint Conference on Neural Networks , 3:1762 – 1767, 02 1999.[36] Emanuel Parzen. Nonparametric statistical data modeling.

Journal of the American Statistical Association ,74(365):105–121, 1979.[37] Emanuel Parzen. Quantile probability and statistical data modeling.

Statistical Science , 19, 11 2004.[38] Stephen Portnoy and Roger. Koenker. Adaptive l -estimation for linear models. Ann. Statist. , 17(1):362–381,1989.[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictionsof any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’16, page 1135–1144, New York, NY, USA, 2016. Association for Computing Machin-ery.[40] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use inter-pretable models instead.

Nature Machine Intelligence , 1(5):206–215, May 2019.[41] S Saha, Tejas Prashanth, Suraj Aralihalli, Sumedh Basarkod, T. S. B. Sudarshan, and Soma. S. Dhavala. Lalr:Theoretical and experimental validation of lipschitz adaptive learning rate in regression and neural networks.

International Joint Conference on neural Networks , abs/2006.13307, 2020.[42] Kevin Scaman and Aladin Virmaux. Lipschitz regularity of deep neural networks: Analysis and efﬁcient estima-tion. In

Advances in Neural Information Processing Systems 32 , page 3839–3848, 2018.[43] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagatingactivation differences. In

Proceedings of the 34th International Conference on Machine Learning - Volume 70 ,ICML’17, page 3145–3153. JMLR.org, 2017.[44] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classiﬁcation models and saliency maps, 2014.[45] Shailesh Sridhar, Snehanshu Saha, A. Shaikh, Rahul Yedida, and Sriparna Saha. Parsimonious computing: Aminority training regime for effective prediction in large microarray expression data sets.

International JointConference on neural Networks , arxive.org/abs/2005.08442, 2020.[46] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Gradients of counterfactuals.

ArXiv , abs/1611.02639, 2016.[47] Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high-dimensional logisticregression.

Proceedings of the National Academy of Sciences , 116(29):14516–14525, 2019.1448] Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. In

Advances in NeuralInformation Processing Systems , pages 6417–6428, 2019.[49] Rahul Yedida and Snehanshu Saha. Lipschitzlr: Using theoretically computed adaptive learning rates for fastconvergence.

Applied Intelligence , arxiv.org/abs/1902.07399, 2020.[50] Hui Zou and Ming Yuan. Composite quantile regression and the oracle model selection theory.

Annals ofStatistics , 36(3):1108–1126, 2008. 15 ppendix A Proofs

Lemma 2.1

Lemma.

The Lipschitz constant of the BQR loss is max( τ, − τ ) Proof.

Recall that, the empirical risk under the BQR loss is: L ( y, z ) = − ( y log p z + (1 − y ) log (1 − p z )) where p z ≡ (cid:26) − τ exp(( τ − z ) z ≥ − τ ) exp( τ z )) z < Let us consider the following cases.

Case-1a : < z < z , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ e ( τ − z ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z ﬁrst, we get, lim z → | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ ) z and then taking the limit w.r.t z later, we get lim z ,z → | L (1 , z ) − L (1 , z ) || z − z | = τ Therefore, | L (1 , z ) − L (1 , z ) || z − z | ≤ τ Case-1b : < z < z , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log ( τ e ( τ − z ) − log ( τ e ( τ − z ) z − z In this case, the RHS simpliﬁes to, | L (0 , z ) − L (0 , z ) || z − z | = ( z − z )(1 − τ ) z − z = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Case-2a : z < < z , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log ((1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z ﬁrst, we get, lim z → | L (1 , z ) − L (1 , z ) || z − z | = log (1 − τ e ( τ − z ) − log (1 − τ ) z and then taking the limit w.r.t z , we get lim z ,z → | L (1 , z ) − L (1 , z ) || z − z | = τ | L (1 , z ) − L (1 , z ) || z − z | ≤ τ Case-2b : z < < z , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log ( τ e ( τ − z ) − log (1 − (1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z ﬁrst, we get, lim z → | L (0 , z ) − L (0 , z ) || z − z | = log (1 − τ e ( τ − z ) − log ( τ ) z and then taking the limit w.r.t z , we get lim z ,z → | L (0 , z ) − L (0 , z ) || z − z | = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Case-3a : z < z < , y = 1 | L (1 , z ) − L (1 , z ) || z − z | = log ((1 − τ ) e τz ) − log ((1 − τ ) e τz ) z − z The RHS simpliﬁes to, | L (1 , z ) − L (1 , z ) || z − z | = τ ( z − z ) z − z Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ τ Case-3b : z < z < , y = 0 | L (0 , z ) − L (0 , z ) || z − z | = log (1 − (1 − τ ) e τz ) − log (1 − (1 − τ ) e τz ) z − z The RHS approaches maximum as z , z −→ . Taking the limit w.r.t z ﬁrst, we get, lim z → | L (0 , z ) − L (0 , z ) || z − z | = log (1 − τ ) − log (1 − τ e ( τ − z ) z and then taking the limit w.r.t z , we get lim z ,z → | L (0 , z ) − L (0 , z ) || z − z | = 1 − τ Therefore, | L (0 , z ) − L (0 , z ) || z − z | ≤ − τ Hence, ∀ z , z ∈ R, y ∈ { , } | L ( y, z ) − L ( y, z ) || z − z | ≤ max(1 − τ, τ ) emma 2.2 Lemma.

BQR also admits a bound in terms of the curvature of the function f ∗ . That is c E (( f − f ∗ ) ) ≤ E ( L ( y, f ) − L ( y, f ∗ )) ≤ c E (( f − f ∗ ) ) where c and c constants, bounded away from 0. Proof.

Recall that, the empirical risk under the BQR loss is: L ( y, z ) = − ( y log p z + (1 − y ) log (1 − p z )) where p z ≡ (cid:26) − τ exp(( τ − z ) z > − τ ) exp( τ z )) z ≤ Using Taylor series expansion of h a ( b ) = L ( b, y ) − L ( a, y ) , with a = f, b = f ∗ , we can write, h a ( b ) = h a ( a ) + h (cid:48) a ( a )( b − a ) + 12 h (cid:48)(cid:48) a ( a )( b − a ) We will be looking at h (cid:48)(cid:48) a to determine the bounds for the curvature of the loss function. Let us consider the followingcases. Case-1 : b ≥ , a ≥ h a ( b ) = − (1 − te − (1 − t ) a ) log (cid:16) − te − (1 − t ) b (cid:17) − ( te − (1 − t ) a ) log (cid:16) te − (1 − t ) b (cid:17) + g ( a )= − (1 − te − (1 − t ) a ) log (cid:16) − te − (1 − t ) b (cid:17) − ( te − (1 − t ) a )(log( t ) − (1 − t ) b )) + g ( a ) h (cid:48) a ( b ) = − (1 − te − (1 − t ) a ) t (1 − t ) e − (1 − t ) b (1 − te − (1 − t ) b ) − ( te − (1 − t ) a )(1 − t ) h (cid:48)(cid:48) a ( b ) = (1 − te − (1 − t ) a ) t (1 − t ) e − (1 − t ) b (1 − te − (1 − t ) b ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = M, b = M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − (1 − t ) M − te − (1 − t ) M h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Case-2 : b ≤ , a ≤ h a ( b ) = − (1 − t ) e ta log (cid:0) (1 − t ) e tb (cid:1) − (1 − (1 − t ) e ta ) log (cid:0) − (1 − t ) e tb (cid:1) + g ( a )= − (1 − t ) e ta (log((1 − t ) + tb ) − (1 − (1 − t ) e ta ) log (cid:0) − (1 − t ) e tb (cid:1) + g ( a ) h (cid:48) a ( b ) = − (1 − t ) te ta + t (1 − t )(1 − (1 − t ) e ta ) e tb − (1 − t ) e tb h (cid:48)(cid:48) a ( b ) = t (1 − t )(1 − (1 − t ) e ta ) e tb (1 − (1 − t ) e tb ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = − M, b = − M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − tM − (1 − t ) e − tM h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) ase-3 : b ≥ , a ≤ h a ( b ) = − (1 − t ) e ta log (cid:16) − te − (1 − t ) b (cid:17) − (1 − (1 − t ) e ta ) log (cid:16) te − (1 − t ) b (cid:17) + g ( a )= − (1 − t ) e ta − log (cid:16) − te − (1 − t ) b (cid:17) − (1 − (1 − t ) e ta )(log( t ) + − (1 − t ) b ) + g ( a ) h (cid:48) a ( b ) = − t (1 − t ) e ta e − (1 − t ) b − te − (1 − t ) b + (1 − (1 − t ) e ta (1 − t ) h (cid:48)(cid:48) a ( b ) = t (‘ − t ) e ta e − (1 − t ) b (1 − te − (1 − t ) b ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = − M, b = M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − M (1 − te − (1 − t ) M ) h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Case-4 : b ≤ , a ≥ h a ( b ) = − (1 − te − (1 − t ) a ) log (cid:0) (1 − t ) e tb (cid:1) − te − (1 − t ) a log (cid:0) − (1 − t ) e tb (cid:1) + g ( a )= − (1 − te − (1 − t ) a ) log (cid:0) (1 − t ) e tb (cid:1) − (1 − te − (1 − t ) a )(log(1 − t ) + tb ) h (cid:48) a ( b ) = − (1 − te − (1 − t ) a ) t − t (1 − t ) e − (1 − t ) a e tb − (1 − t ) e t bh (cid:48)(cid:48) a ( b ) = t (1 − t ) e − (1 − t ) a e tb (1 − (1 − t ) e tb ) h (cid:48)(cid:48) a ( b ) is maximum at a = 0 , b = 0 , and minimum at a = M, b = − M , therefore A ≡ h (cid:48)(cid:48) a ( b ) ≥ t (1 − t ) e − M (1 − (1 − t ) e − M ) h (cid:48)(cid:48) a ( b ) ≤ t (1 − t ) Therefore, c = 0 . A , A , A , A ) c = 0 . t (1 − t ) Theorem 2.3

Theorem.