[PDF] High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach

Abstract

This paper considers the generation of prediction intervals (PIs) by neural networks for quantifying uncertainty in regression tasks. It is axiomatic that high-quality PIs should be as narrow as possible, whilst capturing a specified portion of data. We derive a loss function directly from this axiom that requires no distributional assumption. We show how its form derives from a likelihood principle, that it can be used with gradient descent, and that model uncertainty is accounted for in ensembled form. Benchmark experiments show the method outperforms current state-of-the-art uncertainty quantification methods, reducing average PI width by over 10%.

Full PDF

HHigh-Quality Prediction Intervals for Deep Learning:A Distribution-Free, Ensembled Approach

Tim Pearce

Mohamed Zaki Alexandra Brintrup Andy Neely Abstract

This paper considers the generation of predictionintervals (PIs) by neural networks for quantifyinguncertainty in regression tasks. It is axiomatic thathigh-quality PIs should be as narrow as possible,whilst capturing a speciﬁed portion of data. Wederive a loss function directly from this axiom thatrequires no distributional assumption. We showhow its form derives from a likelihood principle,that it can be used with gradient descent, and thatmodel uncertainty is accounted for in ensembledform. Benchmark experiments show the methodoutperforms current state-of-the-art uncertaintyquantiﬁcation methods, reducing average PI widthby over 10%.

1. Introduction

Deep neural networks (NNs) have achieved impressive per-formance in a wide variety of tasks in recent years, however,success is generally in terms of aggregated accuracy metrics.For many real-world applications, it is not enough that on av-erage a model performs well, rather the uncertainty of eachprediction must also be quantiﬁed. This can be particularlyimportant where there is a large downside to an incorrectprediction: Examples can be found in prognostics, manufac-turing, ﬁnance, weather, trafﬁc and energy networks. Thereis therefore interest in how NNs can be modiﬁed to meetthis requirement (Krzywinski & Altman, 2013; Gal, 2016).In this work the output of prediction intervals (PIs) in regres-sion tasks is considered. Whilst NNs by default output pointestimates, PIs directly communicate uncertainty, offeringa lower and upper bound for a prediction and assurancethat, with some high probability (e.g. 95% or 99%), therealised data point will fall between these bounds. Havingthis information allows for better-informed decisions. Department of Engineering, University of Cambridge, UK Alan Turing Institute, UK. Correspondence to: Tim Pearce < [email protected] > . Proceedings of the th International Conference on MachineLearning , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

As an example, a point estimate stating that a machine willfail in 60 days may not be sufﬁcient to schedule a repair,however given a PI of 45-65 days with 99% probability,timing of a repair is easily scheduled.A diverse set of approaches have been developed to quantifyNN uncertainty, ranging from fully Bayesian NNs (BNNs)(MacKay, 1992), to interpreting dropout as performing vari-ational inference (Gal & Ghahramani, 2015). These requireeither high computational demands or strong assumptions.In this work we formulate PI output as a constrained optimi-sation problem. It is self evident that high-quality PIs shouldbe as narrow as possible, whilst capturing some speciﬁedproportion of data points (hereafter referred to as the

HQprinciple ). Indeed it is through these metrics that PI qualityis often assessed (Papadopoulos et al., 2000; Khosravi et al.,2011b; Galv´an et al., 2017). We show how a loss functioncan be derived directly from this HQ principle, and usedin an ensemble to produce PIs accounting for both modeluncertainty and data noise variance. The key advantagesof the method are its intuitive objective, low computationaldemand, robustness to outliers, and lack of distributionalassumption.Notably we build on the work of Khosravi et al. (2011a)who developed the Lower Upper Bound Estimation (LUBE)method, incorporating the HQ principle directly into the NNloss function for the ﬁrst time. LUBE is gaining popularityin several communities, for example in the forecasting ofenergy demand and wind speed (section 2). However, wehave identiﬁed several limitations of its current form. • Gradient Descent - It was stated that the method wasincompatible with gradient descent (GD), a belief car-ried forward, unchallenged, in all subsequent work(section 2). Implementations therefore require non-gradient based methods for training, such as Simu-lated Annealing (SA) and Particle Swarm Optimisation(PSO). This is inconvenient since GD has become thestandard training method for NNs (Goodfellow et al.,2016), used by all modern NN APIs. • Loss Form - Its current form suffers from several prob-lems. The function is at a global minimum when all a r X i v : . [ s t a t . M L ] J un igh-Quality Prediction Intervals for Deep Learning PIs are reduced to zero. It was also designed throughqualitative assessment of the desired behaviour ratherthan on a statistical basis. • Model Uncertainty - LUBE accounts only for data-noise variance and not model uncertainty (section 2.1).This is an oversimpliﬁcation (Heskes, 1996), implicitlyassuming that training data fully populates the inputspace, which is seldom the case.In this work we develop a model addressing each of these is-sues - henceforth referred to as the quality-driven PI method(QD), and QD-Ens when explicitly referring to the ensem-bled form.We link early literature on PIs for NNs (Tibshirani, 1996;Heskes, 1996; Papadopoulos et al., 2000; Khosravi et al.,2011a), with recent work on uncertainty in deep learning(Hern´andez-Lobato & Adams, 2015; Gal & Ghahramani,2015; Lakshminarayanan et al., 2017) - areas which haveremained surprisingly distinct. We achieve this by followingthe same experimental procedure of recent work, assessingperformance across ten benchmark regression datasets. Wecompare QD’s performance with the current best perform-ing model, originally named Deep Ensembles (Lakshmi-narayanan et al., 2017), here referred to as MVE-Ens. Weshow that QD outperforms in PI quality metrics, achiev-ing closer to the desired coverage proportion, and reducingaverage PI width by around 10%.

2. Related Work

In this section we consider methods to quantify uncertaintyin regression with NNs. Three review papers cataloguedearly work (Tibshirani, 1996; Papadopoulos et al., 2000;Khosravi et al., 2011b), the latter two speciﬁcally consider-ing PIs. Three primary methods were presented: • The

Delta method adopts theory for building conﬁ-dence intervals (CIs) used by general non-linear re-gression models, estimating model uncertainty . It iscomputationally demanding as it requires use of theHessian matrix. • Mean Variance Estimation (MVE) (Nix & Weigend,1994) uses a NN with two output nodes - one represent-ing the mean and the other the variance of a normal dis-tribution, allowing estimation of data noise variance .The loss function used is the Negative Log Likelihood(NLL) of the predicted distribution given the data. • The

Bootstrap (Heskes, 1996) estimates model uncer-tainty . It trains multiple NNs with different parameterinitialisations on different resampled versions of thetraining dataset. It is easily combined with MVE toestimate total variance. In addition, BNNs treat model parmeters as distributionsrather than point estimates (MacKay, 1992), and hence canpredict distributions rather than point estimates. Their draw-back is that the computational cost of running MCMC al-gorithms can be prohibitive. Recent work has focused onaddressing this (Graves, 2011; Hern´andez-Lobato & Adams,2015; Blundell et al., 2015), notably NNs with dropout maybe interpreted as performing variational inference (Gal &Ghahramani, 2015).Lakshminarayanan et al. (2017) produced a modernisationof Heskes’ work (1996), ensembling individual MVE NNs(without resampling the dataset - section 4), and includingadversarial training examples. We henceforth refer to thisas MVE Ensemble (MVE-Ens). Another MVE extentionencourages high uncertainty in data regions not observed byaugmenting training data with synthetic ‘out-of-distribution’samples of high variance (Malinin et al., 2017).Many of these modern works complied with an experimentalprotocol laid out by Hernandez-Lobato & Adams (2015),assessing NLL & RMSE across ten benchmark regressiondatasets, with MVE-Ens the current best performer. Bycontrast, the PI literature reports metrics around coverageproportion and PI width.LUBE (Khosravi et al., 2011a) was developed on the HQprinciple. Originally it was proposed with SA as the train-ing method, and much effort has gone toward trialling itwith various non-gradient based training methods includingGenetic Algorithms (Ak et al., 2013b), Gravitational SearchAlgorithms (Lian et al., 2016), PSO (Galv´an et al., 2017;Wang et al., 2017), Extreme Learning Machines (Sun et al.,2017), and Artiﬁcial Bee Colony Algorithms (Shen et al.,2018). Multi-objective optimisation has been found usefulin considering the tradeoff between PI width and coverage(Galv´an et al., 2017; Shen et al., 2018).LUBE has been used in a plethora of application-focusedwork: Particularly in energy load (Pinson & Karinio-takis, 2013; Quan et al., 2014) and wind speed forecast-ing (Wang et al., 2017; Ak et al., 2013b), but also pre-diction of landslide displacement (Lian et al., 2016), gasﬂow (Sun et al., 2017), solar energy (Galv´an et al., 2017),condition-based maintenance (Ak et al., 2013a), and others.All work has used LUBE as a single NN, making no attemptto account for model uncertainty (section 2.1).

This section describes uncertainty in regression, it is an ag-glomeration of several prominent works (Tibshirani, 1996;Heskes, 1996; Papadopoulos et al., 2000; Shafer & Vovk,2008; Mazloumi et al., 2011; Khosravi et al., 2011b; Lak-shminarayanan et al., 2017), each of who presented similarconcepts but under different guises and terminology. We igh-Quality Prediction Intervals for Deep Learning attempt to reconcile them here.The philosophy behind regression is that some data generat-ing function, f ( x ) , exists, combined with additive noise, toproduce observable target values y , y = f ( x ) + (cid:15). (1)The (cid:15) component is termed irreducible noise or data noise .It may exist due to exclusion of (minor) explanatory vari-ables in x , or due to an inherently stochastic process. Somemodels, for example the Delta method, assume (cid:15) is constantacross the input space ( homoskedastic ), others allow for itto vary ( heteroskedastic ), for example MVE.Generally the goal of regression is to produce an estimate ˆ f ( x ) , which allows prediction of point estimates ( (cid:15) is as-sumed to have mean zero). However, when estimating theuncertainty of y , additional terms must be estimated. Giventhat both terms of eq. (1) have associated sources of uncer-tainty, and assuming they are independent, the total varianceof observations is given by, σ y = σ model + σ noise , (2)with σ model termed model uncertainty or epistemic uncer-tainty - uncertainty in ˆ f ( x ) - and σ noise irreducible vari-ance , data noise variance , or aleatoric uncertainty .It is worth here distinguishing CIs from PIs. CIs considerthe distribution P r ( f ( x ) | ˆ f ( x )) , and hence only require esti-mation of σ model , whilst PIs consider P r ( y | ˆ f ( x )) and mustalso consider σ noise . PIs are necessarily wider than CIs.Model uncertainty can be attributed to several factors. • Model misspeciﬁcation or bias - How closely ˆ f ( x ) isable to approximate f ( x ) , assuming ideal parametersand plentiful training data. • Training data uncertainty or variance - Trainingdata is a sample from an input distribution. Thereis uncertainty over how representative the sample is,and how sensitive the model is to other samples. • Parameter uncertainty - Uncertainty exists aroundthe optimum parameters of the model, increasing inregions sparsely represented in the training data.Different model types have different weightings for eachof these factors ( bias-variance trade-off ). Provided thenumber of hidden neurons is large relative to the complexityof f ( x ) , NNs are considered to have low bias and highvariance. Work on uncertainty in NNs therefore generallyignores model misspeciﬁcation, and only estimates trainingdata uncertainty and parameter uncertainty (Heskes, 1996). To construct PIs, σ y must be estimated at each predictionpoint. In regions of the input space with more data, σ model decreases, and σ noise may become the larger component .In regions of the input space with little data, σ model grows.Lakshminarayanan et al. (2017) recognise this in moreintuitive terms - that two sources of uncertainty exist.1. Calibration - Data noise variance in regions which arewell represented by the training data.2. Out-of-distribution - Uniqueness of an input - inputsless similar to training data should lead to less certainestimates.

3. A Quality-Driven, Distribution-Free LossFunction

We now derive a loss function based on the HQ principle.Let the set of input covariates and target observations be X and y , for n data points, and with x i ∈ R D denoting the i th D dimensional input corresponding to y i , for ≤ i ≤ n .The predicted lower and upper PI bounds are ˆy L , ˆy U . A PIshould capture some desired proportion of the observations, (1 − α ) , common choices of α being 0.01 or 0.05, P r (ˆ y Li ≤ y i ≤ ˆ y Ui ) ≥ (1 − α ) . (3)A vector, k , of length n represents whether each data pointhas been captured by the estimated PIs, with each element k i ∈ { , } given by, k i = ( , if y Li ≤ y i ≤ y Ui , else. (4)We deﬁne the total number of data points captured as c , c := n X i =1 k i . (5)Let Prediction Interval Coverage Probability ( P ICP ) andMean Prediction Interval Width (

M P IW ) be deﬁned as,

P ICP := cn , (6) M P IW := 1 n n X i =1 ˆ y Ui − ˆ y Li . (7) At the same time, σ noise may be estimated with more cer-tainty, although uncertainty of this value itself is not generallyconsidered. Conformal prediction provides a framework to assess this. igh-Quality Prediction Intervals for Deep Learning

According to the HQ principle, PIs should minimise

M P IW subject to

P ICP ≥ (1 − α ) . To minimise M P IW , eq. (7) could simply be included in the loss func-tion, however PIs that fail to capture their data point shouldnot be encouraged to shrink further. We therefore introduce captured

M P IW as the

M P IW of only those points forwhich ˆy L ≤ y ≤ ˆy L holds, M P IW capt. := 1 c n X i =1 (ˆ y Ui − ˆ y Li ) · k i . (8)Regarding P ICP , we take a likelihood-based approach,seeking NN parameters, θ , that maximise, L θ := L ( θ | k , α ) . (9)Recognising that each element, k i , is a binary variable tak-ing with probability (1 − α ) , we represent it as a Bernoullirandom variable (one per prediction), k i ∼ Bernoulli(1 − α ) . We further assume that each k i is iid. This indepen-dence assumption may not hold for data points clusteredclose together, however we believe holds sufﬁciently fora randomly sampled subset of all data points, as used inmini-batches for GD. This iid assumption allows the totalnumber of captured points, c , to be represented by a bino-mial distribution, c ∼ Binomial( n, (1 − α )) . Substitutingin the pmf, L θ = (cid:18) nc (cid:19) (1 − α ) c α n − c . (10)The factorials in the binomial coefﬁcient make computa-tion inconvenient. However using the central limit theorem(speciﬁcally the de Moivre-Laplace theorem) it can furtherbe approximated by a normal distribution. For large n , Binomial( n, (1 − α )) ≈ N (cid:0) n (1 − α ) , nα (1 − α ) (cid:1) (11) = 1 p πnα (1 − α ) exp − ( c − n (1 − α )) nα (1 − α ) . (12)We consider this a mild assumption provided a mini-batchsize of reasonable number, say > , is used.It is common to minimise the NLL rather than maximise thelikelihood, this simpliﬁes eq. (12) to, − log L θ ∝ ( n (1 − α ) − c ) nα (1 − α ) (13) = nα (1 − α ) ((1 − α ) − P ICP ) . (14) Remembering that a penalty should only occur in the casewhere P ICP < (1 − α ) results in a one-sided loss. Com-bining with eq. (8) and adding a Lagrangian, λ , controllingthe importance of width vs. coverage gives a new loss, Loss QD = M P IW capt. + λ nα (1 − α ) max (0 , (1 − α ) − P ICP ) . (15) The derived loss function in eq. (15) may be compared tothe LUBE loss (Khosravi et al., 2011a),

Loss

LUBE = M P IWr exp (cid:0) λ max (0 , (1 − α ) − P ICP ) (cid:1)! , (16)where r = max ( y ) − min ( y ) , is the range of the targetvariable.Whilst still recognisable as having the same objective, thedifferences are signiﬁcant, and are summarised as follows. • The inclusion of n intuitively makes sense since alarger sample size provides more conﬁdence in thevalue of P ICP , and hence a larger loss should beincurred. Similar arguments follow for α . They removethe need to adjust λ based on batch size and targetcoverage. • A squared term has replaced the exponential. Whilstthe RHS for both is minimised when

P ICP ≥ (1 − α ) , the squared term was derived based on likelihoodwhilst the exponential term was selected qualitatively. • M P IW now has an additive rather than multiplicativeeffect. Multiplying has the attractive property of ensur-ing both terms are of the same magnitude. However italso means that a global minimum is found when allPIs are of zero width. We found in practise that NNsoccasionally did produce this undesirable solution. • M P IW is no longer normalised by the range of y .Data for a NN should already be normalised duringpreprocessing. Further normalisation is therefore re-dundant. It also merely scales the loss by a constant,having no overall effect on the optimisation. • M P IW capt. is used rather than

M P IW . As dis-cussed in section 3.1 this avoids the NN beneﬁtingby further reduction of PI widths for missed data. igh-Quality Prediction Intervals for Deep Learning T a r g e t v a r i a b l e , y y L = 0.15 Figure 1.

Set up for the toy problem: Left is input data (10 datapoints linearly spaced at ﬁxed input x = 1 . ), right is the NN used(one trainable weight). It was originally believed that the LUBE loss functionwas, “nonlinear, complex, and non-differentiable... gra-dient descent-based algorithms cannot be applied for itsminimization” (Khosravi et al., 2011a). This belief hasbeen carried forward, unchallenged, in all subsequent work- see section 2 for numerous examples. It is inconvenientsince GD is the standard method for training NNs, so imple-mentations require extra coding effort.Regarding the quoted justiﬁcation, most standard loss func-tions are nonlinear - e.g. L errors - and whilst the LUBEloss function is complex, this does not affect its compatibil-ity with GD . The non-differentiability comment is partiallyvalid. Because the loss function requires the use of stepfunctions, it is not differentiable everywhere. But this is notan unsurmountable problem: ReLUs are a common choiceof activation function in modern NNs, despite not beingdifferentiable when the input is exactly zero .3.3.1. GD T OY E XAMPLE

Loss QD can be directly implemented as shown in Algo-rithm 1 ( Loss H ), however it fails to converge to a minimum.We demonstrate why this is the case and how it can beremedied through a toy example.Consider a NN as in ﬁgure 1 with one input and two output Modern NN APIs generally handle gradient computation au-tomatically, through application of the chain rule to the predeﬁnedoperations. Provided functions within the API library are used,gradient calculations are automatically handled. Software implementations return one of the derivatives eitherside of zero when the input corresponds to the undeﬁned pointrather than raising an error (Goodfellow et al., 2016). w L o ss v a l u e Loss QD , lambda=15 Loss QD , lambda=5 Loss

QDsoft , lambda=15, s =50 Loss

QDsoft , lambda=15, s =150Optimum Figure 2.

Error surface of

Loss QD , and Loss QD − soft on toyproblem, with effect of hyperparameters λ and s . neurons, linear activations and no bias. For purposes ofclarity, one weight is ﬁxed, w = 0 . , to create a onedimensional problem with a single trainable weight, w .Given 10 data points evenly spaced at x = 1 . , and α = 0 . ,the optimal value for ˆ y U (and therefore w ) is . , whichgives the lowest M P IW , subject to

P ICP ≥ − α = 0 . . Loss QD is plotted in ﬁgure 2 (black line). Whilst the globalminimum occurs at the desired point, this solution is notfound through GD. Given the steepest descent weight updaterule with some learning rate τ , w ,t +1 = w ,t − τ ∂Loss QD ∂w ,t , (17)the weight shrinks without converging. This is becausethe gradient, ∂Loss∂w , at any point is positive, except for thediscontinuities which are never realised.To remediate this, we introduce an approximation of thestep function. The sigmoid function has been used in thepast as a differentiable alternative (Yan et al., 2004). In eq.(4), k , the captured vector was deﬁned. We redeﬁne this as k hard and introduce a relaxed version as follows, k soft = σ ( s ( y − ˆy L )) (cid:12) σ ( s ( ˆy U − y )) , (18)where σ is the sigmoid function, and s > is some soften-ing factor. We further deﬁne P ICP soft and

Loss QD − soft by replacing k hard with k soft in equations (6) & (15) re-spectively - see also Loss S in Algorithm 1.Figure 2 shows the result of using Loss QD − soft (red & bluelines). By choosing an appropriate value for s , followingthe steepest gradient does lead to a minimum, making GD a igh-Quality Prediction Intervals for Deep Learning Algorithm 1

Construction of loss function using basic op-erations

Input:

Target values, y , predictions of lower and upperbound, ˆy L , ˆy U , desired coverage, (1 − α ) , and sigmoidsoftening factor, s , (cid:12) denotes the element-wise product. k HU = max (0 , sign ( ˆy U − y )) k HL = max (0 , sign ( y − ˆy L )) k H = k HU (cid:12) k HL k SU = sigmoid (( ˆy U − y ) · s ) k SL = sigmoid (( y − ˆy L ) · s ) k S = k SU (cid:12) k SL M P IW c = reduce sum (( ˆy U − ˆy L ) (cid:12) k H ) / reduce sum ( k H ) P ICP H = reduce mean ( k H ) P ICP S = reduce mean ( k S ) Loss H = M P IW c + λ · nα (1 − α ) · max (0 , (1 − α ) − P ICP H ) Loss S = M P IW c + λ · nα (1 − α ) · max (0 , (1 − α ) − P ICP S ) viable method. Setting s = 160 worked well in experimentsin section 6, requiring no alteration across datasets. The original LUBE loss function, eq. (16), has been im-plemented with various evolutionary training schemes thatdo not require derivatives of the loss function. In order totest the efﬁcacy of

Loss QD − soft with GD, we compared toan evolutionary-based training method (section 5.1). PSO(Kennedy & Eberhart, 1995) was chosen due to use in re-cent work with LUBE (Galv´an et al., 2017; Wang et al.,2017). We make the assumption that other evolutionarymethods would offer similar performance (Jones, 2005).See Kennedy & Eberhart (2001) for an introduction to PSO.

4. Ensembles to Estimate Model Uncertainty

In section 2.1, two components of uncertainty were deﬁned;model uncertainty and data noise variance. It appears thatprevious work assumed both were accounted for (section 2).In fact, LUBE & QD only estimate data noise variance, andthere is a need to consider the uncertainty of these estimatesthemselves. This becomes particularly important when newdata is encountered. Consider a NN trained for the examplein ﬁgure 1: Despite being capable of estimating the datanoise variance at x = 1 . , if shown new data at x = 2 . itwould predict ˆ y U = 1 . , with little basis. Ensembling models provides a conceptually simple way todeal with this. Recall from section 2.1 that three sources ofmodel uncertainty exist, and that the ﬁrst, model misspec-iﬁcation, is assumed zero for NNs. Parameter uncertaintycan be measured by training multiple NNs with differentparameter initialisations ( parameter resampling ). Trainingdata uncertainty can be done similarly: Sub-sampling fromthe training set, and ﬁtting a NN to each subset ( bootstrapresampling ). The resulting ensemble of NNs contains somediversity, and the variance of their predictions can be usedas an estimate of model uncertainty.Recent work reported that parameter resampling offeredsuperior performance to both bootstrap resampling, and acombination of the two (Lee et al., 2015; Lakshminarayananet al., 2017). No robust justiﬁcation has been given for this.Given an ensemble of m NNs trained with

Loss QD − soft ,let ˜y U , ˜y L represent the ensemble’s upper and lower esti-mate of the PI. We calculate model uncertainty and hencethe ensemble’s PIs as follows, ¯ y Ui = 1 m m X j =1 ˆ y Uij , (19) ˆ σ model = σ Ui = 1 m − m X j =1 (ˆ y Uij − ¯ y Ui ) , (20) ˜ y Ui = ¯ y Ui + 1 . σ Ui , (21)where ˆ y Uij represents the upper bound of the PI for datapoint i , for NN j . A similar procedure is followed for ˜ y Li ,subtracting rather than adding . σ Li .

5. Qualitative Experiments

In this section behaviour of QD is qualitatively assessed onsynthetic data. Firstly, the GD method of training explainedin section 3.3 is compared to PSO. Next, the advantage ofQD over MVE (section 2) in data with non-normal varianceis shown. Finally, the effectiveness of the ensembled QDapproach at estimating model uncertainty is demonstrated.The appendix contains experimental details as well as nu-merical results for this synthetic data, covering a full permu-tation of loss functions and training methods [LUBE, QD,MVE]x[GD, PSO].

Comparison of evolutionary methods vs. GD for NN train-ing is its own research topic, and as such analysis here is lim-ited. Preliminary experiments showed that GD performedslightly better than PSO in terms of

P ICP and

M P IW , igh-Quality Prediction Intervals for Deep Learning -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-1.5-1.0-0.50.00.51.01.5 y PSO -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-1.5-1.0-0.50.00.51.01.5 y GD Figure 3.

Comparison of PI boundaries for GD vs. PSO trainingmethods. Shading is the predicted 95% PI. Ground truth is givenby blue lines - the ideal 95% boundaries and mode. producing smoother, tighter boundaries, both more consis-tently and with lower computational effort. See ﬁgure 3 fora graphical comparison of typical PI boundaries. Data wasgenerated by y = 0 . x ) + 0 . (cid:15) , with (cid:15) ∼ N (0 , x ) . Here, the advantage of a distribution-free loss function isdemonstrated by comparing MVE, which assumes Gaussiandata noise, to QD, which makes no such assumption, ontwo synthetic datasets. The ﬁrst was generated as in 5.1with normal noise, the second with exponential noise, (cid:15) ∼ exp (1 /x ) .Figure 4 shows, unsurprisingly, that MVE outputs PIs veryclose to the ideal for normal noise, but struggles with ex-ponential noise. QD approximates both reasonably, thoughdoes not learn the boundaries well where data is sparse.Though possible to alter MVE to assume an exponentialdistribution, this would require signiﬁcant work. With realdata, the distribution would be unknown, and likely irregular,putting QD at an advantage. This experiment demonstrated the ability of ensembling toestimate model uncertainty. Data was generated through y = 0 . x + 0 . (cid:15) , with (cid:15) ∼ N (0 , ) . Figure 5 showsten individual QD PIs as well as the ensembled PIs. Theestimated model uncertainty, ˆ σ model , calculated from eq.(20) is overlaid . Whilst it is difﬁcult to reason about thecorrectness of the absolute value, its behaviour agrees withthe intuition that uncertainty should increase in regions ofthe input space that are not represented in the training set,here x ∈ [ − , , and x > , x < − .

6. Benchmarking Experiments

To compare QD to recent work on uncertainty in deeplearning, we adopted their shared experimental procedure With uncertainty of the upper and lower bound averaged. (Hern´andez-Lobato & Adams, 2015; Gal & Ghahramani,2015; Lakshminarayanan et al., 2017). Experiments wererun across ten open-access datasets. Models were asked tooutput 95% PIs and used ﬁve NNs per ensemble. See ap-pendix for full experimental details. Code is made availableonline .Previous work reported NLL & RMSE metrics. However,the important metrics for PI quality are M P IW and

P ICP (section 1). This meant that we had to reimplement a com-peting method. We chose to compare QD-Ens to MVE-Ens,since it had reported the best NLL results to date (Laksh-minarayanan et al., 2017). We did not include LUBE sinceensembling and GD had already been justiﬁed in section 5.QD-Ens and MVE-Ens both output fundamentally differentthings; MVE-Ens a distribution, and QD-Ens upper andlower estimates of the PI. To compute NLL & RMSE forQD-Ens is possible only by imposing a distribution on thePI. This is not particularly fair since the attraction of themethod is its lack of distributional assumption. Purely forcomparison purposes we did this in the appendix.A fairer comparison is to convert the MVE-Ens output dis-tributions to PIs, and compute PI quality metrics. This wasdone by trimming the tails of the MVE-Ens output nor-mal distributions by the appropriate amount, which allowedextraction of

M P IW , P ICP , and

Loss QD − soft . In ourexperiments we ensured MVE-Ens achieved NLL & RMSEscores at least as good as those reported in the original work,before PI metrics were calculated. https://github.com/TeaPearce -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-1.5-1.0-0.50.00.51.01.5 y MVE, (cid:15) ∼ Norm. -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-1.5-1.0-0.50.00.51.01.5 y QD, (cid:15) ∼ Norm. -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-2-1012 y MVE, (cid:15) ∼ Exp. -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0x-0.50.00.51.01.52.02.5 y QD, (cid:15) ∼ Exp.

Figure 4.

Comparison of PI boundaries for two loss functions, QDvs. MVE, given data noise variance drawn from different distribu-tions. Legend as for ﬁgure 3. igh-Quality Prediction Intervals for Deep Learning

Table 1.

PI quality metrics on ten benchmarking regression datasets; mean ± one standard error, best result in bold. Best was assessedaccording to the following criteria. If P ICP ≥ . for both, both were best for P ICP , and best

MP IW was given to smallest

MP IW . If

P ICP ≥ . for neither or for only one, largest P ICP was best, and

MP IW only assessed if the one with larger

P ICP also had smallest

MP IW . L

OSS QD − soft PICP MPIWMVE-E NS QD-E NS MVE-E NS QD-E NS MVE-E NS QD-E

NS IMPROVEMENT B OSTON ± ± ± ± ± ± ONCRETE ± ± ± ± ± ± NERGY ± ± ± ± ± ± IN NM ± ± ± ± ± ± AVAL ± ± ± ± ± ± OWER PLANT ± ± ± ± ± ± ROTEIN ± ± ± ± ± ± INE ± ± ± ± ± ± ACHT ± ± ± ± ± ± ONG YEAR ± NA ± NA 0.96 ± NA 0.96 ± NA ± NA ± NA -6 -4 -2 0 2 4 6x-3-2-10123 y Model uncertaintyIndiv. boundariesEnsemble boundaryTrue data fn 0.000.050.100.150.200.25 E s t i m a t e d m o d e l un c e r t a i n t y Figure 5.

Estimation of model uncertainty with a QD ensemble.

Full results of PI quality metrics are given in table 1, NLL& RMSE results are included in the appendix. Given that

Loss QD − soft is representative of PI quality, QD-Ens out-performed MVE-Ens on all but one dataset. P ICP wasgenerally closer to the 95% target, and

M P IW was on aver-age 11.6% narrower. The exception to this was the

Kin8nm dataset. In fact, this dataset was synthetic (Danafar et al.,2010), and we suspect that Gaussian noise may have beenused in its simulation, which would explain the superiorperformance of MVE-Ens.One drawback of QD-Ens was the fragility of the trainingprocess. Compared to MVE-Ens it required a lower learningrate, was more sensitive to decay rate, and hence neededfrom two to ten times more training epochs. Other comments are as follows. We found λ a convenientlever providing some control over P ICP . Bootstrap resam-pling gave worse performance than parameter resampling,which agrees with work discussed in 4 - we suspect it wouldwork give a much larger ensemble size. We tried to estab-lish a relationship between the normality of residual errorsand improvement of QD-Ens over MVE-Ens, but due to thevariable power of normality tests analysis was unreliable.

7. Conclusions and Future Work

In this paper we derived a loss function for the output of PIsbased on the assumption that high-quality PIs should be asnarrow as possible subject to a given coverage proportion.We contrasted it with a previous work, justifying differencesand showed that it can be used successfully with GD withonly slight modiﬁcation. We described why a single NNusing the derived loss function underestimates uncertainty,and that this can be addressed by using the model in anensemble. On ten benchmark regression datasets, the newmodel reduced PI widths by over 10%.Several areas are worth further investigation: Why parame-ter resampling provides better performance than bootstrapresampling, how model uncertainty could be estimatedthrough dropout or conformal prediction rather than ensem-bling, and the role that NN architecture plays in estimatesof model uncertainty.

Acknowledgements

The authors thank EPSRC for funding (EP/N509620/1), theAlan Turing Institute for accommodating the lead authorduring his work (TU/D/000016), and Microsoft for Azurecredits. Personal thanks to Mr Ayman Boustati, Mr Henry-Louis de Kergorlay, and Mr Nicolas Anastassacos. igh-Quality Prediction Intervals for Deep Learning

References

Ak, R., Li, Y., Vitelli, V., Zio, E., L´opez Droguett,E., and Magno Couto Jacinto, C. NSGA-II-trainedneural network approach to the estimation of predic-tion intervals of scale deposition rate in oil & gasequipment.

Expert Systems with Applications , 40(4):1205–1212, 2013a. ISSN 09574174. doi: 10 . . eswa . . . http://dx . doi . org/10 . . eswa . . . .Ak, R., Li, Y.-f., Vitelli, V., and Zio, E. Multi-objective Genetic Algorithm Optimization of a Neu-ral Network for Estimating Wind Speed Prediction In-tervals. 2013b. URL https://hal . archives-ouvertes . fr/hal-00864850/document .Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstr,D. Weight Uncertainty in Neural Networks. In Proceed-ings of the 32nd International Conference on MachineLearning , volume 37, 2015.Danafar, S., Gretton, A., and Schmidhuber, J. Characteris-tic kernels on structured domains excel in robotics andhuman action recognition.

Lecture Notes in ComputerScience (including subseries Lecture Notes in ArtiﬁcialIntelligence and Lecture Notes in Bioinformatics) , 6321LNAI(PART 1):264–279, 2010. ISSN 03029743. doi:10 . Uncertainty in Deep Learning . PhD thesis, 2016.Gal, Y. and Ghahramani, Z. Dropout as a BayesianApproximation: Representing Model Uncertainty inDeep Learning. In

Proceedings of the 33rd Interna-tional Conference on Machine Learning , 2015. ISBN1506.02142. doi: 10 . . . http://arxiv . org/abs/1506 . .Galv´an, I. M., Valls, J. M., Cervantes, A., and Aler, R.Multi-objective evolutionary optimization of predictionintervals for solar energy forecasting with neural net-works. Information Sciences , 2017. ISSN 00200255. doi:10 . . ins . . . DeepLearning . 2016. ISBN 3540620583, 9783540620587.doi: 10 . . . deeplearningbook . org .Graves, A. Practical Variational Inference for Neu-ral Networks. Advances in Neural Informa-tion Processing Systems , pp. 1–9, 2011. URL https://papers . nips . cc/paper/4329-practical-variational-inference-for-neural-networks . pdf . Hern´andez-Lobato, J. M. and Adams, R. P. Probabilis-tic Backpropagation for Scalable Learning of BayesianNeural Networks. In Proceedings of the 32nd Interna-tional Conference on Machine Learning , 2015. ISBN9781510810587.Heskes, T. Practical conﬁdence and prediction in-tervals. In

Advances in Neural InformationProcessing Systems 9 , 1996. URL https://papers . nips . cc/paper/1306-practical-confidence-and-prediction-intervals .Jones, K. O. Comparison of Genetic Algorithm and ParticleSwarm Optimisation. International Conference on Com-puter Systems and Technologies - CompSysTech’2005COMPARISON , pp. 1–6, 2005.Kennedy, J. and Eberhart, R. Particle swarm optimiza-tion.

Neural Networks, 1995. Proceedings., IEEE Inter-national Conference on , 4:1942–1948 vol.4, 1995. ISSN19353812. doi: 10 . . . Swarm Intelligence .2001. ISBN 978-3-540-74088-9. doi: 10 . http://link . springer . com/10 . .Khosravi, A., Nahavandi, S., Creighton, D., and Atiya, A. F.Lower upper bound estimation method for construction ofneural network-based prediction intervals. IEEE Transac-tions on Neural Networks , 22(3):337–346, 2011a. ISSN10459227. doi: 10 . . . IEEE Transactionson Neural Networks , 22(9):1341–56, 2011b.Krzywinski, M. and Altman, N. Points of Signiﬁcance.

Nature Methods , 10(9-12), 2013. ISSN 15487091. doi:10 . . , 2017.Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., andBatra, D. Why M Heads are Better than One: Traininga Diverse Ensemble of Deep Networks. 2015. URL https://arxiv . org/abs/1511 . .Lian, C., Zeng, Z., Member, S., Yao, W., Tang, H., Lung,C., and Chen, P. Landslide Displacement Prediction WithUncertainty Based on Neural Networks With RandomHidden Weights. IEEE Transactions on Neural Networksand Learning Systems , 27(12):1–13, 2016. igh-Quality Prediction Intervals for Deep Learning

MacKay, D. J. C. A Practical Bayesian Frame-work for Backpropagation Networks.

NeuralComputation , 4(3):448–472, 1992. ISSN 0899-7667. doi: 10 . . . . . . mitpressjournals . org/doi/10 . . . . . .Malinin, A., Ragni, A., Knill, K. M., and Gales, M. J. F.Incorporating uncertainty into deep learning for spokenlanguage assessment. ACL 2017 - 55th Annual Meeting ofthe Association for Computational Linguistics, Proceed-ings of the Conference (Long Papers) , 2:45–50, 2017. doi:10 . Engineering Applicationsof Artiﬁcial Intelligence , 2011. ISSN 09521976. doi:10 . . engappai . . . Proceedings of1994 IEEE International Conference on Neural Networks(ICNN’94) , pp. 55–60 vol.1, 1994. ISBN 0-7803-1901-X. doi: 10 . . . http://ieeexplore . ieee . org/document/374138/ .Papadopoulos, G., Edwards, P. J., and Murray, A. F. Conﬁ-dence Estimation Methods for Neural Networks: A Prac-tical Comparison. In European Symposium on ArtiﬁcialNeural Networks , 2000.Pinson, P. and Kariniotakis, G. Optimal Prediction Intervalsof Wind Power Generation.

IEEE Transactions on PowerSystems , 25:1845–1856, 2013.Quan, H., Srinivasan, D., and Khosravi, A. Uncer-tainty handling using neural network-based predictionintervals for electrical load forecasting.

Energy , 73:916–925, 2014. ISSN 03605442. doi: 10 . . energy . . . http://dx . doi . org/10 . . energy . . . .Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research , 9:371–421, 2008.ISSN 1532-4435. URL http://arxiv . org/abs/0706 . .Shen, Y., Wang, X., and Chen, J. Wind Power Forecast-ing Using Multi-Objective Evolutionary Algorithms forWavelet Neural Network-Optimized Prediction Intervals. Applied Sciences , 8(2):185, 2018. ISSN 2076-3417. doi:10 . . mdpi . com/2076-3417/8/2/185 . Sun, X., Wang, Z., and Hu, J. Prediction Interval Con-struction for Byproduct Gas Flow Forecasting Using Op-timized Twin Extreme Learning Machine. MathematicalProblems in Engineering , 2017, 2017.Thomas, P., Mansot, J., Delbe, K., Sauldubois, A., andBilas, P. Standard Particle Swarm Optimisation, 2012.URL https://hal . archives-ouvertes . fr/file/index/docid/926514/filename/Delbe{ }9783 . pdf .Tibshirani, R. A Comparison of Some Error Estimates forNeural Network Models. Neural Computation , 8:152–163, 1996.Wang, J., Fang, K., Pang, W., and Sun, J. Wind power in-terval prediction based on improved PSO and BP neuralnetwork.

Journal of Electrical Engineering and Tech-nology , 12(3):989–995, 2017. ISSN 19750102. doi:10 . . . . . Proceedings of the 2004 ACM SIGKDD internationalconference on Knowledge discovery and data mining -KDD ’04 , Seattle, Washington, 2004. ISBN 1581138889.doi: 10 . . igh-Quality Prediction Intervals for Deep Learning A. Experimental details

In this section we give full experimental details of the workdescribed in the main paper. Code is made available online .Note that whilst single-layer NNs were used to be consistentwith previous works, the developed methods may be appliedwithout modiﬁcation to deeper architectures. A.1. Qualitative Experiments

A.1.1. T

RAINING METHOD : PSO VS . GDFor the qualitative training method comparison, PSO vs.GD (section 5.1), NNs used ReLU activations and 50 nodesin one hidden layer. GD was trained using Loss QD − soft and run for 2,000 epochs, PSO was trained using Loss QD and run for 50 particles over 2,000 iterations, parameters asgiven in SPSO 2011 were followed (Thomas et al., 2012).Data consisted of 200 points sampled uniformly from theinterval [ − , .A.1.2. L OSS FUNCTION : QD VS . MVEFor the loss function comparison, QD vs. MVE (section5.2), NNs used Tanh activations and 50 nodes in one hiddenlayer. Both methods were trained with GD and results arefor an individual NN (not ensembled). Data consisted of200 points sampled uniformly from the interval [ − , .A.1.3. M ODEL U NCERTAINTY E STIMATION :E NSEMBLES

For evaluation of ensembling (section 5.3), we sampled50 points uniformly in the interval [ − , − , and another50 from [1 , . An ensemble of ten QD NNs using ReLUactivations and 50 nodes in one hidden layer was trainedwith GD, using parameter resampling.A.1.4. T RAINING M ETHOD / L

OSS F UNCTION P ERMUTATION

We provide quantitative results in table 2 for the two syn-thetic datasets described in sections 5.1 & 5.2. These resultscover permutations of loss function and training method[LUBE, QD, MVE]x[GD, PSO], using individual NNs (notensembled).Again, 200 training data points were sampled uniformlyfrom the interval [ − , . Validation was on 2,000 datapoints sampled from the same interval. Experiments wererepeated ten times. PIs targeted 95% coverage.LUBE relates to the ‘softened’ version of eq. (16), and QDto Loss QD − soft . Softened versions were used for both GDand PSO (note ‘soft’ and ‘hard’ versions were contrasted insection 5.1). https://github.com/TeaPearce For GD, 2,000 epochs were used, for PSO, 10 particleswere run over 2,000 epochs. This meant the computationaleffort for PSO was ﬁve times that required by GD - thecomputational equivalent of 2,000 forward and backwardpasses for GD is 2 particles at 2,000 epochs for PSO.NLL & RMSE metrics were computed, however should beviewed with caution for LUBE and QD - see section A.2.2.All training methods and loss functions slightly overﬁttedthe training data, producing PICP’s lower than 95%. Gen-erally GD-trained NNs outperformed their PSO counter-parts in terms of the primary metric of their loss function(

M P IW for LUBE and QD, NLL for MVE), although qual-ity of other metrics was similar. QD produced comparableresults to LUBE - we note that our contributions to the lossfunction were from a theoretical and usability perspectiveand did not expect a large impact on performance. MVEproduced PIs of comparable width to LUBE and QD for thecase of normal noise, but PIs were signiﬁcantly wider in theexponential noise case.

A.2. Benchmarking Experiments

A.2.1. S

ET UP AND H YPERPARAMETERS

For the benchmarking section, experiments were run acrossten open-access datasets, train/test folds were randomly split90%/10%, with experiments repeated 20 times, input andtarget variables were normalised to zero mean and unit vari-ance. NNs had 50 neurons in one hidden layer with ReLUactivations. The exceptions to this were for experimentswith the two largest datasets,

Protein and

Song Year , whereNNs had 100 neurons in one hidden layer, and were repeatedﬁve times and one time respectively.The softening factor was constant for all datasets, s = 160 . .For the majority of the datasets λ = 15 . , but was set to . for naval , . for protein , . for wine , and . for yacht . The Adam optimiser was used with batch sizes of100. Five NNs were used in each ensemble, using parameterresampling.Hyperparameters requiring tuning were learning rate, decayrate, λ , initialising variance, and number of training epochs.Tuning was done on a single 80%/20% train/validation splitand using random search.A.2.2. NLL & RMSE R ESULTS

In table 3 we report NLL & RMSE in unnormalised formto be consistent with previous works. Note that in the mainresults (section 6) we found it more meaningful to leave

M P IW in normalised form so that comparisons could bemade across datasets.To compute NLL & RMSE for QD-Ens, we used the mid-point of the PIs as the point estimate to calculate RMSE. igh-Quality Prediction Intervals for Deep Learning

Table 2.

Full permutation results of training method and loss function on synthetic data with differing noise distributions. Mean ± onestandard error. LUBE MVE QDGD PSO GD PSO GD PSON ORMAL NOISE

PICP 0.90 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± XPONENTIAL NOISE

PICP 0.91 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± We computed the equivalent Gaussian distribution of thePIs by centering around this midpoint and using a standarddeviation of ( y Ui − y Li ) / . (since the PI represented 95%coverage), which enabled NLL to be computed. We em-phasise that by doing this, we break the distribution-freeassumption of the PIs, and include these purely for the pur-pose of consistency with previous work. Unsurprisingly,NLL & RMSE metrics for QD-Ens are poor. MVE-Ensresults are in line with previously reported work (Lakshmi-narayanan et al., 2017). igh-Quality Prediction Intervals for Deep Learning Table 3.

RMSE and NLL on ten benchmarking regression datasets; mean ± one standard error, best result in bold.RMSE NLL n D MVE-E NS QD-E NS MVE-E NS QD-E NS B OSTON

506 13 ± ± ± ± ONCRETE ± ± ± ± NERGY

768 8 ± ± ± ± IN NM ± ± -1.28 ± -1.14 ± AVAL ± ± ± -5.73 ± OWER PLANT ± ± ± ± ROTEIN ± ± ± ± INE ± ± ± ± ACHT

308 6 1.36 ± ± ± ± S ONG YEAR ± NA ± NA ± NA ±±