[PDF] SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning

Abstract

In this paper, we consider risk-sensitive sequential decision-making in model-based Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or an additive combination of the two. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk provides an estimate more sensitive to both aleatory and epistemic sources of uncertainties than the individual and additive formulations. Following that, we propose to use a bootstrapping method, SENTINEL-K, for performing distributional RL. SENTINEL-K uses an ensemble of K learners to estimate the return distribution. We use the Follow The Regularised Leader (FTRL) to aggregate the return distributions of K learners and to estimate the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimate, demonstrates better risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.

Full PDF

SSENTINEL

SENTINEL: Taming Uncertainty with Ensemble-basedDistributional Reinforcement Learning

Hannes Eriksson [email protected]

Zenseact AB, Gothenburg, SwedenChalmers University of Technology, Gothenburg, Sweden

Debabrota Basu

Scool, INRIA Lille- Nord Europe, Lille, FranceCRIStAL, CNRS, Lille, France

Mina Alibeigi

Zenseact AB, Gothenburg, Sweden

Christos Dimitrakakis

University of Oslo, Oslo, Norway

Editor:

Abstract

In this paper, we consider risk-sensitive sequential decision-making in model-based rein-forcement learning (RL). We introduce a novel quantiﬁcation of risk, namely compositerisk , which takes into account both aleatory and epistemic risk during the learning process.Previous works have considered aleatory or epistemic risk individually, or, an additive com-bination of the two. We demonstrate that the additive formulation is a particular case ofthe composite risk, which underestimates the actual CVaR risk even while learning a mix-ture of Gaussians. In contrast, the composite risk provides a more accurate estimate. Wepropose to use a bootstrapping method, SENTINEL-K, for distributional RL. SENTINEL-K uses an ensemble of K learners to estimate the return distribution and additionally usesfollow the regularized leader (FTRL) from bandit literature for providing a better estimateof the risk on the return distribution. Finally, we experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimate,demonstrates better risk-sensitive performance than competing RL algorithms.

1. Introduction

Reinforcement Learning (RL) algorithms with their recent success in games and simulatedenvironments (Mnih et al., 2015) have drawn interest for real-world and industrial appli-cations (Pan et al., 2017; Mahmood et al., 2018). Two aspects of RL algorithms constraintheir applicability. Firstly, the large amount of data generally required by model-free RLalgorithms. Secondly, since in reinforcement learning the environment is by deﬁnition isunknown to the agent, exploring it so as to improve performance and eventually obtain theoptimal policy entails risks. Though risk is not an issue in simulation, it is important toconsider risks when interacting in the real world (Pinto et al., 2017; Garcıa and Fern´andez, a r X i v : . [ c s . L G ] F e b RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

Data D D K Environment CDQN ( θ )CDQN K ( θ K ) s t +1 , r t s t Composite Risk Q C ( s t , a ) = U E ( U A ( Z θ ( s t , a )))Action Selectionarg max a Q C ( s t , a ) a ∗ t K distribution estimatorsEstimated Return Distributions { Z θ i ( s t , a ) } Ki =1 s t Aleatory U A ( Z θ i ( s t , a )) FTRL-drivenComposite Risk Estimator τ ∼ D τ K ∼ D K Figure 1: SENTINEL-K with FTRL-driven composite risk estimator and K CDQNs asdistribution estimators.2015; Prashanth and Fu, 2018). In this paper, we employ a model-based approach thatenables us both to eﬃcient in terms of the amount of data needed, and to be ﬂexible withrespect to the risk metric the agent should consider when making decisions.Risk sensitivity in reinforcement learning and Markov decision processes has sometimesbeen considered under a minimax formulation over plausible MDPs (Satia, 1973; Heger,1994; Tamar et al., 2014). Alternative approaches include maximising a risk-sensitive statis-tic instead of the expected return (Chow and Ghavamzadeh, 2014; Tamar et al., 2015;Clements et al., 2019). In this paper, we focus on the second approach due to its ﬂexibility.Either approach requires estimating the uncertainty associated with the decision-makingprocedure. This uncertainty includes both the inherent randomness in the model and theuncertainty due to imperfect information about the true model. These two type of un-certainties are called aleatory and epistemic uncertainty respectively (Der Kiureghian andDitlevsen, 2009).In this work, we propose a composite risk formulation in order to capture the combinedeﬀect of aleatory and epistemic uncertainty for decision-making in RL (Section 4). In recentliterature, researchers have either quantiﬁed epistemic and aleatory risks separately (Mi-hatsch and Neuneier, 2002; Eriksson and Dimitrakakis, 2019) or considered an additive riskformulation where their weighted sum is minimized by an RL algorithm (Clements et al.,2019). In a reductive experiment (Figure 2), we show that using an additive risk, whichis the sum of separately computed epistemic and aleatory CVaR , strictly underestimatesthe total CVaR (Rockafellar et al., 2000), and the deviation is signiﬁcant as CVaR focusesmore on less probable events. In contrast, the composite risk takes into consideration thecombined eﬀect of two types of uncertainty, and better reﬂects the underlying risk. Finally,we show that additive risk is essentially a special case of composite risk. CV aR α captures the expected value of α % of events in the left tail. ENTINEL

We then incorporate this composite risk measure within the Distributional RL (DRL)framework (Dabney et al., 2018b; Tang and Agrawal, 2018; Rowland et al., 2019). The DRLframework aims to model the distribution of returns of a policy for a given environment(Section 3.3). This highly expressive distributional representation allows us to both estimateappropriate risk measures and to incorporate them in ﬁnal decision making. However,DRL approaches are typically limited to modelling aleatory uncertainty, with epistemicuncertainty due to partial information not being explicitly modelled in terms of the returndistribution. In this paper, we propose a bootstrapping (Efron and Tibshirani, 1985) basedframework to estimate the return distribution.As we explain in Section 5, we use an ensemble of K distribution estimators, suchas CDQNs (Dabney et al., 2018b), obtained through bootstrapping, to learn the returndistribution. We use these return distributions to estimate the aleatory and compositerisks for the corresponding RL method (Section 5.1). In order to perform the estimationaccurately and eﬃciently, we adapt the Follow The Regularised Leader (FTRL) (Cesa-Bianchi and Lugosi, 2006) algorithm in order to weigh the estimators in our ensemble, aswe describe in Section 5.2.Our framework, which we call SENTINEL-K, is illsutrated in Figure 1. We instanti-ate SENTINEL-K to perform risk-sensitive model-based distributional RL by incorporatingthe composite CVaR estimate with FTRL-driven bootstrapped CDQN algorithm (Dabneyet al., 2018b). We experimentally show in Section 6 that the FTRL-driven bootstrappingmethod of SENTINEL-K generates accurate estimates of true return distributions for evensuboptimal actions and multimodal return distributions, where the vanilla distributionalRL algorithm fails to do so. Estimation of SENTINEL-K even without risk-sensitive objec-tive converges faster. We also show that our FTRL-based approach is more accurate thanuniform or greedy aggregation of K approximations of the return distribution. Finally, weverify the risk-sensitive performance of SENTINEL-K with composite CVaR metric on thehighway environment with 10 cars. Experimental results show that our approach leads to ahigher estimate of underlying risk and thus, less number of crashes than competing distribu-tional algorithms, which are VDQN (Tang and Agrawal, 2018), CDQN, and SENTINEL-Kwith additive CVaR estimate.Before proceeding to the details of our contributions, we posit our work in the existingliterature in Section 2. Following that, we provide a primer on risk measure, Markov decisionprocesses, and DRL in Section 3 to elucidate our contributions.

2. Related Works

For RL applications in the real world, such as for autonomous driving and robotics, risk-sensitive

RL approaches can avoid the negative consequences of excessive exploration. Thishas initiated a spate of research eﬀorts (Howard and Matheson, 1972; Satia, 1973; Coraluppiand Marcus, 1999; Marcus et al., 1997; Mihatsch and Neuneier, 2002; Prashanth and Fu,2018) spanning ﬁve decades. But the majority of these works focus only on discrete state-space MDPs. We are interested in designing a general framework applicable to both discreteand continuous state-spaces. Thus, we adopt the framework of distributional RL, speciﬁ-cally CDQN, that incorporates highly expressive approximators to model continuous andmultimodal return distributions. RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

Both aleatory and epistemic are important for risk-sensitive RL (Der Kiureghian andDitlevsen, 2009). The former expresses the randomness inherent to the problem and the lat-ter uncertainty about the problem respectively. A common approach to make an algorithmrisk-sensitive (Garcıa and Fern´andez, 2015) is to use a utility function that is nonlinear withrespect to the return, or the expected return. For example (Mihatsch and Neuneier, 2002)consider aleatory risk-sensitive RL by transforming the return. Follow-up works (Chow andGhavamzadeh, 2014; C. et al., 2015) focus on scaling up these approaches. There have beenrecent works considering epistemic risk (Eriksson and Dimitrakakis, 2019), wherein problemuncertainty is expressed in a Bayesian framework as a distribution over MDPs. Depeweget al. (2018); Clements et al. (2019) intuitively incorporates both of these risks in decisionmaking. Depeweg et al. (2018) consider the risk in the individual costs in RL. (Clementset al., 2019) consider the additive formulation of epistemic and aleatory risks. They usevariance as the risk measure which is not a coherent measure (Artzner et al., 1999). In orderto rectify such varied choices, we deﬁne a composite risk that considers and quantiﬁes theentangled eﬀect of epistemic and aleatory uncertainties. We also show that for any coherentrisk measure, such as CVaR, the composite risk retains coherence.Ensemble-based RL has been done previously with great success (Wiering and Van Has-selt, 2008; Faußer and Schwenker, 2015; Osband et al., 2016; Pacchiano et al., 2020). Thisprocess typically involves creating an ensemble of well-known RL agents, such as Deep Q-Networks (DQN) (Mnih et al., 2015), where each estimator has its own dataset, and theﬁnal decision maker considers the joint prediction of the ensemble into account. Typically,the ﬁnal estimate averages the individual estimators. In particular, adding additional esti-mators to form an ensemble of estimators not only improves performance for risk-neutraldecision-making but also allows the consideration of the distribution of estimators.

Thisenables epistemic risk-sensitive decision-making.

We incorporate bootstrapping approachto ensemble K diﬀerent estimations of the return distribution, and introduce the FTRLalgorithm to estimate the return distribution accurately and eﬃciently.

3. Background

In this section, we introduce the notion of risk measures, the risk-sensitive Markov decisionprocess formulation, and the distributional RL framework.

The idea of quantifying risk in decision making is long-studied in decision theory and hasfound multiple applications in ﬁnance and actuarial science. Researchers proposed mul-tiple measures of risk, such as variance, Value at Risk (VaR), Conditional Value at Risk(CVaR), etc. to quantify the probability of occurrence of an event away from the expec-tation of corresponding distribution (Szeg¨o, 2002). Artzner et al. (1999) have establisheda basic set of axioms to be satisﬁed for a coherent risk measure : normalization, mono-tonicity, sub-additivity, homogeneity, and translation invariance. For example, CVaR is acoherent risk measure whereas variance and VaR are not. Thus, in this work, we choose

CVaR (Rockafellar et al., 2000) as the risk measure of interest.

2. Here, we use return to mean the total discounted reward ENTINEL

CV aR α quantiﬁes expectation of the worst α % of a probability distribution. For arandom variable Z and α ∈ [0 , CV aR α ( Z ) (cid:44) E [ Z | Z ≥ ν α ∧ Pr(

Z > ν α ) = α ] (1)CVaR is widely used in risk-sensitive RL (Chow and Ghavamzadeh, 2014; Tamar et al.,2015; Chow et al., 2015) as it is coherent, applies to general L p spaces, and captures theheaviness of the tail of a distribution. For α = 0, CVaR reduces to the expected value, andthus, the corresponding risk-sensitive RL algorithm behaves analogously to a risk-neutralone. Kolla et al. (2019) shows that CVaR of a distribution can be accurately estimatedusing i.i.d. samples. In this work, we are considering decision-making problems that can be modelled by a MarkovDecision Process (MDP) (Sutton and Barto, 2018). An MDP is a tuple µ (cid:44) ( S , A , R , T , γ ). S ∈ R d is a state representation of dimension d . A is the set of admissible actions. T isa transition kernel that determines the probability of successor states s (cid:48) given the presentstate s and action a . The reward function R quantiﬁes the goodness of taking action a instate s . The goal of the agent is to ﬁnd a policy π : S → A to maximise expected value of a utility function U (Friedman and Savage, 1948) computed over a reward sequence given atime horizon T : U π ( s, a ) = E (cid:104) U ( (cid:80) Tt =0 γ t R ( s t , a t )) (cid:105) . Here, s t ∼ T ( . | s t − , a t − ), a t = π ( s t ), s = s , and a = a .When the utility function U is an identity function, U π ( s, a ) reduces to the Q-functionwhich is the expected long-term discounted reward. If the utility function U is a coherentrisk measure, such as CVaR, it leads to a risk-sensitive formulation of MDP (Mihatsch andNeuneier, 2002; Prashanth and Fu, 2018). Typically, the variable at the core of both risk-neutral and risk-sensitive RL is usually theaccumulated discounted reward Z π ( s, a ) (cid:44) (cid:80) Tt =0 γ t R ( s t , a t ). Z π ( s, a ) is called the returnof a policy π . In distributional RL, the goal is to learn the return distribution Z π ( s, a )obtained by following policy π from state x and action a under the given MDP.Diﬀerent methods are proposed to parametrize the return distribution. Bellemare et al.(2017) propose CDQN , a categorical distribution with N atoms and, with support in[ V MIN , V

MAX ]. The mass of the atom z i is then given by e θi ( s,a ) (cid:80) j e θj ( s,a ) . Tang and Agrawal(2018), Dabney et al. (2018a), and Rowland et al. (2019) use unimodal Gaussians, quan-tiles, and expectiles to model the return distribution respectively. In this work, we chooseto extend CDQN, as it permits richer representations of distributions, and ﬂexibility tocompute diﬀerent statistics.The intuition of using this distributional framework for risk-sensitive RL is its ﬂexibilityto model multi-modal and asymmetrical distributions, which is important for an accurateestimate of risk. RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

4. Quantifying Composite Risk

In risk-sensitive RL, we encounter two types of uncertainties: aleatory and epistemic .Aleatory uncertainty is engendered by the stochasticity of the MDP model µ and the policy π . Epistemic uncertainty exists due to the fact that the MDP model µ is unknown, In theBayesian setting, this is seen as having a belief distribution β over a set of plausible MDPsΘ. Hence, risk measures can also be deﬁned with respect to the MDP distribution. Con-sequently, as an agent learns more about the underlying MDP, the epistemic risk vanishes.The aleatory risk is inherent to the MDP model µ and policy π , and thus persists evenafter correctly estimating the model µ . Let us ﬁrst deﬁne risk measures for aleatory andepistemic uncertainty separately. We then combine them into a composite risk measure. Aleatory Risk.

Given a coherent risk measure U A , the aleatory risk is quantiﬁed as thedeviation of total risk of individual models from the risk of the average model. A ( U A , β ) (cid:44) E β [ E Pr( . | θ ) [ U A ( Z )] − U A ( E Pr( . | θ ) [ Z ])]= (cid:90) Θ (cid:90) Z ( U A ( z ) − U A ( µ z )) d Pr( z | θ ) d β ( θ ) ,U ( µ z ) (cid:44) U (cid:16) (cid:82) Θ P ( z | θ ) d β ( θ ) (cid:17) , the utility of the average model given a belief distribution β over the plausible set of models Θ. The centered deﬁnition of aleatory risk is necessary forthe additive formulation to be a special case of the composite formulation. Epistemic Risk.

Given a coherent risk measure U E , the epistemic risk quantiﬁes theuncertainty invoked by not knowing the plausible models. Thus, the risk can be computedover any statistics of the models, such as expectation. E ( U E , β ) (cid:44) E β [ U E ( E Pr( . | θ ) [ Z ])]= (cid:90) Θ U E (cid:18)(cid:90) Z z d Pr( z | θ ) (cid:19) d β ( θ ) Composite Risk under Model and Inherent Uncertainty.

In typical risk-sensitiveRL settings, the true MDP model is unknown, as well as the MDPs are inherently stochastic.Thus, the total uncertainty to be considered is a composition of aleatory and epistemicuncertainties. In order to quantify the total uncertainty under consideration, we proposethe composite risk . Deﬁnition 1 (Composite Risk)

For two coherent risk measures U A and U E , belief dis-tribution β on model parameters θ , and a random variable Z , the composite risk of epistemicand aleatory uncertainties is deﬁned as F C ( U A , U E , β ) (cid:44) (cid:90) Θ U E (cid:32) (cid:90) Z U A ( z ) d Pr( z | θ ) (cid:33) d β ( θ ) . The composite risk is ﬂexible to use two diﬀerent risk measures for quantifying epistemicand aleatory uncertainties.

Lemma 2 (Coherence) If U A and U E are two coherent risk measures, the composite riskmeasure F ( U A , U E , β ) is also coherent. ENTINEL C V a R [ r | a l e a t o r y , e p i s t e m i c ] DataAdditiveComposite

Figure 2: Estimation of total

CV aR α from a mixture of 100 Gaussians sampled from aposterior distribution. Total CV aR α [ Data ] is based on the marginal distribution of the r as given in Example 1. We compare this with composite and additive estimates andillustrate results over 100 runs. Additive Risk Measure. If U E is the identity function, the composite risk is reducedto an additive risk measure. F A ( U A , β ) (cid:44) (cid:90) Θ (cid:90) Z U A ( z ) d Pr( z | θ ) d β ( θ )= A ( U A , β ) + E ( U A , β )Often the additive risk measure or weighted sum of the epistemic and aleatory uncertaintyis used in risk-sensitive RL literature (Clements et al., 2019). However, the additive riskformulation strictly underestimates the composite eﬀect of epistemic risk. Thus, we ob-serve that additive risk leads to worse risk-sensitive performance than composite risk in RLproblems (Table 1). In order to compare the risk estimation using additive and compositeformulations, we consider an example of estimating CVaR over a Gaussian mixture. Example 1

We consider a mixture of

Gaussians: p ( r ) = (cid:80) i =1 φ i N ( µ i , σ i ) , where Φ ∼ Dir ([0 . ) , µ ∼ N (0 , , and σ ∼ Γ − (2 , , . We compute CV aR α [ r ] from thedata generated from such mixture for 100 runs. We further estimate composite risk with U E , U A = CV aR α and additive risk with U A = CV aR α . The results illustrated in Figure 2show that the additive CVaR risk strictly underestimates the total CVaR risk computed fromthe data, whereas the composite risk is closer to the one computed from data. Speciﬁcally,for lower values of α , i.e. towards the extreme end of the left tail where events occurwith low probability, the additive CVaR risk deviates signiﬁcantly from data whereas thecomposite measure yields closer estimation. Such values of α ’s are typically interesting forrisk-sensitive applications.

5. Algorithm: SENTINEL-K

In this section, we outline the algorithmic details of SENTINEL-K as an ensemble of K distributional RL estimators, such as CDQN (Bellemare et al., 2017), along with an adap-tation of FTRL for estimator selection. We further evaluate the composite risk using returndistribution estimated by SENTINEL-K for decision making. RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

Algorithm 1

SENTINEL-K with Composite Risk Input:

Initial state s , action set A , risk measures U A , U E , hyperparameter λ , targetnetworks [ θ − , ..., θ − K ], value networks [ θ , ..., θ K ], update schedule Γ , Γ . for t = 1 , , . . . do //* Update K -value and target networks for estimating return distributions*// for t (cid:48) ∈ Γ ∪ Γ do Generate { D , ..., D K } ← DataMask( D t (cid:48) ) for i = 1 , . . . , K do Sample mini batch τ ∼ D i Estimate F C ( Z ( s t , a ) | U A , U B , β ) using τ and K -target networks { θ − i } Ki =1 . Get a ∗ = arg max a F C ( Z ( s t , a ) | U A , U B , β ) Update value network θ i using τ, a ∗ Update target network θ − i using τ, a ∗ if t (cid:48) ∈ Γ end for end for //* Estimate the composite risk of each action using the estimated return distribu-tions*// for a ∈ A do Compute weights w = w , ..., w K from Eq. 2. for i in K do Compute aleatory risks Q Ai ( s t , a ) from (cid:82) Z U A ( z ) d P ( z | θ i ). end for Compute composite risk over weighted aleatory estimates Q C ( s t , a ) = U E (cid:16) w · Q A ( s t , a ) (cid:17) end for //*Action selection*// Take action a t = arg max a Q C ( s t , a ) Observe s t and update the dataset D t ← D t − ∪ { s t , a t − , s t − , r t − } end forSketch of the Algorithm. Pseudocode of SENTINEL-K with composite risk is de-scribed in Algorithm 1. Algorithm 1 has mainly two functional blocks: yielding K esti-mations of return distribution with distributional RL framework (Lines 4- 13), and usingsuch K estimates to compute composite risk of each of the actions (Lines 15- 21). Finally,following the mechanism of Q-learning, it chooses the action with maximal composite riskin the decision making step (Line 23).In the ﬁrst functional block, we speciﬁcally use an ensemble of K CDQNs. Each CDQNuses target and value networks for estimating the return distribution. We set a schedule ofupdating the target networks Γ and a more frequent schedule Γ ∪ Γ to update the valuenetworks. The details of this procedure is elaborated in Section 5.1.The second functional block is used for decision-making and iterated at every timestep. It adapts the FTRL algorithm (Section 5.2) for aggregating the K estimated return ENTINEL distributions and to compose aleatory risk Q Ai ( s t , a ) of each of the estimators to provide aﬁnal estimate of the composite risk Q C ( s t , a ) for each action. The ensemble of SENTINEL-K consists of K distribution estimators. Each estimator getsits own dataset { D i } Ki =1 ⊆ D , value network { θ i } Ki =1 and target network { θ − i } Ki =1 . The K datasets are created from the original data set D by data masking (Line 5). For eachtransition s t , a t , r t , s t +1 , a ﬁxed weight vector u t ∈ [0 , K is generated such that u jt ∼ Ber ( ). Thus, each estimator i has access to on an average of the whole dataset.After preparing the datasets for the estimators, the target and value networks of theCDQN have to be updated and optimized. For i -th estimator, it begins with sampling minibatches of data τ from the respective dataset D i (Line 7). Then, this dataset is used tocompute the composite risk for all actions a ∈ A and to obtain a ∗ (Lines 8- 9). Obtaining thecomposite risk ﬁrst involves estimating the aleatory risk with Q Ai ( s t , a ) = (cid:82) Z U A ( z ) d P ( z | θ i )for a particular estimator i . This quantity can be attained by considering each of theestimators separately, however, as we turn to compute the epistemic risk the estimatorsjointly contribute to this risk. Then, we compose the aleatory risk of all the estimators tocompute Q C ( s t , a ) = (cid:80) i U E ( Q Ai ( s t , a )). Finally, the optimal action a ∗ = arg max a Q C ( s t , a ),and the risk estimates Q C ( s t , a ) are used to update the value and network parameters { θ i } Ki =1 and { θ − i } Ki =1 (Lines 10- 11) by minimising the cross-entropy loss of the currentparameters and the projected Bellman update as described in (Bellemare et al., 2017).Ensembling estimators have been shown to outperform individual estimators as seenin (Wiering and Van Hasselt, 2008; Faußer and Schwenker, 2015; Osband et al., 2016; Pac-chiano et al., 2020). Further, incorporating multiple estimators introduces uncertainty overthe estimators. Because of having separate data sets, each of the estimators learn diﬀerentparts of the MDP. Thus, uncertainty over estimators acts as a quantiﬁer of the model un-certainty. In Section 6, we show that this ensemble-based approach leads SENTINEL-K toachieving superior performance. Now, the question is to aggregate the K estimated return distributions in one such that theﬁnal estimation is as accurate as possible, where each of the estimators may vary in termsof learning and accuracy. Pacchiano et al. (2020) shows that model selection can boostperformance than model averaging. The rationale for this can be given by seeing that someestimators might be overly optimistic or pessimistic. By considering these outliers less, youcan eﬀectively have a more robust ensemble.We adapt the Follow The Regularised Leader (FTRL) algorithm (Cesa-Bianchi andLugosi, 2006) studied in bandits and online learning for selecting the estimators. FTRLputs exponentially more weight on an estimator depending on its accuracy of estimatingthe return distribution. Since we don’t know the ‘true’ return distribution, we use the KL-divergence from the posterior of a single estimator i , P ( z | θ i ), to the posterior marginalizedover β ( θ ), i.e. l ( θ i , β ) (cid:44) D KL (cid:16) (cid:82) Θ (cid:82) Z z d P ( z | θ ) d β ( θ ) || (cid:82) Z z d P ( z | θ i ) (cid:17) , as the proxy of RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS P r o b a b ili t y Z ( x , a ) P r o b a b ili t y Z ( x , a ) (a) n = 0 P r o b a b ili t y Z ( x , a ) P r o b a b ili t y Z ( x , a ) (b) n = 1000 P r o b a b ili t y Z ( x , a ) P r o b a b ili t y Z ( x , a ) (c) n = 5000 P r o b a b ili t y Z ( x , a ) P r o b a b ili t y Z ( x , a ) OracleSENTINEL-4 risk-neutra) (d) n = 10000 Figure 3: Return distributions of a and a for 0 , , n )respectively. The blue dashed line is the categorical approximation of Z ( s , a ) and Z ( s , a ) respectively. The thick orange line is the marginal posterior (cid:82) Θ P ( z | θ ) d β ( θ ) withSENTINEL-4. The thin lines are the posteriors of the individual estimators.estimation loss of estimator i . FTRL selects estimator i with weight w i = exp (cid:16) λl ( θ i , β ) (cid:17)(cid:80) Kj =1 w j , (2) λ ∈ [0 , ∞ ) is a regularising parameter that determines to what extent estimators far awayfrom the marginal estimator should be penalised. If λ →

0, we obtain standard modelaveraging. If λ → ∞ , it reduces to greedy selection.Having computed the weights w (Line 16), we compute the weighted composite risk mea-sure by ﬁrst computing the aleatory risk of each of the estimators, Q Ai ( s t , a ) = (cid:82) Z U A ( z ) d P ( z | θ i )(Line 18), and then the composite risk is computed by Q C ( s t , a ) = U E ( w · Q A ( s t , a ))(Line 20). Here, · : R K × R K → R K is the pointwise product. We experimentally show thatperforming FTRL with a reasonable λ value, namely 1, leads to better performance.SENTINEL-K reduces to a risk-neutral algorithm if we choose both U A , U E as identityfunctions, and to additive risk-sensitive algorithm if we choose U E as identity. Designing it toaccommodate composite risk provides us this ﬂexibility. We use risk-neutral SENTINEL-Kto validate its eﬃciency to estimate return distributions, and the one with composite CVaRrisk to perform risk-sensitive RL tasks.

6. Experimental Evaluation

In this section, we experimentally validate the performance of risk-neutral SENTINEL-K interms of estimating the return distribution of diﬀerent actions and improvement of FTRLover model averaging or greedy model selection. We also test the risk-sensitive performanceof SENTINEL-K with composite CVaR risk in a large enough environment with continuousstate space. Settings for each of these three experiments and results are elaborated incorresponding subsections. In all the experiments, we use 4 CDQNs in the ensemble andcall it SENTINEL-4.

Return Distribution Estimation.

In order to demonstrate uncertainty estimation andconvergence in distribution of SENTINEL-K framework, we test SENTINEL-4 on an MDPenvironment with known multimodal return distribution. The MDP contains three statesand two actions such that the return distribution of a from state s is a mixture of Gaus-sians Z ( s , a ) ∼ (cid:80) Ni =0 Φ i N ( µ i , σ i ) and the return distribution of action a is Z ( s , a ) ∼ ENTINEL W a ss e r s t e i n d i s t a n c e Z ( x , a ) SENTINEL-4 risk-neutralVDQN W a ss e r s t e i n d i s t a n c e Z ( x , a ) Figure 4: Shows convergence in distribution of SENTINEL-4 (risk-neutral) and VDQNby measuring the Wasserstein distance between the categorical approximation of Z ( s , a ) , Z ( s , a ) and the estimated distributions by the two agents, for each action. F a ll s ( i n l o g s c a l e ) = 4.6= 1.0= 0.1= 0.01 Figure 5: Performance and convergence of SENTINEL-4 (risk-neutral) for diﬀerent param-eter values of λ . Shown is the number of falls in the CartPole environment. Experimentalresults are computed over 20 runs with diﬀerent initialisation and the shaded region repre-sents µ t ± σ t . N ( µ , σ ). Here, Φ = [0 . , . µ = [1 . , . , σ = [0 . , . FTRL vs. Average vs. Greedy.

In order to demonstrate the performance of the modelselection algorithm, we evaluate SENTINEL-4 in the

CartPole-v0 environment (Brockmanet al., 2016). This environment is a common testbed for continuous state-space RL tasks.In the environment, a reward of 1 is attained for every time step the pole is kept upright. Ifthe pole falls to either of the sides or if the number of time steps reaches 200, the episode is RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

Table 1: Performance of risk-neutral (VDQN, CDQN, SENTINEL-4), and risk-sensitive(SENTINEL-4 with additive and composite CVaRs) for highway-v1 with 10 vehicles. Re-sults are reported over 20 runs. SENTINEL-4 with composite CVaR performs better.Agent Value ± σ Aleatory metric ± σ ± σ VDQN risk-neutral 23 . ± .

59 14 . ± .

60 1252 . ± . . ± .

27 19 . ± .

43 839 . ± . . ± .

45 20 . ± .

58 617 . ± . . ± .

87 21 . ± .

24 645 . ± . . ± .

60 24 . ± .

40 341 . ± . terminated. This means that the undiscounted return attained per episode is within [0 , V min = 0 , V max = − γ − γ as the histogram support of CDQN.We choose [0 . , . , . , .

6] as the diﬀerent values of the regularising hyperparameter λ . As λ →

0, we are essentially doing standard model averaging. We expect this to haveaverage performance since all estimators are weighted equally. This means that it mightbe overly sensitive to estimator outliers. As λ → ∞ , model selection gets greedily biasedtowards the best average estimator. In fact, we expect performance to be poor when λ istoo high since it is putting almost all weight on one single estimator while not providingother estimators a chance to improve. A sound value of λ would be one that excludesoutlier estimators while still involving most of the other estimators. We run each of theexperiments for 10 steps and average the results over 20 runs. Figure 5 shows performancein terms of cumulative λ values with α = 0 .

25. We observethat FTRL with reasonable λ = 1 . λ = 4 . λ ’s 0 .

01 and 0 .

1. We also observe that for λ = 1the variance of the total number of falls is signiﬁcantly less than that of other values. Thisindicates stability of performance. Risk-sensitive Performance.

In order to demonstrate performance in a larger domain,we opt to evaluate SENTINEL-4 in the highway (Leurent, 2018). Highway is an environmentdeveloped to test RL for autonomous driving. We use a version of the highway-v1 domainwith ﬁve lanes, and ten vehicles in addition to the ego vehicle. In this environment, theepisode is terminated if any of the vehicles crash or if the time elapsed is greater than 40time steps. The reward function is a combination of multiple factors, including staying inthe right lane, the ego vehicle speed, and the speed of the other vehicles.We test the risk-neutral CDQN and VDQN algorithms along with SENTINEL-4 withboth additive and composite CVaRs. The typical performance metric for this scenario isthe expected discounted return E πµ [ R ]. In order to test the risk-sensitive performance, weuse two metrics. In order to measure aleatory risk U A [ R | π, µ ], we use CVaR as U A withthreshold α = 0 .

25. The CVaR metric is a statistic of the left-tail of the return distributionand higher values would mean better performance in the 25% worst-cases of performance.Finally, as a proxy for the epistemic risk, we use the number of crashes (lower is better).Experimental results are illustrated in Table 1 and Figure 6. From Table 1, we observethat our algorithm with composite risk achieves a higher value, higher estimate of aleatoryrisk, and less number of crashes. Thus, SENTINEL-4 with composite CVaR aces the com-peting algorithms in all the three metrics of risk-sensitive and risk-neutral performances. ENTINEL C r a s h e s SENTINEL-4 compositeSENTINEL-4 additiveSENTINEL-4 risk-neutralCDQN risk-neutralVDQN risk-neutral

Figure 6: The total number of crashes in highway environment with 10 vehicles over 20 runsand horizon 10 . Less µ t ± σ t .Additionally, we observe that the variance of performance metrics over 20 runs is the leastfor our algorithm with composite CVaR measure. This shows the stability of our algorithmwhich is another demonstration of good risk-sensitive performance. Figure 6 resonates withthese observations in terms of the total number of crashes. Summary of Results.

Figure 4 shows that SENTINEL-K framework estimates evenmultimodal return distributions more eﬃciently than the classical distributional RL algo-rithms, such as VDQN. Figure 5 demonstrates that selecting λ is important in bootstrappedRL. We observe that it yields better performance over model averaging ( λ →

0) and greedyselection ( λ → ∞ ). Figure 6 shows the risk-sensitive performance of VDQN, CDQN, andSENTINEL-4 with risk-neutral, additive and composite CVaR risks on a large continuousstate environment. SENTINEL-4 with composite risk outperforms competing algorithms interms of the achieved value function and estimated aleatory risk. It causes the least numberof crashes than competing algorithms.

7. Discussions

In this paper, we study the problem of risk-sensitive RL. We propose two main contributions.The ﬁrst is the composite risk formulation that quantiﬁes the holistic eﬀect of aleatory andepistemic risk involved in the learning process. With a reductive experiment, we showthat composite risk estimates the total risk involved in a problem more accurately than theadditive formulation. The other one is

SENTINEL-K which ensembles K distributional RLestimators, namely CDQNs, to provide an accurate estimate of the return distribution. Wealso reintroduce FTRL from bandit literature as a means of model selection. FTRL weighseach estimator diﬀerently depending on how far away they are from the average estimator.This leads to a better estimate of the composite risk over return. FTRL leads to betterexperimental performance than greedy selection and model averaging. Experiments alsoshow that SENTINEL-K even in a risk-neutral setting estimates the return distribution of RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS all the actions better, and also achieves superior risk-sensitive performance while used withcomposite CVaR estimate.Motivated by the experimental performances of SENTINEL-K, we aim to investigatethe theoretical properties of FTRL-driven bootstrapped distributional RL with and withoutcomposite risk estimates.

Acknowledgments

We would like to thank Dapeng Liu for fruitful discussions in the beginning of the project,further, this work was partially supported by the Wallenberg AI, Autonomous Systems andSoftware Program (WASP) funded by the Knut and Alice Wallenberg Foundation and thecomputations were performed on resources at Chalmers Centre for Computational Scienceand Engineering (C3SE) provided by the Swedish National Infrastructure for Computing(SNIC).

References

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measuresof risk.

Mathematical Finance , 9(3):203–228, 1999.Marc G Bellemare, Will Dabney, and R´emi Munos. A distributional perspective on re-inforcement learning. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 449–458. JMLR. org, 2017.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. Openai gym, 2016.Yinlam C., Mohammad G., Lucas J., and Marco P. Risk-constrained reinforcement learningwith percentile risk criteria, 2015.Nicolo Cesa-Bianchi and G´abor Lugosi.

Prediction, learning, and games . Cambridge uni-versity press, 2006.Y. Chow and M. Ghavamzadeh. Algorithms for cvar optimization in mdps. In

Advances inneural information processing systems , pages 3509–3517, 2014.Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making:a cvar optimization approach. In

Advances in Neural Information Processing Systems ,pages 1522–1530, 2015.William R Clements, Benoˆıt-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui, andS´ebastien Toth. Estimating risk and uncertainty in deep reinforcement learning. arXivpreprint arXiv:1905.09638 , 2019.Stefano P Coraluppi and Steven I Marcus. Risk-sensitive and minimax control of discrete-time, ﬁnite-state markov decision processes.

Automatica , 35(2):301–309, 1999. ENTINEL

W. Dabney, M. Rowland, Marc G. Bellemare, and R. Munos. Distributional reinforcementlearning with quantile regression. In

AAAI , 2018a.Will Dabney, Mark Rowland, Marc G Bellemare, and R´emi Munos. Distributional reinforce-ment learning with quantile regression. In

Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , 2018b.Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steﬀen Udluft.Decomposition of uncertainty in bayesian deep learning for eﬃcient and risk-sensitivelearning. In

International Conference on Machine Learning , pages 1192–1201, 2018.Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?

Struc-tural safety , 31(2):105–112, 2009.Bradley Efron and Robert Tibshirani. The bootstrap method for assessing statistical accu-racy.

Behaviormetrika , 12(17):1–35, 1985.Hannes Eriksson and Christos Dimitrakakis. Epistemic risk-sensitive reinforcement learning. arXiv preprint arXiv:1906.06273 , 2019.Stefan Faußer and Friedhelm Schwenker. Neural network ensembles in reinforcement learn-ing.

Neural Processing Letters , 41(1):55–69, 2015.M. Friedman and L. J. Savage. The Utility Analysis of Choices Involving Risk.

The Journalof Political Economy , 56(4):279, 1948.Javier Garcıa and Fernando Fern´andez. A comprehensive survey on safe reinforcementlearning.

Journal of Machine Learning Research , 16(1):1437–1480, 2015.Matthias Heger. Consideration of risk in reinforcement learning. In William W. Cohenand Haym Hirsh, editors,

Machine Learning Proceedings 1994 , pages 105–111. MorganKaufmann, San Francisco (CA), 1994.Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.

Man-agement science , 18(7):356–369, 1972.Ravi Kumar Kolla, Prashanth L. A., Sanjay P. Bhat, and Krishna P. Jagannathan. Concen-tration bounds for empirical conditional value-at-risk: The unbounded case.

OperationsResearch Letters , 47(1):16–20, 2019.Edouard Leurent. An environment for autonomous driving decision-making. https://github.com/eleurent/highway-env , 2018.A Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William Ma, and JamesBergstra. Benchmarking reinforcement learning algorithms on real-world robots. In

Con-ference on robot learning , pages 561–591. PMLR, 2018.Steven I Marcus, Emmanual Fern´andez-Gaucherand, Daniel Hern´andez-Hernandez, StefanoCoraluppi, and Pedram Fard. Risk sensitive markov decision processes. In

Systems andcontrol in the twenty-ﬁrst century , pages 263–279. Springer, 1997. RIKSSON, BASU, ALIBEIGI AND DIMITRAKAKIS

O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning.

Machine learning , 49(2-3):267–290, 2002.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning. nature , 518(7540):529–533,2015.Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep explorationvia bootstrapped dqn. In

Advances in neural information processing systems , pages 4026–4034, 2016.Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and StephenRoberts. On optimism in model-based reinforcement learning. arXiv preprintarXiv:2006.11911 , 2020.Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. Virtual to real reinforcement learningfor autonomous driving. arXiv preprint arXiv:1704.03952 , 2017.Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarialreinforcement learning. In Doina Precup and Yee Whye Teh, editors,

Proceedings of the34th International Conference on Machine Learning , volume 70 of

Proceedings of MachineLearning Research , pages 2817–2826, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.L. A. Prashanth and Michael C. Fu. Risk-sensitive reinforcement learning: A constrainedoptimization viewpoint. arXiv , 2018.R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk.

Journal of risk , 2:21–42, 2000.Mark Rowland, Robert Dadashi, Saurabh Kumar, R´emi Munos, Marc G Bellemare, andWill Dabney. Statistics and samples in distributional reinforcement learning. arXivpreprint arXiv:1902.08102 , 2019.Roy E. Lave Jay K. Satia. Markovian decision processes with uncertain transition proba-bilities.

Operations Research , 21(3):728–740, 1973.Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MITpress, 2018.Giorgio Szeg¨o. Measures of risk.

Journal of Banking & ﬁnance , 26(7):1253–1272, 2002.A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation.In

International Conference on Machine Learning , pages 181–189, 2014.Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the cvar via sampling. In

Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence , 2015.Yunhao Tang and Shipra Agrawal. Exploration by distributional reinforcement learning. arXiv preprint arXiv:1805.01907 , 2018. ENTINEL

Marco A Wiering and Hado Van Hasselt. Ensemble algorithms in reinforcement learning.

IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , 38(4):930–936, 2008., 38(4):930–936, 2008.