DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck
RRobust Deep Reinforcement Learning via Multi-View Information Bottleneck
Jiameng Fan Wenchao Li Abstract
Deep reinforcement learning (DRL) agents areoften sensitive to visual changes that were un-seen in their training environments. To addressthis problem, we introduce a robust representa-tion learning approach for RL. We introduce anauxiliary objective based on the multi-view in-formation bottleneck (MIB) principle which en-courages learning representations that are bothpredictive of the future and less sensitive to task-irrelevant distractions. This enables us to trainhigh-performance policies that are robust to visualdistractions and can generalize to unseen environ-ments. We demonstrate that our approach canachieve SOTA performance on challenging visualcontrol tasks, even when the background is re-placed with natural videos. In addition, we showthat our approach outperforms well-establishedbaselines on generalization to unseen environ-ments using the large-scale Procgen benchmark.
1. Introduction
In reinforcement learning (RL), learning control from rawimages in an end-to-end fashion is important for many appli-cations. While deep reinforcement learning can train agentsto control effectively from image inputs, it suffers fromproblems of overfitting to training environments (Zhanget al., 2018b;a; Yu et al., 2019). In particular, it has beenobserved that DRL agents perform poorly in environmentsdifferent from those where the agents were trained on, evenwhen they are semantically equivalent to the training en-vironment (Farebrother et al., 2018; Cobbe et al., 2019).By contrast, humans are able to adapt to new, unseen envi-ronments with similar underlying dynamics. For example,though visual observations can be drastically different whendriving in different cities, human drivers quickly adjust todriving in a new city which they have never visited. Weargue that humans can adapt to new scenarios because their Department of Electrical and Computer Engineering, BostonUniversity, Boston, MA 02215, USA. Correspondence to: JiamengFan < [email protected] > , Wenchao Li < [email protected] > .Preprint ! !: Encoder ! $ ( !: |% !: , ' !: )% !: % !: !: " !: Reinforcement Learning Objective !:
Multi-View Information Bottleneck Loss V i e w V i e w Figure 1: Robust D eep R einforcement Learning via Multi-View I nfomration BO ttleneck (DRIBO) incorporates theinherent sequential structure of reinforcement learning andmulti-view information bottleneck principle into robust rep-resentation learning in RL. We consider sequential multi-view observations, o (1)1: T and o (2)1: T , of original sequential ob-servation o T sharing the same task-relevant informationwhile any information not shared by them are task-irrelevant .DRIBO uses a multi-view information bottleneck loss to en-sure that s (1)1: T and s (2)1: T , the representations of multi-view ob-servations, shares maximal task-relevant information whileeliminating the task-irrelevant information. DRIBO trainsthe RL policy and (or) value function on top of the encoder.driving skills are invariant to predominantly visual detailsthat are not relevant to driving. Conversely, DRL agentswithout this ability are hindered from understanding the un-derlying task-relevant dynamics and thus can be distractedby task-irrelevant visual details (Zhang et al., 2021).Viewing from a representation learning perspective, a de-sired representation for RL should encode only task-relevantinformation in the environment, such as lane markingson the road for driving, while discarding excessive, task-irrelevant information, such as shape of the cloud in thesky. An RL agent that learns from such representations hasthe advantage of being more robust to visual changes . Inaddition, the resulting policy is more likely to generalize tounseen environments if the task-relevant information in thenew environment remains similar to that in the training envi-ronments. Prior works (Hafner et al., 2019; Lee et al., 2020) a r X i v : . [ c s . A I] F e b obust Deep Reinforcement Learning via Multi-View Information Bottleneck that encode images into a low-dimensional latent space forRL typically rely on a reconstruction loss to learn represen-tations that are sufficient to reconstruct the input images.While these approaches can learn representations that retaininformation in the visual observations, they do nothing todiscard the irrelevant information.We tackle this problem by learning robust representationsfor RL based on the multi-view information bottleneck(MIB) principle (Tishby et al., 2000; Federici et al., 2020).In the multi-view setting, we assume each view providesthe same task-relevant information while all the informa-tion not shared by them is task-irrelevant (Zhang et al.,2018b). Data augmentation can be easily leveraged to gen-erate such multi-view observations without requiring addi-tional new data. Incorporating data augmentation into RLshows promising results for visual control tasks (Laskinet al., 2020; Lange et al., 2012) (Laskin et al., 2020; Langeet al., 2012). However, these methods rarely exploit thesequential aspect of RL that requires the learned represen-tations to be predictive of the future. Instead of learningrepresentations for each individual visual observation, wepropose to learn a mapping from a sequence of observa-tions to a sequence of representations given actions. Ourapproach exploits the fact that a robust RL agent, whenoperating under different views of the same environment,should exhibit similar behaviors. To enforce this similar-ity, the agent is optimized to learn robust representationsthat contain maximal task-relevant information and mini-mal task-irrelevant information . Concretely, we introducea new MIB objective that maximizes the mutual informa-tion between sequences of observations and representationswhile reducing the task-irrelevant information identifiedthrough the multi-view observations. We incorporate thisMIB objective into RL by optimize RL objectives on top ofthe learned encoder. We illustrate our proposed approachin Figure 1. Our contributions are summarized below.• We propose DRIBO, a novel technique to learn robustrepresentations in RL by identifying and discarding task-irrelevant information in the representations basedon the multi-view information bottleneck principle.• We leverage the sequential aspect of RL and define anew MIB objective that maximizes mutual informationbetween sequences of representations and observationswhile disregarding task-irrelevant information withoutrequiring reconstruction.• Empirically, we show that our approach can (i) leadto better robustness against task-irrelevant distractorson the DeepMind Control Suite and (ii) significantlyimprove generalization on the Procgen benchmarkscompared to current state-of-the-arts.
2. Related Work
Reconstruction-based Representation Learning.
Earlyworks trained autoencoders to learn sufficient representa-tions to reconstruct raw observations first. Then, the RLagent was trained from the learned representations (Lange& Riedmiller, 2010; Lange et al., 2012). However, thereis no guarantee that the agent will capture useful informa-tion for control. To address this problem, learning encoderand dynamics jointly has proved effective in learning task-oriented representations (Wahlstr¨om et al., 2015; Watteret al., 2015). More recently, Hafner et al. (2019; 2020) andLee et al. (2020) learn a latent dynamics model and trainRL agents with predictive latent representations. However,these approaches suffer from embedding all details into rep-resentations even when they are task-irrelevant. The reasonis that improving reconstruction quality from representa-tions to visual observations forces the representations toretain more details. Despite success on many benchmarks,task-irrelevant visual changes can affect the performancesignificantly (Zhang et al., 2018a). Experimentally, we showthat our non-reconstructive approach, DRIBO, is substan-tially more robust against this type of visual changes thanprior works. We also compare DRIBO with the recentlyintroduced DBC (Zhang et al., 2021), which uses bisimula-tion metrics to learn representations in RL that contain onlytask-relevant information without requiring reconstruction.
Contrastive Representations Learning.
Contrastive rep-resentation learning methods train an encoder that obeyssimilarity constraints in a dataset typically organized bysimilar and dissimilar pairs. The similar examples are typi-cally obtained from nearby image patches (Oord et al., 2018;H´enaff et al., 2020) or through data augmentation (Chenet al., 2020). Contrastive models encourage similarity be-tween features in representations using a variety of objec-tives. A scoring function that lower-bounds mutual infor-mation is one of the typical objects to be maximized (Bel-ghazi et al., 2018; Oord et al., 2018; Hjelm et al., 2019;Poole et al., 2019). A number of works have applied theabove ideas to RL settings. EMI (Kim et al., 2019) appliesa Jensen-Shannon divergence-based lower bound on mu-tual information across subsequent frames as an explorationbonus. DRIML (Mazoure et al., 2020) uses an auxiliarycontrastive objective to maximizes concordance betweenrepresentations to increase predictive properties of the rep-resentations conditioned on actions. However, maximizingthe lower-bound of mutual information retains all the infor-mation including the task-irrelevant information (Federiciet al., 2020).
Multi-View Information Bottleneck (MIB).
The multi-view setting relies on a basic assumption that each viewprovides the same task-relevant information while all the in-formation not shared by the views are task-irrelevant (Zhao obust Deep Reinforcement Learning via Multi-View Information Bottleneck et al., 2017). In classification, Federici et al. (2020) usesMIB by maximizing the mutual information between therepresentations of the two views while at the same timeeliminating the label-irrelevant information. However, MIBcannot be directly used in RL settings due to the sequentialnature of these decision making problems. Task-relevant in-formation in RL is relevant because they influence not onlycurrent control and reward but also states and rewards in thefuture, which requires representations to be predictive ofthe future representations. Our work, DRIBO, learns robustrepresentations with a predictive model to maximize themutual information between sequences of representationsand observations, while eliminating task-irrelevant informa-tion based on the information bottleneck principle. Learn-ing a predictive model also adopts richer learning signalsthan those provided by individual observation and rewardalone, which helps to reduce sample complexity. In addi-tion to representation learning, MVRL (Li et al., 2019) usesthe multi-view setting to form a generalization of partiallyobservable Markov decision process which substantiallyreduces sample complexity by training RL agents on it.
3. Preliminaries
We denote Markov Decision Process (
MDP ) as M , withstate s , action a , and reward r . We denote a policy on M as π . The agent’s goal is to learn a policy π that maximizesthe cumulative rewards.We consider sufficiency of representations from two perspec-tives. The first is the ability to derive optimal actions fromthe representations. The second is the ability to be predictiveof future representations. We consider ideal latent repre-sentations as the states of some underlying MDP that onlymodels task-relevant dynamics. DRL agents learn fromvisual observations by treating them as states. However,they rely on the heuristic of using consecutive observationsto implicitly capture the predictive property. Besides, thevisual observations contains far more excessive details thanthe underlying states.Thus, instead of mapping a single-step observation to astate, we consider encoding a sequence of observations toa sequence of states. This also relaxes the requirement ofusing consecutive visual observations since the history ofobservations is considered. We define S⊆ R d as the state-representation space. The visual observations are o ∈ O .Let a ∗ T be the optimal action sequence for a sequenceof observation o T , where T is the length. We assumethat o T contains enough information to obtain a ∗ T whichmaximizes the cumulative rewards.With the above assumption, we define a piece of informationas task-relevant if it is minimally sufficient to derive a ∗ T .In contrast, task-irrelevant information does not contribute to the choice of a ∗ T . We first consider sufficient repre-sentations that are discriminative enough to obtain a ∗ T .This property can be quantified by the amount of mutualinformation between o T and a ∗ T and mutual informationbetween s T and a ∗ T . Definition 1.
A sequence of representations s T of o T is sufficient for RL iff I ( o T ; a ∗ T ) = I ( s T ; a ∗ T ) .RL agents that have access to a sufficient representation s t at timestep t must be able to generate a ∗ t as if it has accessto the original observations. This can be better understoodby subdividing I ( o T ; s T ) into two components using thechain rule of mutual information: I ( o T ; s T ) = I ( o T ; s T | a ∗ T ) + I ( s T ; a ∗ T ) (1)Conditional mutual information I ( o T ; s T | a ∗ T ) quan-tifies the information in s T that is task-irrelevant . I ( s T ; a ∗ T ) quantifies task-relevant information that isaccessible from the representation. Note that the lastterm is independent of the representation as long as s t issufficient for a ∗ t (see Definition 1). Thus, a representa-tion contains minimal task-irrelevant information whenever I ( o T ; s T | a ∗ T ) is minimized. To obtain the sufficiency,we can maximize the mutual information I ( o T ; s T ) .With the information bottleneck principle, we can constructan objective to maximize I ( o T ; s T ) while minimizing I ( o T ; s T | a ∗ T ) to reduce task-irrelevant information.However, minimizing I ( o T ; s T | a ∗ T ) can be done di-rectly only in supervised settings where a ∗ T are observed.In addition, the mutual information between sequencesposes challenges for estimation. While MIB can reducetask-irrelevant information in the representations in an un-supervised settings (Federici et al., 2020), the strategy onlyconsiders a single observation and its representation. MIBdoes not guarantee that the learned representations retain theimportant sequential structure for RL. In the next section,we describe how we extend MIB to RL settings.
4. DRIBO
DRIBO learns robust representations that are predictive offuture representations while discarding task-irrelevant in-formation for control. To learn such representations, weconstruct a new MIB objective that (i) maximizes the mu-tual information between sequences of observations andrepresentations, I ( s T ; o T | a T ) and (ii) quantifies andreduces task-irrelevant information in the representationsbased on the multi-view setting. To generalize the mutual information between sequencesof observations and representations given any action se-quences, we consider maximizing the conditional mutual obust Deep Reinforcement Learning via Multi-View Information Bottleneck information I ( s T ; o T | a T ) . The observations are tem-porally evolved in the environment by executing the con-ditioned actions. This conditional mutual information notonly estimates the sufficiency of the representations but alsomaintains the sequential structure of RL problems.However, the large dimension of the sequential data makesit challenging to estimate the mutual information. We firstfactorize the mutual information between two sequentialdata to the mutual information at each timestep. Theorem 1.
Let o T be the observation sequence obtainedby executing action sequence a T . Let s T be a sequenceof sufficient representations for o T . I ( s T ; o T | a T ) ≥ T (cid:88) t =1 I ( s t ; o t | s t − , a t − ) (2) Proof.
Let H ( · ) be the entropy of a random vari-able, and X and Y be two random variables. Themutual information between them can be expressed as I ( X ; Y )= H ( X ) − H ( X | Y ) . We apply the chain rule for en-tropy H ( X , X , . . . , X n )= (cid:80) ni =1 H ( X i | X i − , . . ., X ) and nonnegativity of mutual information in the proof. Thelast steps use Markov property of state transitions. I ( s T ; o T | a T )= H ( s T | a T ) − H ( s T | o T , a T )= (cid:88) t ( H ( s t | a T , s t − ) − H ( s t | a T , o T , s t − ))= (cid:88) t I ( s t ; o T | a T , s t − )= (cid:88) t ( H ( o T | a T , s t − ) − H ( o T | s t , a T , s t − ))= (cid:88) t (cid:88) τ ( H ( o τ | a T , s t − , o τ − ) − H ( o τ | s t , a T , s t − , o τ − ))= (cid:88) t (cid:88) τ I ( s t ; o τ | a T , s t − , o τ − ) ≥ (cid:88) t I ( s t ; o t | a T , s t − , o t − )= (cid:88) t I ( s t ; o t | s t − , a t − ) With Theorem 1, we show that the sum of mutual in-formation I ( s t ; o t | s t − , a t − ) over timesteps is a lowerbound of the mutual information I ( s T ; o T | a T ) . Evenwhen the representations s T are not sufficient, maximiz-ing I ( s t ; o t | s t − , a t − ) encodes more details into s t whichwill make it sufficient and satisfy Equation 2. The factorizedmutual information is conditioned on the representation andthe action at t − , which explicitly retains the predictiveinformation for future representations. To learn sufficient representations with minimal task-irrelevant information, we consider a multi-view settingto identify the task-irrelevant information without supervi-sion. Consider o (1) t and o (2) t to be two visual images ofthe control scenario from different viewpoints. Assumingthat the optimal control a ∗ t can be clearly derived from both o (1) t and o (2) t conditioned on the representation and action at t − . Then, any representation s t containing all informationaccessible from both views and being predictive of futurerepresentations would contain the sufficient task-relevantinformation. Furthermore, if s t captures only the detailsthat are visible from both observations, it would eliminatethe view-specific details and reduce the sensitivity of therepresentation to view-changes.A sufficient representation in RL maintains all informationwhich is shared by mutually redundant observations o (1) t and o (2) t . We refer to Appendix A for sufficiency conditionof representations and mutually redundancy condition be-tween o (1) t and o (2) t . Intuitively, with the mutual redundancycondition, any representation which contains all informationshared by both views is as task-relevant as the joint obser-vation. By factorizing the mutual information between s (1) t and o (1) t as in Equation 1, we can identify two components: I ( s (1) t ; o (1) t | s (1) t − , a t − ) (3) = I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t )+ I ( o (2) t ; s (1) t | s (1) t − , a t − ) Here, s (1) t − is a representation of visual observation o (1) t − .Since we assume mutual redundancy between the two views,the information shared between o (1) t and s (1) t conditionedon o (2) t must be irrelevant to the task, which can be quan-tified as I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t ) (first term in Equa-tion 3). Then, I ( o (2) t ; s (1) t | s (1) t − , a t − ) has to be maximal ifthe representation is sufficient. The formal description forthe above statement can be found in Appendix A.The less the two views have in common, the less task-irrelevant information can be encoded into the represen-tations without violating sufficiency, and consequently, theless sensitive the resulting representation to task-irrelevantnuisances. In the extreme, we can show that s (1) t is theunderlying states of MDP if o (1) t and o (2) t share only task-relevant information. With Equation 2 and 3, we have themulti-view loss L MV , which maintains the temporally evolv-ing information of the underlying dynamics. L MV = − (cid:88) t ( I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t )+ I ( o (2) t ; s (1) t | s (1) t − , a t − )) The above loss extends MIB to RL and maximizing it learns obust Deep Reinforcement Learning via Multi-View Information Bottleneck representations that are sufficient and predictive of futurerepresentations. The multi-view observations can be triviallyobtained with random data augmentation techniques so thateach view is augmented differently.
In Section 4.2, we show how to factorize the mutual in-formation between sequences of observations and repre-sentations into mutual information at each timestep. Thisenables us to obtain sufficient representations by max-imizing the mutual information between s (1) t and o (1) t while discarding task-irrelevant information by reducing I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t ) . The predictive properties areensured by conditioning the mutual information on the previ-ous timestep representation and action. With the informationbottleneck principle, we can construct a relaxed Lagrangianloss to obtain a sufficient representation s (1) t for o (1) t withminimal task-irrelevant information: L ( θ ; λ ) = I θ ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t ) − λ I θ ( o (2) t ; s (1) t | s (1) t − , a t − ) (4)where θ denotes the parameters of the encoder p θ ( s (1) t | o (1) t , s t − , a t − ) , and λ is the Lagrangianmultiplier. Symmetrically, we define a loss L to learna sufficient representation s (2) t for o (2) t with minimaltask-irrelevant information: L ( θ ; λ ) = I θ ( s (2) t ; o (2) t | s (2) t − , a t − , o (1) t ) − λ I θ ( o (1) t ; s (2) t | s (2) t − , a t − ) (5)By re-parameterizing the Lagrangian multipliers, the av-erage of two loss functions L and L from two views attimestep t can be upper bounded as follows: L t ( θ ; β )= − I θ ( s (1) t ; s (2) t | s t − , a t − )+ (6) βD SKL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) where s t − is a sufficient representation, D SKL represents the symmetrized KL divergence ob-tained by averaging the expected values of D KL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) , D KL ( p θ ( s (2) t | o (2) t , s (2) t − , a t − ) || p θ ( s (1) t | o (1) t , s (1) t − , a t − )) ,and the coefficient β represent the trade-off betweensufficiency and sensitivity to task-irrelevant information. β is a hyper-parameter in this work.To generalize the above loss to sequential data in RL, weapply Theorem 1 to obtain the DRIBO loss: L DRIBO = 1 T T (cid:88) t =1 L t ( θ ; β ) (7) Algorithm 1
DRIBO Loss input : Batch B sampled from replay buffer storing N sequential observations and actions with length T . Apply random augmentation transformations on B toobtain multi-view batches B (1) and B (2) . for i, ( o (1)1: T , o (2)1: T , a T ) in enumerate ( B (1) , B (2) ) do for t = 1 to T do We substitute s t − with s (1) t − and s (2) t − given themulti-view assumption. { Analysis in Appendix B } s (1) t ∼ p θ ( s (1) t | o (1) t , s (1) t − , a t − ) s (2) t ∼ p θ ( s (2) t | o (2) t , s (2) t − , a t − ) ( s (1) ,t + T ( i − , s (2) ,t + T ( i − ) ← ( s (1) t , s (2) t ) end for L i SKL = T (cid:80) Tt =1 D DKL ( p θ ( s (1) t ) || p θ ( s (2) t )) end for return − ˆ I ψ ( { ( s (1) ,i , s (2) ,i ) } T ∗ Ni =1 ) + βN (cid:80) Ni =1 L i SKL
We summarize the batch-based computation of the lossfunction in Algorithm 1. We sample s (1) t and s (2) t from p θ ( s (1) t | o (1) t , s (1) t − , a t − ) and p θ ( s (2) t | o (2) t , s (2) t − , a t − ) re-spectively. Though first term in Equation 6 is conditionedon s t − , we prove that the sampling process does not affectits effectiveness based on the multi-view assumption in Ap-pendix B. The symmetrized KL divergence term can be com-puted from the probability density of s (1) t and s (2) t estimatedby the encoder. The mutual information between the tworepresentations I θ ( s (1) t ; s (2) t | s t − , a t − ) can be maximizedby using any sample-based differentiable mutual informa-tion lower bound ˆ I ψ ( s (1) t , s (2) t ) , where ψ represents thelearnable parameters. We use InfoNCE (Oord et al., 2018)to estimate mutual information since the multi-view settingprovides a large number of negative examples. The positivepairs are the representations ( s (1) t , s (2) t ) of the multi-viewobservations generated from the same observation. The re-maining pairs of representations within the same batch areused as negative pairs. The full derivation for the DRIBOloss function can be found in Appendix B. The encoder p θ ( s t | o t , s t − , a t − ) approximates represen-tation posteriors from current observation, the previoustimestep representation and action. The posteriors can alsobe seen as a reparameterization of p θ ( s T | o T , a T ) = (cid:81) t p θ ( s t | o t , s t − , a t − ) , which explicitly maintain the in-herent sequential structure of RL.We implement the encoder as a recurrent space model(RSSM (Hafner et al., 2019)) with a convolutional neuralnetwork (CNN) applied to the visual observations. RSSM isa latent dynamics model with an expressive recurrent neuralnetwork to perform accurate long-term prediction. We split obust Deep Reinforcement Learning via Multi-View Information Bottleneck the representation s t into a stochastic part z t and a deter-ministic part h t , where s t = ( z t , h t ) . The generative andinference models of RSSM are defined as:Deterministic state transition: h t = f ( h t − , z t − , a t − ) Stochastic state transition: z t = p ( z t | h t ) Observation model: o t = p ( o t | h t , z t ) where f ( h t − , z t − , a t − ) is implemented as a recur-rent neural network (RNN) that carries the dependencyon the stochastic and deterministic parts at the previoustimestep. Then, we obtain the representation with theencoder p θ (( s T | o T , a T )= (cid:81) t p θ ( s t | o t , h t ) , where h t retains information from s t − = ( z t − , h t − ) and a t − .The encoder architecture based on the RSSM model encour-ages the representations to be predictive of future states,which aligns with the key property of DRIBO. We simultaneously train our representation learning modelswith the RL agent by adding L DRIBO (Algorithm 1) as anauxiliary objective during training. The multi-view observa-tions required by DRIBO can be trivially obtained using thesame experience replay of RL agents with data augmenta-tion. The policy and (or) value function in RL will directlytake the representation s t of the original visual observation o t to allow the models in RL to backprop to our encoder.This further improves the performance by accommodatingthe learned representations with policy/value function. Asa result, DRIBO encourages the agent to learn the underly-ing task-relevant dynamics of the environments while beingrobust against visual changes that are task-irrelevant.We demonstrate the effectiveness of DRIBO by buildingthe agents on top of SAC (Haarnoja et al., 2018) andPPO (Schulman et al., 2017) in Section 5.1 and Section 5.2respectively. More details can be found in Appendix C.
5. Experiment
We experimentally evaluate DRIBO on a variety of visualcontrol tasks. We designed the experiments to compareDRIBO to current best methods in the literature on: (i)the effectiveness of solving visual control tasks, (ii) theirrobustness against task-irrelevant distractors, and (iii) theability to generalize to unseen environments.For effectiveness, we demonstrate performance on the Deep-Mind Control Suite with no distractors (DMC (Tassa et al.,2018). The DMC suite provides qualitatively different visualcontrol challenges. For robustness, we investigate whetherour DRIBO agent can ignore high-dimensional visual dis-tractors that are task-irrelevant in the DMC environmentswhen the backgrounds are replaced with natural videos from Figure 2: The left images are observations in DMC in theclean setting. The center images are observations in DMCusing natural videos as background. The right images are thespatial attention maps of the encoder for the center images.the Kinetics dataset (Kay et al., 2017). For generalization,we present results on Procgen (Cobbe et al., 2020) whichprovides different levels of the same game to test how wellagents generalize to unseen levels. Since DRIBO does notassume that observation at each timestep provides full ob-servability of the underlying dynamics, we use single-stepobservations to train the DRIBO representations. By con-trast, current SOTA approaches require the use of consecu-tive observations to capture predictive property implicitly.For the DMC suite, all agents are built on top of SAC, anoff-policy RL algorithm. For the Procgen suite, we augmentPPO, an on-policy RL baseline on Procgen, with DRIBO.Implementation details are given in Appendix C. First, we focus on studying the effectiveness and robustnessof DRIBO trained agents. We evaluate our approach onDMC under ‘clean’ settings (without distractors), as well asmuch more difficult settings with distractors.We compare DRIBO against several baselines. The firstis RAD (Laskin et al., 2020), a recent method that usesaugmented data to train pixel-based policy and achievedstate-of-art performance on DMC. The second is SLAC (Leeet al., 2020), a SOTA representation learning method forRL that learns a dynamics model with a reconstruction loss.Finally, we compare with DBC (Zhang et al., 2021) whichis the most similar work to ours. DBC learns an invariantrepresentation based on bisimulation metrics without requir-ing reconstruction. For RAD and DRIBO, we apply crop + random grayscale to obtain augmented data and multi-view ALl methods compared in our experiments stacks 3 consecu-tive observations while DRIBO does not use frame stack. obust Deep Reinforcement Learning via Multi-View Information Bottleneck A v e r a g e R e t u r n s walk/stand walk/walk walk/run Environment Steps A v e r a g e R e t u r n s finger/spin Environment Steps cheetah/run
Environment Steps reacher/easyDRIBO (Ours) RAD SLAC DBC
Figure 3: Average returns on DMC tasks over 5 seeds with mean and one standard error shaded in the clean setting. A v e r a g e R e t u r n s walk/stand walk/walk walk/run Environment Steps A v e r a g e R e t u r n s finger/spin Environment Steps cheetah/run
Environment Steps reacher/easyDRIBO (Ours) RAD SLAC DBC
Figure 4: Average returns on DMC tasks over 5 seeds with mean and one standard error shaded in the natural video setting.observations across different settings.
Clean Setting.
For clean setting, the pixel observationshave simple backgrounds as shown in Figure 2 (left column).Figure 3 shows that RAD and SLAC generally perform thebest, whereas DRIBO outperforms DBC and matches SOTAin some of the environments. However, since the testingand training environments are identical, the RL agents mayoverfit to the training environments.
Natural Video Setting.
Next, we introduce high-dimensional visual distractors by using natural videosfrom the Kinetics dataset (Kay et al., 2017) as new back-grounds (Zhang et al., 2018a) (Figure 2: middle column).To avoid the issue of overfitting, we use different naturalvideos to replace the background in training and testing.In Figure 2, spatial attention maps (Zagoruyko & Ko-modakis, 2017) of the trained DRIBO encoder demonstrate that DRIBO trains agents to focus on the robot body whileignoring irrelevant scene details in the background. Figure 4shows that DRIBO performs substantially better than RADand SLAC which do not discard task-irrelevant informa-tion explicitly. Compared with DBC, a recent state-of-artmethod for learning representations that are invariant totask-irrelevant information, DRIBO either outperforms ormatches its performance.
Visualizing Learned Representations.
We visualize therepresentations learned with the DRIBO loss function in Al-gorithm 1 with t-SNE (Van der Maaten & Hinton, 2008).Figure 5 shows that even when the background looks drasti-cally different, DRIBO learns to disregard irrelevant infor-mation and maps observations with similar robot configura-tions to the neighborhoods of one another. The color coderepresents values of reward for each representation. Weobserve that neighboring representations share close reward obust Deep Reinforcement Learning via Multi-View Information Bottleneck
Figure 5: t-SNE of latent spaces learned with DRIBO. Wecolor-code the embedded points with reward values (highvalue yellow, lower value green). DRIBO learns represen-tations that are neighboring in the embedding space withsimilar reward values, which are the direct task-relevantsignals from the environments. This property also holdseven if the backgrounds are drastically different (see rightvisual images). The solid lines refer to the correspondingembedded points for each observation.values. The rewards can be viewed as task-relevant signalsprovided by the environments.
Though the natural video setting of DMC is suitable forbenchmarking robustness to high-dimensional visual dis-tractors, the task-relevant information and the task difficul-ties are unchanged. For this reason, we use the ProcGensuite (Cobbe et al., 2020) to investigate the generalizationcapabilities of DRIBO. Our training setting consists of fix-ing the first 200 levels of a given Procgen game to trainagents and then using the remaining levels as unseen levelsto evaluate generalization performance. Unseen levels typi-cally have different backgrounds or different layouts, whichare easy for humans to adapt but challenging for RL agents.We compare DRIBO with recent methods that incorporatedata augmentation. All approaches are implemented basedon PPO. RAD (Laskin et al., 2020) feeds augmented obser-vation directly into RL policy and value function to enrichthe diversity of training samples. DrAC (Raileanu et al.,2020) applies two regularization terms for policy and valuefunction using augmented data. UCB-DrAC is built on topof DrAC, which automatically selects the best type of dataaugmentation for DrAC. For RAD and DrAC, we use thebest reported augmentation types for different environments.DRIBO select the same augmentation types except for a fewgames. The details about data augmentation types used inthe Procgen environments can be found in Appendix C.Results in Table 1 show that DRIBO attains higher aver-aged testing returns compared to the PPO baseline andaugmentation-based RL baselines. A few environments, in which our approach does not outperform others, sharethe commonality that task-relevant layouts remain staticthroughout the same run of the game. Since the currentversion of DRIBO only considers the mutual informationbetween the complete input and the encoder output (globalMI (Hjelm et al., 2019)), it may fail to capture local features.The representations for a sequence of observations withinthe same run of the game are globally negative pairs butlocally positive pairs. Thus, DRIBO performance can befurther improved by considering local features (positionsof the layouts) shared between representations as positivepairs in mutual information maximization.Table 1: Procgen returns on test levels after training on 25Menvironment steps. The mean and standard deviation arecomputed over 10 runs.
Env PPO RAD DrAC UCB-DrAC DRIBOBigFish 4.0 ± ± ± ± ± StarPilot 24.7 ± ± ± ± ± FruitBot 26.7 ± ± ± ± ± BossFight 7.7 ± ± ± ± ± Ninja 5.9 ± ± ± ± ± Plunder 5.0 ± ± ± ± ± ± ± ± ± ± CoinRun 8.5 ± ± ± ± ± Jumper 5.8 ± ± ± ± ± Chaser 5.0 ± ± ± ± ± ± ± ± ± ± DodgeBall ± ± ± ± ± ± ± ± ± ± Leaper 4.9 ± ± ± ± ± Maze 5.7 ± ± ± ± ± Miner 8.5 ± ± ± ± ± Norm.score 1.0 1.1 1.1 1.1
6. Conclusion
In this paper, we introduce a novel robust representationlearning approach in RL based on the multi-view informa-tion bottleneck principle. Visual observations are encodedinto representations that are robust against different task-irrelevant details and predictive of the future, a property cen-tral to the sequential aspect of RL. Our experimental resultsshow that 1) DRIBO learns representations that are robustagainst task-irrelevant distractions and boosts training per-formance when complex visual distractors are introduced,2) exploiting the sequential aspect of RL helps to learn moreeffective representations and 3) DRIBO improves general-ization performance compared to well-established baselineson the large-scale Procgen benchmarks.
Future Work.
We plan to explore the direction of incor-porating knowledge about locality in the observations intoDRIBO. In addition, our latent dynamics RSSM model wasonly used for training our encoder. We plan to augmentmodel-based RL algorithms with DRIBO learned RSSMmodel to train RL agents in the future. obust Deep Reinforcement Learning via Multi-View Information Bottleneck
References
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Ben-gio, Y., Courville, A., and Hjelm, D. Mutual informationneural estimation. In
International Conference on Ma-chine Learning , pp. 531–540. PMLR, 2018.Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. Asimple framework for contrastive learning of visual rep-resentations. In
International conference on machinelearning , pp. 1597–1607. PMLR, 2020.Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman,J. Quantifying generalization in reinforcement learning.In
International Conference on Machine Learning , pp.1282–1289. PMLR, 2019.Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever-aging procedural generation to benchmark reinforcementlearning. In
International Conference on Machine Learn-ing . PMLR, 2020.Farebrother, J., Machado, M. C., and Bowling, M. Gen-eralization and regularization in dqn. arXiv preprintarXiv:1810.00123 , 2018.Federici, M., Dutta, A., Forr´e, P., Kushman, N., and Akata,Z. Learning robust representations via multi-view infor-mation bottleneck.
International Conference on LearningRepresentation , 2020.Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcementlearning with a stochastic actor. In
International Con-ference on Machine Learning , pp. 1861–1870. PMLR,2018.Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D.,Lee, H., and Davidson, J. Learning latent dynamics forplanning from pixels. In
International Conference onMachine Learning , pp. 2555–2565. PMLR, 2019.Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream tocontrol: Learning behaviors by latent imagination.
Inter-national Conference on Learning Representation , 2020.H´enaff, O. J., Srinivas, A., De Fauw, J., Razavi, A., Doersch,C., Eslami, S., and van den Oord, A. Data-efficient imagerecognition with contrastive predictive coding. In
Interna-tional Conference on Machine Learning , pp. 4182–4192.PMLR, 2020.Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,K., Bachman, P., Trischler, A., and Bengio, Y. Learningdeep representations by mutual information estimationand maximization.
International Conference on LearningRepresentation , 2019. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier,C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,Natsev, P., et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 , 2017.Kim, H., Kim, J., Jeong, Y., Levine, S., and Song, H. O. Emi:Exploration with mutual information. In
InternationalConference on Machine Learning , pp. 3360–3369, 2019.Lange, S. and Riedmiller, M. Deep auto-encoder neuralnetworks in reinforcement learning. In
The 2010 Interna-tional Joint Conference on Neural Networks (IJCNN) , pp.1–8. IEEE, 2010.Lange, S., Riedmiller, M., and Voigtl¨ander, A. Autonomousreinforcement learning on raw visual input data in a realworld application. In
The 2012 international joint confer-ence on neural networks (IJCNN) , pp. 1–8. IEEE, 2012.Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., andSrinivas, A. Reinforcement learning with augmented data.
Advances in Neural Information Processing Systems , 33,2020.Lee, A., Nagabandi, A., Abbeel, P., and Levine, S. Stochas-tic latent actor-critic: Deep reinforcement learning with alatent variable model.
Advances in Neural InformationProcessing Systems , 33, 2020.Li, M., Wu, L., Jun, W., and Ammar, H. B. Multi-view rein-forcement learning. In
Advances in neural informationprocessing systems , pp. 1420–1431, 2019.Mazoure, B., Tachet des Combes, R., DOAN, T. L., Bach-man, P., and Hjelm, R. D. Deep reinforcement and info-max learning.
Advances in Neural Information Process-ing Systems , 33, 2020.Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn-ing with contrastive predictive coding. arXiv preprintarXiv:1807.03748 , 2018.Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., andTucker, G. On variational bounds of mutual information.In
International Conference on Machine Learning , pp.5171–5180. PMLR, 2019.Raileanu, R., Goldstein, M., Yarats, D., Kostrikov, I., andFergus, R. Automatic data augmentation for general-ization in deep reinforcement learning. arXiv preprintarXiv:2006.12862 , 2020.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, obust Deep Reinforcement Learning via Multi-View Information Bottleneck
A., et al. Deepmind control suite. arXiv preprintarXiv:1801.00690 , 2018.Tishby, N., Pereira, F. C., and Bialek, W. The informa-tion bottleneck method. arXiv preprint physics/0004057 ,2000.Van der Maaten, L. and Hinton, G. Visualizing data usingt-sne.
Journal of machine learning research , 9(11), 2008.Wahlstr¨om, N., Sch¨on, T. B., and Desienroth, M. P. Frompixels to torques: Policy learning with deep dynamicalmodels. In
Deep Learning Workshop at the 32nd Inter-national Conference on Machine Learning (ICML 2015),July 10-11, Lille, France , 2015.Watter, M., Springenberg, J. T., Boedecker, J., and Ried-miller, M. Embed to control: a locally linear latent dynam-ics model for control from raw images. In
Proceedings ofthe 28th International Conference on Neural InformationProcessing Systems-Volume 2 , pp. 2746–2754, 2015.Yu, W., Liu, C. K., and Turk, G. Policy transfer with strategyoptimization. In
International Conference on LearningRepresentations , 2019.Zagoruyko, S. and Komodakis, N. Paying more attentionto attention: Improving the performance of convolutionalneural networks via attention transfer.
International Con-ference on Learning Representation , 2017.Zhang, A., Wu, Y., and Pineau, J. Natural environmentbenchmarks for reinforcement learning. arXiv preprintarXiv:1811.06032 , 2018a.Zhang, A., McAllister, R., Calandra, R., Gal, Y., and Levine,S. Learning invariant representations for reinforcementlearning without reconstruction.
International Confer-ence on Learning Representation , 2021.Zhang, C., Vinyals, O., Munos, R., and Bengio, S. Astudy on overfitting in deep reinforcement learning. arXivpreprint arXiv:1804.06893 , 2018b.Zhao, J., Xie, X., Xu, X., and Sun, S. Multi-view learningoverview: Recent progress and new challenges.
Informa-tion Fusion , 38:43–54, 2017. obust Deep Reinforcement Learning via Multi-View Information Bottleneck
AppendixA. Theorems and Proofs
In this section, we first list properties of the mutual informa-tion we used in our proof. For any random variables X , Y and Z . (P.1) Positivity: I ( X ; Y ) ≥ , I ( X ; Y | Z ) ≥ (P.2) Chain rule: I ( XY ; Z ) = I ( Y ; Z ) + I ( X ; Z | Y ) (P.3) Chain rule (Multivariate Mutual Information): I ( X ; Y ; Z ) = I ( Y ; Z ) − I ( Y ; Z | X ) (P.4) Entropy and Mutual Information: I ( X ; Y ) = H ( X ) − H ( X | Y ) (P.5) Chain rule for Entropy: H ( X , X , . . . , X n )= n (cid:88) i =1 H ( X i | X i − , . . ., X ) A.1. Theorem 1
Here, we relax the sufficiency condition in Theorem 1 andgeneralize the theorem to representations sampled from aencoder p θ ( s T | o T , a T ) . Theorem A.1.
Let o T be the observation sequence ob-tained by executing action sequence a T . Let s T be asequence of representations for o T sampled from an en-coder with a specific architecture, p θ ( s T | o T , a T ) . I ( s T ; o T | a T ) ≥ T (cid:88) t =1 I ( s t ; o t | s t − , a t − ) (8) Proof.
We specify the property we used for each stepof derivation. The last equality holds since s t − ∼ p θ ( s t − | o t − , a t − ) . All information contained in o t − are observed by s t − with the factorized probabil-ity p θ ( s t − | o t − , s t − , a t − ) . Then, s t − contain all the information in o t − , s t − and a t − . I ( s T ; o T | a T ) (P.4) = H ( s T | a T ) − H ( s T | o T , a T ) (P.5) = (cid:88) t ( H ( s t | a T , s t − ) − H ( s t | a T , o T , s t − )) (P.4) = (cid:88) t I ( s t ; o T | a T , s t − ) (P.4) = (cid:88) t ( H ( o T | a T , s t − ) − H ( o T | s t , a T , s t − )) (P.5) = (cid:88) t (cid:88) τ ( H ( o τ | a T , s t − , o τ − ) − H ( o τ | s t , a T , s t − , o τ − )) (P.4) = (cid:88) t (cid:88) τ I ( s t ; o τ | a T , s t − , o τ − ) (P.1) ≥ (cid:88) t I ( s t ; o t | a T , s t − , o t − )= (cid:88) t I ( s t ; o t | s t − , a t − ) With the above generalization, this lower bound holds forany representations sampled from p θ ( s T | o T , a T ) . Asa result, L MV is a lower bound of I ( s (1)1: T ; o (1)1: T | a T ) . I ( s (1)1: T ; o (1)1: T | a T ) ≥ − (cid:88) t ( I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t )+ I ( o (2) t ; s (1) t | s (1) t − , a t − )) A.2. Sufficient Representations in RL
In this section, we first present the sufficiency conditionfor sequential data. Then, we prove that if the sufficiencycondition on the sequential data holds, then the sufficiencycondition on each corresponding individual representationand observation holds as well.
Theorem A.2.
Let o T and a ∗ T be random variables withjoint distribution p ( o T , a ∗ T ) . Let s T be the represen-tation of o T , then s T is sufficient for a ∗ T if and onlyif I ( o T ; a ∗ T ) = I ( s T ; a ∗ T ) . Also, s t is a sufficientrepresentation of o t since I ( o t ; a ∗ t | s t , s t − , a t − ) = 0 .Hypothesis: (H.1) s T is a sequence of sufficient representations for o T : I ( o T ; a ∗ T | s T ) = 0 obust Deep Reinforcement Learning via Multi-View Information Bottleneck Proof. I ( o T ; a ∗ T | s T ) (P.3) = I ( o T ; a ∗ T ) − I ( o T ; a ∗ T ; s T ) (P.3) = I ( o T ; a ∗ T ) − I ( a ∗ T ; s T ) − I ( a ∗ T ; s T | o T ) With s T as a representation of o T , we have I ( s T ; a ∗ T | o T )=0 . The reason is that o T shares thesame level of information as a ∗ T and s T . Then, I ( o T ; a ∗ T | s T ) = I ( o T ; a ∗ T ) − I ( a ∗ T ; s T ) (9)So the sufficiency condition I ( o T ; a ∗ T | s T ) = 0 holdsif and only if I ( o T ; a ∗ T ) = I ( a ∗ T ; s T ) .We factorize the mutual information between sequentialobservations and optimal actions I ( o t ; a ∗ t ) (P.2) = I ( o t ; a ∗ t | o t − ) + I ( o t − ; a ∗ t ) (H.1) = I ( o t ; a ∗ t | o t − ) + I ( s t − ; a ∗ t ) I ( s t ; a ∗ t ) (P.2) = I ( s t ; a ∗ t | s t − ) + I ( s t − ; a ∗ t ) (H.1) = I ( s t ; a ∗ t | s t − ) + I ( o t − ; a ∗ t ) Then we obtain the following relation: I ( o t ; a ∗ t | o t − ) = I ( s t ; a ∗ t | s t − ) (10)We also have I ( o t ; a ∗ t | o t − ) (P.2) = I ( o t ; a ∗ t ) − I ( o t − ; a ∗ t ) (P.2) = I ( o t − ; a ∗ t | o t ) + I ( o t ; a ∗ t ) − I ( o t − ; a ∗ t ) (H.1) = I ( s t − ; a ∗ t | o t ) + I ( o t ; a ∗ t ) − I ( s t − ; a ∗ t ) (P.2) = I ( o t s t − ; a ∗ t ) − I ( s t − ; a ∗ t ) (P.2) = I ( o t ; a ∗ t | s t − ) Equation 10 = I ( s t ; a ∗ t | s t − ) (P.2) ⇐⇒ I ( o t ; a ∗ t | a ∗ t − , s t − ) + I ( o t ; a ∗ t − | s t − )= I ( s t ; a ∗ t | a ∗ t − , s t − ) + I ( s t ; a ∗ t − | s t − ) Equation 10 ⇐⇒ I ( o t ; a ∗ t | a ∗ t − , s t − )= I ( s t ; a ∗ t | a ∗ t − , s t − ) Equation 9 ⇐⇒ I ( o t , a ∗ t | s t , s t − , a ∗ t − ) = 0 With the above derivation and Markov property, we have I ( o t ; a ∗ t | s t , s t − , a t − ) = 0 . We can generalize a ∗ t − toany a t − by assuming a ∗ t as the optimal action for state s t whose last timestep state-action pair is ( s t − , a t − ) . Thus,we have s t is a sufficient representation for o t if and only if s T is a sufficient representation of o T . A.3. Multi-View Redundancy and SufficiencyProposition A.1. o (1)1: T is a redundant view with respect to o (2)1: T to obtain a ∗ T if only if I ( o (1)1: T ; a ∗ T | o (2)1: T ) = 0 . Anyrepresentation s (1)1: T of o (1)1: T that is sufficient for o (2)1: T is alsosufficient for a ∗ T . Proof.
See proof of Proposition B.3 in the MIB paper (Fed-erici et al., 2020).
Corollary A.1.
Let o (1)1: T and o (1)1: T be two mutuallyredundant views for a ∗ T . Let s (1)1: T be a representa-tion of o (1)1: T . If s (1)1: T is sufficient for o (2)1: T , s (1) t canderive a ∗ t as the joint observation of the two views( I ( o (1) t o (2) t ; a ∗ t | s t − , a t − )= I ( s (1) t ; a ∗ t | s t − , a t − ) ),where s t − is any sufficient representation at timestep t − . Proof.
For the sequential data, see proof of Corollary B.2.1in the MIB paper (Federici et al., 2020) to prove I ( o (1)1: T o (2)1: T ; a ∗ T )= I ( s (1)1: T ; a ∗ T ) According to Theorem A.2, if s (1)1: T is a sufficient represen-tation of o (2)1: T , s (1) t is a sufficient representation of o (2) t .Similar to proof on sequential data, we can use CorollaryB.2.1 in the MIB paper (Federici et al., 2020) to show that I ( o (1) t o (2) t ; a ∗ t | s t − , a t − ) = I ( s (1) t ; a ∗ t | s t − , a t − ) Theorem A.3.
Let the two views o (1)1: T and o (2)1: T ofobservation o T are obtained by data augmentationtransformation sequences t (1)1: T and t (2)1: T respectively( o (1)1: T = t (1)1: T ( o T ) and o (2)1: T = t (2)1: T ( o T ) ). Whenever I ( t (1)1: T ( o T ); a ∗ T )= I ( t (2)1: T ( o T ); a ∗ T )= I ( o T ; a ∗ T ) ,the two views o (1)1: T and o (2)1: T must be mutually redundantfor a ∗ T . Besides, the two views o (1) t and o (2) t must bemutually redundant for a ∗ t .Proof. Let s T be a sufficient representation for both orig-inal and multi-view observations. We first factorize the obust Deep Reinforcement Learning via Multi-View Information Bottleneck mutual information and refer A.2 as Theorem A.2. I ( t (1)1: t ( o t ); a ∗ t ) = I ( o (1)1: t ; a ∗ t ) (P.2) = I ( o (1) t ; a ∗ t | o (1)1: t − )+ I ( o (1)1: t − ; a ∗ t ) A. = I ( o (1) t ; a ∗ t | s t − )+ I ( s t − ; a ∗ t ) I ( t (2)1: t ( o t ); a ∗ t ) = I ( o (2)1: t ; a ∗ t ) (P.2) = I ( o (2) t ; a ∗ t | o (2)1: t − )+ I ( o (2)1: t − ; a ∗ t ) A. = I ( o (2) t ; a ∗ t | s t − )+ I ( s t − ; a ∗ t ) I ( t (2)1: t ( o t ); a ∗ t ) = I ( o (2)1: t ; a ∗ t ) (P.2) = I ( o (2) t ; a ∗ t | o (2)1: t − )+ I ( o (2)1: t − ; a ∗ t ) I ( o t ; a ∗ t ) = I ( o t ; a ∗ t ) (P.2) = I ( o t ; a ∗ t | o t − )+ I ( o t − ; a ∗ t ) A. = I ( o t ; a ∗ t | s t − )+ I ( s t − ; a ∗ t ) Then, we have the following equality I ( o (1) t ; a ∗ t | s t − )= I ( o (2) t ; a ∗ t | s t − )= I ( o t ; a ∗ t | s t − ) Similar as derivation in Theorem A.2 I ( o (1) t ; a ∗ t | s t − ) (P.2) = I ( o (1) t ; a ∗ t | a ∗ t − , s t − ) + I ( o (1) t ; a ∗ t − | s t − ) Equation 10 = I ( o (1) t ; a ∗ t | a ∗ t − , s t − ) + I ( s t ; a ∗ t − | s t − ) We apply the same derivation for o (2) and o , we have thefollowing with Markov property I ( t (1) t ( o t ); a ∗ t | s t − , a t − )= I ( t (2) t ( o t ); a ∗ t | s t − , a t − )= I ( o t ; a ∗ t | s t − , a t − ) We show that the condition on sequential data can be ex-pressed at each timestep with the similar form. See proof ofProposition B.4 in the MIB paper (Federici et al., 2020) formutual redundancy between sequential views and individualpairs of views.
Theorem A.4.
Suppose the mutu-ally redundant condition holds, i.e. I ( t (1)1: T ( o T ); a ∗ T )= I ( t (2)1: T ( o T ); a ∗ T )= I ( o T ; a ∗ T ) .If s (1)1: T is a sufficient representation for t (2)1: T ( o T ) then I ( o t ; a ∗ t | s t − , a t − ) = I ( s (1) t ; a ∗ t | s t − , a t − ) . Proof. Since t (1)1: T ( o T ) is redundant for t (2)1: T ( o T ) (Theo-rem A.3), any representation s (1) t of t (1)1: T ( o T ) that is suf-ficient for t (2)1: T ( o T ) must also be sufficient for a ∗ t (Theo-rem A.2 and Proposition A.1). Using Theorem A.2 we have I ( s (1) t ; a ∗ t | s t − , a t − )= I ( t (1) t ( o t ); a ∗ t | s t − , a t − ) . With I ( t (1) t ( o t ); a ∗ t | s t − , a t − ) = I ( o t ; a ∗ t | s t − , a t − ) , weconclude I ( o t ; a ∗ t | s t − , a t − ) = I ( s (1) t ; a ∗ t | s t − , a t − ) .We finally show the proposition for the Multi-InformationBottleneck principle in RL with the generalization of suffi-ciency and mutually redundancy condition from sequentialdata to each individual pairs of data. Proposition A.2.
Let o (1) t and o (2) t be mutually redundantviews for a ∗ t that share only optimal action information.Then a sufficient representation of s (1) t of o t for o (2) t that isminimal for o (2) t is also a minimal representation for a ∗ t . Proof.
See proof of Proposition E.1 in the MIB paper (Fed-erici et al., 2020).
B. DRIBO Loss Computation
We consider the average of the information bottleneck lossesfrom the two views. L (11) = I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t )+ I ( s (2) t ; o (2) t | s (2) t − , a t − , o (1) t )2 − λ I ( s (1) t ; o (2) t | s (1) t − , a t − )+ λ I ( s (2) t ; o (1) t | s (2) t − , a t − )2 (12)Consider s (1) t and s (2) t on the same domain S , I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t ) can be expressed as: I ( s (1) t ; o (1) t | s (1) t − , a t − , o (2) t )= E (cid:34) log p θ ( s (1) t | o (1) t , s (1) t − , a t − ) p θ ( s (1) t | o (2) t , s (1) t − , a t − ) (cid:35) = E (cid:34) log p θ ( s (1) t | o (1) t , s (1) t − , a t − ) p θ ( s (2) t | o (2) t , s (2) t − , a t − ) p θ ( s (2) t | o (2) t , s (2) t − , a t − ) p θ ( s (1) t | o (2) t , s (1) t − , a t − ) (cid:35) = D KL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) − D KL ( p θ ( s (1) t | o (2) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) ≤ D KL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) (13) obust Deep Reinforcement Learning via Multi-View Information Bottleneck Note that equality holds if the twodistributions coincide. Analogously I ( s (2) t ; o (2) t | s (2) t − , a t − , o (1) t ) is upper bounded by D KL ( p θ ( s (2) t | o (2) t , s (2) t − , a t − ) || p θ ( s (1) t | o (1) t , s (1) t − , a t − )) .Assume s t − is a sufficient representation of o t − . Then, s (1) t − provides task-relevant information no more than thesufficient representation s t − . I ( s (1) t ; o (2) t | s (1) t − , a t − ) canbe thus re-expressed as: I ( s (1) t ; o (2) t | s (1) t − , a t − ) ≥ I ( s (1) t ; o (2) t | s t − , a t − ) (P.2) = I ( s (1) t ; s (2) t o (2) t | s t − , a t − ) − I ( s (1) t ; s (2) t | o (2) t , s t − , a t − ) ∗ = I ( s (1) t ; s (2) t o (2) t | s t − , a t − )= I ( s (1) t ; s (2) t | s t − , a t − ) + I ( s (1) t ; o (2) t | s (2) t , s t − , a t − ) ≥ I ( s (1) t ; s (2) t | s t − , a t − ) (14)Where ∗ follows from s (2) t being the representation of o (2) t . The bound is tight whenever s (2) t is sufficientfrom s (1) t ( I ( s (1) t ; o (1) t | s t − , a t − , o (2) t ) = 0) . This hap-pens whenever s (2) t contains all the information regarding s (1) t . Once again, we can have I ( s (2) t ; o (1) t | s (2) t − , a t − ) ≥ I ( s (1) t ; s (2) t | s t − , a t − ) . Therefore, the averaged loss func-tions can be upper-bounded by L ≤− λ + λ I ( s (1) t ; s (2) t | s t − , a t − ) (15) + D SKL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) Lastly, by re-parametrizing the objective, we obtain: L ( θ ; β )= − I θ ( s (1) t ; s (2) t | s t − , a t − ) (16) + βD SKL ( p θ ( s (1) t | o (1) t , s (1) t − , a t − ) || p θ ( s (2) t | o (2) t , s (2) t − , a t − )) In Algorithm 1, we use s (1) t ∼ p θ ( s (1) t | o (1) t , s (1) t − , a t − ) and s (2) t ∼ p θ ( s (2) t | o (2) t , s (2) t − , a t − ) to obtain representa-tions for multi-view observations. We argue that the sub-stitution does not affect the effectiveness of the averagedobjective. With the multi-view assumption, we have that rep-resentations s (1) t − and s (2) t − do not share any task-irrelevantinformation. So, the representations at timestep t condi-tioned on them do not share any task-irrelevant information.Maximizing the mutual information between s (1) t and s (2) t (first term in Equation 16) will encourage the representa-tions to share maximal task-relevant information. Similar argument also works for the second term in Equation 16.Since s (1) t − and s (2) t − do not share any task-irrelevant in-formation, any task-irrelevant information introduced fromthe conditional probability will be also identified as task-irrelevant information by KL divergence, which will bereduced through minimizing the DRIBO loss. C. Implementation Details
C.1. DRIBO + SAC
We first show how we train SAC agent given the representa-tions of DRIBO. Let φ ( o ) = s ∼ p θ ( s | o , s (cid:48) , a (cid:48) ) denote theencoder, where s (cid:48) and a (cid:48) as the representation and action atlast timestep. Algorithm 2
SAC + DRIBO Encoder input
RL batch B RL = { ( φ ( o i ) , a i , r i , φ ( o (cid:48) i )) } ( T − ∗ Ni =1 with ( T − ∗ N pairs of representation, action, re-ward and next representation. Get value: V = min i =1 , ˆ Q i ( ˆ φ ( o (cid:48) )) − α log π ( a | ˆ φ ( o (cid:48) )) Train critics: J ( Q i , φ ) = ( Q i ( φ ( o )) − r − γV ) Train actor: J ( π )= α log π ( a | φ ( o )) − min i =1 , Q i ( φ ( o )) Train alpha: J ( α ) = − α log π ( a | φ ( o )) − α H ( a | φ ( o )) Update target critics: ˆ Q i = τ Q Q i + (1 − τ Q ) ˆ Q i Update target encoder: ˆ φ ← τ φ φ + (1 − τ φ ) φ Then we incorporate the above SAC algorithm into mini-mizing DRIBO loss as follows:
Algorithm 3
DRIBO + SAC input : Replay buffer D storing sequential observationsand actions with length T . The batch size is N . Thenumber of total training step is K. The number of totalepisodes is E. for e = 1 , . . . , E do Sample sequential observations and actions from theenvironment and append new samples to D . for each step k = 1 , . . . , K do Sample a sequential batch
B ∼ D . Compute the representations batch B RL whichhas the shape ( T, N ) using the encoder p θ ( s T | o T , a T ) Train SAC agent: E B RL [ J ( π, Q, φ )] { Algorithm 2 } Update θ and ψ to minimize L DRIBO using B{ Algorithm 1 } . end for end for obust Deep Reinforcement Learning via Multi-View Information Bottleneck C.2. DRIBO + PPO
The main difference between SAC and PPO is that PPOis an on-policy RL algorithm while SAC is an off-policyRL algorithm. With the update of the encoder, representa-tions may not be consistent within each training step whichbreaks the on-policy sampling assumption. To address thisissue, instead of obtaining s t propagating from the initialobservation of the observation sequence, we store the repre-sentations as s old t while sampling from the on-policy batch.Then, we use ϕ ( o ) = s ∼ p θ ( s | o , s old , a (cid:48) ) to denote therepresentation from the encoder. Here, s old and a (cid:48) are therepresentation and action at the previous timestep. By treat-ing the encoding process as a part of the policy and valuefunction, the on-policy requirement is satisfied since the newaction/value at timestep t depends only on ( o t , s oldt − , a t − ) . Algorithm 4
DRIBO + PPO input : Replay buffer D and on-policy replay buffer D PPO storing sequential observations and actions with length T . The batch size is N . The minibatch size for PPO isM. The number of total episodes is E. for e = 1 , . . . , E do Sample sequential observations and actions from theenvironment { ( o T , a T , r T , s old T } Ni =1 . Append new samples to D and update the on-policyreplay buffer D PPO . for j = 1 , . . . , M do { ( ϕ ( o i ) , a i , r i ) } (cid:98) T ∗ NM (cid:99) i =1 ∼ D PPO Optimize PPO policy, value function and encoderusing each sample ( ϕ ( o i ) , a i , r i ) in the batch. Sample a sequential batch
B ∼ D . Update θ and ψ to minimize L DRIBO using B{ Algorithm 1 } . end for end forC.3. DMC We use the same encoder architecture as the encoder in theRSSM paper (Hafner et al., 2019). Deterministic part ofthe representation is a 200-dimensional vector. Stochas-tic part of the representation is a 30-dimensional diagonalGaussian with predicted mean and standard deviation. Thus,the representation is a 230-dimensional vector. We imple-ment Q-network and policy in SAC as MLPs with two fullyconnected layers of size 1024 with ReLU activations. Themutual information (MI) estimator I ψ ( s (1) , s (2) ) is a MLPwith two fully connected layers of size 500 with ReLUactivations. Augmentations of Visual Observations.
For our approachDRIBO and RAD, we use crop+random grayscale to gener- ate multi-view observations and augmented data. We applythe implementation of RAD to do the augmentation. For crop , it extracts a random patch from the original observa-tion. In DMC, we render × pixel observations andcrop randomly to × pixels. We then resize the croppedobservations to × pixels. For random grayscale , itconverts RGB images to grayscale with a probability p =0 . . Hyperparameters.
To facilitate the optimization, the hy-perparameter β in the DRIBO loss Algorithm 1 is slowlyincreased during training. β value starts from a small value e − and increases to e − with an exponential scheduler.The same procedure is also used in the MIB paper (Federiciet al., 2020). We show other hyperparameters for DMCexperiments in Table 2.Table 2: Hyperparameters used for DMC experiments. Hyperparameters Value
Observation size (100 × Replay buffer size 1000000Initial steps 1000Stacked frames NoAction repeat 2 finger, spin; walker, stand, walk, run;4 otherwiseEvaluation episodes 8Optimizer AdamLearning rate encoder learning rate: 1e-4;MI estimator learning rate: 1e-4;policy/Q network learning rate: 1e-3; α learning rate: 1e-4.Batch size × , where T = 50 Target update τ γ .99Initial temperature 0.1Num. of steps per episode 1000Num. of training steps per episode 500 β scheduler start episode 10 β scheduler end episode 110 C.4. Procgen
For Procgen suite, the implementation of DRIBO is almostthe same as DMC experiments. Better design choice couldbe found by validation. We use the same as the encoderarchitecture used in DMC experiments, except for the obser-vation embedder, which we use the network from IMPALApaper to take the visual observations. In addition, sincethe actions in Procgen suite are discrete, we use an actionembedder to embed discrete actions into continuous space.The action embedder is implemented as a simple one hiddenlayer resblock with 64 neurons. It maps a one-hot actionvector to a 4-dimensional vector. The policy and value func-tion share one hidden layer with 1024 neurons. The policyuses another fully connected layer to generate a categoricaldistribution to select the discrete action. The value functionuses another fully connected layer to generate the value foran input representation. All activation functions are ReLU obust Deep Reinforcement Learning via Multi-View Information Bottleneck activations.
Augmentation of Visual Observations.
We select augmen-tation types based on the best reported augmentation typesfor each environment. DrAC (Raileanu et al., 2020) reportedbest augmentation types for RAD and DrAC in Table 4 and5 of the DrAC paper. We list the augmentation types used inDRIBO in Table 3 and 4. We use the same settings for eachaugmentation type as DrAC. Note that we only performedlimited experiments to select the augmentations reportedin the tables due to time constraints. So, the tables do notshow the best augmentation types in each environment forDRIBO.Table 3: Augmentation type used for each game.
Env BigFish StarPilot FruitBot BossFight Ninja Plunder CaveFlyer CoinRunAugmentation crop cutout cutout cutout random-conv crop random-conv random-conv
Table 4: Augmentation type used for each game.
Env Jumper Chaser Climber DodgeBall Heist Leaper Maze MinerAugmentation random-conv crop random-conv cutout crop crop crop flip
Hyperparameters.
We use the same β scheduler as theDMC experiments. The starting β value is e − and thefinal β value is e − (the same as DMC experiments). Weshow other hyperparameters for Procgen environments inTable 5.Table 5: Hyperparameters used for Procgen experiments. Hyperparameters Value