[PDF] How Do You Act? An Empirical Study to Understand Behavior of Deep Reinforcement Learning Agents

Abstract

The demand for more transparency of decision-making processes of deep reinforcement learning agents is greater than ever, due to their increased use in safety critical and ethically challenging domains such as autonomous driving. In this empirical study, we address this lack of transparency following an idea that is inspired by research in the field of neuroscience. We characterize the learned representations of an agent's policy network through its activation space and perform partial network ablations to compare the representations of the healthy and the intentionally damaged networks. We show that the healthy agent's behavior is characterized by a distinct correlation pattern between the network's layer activation and the performed actions during an episode and that network ablations, which cause a strong change of this pattern, lead to the agent failing its trained control task. Furthermore, the learned representation of the healthy agent is characterized by a distinct pattern in its activation space reflecting its different behavioral stages during an episode, which again, when distorted by network ablations, leads to the agent failing its trained control task. Concludingly, we argue in favor of a new perspective on artificial neural networks as objects of empirical investigations, just as biological neural systems in neuroscientific studies, paving the way towards a new standard of scientific falsifiability with respect to research on transparency and interpretability of artificial neural networks.

Full PDF

HHow Do You Act?An Empirical Study to Understand Behavior ofDeep Reinforcement Learning Agents

Richard Meyes, Moritz Schneider, and Tobias Meisen

Chair of Technologies and Management of the Digital Transformation,University of Wuppertal, Rainer-Gruenter-Strae 21, 42119 Wuppertal, Germany { meyes, m.schneider-hk, meisen } @uni-wuppertal.de Abstract.

The demand for more transparency of decision-making pro-cesses of deep reinforcement learning agents is greater than ever, dueto their increased use in safety critical and ethically challenging domainssuch as autonomous driving. In this empirical study, we address this lackof transparency following an idea that is inspired by research in the ﬁeldof neuroscience. We characterize the learned representations of an agent’spolicy network through its activation space and perform partial networkablations to compare the representations of the healthy and the inten-tionally damaged networks. We show that the healthy agent’s behavioris characterized by a distinct correlation pattern between the network’slayer activation and the performed actions during an episode and thatnetwork ablations, which cause a strong change of this pattern, lead tothe agent failing its trained control task. Furthermore, the learned rep-resentation of the healthy agent is characterized by a distinct patternin its activation space reﬂecting its diﬀerent behavioral stages during anepisode, which again, when distorted by network ablations, leads to theagent failing its trained control task. Concludingly, we argue in favor ofa new perspective on artiﬁcial neural networks as objects of empirical in-vestigations, just as biological neural systems in neuroscientiﬁc studies,paving the way towards a new standard of scientiﬁc falsiﬁability with re-spect to research on transparency and interpretability of artiﬁcial neuralnetworks.

Keywords:

Transparency, Interpretability, Explainability, Deep Rein-forcement Learning, Neuroscience

Recent research on general-purpose artiﬁcial intelligence (AI) has seen somemajor breakthroughs in the past few years spurred by the advances of deepreinforcement learning (DRL) algorithms utilized in environments with sparserewards and complete information [1,2] or in complex multi-agent environmentswith incomplete information [3,4,5]. However, the research path leading up totoday’s pinnacle of these applications is marked by a crisis of reproducibility a r X i v : . [ c s . L G ] A p r R. Meyes et al. and required intense manual trial-and-error eﬀorts such as ﬁnding a good net-work initialization and subsequent hyper-parameter tuning, which can make allthe diﬀerence between a working and a failing solution [6]. What complicatesthe problem even more is that many working solutions are interspersed withunwanted behavioral artifacts that manifest in the learned policy of agents, ifthe environment allows for such manifestation, e.g. in the domain of learninglocomotion [7]. Such artifacts are commonly caused by incentivizing an agentto solely maximize a possibly richly shaped reward without any constraints onits policy. The usual approach of training agents to maximize their cumulativereward and quantitatively evaluating them solely based on this reward or anyother performance metric, such as the ELO rating in chess, raises a key question:

How can we trust an agent, if we do not understand how its behavioremerges from its internal processes and the complex interplay of itsindividual functional components?

In this paper, we aim to contribute towards answering this question followinga research paradigm from the ﬁeld of neuroscience based on empirical studiesof large and complex neural systems. Such systems have been the objects ofinvestigation for decades starting with the inﬂuential work of Hubel and Wieselin the 1950s [8], aiming to make them transparent and interpretable with respectto how their inner processes contribute to abstract concepts like consciousnessand decision-making. Speciﬁcally, we investigate the behavior of DRL agents inthree diﬀerent classic control environments based on the learned representationsof their policy networks, aiming to ﬁnd a link between these representationsand diﬀerent behavioral stages during the execution of the trained policy. Wecharacterize the actor’s learned representations based on its layer activationduring the execution of the policy and use network ablations (cf. section 3.2) tointentionally damage agents, evoking malfunctioning behavior to compare therepresentations of the fully intact and damaged networks to each other.First, we investigate the impact of network ablations with diﬀerent sizes indiﬀerent layers on the agent’s capability to solve its trained control task andshow that the the agent exhibits a task speciﬁc robustness to these ablationsdepending on the size and location of the ablations. We further investigate howthe activations of single units contribute to solving the control task, uncoveringspeciﬁc correlation patterns between these activations and the executed actionsduring an episode. Finally, we investigate patterns in the temporal evolvementof the actor’s layer activation and ﬁnd that the healthy agent’s learned repre-sentation contains distinct activation states that can be directly linked to thediﬀerent behavioral stages of the policy that successfully solves the control task,ultimately providing a link between the agent’s behavior and its internal pro-cesses.

Most of the recent work on transparency, interpretability and explainability ofAI comes from the ﬁeld of computer vision (CV), where the main focus is com- ow Do You Act? 3 monly placed on investigations of convolutional neural networks (CNNs) andthe importance of speciﬁc input variables for a network’s output [9,10,11,12,13].Similar eﬀorts are made in the ﬁeld of natural language processing (NLP), whererecurrent neural networks (RNNs) are investigated for their representations oflinguistic properties, contextual understanding or sentiment [14,15,16,17]. Typi-cally, learned network representations are characterized via embedding methodslike t-SNE [18] or UMAP [19] visualizing the high dimensional activation-spaceof neural networks to identify the role of speciﬁc network components in solv-ing a given task [20,21,22,23,24]. To this end, network ablations were used tostudy the impact of single units on a network’s performance [25], aiming todecide which units can be pruned without aﬀecting a network’s discriminativepower [26,27,28]. Subsequently, network ablations revealed that a single unit’simportance can be characterized by the magnitude of its weights [29] and theextent to which the distribution of its incoming weights changes during training[30,31]. Additionally, it was shown that units, which are easily interpretable, arenot necessarily more important than units with a less accessible interpretability[32]. Recently, controversial insights on methods how to evaluate the similar-ity of learned network representations have been reported and demonstrate theearly stage of current knowledge and thus, the importance and the need formore research on the topic [33,34]. In general, the extensive eﬀorts of recentresearch aimed to map the classiﬁcation result of a supervised trained networkto humanly interpretable explanations. We aim to extend these eﬀorts towardsthe DRL domain, where despite some work on understanding Deep Q-Networksand interpreting their learned policies in environments with a discrete actionspace [35], to the best of our knowledge, work on facilitating transparency andinterpretability of learned representations by means of network ablations hasnot been conducted yet. However, in view of the fact that robust DRL and itsapplication in real world scenarios is still a matter of current research [6,36], weargue that a better interpretability of DRL agents is of utmost importance.

In this empirical study, we trained a DRL agent in three diﬀerent classic con-trol environments, namely the cart-pole swing-up (CPSU) environment [37], thependulum swing-up (PSU) environment [38] and the cart-pole balance (CPB)environment [37] (cf. Figure 1). Although each environment poses an individ-ual challenge, they share the partial objectives of controlling a cart on a railor balancing a pendulum/pole in an upright position, providing some degree ofcomparability of the observed agent’s behavior across tasks. We refrain from amore detailed explanation of the intricacies of these environments regarding theirstate space, action space and reward functions at this point, as they are well-known benchmark environments for DRL research and have been extensivelyexplained elsewhere [37,38].

R. Meyes et al.Fig. 1: Three exemplary rendered images of the respective control environments.

As the object of investigation, we trained an actor-critic agent in the threedescribed environments with the deep deterministic policy gradient algorithmas outlined in [39]. Both, the actor and the critic network consist of two hiddenlayers with 400 units in the ﬁrst layer and 300 units in the second layer with bothlayers using ReLU activation and layer normalization [40]. The critic is suppliedwith the actor’s chosen actions, which is superimposed by an OrnsteinUhlenbecknoise process [41], only in the second hidden layer. Each agent was trained for800,000 time steps and optimized via Adam [42] with all other hyper-parametersbeing the same as in [39]. All computations were performed on a single machinecontaining two Intel Xeon Platinum 8168 processors with a total number of 48physical cores and 8 NVIDIA Tesla V100 32G GPUs.

We characterize the actor’s learned representations based on its layer activationduring policy execution. We use network ablations to intentionally damage theactor, evoking malfunctioning agent behavior to compare the representations ofthe fully intact and damaged networks. To this end, we record the activation ofeach single unit within the fully intact actor and its predicted actions for eachtime step of an episode in addition to the cumulative episodic reward to establisha baseline recording. Additionally, we record the same data for each individualablation case to compare it to the baseline recording.

Network Ablations.

We perform partial network ablations in a single layerwith varying proportions of ablated units by manually clamping their activationsto zero, eﬀectively preventing any ﬂow of information through the ablated units.We select the amount of ablated units in a range from 5% to 90% in steps of 5%until 30% and then in steps of 10% until 90%. In addition, we deviate from thispattern once by ablating 33 .

33% of units within a layer. Thereby, the ablatedunits are selected in a sliding window manner that is shifted across the layer,similar to sliding a kernel over an image in a CNN while the window positionis frozen during an episode. Note that the total number of ablations with thesame proportion varies because they depend on the size of the layer, the size ofthe window and the stride of the window. For instance, in a layer with 300 unitsand a chosen window size of 5% with a stride of 10 units, 15 units are ablated ow Do You Act? 5 at once resulting in 29 diﬀerent network ablations in total. For all ablations, wechose a constant stride value of 10 units to gather suﬃcient activation recordingsfor statistical analysis while at the same time keeping the computational eﬀortsmanageable.

Extraction of Activation Patterns.

To determine how single units con-tribute to the control task, we calculate the Pearson correlation coeﬃcient ofits set of activations A i,j = { a t | t ∈ [0 , T ] } and the outputs of the actor net-work U = { u t | t ∈ [0 , T ] } , for each time step within an episode, where t denotesthe time step within the episode, T denotes the total number of time steps perepisode, i denotes the i -th layer and j the j -th unit within that layer.Furthermore, to characterize the learned representations within a layer ofthe actor, we store the activations of each single unit in that speciﬁc layer foreach time step of an episode in a matrix M T × N , where T denotes the numberof time steps per episode and N denotes the number of units per layer. Wevisualize the evolvement of the actor’s activation during an episode using an opensource Python implementation of UMAP [19] to embed the stored activationsinto a two-dimensional space, i.e. M ∈ R T × . Thus, each point in the embeddedspace represents the activation of a speciﬁc layer of the actor network for asingle time step of an episode. We chose the default parameters for the UMAPembeddings after an initial attempt for ﬁnding better values for the number ofnearest neighbours or the minimum distance between data points yielded nosigniﬁcant visual improvement of the embeddings. To establish a baseline evaluation, we train the healthy agent to achieve nearstate-of-the-art results in all three environments, i.e. a maximum total episodicreward of 886 . − .

87 for the PSU task and 1000 for theCPB task. For reasons of performance comparability across the three environ-ments, the absolute return is normalized so that the minimum return value ineach environment is 0 and the respective baseline return value is 1.Figure 2 shows the normalized return for the baseline in comparison to all29 network ablations in the ﬁrst and second layer with a window size of 30% (120units) for the three control tasks. For both swing-up tasks, most ablations in theﬁrst layer have a negative impact on the agent’s capability to solve the tasks.Interestingly, there are some ablations that have little to no impact or even apositive impact, thus increasing the return. In case of the CPB task, ablating30% of the units in the ﬁrst layer does not aﬀect the agent’s capability to solvethe task at all. Contrary to the ﬁrst layer, all ablations in the second layer havea strong negative impact for the CPSU task and the CPB task (except for twocases), however, only a few ablations have a comparably negative impact forthe PSU task, where many ablations have little to no impact or even a positiveimpact. The negligible impact of ablations suggests that either the capacity of

R. Meyes et al.30% ablations in the ﬁrst layer 30% ablations in the second layer C P S U P S U C P B Ablated BaselineFig. 2: Comparison of the normalized returns achieved as a result of ablations of 30%of the units (red bars) in to its respective baselines (blue bars). the network has not been exploited to its fullest extent so that some units donot contribute to solving the task and could be pruned or that the informationrepresented by the ablated units is redundantly represented by other units mak-ing the agent robust against network ablations. The positive impact of ablationssuggests that some units may play competing roles in the learned representationand that resolving this competition by targeted ablations improves the agent’scapability to solve a task. Both observations are consistent with previously re-ported ﬁndings on the impact of ablations in supervised trained neural networkson image recognition tasks [30,31].Figure 3 shows the distributions of the normalized returns resulting fromthe diﬀerent network ablations in the ﬁrst layer and second layer for the threecontrol tasks.On average, the return decreases proportionally to the amount of ablatedunits. Comparing the impacts in the ﬁrst layer across the three tasks shows asimilar trend for the CPSU and the PSU task, i.e. a slow but steady decrease ofthe achieved return with increasing sizes of ablations but a much more robustbehavior for the CPB task, where ablations of up to 50% generally do not aﬀectthe agent’s capability to solve the task. Further, comparing the impacts in thesecond layer shows a similar trend for the CPSU and the CPB task, i.e. a strongand sudden decrease in the achieved return for small ablation sizes, but a muchmore robust behavior for the PSU task, where ablations of up to 33 .

33% onlymarginally aﬀect the agent’s capability to solve the task. Interestingly, connect-ing the similarity of the ablation impacts with the similarity of the diﬀerenttasks suggests that the ﬁrst layer holds a representation of how to swing up thepole/pendulum while the second layer holds a representation of how to controlthe moving cart. More precisely, ablations in the ﬁrst layer impact the agent inboth tasks, in which a pole has to be swung up, while the representation for ow Do You Act? 7Fig. 3: Distributions of the normalized returns for all ablations performed in the ﬁrstlayer (left side) and second layer (right side). the task, which merely requires balancing the pole, is very robust against ab-lations in this layer. Analogously, ablations in the second layer strongly impactthe agent in both tasks, in which a cart has to be controlled, while the repre-sentation for the task without a cart is fairly robust against ablations in thislayer. These results suggest that interlinked learning objectives to solve the tasksuch as controlling the cart, swinging up the pendulum and subsequently balanc-ing it, are represented in diﬀerent locations of the network. These observationsare consistent with previously reported ﬁndings on the localized representationsof speciﬁc classes in supervised trained neural networks on image classiﬁcationtasks [43,44,45].

Following the observations described above, we wonder what role the preciseinterplay of SUA plays with respect to the agent’s executed policy. More speciﬁ-cally, we ask whether the contribution of SUA to the executed actions during anepisode shows a distinct pattern for the healthy agent and to what extent this

R. Meyes et al. D i ff ere n ce Baseline Units A b l a t i o n s -0.49-816.51-906.35-0.38 Return

Positive correlation ( ≈

1) Negative correlation ( ≈ − pattern is distorted in case of ablations with a negative impact on the achievedreturn. To this end, we characterize this pattern via the set of Pearson correla-tion coeﬃcients calculated for the activations of single units within a layer andthe outputs of the actor network for each time step within an episode (cf. 3.2).Figure 4 shows this pattern for the baseline and four exemplary ablationsof 5% of units in the ﬁrst layer activated in the CPSU task. Each row contains400 entries corresponding to the 400 units in the ﬁrst layer. Each entry containsthe correlation value and shows how the unit’s activation correlates with theactor’s chosen action. The empty spaces in the rows show the ablated units, forwhich no correlation coeﬃcient is calculated. The top row shows the baselinecorrelation pattern in comparison to the following four rows, which show thecorrelation patterns corresponding to the four exemplary ablations. The bottomfour rows show to what extent the patterns resulting from the ablations changecompared to the baseline pattern, speciﬁed by the diﬀerence between the base-line pattern and the ablation patterns. The ablations of units 100 to 119 and 270to 289, resulting in the agent’s failure to solve the task, show a general increasein correlation between the SUA and the chosen actions and the strongest diﬀer-ence of the pattern compared to the baseline. A high correlation value indicatesa unit’s exclusive contribution to a speciﬁc control direction, i.e. whenever thecart is moved to either side, speciﬁc units are selectively active and contribute tothe control in a speciﬁc direction. However, such distinct contributions of singleunits do not seem to resemble a robust representation as we ﬁnd that patternswith less distinct correlations between single unit activations and the chosenactions generally lead to higher returns. This observation shows some similaritywith previously reported ﬁndings about the importance of single units in super-vised trained networks for image classiﬁcation tasks. Speciﬁcally, networks thatmemorize well instead of generalizing are more reliant on units that show a highselectivity in their activation for speciﬁc classes, indicating that units which se- ow Do You Act? 9High return Low return Baseline

Units 100 to 119Units 270 to 289

Mean V a r i a n ce Units 300 to 319BaselineUnits 30 to 49

Fig. 5: Scatter plot of the mean and the variance of the correlation patterns for thebaseline and all 29 ablations of the size of 5% and their corresponding returns in theCPSU task. lectively get activated for speciﬁc classes do not contribute as much to a robustand generalized representation as units with a less selective activation [32].In order to further solidify that notion, we compared the mean and thevariance of the correlation patterns of all ablations with the mean and the vari-ance of the baseline pattern, hypothesizing that high values for the mean andthe variance, corresponding to strong and distinct correlations, result in a lowreturn. Figure 5 shows a scatter plot of the mean and the variance of the corre-lation patterns for the baseline and all 29 ablations of the size of 5% and theircorresponding returns. Conﬁrming the hypothesis, ablations of units resultingin large values for the mean and the variance, e.g. units 100 to 119 (marked inthe top right corner of the scatter plot) lead to low returns. Almost all otherablations with mean and variance values close to the baseline (points withinthe red ellipsis) do not result in task failures but achieve returns comparable tothe baseline. Interestingly, the ablation of the units 270 to 289, which results insmall values for the mean and the variance, also leads to a low return, suggestingthat our hypothesis can be extended towards small values for the mean and thevariance, corresponding to no clear contribution for most of the single units tothe control task.To further test the validity of the hypothesis across diﬀerent sizes of abla-tions and across the three tasks, Figure 6 summarizes the eﬀects of all ablations(5% to 90%) on the return and the dependency on the characteristics of the

Baseline regime of high returns (a) First layer, CPSU

Baselineregime of high returns (b) First layer, PSU

Baselineregime ofhigh returns (c) Second layer, CPBFig. 6: Scatterplot showing the mean (x-axis) and variance (y-axis) of the correlationcoeﬃcients for all ablations of the speciﬁed layer. correlation patterns. Analogously to ﬁgure 5, the x- and y-axis show the meanand the variance of the correlation patterns. For the CPSU task, the highestreturn is generally achieved for patterns with a low variance as ablations leadingto larger variances show a decreased return. This suggests that the CPSU taskrequires single units to be generically involved in the control task and not tospecialize too strongly on speciﬁc controls. On the contrary for the PSU task,higher returns are generally achieved for patterns with a high mean and highvariance, suggesting a further reﬁnement of our hypothesis with respect to taskspeciﬁc characteristics. Interestingly, ablations that increase both values beyondthe baseline lead to even higher returns while patterns with low values lead tolow returns. This suggests that the ability to swing-up the pendulum requiresthe units to contribute to the control in a very speciﬁc rather than generic way.Consistently, a very clear picture emerges for the CPB task, where no swing-upis required and only patterns with low values for mean and variance result inhigh returns, verifying our initial hypothesis. In combination with the CPSUtask, this suggests that the ability to control the moving cart requires a genericinvolvement of single units in the control task rather than speciﬁc roles.

Although the correlation patterns provide some insights on how the agent acts,they do not capture the temporal evolvement of the learned representations anddo not answer questions with respect to such evolvements, e.g. at what pointduring the episode does the agent fail? When does it diverge from the base-line behavior and in what way? Does the agent go through diﬀerent behavioralstages during an episode and can these stages be linked to speciﬁc patterns inthe the learned representation? In order to answer these questions, we charac-terize the learned representations by embedding the layer activations recordedduring an episode (cf. 3.2) and compare the representations of the baseline tothe representations resulting from the ablations. ow Do You Act? 11Ablated Baseline start swing-up divergencepath 2path 1unionrail border path 1path 2unionstart stabilizationstart balance (a) Ablation of units 20 to 39 (5%) in the ﬁrst layer. start swing-up divergencepath 2path 1union rail borderstart balancestart stabilization (b) Units 110-149 (10%),layer 1. path 2path 1unionrail borderstart swing-upfailure &swing-up path 1path 2start stabilizationstartbalance beforefailure (c) Units 260-289 (10%),layer 2.Fig. 7: Comparison of the temporal evolvement of layer activations between the baselineand three exemplary ablation cases for the CPSU task.

Figure 7 shows this comparison for three exemplary ablation cases for theCPSU task. Each scatter plot contains 1000 blue and 1000 red points corre-sponding to the layer activation for each time step during an episode for thebaseline and the ablation case, respectively. Note that even though the baselinesin (a) and (b) show the exact same values, they are embedded slightly diﬀerentlyas the embeddings were calculated separately for all cases. The three cases cor-respond to ablations, which had no eﬀect on the agent’s capability to solve thetask (Figure 7a) or which lead to only half the return of the baseline (Figure 7band c). Figure 7a shows the evolvement of the layer activation during an episodefor the healthy and the damaged agent and how the diﬀerent behavioral stages ofthe episode are linked to diﬀerent sections of this evolvement. Both, the healthyand the damaged agent, start with moving the cart to the side, accelerating thependulum to swing it up. After the initial swing-up (upon reaching the rail bor-der), the agent is required to compensate for the excess momentum of the polevia corresponding cart movement to stabilize its upright position. This changein behavior results in a jump in the activation space from the initial activation path that corresponds to the initial swing-up behavior to another path that cor-responds to the stabilization behavior. The diﬀerence in activations is likely dueto the movement of the cart into the opposite direction upon reaching the railborder. Following the successful stabilization, the agent is required to balancethe pole by rapidly switching directions of the cart to maintain an upright poleposition. Interestingly, this behavior is represented in the activation space by twopaths, along which the layer activation progresses as the agent acts throughoutthe episode. The layer activation repeatedly switches between these two pathssuggesting that the network constantly changes between two distinct activationstates corresponding to the balancing act of the pole. At some point during theepisode, these two paths merge together (union) as the balancing act leads toan almost static position of the cart and the pole. However, from a mechanicalperspective, this constitutes an unstable equilibrium point for the pole, wheresmall perturbations of the pole’s angular position result in its downfall trigger-ing a renewed balancing act that is resembled by a renewed separation of themerged paths. This observations suggests that the convergence of the actor’s ac-tivation towards a single ﬁnal activation state is not suﬃcient to solve the task.Rather, a stable and continuous transition between two distinct activation statesis necessary to suﬃciently represent the balancing act. This observation seemssomewhat surprising considering the weak correlations of SUA to the actor’schosen actions throughout an episode (cf. 4.2). Although the SUA does not cor-relate strongly with the network’s executed actions, their combined activationslead to two distinct activation states of the network, each of them correspondingto the movement of the cart in either one of the two possible directions duringthe balancing act. This suggests that single units do not contribute individuallyto the control task, but rather as part of a larger conglomerate of units thatconstitute the two diﬀerent activation states.Figure 7b shows an ablation case, for which the agent fails to balance thepole continuously after the initial swing-up and drops it after a short period ofholding it in the upright position, reattempting the swing-up and balancing act.The layer activation diverges slightly from the baseline right from the start ofthe swing-up and further diverges completely after a short period of the stabi-lization phase. Consequently, due to this divergence, the layer activation of thedamaged agent does not show the emergence of two distinct paths connected tothe balancing act as the agent never succeeds in stabilizing the pole compensat-ing its excess momentum after the initial swing-up. Interestingly, the existenceof two distinct activation states is not exclusive to the actor’s ﬁrst layer but alsoapparent in its second layer. Figure 7c shows an ablation case in layer two, inwhich the failure of the agent is caused by a drop of the pole after the initialswing-up and a short period of balancing, causing the pole to rotate at highspeed until the end of the episode. The blue points resemble a similar pattern ofthe second layer’s activation compared to the ﬁrst layer including the divergenceof the activation along two distinct paths, the attempt to merge these paths andthe renewed separation. The failure of the agent, i.e. the continuous rotation ofthe pole at high speed, is visible in the activation space by the circularly ar- ow Do You Act? 13 ranged red points, from which the agent is not able to recover back onto thestabilization path and the both connected paths corresponding to the balancingact.

In this paper, we conducted an empirical study to understand how a DRL agentacts based on characterizing the learned representations of its policy network.We shed some light on the role of single units for the control task and foundthat despite the absence of a strong correlation between their activations and theactor’s chosen actions throughout an episode, agents, that solve their tasks suc-cessfully, show task speciﬁc patterns of weakly correlated SUA that get distortedby network ablations leading to low returns. The importance of these patternsfor a successful solution of the control task suggests that the careful interplaybetween single units with respect to the executed policy is essential rather thantheir sole and isolated behavior. However, we have only scratched the surface ofhow such patterns of joint activations can be characterized. In our future work,we plan to systematically investigate the role of functional neuron populationsand their involvement in solving a given control task. Speciﬁcally, we plan to in-vestigate the activation of sub-populations of neurons aiming to uncover if thereis a link between their activations and the emergent agent behavior.We further investigated the temporal evolvement of the actor’s layer ac-tivations during an episode and showed that, in case of the CPSU task, theconsecutive steps executed during the episode to solve the task are preciselyrepresented by the policy network and mapped onto its layer activations. Wefruther showed that this mapping is essential for solving the task as its distor-tion as a result of network ablations leads to low returns and failed attempts tosolve the task. The arrangement of the consecutive points in the embedded ac-tivation space revealed that the agent runs along speciﬁc paths in its activationspace and that diverging from this path is fatal for its task performance. Themost striking observation of these paths is given by the fact that the actor’s layeractivations can be very diﬀerent for very similar states. We naively expected thatthe layer activation would converge to a single speciﬁc activation vector just asthe consecutive states to be processed by the network become more and moresimilar to each other as the pole is balanced. However, we found that this is notthe case, suggesting that the learned representations may contain some informa-tion that is encoded in the temporal dimension on which the states are ordered,i.e. that the same state evokes a diﬀerent activation of the network depending onwhen it is presented to the network. In our future work, we plan to investigatehow these distinct activation patterns evolve during training, aiming to answerthe question, whether the diﬀerent behaviors are learned hierarchically, i.e. in aspeciﬁc order, or whether they emerge collectively.Considering that our study was limited to a single agent solving three dis-tinct control tasks, the universality of our results is strongly limited and theirimplications for other networks and tasks is not clear. We plan to address this issue by transferring our study design to a larger number of diﬀerent networksand control tasks aiming to establish a scientiﬁc standard for the falsiﬁability ofempirical studies conducted in the ﬁeld of artiﬁcial neural networks. Ultimately,we aim to pave the way towards a new perspective of neuroscience inspired em-pirical studies on artiﬁcial neural networks to exploit them as a test bed forneuroscientiﬁc research. Uncovering parallels between the structure and orga-nization of represented knowledge in artiﬁcial and biological systems opens upmeasures and possibilities for initial large scale studies in artiﬁcial systems be-fore transferring them to biological systems. Speciﬁcally, this addresses the issueof reproducibility, which, despite modern experimental methods, is one of themost critical issues in modern neuroscience, stemming from the large diﬀerencesbetween brains and the commonly small sample sizes in neuroscientiﬁc studies.

References

1. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. , “Master-ing the game of go with deep neural networks and tree search,” nature , vol. 529,no. 7587, pp. 484–489, 2016.2. D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot,L. Sifre, D. Kumaran, T. Graepel, et al. , “Mastering chess and shogi by self-playwith a general reinforcement learning algorithm,” arXiv preprint arXiv:1712.01815 ,2017.3. OpenAI, “Openai ﬁve,” https://blog.openai.com/openai-ﬁve/ , 2018.4. B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, andI. Mordatch, “Emergent tool use from multi-agent autocurricula,” arXiv preprintarXiv:1909.07528 , 2019.5. M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda,C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. , “Human-levelperformance in 3d multiplayer games with population-based reinforcement learn-ing,”

Science , vol. 364, no. 6443, pp. 859–865, 2019.6. A. Irpan, “Deep reinforcement learning doesn’t work yet.” , 2018.7. I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe,Y. Tassa, T. Erez, and M. Riedmiller, “Data-eﬃcient deep reinforcement learningfor dexterous manipulation,” arXiv preprint arXiv:1704.03073 , 2017.8. D. H. Hubel and T. N. Wiesel, “Receptive ﬁelds of single neurones in the cat’sstriate cortex,”

The Journal of physiology , vol. 148, no. 3, pp. 574–591, 1959.9. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami,“Practical black-box attacks against machine learning,” in

Proceedings of the 2017ACM on Asia conference on computer and communications security , pp. 506–519,ACM, 2017.10. R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaning-ful perturbation,” in

Proceedings of the IEEE International Conference on Com-puter Vision , pp. 3429–3437, 2017.11. K. Faust, Q. Xie, D. Han, K. Goyle, Z. Volynskaya, U. Djuric, and P. Diamandis,“Visualizing histopathologic deep learning classiﬁcation and anomaly detection us-ing nonlinear feature space dimensionality reduction,”

BMC bioinformatics , vol. 19,no. 1, p. 173, 2018.ow Do You Act? 1512. J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neuralnetworks,”

IEEE Transactions on Evolutionary Computation , 2019.13. R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremalperturbations and smooth masks,” 2019.14. A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding recurrentnetworks,” arXiv preprint arXiv:1506.02078 , 2015.15. A. Radford, R. Jozefowicz, and I. Sutskever, “Learning to generate reviews anddiscovering sentiment,” arXiv preprint arXiv:1704.01444 , 2017.16. A. Bau, Y. Belinkov, H. Sajjad, N. Durrani, F. Dalvi, and J. Glass, “Identifyingand controlling important neurons in neural machine translation,” arXiv preprintarXiv:1811.01157 , 2018.17. A. Madsen, “Visualizing memorization in rnns,”

Distill , 2019.https://distill.pub/2019/memorization-in-rnns.18. L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journal of machinelearning research , vol. 9, no. Nov, pp. 2579–2605, 2008.19. L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximationand projection for dimension reduction,” arXiv preprint arXiv:1802.03426 , 2018.20. M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better analysis of deepconvolutional neural networks,”

IEEE transactions on visualization and computergraphics , vol. 23, no. 1, pp. 91–100, 2016.21. P. E. Rauber, S. G. Fadel, A. X. Falcao, and A. C. Telea, “Visualizing the hid-den activity of artiﬁcial neural networks,”

IEEE transactions on visualization andcomputer graphics , vol. 23, no. 1, pp. 101–110, 2016.22. Z. Elloumi, L. Besacier, O. Galibert, and B. Lecouteux, “Analyzing learnedrepresentations of a deep asr performance prediction model,” arXiv preprintarXiv:1808.08573 , 2018.23. D. V., “Convnet playground,” https://convnetplayground.fastforwardlabs.com ,2019.24. S. Carter, Z. Armstrong, L. Schubert, I. Johnson, and C. Olah, “Activation atlas,”

Distill , vol. 4, no. 3, p. e15, 2019.25. F. Dalvi, A. Nortonsmith, A. Bau, Y. Belinkov, H. Sajjad, N. Durrani, and J. Glass,“Neurox: A toolkit for analyzing individual neurons in neural networks,” in

Pro-ceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33, pp. 9851–9852,2019.26. P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutionalneural networks for resource eﬃcient inference,” arXiv preprint arXiv:1611.06440 ,2016.27. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning ﬁlters foreﬃcient convnets,” arXiv preprint arXiv:1608.08710 , 2016.28. N. Cheney, M. Schrimpf, and G. Kreiman, “On the robustness of convolutionalneural networks to internal architecture and weight perturbations,” arXiv preprintarXiv:1703.08245 , 2017.29. F. Dalvi, N. Durrani, H. Sajjad, Y. Belinkov, D. A. Bau, and J. Glass, “What isone grain of sand in the desert? analyzing individual neurons in deep nlp models,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI) , 2019.30. R. Meyes, M. Lu, C. W. de Puiseau, and T. Meisen, “Ablation studies in artiﬁcialneural networks,” 2019.31. R. Meyes, M. Lu, C. W. de Puiseau, and T. Meisen, “Ablation studies to un-cover structure of learned representations in artiﬁcial neural networks,”

Int’l Conf.Artiﬁcial Intelligence 2019 , 2019.6 R. Meyes et al.32. A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “On the im-portance of single directions for generalization,” arXiv preprint arXiv:1803.06959 ,2018.33. A. Morcos, M. Raghu, and S. Bengio, “Insights on representational similarity inneural networks with canonical correlation,” in

Advances in Neural InformationProcessing Systems , pp. 5727–5736, 2018.34. S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural networkrepresentations revisited,” arXiv preprint arXiv:1905.00414 , 2019.35. T. Zahavy, N. Ben-Zrihem, and S. Mannor, “Graying the black box: Understandingdqns,” in

International Conference on Machine Learning , pp. 1899–1908, 2016.36. G. Dulac-Arnold, D. Mankowitz, and T. Hester, “Challenges of real-world rein-forcement learning,” arXiv preprint arXiv:1904.12901 , 2019.37. OpenAI, “Openai roboschool,” 2017.38. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, andW. Zaremba, “Openai gym,” 2016.39. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,and D. Wierstra, “Continuous control with deep reinforcement learning,” arXivpreprint arXiv:1509.02971 , 2015.40. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprintarXiv:1607.06450 , 2016.41. G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian motion,”

Physical review , vol. 36, no. 5, p. 823, 1930.42. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980 , 2014.43. A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave like ensembles ofrelatively shallow networks,” in

Advances in neural information processing systems ,pp. 550–558, 2016.44. C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye,and A. Mordvintsev, “The building blocks of interpretability,”

Distill , 2018.https://distill.pub/2018/building-blocks.45. I. Rafegas, M. Vanrell, L. A. Alexandre, and G. Arias, “Understanding trainedcnns by indexing neuron selectivity,”