Uncertainty Maximization in Partially Observable Domains: A Cognitive Perspective
UUncertainty Maximization in Partially Observable Domains: ACognitive Perspective
Mirza Ramicic [email protected]
Artificial Intelligence CenterFaculty of Electrical EngineeringCzech Technical University in Prague12135, Prague, Czech Republic
Andrea Bonarini [email protected]
Artificial Intelligence and Robotics LabDipartimento di Elettronica, Informazione e BioingegneriaPolitecnico di Milano20133, Milan, Italy
Editor:
Abstract
Faced with an ever-increasing complexity of their domains of application, artificial learningagents are now able to scale up in their ability to process an overwhelming amount ofinformation coming from their interaction with an environment. However, this process ofscaling does come with a cost of encoding and processing an increasing amount of redundantinformation that is not necessarily beneficial to the learning process itself. This workexploits the properties of the learning systems defined over partially observable domainsby selectively focusing on the specific type of information that is more likely to express thecausal interaction among the transitioning states of the environment. Adaptive maskingof the observation space based on the temporal difference displacement criterion enableda significant improvement in convergence of temporal difference algorithms defined over apartially observable Markov process.
Keywords: partially observable Markov decision process, cognitive modelling, entropy,convolutional neural networks, reinforcement learning, temporal-difference learning, atten-tion mechanisms and development, artificial neural networks, dynamics in neural systems,neural networks for development
1. Introduction
Recent rapid developments in reinforcement learning ( RL ) rely on their ability to perceiveand process a great surge of information collected through the interaction with their real orsimulated. With the evolution of the sophisticated artificial sensory apparatus began thecollective quest to improve the predictability of surrounding world dynamics by increas-ing the shear amount of data collected from it. The data greedy approach worked andconsequently gave rise to significant breakthroughs and applicability of deep reinforcementlearning ( DRL ). More complex architectures of neural network function approximators cou-pled with the increase of computational power allowed temporal-difference RL algorithms1 a r X i v : . [ c s . A I] F e b o achieve super-human level control in the problems that were designed for the complexityand scale of human cognition such as Atari games Mnih et al. (2013,?), complex boardgames such as Go Silver et al. (2017, 2018); ? ); Schrittwieser et al. (2020) and modernstrategy games like Starcraft II Vinyals et al. (2019a).The aforementioned breakthrough approaches worked in part because both artificial andbiological learning systems rely on the premise that their environment will provide themwith enough unpredictability or informational entropy in order for them to perform theirpredominant function: adaptation. This Darwinian attribute of learning is evident in bio-logically inspired machine learning mechanisms such as RL in the way that artificial agents adapt to their environment by creating and updating a policy π that would ultimately selectthe actions according to the maximization of the expected reward in the long run Suttonand Barto (2018). The adaptation of a RL agent by learning can be seen as a process ofreducing the inherent unpredictability or entropy of the constantly changing environment:as the agent learns it becomes better at predicting its rewards. In this adaptive view of thelearning process an artificial agent is reducing its ”surprise” or entropy in its perception ofthe environment according to the free energy principle Friston (2010). Artificial RL systemsfaced with zero entropy state space and zero entropy reinforcement function would makelearning obsolete: no potential uncertainty to reduce means no learning could be made inthe system.More data collected from the environment by DRL approaches meant that the learningagents state space encompassed more of the external world unpredictability providing thelearning algorithms with more entropy ”fuel” for learning. However, in the most real worldcases the amount of data collected from the environment is not linearly proportional tothe overall entropy it yields: increase of dimension of the perceived data also increases thechance of encoding highly predictable and redundant environment data into the agents staterepresentation.To mitigate the effects of encoding a great amount of low-entropy data that is notsupportive of the learning process itself, the recent approaches prioritized on the agent’sexperiences that carried more entropy Ramicic and Bonarini (2017) or used an array ofunsupervised learning techniques in order to compress the world representations into vectorswith high entropy Vinyals et al. (2019a). Both approaches were effectively conveying moreof the environment’s uncertainty to the learning agent itself.The approach presented in this work is addressing the issue of optimization of a lim-ited bandwidth communication channel between the agent’s environment perception andits learning algorithm asserting the importance of looking at the learning problem (artificialand biological) as essentially uncertainty greedy . This proposal is based on exploiting thisinherent natural informational dependence that represents a characteristic of all learningprocesses. Instead of increasing the channel’s bandwidth in our quest to better describe theenvironment (i.e., increasing the state-space dimension) the goal of the proposed approachis to utilize the channel in a way that would maximize its ability to efficiently transfer theuncertainty or entropy of the perceived environment. It relies on a simple, yet effectiveconcept of temporal difference displacement (TDD) defined over a single agent’s transitionand indicating the portion of the sensed state that was affected by the transitioning pro-cess. The
TDD criteria allows for the implementation of the main functional componentof the approach: active state space masking based on the specific transition’s temporal dif-2erence displacement function (TDDM). The main effect of the
TDDM -based masking isan isolation of a subset of the observations that are including the information responsiblefor representing distinctions among world states while suppressing the unchanging informa-tion, so to reduce overloading the approximations of the learning algorithm. The proposedselective attentive focus thus enables the learning algorithms dealing with partially observ-able spaces to improve their essential function of discriminating among the world statesbased on their temporal relationships, which, in turn, can support better decision policiesby inducing determinism into the system.The experimental results shown in Section 6.1 show that the active state space mask-ing can significantly improve the convergence of the TD learning algorithms defined overa partially observable Markov process in a variety of complex and sensory demanding en-vironments such as Atari games. The article is structured in an incremental way, with thetwo first sections providing the general context of looking at the problem in a specific way ,therefore building up a foundation for the approach.
2. The big picture: getting the right context
Looking at the nature of things through the lens of the free energy principle
Friston (2010)imposes a duality: on one side we have a tendency of the universe, i.e. our environment,to achieve the state of least energy expenditure, which is a high entropy one (in bothinformational and thermodynamic way), and, on the other side, learning adaptive systems,both biological and artificial, resisting this natural tendency to disorder. This fundamentaldisposition for learning for adaptation was observed from low complexity biological formssuch as worms Rankin (2004) and even organisms with no nervous systems Boisseau et al.(2016). Evolving from the simpler forms, the majority of biological systems have ever sinceimproved their sensory apparatus and started maximizing its potential by the developmentof mechanisms that enabled them to better cope with the abundance of surrounding entropy.The solution was simple: focusing on a finite subset of the environment and further evolvethe techniques of better processing it in order to acquire the full potential of that particular perception .For example, in the animalia kingdom the organisms have evolved a strong preference fordetecting electromagnetic waves as they proved beneficial in reducing the uncertainty abouttheir immediate environment, which, in turn, provided them with the possibility of betteradaptation. Focusing on a specific range of electromagnetic spectrum allowed the formationof a structure that that we now refer as to eye. Over time, the biological systems evolvedmany types of sensory apparatus, but none of them conveyed as much entropy as visualinformation: most of the physical reality does not necessarily make disturbances in the airwe could detect, or emit chemical compounds, but most of them reflects electromagneticwaves and, more importantly, in a variety of different ways. The sounds and smells justdidn’t have the ability to differentiate the properties of the environment enough to provide ahigh entropy sensory input, visuals did. This surge of entropy acquired by the newly foundedability to extract information from the visible light spectrum made a huge evolutionary leapin the Upper Paleolithic era Csikszentmihalyi (1992): the search to expand the domain ofperception quickly became a search to improve its processing. Perception moved from a3imple reactive collections of neurons existing even before early Cambrian era Nilsson (1996)to the highly complex processing of visual information that now happens in a human brain.Certainly, the human sensory apparatus also improved in the evolution, but the evolutionof mechanisms that process the data it can produce had a major role in rising to the UpperPaleolithic evolutionary boom Csikszentmihalyi (1992). The crucial ingredient was there,making sense of it was another issue.
Looking at a human genome which is encoded using a colossal 3.2 billion base pairs over 24distinct chromosomes it is evident that its informational capacity is overwhelming. Evenif we take the rather uneasy fact that the most of the DNA do not encode information inproteins (so called ”junk DNA”) and that only 2 .
3% of it is effectively capable of preservingthe information organized in a roughly 25.000 genes Penke et al. (2007) we are still facedwith an enormous entropy reduction system. The evolutionary process encodes the contri-butions of individual learning (through a process of mutation) into a collective species-widegenome which carries a majority of the species certainty about the environment: on aver-age, random sampling of two same-sex humans from the entire population yields a 99 . , building-block hypothesis , at the basis of the schema theorem Goldberg; Sampson(1976) that guarantees convergence of Genetic Algorithms. The preserved traits allowed forthe growth of meta-learning structures crucial for human naturally evolved disposition tolearn from the sources that are higher in entropy and even create new ones in a form of alanguage. Human beings became a highly complex biological system capable of adaptationwhich is beyond pure reactivity, an adaptation that was able to process the multitude ofsensory information through a hugely complex mechanisms in order to induce a second-order or even n -th order one. The information has been stored (in short term and longterm memory), channeled and manipulated in a variety of ways and a capacity for activeimagination emerged.However this doesn’t mean that this rapid human development owes everything to thefunctional modules preserved in their gene pool: the genome could never be viewed separatefrom the environment that gave rise to it. Genes co-evolve with the environment and dependon it in order to articulate themselves into the functional ontogentic mechanisms Barrett(2006). These same genes ability to ”realize” themselves through individual learning andadaptation rely on the exploitation of abundance of uncertainty or informational entropythat the immediate environment provides through their lifespanBarrett (2006). Thus, theadaptive evolutionary process is not characterized by a mere instantiation of phenotypes4ased on the information contained in the genome blueprint, but could be seen as a de-velopmental and computational process that is a function of the unpredictability of theenvironment Marcus (2004). A human brain is able to instantiate itself as a structure madeof a roughly 20 billion of neurons which are integrated using trillions of connections. The”gene shortage” hypothesis Ehrlich (2000) argues that such complexity couldn’t possiblyarise from a limited informational potential that the aforementioned 25.000 ”useful” genesprovide. This small part of an encoded genetic material, although being necessary for theconstruction of a phenotype doesn’t seem to carry an implicit information about the worldfeatures that are developmentally relevant to the individual. Instead this encoded datacan only become ”useful” information for adaptation in the context of the dynamics of thephenotype interaction with its environment Oyama (2000).In other words, the organisms can only exploit their ontogentic heritage if they inter-act with the unpredictability and necessary ”chaos” of the environment. In general, anenvironment with less thermodynamic energy, i.e. entropy, with a disposition to hetero-geneity carries less potential for manipulation, and discrimination produces an organismwith reduced abilities to discriminate, come up with a more broad and reliable strategies ofadaptation, engage in an exploratory behavior and, on a more human level, draw inferencesabout the environment Bruner (1959). However, from a cognitive perspective, which thiswork adopts, the efficacy of learning not only depends on the level of environment energybut also from the amount of the energy that can be perceived or channeled to the learn-ing system itself. Energy does not have learning potential unless and until the variationsin energy have different effects on perception; these should be viewed in a context of themachinery that processes it Gibson and Gibson (1955). Life’s quest for a reduction of uncertainty (as far as we know) didn’t appear in high en-ergy environments such as the gas giants of our Solar system, nor did it sustain in thelow energy ones such as the Earth’s Moon or Mars for example. This biological process,however has some prerequisites in terms of the entropy level of its environment: It neededto be just enough to enable the adaptive learning mechanism to predict the patterns of the causal relationships between the changing environmental states according to the integratedinformation principle
Tononi et al. (2016). The integrated information
Φ represents the in-formation that is irreducible to its non-interdependent subsets, which in our case representthe environmental states or their representations. Instead, this type of information explainsthe relationships between them aiding the integration of a set of phenomenal distinctionsinto a unitary experienceTononi et al. (2016). The ability of a learning system to extract theintegrated information from its environment depends on the environment ability to conveythe causal distinctions between its states. In the case of high energy systems the differencebetween the states is so vast that no reasonable correlation is possible: for example the gasgiants of our Solar system such as Venus incorporate high level of thermodynamic energyin the movement of their molecules making the system exhibit an unpredictable chaos. Onthe contrary, the low energy systems just don’t have enough entropy in their states for thelife-supporting uncertainty to emerge. 5he well known
Goldilocks ˜citerampino1994goldilocks thermodynamics property of hab-itable planets can be extended in the Shannon’s sense as the optimal informational satura-tion condition of all learning systems, biological and artificial.
The breakthroughs in artificial learning algorithms mentioned in Section 1 have dealt withthe uncertainty of the world by focusing on the part of it that was deterministic in natureand defining it as a
Markov Decision Process ( MDP ) Sutton and Barto (2018). Thisrepresented a sort of a leap of faith as most of real-world problems are inherently non-Markovian: the world itself is highly non-Markovian and complex biological learning systemslike humans have long since benefited from this fact as suggested in Section 2.1. Eventhough, since the mid 60ies the artificial intelligence community has developed methodsthat could represent and reason with uncertainty that originate from the control engineering perspective of Karl Johan ˚Astr¨om ˚Astr¨om (1965).The majority of TD methods have reliedon this deterministic safe haven of MPDs . This tendency could be partially attributed to thefact that the proofs of the convergence of TD algorithms assumed the agent’s perceived statespace to be Markovian and ergodic in nature Watkins and Dayan (1992); Tsitsiklis (1994).Despite the convergence issues the artificial intelligence community adopted the method asa (more or less) natural extension of MDP under the name of partially observable markovdecission process or POMDP Monahan (1982); Lovejoy (1991); Cassandra et al. (1994). A Markov Decision Process is fully defined by the tuple (cid:104)
S, A, T, R (cid:105) , which includes: afinite set of environment representations S that can be reliably encoded by the agent,a finite number of actions A that an agent is allowed to perform in that environment, atransitional model of the environment T providing a functional mapping of S × A to discreteprobabilities defined over S , and a reward function R ( s, a ) which maps the state and actionpairs from S and A to a scalar indicating the immediate reward feedback the agent receivesfrom being in a specific state s and taking a specific action a Sutton and Barto (2018).In
POMDP’s the algorithm doesn’t have the benefit of performing the mappings of S × A over a set of deterministic states S but rather on a set of the possible partial observations O of the states Cassandra et al. (1994). In other words, the additional modelling of theconcept of observation was required. The solution for this problem came in the form ofa belief state : an internal representation which is mapping the environment states to theprobability that the environment is actually in that state. The belief state denoted by B is simply a probability distribution that can be represented by a vector of probabilities,one for each possible state of the environment, summing to 1Cassandra et al. (1994). Thisarticulates the problem of learning in a partially observable environment as a problem of estimating the ”true” state of the world based on the belief state derived from the agent’s partial observations.The POMDP agent improves its model of the environment by updating its state estimate τ ( b, a, o )( s (cid:48) ) about the state s (cid:48) based on the previous belief state s along with the most recentaction a and the most recent partial observation o by applying the simplicity of the Bayes’rule according to Equation 1. Transitional probabilities τ ( s, a, s (cid:48) ) in Equation 1 are given6s in vanilla MDP’s and b ( s ) represents the actual probability that is assigned to the state s considering an agent being in a specific belief state of b . τ ( b, a, o )( s (cid:48) ) = P ( s (cid:48) | a, o, b )= P ( o | s (cid:48) , a, b ) P ( s (cid:48) | a, b ) P ( o | a, b )= O ( s (cid:48) , a, o ) (cid:80) s ∈ S τ ( s, a, s (cid:48) ) b ( s ) P ( o | a, b ) (1)Regardless of their differences, solving problems defined over MDP and
POMDP comesdown to finding a policy π that will maximize the future expected reward Sutton and Barto(2018). While in the case of MDP this policy represents a mapping of deterministic states S to actions, in POMDP the actions are chosen based on the basis of the agent’s current belief states b . Along the iterative update of the agent’s belief state using Equation 1 theagent’s first step towards learning a policy π is the iterative update of the value functions V for each of its belief states using dynamic programming methods Bellman (1966) such as value iteration outlined in Equation 2 (this is where the inherent complexity of the POMDPapproach becomes apparent). The updated value function V n +1 in Equation 2 is calculatedon the basis of the previous value function V n defined over the current state estimate givenby Equation 1 and immediate expected reward r ( b, a ) of executing action a in belief b . Theexpectation of this scalar reward r ( b, a ) is based on the whole state space and the currentbelief b ( s ) as defined in Equation 3. V n +1 ( b ) = max a (cid:34) r ( b, a ) + γ (cid:88) o ∈ O P ( o | b, a ) V n ( τ ( b, a, o )) (cid:35) ∀ b ∈ B (2) r ( b, a ) = (cid:88) s ∈ S b ( s ) r ( s, a ) (3)For a arbitrary value function V updated by Equation 2 a policy π is said to be improving on V if the Equation 4. The convergence of the policy π to the optimal policy π ∗ is theresult of the value function V convergence to V ∗ as the number of iterations n goes toinfinity. π ( b ) = argmax a (cid:34) r ( b, a ) + γ (cid:88) o ∈ O P ( o | b, a ) V ( τ ( b, a, o )) (cid:35) ∀ b ∈ B (4)If we simply omit the max a operator from Equation 2 we get a representation of the value of executing a specific action a in a current belief state b ( s ). This representation,also known as Q-value, is given in Equation 5 and its is widely used in temporal-differencelearning since Watkins’ paper Watkins and Dayan (1992). Q n +1 ( b, a ) = (cid:34) r ( b, a ) + γ (cid:88) o ∈ O P ( o | b, a ) V n ( τ ( b, a, o )) (cid:35) ∀ b ∈ B (5)7owever, solving a problem defined over POMDP proved not to be such an easy task dueto their complexity Ross et al. (2008); Lee et al. (2008) imposed by a creation of belief state B which, in most of the cases has the same dimension as | S | : the dimension of the beliefspace thus grows exponentially with | S | . A certain revival for POMDP’s, though, came withthe introduction of more complex function approximators: for most non-trivial cases keepingtrack of the values V given by Equation 2 for each of the observations was computationallyunfeasible because of their sheer numbers, and for this reason the value functions have beenapproximated by ANN ranging back to Lin Lin (1991). The approximation is done bynudging the parameters or weights Θ of an ANN by a small learning rate α at each learningstep so that the current estimate Q ( b, a ; Θ) will be closer to the target Q-value given byEquation 5. This is done by minimizing the loss function L (Θ) representing the differencebetween the previous estimate and the expectation target by performing a stochastic gradientdescent on the weights Θ to achieve Q ( s, a ; Θ) ≈ Q ∗ ( s, a ) according to Equation 6: ∇ Θ i L i (Θ i ) = ( y i − Q ( b, a ; Θ i )) ∇ Θ i Q ( b, a ; Θ i ) , (6)where y i represents our target Q-value obtained by calculating the Bellman optimalityunder the newly observed transition parameters over Equation 5.Later introduction of many-layered deep ANN’s capable of scaling up to an over-increasingsensory demand of modern RL applications Mnih et al. (2013, 2015); Silver et al. (2017,2018); Vinyals et al. (2019a,b); Schrittwieser et al. (2020) inspired a deep learning POMDP approaches by Hausknecht and Stone Hausknecht and Stone (2015) and a more recentLe Le et al. (2018). Their deep recurrent Q-network ( DRQN ) achieved a better adaptationof agents under the circumstances where the quality of observation changes over time com-pared to vanilla
DQN . Because of its appoximation power and scalability over the
POMDP domain the
DRQN approach is used as a basis for the proposed
TDDM filtering architecturefurther elaborated in Section 4.
3. POMP as Perception Mechanism
Why link perception with partially observable markov decision processes ? The nature ofperception itself lies in a selective filtering,processing,redefining and in most cases interpret-ing the raw information received through an agent’s sensory apparatus. In this process, anagent, biological or artificial is not acting upon the idealized full potential of informationalcontent present in its immediate environment but on a small subset of highly processedinputs, which are often moved into latent spaces. Despite of this limitation the biologicalagents can act optimally in a partially observable world by building something that in anartificial sense could be seen as POMDP belief states.
One of the first computational perspectives on vision in general was given in the late 70ies byDavid Marr Marr (1976, 1982). The works postulate a theory of early visual computationalprocessing that has inspired some of the pioneering works Agre and Chapman (1987); Agre(1988) that dealt with the problem of computational perception as an important componentof artificial learning agents. Marr’s work sets a theoretical base for the principle of what8gre and Chapman called deictic representation by postulating that the first operation on aperceived raw image is to transform it into a more simple, but entropy rich, description of theway its intensities change over the visual field, as opposed to a description of the intensitiesthemselves Marr (1976, 1982). This primal sketch , as he coined it, provides a description ofsignificantly reduced size that is still able to preserve the important information requiredfor image analysis. The importance of the Agre and Chapman deictic approach Agre andChapman (1987); Agre (1988) from the perspective of the here proposed work lies in itsarchitecture: the crisp distinction between the visual perception system and the central system (i.e. the learning algorithm). The visual system thus takes the deictic burden: atany given moment, the agent’s representation should actively register only the features orinformation that are relevant to the goal and ignore the rest. This architectural modularityallows the central system in Agre and Chapman (1987); Agre (1988) to be implemented in arather simple way without the complexity of a pattern matcher or similiar computationalydemanding processes as the deictic process permits it to generalize over functionally andindexically identical states of the environment by simply not bringing in the redundant distinctions among them.Later work of Ballard et. al Ballard et al. (1997); Hayhoe et al. (1997) put the deic-tic principle of Agre and Chapman (1987); Agre (1988) into the broad context of visualprocessing of biological systems suggesting that the human visual representations are lim-ited and task dependent . Ballard et al. (1997); Hayhoe et al. (1997) further postulate thatthe superior human performance in visual perception can be attributed to the sequence ofconstraining deictic processes based on a limited amount of primitive operations support-ing the notion that a human working memory is limited in its capacity and computationalprocessing ability Broadbent (1958); Baddeley (1992); Salway and Logie (1995).A more complex extension of Agre and Chapman (1987); Agre (1988) is given by Chap-man Chapman (1992) as
SIVS architecture capable of selective deictic visual processing ofsubsets of an image by identifying the regions that are ”task dependent”. The interestingpart of the
SIVS approach that it implements, amongst other, a concept of visual routines inspired by Ullman (1987), which actively process the visual information within the time-domain , allowing for the detection and abstraction of changes in the visual field Chapman(1992). The Chapman’s applications of the visual routines is very much in line with thetemporal context retaining properties of
POMDP -based learning algorithms presented inthis work.
The deictic way of looking at a machine learning problem seemed very promising because ofits ability to represent as equivalent the world states that require the same action accordingto the agent’s current policy: more abstracted, more compacted representations reduce theburden on learning mechanisms. As the researches eagerly exploit the possibilities of mod-eling artificial perception under the deictic principle a concern arises whether this selective,compact and task-dependant world representation can be acted upon deterministically withrespect to the
Markov property in order for an agent to achieve optimal policy Whiteheadand Ballard (1991); Chrisman et al. (1991). The integration of adaptive control methodssuch as active perception with the (at the time) widely machine learning algorithms Watkins9nd Dayan (1992) may lead to a phenomenon of perception aliasing
Whitehead and Ballard(1991) as it can produce internal representations that are not consistent with each other.Lack of consistency among states can be very detrimental to the TD algorithms Watkinsand Dayan (1992) as their underlying principle of Bellman’s optimality Bellman (1966); Sut-ton and Barto (2018) relies on this property: inconsistent states can destabilize the learningalgorithm by introducing unfounded maximums in the value function Whitehead and Bal-lard (1991) which, in turn, can make the agent diverge from its optimal policy. Furthermore,the perceptual aliasing can lead to distinct world states that may call for distinct actionsaccording to the optimal policy being represented by the same deictic representation. A partial solution was readily proposed by the work that introduced the problem of percep-tual aliasing Whitehead and Ballard (1991) in the first place and it was based on detectingand suppressing the representations that are less correlated. As the
MDP assumption stillrelied on the deterministic principles the correlation between the states in Whitehead andBallard (1991) seemed to be a part of the system that was the source of certainty.The here presented work extends that notion into the work by Chrisman Chrisman(1992) that made use of the memory of the previous states in order to detect the essen-tial information inducing correlations. The previously experienced correlations among thestates, in this case are used to build a probabilistic model that is able to predict the currentworld state. Although the probabilistic models have been used in reinforcement learningas a form of experience replay Sutton (1991); Lin (1991) the so-called predictive distinc-tions approach of Chrisman Chrisman (1992) used it to drop the deterministic assumptionsof the agent’s representations by implementing a
Hidden Markov Model ( HMM ) Rabinerand Juang (1986). Proposing the powerful, yet (at the time) untapped predictive abilityof
RNN’s
Jordan and Rumelhart (1992) to extend the Chrisman’s approach led to ar-tificial agents with a better grasp of uncertainty which, in turn led to the definition of
POMDP . McCallum McCallum (1993) used a similar
HMM approach, so-called utile dis-tinction memory , introducing the possibility of discriminating states based on their utility :the world states represented by the identical observations could be distinguished based ontheir prior assignments of rewards thus reducing the system uncertainty further. The utiledistinction memory
McCallum (1993) approach raised a possibility for further optimizationsof the memory process itself as seen in the later work by Wiestra and Weiring Wierstra andWiering (2004) on utile distinction hidden Markov models or UDHMM . We can relate the
UDHMM approach to the here presented work as it too optimizes the learning process bylimiting the amount of the informational entropy being channeled to the learning algorithmby creating its memory in such a way that it would represent the distinctions of the worldstate only when needed. While the
UDHMM does that by adjusting the number of stepsit looks back in order to create the utility distinctions, the approach proposed in this workrather focuses on isolating a subset of the observations that induces the distinction-relevantinformation while ignoring the extraneous part. This observation partitioning principle hasbeen successfully implemented in a class of
POMDP’s called mixed observability Markov de-cision process or MOMDP introduced by Ong et al. Ong et al. (2009). The
MOMDP exploitthe fact that although the agent perceives limited representations of the world, some subset10f its observations can be deterministic in a sense they possess a fully observable property.In
MOMDP approach the agent state is split into the fully observable component x andthe partially observable one y which leads to computational benefits of the maintaining andupdating a belief state b y about the y component only.The aforementioned advances bring focus to the problem of artificial perception Weynset al. (2004); Spaan (2008) as a way for an agent to intrinsically and dynamically learn what to perceive in the first place. One of popular approaches to active perception defined over
POMDP includes designing a reinforcement function in such a way that would minimize thesensing cost Boutilier (2002), minimize the agent’s belief state uncertainty based on its cur-rent measure of entropy Araya et al. (2010) or credit the belief level achieved by the specificsensed state Spaan et al. (2015). More recent work Zhu et al. (2017) relates observationswith the agent’s actions by encoding them together in such a way that the
LSTM layercan propagate the additional context of actions through the history of observation-action pairs. In de Castro et al. (2019), ANN function approximators are used to split the sensoryinput in the partially observable subset that is included in the
POMDP’s history of thepast states and the fully observable subset that is treated as Markovian.Artificial attention has been explored recently in the context of standard MDP basedreinforcement learning problems through evolutionary techniques ,taking biological inspira-tions such as intentional blanking in the approach by Tang et. al Tang et al. (2020).
4. Model Architecture and Theoretical Background
This work introduces a novel method of improving the propagation of environment’s inherentuncertainty or entropy to the temporal-difference reinforcement learning algorithm definedover a partially observable domain by introducing a perceptual filtering of the agent’s statespace based on a concept of temporal difference displacement criterion or
TDD . The
TDD property represents a simple yet effective way to maximize the amount of entropy thatis dedicated to the representation of causal relationship between the environment states orstates representations according to the integrated information principle Tononi et al. (2016).The proposed
TDDM criterion exploits the following properties of these types of learningalgorithms: • By each successive
TD transition the algorithm takes advantage of the temporal re-lations between the transitioning states in order to improve its belief state about theenvironment: the policies that an agent makes are not a product of deterministic states but on the history of (possibly) all previous observations and their underlyingrelationships. • The
POMDP agent still updates its policy based on a single transition from state s to state s (cid:48) by performing an action a : this one step information along with areward scalar constitutes everything the algorithm needs in order to perform a learningupdate Watkins and Dayan (1992).Moreover, it postulates the following: 11 Each of the two subsequent states ( s and s (cid:48) ) in a single learning transition can eitherhave a positive or negative effect on the belief state’s uncertainty reduction based ontheir temporal relationship Cassandra et al. (1994). • The changes in the observation states ( s and s (cid:48) ) that a specific transition has in-duced are more relevant in reducing uncertainty about the agent’s belief state as theirinformational content helps in distinguishing states from each other. • The changes in the observation states ( s and s (cid:48) ) channel more of the environment un-certainty or entropy required for creating a more accurate observation model on whichthe agent acts on: it carries the highly valuable information about the transitionalrelationships . • For the intuition regarding the previous paragraph let us imagine a case in which allof the agent’s observations were exactly the same but the transitions yield differentrewards. The
POMDP agent’s learning algorithm would try to attribute the rewarddifferences to the states in the form of value functions but there would not be learningbecause the states would be indistinguishable from each other ( s = s (cid:48) ).The full potential of the TDDM perspective on learning is realized through an activestate space masking or selective filtering of the agent’s observations based on the temporaldifference displacement or TDD between the initial perceived state s and its successor s (cid:48) . Figure 1 details the applied TDD transformations to an Atari game learning problemexample.The
TDD criterion is estimated with a computationally inexpensive two-frame motionestimation technique based on polynomial expansion Farneb¨ack (2003) capable of producinga dense optical flow vector field based on two successive video frames, which, in the case of
TDD includes observation states of the agent’s atomic transition ( s and s (cid:48) ) as detailed inFigure 1.After the initial problem-specific prepossessing of perceived visual information, the twosuccessive frames, namely, S and S (cid:48) are used as an input for the motion field estimationfunction f MF E in Figure 1 a ).In order to perform the motion estimation f MF E function analyses the displacementof the intensities (dx,dy) between the starting image I ( x, y, t ) at the time t and image I ( x + dx, y + dy, t + dt ) that is obtained after temporal displacement t + dt .Based on the initial displacement analysis the Farneb¨ack (2003) produces a dense motionfield estimate window or MFE in Figure 1 commonly depicted using oriented Cartesianvectors representing the intensity and direction of detected temporal displacements.The obtained dense motion field is then transformed to a binary threshold mask orBM in Figure 1 by applying a simple adaptive high-pass filter on the vector magnitudecomponent.Each transition generates its own unique binary mask BM which is multiplied element-wise with each state input in Figure 1c) effectively performing TDD masking T DDM ( S )on the input prior to its integration into the main TD learning algorithm.Figure 3 outlines the final component of main learning part of the algorithm: Q-valueapproximation using three layers of convolutions Hausknecht and Stone (2015) togetherwith the proposed T DDM component processing the input.12igure 1: The process of active state masking based on
TDD decomposed with regards toits functional transformations f represented by black circles; a) motion estimation basedon polynomial expansion Farneb¨ack (2003); b) Binary threshold mask generated from themotion-field vector magnitudes obtained from a); c) Element-wise matrix multiplication ofthe original frame S with the binary mask obtained from b).The recurrent property of the LSTM component Hochreiter and Schmidhuber (1997)applied just before the output layer in Figure 3 is responsible for processing activations through time allowing the ANN to infer on the transitional information from the paststates. This context of previous states and actions is crucial in leaning algorithms definedover POMDP as it provides a way to disambiguate the states of the environment. In order toachieve this, LSTM layer shown in Figure 3 recurrently connects with the previous n LSTMlayers processing n previous agent’s states. Although the architecture of a LSTM is out ofthe scope of this work the basic working principle behind it is a recurrrent neural network propagation of a hidden state H through the layers. To appreciate the contribution of anLSTM recurrence to the overall architecture we observe a less complex RNN architectureshowcased in Figure 4: Let’s say we want to base the agent’s decision making (in our caseapproximated Q-value) not only on current perceived state, but on the n previous ones, S t being the current one and S t − n the oldest one in our horizon. From Figure 4 it is clearthat n layers would be implemented, each receiving their respective temporal input ( x n to x ), but at the same time each of them generating an internal hidden state H at the output ,which becomes a part of the next layer input, thus propagating the context of the n states.By viewing the main architecture in this recurrent perspective it is clear that Figure 3 showsonly the last network out of n identical ones, each being interconnected with their LSTMlayers for essential recurrency property. The outlined last layer is used to approximate the13igure 2: Simplified process of motion estimation based on the amount of displacement( dx, dy ) detected between two transitional states of an Atari game example; The commonAtari prepossessing includes resizing the input to a 84x84 matrix and reducing color channelsto a single grayscale one; a) Image intensity I ( x, y, t ) at time t or S t ; b) Image intensity I ( x + dx, y + dy, t + dt ) after a dt amount of time has passed or S t + dt ; c) During the dt time-window the autonomous agent has successfully performed a transition defined over aMDP by taking an action ( A ), obtaining immediate reward ( R ) and observing S t + dt stateat the final time of t + dt .final Q-values from the outputs of the last LSTM layer but its approximations are a productof recurrent context transfer among the previous n −
5. Experimental Setup
The evaluations were performed on a variety of
Atari games environments on a Python basedplatform mainly supported by Tensorflow Abadi et al. (2016) and OpenAI Gym Brockmanet al. (2016) frameworks with all of the aspects of the architecture and setup being basedon the vanilla
DRQN approach originally presented by Hausknecht et al. Hausknecht andStone (2015). The purpose of the evaluations was to compare the learning performance ofthe baseline DRQN approach Hausknecht and Stone (2015) with the
DRQN-TDDM : animplementation that extended the baseline to include the proposed active state masking based on the
TDDM criterion. The
DRQN and
DRQN-TDDM implementations sharedthe same architecture, meta-parameters; their approximator weights and biases were ran-domly initialized.
DRQN-TDDM only differed in its implementation of perceptual filtering based on a sparse
TDDM mask that was multiplied element-wise with the correspondingobservations before forwarding them as an input to the learning algorithm.Agent’s policies were evaluated by performing 5 independent learning trials for each ofthe two implementations (DRQN and DRQN-TDDM) and averaging their achieved scores.An ANN function approximator shown in Figure 3 was trained on each trial for a total of7 million iterations with a root mean square propagation ( RMSProp ) optimizer capable of14igure 3: The final Q-value approximating component combining active state masking withthree convolutional layers, a LSTM layer connected with n − learning rate α = 0 . .
97. The RMSPropalso implemented a momentum of 0 .
95 and additional gradient clipping. At each iterationthe ANN’s training data included a mini-batch of 64 transitions uniformly sampled froma sliding-window replay memory of size 800 . (cid:15) − greedy approach; the starting (cid:15) = 1 . (cid:15) = 0 .
01. The decay process started after the first millionsteps and proceeded linearly afterwards. The discount factor γ , a parameter of the Bellmanoptimality Equation 5 was set to a high value of 0 . n X n - X n - X H n + X n - H n - + X n - H n + X n - H n + H n + + X n H n H n - H n - S t S t - n S t - n - S t - n - Figure 4: The hidden state propagation of a basic recurrent neural network architecture.The inputs of the gray layers are denoted as X , agent’s states are denoted by S and hiddenstates by H . The big black circle denotes activation function applied to the layers outputwhile the small one represents concatenation operator.
6. Experimental Results
The results for Evaluation phase compared the learned Q-network parameters obtained inthe Training phase under identical configurations. During this stage the network parame-ters obtained under the baseline and under
T DDM filtering were both evaluated with anoriginal, unfiltered game input, providing a robust benchmark of
T DDM (cid:48) s advantage.The evaluation benchmark performed a reproducible batch of 10 independent act-onlytrials for each of the ANN models obtained during the training phase. Each of the indepen-dent evaluation trials were performed for a total of 100 000 steps on an original unfilteredAtari input. To insure the reproducibility of the evaluation results a vector of 10 ran-dom scalars were generated a-priori specific to each of the environment-game. The uniquescalars have been used to seed the pseudo-random number generators of all of the relevantframeworks governing the behaviour of the Atari emulator, making it deterministic relativeto the scalar used.Faced with the identical and reproducible conditions the ANN models trained under T DD filtering outperformed the baseline ones in 20 of a total of 32 Atari game environ-ments evaluated under the benchmark. The general performance measure is defined as theaverage return or reward that an agent received during its 10 independent batch trials; This16easure represents the quantity reported along the
A.R. or Average Reward column of theTable 1. The Table 1 outlines the summary of the performed benchmark with the best per-forming values under each Atari-environment being highlighted in bold. Each row of Table1 represents an independent benchmark batch. Each game environment is represented witha total of two trial batches;
T DD one and a baseline. The batches performed with the
T DD models have been highlighted with a light gray background.Along with the standard performance measure of average return the Table 1 providesadditional benchmark metrics which descriptiveness can contribute to the justification of theevident differences in performances yields between the models trained with
T DDM and thebaseline ones. The additional reported metrics and their potential descriptive significanceare reported in Table 2.For the purpose of identifying the influence which different trained models (TDDM andBaseline) exert on the dynamics of evaluation regardless of the presented environment/gamesome of the more descriptive metrics are visualized based on their density distributions inFigure 5.Although no TDD masking was performed during the evaluation benchmark the binarymasks BM were generated for analytical purposes using the identical TDD process definedin Figure 1. The obtained TDD masking ratios indicated in Figure 5 a) show a very strongpreference of TDDM trained models (in orange hue) for states that would be masked to ahigher degree compared to the baseline (blue hue) in which this bias is not present.TDDM’s models bias towards states with a higher discriminatory potential as quantifiedby the amount of masking could be seen as temporal-information greediness, indirectlygenerated artificial attention capability.This temporal-information-greedy behavior can be also observed, even more clearly, inthe informational content of the LSTM states that propagate context-creating temporal-information through LSTM’s sequential process as depicted in Figure 4. Figure 5 b) showsthe informational density of that exact temporal propagation measured in Shannon’s En-tropy bits of the LSTMS’s hidden states. From Figure 5 b) it is evident that the, regardlessof the type of Atari game the models trained using T DDM are propagating a higher amountof temporal specific information through the model’s LSTM layer which in turn, supportsthe main TDD hypothesis postulates outlined in Section 4.While the a) and b) of Figure 5 are mostly descriptive of the information processingdynamics differences, the second row consisting of c) and d) plots is concerned with thechanges in exploration/exploitation dispositions of the agents using models trained with
T DDM approach contrary to the baseline ones. From Figure 5 c) it is evident that the
T DDM models have produced agent’s policies that in general allow for a longer Atari gameepisodes which in most of the game variations accounts for a higher exploration rate of thegames state space. On the contrary, Figure’s 5 d) graph indicates a more consistent rewardreturns for the
T DDM models ,as quantified by their variances. While the graph c) seemsto suggest a more efficient exploration of the games state space, the d) graph also accountsfor the
T DDM models ability to exploit their certainty of Q-value predictions in such away as to be able to predict a more consistent return than their baseline counterparts.17 nvironment A.R N.P.E H.S.A.E S.A.E H.S.S M.A S.S ST.D.R ST.D.A ST.D.MAlien-v0 0.7064
Alien-v0
217 8.127
Asterix-v0
Atlantis-v0 9.158 649.1 6.652 8.472 2.71
BattleZone-v0 0.664
BattleZone-v0
512 81.9
BeamRider-v0
Berzerk-v0 0.9837 1958 8.599 8.566 8.988
500 6.973 4.366 947Bowling-v0 0.000845 10
Bowling-v0
Breakout-v0 0.00021
Breakout-v0
ChopperCommand-v0 1.39 459.4 8.591 8.594 4.14 56.91 427.6 14.64 2.521 623.4CrazyClimber-v0 0.1476
DemonAttack-v0
512 1.702
DemonAttack-v0 0.004
512 0.2015
Enduro-v0 0.001997 -0.02089 Freeway-v0 0
512 0 0.1281
Freeway-v0
50 8.61 8.61
512 0.09596 0.5826
Frostbite-v0
Hero-v0 -0.002695 Jamesbond-v0 0.2138 932.4
Kangaroo-v0 0.4452
Kangaroo-v0
KungFuMaster-v0 0.0282
987 8.61 8.61 10.24 95.64 512
12 4.123 687.7
Phoenix-v0
Phoenix-v0 0.133
Pitfall-v0 -0.05085 1.332e+04 -0.004044
512 0.1014
Pong-v0 -0.01118
512 0.1271
Riverraid-v0
Riverraid-v0 2.349
Seaquest-v0 0.3393
918 8.506
StarGunner-v0 0.0472 484.3
StarGunner-v0
18 13 13 16 19 16 20 15 12
15 19 19 20 13 22 12 17 20
Table 1: Results of the Evaluation Benchmark performed under identical reproducible se-tups. Best performing batches are outlined in bold for each of the game environments andresults obtained with models trained under
T DDM are highlighted with light-gray back-ground; Two of the rows at the bottom represent a summary of best performing valuesfor each of the columns;
T DDM while the bbreviation Full Name Description
Environment Environment Specific Atari game used in benchmark batch.A.R. Average Return The immediate rewards that the agents received averaged over all ofthe 10 trials that formed a single benchmark batch.N.P.E. Number of Played Episodes Average Total Number of Played Episodes in a single trial.H.S.A.E. Hidden States Activation Entropy Average Shannon’s entropy in bits of the model’s hidden states H n indicative of the amount of information being effectively propagatedthrough their activations in the LSTM part of the main ANN modeldetailed in 4.S.A.E. States Activation Entropy Average Shannon’s entropy in bits of the model’s input states X n indicative of the amount of information being effectively propagatedthrough their activations in the LSTM part of the main ANN modeldetailed in 4.H.S.S. Hidden States Sparsity Average Percentage of Non-Zero Hidden States Activations. Higherpercentage indicates more activity in RNN Hidden States propaga-tion.M.A. Masking Amount Percentage of the input state’s pixels masked or blanked with TDDM .S.S. States Sparsity Percentage of Non-Zero model’s input states X n Activations. Higherpercentage indicates more activity in RNN input state propagation.ST.D.R Standard-Deviation of Returns Depending on a specific environment reinforcement function the vari-ance of the Returns could be an indicative of an agent’s preference ofexploration over exploitation.ST.D.A Standard-Deviation of Selected Actions Depending on a specific environment configuration the variance ofthe Actions taken could be an indicative of an agent’s preference ofexploration over exploitation.ST.D.M Standard-Deviation of Masking Percentages The variance in Masking Amounts of single frames could indicate thelevel of adaptability of the motion detection technique shown in Fig-ure 1 to a specific environment.
Table 2: Detailed Description of the type of data represented in Table 1 columns.
7. Concluding Remarks
The abundance of the inherent information-generating uncertainty in our world pushed thehuman evolution into a momentum of producing information-processing mechanisms withincreasing complexity that would in turn be able to reduce this uncertainty on a variety oflevels or abstractions including crafting our immediate environment by creating patternsof predictability; be it in a form of ubiquitous technical systems (which include artificiallearning ones presented in this work) or the more abstract social structures.
Interaction and causal relationships with the perceived environment are emerging as thefocus of the perception information-gathering process rather than expanding the perceptiondomain itself. For example, we would not benefit from a hypothetical super-perception thatwould allow us to perceive the incomprehensible amount of information contained in thespin direction of electrons in each of the atoms comprising a typical physical object; if astate of spin can be either spin-up or spin-down with equal probability distribution, thatwould give us an entropy of 1 bit per electron. This information though, will not be helpfulin determining the interaction that physical objects has with its surroundings and ourselvesas a part of it: but its movement or position among other attributes could.Interacting with a more predictable environment reduces the overall information thatneeds to be processed, but at the same time generates more of the information that explainsthe causal relationships within. This causal subset of the perceived information can be19igure 5: Visualization of distribution densities for four characterizing variables (ordinates)selected from Table across the average return range (abscissa) . The plotted areas representcounts of ordinate-variable observations falling within each discrete bin with higher frequen-cies corresponding to a higher saturation value. Areas hues are indicative of the trainedmodel used in the evaluation trial: benchmark results obtained using models trained un-der
T DDM approach have their frequencies or counts represented with orange while thebenchmark results obtained using baseline models are indicated in blue.seen as an information gain of interaction (or transition from s to s (cid:48) in our case) that20s irreducible to its composing parts ( s and s (cid:48) ) according to the integrated informationprinciple or Φ proposed by Tononi et al. Tononi et al. (2016).Thus, the here-presented work intention is to illustrate the significance of the qualitativerepresentation of the information in the general context of machine learning problems.Depending on the actual machine learning approach, representing the environment-relatedinformation in different categorical contexts allows for the algorithm itself to leverage thesame information it in a way that would better support the efficacy of its conversion intohigher order representations such as Q-values that would eventually lead to better agentpolicies.The main exploration of the presented work is the effect of selective focus on the subsetof the state information that is more likely to induce this specific type of information as wemove from the determinism of Atari’s pixels or the contents of its allocated memory Brock-man et al. (2016) to the problems closer to the real world dynamics such as StarcraftII Vinyals et al. (2019a). 21 ppendix A. In this appendix we present the variation of a total of three variables characterizing theactual learning during the agents training phase, namely, total cumulative return, averageQ-value obtained and number of played episodes:
Environment Mask Mean Maximum Minimum Median ModAlien-v0 0 0.7126 1.352 0.232 0.714 0.714
Alien-v0 1 0.801 1.536 0.258 0.814 0.75
Asterix-v0 0 1.182
Asterix-v0 1 1.654
Asteroids-v0 0 0.781
Asteroids-v0 1 0.9593
Atlantis-v0 1 10.7 65.78 5.6 10.58 11.42BattleZone-v0 0 0.3694 12.4
BeamRider-v0 0 0.4709
Berzerk-v0 0 0.8657
Berzerk-v0 1 0.8321
Bowling-v0 0 0.01267
Bowling-v0 1 0.009806
Boxing-v0 1 0.01691 0.0458 -0.0058 0.0176 0.0134
Breakout-v0 0 0.01451 0.0356 0.0022 0.0152 0.0158
Breakout-v0 1 0.02939 0.0604 0.0032 0.0312 0.032
ChopperCommand-v0 0 1.322
ChopperCommand-v0 1 1.796
CrazyClimber-v0 0 1.789 13.18
CrazyClimber-v0 1 3.055 13.7
DemonAttack-v0 1 0.1906 0.827
Enduro-v0 0 0.02432 0.0672 -0.0024 0.0246 0
Enduro-v0 1 0.02033 0.065 -0.0032 0.02 -0.241 -0.0246 -0.0176
FishingDerby-v0 1 -0.03838 -0.0184 -0.2332 -0.0376 -0.0344Freeway-v0 0 0 0 Frostbite-v0 1 0.2978 0.844 -0.0036 -0.004
IceHockey-v0 1 -0.002848 0.0004 -0.0172 -0.0028 -0.0028Jamesbond-v0 0 0.1498 0.31 Jamesbond-v0 1 0.1491 0.29 Kangaroo-v0 1 0.7284 1.48 0 0.76 0.72Krull-v0 0 1.174
Krull-v0 1 1.122
KungFuMaster-v0 1 1.911 4.88 0.36 1.88 1.92Pitfall-v0 0 -0.02355 0 -0.4292 0 0
Pitfall-v0 1 -0.04863 -0.4698 -0.0366 Pong-v0 1 -0.01017 -0.0008 -0.1034 -0.0094 -0.0056Qbert-v0 0 0.7171 2.15 0.2 0.71 0.68
Qbert-v0 1 0.8616 2.185 0.255 0.865 0.845
Riverraid-v0 0 3.017 10.46
Riverraid-v0 1 3.033 10.79
Seaquest-v0 0 0.2961 0.768 0.104 0.3 0.3
Seaquest-v0 1 0.3279 0.872 0.12 0.332 0.344SpaceInvaders-v0 0 0.3921
SpaceInvaders-v0 1 0.3533
StarGunner-v0 1 0.5605 3.06
Table 3: Total cumulative return received during the training phase for each of the combi-nations of environment/masking. Best values are highlighted in bold.22 nvironment Mask Mean Maximum Minimum Median ModAlien-v0 0 232.6 363.8
Alien-v0 1 419.8 577
Asterix-v0 0 210.6 336.6
Atlantis-v0 1 1906 2639 0.08319 2342 2476
BattleZone-v0 0 0.6521 1.45
Berzerk-v0 0 206.8 314.1
Berzerk-v0 1 168.6 233.6
Breakout-v0 1 27.91 38.4
ChopperCommand-v0 0 185.3 503.7 0.02374 66.08 0.02374
ChopperCommand-v0 1 333.4 559.6 0.02393 428.5 0.02393
CrazyClimber-v0 0 124 195.7 0.1077 109.8 0.1077
CrazyClimber-v0 1 506.3 716 0.1121 636.8 0.1121DemonAttack-v0 0 136.7 261.4 0.06215 198.9 0.06215
DemonAttack-v0 1 111 195.7 0.06059 154.6 0.06059Enduro-v0 0 73.34 108.6 0.002759 96.77 0.002759
Enduro-v0 1 88.7 149.7 0.003457 107.8 146.9
FishingDerby-v0 0 12.47 22.56 -5.809 18.55 -0.1605
FishingDerby-v0 1 20.02 26.59 -0.8865 22.89 -0.148
Freeway-v0 0 0.0616 0.07526 0.005634 0.06213 0.005634
Freeway-v0 1 4.473 5.675 0.009225 5.266 5.293Frostbite-v0 0 230.9 372.4 0.0794 216.8
IceHockey-v0 0 0.5194 1.625 -0.4062 0.5266 1.218
IceHockey-v0 1 6.73 8.181 -0.0109 7.421 7.618
Jamesbond-v0 0 116.9 229.5
Kangaroo-v0 1 515.7 835.8
Krull-v0 1 953.6 1332 -0.01823
Pong-v0 0 5.076 9.564 -4.207
Pong-v0 1 4.766 7.895 -1.264
Riverraid-v0 0 1115 1673 0.1119 1249
Seaquest-v0 0 163.2 316.3
Seaquest-v0 1 255.5 386.2
SpaceInvaders-v0 0 119.2 174.9
SpaceInvaders-v0 1 220.2 347.5
StarGunner-v0 0 2.04 3.595 0.0143 1.988 0.0143
StarGunner-v0 1 5.32 30.72 0.01667 2.126 0.01667TimePilot-v0 0 52.56 272 0.03401
Table 4: Average Q-value reached during the training phase for each of the combinationsof environment/masking. Best values are highlighted in bold.23 nvironment Mask Mean Maximum Minimum Median ModAlien-v0 0 23.56 114
23 23
Alien-v0 1 24.05 116 16 24 24Asterix-v0 0 48
33 48 49
Asterix-v0 1 45.2
28 44 42
Asteroids-v0 0 36.28
10 37 38
Asteroids-v0 1 23.68
144 18
31 30
Atlantis-v0 1 35.33
139 16
35 32BattleZone-v0 0 4347 5000 0 5000 5000
BattleZone-v0 1 4182
BeamRider-v0 0 10.47 53
Berzerk-v0 0 60.6 298 35 61 61
Berzerk-v0 1 64.69 350 45 65 65
Bowling-v0 0 2.183
11 1 2 2Bowling-v0 1 2.236 11 1 2 2
Boxing-v0 0 2.877
14 2 3 3Boxing-v0 1 3.41 14 2 3 3Breakout-v0 0 68.69 676 38 64 59
Breakout-v0 1 34.99 662 17 27 27ChopperCommand-v0 0 25 ChopperCommand-v0 1 32.29 CrazyClimber-v0 0 7.255 44 CrazyClimber-v0 1 8.35 45 2 8 8DemonAttack-v0 0 22.41 121 2 22 24
DemonAttack-v0 1 21.89 115
Enduro-v0 0 1.482
Enduro-v0 1 1.503 7 0 2 2
FishingDerby-v0 0 2.654
13 2 3 3FishingDerby-v0 1 2.681 Frostbite-v0 0 43.81 254 33 43 42
Frostbite-v0 1 44.87 266 35 44 44IceHockey-v0 0 1.448 7 1 1 1
IceHockey-v0 1 1.418
Jamesbond-v0 0 40.69
21 38 31
Jamesbond-v0 1 44.6
23 44 42Kangaroo-v0 0 28.49 161 20 27 25
Kangaroo-v0 1 25.59 150 19 25 24Krull-v0 0 10.37 257 KungFuMaster-v0 0 27.24
15 28 28
KungFuMaster-v0 1 18.71
11 19 19
Pitfall-v0 0 3478 5000 0 5000 5000
Pitfall-v0 1 1827 Pong-v0 0 2.919
26 1
Pong-v0 1 3.134 26 1 3 3Qbert-v0 0 50.08
35 49 49
Qbert-v0 1 47.74
31 47 47Riverraid-v0 0 34.71 137 21 33 32
Riverraid-v0 1 38.37 138 24 37 36
Seaquest-v0 0 27.13
193 17 27 Seaquest-v0 1 27.75
183 16
27 27
SpaceInvaders-v0 0 26.23 142 16 26 26
SpaceInvaders-v0 1 29.88 147 17 30 31
StarGunner-v0 0 24.98 115
25 24
StarGunner-v0 1 26.26 119
26 26TimePilot-v0 0 19.24 73 Table 5: Average number of episodes played during the training phase for each of thecombinations of environment/masking. Best values are highlighted in bold. Higher valuesare indicative of a exploratory strategy. 24 eferences
Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow:A system for large-scale machine learning. In { USENIX } Symposium on OperatingSystems Design and Implementation ( { OSDI } , pages 265–283, 2016.Philip E Agre. The dynamic structure of everyday life. Technical report, MassachusettsInst Of Tech Cambridge Artificial Intelligence Lab, 1988.Philip E Agre and David Chapman. Pengi: An implementation of a theory of activity. In AAAI , volume 87, pages 286–272, 1987.Mauricio Araya, Olivier Buffet, Vincent Thomas, and Fran¸ccois Charpillet. A pomdp ex-tension with belief-dependent rewards. In
Advances in neural information processingsystems , pages 64–72, 2010.Karl Johan ˚Astr¨om. Optimal control of markov processes with incomplete state information.
Journal of Mathematical Analysis and Applications , 10(1):174–205, 1965.Alan Baddeley. Working memory.
Science , 255(5044):556–559, 1992.Dana H Ballard, Mary M Hayhoe, Polly K Pook, and Rajesh PN Rao. Deictic codes forthe embodiment of cognition.
Behavioral and brain sciences , 20(4):723–742, 1997.H Clark Barrett. Modularity and design reincarnation.
The innate mind: Culture andcognition, ed. P. Carruthers, S. Laurence & S. Stich , pages 199–217, 2006.Richard Bellman. Dynamic programming.
Science , 153(3731):34–37, 1966.Romain P Boisseau, David Vogel, and Audrey Dussutour. Habituation in non-neural or-ganisms: evidence from slime moulds.
Proceedings of the Royal Society B: BiologicalSciences , 283(1829):20160446, 2016.Craig Boutilier. A pomdp formulation of preference elicitation problems. In
AAAI/IAAI ,pages 239–246, 2002.Donald Broadbent. E.(1958).
Perception and communication , 1958.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. Openai gym, 2016.Jerome S Bruner. The cognitive consequences of early sensory deprivation.
PsychosomaticMedicine , 21(2):89–95, 1959.Anthony R Cassandra, Leslie Pack Kaelbling, and Michael L Littman. Acting optimally inpartially observable stochastic domains. In
AAAI , volume 94, pages 1023–1028, 1994.David Chapman. Intermediate vision: Architecture, implementation, and use.
CognitiveScience , 16(4):491–537, 1992. 25onnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinc-tions approach. In
AAAI , volume 1992, pages 183–188. Citeseer, 1992.Lonnie Chrisman, Rich Caruana, and Wayne Carriker. Intelligent agent design issues: Inter-nal agent state and incomplete perception. In
Proceedings of the AAAI Fall Symposiumon Sensory Aspects of Robotic Intelligence. AAAI Press/MIT Press . Citeseer, 1991.Mihaly Csikszentmihalyi. Imagining the self: An evolutionary excursion.
Poetics , 21(3):153–167, 1992.Miguel Suau de Castro, Elena Congeduti, Rolf Starre, Aleksander Czechowski, and FransOlihoek. Influence-aware memory for deep reinforcement learning. arXiv preprintarXiv:1911.07643 , 2019.Paul R Ehrlich.
Human natures: Genes, cultures, and the human prospect . Island Press,2000.Gunnar Farneb¨ack. Two-frame motion estimation based on polynomial expansion. In
Scandinavian conference on Image analysis , pages 363–370. Springer, 2003.Karl Friston. The free-energy principle: a unified brain theory?
Nature reviews neuro-science , 11(2):127–138, 2010.James J Gibson and Eleanor J Gibson. Perceptual learning: Differentiation or enrichment?
Psychological review , 62(1):32, 1955.David E Goldberg. Genetic algorithms in search, optimisation and machine learning, 1989.
Reading, Addison, Wesley .Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observablemdps. In , 2015.Mary Hayhoe, David Bensinger, and Dana Ballard. Task constraints in visual workingmemory. 1997.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.
Neural computation ,9(8):1735–1780, 1997.Michael I Jordan and David E Rumelhart. Forward models: Supervised learning with adistal teacher.
Cognitive science , 16(3):307–354, 1992.Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody, Jen-nifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, et al. Initialsequencing and analysis of the human genome. 2001.Tuyen P Le, Ngo Anh Vien, and TaeChoong Chung. A deep hierarchical reinforcementlearning algorithm in partially observable markov decision processes.
Ieee Access , 6:49089–49102, 2018.Wee S Lee, Nan Rong, and David Hsu. What makes some pomdp problems easy to approx-imate? In
Advances in neural information processing systems , pages 689–696, 2008.26ong-Ji Lin. Programming robots using reinforcement learning and teaching. In
Proceedingsof the ninth National conference on Artificial intelligence-Volume 2 , pages 781–786, 1991.William S Lovejoy. A survey of algorithmic methods for partially observed markov decisionprocesses.
Annals of Operations Research , 28(1):47–65, 1991.Gary Fred Marcus.
The birth of the mind: How a tiny number of genes creates the com-plexities of human thought . Basic Civitas Books, 2004.David Marr. Early processing of visual information.
Philosophical Transactions of the RoyalSociety of London. B, Biological Sciences , 275(942):483–519, 1976.David Marr. Vision: A computational investigation into the human representation andprocessing of visual information, henry holt and co.
Inc., New York, NY , 2(4.2), 1982.R Andrew McCallum. Overcoming incomplete perception with utile distinction memory. In
Proceedings of the Tenth International Conference on Machine Learning , pages 190–196,1993.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.
Nature , 518(7540):529–533,2015.George E Monahan. State of the art—a survey of partially observable markov decisionprocesses: theory, models, and algorithms.
Management science , 28(1):1–16, 1982.Dan-E Nilsson. Eye ancestry: old genes for new eyes.
Current Biology , 6(1):39–42, 1996.Sylvie CW Ong, Shao Wei Png, David Hsu, and Wee Sun Lee. Pomdps for robotic taskswith mixed observability. In
Robotics: Science and systems , volume 5, page 4, 2009.Susan Oyama.
The ontogeny of information: Developmental systems and evolution . Dukeuniversity press, 2000.Lars Penke, Jaap JA Denissen, and Geoffrey F Miller. The evolutionary genetics of per-sonality.
European Journal of Personality: Published for the European Association ofPersonality Psychology , 21(5):549–587, 2007.Lawrence Rabiner and B Juang. An introduction to hidden markov models. ieee asspmagazine , 3(1):4–16, 1986.Mirza Ramicic and Andrea Bonarini. Entropy-based prioritized sampling in deep q-learning.In , pages1068–1072. IEEE, 2017. 27atharine H Rankin. Invertebrate learning: what can’t a worm learn?
Current biology , 14(15):R617–R618, 2004.St´ephane Ross, Joelle Pineau, S´ebastien Paquet, and Brahim Chaib-Draa. Online planningalgorithms for pomdps.
Journal of Artificial Intelligence Research , 32:663–704, 2008.Alice FS Salway and Robert H Logie. Visuospatial working memory, movement control andexecutive demands.
British Journal of Psychology , 86(2):253–269, 1995.Jeffrey R Sampson. Adaptation in natural and artificial systems (john h. holland), 1976.Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre,Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al.Mastering atari, go, chess and shogi by planning with a learned model.
Nature , 588(7839):604–609, 2020.David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Master-ing the game of go without human knowledge.
Nature , 550(7676):354–359, 2017.David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai,Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al.A general reinforcement learning algorithm that masters chess, shogi, and go throughself-play.
Science , 362(6419):1140–1144, 2018.Matthijs TJ Spaan. Cooperative active perception using pomdps. In
AAAI 2008 workshopon advancements in POMDP solvers , 2008.Matthijs TJ Spaan, Tiago S Veiga, and Pedro U Lima. Decision-theoretic planning underuncertainty with information rewards for active cooperative perception.
AutonomousAgents and Multi-Agent Systems , 29(6):1157–1185, 2015.Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.
ACM Sigart Bulletin , 2(4):160–163, 1991.Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MITpress, 2018.Yujin Tang, Duong Nguyen, and David Ha. Neuroevolution of self-interpretable agents.In
Proceedings of the 2020 Genetic and Evolutionary Computation Conference , pages414–424, 2020.Giulio Tononi, Melanie Boly, Marcello Massimini, and Christof Koch. Integrated informa-tion theory: from consciousness to its physical substrate.
Nature Reviews Neuroscience ,17(7):450–461, 2016.John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.
Machine learn-ing , 16(3):185–202, 1994.Shimon Ullman. Visual routines. In
Readings in computer vision , pages 298–328. Elsevier,1987. 28riol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Woj-ciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al.Alphastar: Mastering the real-time strategy game starcraft ii.
DeepMind blog , page 2,2019a.Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik,Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al.Grandmaster level in starcraft ii using multi-agent reinforcement learning.
Nature , 575(7782):350–354, 2019b.Christopher JCH Watkins and Peter Dayan. Q-learning.
Machine learning , 8(3-4):279–292,1992.Danny Weyns, Elke Steegmans, and Tom Holvoet. Towards active perception in situatedmulti-agent systems.
Applied Artificial Intelligence , 18(9-10):867–883, 2004.Steven D Whitehead and Dana H Ballard. Learning to perceive and act by trial and error.
Machine Learning , 7(1):45–83, 1991.Daan Wierstra and Marco Wiering. Utile distinction hidden markov models. In
Proceedingsof the twenty-first international conference on Machine learning , page 108, 2004.Pengfei Zhu, Xin Li, Pascal Poupart, and Guanghui Miao. On improving deep reinforcementlearning for pomdps. arXiv preprint arXiv:1704.07978arXiv preprint arXiv:1704.07978