Projective simulation with generalization
Alexey A. Melnikov, Adi Makmal, Vedran Dunjko, Hans J. Briegel
PProjective simulation with generalization
Alexey A. Melnikov,
1, 2, ∗ Adi Makmal, Vedran Dunjko, and Hans J. Briegel
1, 3 Institute for Theoretical Physics, University of Innsbruck, Technikerstraße 21a, 6020 Innsbruck, Austria Institute for Quantum Optics and Quantum Information,Austrian Academy of Sciences, Technikerstraße 21a, 6020 Innsbruck, Austria Department of Philosophy, University of Konstanz, Fach 17, 78457 Konstanz, Germany
The ability to generalize is an important feature of any intelligent agent. Not only because itmay allow the agent to cope with large amounts of data, but also because in some environments, anagent with no generalization capabilities cannot learn. In this work we outline several criteria forgeneralization, and present a dynamic and autonomous machinery that enables projective simula-tion agents to meaningfully generalize. Projective simulation, a novel, physical approach to artificialintelligence, was recently shown to perform well in standard reinforcement learning problems, withapplications in advanced robotics as well as quantum experiments. Both the basic projective simu-lation model and the presented generalization machinery are based on very simple principles. Thisallows us to provide a full analytical analysis of the agent’s performance and to illustrate the benefitthe agent gains by generalizing. Specifically, we show that already in basic (but extreme) envir-onments, learning without generalization may be impossible, and demonstrate how the presentedgeneralization machinery enables the projective simulation agent to learn.
INTRODUCTION
The ability to act upon a new stimulus, based on previ-ous experience with similar, but distinct, stimuli, some-times denoted as generalization , is used extensively inour daily life. As a simple example, consider a driver’sresponse to traffic lights: The driver need not recognizethe details of a particular traffic light in order to respondto it correctly, even though traffic lights may appear dif-ferent from one another. The only property that mattersis the color, whereas neither shape nor size should playany role in the driver’s reaction. Learning how to reactto traffic lights thus involves an aspect of generalization.A learning agent, capable of a meaningful and usefulgeneralization is expected to have the following charac-teristics: (a) an ability for categorization (recognizingthat all red signals have a common property, which wecan refer to as redness ); (b) an ability to classify (a newred object is to be related to the group of objects with theredness property); (c) ideally, only generalizations thatare relevant for the success of the agent should be learned(red signals should be treated the same, whereas square-shaped signals should not, as they share no property thatis of relevance in this context); (d) correct actions shouldbe associated with relevant generalized properties (thedriver should stop whenever a red signal is shown); and(e) the generalization mechanism should be flexible .To illustrate what we mean by “flexible generaliza-tion”, let us go back to our driver. After learning how tohandle traffic lights correctly, the driver tries to follow ar-row signs to, say, a nearby airport. Clearly, it is now theshape category of the signal that should guide the driver,rather than the color category. The situation would beeven more confusing, if the traffic signalization would ∗ Correspondence to: [email protected] suddenly be based on the shape category alone: squarelights mean “stop” whereas circle lights mean “drive”.To adapt to such environmental changes the driver hasto give up the old color-based generalization and buildup a new, shape-based, generalization. Generalizationsmust therefore be flexible.Refs. [1, 2] provide a broad account of generalizationin artificial intelligence. In reinforcement learning (RL),where an agent learns via interaction with a rewardingenvironment [3–5], generalization is often used as a tech-nique to reduce the size of the percept space, which ispotentially very large. Two useful recent summaries canbe found in Refs. [6, 7]. For example, in the Q-learning [8]and SARSA [9] algorithms, it is common to use functionapproximation methods [3–5, 10], realized by e.g. tile cod-ing (CMAC) [11, 12], neural networks [13–15], decisiontrees [16, 17], constructive function approximation [18],or support vector machines [4, 19, 20], to implementa generalization mechanism. Alternatively, in learningclassifier systems (LCS), generalization is facilitated byusing the wildcard a r X i v : . [ c s . A I] O c t memory, denoted as episodic & compositional memory (ECM), which is structured as a directed, weighted net-work of clips , where each clip represents a rememberedpercept, action, or sequences thereof. Once a percept isobserved, the network is activated, invoking a randomwalk between the clips, until an action clip is hit andcouples out as a real action of the agent.The generalization process within PS is achieved in thiswork by, roughly speaking, the generation of abstractedclips, which represent commonalities between percepts,or more precisely, subsets of the percept space. These,in turn, influence behavior in new situations based onsimilar previous experiences. The method we introduceis a step toward more advanced approaches to general-ization within PS and is suitable for medium-scale taskenvironments. In more complicated environments, thelarge number of possible abstractions may harm the per-formance of the agent. We address the question how thiscould be combated in the discussion section.The PS approach to artificial intelligence, arguably,stands out as promising from different perspectives:First, random walks, which constitute the basic internaldynamics of the model, have been well-studied in the con-text of randomized algorithm theory [34] and probabil-ity theory, thus providing an extensive theoretical tool-box for analyzing related models; second, the PS model,by design, represents a stochastic physical system whichpoints to possible physical (rather than computational)realizations. This relates PS to the framework of embod-ied artificial agents [35]; last, the physics aspects of themodel offer a route toward the research into quantum-enhanced variants of PS: the underlying random walkwas already shown to naturally extend to a quantummany-body master equation [33]. Related to this, thefact that the deliberation of the agent centers arounda random walk process, opens up a route for advancesby using quantum random walks instead. In quantumrandom walks, roughly speaking, the probabilistic re-positioning of the walker is replaced by a quantum su-perposition of moves, by switching from a stochastic to acoherent quantum dynamics. This allows one to exploitquintessential quantum phenomena, including quantuminterference and quantum parallelism. Quantum walkshave been increasingly more employed in recent times asa new framework for the development of new quantumalgorithms. Over the course of the last decade, poly-nomial and exponential improvements in computationalcomplexity have been reported, over classical counter-parts [36–38]. Utilizing this methodology in the contextof reinforcement learning, it was recently shown, by someof the authors and collaborators, that a quantum variantof the PS agent exhibits a quadratic speed-up in delib-eration time over its classical analogue, which leads to asimilar speed-up of learning time in active learning scen-arios [39–42]. This quantum advantage in the decision-making process of the quantum PS agent was recently ex-perimentally demonstrated using a small-scale quantuminformation processor based on trapped ions [43]. In the PS model, learning is realized by internal modi-fication of the clip network, both in terms of its structureand the weights of its edges. Through interactions witha rewarding environment, the clip network adjusts itselfdynamically, so as to increase the probability of perform-ing better in subsequent time steps (see below, for a moredetailed description of the model). Learning is thus basedon a “trial and error” approach, making the PS modelespecially suitable for solving RL tasks. Indeed, recentstudies showed that the PS agent can perform very well incomparison to standard models, in both basic RL prob-lems [44] and in standard benchmark tasks, such as the“grid world” and the continuous-domain “mountain carproblem” [45]. Due to the flexibility of the PS frame-work it can also be used in contexts beyond textbookRL. Recent applications are, for instance, in the problemof learning complex haptic manipulation skills [46] andin the problem of learning to design complex quantumexperiments [47].Here we present a simple dynamical mechanism whichallows the PS network to evolve, through experience, toa network that represents and exploits similarities in theperceived percepts, i.e. to a network that can generalize.Using such a network we address the problem of RL intask environments, which require some aspects of gener-alization (function approximation) to be solved. Stand-ard approaches to such task environments rely on ex-ternal machinery, a function approximator, which thenhas to be combined to otherwise “raw”, tabular, RLmachinery, e.g. temporal difference learning model-freemethods such as Q-learning and SARSA. In contrast, toachieve the same goal, here we use the more elaboratestructure of the PS model, which is not represented by atable, but by a directed graph, rather than external ma-chinery, e.g., such as in the case of Q-learning [10, 15].Naturally, our proposed machinery also internally real-izes a function approximator, but its structure and up-dates arise from the very basic learning rules of the PSmodel. The generalization mechanism, which is inspiredby the wildcard notion of LCS [21–23], is based on a pro-cess of abstraction which is systematic, autonomous, and,most importantly, requires no explicit prior knowledge ofthe agent. This is in contrast with common RL modelswith a function approximator, which often require addi-tional prior knowledge in terms of an additional input [7].Moreover, we show that once the PS agent is providedwith this machinery which allows it to both categorize and classify , the rest of the expected characteristics welisted above follow directly from the basic learning rulesof the PS agent. In particular, we show that relevant gen-eralizations are learned, that the agent associates correctactions to generalized properties, and that the entire gen-eralization scheme is flexible , as required.PS with generalization, in comparison with other solu-tions, does not rely on external machinery. Instead, it isa homogeneous approach where the generalization mech-anism is based on the basic PS principles, in this casespecifically the dynamic growth of the clip network. Onecan see several advantages to our approach: First, it al-lows for a relatively straightforward theoretical treatmentof the performance, including analytic formulas charac-terizing the performance of generalization and learning.Second, we do not rely on powerful classifying machinerywhich can significantly increase the model complexityof the agent (in particular if neural networks are util-ized), which may be undesirable. Finally, and for ouragenda very relevant, sticking to just the basic randomwalk mechanism offers a natural and systematic route toquantization of the overall dynamics. As we have men-tioned earlier, this can lead to improvements in compu-tational complexity and, in principle, also in space com-plexity of the model. In contrast, for heterogeneous ap-proaches, e.g. Q-learning combined with a neural net-work, there exist no clear routes for useful quantization,and no firm results proving improvements have been es-tablished for the quantization of either, let alone for acombination.While in most RL literature elements of generalizationare considered as means of tackling the “curse of dimen-sionality” [5], as coined by Bellman [48] and discussedabove, they are also strictly necessary for an agent tolearn in certain environments [3]. Here we consider atype of environment where, irrespective of its availableresources, an agent with no generalization ability can-not learn, i.e. it performs no better than a fully randomagent.Following this, we show that the PS model, when en-hanced with the generalization mechanism, is capable oflearning in such an environment. Along numerical illus-trations we provide a detailed analytical description ofthe agent’s performance, with respect to its success- andlearning-rates (defined below). Such an analysis is feas-ible due to the simplicity of the PS model, both in termsof the number of its free parameters, and its underlyingequations (see also Ref. [44]), a property we extensivelyexploit. The main contribution of this paper is thus todemonstrate how the inherent features of the PS modelcan be used to solve RL problems that require nontrivialnotions of generalization, importantly without relying onexternal classifier machinery. While it is also possible tosacrifice homogeneity, and combine PS with external ma-chinery (in which case a direct comparison between thePS and other RL models both enhanced by external ma-chinery would be warranted), in this work we strive todevelop the theory of the PS model on its own. RESULTS
The remainder of this paper is structured as follows.We first begin, for completeness, with a description of thePS model and a formal comparison of the PS model tostandard RL techniques. We then present the proposedgeneralization mechanism, examine its performance in asimple case and illustrate how it gives rise to a meaning-ful generalization, as defined above. Next, we study the central scenario of this paper, in which generalization isan absolute condition for learning. After describing thescenario and showing that the PS agent can cope withit, we analyze its performance analytically. Finally, westudy this scenario for an arbitrary number of categories,and observe that the more there is to categorize the morebeneficial is the proposed mechanism.
The PS model
In what follows we shortly summarize the basic prin-ciples of the PS model, for more detailed descriptions werefer the reader to Refs. [33, 44, 45, 49].The central component of the PS agent is the so-calledclip network, which can, abstractly, be represented as adirected graph, where each node is a clip, and directededges represent allowed transitions, as depicted in Fig. 1.Whenever the PS agent perceives an input, the corres-ponding percept clip is excited (e.g. Clip 1 in Fig. 1).This excitation marks the beginning of a random walkbetween the clips until an action clip is hit (e.g. Clip 6 inFig. 1), and the corresponding action is performed. Therandom walk is carried out according to time-dependentprobabilities p ij to hop from one node to another.Formally, percept clips are defined as K -tuples s =( s , s , ..., s K ) ∈ S ≡ S × S × ... × S K , s i ∈ { , ..., |S i |} ,where |S| = |S | · · · |S K | is the number of possible per-cepts. Each dimension may account for a different typeof perceptual input such as audio, visual, or sensational,where the exact specification (number of dimensions K and the perceptual type of each dimension) and resol-ution (the size |S i | of each dimension) depend on thephysical realization of the agent. In what follows, weregard each of the K dimensions as a different cat-egory. Action clips are similarly given as M -tuples: a = ( a , a , ..., a M ) ∈ A ≡ A × A × ... × A M , a i ∈{ , ..., |A i |} , where |A| = |A | · · · |A M | is the number ofpossible actions. Once again, each of the M dimensionsprovides a different aspect of an action, e.g. walking,jumping, picking-up, etc. Here, however, we restrict ouranalysis to the case of M = 1 and varying |A | . input outputpercept clip action clip Clip 1 ...
Clip 2 Clip 3 Clip 4Clip 5 Clip 6 ... p p p p p p p Figure 1. The PS clip network.
Each directed edge from clip c i to clip c j has a timedependent weight h ( t ) ( c i , c j ), which we call the h -value.The h -values define the conditional probabilities of hop-ping from clip c i to clip c j according to p ( t ) ( c j | c i ) = h ( t ) ( c i , c j ) (cid:80) k h ( t ) ( c i , c k ) . (1)At the beginning, all h -values are initialized to the samefixed value h = 1. This ensures that, initially, the prob-ability to hop from any clip to any of its neighbors is com-pletely uniform. The conditional probabilities defined byEq. (1) will be used throughout the paper unless statedotherwise. One can also define a different function forconditional probabilities known as the softmax function p ( t ) ( c j | c i ) = e βh ( t ) ( c i ,c j ) (cid:80) k e βh ( t ) ( c i ,c k ) , (2)where β is an inverse temperature parameter, the lowerthe temperature – the higher the chance to traverse anedge with the largest h -value.Learning takes place by the dynamical strengtheningand weakening of the internal h -values, in correspondenceto an external feedback, i.e. a reward λ , coming from theenvironment. Specifically, the update of the h -values isdone according to the following update rule: h ( t +1) ( c i , c j ) = h ( t ) ( c i , c j ) − γ ( h ( t ) ( c i , c j ) − (cid:88) l δ ( c i , c k l ) δ ( c j , c m l ) λ ( t +1) , (3)where the reward λ is non-negative ( λ = 0 implies noreward), and is added only to the h -values of the edges( c k l , c m l ) that were traversed in the last random walk.The update rule can also handle negative rewards giventhat the probability function is defined by Eq. (2) so thatthe transition probabilities p ( c j | c i ) are guaranteed to re-main non-negative. The damping parameter 0 ≤ γ ≤ h -values of all edges and allows the agentto forget its previous experience, an important feature inchanging environments [33, 44, 49]. The damping termin Eq. (3) with a nonzero γ is however not needed instationary environments, such as the contextual bandittask [50]. In the contextual bandit task the PS agentwith the update rule in Eq. (3) with γ = 0 realizes theoptimal policy in the limit. Note that although the up-date rule in Eq. (3) does not have an explicit tunablelearning rate, the effective separation between different h -values are tunable by changing the inverse temperat-ure parameter β in the probability function.In order to cope with more general environments,which may include temporal correlations, for instancethrough delayed rewards, the PS model utilizes the so-called glow mechanism [44]. In this mechanism, to eachedge an additional variable g ( c i , c j ) ∈ [0 ,
1] is assigned.The edge g -value is set to 1 whenever an edge is traversed,and for each time-step that the edge was not used, it dis-sipates according to the rule g ( t +1) ( c i , c j ) = (1 − η ) g ( t ) ( c i , c j ) , (4) where the rate η ∈ [0 ,
1] is a parameter of the model.The h − value update rule for the non-traversed edges,with glow, assumes the form h ( t +1) ( c i , c j ) = h ( t ) ( c i , c j ) − γ ( h ( t ) ( c i , c j ) − g ( t +1) ( c i , c j ) λ ( t +1) . (5)To clarify, with glow, the edges whose transition did notresult in obtaining a reward still obtain a fraction ofa later issued reward proportional to the current glowvalue. The latter corresponds to how far in the past, rel-ative to the later rewarded time-step, the particular edgewas used. The glow mechanism thus allows for a futurereward to propagate back to previously used edges, andenables the agent to perform well also in settings wherethe reward is delayed (e.g. in the grid world and themountain car tasks [45]) and/or is contingent on morethan just the immediate history of agent-environmentinteraction, such as in the n-ship game, as presented inRef. [44]. The described learning mechanisms specifiedby Eq. (4)-(5) fully define the basic PS agent with fixedlearning parameters γ and η . The values of these learningparameters, or meta-parameters, have to be set properlysuch that the agent performs optimally in a certain task.However, as it was shown in Ref. [49], the PS model cannaturally be extended to account for self-monitoring ofits own meta-parameters and one and the same agent canreach near-optimal to optimal success rates in differentkinds of reinforcement learning problems.In this work, we present extensions which make useof the capacity of the PS model to generate new clipsdynamically, for the purpose of generalization. To focusour study of the performance of such a PS agent, here weintroduce simplest environmental scenarios which high-light the critical aspects of generalization, and analyzehow the PS with generalization performs. While thesesimple settings all fit into the contextual bandit frame-work [50] and hence do not require glow, the same mech-anism is of course readily applied to more complex taskenvironments (e.g. with temporal correlations) and ana-logous generalization results will hold. In the next sec-tion we will briefly reflect on the relation of the basic PSmodel (without generalization) with the more standardreinforcement learning machinery, which will further putthis work into context. Projective simulation and reinforcement learning
The basic PS model can be viewed as an explicitlyreward-driven RL model. Unlike most standard RL mod-els, it does not include an explicit approximation ofthe state-value or the action-value function. As a con-sequence of this, PS is also simpler in the sense that thevalue function and the policy are not optimized over sep-arately, but rather the optimization occurs concurrently.Despite these structural differences, the PS model canbe related to other standard RL approaches. Quant-itively, the PS model was shown to perform similar incomparison to the standard RL models in benchmarktasks [44, 45, 51]. In addition to this quantitative rela-tionship, a formal relationship between the h -value mat-rix and the action-value Q-matrix can be derived on aformal level, from which many fundamental propertiesof RL algorithms are qualitatively recovered in PS aswell. For instance, in the setting of stationary environ-ments with immediate rewards, the update rule of thebasic two-layered PS model, given percept s and action a reads as h ( t +1) ( s, a ) = h ( t ) ( s, a ) + λ ( t +1) . (6)It is clear that the h − values, when normalized by thenumber of realized transitions over the history, convergeto the (immediate) value of this transition. Moreover,given the policy-generating rule given in Eq. (1), it holdsthat the probability of outputting the action of the mostrewarded transition converges to unity. This is equivalentto employing a greedy policy over a converged action-value function in standard RL, hence also implies thatthis basic PS handles stationary contextual bandit prob-lems. Nonetheless, the h − values in PS are not meantto represent action values, but actually stem from de-scriptions of physical dynamics (as coupling coefficients).Going a few steps further, in the setting of more gen-eral MDP environments (with delayed rewards) we cancontrast Eq. (5) of the basic PS to the standard SARSAQ-matrix update [3, 9]. For the latter we have the ex-pression Q ( t +1) ( s, a ) = Q ( t ) ( s, a ) + α (cid:2) λ ( t +1) + γ RL Q ( t ) ( s (cid:48) , a (cid:48) ) − Q ( t ) ( s, a ) (cid:3) , (7)where the discounted value of the next state s (cid:48) and ac-tion a (cid:48) , γ RL Q ( s (cid:48) , a (cid:48) ) (note, this γ RL does not correspondto the γ term in the basic PS, but constitutes the dis-count factor), ensures the current action value obtains afraction of the value of the subsequently realized step.In PS, the update rule is represented in Eq. (4) and (5).Note that this edge h -value is updated regardless whetherthis edge was traversed, that is, in each time-step. Onthe other hand, this h − value plays an active role in theoutputs of the PS agent only in some subsequent time-step t (cid:48)(cid:48) when a transition from clip c i is required. In thisinterval, from t (cid:48) when the edge ( c i , c j ) was traversed last,until t (cid:48)(cid:48) when the clip c i is encountered again, this edgeaccumulates discounted rewards. At time step t (cid:48)(cid:48) whenthis particular edge may be used again, the relevant h -value reads as h ( t (cid:48)(cid:48) ) ( c i , c j ) = h ( t (cid:48) ) ( c i , c j )+ λ ( t (cid:48) +1) + t (cid:48)(cid:48) (cid:88) k = t (cid:48) +2 (1 − η ) t (cid:48)(cid:48) − k λ ( k ) , (8)where we have set the damping γ term to zero to sim-plify the expression. The term (cid:80) t (cid:48)(cid:48) k = t (cid:48) +2 (1 − η ) t (cid:48)(cid:48) − k λ ( k ) accounts for all the future rewards which followedthe ( c i , c j ) transition, which occurred at time-step t (cid:48) . The term (cid:80) t (cid:48)(cid:48) k = t (cid:48) +2 (1 − η ) t (cid:48)(cid:48) − k λ ( k ) is closely related to γ RL Q ( s (cid:48) , a (cid:48) ) – the first term captures the discounted fu-ture rewards experienced by this particular agent in itsfuture realized steps, starting from the next step. Theterm γ RL Q ( s (cid:48) , a (cid:48) ) , captures the current approximationof what the future rewards will be (under the currentpolicy), also starting from the next step, discounted by γ RL . In other words, γ RL in SARSA plays the samefunctional role as the g − value decay rate (1 − η ) in PS.To further clarify, note that γ RL Q ( s (cid:48) , a (cid:48) ) correspondsto the (cid:80) t (cid:48)(cid:48) k = t (cid:48) +2 (1 − η ) t (cid:48)(cid:48) − k λ ( k ) term, but computed foran averaged agent, averaged over the sequences of sub-sequent moves, given the agent’s policy. Note, how-ever, that in the case of PS, the edge ( c i , c j ) will betraversed many times over the course of learning, lead-ing to an effective averaging of the future-reward term (cid:80) t (cid:48)(cid:48) k = t (cid:48) +2 (1 − η ) t (cid:48)(cid:48) − k λ ( k ) . In the case of non-zero damping,all the rewards in the sum would also undergo propor-tional damping, but this yields a complicated expressionwhich obfuscates the general trends in behavior.We note that this heuristic analysis certainly does notconstitute any formal statement about the relationshipof the basic PS with other reinforcement learning mod-els. While a full analysis goes beyond the scope of thispaper, already this heuristic suggests that the expec-ted performance of it should not differ, qualitatively,from other models. This has been confirmed empiricallythrough benchmarking against other models [44, 45, 51],and while in some occasions PS outperformed Q-Learningor Dyna-planning, and in some it underperformed, theglobal trends were comparable.Having discussed the similarities between the basic PSmodel and other standard RL algorithms, now we turnour attention to the differences, which is the topic of thispaper. Unlike in the mentioned RL models, where learn-ing revolves around the estimation of value functions, inPS learning is embodied in the re-configuration of the clipnetwork. This includes the update of transition probab-ilities but also the dynamical network restructuring (viae.g. clip creation [33] applied here for generalization).The latter has no analog in standard RL approaches wediscussed previously, and only makes sense since the clipnetwork is manifestly not a representation of value func-tions, but conceptually a different object. In this work wefurther explore this capacity of the PS model, by showinghow it can be utilized to handle problems of generaliza-tion. Other possibilities, related to action fine-graining,were previously studied in Refs. [44, 52]. Here, the PSapproach diverges from standard methodology, which, toour knowledge without exception, tackles this problemby using external machinery like classifiers (developed inthe context of e.g. supervised learning), or, more gen-erally, function approximators. We reiterate that suchadditional machinery could also be used with PS, butthis comes at a cost which we elaborated on previously.In the next section we focus on how simple tasks whichrequire generalization can be resolved using PS with dy-namic clip generation. (a) (b) ⇐ ⇒ ⇒ ⇐ + − t = 1 ⇐ + − t = 2 ⇐ ⇒ + − t = 3 ⇐ ⇒ ⇒ + −⇒ t = 4 ⇐ ⇒ ⇒ ⇐ + −⇒ ⇐ Figure 2. (a) The basic PS network as it is built up for the driver scenario. Four percept clips (arrow, color) in the first roware connected to two action clips (+ \− ) in the second row. Each percept-action connection is learned independently. (b) Theenhanced PS network as it is built up for the driver scenario, during the first four time steps. The following sequence of signalsis shown: left-green ( t = 1), right-green ( t = 2), right-red ( t = 3), and left-red ( t = 4). Four percept clips (arrow, color) in thefirst row are connected to two layers of wildcard clips (first layer with a single wildcard and second layer with two) and to twoaction clips (+ \− ) in the fourth row. Newly created edges are solid, whereas existing edges are dashed (relative weights of the h -values are not represented). Generalization within PS
Generalization is usually applicable when the percep-tual input is composed of more than a single category.In the framework of the PS model, this translates to thecase of
K > abstracted clips that we call wildcard clips. Whenever the agent encounters a new stimulus, thecorresponding new percept clip is created and comparedpairwise to all existing clips. For each pair of clips whose1 ≤ l ≤ K categories carry different values, a new wild-card clip is created (if it does not already exist) with allthe different l values replaced with the wildcard symbol K − l common categories.A wildcard clip with l wildcard symbols is placed inthe l th layer of the clip network (we consider the per-cept clip layer as the zeroth layer). In general, therecan be up to K layers between the layer of percept clipsand the layer of action clips, with (cid:0) Kl (cid:1) wildcard clips inlayer l for a particular percept. From each percept- andwildcard-clip there are direct edges to all action clips andto all matching higher-level wildcard clips. By matchinghigher-level wildcard clips, we mean wildcard clips withmore wildcard symbols, whose explicit category valuesmatch with those of the lower-lever wildcard clip. In es-sence, a matching higher-level wildcard clip (e.g. the clip( s , s , , s , s , s , |A | = 2): continue driving (+) or stop thecar ( − ). The percepts that the agent perceives are com-posed of two categories ( K = 2): color and direction.Each category has two possible values ( |S | = |S | = 2):red and green for the color, and left and right for thedirection. At each time step t the agent thus perceivesone of four possible combinations of colors and arrows,randomly chosen by an environment, and chooses oneof the two possible actions. In such a setup, the basicPS agent, described in the previous section, would havea two-layered network of clips, composed of four perceptclips and two action clips, as shown in Fig. 2(a). It wouldthen try to associate the correct action for each of thefour percepts separately. The PS with generalization, onthe other hand, has a much richer playground: it can, inaddition, connect percept clips to intermediate wildcardclips, and associate wildcard clips with action clips, aswe elaborate below.The development of the enhanced PS network is shownstep by step in Fig. 2(b) for the first four time steps ofthe driver scenario (a hypothetical order of percepts isconsidered for illustration). When a left-green signal isperceived at time t = 1, the corresponding percept clip iscreated and connected to the two possible actions (+ \− )with an initial weight h . In the second time step t = 2,a right-green signal is shown. This time, in addition tothe creation of the corresponding percept clip, the wild-card clip ( t = 3, a right-red signal is presented. This leads to thecreation of the ( ⇒ , (i) t ≤ ⇐ ⇒ ⇒ ⇐ + −⇒ ⇐ (ii) 1000 < t ≤ ⇐ ⇒ ⇒ ⇐ + −⇒ ⇐ (iii) 2000 < t ≤ ⇐ ⇒ ⇒ ⇐ + −⇒ ⇐ (iv) 3000 < t ≤ ⇐ ⇒ ⇒ ⇐ + −⇒ ⇐ Figure 3. The enhanced PS network configurations (idealized) as built up for each of the four phases of the driver scenario(see text). Only indirect strong edges are shown. Different wildcard clips allow the network to realize different generalizations.Categorization and classification are realized by the structure of the network, whereas relevance, correctness and flexibilitycome about through the update rule of Eq. (3). third percept, the full wildcard clip ( t = 4,a left-red signal is shown. This causes the creation ofthe ( ⇐ , h -values.The mechanism we have described so far, realizes, byconstruction, the first two characteristics of meaningfulgeneralization: categorization and classification. In par-ticular, categorization, the ability to recognize commonproperties, is achieved by composing the wildcard clipsaccording to similarities in the coming input. For ex-ample, it is natural to think of the ( redness . Inthat spirit, one could interpret the full wildcard clip ( h -values.To illustrate this on a simple domain, we next confrontthe agent with four different environmental scenarios, oneafter the other. Each scenario lasts 1000 time steps, fol-lowing by a sudden change of the rewarding scheme, towhich the agent has to adapt. The different scenarios arelisted below:(i) At the beginning (1 ≤ t ≤ < t ≤ < t ≤ < t ≤ (i) (ii) (iii) (iv) a v e r ag e r e w a r d time step Figure 4. The average reward obtained by the PS agent withgeneralization as simulated for the four different phases of thedriver scenario (see text). At the beginning of each phase, theagent has to adapt to the new rules of the environment. Theaverage reward drops and revives again, thereby exhibitingthe mechanism’s correctness and flexibility. A damping para-meter of γ = 0 .
005 was used, and the average was taken over10 agents. Figure 3 sketches four different network configurationsthat typically develop during the above phases. Onlystrong edges of relative large h -values are depicted, andwe ignore direct edges from percepts to actions, for clar-ity, as explained later. At each stage a different configur-ation develops, demonstrating how the relevant wildcardclips play an important role, via strong connections to (a) (b) percept clipsaction clips ⇐ ⇐ ⇒ ⇓ ⇑ ⇒ ⇒ ⇑ ⇓ ... ⇐← ↑ ... ↓ percept clips(arrow, ⇐ ⇐ ⇒ ⇓ ⇑ ⇒ ⇒ ⇑ ⇓ ... ⇐← ↑ ... ↓ ⇐ ⇓ ⇑ ... Figure 5. The basic (a) and the enhanced (b) PS networks as they are built up in the neverending-color scenario. (a) Eachpercept clip at the first row is independently connected to all n action clips at the second row. (b) Each percept- and wildcard-clip is connected to higher-level matching wildcard clips and to all n action clips. For clarity, only one-level edges to and fromwildcard clips are solid, while other edges are semitransparent. The thickness of the edges does not reflect their weights. action clips. Moreover, those wildcard clips are connec-ted strongly to the correct action clips. The relevant andcorrect edges are built through the update rule of Eq. (3),which only strengthens edges that, after having been tra-versed, lead to a rewarded action. Finally, the presentedflexibility in the network’s configuration, which reflectsa flexible generalization ability, is due to: (a) the exist-ence of all possible wildcard clips in the network; and(b) the update rule of Eq. (3), which allows the network,through a non-zero damping parameter γ to adapt fastto changes in the environment. We note that Fig. 3 onlydisplays idealized network configurations. In practice,other strong edges may exist, e.g. direct edges from per-cepts to actions, which may be rewarded as well. In thenext section we address such alternative configurationsand analyze their influence on the agent’s success rate.Figure 4 shows the PS agent’s performance, that is, theaverage reward obtained by the agent, in the driver scen-ario, as a function of time, averaged over 10 agents. Areward of λ = 1 is given for correct actions and a damp-ing parameter of γ = 0 .
005 is used. In the PS modelthere is a trade off between adaptation time and themaximally achievable average reward. A high dampingparameter, γ , leads to faster relearning, but to lower av-eraged asymptotic success rates, see also Ref. [44]. Herewe chose γ = 0 .
005 to allow the network to adapt within1000 time steps. It is shown that on average the agentmanaged to adapt to each of the phases imposed by theenvironment, and to learn the correct actions. We canalso see that the asymptotic performance of the agent isslightly better in the last phase, where the correct actionis independent from the input. To understand this, notethat: (a) The relevant edge can be rewarded at each timestep and thus be less affected by the non-zero dampingparameter; and (b) Each wildcard clip necessarily leadsto the correct action. This observation indicates that theagent’s performance improves when the stimuli can begeneralized to a greater extent. We will encounter thisfeature once more in the next section, where it is analyt-ically verified.The driver’s scenario is given here to explain anddemonstrate the underlying principles of the proposedgeneralization mechanism within PS. The problem itselfcan, of course, be solved by the basic PS alone, as well as by other methods, without a generalization capabil-ity. In what follows, however, we consider scenarios inwhich such an ability to generalize is indeed crucial forthe agent’s success.
Experimental and analytic results
A simple example
Sometimes it is not only helpful but a plain necessityfor the agent to have a mechanism of generalization, asotherwise it has no chance to learn. Consider, for ex-ample, a situation in which the agent perceives a newstimulus every time step. What option does it have, otherthan trying to find some similarities among those stimuli,upon which it can act? In this section we consider sucha scenario and analyze it in detail. Specifically, the en-vironment presents one of n different arrows, but at eachtime step the background color is different. The agentcan only move into one of the n > n possible action clips the agent can perform. Theproblem is that even if the agent takes the correct direc-tion, the rewarded edge will never take part in later timesteps, as no symbol is shown twice. The basic PS agenthas thus no other option but to choose an action at ran-dom, which will be correct only with probability of 1 /n ,even after infinitely many time steps. In contrast to thebasic PS, the PS with generalization does show learningbehavior. The full network is shown in Fig. 5(b). Perceptclips and wildcard clips are connected to matching wild-card clips and to all actions. Note that the wildcard clips( (a) (b) ... ++ ++ ++ " ...
11 11 1 1 1 ( ( ... (( " ... Figure 6. (a) The enhanced PS network as it is built up for the neverending-color scenario with K = 2 categories. Only thesubnetwork corresponding to the left-arrow is shown. The weight of the edge from the wildcard clip ( ⇐ , ← ) goes to h = ∞ with time. Hopping to the ( ⇐ , /n . (b) The enhanced PS network as it is builtup for the neverending-color scenario with K = 3 categories (direction, color, shape). Only the subnetwork corresponding tothe down-arrow is shown. The weights of the edges from the wildcard clips ( ⇓ , ⇓ , ↓ ), go to h = ∞ with time. (a) and (b): Edges that are relevant for theanalysis are solid, whereas other edges are semitransparent. The thickness of the edges does not reflect their weights. probability 1 /n . To see that the PS agent with general-ization can do better, we take a closer look on the (arrow, γ = 0) we consider here, the h -values ofthese edges will tend to infinity with time, implying thatonce an (arrow, p = 1 / ( n + 2), the excitation will hop to thecorrect action with unit probability and the agent willbe rewarded. In the second case, no action is preferredover the others and the correct action will be reachedwith probability 1 /n . It is possible that an edge fromthe full wildcard clip ( /n . Overall,the performance of the PS agent with generalization isthus given by: E ∞ ( n ) = p + (1 − p ) 1 n = 1 + 2 nn ( n + 2) > n , p = 1 n + 2 , (9)which is independent of the precise value of the reward λ (as long as it is a positive constant). Recall that n > n . Initially, the average performanceis 1 /n , i.e. completely random (which is the best per-formance of the basic agent). It then grows, indicatingthat the agent begin to learn how to respond correctly,until it reaches its asymptotic value, as given in Eq. (9) and marked in the figure with a dashed blue line. Itis seen that in these cases, the asymptotic performanceis achieved already after tens to hundreds of time steps(see the next section for an analytical expression of thelearning rate). The simulations were carried out with 10 agents and a zero damping parameter ( γ = 0). Since theasymptotic performance of Eq. (9) is independent of thereward λ and to ease the following analytical analysis, wechose a high value of λ = 1000. Setting a smaller rewardwould only amount to a slower learning curve, but withno qualitative difference, as one can see in Fig. 7(b) fora reward of λ = 1. In Fig. 7(a) and whenever the reward λ is not 1, we normalize the average reward obtained bythe PS agent in order to compare to the probability ofobtaining the maximum reward given by Eq. (9).We have therefore shown that the generalization mech-anism leads to a clear qualitative advantage in this scen-ario: without it the agent can not learn, whereas with it,it can. As for the quantitative difference, both Eq. (9)and Fig. 7(a) indicate that the gap in performance is notlarge. Nonetheless, any such gap can be further ampli-fied. The idea is that for any network configuration ofwhich the probability to take the correct action is largerthan the probability to take any other action, the successprobability can be amplified to unity by “majority vot-ing”, i.e. by performing the random walk several times,and choosing the action clip that occurs most frequently.Such amplification rapidly increases the agent’s perform-ance whenever the gap over the fully random agent is notnegligible given that the full wildcard clip ( (a) (b) = 1000 = 1 n = 5 n = 3 n = 2 a v e r ag e n o r m a li ze d r e w a r d time stepSimulationsAsymptote Eq. (9)Approximation Eq. (10) n = 5 n = 3 n = 2 a v e r ag e r e w a r d time stepSimulationsAsymptote Eq. (9) Figure 7. Learning curves of the PS agent with generalization in the neverending-color scenario for n = 2 , agents are shown in red, asymptotic average reward lines E ∞ ( n ) (Eq. (9)) are shown in dashed blue, thecorresponding analytical approximation curves (Eq. (10)) are shown in dotted black. (a) A high reward of λ = 1000 was used.(b) A reward of λ = 1 was used. cannot learn anything in this task environment withoutan external classifying machinery (decision trees, neuralnetworks or SVM). On the flip side, the capacity of acombination of basic RL models with classifier machinerymay beat the performance of our model. But again, thisis arguably not an instructive comparison as we do notcombine the PS with external classifier machinery. Thelatter could in principle be done, but all such solutionscome at the price of loss of homogeneity, and (sometimesdramatic) increase in model complexity, which we avoidin this work for the reasons we elaborated on earlier. Analytical analysis: learning curve
To analyze the PS learning curves and to predict theagent’s behavior for arbitrary n , we take the followingsimplifying assumptions: First, we assume that all pos-sible wildcard clips are present in the PS network fromthe very beginning; second, for technical reasons, we as-sume that edges from the partial wildcard clips (arrow, λ = ∞ (or, equivalently,one can use the softmax function in Eq. (2) with β = ∞ instead of Eq. (1)). As shown below, under these as-sumptions, the analysis results in a good approximationof the actual performance of the agent.While Eq. (9) provides an expression for E ∞ ( n ), theexpected average reward of the agent at infinity, here welook for the expected average reward at any time step t ,i.e. we look for E t ( n ). Taking the above assumptions intoaccount and following the same arguments that led toEq. (9), we note that at any time t , at which the arrow a isshown, there are only two possible network configurationsthat are conceptually different: Either the edge from the( a , a ) action clip was already rewardedand has an infinite weight, or not. Note that while thisedge must eventually be rewarded, at any finite t this isnot promised. Let p learn ( t ) be the probability that the edge from the wildcard clip ( a , a )has an infinite h -value at time t , i.e. the probability thatthe correct association was learned, then the expectedperformance at time t is given by: E t ( n ) = p learn ( t ) E ∞ ( n ) + (1 − p learn ( t )) 1 n . (10)The probability p learn ( t ) can be written as p learn ( t ) = 1 − (cid:18) − n ( n + 1)( n + 2) (cid:19) t − , (11)where the term 1 /n ( n + 1)( n + 2) corresponds to theprobability of finding the rewarded path (labeled as “ ∞ ”in Fig. 6(a)): 1 /n is the probability that the environmentpresents the arrow ( x ), then the probability to hop fromthe percept clip ( x , color) to the wildcard clip ( x , / ( n + 2), and the probability to hop from the wildcardclip ( x , x ) action clip is 1 / ( n + 1). Finally,we take into account the fact that, before time t , theagent had ( t −
1) opportunities to take this path and berewarded.The analytical approximation of the time-dependentperformance of PS, given in Eq. (10) is plotted on topof Fig. 7(a) in dotted black, where it is shown to matchthe simulated curves well (in red). The difference, whichone can see in detail in Fig. 9(a), in the very beginningis caused by the assumption that all wildcard clips arepresent in the network from the very beginning, whereasthe real agent needs several time steps to create them,thus reducing its initial success probability. Nonetheless,after a certain number of time-steps the simulated PSagent starts outperforming the prediction given by theanalytic approximation, because the agent can be rewar-ded for transition from the wildcard clip (arrow, (a) (b) = 1000 = 1 n = 5 n = 3 n = 2 a v e r ag e n o r m a li ze d r e w a r d time step n = 5 n = 3 n = 2 a v e r ag e r e w a r d time step Figure 8. Learning curves of the PS agent with generalization augmented with majority voting in the neverending-color scenariofor n = 2 , agents, majority voting consists of 100 random walks. Inorder to achieve shorter learning times, the creation of the ( λ = 1000 was used. (b) A reward of λ = 1 was used. Analytical analysis: learning time
The best performance a single PS agent can achieveis given by the asymptotic performance E ∞ ( n ) (Eq. (9)).For each agent there is a certain finite time τ at whichthis performance is achieved, thus all agents reach thisperformance at t = ∞ . p learn ( t ), as defined before, is theprobability that the edge from the relevant wildcard clipto the correct action clip was rewarded up to time t , andcan be expressed as a cumulative distribution function P ( τ ≤ t − P ( τ = t ) = P ( τ ≤ t ) − P ( τ ≤ t −
1) = p learn ( t + 1) − p learn ( t ).The expected value of τ can be thought of as the learn-ing time of the agent and can be expressed as a powerseriesE[ τ ] = (cid:80) ∞ t =1 tP ( τ = t ) = (cid:80) ∞ t =1 t (cid:32)(cid:16) − n ( n +1)( n +2) (cid:17) t − − (cid:16) − n ( n +1)( n +2) (cid:17) t (cid:33) = n ( n + 1)( n + 2) . (12)Note that the “not learning” probability (1 − p learn ( t ))decays exponentially, with the decay rate reflected in thelearning time E[ τ ] (see Eq. (11)). More than two categories
We next generalize the neverending-color task to thecase of an arbitrary number of categories K . The colorcategory may take infinite values, whereas any other cat-egory can only take finite values, and the number of pos-sible actions is given by n >
1. As before, only onecategory is important, namely the arrow direction, andthe agent is rewarded for following it, irrespective of allother input. With more irrelevant categories the envir-onment thus overloads the agent with more unnecessaryinformation, would this affect the agent’s performance? To answer this question, we look for the correspondingaveraged asymptotic performance. As before, in the limitof t → ∞ , the wildcard clips which contain the arrowslead to a correct action clip with unit probability (for anyfinite reward λ and zero damping parameter γ = 0), asillustrated in Fig. 6(b). On the other hand, choosing anyother clip (including action clips) results with the correctaction with an averaged probability of only 1 /n . Accord-ingly, either the random walk led to a wildcard clip withan arrow, or it did not. The averaged asymptotic per-formance for K categories and n actions can hence bewritten as E ∞ ( n, K ) = p + (1 − p ) 1 n = n + (1 + n )2 K − n ( n + 2 K − ) , (13)where p = 2 K − / ( n + 2 K − ) is the probability to hit awildcard clip with an arrow, given by the ratio betweenthe number of wildcard clips with an arrow (2 K − ), andthe total number of clips that are reachable from a per-cept clip ( n + 2 K − ). In this scenario, where no color isshown twice, all wildcard clips have their color categoryfixed to a wildcard symbol K − wild-card clips that are connected to each percept clip, wherehalf of them, i.e. 2 K − contain an arrow. Note that fortwo categories Eq. (13) correctly reduces to Eq. (9).We can now see the effect of having a large number K of irrelevant categories on the asymptotic performanceof the agent. First, it is easy to show that for a fixed n , E ∞ ( n, K ) increases monotonically with K , as also il-lustrated in Fig. 9(b). This means that although thecategories provided by the environment are irrelevant,the generalization machinery can exploit them to makea larger number of relevant generalizations, and therebyincrease the agent’s performance. Moreover, for large K ,and, more explicitly, for K (cid:29) log n , the averaged asymp-totic performance tends to (1 + 1 /n ) /
2. Consequently,when the number of possible actions n is also large, inwhich case the performance of the basic agent would drop2 (a) (b) n = 2 a v e r ag e n o r m a li ze d r e w a r d time stepSimulationsApproximation Eq. (10) number of categories K a s y m p t o t i c a v e r ag e r e w a r d E ( n , K ) n = 2 n = 2 Figure 9. (a) A part of Fig. 7(a): The difference between the PS agent’s performance approximation, which is given by Eq. (10),and numerical simulations for n = 2. (b) The asymptotic average reward E ∞ ( n, K ) for the neverending-color scenario (seeEq. 13), as a function of K , the number of categories, for n = 2 , . to 0, the PS agent with generalization would succeed witha probability that tends to 1 /
2, as shown in Fig. 9(b) for n = 2 .Similarly, we note that when none of the categoriesis relevant, i.e. when the environment is such that theagent is expected to take the same action irrespective ofthe stimulus it receives, the agent performs even better,with an average asymptotic performance of E ∗∞ ( n, K ) = (cid:0) K − (cid:1) / (cid:0) n + 2 K − (cid:1) . This is because in such a scen-ario, every wildcard clip eventually connects to the re-warded action. Accordingly, since each percept clip leadsto wildcard clips with high probability, the correct ac-tion clip is likely to be reached. In fact, in the case of K (cid:29) log n the asymptotic performance of the PS withgeneralization actually tends to 1. DISCUSSION
When the environment confronts an agent with a newstimulus in each and every time step, the agent has nochance of coping, unless the presented stimuli have somecommon features that the agent can grasp. The recogni-tion of these common features, i.e. categorization, andclassifying new stimuli accordingly, are the first stepstoward a meaningful generalization as characterized atthe beginning of this paper. We presented a simple dy-namical machinery that enables the PS model to realizethose abilities and showed how the latter requirements ofmeaningful generalization - that relevant generalizationsare learned, that correct actions are associated with therelevant properties, and that the generalization mechan-ism is flexible - follow naturally from the PS model itself.Through numerical and analytical analysis, also for anarbitrary number of categories, we showed that the PSagent can then learn even in extreme scenarios whereeach percept is presented only once.The generalization machinery introduced in this paper,enriches the basic PS model: not only in the practicalsense, i.e. that it can handle a larger range of scenarios, but also in a more conceptual sense. In particular, the en-hanced PS network allows for the emergence of clips thatrepresent abstractions or abstract properties , like the red-ness property, rather then merely remembered perceptsor actions. Moreover, the enhanced PS network is mul-tilayered and allows for more involved dynamics of therandom walk, which, as we have shown, gives rise to amore sophisticated behavior of the agent. Yet, althoughthe clip network may evolve to more complicated struc-tures than before, the overall model preserves its inherentsimplicity, which enables an analytical characterization ofits performance.Our approach assumes that the percept space has anunderlying Cartesian product structure, that is, there areestablished categories. This is a natural assumption inmany settings, for instance where the categories stemfrom distinct sensory devices, in the case of embodiedagents. For the generalization procedure to produce func-tional generalizations, however, it is also vital that thereward function of the environment indeed does reflectthis structure. This implies that the status of some cat-egories matters, where other categories may not matter.Nonetheless, in the settings where the underlying simil-arity structure on the percept space (e.g. a metric whichestablishes which subsets of the percept space requiresimilar responses) does not follow the Cartesian featureaxes, the agent will still learn, provided the percept spaceis not infinite. In this case, no improvements will comeabout from the generalization procedure. While restric-ted, the provided notion of generalization still capturesan exponentially large collection of subsets of the perceptspace, in the number of categories.Regarding the computational complexity of the model,we can identify two main contributions. The first is thedeliberation time of the model – that is, the number oftransitions within the ECM which have to occur at anytime-step of the interactions. This is efficient, as thenumber of steps is upper bounded by the number of lay-ers, which scales linearly with the number of categories.The second contribution stems from the updates of the3network – the number of wildcard clips which have tobe generated – which also has an immediate impact onthe expected learning time of the agent. In principle, thetotal number of wildcard clips can grow exponentiallywith the number of categories (that is, typically as a loworder polynomial in the size of the percept space). Notethat this is a necessity, as there are exponentially manysubspaces of the percept space which the agent must ex-plore to find the relevant generalizations. Confining thisspace to a smaller subspace would necessarily restrict thegenerality of the generalization procedure. Thus there isan unavoidable trade-off: how general is the generaliz-ation procedure versus how large is the generalizationspace – one has to balance.In the approach to generalization we have presented,we have built upon the capacity of the PS model itselfto dynamically generate novel clips, and the restrictionwe have mentioned (combinatorial space complexity dueto the overall clip number) could be mitigated by em-ploying external classification machinery. This latter ap-proach is the norm in other RL approaches, but as wehave clarified earlier, it comes with other types of issueswe have avoided: the use of external machinery increasesthe complexity of the model, and makes it inhomogen-eous. In contrast, the homogeneity of the PS approachallows for high interpretability of results, including ana-lytic analyses, and advantages with respect to embodiedimplementations. Moreover, our approach has a clear route for quantization of the model, which may becomeincreasingly more relevant as quantum technologies fur-ther develop. Thus, it is of great interest to develop amethod that can deal with the issue of combinatorialspace complexity while maintaining homogeneity.In particular, a clip-deletion mechanism can be used forPS to deal with the combinatorial growth of the numberof clips. This mechanism deletes clips, thereby maintain-ing a stable population of a controlled size and prioritiz-ing the deletion of less used, hence less useful clips. Thesize of the population constitutes a sparsity parameter ofthe model, and makes the combinatorial explosion in theclip number controllable, while still allowing the agent toexplore the complete space of axes-specified generaliza-tions. A proof of principle of this approach was given inRef. [53]; a more detailed analysis of the deletion mech-anism is ongoing work. ACKNOWLEDGEMENTS
We wish to thank Markus Tiersch, Dan Browne andElham Kashefi for helpful discussions. This work wassupported in part by the Austrian Science Fund (FWF)through Grant No. SFB FoQuS F4012, and by theTempleton World Charity Foundation (TWCF) throughGrant No. TWCF0078/AB46. [1] Holland, J. H., Holyoak, K. J., Nisbett, R. E. & Thagard,P.
Induction: Processes of Inference, Learning, and Dis-covery . Computational Models of Cognition and Percep-tion (MIT Press, Cambridge, MA, USA, 1986).[2] Saitta, L. & Zucker, J.-D.
Abstraction in Artificial Intel-ligence and Complex Systems (Springer, New York, NY,USA, 2013).[3] Sutton, R. S. & Barto, A. G.
Reinforcement Learn-ing: An Introduction (MIT press, Cambridge, MA, USA,1998).[4] Russell, S. & Norvig, P.
Artificial Intelligence: A Mod-ern Approach (Prentice Hall, Englewood Cliffs, NJ, USA,2010), third edn.[5] Wiering, M. & van Otterlo, M. (eds.)
ReinforcementLearning: State-of-the-Art , vol. 12 of
Adaptation, Learn-ing, and Optimization (Springer, Berlin, Germany, 2012).[6] van Otterlo, M.
The logic of adaptive behavior: know-ledge representation and algorithms for the Markov de-cision process framework in first-order domains . Ph.D.thesis, Univ. Twente, Enschede, Netherlands (2008).[7] Ponsen, M., Taylor, M. E. & Tuyls, K. Abstraction andgeneralization in reinforcement learning: A summary andframework. In Taylor, M. E. & Tuyls, K. (eds.)
Adaptiveand Learning Agents , vol. 5924 of
Lecture Notes in Com-puter Science , chap. 1, 1–32 (Springer, Berlin, Germany,2010).[8] Watkins, C. J. C. H.
Learning from delayed rewards .Ph.D. thesis, Univ. Cambridge, Cambridge, U.K. (1989). [9] Rummery, G. A. & Niranjan, M. On-line Q-learningusing connectionist systems. Tech. Rep. CUED/F-INFENG/TR 166, Univ. Cambridge, Cambridge, U.K.(1994).[10] Melo, F. S., Meyn, S. P. & Ribeiro, M. I. An analysis ofreinforcement learning with function approximation. In
Proc. 25th Int. Conf. Mach. Learn. , 664–671 (2008).[11] Albus, J. S. A new approach to manipulator control:The cerebellar model articulation controller (CMAC).
J.Dyn. Sys., Meas., Control. , 220–227 (1975).[12] Sutton, R. S. Generalization in reinforcement learning:Successful examples using sparse coarse coding. In Adv.Neural Inf. Process. Syst. , vol. 8, 1038–1044 (MIT Press,1996).[13] Boyan, J. A. & Moore, A. W. Generalization in reinforce-ment learning: Safely approximating the value function.In
Adv. Neural Inf. Process. Syst. , vol. 7, 369–376 (MITPress, 1995).[14] Whiteson, S. & Stone, P. Evolutionary function approx-imation for reinforcement learning.
J. Mach. Learn. Res. , 877–917 (2006).[15] Mnih, V. et al. Human-level control through deep rein-forcement learning.
Nature , 529–533 (2015).[16] Pyeatt, L. D. & Howe, A. E. Decision tree function ap-proximation in reinforcement learning. In
Proc. 3rd Int.Symposium on Adaptive Systems: Evolutionary Compu-tation and Probabilistic Graphical Models , 70–77 (2001).[17] Ernst, D., Geurts, P. & Wehenkel, L. Tree-based batchmode reinforcement learning.
J. Mach. Learn. Res. , FeatureExtraction, Construction and Selection , vol. 453 of
TheSpringer International Series in Engineering and Com-puter Science , 219–235 (Springer, New York, NY, USA,1998).[19] Cortes, C. & Vapnik, V. Support-vector networks.
Mach.Learn. , 273–297 (1995).[20] Laumonier, J. Reinforcement using supervised learningfor policy generalization. In Proc. 22nd National Confer-ence on Artificial Intelligence , vol. 2, 1882–1883 (AAAIPress, 2007).[21] Holland, J. H. Adaptation. In Rosen, R. J. & Snell,F. M. (eds.)
Progress in Theoretical Biology , vol. 4, 263–293 (1976).[22] Holland, J. H. Escaping brittleness: The possibilities ofgeneral-purpose learning algorithms applied to parallelrule-based systems. In Michalski, R. S., Carbonell, J. G.& Mitchell, T. M. (eds.)
Machine Learning: An ArtificialIntelligence Approach , vol. 2 (Morgan Kaufmann, 1986).[23] Urbanowicz, R. J. & Moore, J. H. Learning classifiersystems: A complete introduction, review, and roadmap.
Journal of Artificial Evolution and Applications ,1–25 (2009).[24] Jong, N. K. State abstraction discovery from irrelevantstate variables. In
Proc. 19th International Joint Con-ference on Artificial Intelligence , 752–757 (2005).[25] Li, L., Walsh, T. J. & Littman, M. L. Towards a uni-fied theory of state abstraction for MDPs. In
Proc. 9thInternational Symposium on Artificial Intelligence andMathematics , 531–539 (2006).[26] Cobo, L. C., Zang, P., Isbell, C. L. & Thomaz, A. L.Automatic state abstraction from demonstration. In
Proc. 22nd International Joint Conference on ArtificialIntelligence (2011).[27] Sutton, R. S., Precup, D. & Singh, S. Between MDPsand semi-MDPs: A framework for temporal abstractionin reinforcement learning.
Artif. Intell. , 181 – 211(1999).[28] Botvinick, M. M. Hierarchical reinforcement learning anddecision making.
Curr. Opin. Neurobiol. , 956 – 962(2012).[29] Tadepalli, P., Givan, R. & Driessens, K. Relationalreinforcement learning: An overview. In Proc. Int.Conf. Mach. Learn. Workshop on Relational Reinforce-ment Learning (2004).[30] Hutter, M. Feature reinforcement learning: Part I. Un-structured MDPs.
Journal of Artificial General Intelli-gence , 3–24 (2009).[31] Nguyen, P., Sunehag, P. & Hutter, M. Feature rein-forcement learning in practice. In Sanner, S. & Hutter,M. (eds.) Recent Advances in Reinforcement Learning ,vol. 7188 of
Lecture Notes in Computer Science , 66–77(Springer, Berlin, Germany, 2012).[32] Daswani, M., Sunehag, P. & Hutter, M. Feature rein-forcement learning: State of the art. In
Proc. 28th AAAIConf. Artif. Intell.: Sequential Decision Making with BigData , 2–5 (2014).[33] Briegel, H. J. & De las Cuevas, G. Projective simulationfor artificial intelligence.
Sci. Rep. , 400 (2012).[34] Motwani, R. & Raghavan, P. Randomized Algorithms ,chap. 6 (Cambridge University Press, New York, USA, 1995).[35] Pfeiffer, R. & Scheier, C.
Understanding Intelligence (MIT Press, Cambridge, MA, USA, 1999), first edn.[36] Childs, A. M. et al.
Exponential algorithmic speedupby a quantum walk. In
Proc. 35th Annu. ACM Symp.Theory Comput. (STOC) , 59–68 (ACM, New York, NY,USA, 2003).[37] Kempe, J. Discrete quantum walks hit exponentiallyfaster.
Probab. Theory Relat. Field , 215–235 (2005).[38] Krovi, H., Magniez, F., Ozols, M. & Roland, J. Quantumwalks can find a marked element on any graph.
Algorith-mica
Phys. Rev. X , 031002 (2014).[40] Dunjko, V., Friis, N. & Briegel, H. J. Quantum-enhanceddeliberation of learning agents using trapped ions. NewJ. Phys. , 023006 (2015).[41] Friis, N., Melnikov, A. A., Kirchmair, G. & Briegel, H. J.Coherent controlization using superconducting qubits. Sci. Rep. , 18036 (2015).[42] Dunjko, V., Taylor, J. M. & Briegel, H. J. Quantum-enhanced machine learning. Phys. Rev. Lett. , 130501(2016).[43] Sriarunothai, T., W¨olk, S., Giri, G. S., Friis, N., Dunjko,V., Briegel, H. J. & Wunderlich, C. Speeding-up thedecision making of a learning agent using an ion trapquantum processor. arXiv:1709.01366 (2017).[44] Mautner, J., Makmal, A., Manzano, D., Tiersch, M. &Briegel, H. J. Projective simulation for classical learn-ing agents: a comprehensive investigation.
New Gener.Comput. , 69–114 (2015).[45] Melnikov, A. A., Makmal, A. & Briegel, H. J. Projectivesimulation applied to the grid-world and the mountain-car problem. arXiv:1405.5459 (2014).[46] Hangl, S., Ugur, E., Szedmak, S. & Piater, J. Ro-botic playing for hierarchical complex skill learning. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. , 2799–2804 (2016).[47] Melnikov, A. A., Poulsen Nautrup, H., Krenn, M., Dun-jko, V., Tiersch, M., Zeilinger, A. & Briegel, H. J. Activelearning machine learns to create new quantum experi-ments. arXiv:1706.00868 (2017).[48] Bellman, R. E.
Dynamic Programming (Princeton Uni-versity Press, Princeton, NJ, US, 1957).[49] Makmal, A., Melnikov, A. A., Dunjko, V. & Briegel, H. J.Meta-learning within projective simulation.
IEEE Access , 2110–2122 (2016).[50] Wang, C.-C., Kulkarni, S. R. & Poor, H. V. Bandit prob-lems with side observations. IEEE Trans. Autom. Control , 338–355 (2005).[51] Bjerland, Ø. F. Projective simulation compared to rein-forcement learning . Master’s thesis, Dept. Comput. Sci.,Univ. Bergen, Bergen, Norway (2015).[52] Tiersch, M., Ganahl, E. J. & Briegel, H. J. Adaptivequantum computation in changing environments usingprojective simulation.