[PDF] Robby is Not a Robber (anymore): On the Use of Institutions for Learning Normative Behavior

Abstract

Future robots should follow human social norms in order to be useful and accepted in human society. In this paper, we leverage already existing social knowledge in human societies by capturing it in our framework through the notion of social norms. We show how norms can be used to guide a reinforcement learning agent towards achieving normative behavior and apply the same set of norms over different domains. Thus, we are able to: (1) provide a way to intuitively encode social knowledge (through norms); (2) guide learning towards normative behaviors (through an automatic norm reward system); and (3) achieve a transfer of learning by abstracting policies; Finally, (4) the method is not dependent on a particular RL algorithm. We show how our approach can be seen as a means to achieve abstract representation and learn procedural knowledge based on the declarative semantics of norms and discuss possible implications of this in some areas of cognitive science.

Full PDF

11 Robby is Not a Robber (anymore): On the Use ofInstitutions for Learning Normative Behavior

Stevan Tomic, Federico Pecora and Alessandro Safﬁotti

Abstract —Future robots should follow human social norms inorder to be useful and accepted in human society. In this paper,we leverage already existing social knowledge in human societiesby capturing it in our framework through the notion of socialnorms. We show how norms can be used to guide a reinforcementlearning agent towards achieving normative behavior and applythe same set of norms over different domains. Thus, we areable to: (1) provide a way to intuitively encode social knowledge(through norms); (2) guide learning towards normative behaviors(through an automatic norm reward system); and (3) achievea transfer of learning by abstracting policies; Finally, (4) themethod is not dependent on a particular RL algorithm. We showhow our approach can be seen as a means to achieve abstractrepresentation and learn procedural knowledge based on thedeclarative semantics of norms and discuss possible implicationsof this in some areas of cognitive science.

Index Terms —Norms, Institutions, Automatic Reward Shap-ing, Transfer of Learning, Abstract Policies, Abstraction, State-Space Selection, Schema

I. I

NTRODUCTION

In order to be accepted in human society, robots needto comply with human social norms. The main goal of ourresearch is to make robots capable of behaving in a socially-acceptable way, i.e., to adhere to social norms. In this paper,we do this by applying social norms in reinforcement learning(RL) settings.In recent years, reinforcement learning has achieved notablesuccesses in simulated environments like classical Atari [1]or board games e.g., go [2], chess [3] and other computergames [4], [5]. Additionally, signiﬁcant successes have beenachieved in real physical robotic control, after ﬁrst learningpolicies in a simulated environment [6]. Despite such impres-sive developments, RL is still mainly relegated to relativelytightly controlled settings. Using RL in real-world settingsposes signiﬁcant challenges, most notably, the credit assign-ment problem, which concerns how an action taken earlierinﬂuences the ﬁnal reward, the size of state space, and theamount of training data. Also, the transfer of learned policies,to novel domains is not trivial to achieve. Challenges are com-plicated by the fact that robots have to act in the social spaceshared with humans. For example, in a learning process drivenby exploration, an agent is not necessarily acting towards theachievement of a goal, which may result in possibly dangerousunexpected behaviors. Furthermore, learned behaviors andinteraction with other agents or objects, may not be easilyunderstandable or may seem odd to humans, thus makingit difﬁcult to realize human-level social interactions. Policies

All authors are with the Center for Applied Autonomous Sensor Systems,School of Science and Technology, ¨Orebro University, 70187 ¨Orebro, Sweden.Contact author’s e-mail: [email protected];[email protected] obtained via RL lead an agent to act in a way that maximizesthe reward function, meaning that an agent may learn toachieve its goal in a way which is not socially acceptable. Thissomewhat Machiavellistic characteristic of RL agents indicatesthat policies that are to be used in social environments shouldbe subject to, or at least biased by, social norms.To make robotic RL agents behave socially, we aim tocreate policies that produce normative behavior dependingon the social situation. In doing so, we also wish to avoidhand-crafting policies (or the decision-theoretic planning [7]models), that is, provide agents with principles that can beapplied across domains and boundary conditions. We combinespeciﬁcation and learning of social norms, so that these socialnorms can be applied in a general and reusable way to agentsthat learn, resulting in learned policies that generate normativebehaviors. Towards this end, we propose the use of institutions (e.g. [8], [9], [10]). Institutions give a social dimension to theactions of agents and provide regulative mechanisms to guideagents in their interactions towards normative behaviors. Theadvantage of using institutions is twofold: (A) They provide social knowledge by capturing the set of norms that shouldbe followed given the current situation; (B) Norms in theinstitution are usually abstracted from the domain in whichthey can be applied, thus the same set of norms can be appliedin different domains.In this paper, we leverage the fact that our institutionalmodel can distinguish between socially acceptable and notacceptable behaviors by formally verifying them against normsemantics and by exploiting the facts (A) and (B). As we willsee, (A) can be used to automatically create a reward systemthat can guide RL towards policies that adhere to norms, ad-dressing the well-known credit assignment problem, while (B)will allow us to create abstracted normative policies, address-ing state-space reduction with automatic state-space/featureselection, which can be applied in different domains withoutre-learning. Finally, the framework is agnostic to the particularlearning / RL approach.We start with a brief look at the related work, followed by adescription of the basics of our institutional framework. Thenwe cover the background of RL approaches and describe howwe used our framework in RL settings. A set of experimentsare then presented as proofs of concept that: (1) normativepolicies can indeed be learned; (2) an automatic reward systembased on norms can be created that helps robotic agents learnin the right direction; (3) policies can be abstracted and appliedto different domains; and (4) our approach can be used inmulti-agent settings. Finally, we discuss our achievements withthe focus on abstract knowledge representation and its possiblefurther implications. a r X i v : . [ c s . L G ] A ug II. R

ELATED W ORK

Classical Problems in RL.

One of the well-known problemsin RL, the the credit assignment problem [11], decreases learn-ing ability for long temporal horizons with sparse rewards.

The curse of dimensionality [12] is another problem wherethe exploration of states has to increase exponentially withthe increase of state space. Social robots face these problemssince they should act in a physical world with a complexstate-space and long temporal horizons. Finally, the transferof learning – how learned knowledge can be reused for noveldomains – is notoriously difﬁcult. Several ways to addressthese problems have been reported in the literature. RL agentscan receive intermediate rewards which serve as ‘progressestimators’ (indicators), which reduces exploration and directsan agent’s behavior towards the goal. Matari´c [13] shows howthis can signiﬁcantly reduce learning time. Ng et al. [14] ana-lyzed ‘reward shaping’ mechanisms and a potential-shapingfunction that can speed up convergence without changingthe previous optimal policy. The major shortcoming of thisapproach is that such intermediate feedback signals requiresigniﬁcant engineering effort. Hierarchical RL (HRL) providesa different approach to address long temporal horizons. In suchmethods, the domain (or temporal aspect of it) are representedat another level of abstraction [15]. HRL methods mitigatethe temporal assignment problem: by shortening the temporalhorizon, better (structural) exploration is achieved, while sub-policies enable transfer learning. A popular way to addressthe ‘curse of dimensionality’ problem, especially in robotics,is to reduce a state space to a smaller one. Behavior-basedsystems (BBS) Matari´c [16] encapsulate behaviors as higher-level structures combining sensing and action, which helps inreducing the underlying state-space. Robots are then trainedover such higher-level state space. The state-space can bereduced, for example, by deﬁning the measure of importanceof a sensor [17] or by learning connections between largeperceptual states spaces and hidden states [18]. Similarly, thenumber of agent actions can also be reduced, where the mostpopular approach is to do that with the ‘option’ frameworkof Sutton et al. [15]. Options can be understood as ‘closed-loop’ sub-policies with certain duration depending on theterminating conditions and can combine more primitive agentactions. In general, state-abstraction and action abstraction areusually studied separately.Both reward shaping and state-space/action reduction arecrucial ideas that are utilized in our work. The way we deﬁnethe semantics of norms and the way they are associated withsocial structures, allows us to naturally abstract both the state-space and the action space of learning agents. Such abstractionreduces the dimensionality problem while at the same timeaddresses the transfer learning problem so that we can applythe same norms over novel domains. Furthermore, by utilizingthe process of formal veriﬁcation, we can assign normativestates to agent behaviors such as fulﬁllment or violation.Consequently, behavior can then be evaluated during executionto deﬁne a normative shaping function and to mitigate thecredit assignment problem.

Norms.

In the context of RL, learning human normative behavior is usually achieved through inverse reinforcementlearning (IRL) algirithms [19]. The motivation behind thisarea of research is to learn (infer) a reward distributionfrom the exemplary behavior of expert humans. The obtainedreward distribution is used in ordinary RL settings to trainartiﬁcial agents, where the resulting policies lead to the desiredbehavior. While this approach does not learn norms directly,an additional proposal explicitly represents norms which arethen used to infer them from demonstrated behaviors [20].Additionally, agents can also be trained to learn behavior basedon human feedback [21].In our approach, we leverage given social knowledge cap-tured in the form of social institutions. Our framework enablesus to formally represent this knowledge and use it for learningnormative behavior in RL settings, addressing some of theclassical RL problems described above.

Institutions.

The concept of institution stems from economicand social sciences [8]. The concept is loosely deﬁned acrossdifferent ﬁelds. According to North [9] institutions are, simplydeﬁned as “the rules of the game in a society” , while Ostrom[10] deﬁnes them as: “the prescriptions that humans use toorganize all forms of repetitive and structured interaction...” .In multi-agent systems (MAS) institutions are usually seenas a set of norms. The institution and its norms are used tocoordinate organizations of agents, thus they are sometimescalled “coordination artifacts” [22]. One of the advantages ofusing institutions to organize agents is that they are typicallyabstracted from the particular agent domain. Thus, ideally,the same institution should be able to organize differentheterogeneous agents in various environments.In our work, we use the concept of institution to formal-ize the normative aspects of a social situation (context) byencapsulating a set of abstract norms. The core of learningnormative behaviors is concerned with the particular relationsof state values in an agent’s execution trace (norms semanticfunctions), which ‘count as’ [8] institutional norms. Thecount-as principle is explored in MAS [23]. We employ amapping between institutional concepts and a domain, called grounding . This mapping, realizes the count-as concept, liftingan agent’s “raw” execution traces to their institutional (social)meaning and binding them to the normative dimension throughobligations, permissions or prohibitions.III. T HE F ORMAL M ODEL OF I NSTITUTION

We start by introducing our formal institution framework.The framework is presented in full detail in [24], while in thissection we only present the parts of that framework that arerelevant to this paper.

A. Institutions

Institutions encapsulate a collection of norms together withthe roles , actions and artifacts that these norms refer to. Forexample, the desired behaviors of agents participating in astore (marketplace) can be modeled, where its norms includerelations between sellers, customers, goods. A norm in thisinstitution could state, for instance, that a buyer should pay for certain goods to a certain seller. We deﬁne the followingsets: Roles = { role , role , . . . , role m } Arts = { art , art , . . . , art a } Acts = { act , act , . . . , act k } Examples of

Roles are a ‘customer’ in a store or a ‘goal-keeper’ in a football game.

Arts are the artifacts of theinstitution, e.g., ‘goods’ or a ‘cash register’.

Acts are actionsthat can be performed, e.g., ‘exit’ a location or ‘pay’ fora carton of milk. We deﬁne norms to be predications overstatements:

Deﬁnition 1. A norm is a statement of the form q ( trp ∗ ) ,where q is a qualiﬁer and trp is a triple of the form: trp ∈ Roles × Acts × ( Arts ∪ Roles ) Qualiﬁers can be unary relations like must or must-not , orn-ary ones like inside or before . For example, must can beused to represent the obligation ‘A buyer must pay with cash’: must ((Buyer , Pays , Cash)) . A qualiﬁer in front of can specify a spatial constraint on thelocation of an action, as in ‘Paying should be performed infront of a cash register’: in front of ((Buyer , Pays , CashRegister)) . Binary qualiﬁers can express relations between statements,e.g., temporal relations such as before or during , for instance before ((Buyer , GetGoods , Goods) , (Buyer , Pays , Cash)) indicating that ‘A buyer should get the goods before paying’.Institutions put all the above elements together:

Deﬁnition 2. An institution is a tuple I = (cid:104) Arts , Roles , Acts , Norms (cid:105) . This deﬁnition embodies a pragmatic view of institutions,which is in accordance to North’s view that institutionsconstitute “the human-devised constraints that shape socialinteraction” [9].

B. Domains

An institution is an abstraction, which can be instantiated inconcrete systems that are physically different but are describedby the same structure. For instance, the same ‘store’ institutioncan be used to describe or regulate behaviors of agents indifferent markets, irrespective of these agents being humans,robots, or a combination of both. We call such a concretesystem a domain .The domain ( D ) is characterized by: the set of agents A that can include humans (e.g., john ), robots (e.g., robby ), orboth; the collection B of all behaviors that these humans orrobots can perform, like pick or speak ; a set O of objects inthe domain, like door , battery . In addition, a domain includesa set R = { ρ . . . , ρ n } of state variables. Each state variable ρ i deﬁnes a property pertaining to an entity (agent, behavior, or object) in the domain. For instance, pos(robby) is a statevariable that indicates the position of the agent robby , while active(pick , robby) indicates the activation of robby ’s behav-ior pick . We denote with vals( ρ ) the set of possible values ofstate variable ρ , e.g., vals(active(pick , robby)) = {(cid:62) , ⊥} . Deﬁnition 3.

The state space of D is S = (cid:81) ρ ∈ R vals( ρ ) ,where vals( ρ ) are the possible values of state-variable ρ . Wecall any element s ∈ S a state . The value of ρ in state s isdenoted ρ ( s ) .In a dynamic environment, the values of most propertieschange over time. In our formalization, we represent timepoints by natural numbers in N , and time intervals by pairs I = [ t , t ] such that t , t ∈ N and t ≤ t . We denote by I the set of all such time intervals. Deﬁnition 4. A trajectory is a pair ( I, τ ) , where I ∈ I is atime interval and τ : I → S maps time to states.In simple words, a trajectory represents how the domainevolves over a given time interval. C. Grounding

Deﬁnition 5.

Given an institution I and a domain D , a grounding of I into D is a tuple G = (cid:104)G A , G B , G O (cid:105) , where: • G A ⊆ Roles × A is a role grounding , • G B ⊆ Acts × B is an action grounding , • G O ⊆ Arts × O is an artifact grounding .Grounding plays an important role in our framework, byestablishing the relation between an abstract institution anda speciﬁc domain. Grounding provides the key to reuse thesame abstract institution to describe or regulate different sys-tems. For example, the ’store’ institution can be grounded inanalogous environments, e.g. different physical marketplaces,using different G ’s. Also, different institutions can be groundedto the same domain.Grounding directly depends on object affordances. Forexample, if a grounding maps an agent a to a given role,and this role involves the execution of a certain action, then a must be able to perform a behavior that grounds that action.For more details please refer to [24]. In this paper we assumethat all groundings are admissible. D. Norms Semantics and States

By grounding an institution I into a domain D , we imposethat the norms stated for abstract entities in I must hold forthe corresponding concrete entities in D . But what does “hold”mean here? To answer this question, we give norms semanticsin terms of executions in a physical domain: intuitively, thesemantics of norm is the set of all trajectories that comply withthe condition that the norm is meant to express. For example,the semantics of the norm must ((buyer , getGoods , goods)) is given by all trajectories where the behavior b , to which theaction getGoods is grounded, is executed (at least once) byany agent a , to which the role buyer is grounded. Formally, let T be the set of all possible trajectories ( I, τ ) over the state variables of domain D . Given a norm q ( trp ∗ ) ,we deﬁne its semantic (cid:74) q ( trp ∗ ) (cid:75) in D as: (cid:74) q ( trp ∗ ) (cid:75) ⊆ T We differentiate between the following type of norm seman-tics: Fulﬁllment semantics, (cid:74) q ( trp ∗ ) (cid:75) F , that deﬁnes the setof trajectories fulﬁlling the norm; and violation semantics, (cid:74) q ( trp ∗ ) (cid:75) V , expressing the set of trajectories violating thenorm. Naturally a trajectory cannot be an element of bothfulﬁllment and violation semantics for a given norm, thatis: (cid:74) q ( trp ∗ ) (cid:75) F ∩ (cid:74) q ( trp ∗ ) (cid:75) V = ∅ . Accordingly, we deﬁne afunction that indicates the state of a norm given a trajectoryas follows: ns ( q ( trp ∗ ) , ( I, τ )) = { f, v, n } , where f stands for fulﬁlled , v for violated and n for neutral state, neither fulﬁllednor violated: ns :=  f, iff ( I, τ ) ∈ (cid:74) q ( trp ∗ ) (cid:75) F v, iff ( I, τ ) ∈ (cid:74) q ( trp ∗ ) (cid:75) V n, otherwise a) Examples of Semantics.: A variety of norm semanticscan be expressed as constraints on trajectories, i.e., as con-straints on the possible values of state variables over time.For example, the fulﬁllment semantics of the must norm canbe expressed by requiring that the state variable active of therelevant behavior is true at least once. Formally: (cid:74) must (( role, act, art )) (cid:75) F ≡{ ( I, τ ) | ∀ a ∈ A role . ∃ ( b, t ) ∈ B act × I :active( b, a )( τ ( t )) = (cid:62)} , where A role is the set of all agents role is grounded to, and B act is the set of all behaviors act is grounded to. Intuitively,these semantics select all trajectories where any agent taking role activates at least once a behavior that performs act .Variants of this semantics are possible, for example, statingthat the behavior should be enacted at all times (not justonce in the trajectory). In more demanding scenarios, a statevariable indicating ’successful execution’ can be used insteadof simpler ’activation’. Notice that art in the norm is ignored,since it is not of interest for the semantic deﬁnition of thisnorm. However, it may be relevant in other norms, as in thefollowing example: (cid:74) use (( role, act, art )) (cid:75) F ≡{ ( I, τ ) | ∀ ( b, a, t ) ∈ B act × A role × I. ∃ o ∈ O art :(active( b, a )( τ ( t )) = (cid:62) = ⇒ usedObj( b, a ) = o ) } , where O art is the set of all objects. The semantics states thatall agents in the role grounding with behaviors in the actiongrounding that are active should use the same object groundedto artifacts. Sometimes it is useful to join the semantics of thenorm must and the norm use : (cid:74) mustUse (( role, act, art )) (cid:75) F ≡{ ( I, τ ) | ∀ a ∈ A role . ∃ ( b, o, t ) ∈ B act × O art × I :active( b, a )( τ ( t )) = (cid:62) ∧ (active( b, a )( τ ( t )) = (cid:62) = ⇒ usedObj( b, a )( τ ( t )) = o )) } , The semantics selects all trajectories in which grounded agentswith corresponding grounded behaviors are executed, and suchexecution implies that grounded objects are used. Similarlyspatial and temporal semantics can be deﬁned. The nextexample shows obligatory at semantics: (cid:74) mustAt (( role, act, art )) (cid:75) F ≡{ ( I, τ ) | ∀ a ∈ A role . ∃ ( b, o, t ) ∈ B act × O art × I :active( b, a )( τ ( t )) = (cid:62) ∧ (active( b, a )( τ ( t )) = (cid:62) = ⇒ position( b, a )( τ ( t )) = position( o )( τ ( t ))) } , An example of temporal semantics is the following: (cid:74) before (( role , act , art ) , ( role , act , art )) (cid:75) F ≡{ ( I, τ ) | ∀ ( a , a ) ∈ A role × A role , ( b , b ) × B act × B act , ∀ ( t , t ) ∈ I × I, (active( b , a )( τ ( t )) = (cid:62) ∧ active( b , a )( τ ( t )) = (cid:62) = ⇒ t < t ) ∧ ( ∀ ( a , a ) ∈ A role × A role , ( b , b ) × B act × B act , ∃ ( t , t ) ∈ I × I :active( b , a )( τ ( t )) = (cid:62) ∧ active( b , a )( τ ( t )) = (cid:62) ) } . Simply said, ‘ before ’ fulﬁll semantics declares that allgrounded agents behaviors for all grounded agents, have tobe in a certain order: the ﬁrst should precede the second, andboth have to be active at a certain time point in a trajectory.While for spatial semantics the states n and violated v arepractically the same in most cases, for temporal semanticsit is useful to explicitly deﬁne violation semantics, since suchnorms can often be in n state thus neither violated nor fulﬁlled.One simple way to deﬁne violation semantics for the before norm, (cid:74) before (( · · · )) (cid:75) V , would be to switch the non-equalitysymbol between time points t and t to t ≥ t , and to ensurethe activation of the ’second’ behavior: (cid:74) before (( role , act , art ) , ( role , act , art )) (cid:75) V ≡{ ( I, τ ) | ∀ ( a , a , b , b ) ∈ A role × A role × B act × B act ∀ ( t , t ) ∈ I × I (active( b , a )( τ ( t )) = (cid:62) ∧ active( b , a )( τ ( t )) = (cid:62) = ⇒ t > = t ) ∧ ∃ t (cid:48) ∈ I :active( b , a )( τ ( t (cid:48) )) = (cid:62) ∧ t = t (cid:48) } Note that the semantics does not ensure existence of (active( b , a )( τ ( t (cid:48) )) , since if b is active at t but b is notactive at any t < t , the norm is already violated. When atrajectory is neither fulﬁlled nor violation, the trajectory is in aneutral state ( n ). As we will see, this helps us shape the rewardfunction and provide informative feedback immediately whenthe norm state becomes either fulﬁlled or violated.Other types of norms and their semantics can be deﬁneddepending on the application. The expressiveness of normsdepends on two factors, namely on the complexity of relationsin norms semantics and on the availability of appropriate statevariables R . It is important to stress that semantics is givenin the terms of execution and is abstracted, since it is deﬁnedthrough grounding to abstract institutional concepts. An important property of our semantics, which is exploitedin this paper, is that semantics is deﬁned over state-spacevariables belonging in a sub-sets of equivalence classes inthe state-space. For example, semantics of the norm ‘ must ’is deﬁned over elements in the equivalence class of the statevariables indicating activation of an agent’s behavior: [ active ] = { active ( a, b ) ∈ R | a ∈ A, b ∈ B } Then, a particular state-variable, used in norm semantics, isdecided by grounding.

E. Adherence

We are now in a position to deﬁne what it means fora given physical system to behave in compliance with aninstitution. Consider an institution I , and suppose that I hasbeen grounded on a given domain D through some grounding G . Further suppose that all the norms in I are given fulﬁllmentsemantics in D through the function (cid:74) · (cid:75) F .The following deﬁnition tells us whether or not a speciﬁc,concrete execution in D adheres to the abstract institution I given the above grounding and semantics. Deﬁnition 6.

A trajectory ( I, τ ) adheres to an institution I ,with admissible grounding G and semantics function (cid:74) · (cid:75) F , if ( I, τ ) ∈ (cid:74) norm (cid:75) F , ∀ norm ∈ Norms . With the adherence property, we can distinguish betweentrajectories (executions) which are normative (socially ac-ceptable) and others which are not. To do that, we rely onthe formal veriﬁcation mechanisms described in our previouswork [24]. With veriﬁcation, we can evaluate a trajectory andsay if it is adherent (to an institution) or not. Moreover, we canalso evaluate the status of each norm separately during agentexecution. Such states are then used to automatically create areward function.

F. From Declarative to Procedural Knowledge

The ontology in our framework is represented with severallayers. The institutional layer speciﬁes abstract norms, throughrelations between social-level categories (roles, actions, arti-facts) providing information about ‘what’ ought to be nor-mative behavior. By associating an institution (abstract layer)with the domain (concrete) layer, we give social meaning toagents behaviors and interactions through declarative normsemantics in terms of execution. This association is realizedwith the middle layer with the notion of grounding. It bindsinstitution-level norms with the concrete domain elements(agents, behaviors, objects).The next section considers answering the question ‘how’:In which way and when exactly should agents act so thattheir behavior is socially acceptable in the context of a giveninstitution and grounding. While in our previous research wehave achieved synthesizing adherent trajectories with model-based planners [25], [24], [26], in this paper, we focus ourattention on answering this question with reinforcement learn-ing algorithms. IV. A

PPLYING I NSTITUTION M ODELS TO RL AGENTS

Markov Decision Processes (MDPs) are an example of howreinforcement learning can be used to direct the actions ofagents. A MDP is deﬁned as a tuple < S , A , P , R , γ > ,where S is the state space, A is the ﬁnite set of actions, P is the model given as transitions probability between statesdepending on actions, and R is a set of rewards – a scalarfeedback signal which agents get as they change their states. γ ∈ (0 , is a discount factor indicating the importance offuture rewards. The goal of a RL agent (the RL problem) is toﬁnd a mapping between states and actions which maximizesthe amount of reward the agent receives, known as cumulativereward or return. Such a mapping represents a policy π ( a t | s t ) , which gives the probability of taking action a t ∈ A giventhe state s t ∈ S . In deterministic policies, this probability isequal to . RL algorithms use the notion of value functions: avalue function V π ( s ) provides the expected return of rewardsstarting from state s acting according to policy π ; and/or aaction-value function (Q-function), Q π ( s, a ) , which estimatesthe value of a state given the action. Knowing the optimal Q function ( Q ∗ ) gives us an optimal policy for an agent for anystate s , that is, argmax a ∈A Q ∗ ( s, a ) . RL algorithms update thevalue function either from sampled data, if the model P isnot-known (model-free RL), or by leveraging on the providedmodel, where it is possible to use, e.g., dynamic programming(model-based RL). Value updates propagate the truth about therewards until they converge to optimal values. For example, astandard idea in RL to learn Q-values is known as temporaldifference learning: Q ( s, a ) ← Q ( s, a ) + α ( r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) − Q ( s, a )) ,where α is a learning rate, and s (cid:48) and a (cid:48) are the state and actionsubsequent to s and a . For a variety of real world problemsthe number of states may be too large to be stored (a tabularrepresentation), and calculating values for each state may notbe feasible. Value function approximation methods addressesthis problem. The idea is to generalize over (semantically)similar states across a distributed representation of a state-space with fewer parameters. In such an approach, state-space variables are represented as a feature vector, where eachfeature is a number, describing some property in the state-space, e.g., the position of an agent along x coordinate, theactivation of an agent’s behavior, etc. The policy ( π θ ) is acomputable function (like a neural network), that depends ona set of parameters θ . Converging to the optimal policy isnot guaranteed, however, usually, policies converge to closeto optimal solutions. A. From Norms to Rewards

The normative semantics in our model is deﬁned using thenotion of (see Deﬁnition 4). A trajectory is constructed fromthe values in the state space variables which are changing asthe agents interact with the environment. This allows us to usenorm semantics to evaluate the states of the norms by calcu-lating if they are in fulﬁlled, neutral, or violated state. This,in turn, can be used to feedback signals regarding the agents’adherence to norms. The schematic view of this approach is shown in Figure 1. The ‘institutional model’ box takes onlythose states of the environment which are used in the normsemantics and evaluates norm states. Naturally, to calculatenorm states which have a temporal dimension, a history ofstates has to be stored. However, state variables are usuallyshared between various norm semantics, thus the increase ofthe number of norms usually doesn’t mean a proportionalincrease in the number of stored states. Additionally, as donein our previous research [24], instead of storing values of statevariables in each time-step, the contiguous interval for whichstate variables have constant values can be stored, which cansigniﬁcantly speed up evaluation of norm states. This is acommon approach used in temporal reasoning, planning andscheduling (e.g., [27], [28], [29]). Note that the ‘institutionalmodel’ is separated from the concrete RL algorithm and isused only to provide a feedback signal to RL agents. The statespace in the RL agents complies with the Markov property(agent action selection depends only on the current state andnot on the history of states).Fig. 1: RL agents receive feedback signals from the institution.The signals steer agents learning towards normative behavior.After the states of the norms are evaluated to either fulﬁlled,violated or neutral, the norm reward function takes the statesof norms as an input and returns a feedback signal: F norm : ( N S × N S ) → R There are possible transitions of norm states which can beused to signal positive or negative feedback, namely: ( f, f ) , ( f, n ) , ( f, v ) , ( n, f ) , ( n, n ) , ( n, v ) , ( v, f ) , ( v, n ) , ( v, v ) . The F norm function can be realized in different ways and whileits optimal implementation may depend on the underlying RLalgorithm, we explore the following more general ideas: a) Feedback from Full Adherence: The adherent rewardis the main reward or the ﬁnal goal for learning normativebehavior. Agents receive this reward only when their trajectoryis adherent to the institution, i.e., to all institutional norms (seeDeﬁnition 6). This happens when all transitions are updatedto a fulﬁlled state ( ns (cid:48) = f ). Algorithm 1 realizes F norm where the procedure F norms is called at each training step orpredeﬁned intervals. The algorithm retrieves, for each norm,its latest state with method

GetNS . If all norms are in thefulﬁlled state, a reward with the value is returned. However,this means that an agent has to fulﬁll all the norms beforeit receives ’full adherence’ reward and does not provide anyfeedback to norms states separately. Thus this reward is sparseand does not provide helpful intermediate feedback for thecredit assignment problem. Algorithm 1:

Implementation of F norm for ’Full Adher-ence Feedback Reward’ approach. Input :

The current trajectory ([ , step t ], τ ) ); institution I ;grounding G ; norm semantics (cid:74) · (cid:75) Output: Procedure

F_norms( ([0 , step t ] , τ ) , Norms, G , (cid:74) · (cid:75) F ) foreach norm ∈ I .Norms do ( ns, ns (cid:48) ) ← GetNS ([ , step t ], τ ) , G , (cid:74) norm (cid:75) ) if ( ns (cid:48) (cid:54) = f ) then return ; return b) Feedback from Norms States: This approach providesfeedback to the agents as they change the states of normsand this is used as a mechanism for reward shaping whichindicates partial achievement of the ﬁnal goal. The normreward function assigns ﬁxed numerical values to norm statetransitions. For example, transitions from not-fulﬁlled to afulﬁlled norm state ( n, f ) may be rewarded while transitionsto violated states (e.g., ( n, v ) ) may be penalized. This is usefulfor temporal norms since violations of temporal requirementscan provide an immediate negative signal. However, this maynot be the best approach, since it can lead to local optima,e.g., if among other norms, two temporal norms are present,an agent may break one temporal norm but get the rewardfrom the second one and all other subsequent norms. This isnot particularly helpful for the temporal assignment problemand agents can get stuck in local optima. c) Dynamic Feedback: Similarly, as in the previous case,feedback is returned from norm state transitions. However,instead of constant values of norm transitions, they are de-pendent on each other or controlled by an additional (cus-tomizable) function, e.g., “Stop all future rewards (assign avalue of 0) if a violation has happened”. Here, one doesn’twant to continue to reward agents trajectories after a violation.This enables agents to learn only from temporally consistent(ordered) normative behaviors.While feedback from norms can be used for reward shaping,and consequently, speed-up learning, we still cannot applythe same abstract norms to novel domains with differentagents, behaviors, and objects. For example, let’s say that n is an obligation norm mustU se ( Buyer, Get, Goods ) andgrounding: G a = { Buyer, robot } , G b = { Get, pick } and G o = { Goods, battery } . The policy will learn normativebehavior, declared in the semantics of that norm, i.e., thatagent robot should execute the particular behavior pick ,and take the particular object battery . The policy learns onthe exact elements represented in the domain state-space.If we change the grounding (using the same norm), e.g., G o = { Goods, box } , the agent will not know that it has to pick a box now. This means that applying the same norm withdifferent grounding will require retraining. This limitation isaddressed below. B. Abstraction Through Grounding

An interesting distinction that is often made in Multi-AgentSystems is whether or not agents know about the organizationin which they take part [30]. The approach described so far isan example of agents not knowing about institutional structure,thus agents are not able to consider normative changes whengenerating trajectories. Agents can be made aware of theabstract institutional context by explicitly representing normsand grounding as an additional input vector to learning. In sucha scenario, agents would have enough information to associatevarious grounded elements to their normative requirements.The main disadvantage is that this would additionally increasethe size of the agents’ inputs, hence making learning more dif-ﬁcult. Another, more practical approach, would be to abstractdomain elements to the institutional level.An abstraction of a domain element implies that its meaningis deﬁned in higher, more general categories. Roles, insti-tution actions and artifacts represent such categories, mean-ingful in social interaction. Our institutional model providesa method where an agent’s observations and action spacecan be both abstracted. The equivalence classes such as [ active ] , [ position ] , [ color ] , [ detected ] , etc . , each represent atype of state-variables characterized by sharing the sameattribute or feature : ‘activation’, ‘position’, ‘color’, ‘detec-tion’, etc. This is important since it allows us to createan abstracted, institutional-level state-space, based on suchequivalence classes. Changing the grounding of roles, actionsor artifacts, may change a particular set of agents, behaviors,and object, but equivalence classes will stay the same andwill now describe the state of the new grounded domainelements. Therefore, we can create a stable feature vectorbased on equivalence classes. Figure 2 illustrates this idea.(A) is full state-space containing all state-variables describingall agents, behaviors, and objects. (B) is institution-level state-space, where states are partitioned by the equivalence classescharacterized by the same state-space features. Groundinggenerates boundary conditions by selecting only state-variablesdescribing the state of grounded domain elements. Thus,the state-space (B) represents states of institution categories(Roles, Acts, Arts). Similarly, outputs of an abstract policyrepresent institutional actions and are related to concretebehaviors via grounding.A policy learned in this way would, therefore, be at a higherlevel of abstraction. We henceforth call such policies “ abstractpolicies ”. Procedural knowledge (knowing how), stored inabstract policies, is learned based on declarative norms seman-tics deﬁned as relations between (institution-level) categories(roles, actions, artifacts). Grounding selects domain elementsthat ﬁt into these categories, thus the pattern of activationgenerated by the abstract policy stays the same but is producedwith different (domain-level) agents, behaviors, and objects.This means that in principle, an abstract policy may beapplied in every domain where the same (or similar) categoriescan be identiﬁed since they will share the same or similarfeatures over which original policy was learned. Thus, theadvantage of abstraction is twofold. First, it can automaticallyand signiﬁcantly reduce the size of the observed state-space, Fig. 2: State-space selection for state variables describinginstitution categories. (A) is the full state space which isusually encoded as an input in the standard policy. (B) repre-sent state-space of abstract institution categories (Roles, Acts,Arts). Grounding (full red line) generates boundary conditionsselecting only state-variables describing the state of groundeddomain elements. (a) (b) Fig. 3: (a) Observations: The full state space (A) is ﬁlteredthrough (B) to sub-set of states (C); (b) Actions: The full setof behaviors (A), are reduced (B) to grounded actions (C)addressing the curse of dimensionality problem. Second, anabstract policy can be applied (via grounding) to differentagents, behaviors, and objects, and more broadly to differentdomains, thus achieving a transfer of learning. Figure 3 showsa schematic view of an abstract policy’s inputs and outputs.More about the nature of abstract representation of proceduralknowledge and its possible implication to Cognitive Scienceis discussed in Section VI-A.Among other things, the ability to re-ground norms to otherdomains provides a workﬂow where one can train abstractinstitutional policies in simulation (adapted to robotics) andthen re-ground them to real robotic systems. The trained policyfrom simulation then can be updated by learning subtle detailswith already trained (hence, ‘safe’ from the normative pointof view) policy.

V. E

XPERIMENTS

A. Scenario

The scenario follows the normative behavior speciﬁed bythe ’Store’ institution. Robby is a robot, whose goal is toobtain a pack of batteries. Instead of taking the batteries andimmediately exiting the store, Robby is aware of the humaninstitution providing social knowledge of how to behave inbuying/selling scenarios. Namely, Robby is aware of thenorms: An agent that wants to buy something should pickthe desired object and not other objects in the store, and itshould pay for the object before leaving the store with it.The institution is speciﬁed as follows (we increase itscomplexity as we progress through the experiments):

Store = (cid:104) Arts S , Roles S , Acts S , Norms S (cid:105) where Roles S = { Buyer } Acts S = { Pay , GetGoods } Arts S = { Goods , PayPlace } Norms S = { MustUse ((Buyer , GetGoods , Goods))MustAt ((Buyer , Pay , Payplace))Before ((Buyer , GetGoods , Goods) , (Buyer , Pay , PayPlace)) } The examples domain include Robby and the set of itsbehaviors. Robby knows how to take items, pay via wirelessmoney transfer, and open doors. Robby is also capable ofmoving one step forward and backward, and rotate one stepleft or right. The domain includes 13 items on sale in thestore: battery, drill, ax, screwdrivers etc., and a cash register.The experiments focus only on learning normative policies, Fig. 4: Simulated Store Environmentand not on particular behaviors (such as grasping an item)thus all robot behaviors are atomic, that is, they are done inone step. In the real case, these could be either other (sub)policies or implemented behaviors. All norm semantics aregiven previously (see Section III-D).

B. Experiments Setupa) Software:

Model-free RL requires a substantialamount of training data, which we generate via simulation. Inthis paper we used a game engine [31] to simulate a simpliﬁedstore domain (see Figure 4). The state of the art algorithmPolicy Proximal Optimization (PPO) [32] is used to trainsimulated agents. This is done with the help of the MachineLearning toolkit [33] (v0.6), which includes a TensorFlowimplementation of PPO and a bridge between simulation dataand the RL algorithm. For all reported experiments we usedmaximum 2000 steps per episode. Environment and agentsrestart if full adherence is achieved or if the simulationstep reaches the maximum count. In addition to the rewardsdiscussed earlier, the agent receives small negative feedback( . e − ) for each simulation step so as to make agentsﬁnish the episode as soon as possible. We used the samehyper-parameters in all experiments and some of them areshown in Table I, where newly introduced beta is the entropyregularization parameter (increasing it will inﬂuence policyto take more random actions). Time horizon is the length ofgenerated trajectory used for gradient update and there are 3layers of 256 training units. Parameters were chosen as a resultof pilot tests (not reported here). beta gamma lambd learning rate num layers hidden units time horizon batch size6e-3 0.99 0.95 3e-4 3 256 1024 1024 TABLE I: Hyper-parameters used for training in the experi-mentsRobots observe the state of the environment through asensor consisting of 7 rays positioned at the center of theagent, pointing in different directions. Each ray returns avector which encodes information about detected objects andits normalized distance (see Figure 5a). Additional inputs are:velocity vector of the agent, information about whether behav-iors are already executed, and state variables ‘near’, indicatingwhether an object is near the agent and ‘has’, indicatingwhether the agent holds an object or not. The input vector (a)(b)

Fig. 5: Vector input from one ray in the robot sensor ininstitution focus approachsize is 162 elements. However, the abstract policy approachencodes only states regarding grounded objects, while statesregarding other domain elements are encoded in only onevector element merely as an indication of their existencereducing the input vector to 39 elements (see Figure 5b). b) Hardware:

The simulations are executed on a desktopcomputer with the following conﬁguration: Intel Core i74790K CPU @ 4GHz (x64), 16GB DDR3 RAM, NvidiaGTX970 (4GB GDDR5).

C. Experiment 1: Proof of Concept

The goal of this experiment is to learn the normativebehavior speciﬁed by the Store institution. The experimentis executed in a simpliﬁed Store environment with 3 items,using only adherent reward based on algorithm 1. Ground-ings are given as follows: G a = { Buyer, Robby } , G b = { GetGoods, pick } , G b = { P ay, transf er } and G o = { Goods, battery } .Results are shown on Figure 6. The x axis shows the numberof steps, while the y axis shows the amount of accumulatedreward. ‘Mean’ line represents the learning curve averagedover 10 independent training trials, where for each trainingtrial data is collected over 16 parallel simulations of theenvironment. Figure 6 also shows the best learning curve(out of 10 trials), which reached near optimal policy in theminimum number of steps. D. Experiment 2: Learning Abstract Policies

In this experiment, we turn our attention to abstracted policylearning. We increase the complexity of the Store domain,by adding different store items. For training, we use onlyadherent reward, as described in section IV-A, in two differentsettings: (A) Standard learning (B) Abstract Learning. Thehypothesis is that the learning will be signiﬁcantly faster insetting (B) since the agent state-space is signiﬁcantly reduced.Results conﬁrm our hypothesis. Figure 7a shows two learningcurves for settings (A) and (B), each of them representing amean of 10 different execution trials. In setting (A) the RLagent didn’t manage to learn a normative policy in almost Fig. 6: Experiment 1: Robby learning using only adherentreward in simpliﬁed environmentall the trials, wherein in settings (B) almost all training trialsmanaged to converge to near-optimum behavior. The best andonly successful trial in standard learning (A) is compared tothe best trial in abstract learning (B) (see Figure 7b). Note thatthe abstract policy is applicable to a whole category of inputs– a subset of equivalence classes of state-variables – whilethe standard policy is applicable only to one particular (ﬁxed)grounding. In that sense, these policies are not the same, andthe only reason we can compare them is that we have usedthe same grounding for both of them.The following experiment is concerned with demonstratingand understanding the quality of an abstract policy when usedwith a grounding other than that used for learning. (a)(b) Fig. 7: Experiment 2: Comparison between standard learningversus abstract learning: (a) means over 10 trials (b) bestresults E. Experiment 3: Transfer of Learning

In section IV-B, we hypothesized that a policy learnedat the level of abstraction of institution can be used acrossboundary conditions and domains. Thus, the goal of thisexperiment is to apply the same set of norms in a noveldomain. In scenario (A) we test a new grounding in the samedomain: G o = { Goods, drill } . We expect that the agent willknow to buy the drill instead of the battery since now thedrill is abstracted to ’Goods’. It is important to stress thatwhile humans may assign different names to roles, actions orartifacts over different domains, the pattern of agents behaviorsmay still have the same semantics over such domains. Scenario(B) demonstrates this concept, where we use a factory-yarddomain (see Figure 8). A robot named Forky is requiredto sort out different items by locating them on a conveyorbelt, lifting them and bringing them to a container’s hatch fordisposal. We ground the Store institution to the new domainas follows: G a = { Buyer, F orky } , G b = { GetGoods, lif t } , G b = { P ay, leave } and G o = { Goods, box } . Forky is twiceas slow than Robby, the size of the environment is increased,and the spatial layout is somewhat different. Additionally,items that Forky is required to lift are moving, and sometimesthey are not in the environment at all (they appear on theconveyor belt), which can additionally confuse the agent. Inthe trials, we distinguish between: (B1) - directly applying theabstract policy learned in the Store scenario; (B2) where wecontinue training of the transferred policy in the new domainfor an additional 200k steps; and (B3) where we train the agentfrom scratch. Fig. 8: Factory-yard environmentFig. 9: Experiment 3: Results averaged over 10 differenttrials of learning from abstract Store policy (B2) comparedto learning from scratch(B3). Results show that in the trail (A), Robby successfully locatesthe drill (instead of the battery), picks it up and goes tothe cash register to pay. In scenario (B), with the originalstore policy (case B1), Forky manages to search for andnavigate to the grounded factory item (box), lift it, and inmost of the episodes manages to reach the hatch to disposeof the item. Figure 9 shows continuing training the policy(B2) which did not improve agent behavior, while learningfrom scratch (B3) did not manage to achieve any signiﬁcantresults within 4M steps. Results are averaged over 10 differenttrials. Interestingly, the cumulative reward in B2 started todecline which shows that learning in the factory-yard is verydifﬁcult (given its dynamics), whereas desired behavior iseasily achieved via a transfer of policies learned in a simplerenvironment. Simple environments allow agents to base theirlearning on exploiting salient features in the environment.In the factory-yard domain, there are disturbing, dynamicelements that have nothing to do with norms and it is muchharder to capture salient features leading to norm adherence.Hence, the experiment demonstrates that it is much easier tocapture such features in an abstract policy learned in a simplerenvironment and then applying it in “normatively-equivalent”dynamic environment. The ability to transfer knowledge isalso important since it provides a way of learning abstractpolicies that can capture salient features responsible for normadherence in simulations and then ground them to real-worlddomains with complex dynamics. Real robotic agents maycontinue to learn (subtle) differences between simulation andthe real-world in a safer and faster way than starting fromscratch. F. Experiment 4: Normative Feedback

Figure 7 shows that the policy was not able to converge(in almost all of the trials) in a complex state-space withoutintermediate feedback. However, the information about normstates can help the learning agent. The goal of this experimentis to test reward based on feedback from individual normstates and to assess whether it can more effectively/quicklyguide an agent towards a normative policy. The temporalnorms are hardest to learn since they directly introduce the(temporal) credit assignment problem. We make the problemeven more challenging (compared to experiment 2) by addingan additional temporal norm where now the buyer, afterpaying, is expected to exit the store through the door. Theinstitution is extended with another action

Exit and artifact

ExitPlace and the following norms:

MustAt ((Buyer , Exit , ExitPlace))Before ((Buyer , Pay , PayPlace) , (Buyer , Exit , ExitPlace))

The hypothesis here is that Robby should learn the pol-icy in a more complex environment even without institutionabstraction, however, the abstraction will lead to even fasterlearning. We test two similar approaches designed to mitigatethe temporal assignment problem: (A.) Stop rewards after violation. Assign the same amountof reward to each norm: . / (number of norms) , for tran-sition ( n, f ) . However, stop rewards if any of the norms areviolated. This avoids local optima. The idea here is to steerthe agents towards temporally-consistent behaviors. Note thatadherent reward of value . is still given if all norms aresatisﬁed, making the maximum possible reward a little lessthan . (taking into account small negative feedback fromthe environment at every step). (B.) Restart after the violation. The same shaping func-tion as (A), with the difference that an episode is ﬁnished(restarted) immediately after the violation. Agents do receivethe negative feedback equal to the sum of environmentalnegative feedback from each step, which an agent would getanyways before the end of an episode. The idea is to avoid anyfuture (ill-informative) updates of the policy since the normreward function will not give any feedback even when othernorms are achieved separately (as in case A). In addition,this may shorten overall training time, since when a normis violated the agent doesn’t have to wait until the end of theepisode.Figure 10a shows means over 10 training trials for bothshaping approaches (A) and (B) applied to standard (full-state space) learning and abstracted learning. Both approachesare similar, where slightly better results are achieved withthe approach (B). While, experiment 2 (see Figure 7) showsthat standard learning without shaping is unsuccessful in 4Mepisodes, in this experiment with additional temporal require-ments, results are signiﬁcantly improved and the learningcurve is constantly increasing. Still, 4M steps are not sufﬁcientfor all trials to reach their near-optimal policies. Figure 10bshows the best results in all tested approaches.

G. Experiment 5: Multiple Agents

No speciﬁc changes have to be introduced in order to trainmultiple agents. While it is possible to learn one abstract policyto guide all the agents in the institution, in this experimenteach agent learns its own abstract policy. The rewards aredistributed in the following way. The acting of all agentstogether creates a single domain trajectory which can be eitheradherent to an institution or not, thus the ﬁnal reward from thefull adherence is given to each agent that is grounded by theinstitutional roles. Norms of particular agents depend on theirrole in the grounding, thus each agent gets a feedback from thenorm shaping function for fulﬁlling the norms that are relevantto it. Some norms are deﬁned over more than one role: in suchcases, agents have to cooperate to receive a feedback reward.In this experiment we use an additional robotic agentnamed ’Kobby’ who is capable only of navigating, receivingand accepting/declining payments from the nearby agents.The institution now includes another role

Seller , and action

ReceivePayment , with additional norms indicating that theseller has to receive payment at the place of payment and atemporal norm ’equals’ making sure that agents will synchro-nize their behaviors, since its semantics speciﬁes that payingand receiving payment has to happen at the same time: (a)(b)

Fig. 10: Experiment 4: Rewards shaping. Two implementations(A) and (B) of a reward function applied for (a) standard and(b) abstract learning

MustAt ((Seller , ReceivePayment , PayPlace))Equals ((Buyer , Pay , PayPlace) , (Seller , ReceivePayment , PayPlace))

Similarly as the temporal semantics of ‘ before ’, the ful-ﬁllment semantics of ‘ equals ’ can be deﬁned by changingrelation between t and t to t = t . Note that the norm isdeﬁned between triples involving different roles. This meansthat agents grounded to the role of ’Seller’ and ’Buyer’ haveto cooperate to achieve adherence, i.e., ‘Paying’ of one agentshould be at the same time as ‘ReceivingPayment’ of anotheragent. Figure 11 shows that both agents have learned theirnormative policies. The ’Buyer’ has to fulﬁll more normsthat the ’Seller’, which results in the difference in cumulativereward. Both learning trials converge to near optimal policies. H. Summary of Results

The overview of the experiments and the main results areshown in Table II. Results reveal that learning normativebehavior using only adherent feedback at the end of theepisode, is possible (Experiment 1), but it is not convenient inmore complex state-spaces (Experiment 2). Much better resultsare achieved when the state-space is simpliﬁed with respect togrounding when learning abstract policies (Experiment 2). Ab-straction also enables applying such policies to novel domains,even when the policy in a novel domain is very hard to, orcannot be learned from scratch (Experiment 3). Furthermore,results show that normative reward feedback is highly usefulto steer the learning algorithm towards successful policies, (a)(b) Fig. 11: Experiment 5: Two policies (buyer and seller) aretrained in parallel: (a) Mean over 10 trials (b) Best out of 10trials

Exp1 Exp2 Exp3 Exp4 Exp5State-SpaceComplexity Low High High High HighTemporalComplexity Low Low Low High HighStandardLearning Success Failed - - -StandardLearning(Shaping) - - - Success(needs more data) -AbstractLearning - Success Failed (Learning)Success (Transfer) - -AbstractLearning(Shaping) - - - Success(Best) Success(Multiple Agents)

TABLE II: Summary of the Resultseven in complex environments (Experiment 4). However, thebest results are achieved when combining both approachestogether: intermediate normative feedback and (institution)abstraction (Experiment 4). Finally, it is shown that, in thesame manner, it is possible to learn policies for coordinatingmultiple agents (Experiment 5).VI. D

ISCUSSION AND F UTURE W ORK

Our current framework provides a way to model a set ofnorms relevant to a given social context. Agents can learnto behave and cooperate with respect to their assigned rolein a way predictable by humans. Still, focusing on only oneinstitution at the time can be limiting. In real-world examples,humans often follow norms of multiple institutions concur-rently. For example, normative behavior in a store also includesadhering to other generally-accepted behavioral norms, likenot bumping into others, not making excessive noise, etc. Also, knowing that an institution, e.g., a gas station, is subordinateof a more general category of paying/trading institution, anagent can assume that it has to pay for the gas, withoutexplicitly specifying it in a gas-station institution. Norms fromdifferent institutions on different hierarchical levels should becombined together to produce a full set of desired normativebehaviors. The study of how different institutions can berelated to each other, and how to exploit these relations inveriﬁcation, planning, and learning, should be investigatedfurther. Naturally, norms concern robot safety, and while weplan to test transfer of normative (safe) policies to the realworld via grounding, safety issues go deeper than that [34].One of the important questions that need answering is, canhumans identify the ‘critical mass’ of norms that would ensureexpected social behavior in all circumstances?This paper focuses on demonstrating how our normativeframework can be used to create policies that generate norma-tive behaviors. Still, the described method suggest possibilitiesbeyond those we have described. For example, imagine anagent exploring an environment. Such an agent can be re-warded for any trajectory for which an admissible groundingcan be found and which is adherent to institutional norms.Since a much larger subset of trajectories can be used forlearning, such an algorithm should be more data-efﬁcient, andlearn from fewer examples. Another unexplored possibilitywould be to realize curriculum learning [35] by graduallyincreasing the difﬁculty of the problem by changing thegrounding. Similarly, instead of randomizing items in oursimulated environment for each training episode, we couldhave randomized grounding. An interesting direction is toinvestigate the nature of the abstract knowledge representationin the broader context of AI and Cognitive science. Some briefdiscussions are covered below.

A. Connections to Cognitive Science

A policy is a sensory-motor (perception-action) map. Stan-dard policies represent all sensory inputs and actions out-puts, while abstract policies have abstract, that is, inputs/out-puts related to categories. Abstract policies store proceduralknowledge for achieving declarative semantics speciﬁed onan abstract (categorical) level. In psychology and cognitivescience, a pattern of behaviors that organize categories ofinformation and relationships between them is known as‘ schema ’ (e.g. [36]). The term ‘schema’ was initially intro-duced in psychology by Bartlett [37]. The paper proposes thathuman knowledge is stored in underlying mental structuresthat represent a generic knowledge about the world. Sincethen, many other terms are used to describe schema, such as:‘frame’, or ‘script’. They are also referred to as a ‘mental’models or representations, ‘concepts’, etc.As described by Hampson and Morris [38], schema (orplural ‘schemata’), “store both declarative (‘what’) and pro-cedural (‘how’) information, where declarative knowledge isknowing facts, knowing that something is the case, while pro-cedural knowledge is knowing how to do something”. Exactlythis is the ontology modeled in our framework (see III-F).Furthermore, Alba and Hasher [39] describe four important processes relevant to schemata: (1) choosing incoming stimuli ,which can guide attention and focus only on relevant stimuli;(2) abstraction , which stores the meaning without the details(its original content); (3) interpretation of the new informationby association to previously-stored knowledge; and (4) inte-gration of those processes into a memory. It can be argued thatwe have made a step towards achieving these processes in ourcomputational framework, thus creating a schematic knowl-edge representation. Choosing incoming stimuli is done viagrounding. The knowledge stored in abstract policies does notcontain information about a particular domain, rather it storesinformation regarding relations between (social) categories. Assuch, it can be used to interpret information about new domainelements depending on previously-stored relations, by ﬁttingthem into corresponding (social) categories. By exploitingproperties (2) and (3) we were able to achieve the transferof learning (see Section V-E).Early work in computer science, focusing on creating‘frames’ in machines, was conducted by Minsky [40]. Heargued that humans are using stored knowledge about theworld to accomplish many of the processes that the frame-work was attempting to emulate. Based on this work, newtheories about mental representation emerged in the ﬁeld ofcognitive psychology (e.g., [41]). The importance of schematicknowledge representation is explained in the book by Thagard[42], stating that people have “mental processes that operateby means of mental representations for the implementationof thinking and action”. Indeed, schemata are the foundationof numerous theories regarding modeling human cognition,concepts formation, language development and comprehen-sion, culture, representational theories of mind, etc. Being ableto realize explicit schematic (conceptual) knowledge throughour computational model, may have further inﬂuence on thedevelopment of such theories. While this paper is focused onrepresenting social normative knowledge, the knowledge aboutother properties of the world is represented with other types ofschemata, for instance, self-schema, knowledge about oneselfbased on past and grounded in present experiences; object-schema, knowledge about different categories of objects, theirfunction, and structure, etc. How to design and combine dif-ferent types of schemata, and use it for robots is an interestingand promising area for future investigation.The area of ’Grounded Cognition’ [43], states that “con-ceptual representation underlying knowledge is grounded insensory and motor systems” and perception-action links areused “as a common base of simple behavior as well as complexcognitive and social skills”. One of the main questions toanswer in this ﬁled is: how abstract concepts such as ‘democ-racy’ [44] can be formed from the sensory-motor experience?A possible solution to this problem is proposed by Barsalou[45], where such abstract concepts can be created throughmental simulations and conceptual combinations . Similarly,Svensson and Ziemke [46] argue based on evidence froma range of disciplines, that higher-level concepts and cog-nitive processes are based on simulations of sensory-motorprocesses. Robots are often used for the demonstration of suchtheories [47]. A review of computational modeling approachesconcerned with learning abstract concepts in embodied agents (robots), is given by Cangelosi and Stramandinoli [44]. Ourmodel allows us to represent some of the abstract conceptsthat can be speciﬁed as an institution. For instance, it wouldbe possible to specify an ‘election’ institution, including rolessuch as leader and voter, and actions that count-as voting —inserting a paper into a box. However, deﬁning more complexconcepts (e.g., ‘democracy’), would, probably, require relatingmultiple institutions.At the current stage of our work, we are not combiningabstract knowledge into more complex structures. This seemslike an important direction for future study. The question ofhow conceptual knowledge may be combined via a combina-tion of abstract policies is discussed below (see Section VI-C). B. Connections to Cognitive Linguistics

Cognitive and linguistic theories are linked together in themultidisciplinary area of cognitive linguistics. The knowledgerepresentation is often necessary for explaining the acquisi-tion, comprehension, and use of language and its relation tomeaning. Jean Piaget’s theories [49] suggest two processes,central for child development: accommodation and assimila-tion. Former describes creating new or changing old schemeswhen a certain situation doesn’t make sense, while the latteris understanding the world through already existing schemata.He argues that children ﬁrst have to develop a mental represen-tation of the world (schemata) and then base their language onsuch representations. Our framework is not focused on linguis-tic representation. Language has a deeper and more complexstructure than our current framework can express. Still, someinteresting connections exist and deserve a brief discussion.Norms are part of human language, and our work on howto make robots ‘understand’ norms in terms of execution,may prove to be similar to how language is understood ingeneral (or at least some of its parts). The fact that languageis a social phenomenon, that its meaning depends on context,and given that both triples, role-action-artifacts, and subject-predicate-object, abstract agents interactions, indicates that acomputational model based on related principles of categoricalabstraction may be created. For instance, a simple sentencelike “Bring me a glass of water”, can be grounded by the sameabstract knowledge representation used in ‘Store’ institution(see Section V), where a robot should now go and pick up(GetGoods) the glass of water (Goods) and bring it to a human(Register). Similarly, the same knowledge can be groundedto a sentence with the same meaning in other languages.Furthermore, a simple sentence such as “John opens thedoor” has a certain underlying (abstract) semantics, which canbe captured with our model. Humans know how to groundabstract knowledge to language symbols in the syntacticallyright way. The wrong grounding would produce word orderthat would lead to ill-formed sentences or categorical errors,which cannot make sense in terms of execution. For example,switching the grounding of the agent with the grounding of theobject, in the same sentence, would result in the meaning ofthe form “Door opened John”. Moreover, a sentence usingwords such as ‘who’, ‘what’, etc., indicating the lack ofgrounding. For instance “Who opens the door?”, lacks the subject or (agent) grounding. Finally, the grounding dependson object affordances (see [24]), thus syntactically correctsentence with not-admissible grounding will not have sense,e.g., “John jumps the door”. The same abstract knowledgemay be grounding to many different symbols, thus formingmany different sentences with the same meaning. Naturally,the question arises: is such or similar abstract knowledgerepresentation a step towards realizing the universal propertyfor all languages (language universals)?Finally, ‘metaphors’ are often used in linguistic theories:“The essence of metaphor is understanding one kind ofthing in terms of another”[50]. According to this deﬁnition,metaphors could be explained with ‘interpretation’ property(see Section VI-A or Alba and Hasher [39]) of abstract knowl-edge (schema). One of the key questions is “whether thesemetaphors simply reﬂect linguistic convention or whether theyactually represent how people think” [51]. C. Future Work

We prioritize on the investigation of what is possibilities toachieve with abstract policies in computer simulation whileapplying and testing obtained results on real robots.

Exploring Grounding.

While in this paper we have demon-strated learning of abstract policies based on all institutionnorms, it is also possible to learn each norm independently and separately and store them in separate ‘primitive’ abstractpolicies. Such policies could then be combined to achievenormative behavior as speciﬁed by the institution, by ﬁndingthe grounding that produces an adherent trajectory. Further-more, primitive abstract policies can be used outside the scopeof their normative meaning. Each of them would containprocedural knowledge of how to achieve one particular relationdescribed by the norm qualiﬁer (e.g., prepositions) regardingtemporal relations (after, before, during, overlap, etc.), spatial(at, near, on, below, etc.), but also other relation of interest.An agent grounded by a primitive policy would act towards‘achieving’ it. For instance, if a policy that captures concept‘at’, grounds an agent and an object representing a footballﬁeld, the agent will act towards being at that football ﬁeld.Changing the grounded object would make the agent moveto another area and activating another primitive policy willmake the agent act towards ‘achieving’ the new policy. Thus,it would be possible to explore the world, by exploring ground-ing of such policies, which can be further used for learningor planning. This possibility seems interesting to explore andwould require learning a set of primitive abstract policiesover diverse examples. Then another policy will output avector encoding possible groundings and possible primitivepolicies to be activated. This will allow learning groundingsof primitive policies in order to generate agents behaviorsthat maximize certain reward. Also, such a design may makelearning on the level of granularity more meaningful from thehuman perspective. Hierarchical/Recursive Grounding.

The described hypothet-ical process of learning grounding of ‘primitive’ abstract poli-cies and combining them into more complex representations,would still have a ‘ﬂat’ structure. A complementary idea in this direction of research is to investigate grounding of ‘primitive’policies on other such policies. In an early investigation, wewere able to deﬁne norms in terms of other norms, where wehave deﬁned the ‘before’ norm over two other ‘must’ norms.The agent has successfully learned to fulﬁll both ‘must’ normin correct sequential order. Such hierarchical grounding wasimplicit, and future research effort should make it explicit.Being able to hierarchically ground abstract policies wouldmean that simple concepts could be combined in hierarchicalstructures, which should enable agents to understand morecomplex forms of norms, and explore the world on differentlevels of granularity, just by exploring grounding.

Grounding and Transfer of Knowledge.

At the current levelof development of our model, whether an abstract policycan be applied in a given domain, or not, can be decidedby a human knowledge engineer. However, an interestingquestion needs answering is: how should a computationalsystem know to automatically ground an abstract policy innovel domains? In Section IV-B we explain that the abstractpolicy can be (in principle) grounded in domains that sharesthe same features as the categories over which abstract policyis originally learned. However, in practice, the success ofa transfer may depend on other morphological/isomorphic(syntactic/semantics) similarities. Related ideas are suggestedin the area of analogy reasoning [52] [53], and they may serveas a source of inspiration for future research. Researches inthis area were able to explain how a mapping of a domainto other domains may result in creativity and some forms ofreasoning [54].In this section, we have brieﬂy discussed the nature ofabstract knowledge representation in our framework, someideas for future work and its possible implications to otherareas. Interestingly, in this view, all of the discussed areasmay have something in common: the grounding of abstractknowledge representation. A multi-disciplinary research effortmay be needed for a unifying cognitive model where thegrounding of abstract knowledge plays a central role.VII. C

ONCLUSIONS

Future robots will need to follow human social normsin order to be useful and accepted in human society. Onthe other hand, recent successes in reinforcement learningtechniques may further speed up the development of robotics.In this paper, we propose a method to bring social normsto reinforcement learning algorithms together through ourinstitutional framework. We are able to: (1) provide a way tointuitively encode social knowledge (through norms); (2) guidelearning towards normative behaviors (through an automaticnorm reward system); and (3) achieve a transfer of learning byabstracting policies; Finally, (4) the method is not dependenton a particular RL algorithm (although it was tested with aparticular RL algorithm). We show how our approach can beseen as a means to achieve abstract procedural knowledgerepresentation and we discuss its implication to cognitivescience. R EFERENCES [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playingatari with deep reinforcement learning,” arXiv preprintarXiv:1312.5602 , 2013.[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,A. Bolton et al. , “Mastering the game of go withouthuman knowledge,”

Nature , vol. 550, no. 7676, p. 354,2017.[3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou,M. Lai, A. Guez et al. , “Mastering chess and shogiby self-play with a general reinforcement learning al-gorithm. 2017,”

URL: http://arxiv. org/pdf/1712.01815 ,2017.[4] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S.Vezhnevets, M. Yeo, A. Makhzani, H. K¨uttler, J. Aga-piou, J. Schrittwieser et al. , “Starcraft ii: A newchallenge for reinforcement learning,” arXiv preprintarXiv:1708.04782 , 2017.[5] OpenAI, “Openai ﬁve,” URL:https://blog.openai.com/openai-ﬁve/, 2018.[6] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefow-icz, B. McGrew, J. Pachocki, A. Petron, M. Plappert,G. Powell, A. Ray et al. , “Learning dexterous in-handmanipulation,” arXiv preprint arXiv:1808.00177 , 2018.[7] C. Boutilier, T. Dean, and S. Hanks, “Decision-theoreticplanning: Structural assumptions and computationalleverage,”

J. Artif. Int. Res. , vol. 11, no. 1, pp. 1–94,Jul. 1999. [Online]. Available: http://dl.acm.org/citation.cfm?id=3013545.3013546[8] J. R. Searle, “What is an institution,”

Journal of institu-tional economics , vol. 1, no. 1, pp. 1–22, 2005.[9] D. C. North,

Institutions, institutional change and eco-nomic performance . Cambridge university press, 1990.[10] E. Ostrom,

Understanding institutional diversity . Prince-ton university press, 2009.[11] M. Minsky, E. Feigenbaum, and J. Feldman, “Computersand thought,” 1963.[12] R. Bellman, “Adaptive control processes,”

Princeton Uni-versity Press , 1961.[13] M. J. Matari´c, “Reinforcement learning in the multi-robotdomain,” in

Robot colonies . Springer, 1997, pp. 73–83.[14] A. Y. Ng, D. Harada, and S. Russell, “Policy invarianceunder reward transformations: Theory and application toreward shaping,” in

ICML , vol. 99, 1999, pp. 278–287.[15] R. S. Sutton, D. Precup, and S. Singh, “Between mdpsand semi-mdps: A framework for temporal abstraction inreinforcement learning,”

Artiﬁcial intelligence , vol. 112,no. 1-2, pp. 181–211, 1999.[16] M. J. Matari´c, “Learning in behavior-based multi-robotsystems: Policies, models, and other agents,”

CognitiveSystems Research , vol. 2, no. 1, pp. 81–93, 2001.[17] Y. Kishima, K. Kurashige, and T. Kimura, “Decisionmaking in reinforcement learning using a modiﬁed learn-ing space based on the importance of sensors,”

Journalof Sensors , vol. 2013, 2013. [18] A. K. McCallum et al. , “Learning to use selective at-tention and short-term memory in sequential tasks,” in

From animals to animats 4: proceedings of the fourthinternational conference on simulation of adaptive be-havior , vol. 4. MIT Press, 1996, p. 315.[19] A. Y. Ng, S. J. Russell et al. , “Algorithms for inversereinforcement learning.” in

Icml , vol. 1, 2000, p. 2.[20] D. Kasenberg, T. Arnold, and M. Scheutz, “Norms,rewards, and the intentional stance: Comparing machinelearning approaches to ethical training,” in

Proceedingsof the 2018 AAAI/ACM Conference on AI, Ethics, andSociety . ACM, 2018, pp. 184–190.[21] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg,and D. Amodei, “Deep reinforcement learning fromhuman preferences,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 4299–4307.[22] P. Silva, R. Ventura, and P. U. Lima, “Institutionalenvironments,” in

Proc of AAMAS Workshop: From agenttheory to agent implementation , 2008, pp. 157–164.[23] H. Aldewereld, S. ´Alvarez-Napagao, F. Dignum, andJ. V´azquez-Salceda, “Making norms concrete,” in

Pro-ceedings of the 9th International Conference on Au-tonomous Agents and Multiagent Systems: volume 1-Volume 1 . International Foundation for AutonomousAgents and Multiagent Systems, 2010, pp. 807–814.[24] S. Tomic, F. Pecora, and A. Safﬁotti, “Norms, insti-tutions, and robots,” arXiv preprint arXiv:1807.11456 ,2018.[25] A. Wasik, S. Tomic, A. Safﬁotti, F. Pecora, A. Martinoli,and P. U. Lima, “Towards norm realization in institutionsmediating human-robot societies,” in . IEEE, 2018, pp. 297–304.[26] S. Tomic, A. B. Wasik, P. U. Lima, A. Martinoli, F. Pec-ora, and A. Safﬁotti, “Towards institutions for mixedhuman-robot societies,” in

Proc. of the 17th InternationalConference on Autonomous Agents and Multiagent Sys-tems (AAMAS 2018) , no. CONF, 2018, pp. 2216–2217.[27] R. Dechter, I. Meiri, and J. Pearl, “Temporal constraintnetworks,”

Artiﬁcial intelligence , vol. 49, no. 1, pp. 61–95, 1991.[28] J. Frank and A. J´onsson, “Constraint-based attribute andinterval planning,”

Constraints , vol. 8, no. 4, pp. 339–364, 2003.[29] S. Fratini, F. Pecora, and A. Cesta, “Unifying planningand scheduling as timelines in a component-based per-spective,”

Archives of Control Science , vol. 18, no. 2, pp.231–271, 2008.[30] O. Boissier, J. F. H¨ubner, and J. S. Sichman, “Orga-nization oriented programming: From closed to openorganizations,” in

Engineering Societies in the AgentsWorld . Springer, 2007, pp. 86–105.[31] U. Technologies, “https://unity.com/,” 2019.[32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 , 2017.[33] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry,M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,” arXiv preprint arXiv:1809.02627 ,2018.[34] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schul-man, and D. Man´e, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565 , 2016.[35] Y. Bengio, J. Louradour, R. Collobert, and J. Weston,“Curriculum learning,” in Proceedings of the 26th annualinternational conference on machine learning . ACM,2009, pp. 41–48.[36] P. DiMaggio, “Culture and cognition,”

Annual review ofsociology , vol. 23, no. 1, pp. 263–287, 1997.[37] F. C. Bartlett, “Remembering: A study in social andexperimental psychology,” 1932.[38] P. J. Hampson and P. E. Morris,

Understanding cognition .Wiley-Blackwell, 1996.[39] J. W. Alba and L. Hasher, “Is memory schematic?”

Psychological Bulletin , vol. 93, no. 2, p. 203, 1983.[40] M. Minsky, “A framework for representing knowledge,”

The Psychology of Computer Vision , 1975.[41] D. Rumelhart, “E.(1980). schemata: The building blocksof cognition,”

Theoretical Issues in Reading Comprehen-sion , pp. 9 781 315 107 493–4, 1980.[42] P. Thagard,

Mind: Introduction to cognitive science .MIT press Cambridge, MA, 2005, vol. 17.[43] L. W. Barsalou, “Grounded cognition,”

Annu. Rev. Psy-chol. , vol. 59, pp. 617–645, 2008.[44] A. Cangelosi and F. Stramandinoli, “A review of abstractconcept learning in embodied agents and robots,”

Philo-sophical Transactions of the Royal Society B: BiologicalSciences , vol. 373, no. 1752, p. 20170131, 2018.[45] L. W. Barsalou, “Perceptions of perceptual symbols,”

Behavioral and brain sciences , vol. 22, no. 4, pp. 637–660, 1999.[46] H. Svensson and T. Ziemke, “Making sense of embod-iment: Simulation theories and the sharing of neuralcircuitry between sensorimotor and cognitive processes,”in

Proceedings of the Annual Meeting of the CognitiveScience Society , vol. 26, 2004.[47] G. Pezzulo, L. W. Barsalou, A. Cangelosi, M. H. Fischer,K. McRae, and M. Spivey, “Computational groundedcognition: a new alliance between grounded cognitionand computational modeling,”

Frontiers in psychology ,vol. 3, p. 612, 2013.[48] J. A. Fodor,

The language of thought . Harvard universitypress, 1975, vol. 5.[49] S. McLeod, “Jean piaget— cognitive theory— simplypsychology.[online] simplypsychology. org,” 2009.[50] M. Johnson and G. Lakoff,

Metaphors we live by . Uni-versity of Chicago Press Chicago, 2003.[51] G. L. Murphy, “Reasons to doubt the present evidencefor metaphoric representation,”

Cognition , vol. 62, no. 1,pp. 99–108, 1997.[52] P. Bartha, “Analogy and analogical reasoning,” in

TheStanford Encyclopedia of Philosophy , spring 2019 ed.,E. N. Zalta, Ed. Metaphysics Research Lab, StanfordUniversity, 2019.[53] D. Gentner, “Structure-mapping: A theoretical frame-work for analogy,”