[PDF] Understanding User Instructions by Utilizing Open Knowledge for Service Robots

Abstract

Understanding user instructions in natural language is an active research topic in AI and robotics. Typically, natural user instructions are high-level and can be reduced into low-level tasks expressed in common verbs (e.g., `take', `get', `put'). For robots understanding such instructions, one of the key challenges is to process high-level user instructions and achieve the specified tasks with robots' primitive actions. To address this, we propose novel algorithms by utilizing semantic roles of common verbs defined in semantic dictionaries and integrating multiple open knowledge to generate task plans. Specifically, we present a new method for matching and recovering semantics of user instructions and a novel task planner that exploits functional knowledge of robot's action model. To verify and evaluate our approach, we implemented a prototype system using knowledge from several open resources. Experiments on our system confirmed the correctness and efficiency of our algorithms. Notably, our system has been deployed in the KeJia robot, which participated the annual RoboCup@Home competitions in the past three years and achieved encouragingly high scores in the benchmark tests.

Full PDF

11 Understanding User Instructions by Utilizing OpenKnowledge for Service Robots

Dongcai Lu, Feng Wu ∗ , and Xiaoping Chen Abstract —Understanding user instructions in natural languageis an active research topic in AI and robotics. Typically, naturaluser instructions are high-level and can be reduced into low-leveltasks expressed in common verbs (e.g., ‘take’, ‘get’, ‘put’). Forrobots understanding such instructions, one of the key challengesis to process high-level user instructions and achieve the speciﬁedtasks with robots’ primitive actions. To address this, we proposenovel algorithms by utilizing semantic roles of common verbsdeﬁned in semantic dictionaries and integrating multiple openknowledge to generate task plans. Speciﬁcally, we present a newmethod for matching and recovering semantics of user instruc-tions and a novel task planner that exploits functional knowledgeof robot’s action model. To verify and evaluate our approach,we implemented a prototype system using knowledge fromseveral open resources. Experiments on our system conﬁrmedthe correctness and efﬁciency of our algorithms. Notably, oursystem has been deployed in the KeJia robot, which participatedthe annual RoboCup@Home competitions in the past three yearsand achieved encouragingly high scores in the benchmark tests.

Index Terms —Service Robots, Human-Robot Interaction, Nat-ural Language Understanding and Task Planning.

I. I

NTRODUCTION N OWADAYS, service robots can do more and more workin our daily life, such as moving around in a house,fetching drink or medicine for elderly people, or preparingfood for a family. They are smart and can do many complextasks autonomously. Nevertheless, when robots encounter userrequests or tasks in an open-ended form (e.g., through dialogsin natural language), they often fail to response properly, notonly due to possible language processing failures but alsothe challenges of task planning with incomplete knowledge.For example, as illustrated in Figure 1, a daily instruction“clean up toys” is challenging for a robot to process if theaction “clean up” is under-speciﬁed and “have a headache”is also nontrivial for a robot to offer help to people withoutgrounding the helping verb (i.e., knowing how to help). Theseare common tasks in domestic scenarios and therefore it isdesirable for service robots to be able to complete such tasksgiven user instructions in natural language.Typically, user instructions are action-directed in the sensethat the fundamental purpose of an instruction is to specifywhat users want a robot to do for them. This indicates aconnection between robot understanding (i.e., knowing whatthe users said) and acting (i.e., doing what the users asked).In other words, understanding an instruction means that therobot is able to generate a plan (i.e., sequence of actions) forthe tasks speciﬁed in the instruction [1], [2], [3], [4], [5], [6].Therefore, it is crucial for the robot to have the knowledgeabout the tasks and actions in order to do planning. However, - Task: · Clean up toys- Steps: · Pick up toys from ﬂoor · Put toys in the toybox(a) Clean up toys - Desire: · Have a headache- Actions: · Give him an aspirin · With pain medication(b) Have a headache

Fig. 1. Examples of robot tasks for user instructions in natural language. some knowledge may be missing in the instruction (e.g., “havea headache” does not directly indicate that the robot shouldgiven the user an aspirin). Consequently, the robot does notknow how to act when such instructions are presented.Fortunately, there is more and more common knowledgeavailable in open resources, such as the

Open Mind IndoorCommon Sense (OMICS) database [7], wikihow , WordNet ,and many other digital dictionaries. In these dictionaries,actions are often hierarchical where a high-level action iscomposed of several lower-level actions. Similarly, user in-structions are often speciﬁed hierarchically in which an actionis referred by an action verb or verb phrase. For instance,“clean up a house” may indicate a series of subtasks such as“clean the table”, “clean the ﬂoor”, etc. Therefore, common-sense knowledge about hierarchical relations between tasksand subtasks is useful for instruction understanding.In our previous studies [8], we found that a user instructionrepresenting a high-level task can usually be reduced into a se-quence of low-level subtasks, using hierarchical knowledge inopen resources. Furthermore, we observed that this reductionprocedure often ends up at so-called primitive tasks (i.e., low-level subtasks expressed in common verbs [9]). For instance, inOMICS, “serve a drink from fridge” is reduced into a sequenceof low-level subtasks expressed in common verbs, such as “goto fridge”, “open the fridge door”, and “take the drink”, where‘go’, ‘open’, and ‘take’ are common verbs. Ideally, if all ofthe primitive tasks in the reduction can be directly mappedinto robot’s actions, the robot can simply complete the taskby executing those actions.However, it is generally nontrivial to map primitive tasks a r X i v : . [ c s . R O ] J un to robot’s actions. One of the key challenges is that there islittle knowledge about common verbs in most open resourcesand furthermore how they can be executed by robot withits actions. To avoid this challenge, most of the existingapproaches [3], [4], [10] manually create a small set of hand-coded robot actions for primitive tasks though their scalability(i.e., only work for small problems) and generality (i.e., onlywork for speciﬁc domains) are limited. To build a general-purpose system for handling large-scale user instructions, wedirectly tackle this challenge and consider the follow problems:1) how to deﬁne semantics of meanings of common verbs,match and recover such semantics in user instructions and 2)how to handle a large number of instructions and generateplans in realtime using open knowledge resources.To address these problems, we propose a novel system forservice robots to 1) process user instructions based on semanticroles of common verbs deﬁned in semantic dictionaries, and2) then generate plans for the corresponding tasks of userinstructions. The semantic roles suggest possible entities in theknowledge representation that may be missing from or omittedin natural instructions. In more detail, we introduce a heuristicmethod to match and recover missing semantic roles from thecontext of user instructions. Then, we use a planner basedon Answer Set Programming (ASP) [11] to exploit deﬁnitionsof common verbs in terms of semantic roles and generate aplan for the task speciﬁed in the user instruction. By puttingthem together, we built a general-purpose system for servicerobots that can handle large-scale user instructions using opencommonsense knowledge.To evaluate our approach, we conducted a corpus-basedexperiment on two test sets with 11885 user tasks and 467user desires collected from OMICS. We also developed aprototype system and ran a case study on a service robot intwo typical domestic scenarios. Our experimental results showsubstantial improvement in performance on user instructionunderstanding. It is worth pointing out that the proposedsystem has been successfully deployed in our KeJia robot,which participates annually RoboCup@Home competitionand won the ﬁrst place once and the second place twicein the pass three years. During the benchmark tests of theRoboCup@Home competitions, our system is used by ourrobot for understanding the instructions in English given byreferees and completing the corresponding tasks. This conﬁrmsthe usefulness of our system in practice.The remainder of this article is organized as follows. Sec-tion II introduces our problem and Section III presents anoverview of our system. Then, Section IV proposes our mainalgorithms, followed by Sections V and VI describing thetwo key techniques used in our algorithms. Next, Section VIIreports our experimental results. Finally, Section VIII brieﬂyreviews the related work and Section IX concludes.II. P ROBLEM S TATEMENT

We aim to building a general-purpose system so that the robotcan understand user instructions and provide service for the http://ai.ustc.edu.cn/en/robocup/atHome/index.php Fig. 2. System architecture. user. To this end, we must solve the problem of generating asequence of primitive actions, which can be directly executedby a robot, given user instructions in natural language. Forexample, when a user says: “please serve a meal for me”,the robot will take the meal, put it on a plate, and placethe plate on a table; when a user says: “I am thirsty”, therobot will take a drink from the fridge and deliver it to theuser. To achieve this, our system must be able to extract atask from a user instruction in natural language (i.e., knowingwhat the user said) and generate a executable plan for thetask (i.e., doing what the user asked). In other words, naturallanguage understanding and task planning must be combinedsystematically in order to solve our problem.In the next section, we give an overview of our system forinstruction understanding and task planning that is built byintegrating different modules.III. S

YSTEM O VERVIEW

The overall architecture of our system is shown in Figure 2. Aswe can see, the human-robot dialog system transcribes spokenutterances into text sentences and manages the dialog withusers. Each sentence in the dialog is then transferred to theprocessing module, which generates a sequence of primitiveactions for the task expressed in natural language. After that, asequence of commands corresponding to each primitive actionis computed by the Motion Planning module. Finally, thecommands are executed by the Robot Control module.Here, we focus on the

Processing module that takes a textsentence as its input and outputs a sequence of primitiveactions that are executable by the robot. The main componentsof our Processing module are described in detail as follows.

A. Open Knowledge

As shown in Figure 2, we use open knowledge both for

Natural Language Processing (NLP) and task planning. The

Open Knowledge considered in our system includes OMICS,

FrameNet , and

Re-FrameNet as introduced below.OMICS [7] is an extensive collection of knowledge forindoor service robots gathered from internet users. Currently,it contains 48 tables capturing different sorts of knowledge, among which the

Help and

Tasks/Steps tables are most usefulfor our system. Each tuple of the

Help table maps a userdesire to a task that may meet the desire (e.g., (cid:104) “feel thirsty”,“by offering drink” (cid:105) ). Each tuple of the

Tasks/Steps tabledecomposes a task into several steps (e.g., (cid:104) “serve a drink”,0. “get a glass”, 1. “get a bottle”, 2. “ﬁll class from bottle”,3. “give class to person” (cid:105) ). Given this, OMICS offers use-ful knowledge about hierarchism of naturalistic instructions,where a high-level user request (e.g., “serve a drink”) canbe reduced to lower-level tasks (e.g., “get a glass”, · · · ).Another feature of OMICS is that elements of any tuplein an OMICS table are semantically related according to apredeﬁned template. This facilitates the semantic interpretationof the OMICS tuples.

FrameNet is a digital dictionary providing rich semanticinformation for action verbs. It groups action verbs into Frames and speciﬁes word deﬁnitions in terms of semanticroles called

Frame Elements (FEs) for each Frame [12].Although the connections between an action verb and itssemantic roles are useful for resolving under-speciﬁcation ofnaturalistic instructions, this knowledge cannot be directlyused by robots since it is not formalized in

FrameNet . Toovercome this difﬁculty, we developed

Re-FrameNet — aformalized version of

FrameNet by rewriting part of

FrameNet knowledge in a formal meta-language.Speciﬁcally, in

Re-FrameNet , a Frame of FrameNet isformalized as a meta-task and re-deﬁned by a set of pre-condition, postcondition, invariant, and/or steps over semanticroles of the meta-task. In the deﬁnition, FEs (i.e., semanticroles) such as

Theme, Source , and

Goal of the Frame aretaken as meta-variables. Therefore, the deﬁnition of a meta-task speciﬁes the common semantic structure of action verbsin the corresponding Frame. For example, the meta-task put-Placing is deﬁned as: ( deﬁne ( meta-task put-Placing ( :parameters ?Agent ?Theme ?Source ?Goal ))( :precondition ... )( :postcondition ... )( :invariant ... ) ) where all action verbs in Frame Placing (e.g., lay, heap,deposit ) share the same deﬁnition. When a robot tries toplan with put-Placing as its action verb (verb sense) for aninstruction, our NLP components will try to extract appropriateentities for every semantic roles speciﬁed in the deﬁnition ofmeta-task put-Placing (See Section V for more detail).It is worth noting that common verbs are normally notexplained in the aforementioned open resources because mostof them belong to the so-called

General Service List (GSL) —a list of roughly 2000 most frequent English words [9]. TheGSL is taken as the deﬁning vocabulary of dictionaries suchas the Longman Dictionary of Contemporary English, basedon the notion that words should be deﬁned using “terms lessabstruse than the word that is to be explained” [13]. As aresult, there are few deﬁnitions of the GSL verbs in OMICSor other digital dictionaries. https://framenet.icsi.berkeley.edu/fndrupal/ http://ai.ustc.edu.cn/en/research/reframenet.php B. NLP Module

This module maps user instruction in natural language I tothe OMICS tables, which contains tuple (cid:104) task, steps (cid:105) for task-oriented instructions or tuple (cid:104) desire, task (cid:105) for desire-orientedinstructions (See Section IV for more detail). The output is alogical form L to the Planning module, containing a frame-semantic representation as: ( meta-task take-Taking ( : parameters food fridge ) ) Speciﬁcally, interpreting I to L is done in three steps: 1) dependency parsing that analyzes the dependencies of eachword in a sentence, 2) frame-semantic parsing that identiﬁesthe verb’s frame, and 3) semantic matching and recovering that ﬁlls the semantic roles for a given frame. In Section V,each step will be described in detail. C. Planning Module

The

Planning module takes the logical form of user instruc-tion L , online knowledge base (e.g., Re-FrameNet, WordNet,FrameNet ), domain knowledge, and robot’s skills as the inputs.The output of the

Planning module is a high-level plan for themotion planning module.We employ both global and local planners in the

Planning module. The global planner searches through the whole knowl-edge of task decomposition in OMICS to generate a plan.However, most of tasks is OMICS cannot be decomposed intorobot’s primitive actions because many steps in OMICS arereferred by common verbs, for which OMICS does not containdecomposition knowledge. For example, verbs such as take,place, put, get , and turn frequently occur in task steps butthere is no knowledge in OMICS about how to execute themby the robot. Therefore, a local planner based on ASP is usedfor planning based on merely the instruction itself.Note that the local planner is incapable of generating aplan for under-speciﬁcation terms in an instruction. Therefore,common verbs referred by the instruction must be speciﬁedﬁrst in order to generate a plan. Fortunately, semantic dic-tionaries such as

FrameNet provide rich knowledge aboutcommon verbs. In

Re-FrameNet , we reorganize the deﬁnitionof an action verb by a set of precondition, postcondition,and invariant over semantic roles of the action (a.k.a., thefunctional deﬁnition of action). Given this, a planner based onASP can plan actions for the instruction using the formalizedfunctional deﬁnition of an action. Section VI will give moredetail about our planning method.

D. Skills and Action Model

For a robot, we deﬁne an

Action Model to specify its skills.Speciﬁcally, an Action Model consists of several primitiveactions. Each primitive action a is deﬁned by a set of precon-dition, postcondition and invariant, similar to the deﬁnition ofa common verb in Re-FrameNet . In other words, they specifyconditions under which a can be executed, conditions that holdwhen a ﬁnishes, and conditions that must be satisﬁed duringthe execution of a respectively. Indeed, a primitive action isthe formal speciﬁcation of a robot skill. As we will show latersections, the Action Model is useful for our system to generatea plan that is executable by the robot. Algorithm 1

SolveTask(task t , ActionModel AM ) gSeen := ∅ /* prevent inﬁnite recursive loop when exploratorysearching itself */ initiate worldmodel and plans if t ∈ gSeen then return null end if gSeen = gSeen ∪ t subT asks = FindSubTasks(t) /* ﬁnd subtasks of task t from the Tasks/Steps table in OMICS */ for each task s in subT asks do if GeneratePlans( s , AM ) = null then F oundEqualT ask = F alse while there is a new t (cid:48) from the Tasks/Steps table seman-tically equivalent to s do if SolveTask( t (cid:48) , AM ) (cid:54) = null then F oundEqualT ask = T rue plans .append(SolveTask ( t (cid:48) , AM ) ) wordmodel = simulator ( wordmodel, plans ) break end if end while if F oundEqualT ask = F alse then return null end if else plans .append(GeneratePlans( s ,AM)) wordmodel = simulator ( wordmodel, plans ) end if end for /*successfully planned*/ return plans /* all steps have been solved. */ E. Learning Module

In this module, methods such as log linear , ConditionalRandom Field (CRF),

Learn from Demonstration (LfD) areused to learn robot’s low-level skills. Intuitively, the moreskills a robot possesses, the more capable it is. For example,unless a robot knows how to pour water to a cup, it cannotﬁnish the high-level task such as “make a coffee” (with thetask-step tuple (cid:104) “make a coffee”, 0. “put hot water in a cup”,1. “pour the coffee” (cid:105) ). In this paper, we assume that our robothas all necessary low-level skills to complete a task speciﬁedby user instructions though most of the skills must be learnedone by one in practice. The learning methods for robot skillsare interesting but beyond the scope of this article.After introducing our system as a whole, we describe ourmain algorithms for instruction understanding next.IV. U

NDERSTANDING U SER I NSTRUCTIONS

There are two types of user instructions that we consider in thisarticle: 1) task-oriented instruction (e.g., “serve a meal”) and2) desire-oriented instruction (e.g., “I am thirsty”). In OMICS,a task-oriented instruction is represented as tuple (cid:104) t, s (cid:105) , where s = (cid:104) s , s , · · · , s n (cid:105) is a sequence of the n steps to completethe task t . For example, given task t = “serve a meal”, asequence of steps may be s = (cid:104) s : “take the meal”, s : “putit on a plate”, s : “place the plate on a table” (cid:105) . Similarly, adesire-oriented instruction is represented as tuple (cid:104) d, t (cid:105) , where t is the task corresponding to the user desire d . For instance,given user desire d = “I am thirsty”, the task for a robotmay be t = “serve a drink”. Indeed, in most of the domestic Algorithm 2

SolveHelp(desire t , ActionModel AM ) AllHelps := FindHelpsMaptoDesire(desire t )/* ﬁnd all help tasks mapped to desire t*/ for each help task s in AllHelps do if GeneratePlans( s , AM ) = null then for task gs in Tasks/Steps Table do if gs semantically equivalent to s then return SolveTask( gs ,AM) end if end for else return GeneratePlans( s , AM ) end if end for return null Algorithm 3

GeneratePlans(task t , ActionModel AM ) /* generate a plan for low-level task t */ sem := SemanticMatchAndRecover( t ) if sem.frame = NULL then return null end if if sem.frame ∈ AM then return sem.frame ( sem.parameters ) else gRF N := FindRFNBySem( sem.frame ) /* ﬁnd the deﬁni-tion of sem.frame in Re-FrameNet */ Res = solver( gRF N, sem, AM ) /* compute a plan by in-putting rules of gRFN, sem and AM. */ if Res (cid:54) = NULL then return

Res else return null end if end if scenarios, a user instruction is usually either task-oriented ordesire-oriented. Now, we turn to our algorithms for generatinga plan for these two types of user instructions respectively.Algorithm 1 is used to process task-oriented instructionsby utilizing the

Tasks/Steps table in OMICS. The input isa naturally expressed task t and the robot’s action model AM and the output is a sequence of primitive actions plans .Speciﬁcally, it ﬁrst ﬁnds all subtasks of task t from the Tasks/Steps table of OMICS. Then, it tries to generate a plan(i.e., a sequence of primitive actions) for each subtask. If aplan is successfully generated, the plan is added to the planlist plans and the simulator advances to the next subtask.Otherwise, it searches the

Tasks/Steps table of OMICS againfor all

Semantically Equivalent (SE) tasks of that subtask untilone of the SE tasks is successfully planned. If there is no SEtask or none of the SE tasks can be successfully planned, a null is returned to indicate the failure of task planning. Afterall subtasks are successfully planned, plans are returned andexecuted by the robot.Algorithm 2 is used to process desire-oriented instructionsby utilizing the Help table in OMICS. Similarly, the inputis a desire and an action model and the output is a plan.Speciﬁcally, it ﬁrst ﬁnds a list of help tasks offering the For example, the tasks of “give someone an object” and “take an objectto someone” are semantically equivalent. corresponding help when given a desire. Then, it tries toplan for each of the help tasks by checking whether the helptask can be successfully planned with a sequence of primitiveactions. If so, the resulting plan is returned. Otherwise, itsearches the

Tasks/Steps table in OMICS for a SE task ofthe help task and calls Algorithm 1 to generate the plan.Notice that both Algorithms 1 and 2 depend on Algorithm3 to generate a plan for a low-level task t . In Algorithm 3,it ﬁrst performs semantic role matching and recovering fortask t and outputs a frame and its roles. If no verb frame isidentiﬁed, the process terminates with null as no plan can begenerated. If the frame is a primitive action, this frame plus itsroles are returned. Otherwise, the frame is evoked by commonverbs. In this case, it ﬁrst ﬁnds the deﬁnition of sem.f rame in Re-FrameNet and translate it to a set of rules. After that,it computes a plan based on the rules of gRF N , the frame sem , and the action model AM .Now, the key procedures in Algorithm 3 are: 1) how to dosemantic role matching and recovering given a task expressedin natural language; 2) how to compute a plan given a set ofrules, a frame, and an action model. The details about the twoprocedures are described in Sections V and VI respectively.V. S EMANTIC M ATCHING AND R ECOVERING

We propose a three-phase procedure to translate a user in-struction expressed in natural language into the internal rep-resentation, which can be handled by our planner. Firstly, aprobabilistic syntactic parser is used to retrieve the dependen-cies of the instruction. Secondly, the frame of sentence’s verbis identiﬁed by frame-semantic parsing. Here, without loss ofgenerality, we assume that each instruction represents just asingle task (verb). Thirdly, the semantic roles of the frameare recovered and ﬁlled as much as possible with the matchedentities appeared in the instruction or its sentential context,represented as a meta-task in

Re-FrameNet . More details aboutour three-phase procedure is described below.

A. Dependency Parsing

We use the Stanford parser [14] in the ﬁrst phase, whichproduces the Stanford-typed dependencies between wordsin a sentence. These dependencies indicate the grammaticalrelations between words in terms of the name of relation,governor, and dependence [15]. Figure 3 illustrates the parsingof a sentence “take food out of refrigerator”. The edge of thetype dobj denotes that the noun “food” is the direct objectof the verb “take”. The verb “take” also governs the noun“refrigerator” via the typed dependency prep out of . Sincethe typed dependency between a verb and a noun reveals theirsemantic-role relation, the syntactic structure of an instructionis used for our semantic role matching and recovering.

B. Frame Semantic Parsing

Given that a verb varies in different senses, an instruction mayrepresent different meanings and therefore can be mapped todifferent frames in

FrameNet . For instance, the verb “take”can represent the Frame

Bring or Removing under different

Fig. 3. Stanford typed dependencies of “ take food out of fridge ”. contexts. The Stanford parser does not disambiguate verbsenses. Therefore, we propose a Frame Semantic Parsing method to map a verb to a unique Frame. Speciﬁcally, wedeﬁne a frame identiﬁcation model and train the model withsets of data from

FrameNet and OMICS as below.

1) Model:

Given a sentence x = (cid:104) x , . . . , x n (cid:105) with frame-evoking verb v , we seek the most likely Frame f ∗ in the frameidentiﬁcation stage. Let F be the set of candidate Frames for v , L the set of verbs found in the FrameNet annotations, and L f ⊆ L the subset of verbs annotated by evoking the Frame f . The frame identiﬁcation can be formalized by the followingprediction rule: f ∗ = arg max f ∈F (cid:88) l ∈L f p ( f, l | v, x ) For f ∈ F and l ∈ L f , a conditional log-linear model is usedto model the probability p ( f, l | v, x ; θ ) : p ( f, l | v, x ; θ ) = exp[ θ · Φ( f, l, v, x )] (cid:80) f (cid:48) ∈F (cid:80) l (cid:48) ∈L f (cid:48) exp[ θ · Φ( f (cid:48) , l (cid:48) , v, x )] where θ · Φ( f, l, v, x ) is the inner product (cid:80) Mi =1 θ i × Φ i ( f, l, v, x ) and θ is the parameter vector over the featurefunction Φ with M dimension.Generally, the feature function allows for a variety of (pos-sibly overlapping) features. A feature Φ i may relate a frame f to a verb v , representing a lexical-semantic relationship.

2) Data:

Our training and test sets come from

FrameNet lexicon and OMICS. The

FrameNet lexicon is a taxonomy ofmanually identiﬁed general-purpose Frames in English. Listedin the lexicon with each Frame are several lemmas (with partof speech) that can denote the Frame or some aspect of it —these are often called

Lexical Units (LUs). Table I shows someexamples of our training and test sets.

3) Training:

Given the training subset of the data in theform (cid:104) x j , v j , f j , s j (cid:105) Nj =1 where N is the number of sentences,we discriminatively train the frame identiﬁcation model bymaximizing the following log-likelihood function: max θ N (cid:88) j =1 log (cid:88) l ∈L jf p ( f j , l | v j , x ) . Speciﬁcally, we optimize it using a distributed version ofgradient ascent algorithm with initial value (cid:126)θ as: for k = 0 ..D − for i = 1 ..Mθ i = θ i + α ∂ (cid:80) Nj =1 log (cid:80) l ∈L jf p ( f j , l | v j , x ) ∂θ TABLE ID

ATA COLLECTED FROM F RAME N ET AND ANNOTATED FROM

OMICS.

Data Size Examples Verb LU Frame

FrameNet 191740 i want to bring your daughter up to the prison bring bring.v Bringingi was visited by one of the king ’s most important ofﬁcials visited visit.v Arrivingcutting his wrist and jumping from a third-ﬂoor window cutting cut.v Cause harmOMICS 1100 remove objects from surface remove remove.v Removingcomplete the dance together complete complete.v Activity ﬁnishTABLE IIH

EURISTIC RULES FOR SEMANTIC ROLE FILLING WITHIN SENTENCE . Meta-task Dependency Type Semantic Role put-Placing dobj Themeput-Placing prep in Goaltake-Removing dobj Themetake-Removing prep from Sourcedry-Cause to be dry dobj Dryeedeliver-Delivery prep to Recipient · · · where D is a parameter that controls the number of passesover the training data, M is the number of features, and N isthe total size of our training set.Note that the computational complexity of the algorithmabove is O ( D × M × N ) . When the number of features islarge, it will be costly to train our model sequentially. In orderto update the parameter of a feature f faster, we consider N f training examples that contains only f instead of N . Hence,the computational complexity becomes O ( D × M × N f ) , where N f is usually much smaller than N . C. Roles Matching and Recovering

After the Frame for the meta-task achieved from

Re-FrameNet is identiﬁed, the semantic roles of the meta-task must be ﬁlledwith the corresponding entities (expressed by nouns) in thesentence or from its sentential context. In Figure 4, given steps s = (cid:104) s , ..., s n (cid:105) and Frames of each step f = (cid:104) f , ..., f n (cid:105) ,we match and recover missing semantic roles of each Frame r = (cid:104) r , ..., r n (cid:105) , where r i = (cid:104) r i , ..., r ik i (cid:105) . s : f rame ( f ) , role ( r ) , role ( r ) , ..., role ( r k ) s : f rame ( f ) , role ( r ) , role ( r ) , ..., role ( r k ) · · · s n : f rame ( f n ) , role ( r n ) , role ( r n ) , ..., role ( r nk n ) Fig. 4. Formalization description of instruction ﬂow.

Take the ﬂow of instructions (cid:104) step 1: “go to fridge”; step2: “open the fridge door”; step 3: “take the beer”; step 4:“close the fridge door” (cid:105) for example. The third instruction(i.e., step 3) is identiﬁed as the meta-task take-Taking , whosesemantic roles in

Re-FrameNet include

Agent , Theme , and

Source . However, this instruction only explicitly speciﬁes therole

Theme (the beer) , while the others are missing fromit. Note that the semantic role

Source can be recovered andmatched with the entity fridge in the sentential context of thisinstruction. Therefore, the challenge of our third phase lies inthe recovering of missing semantic roles.To address this challenge, we borrow ideas from the “lastobjects” method [16] and propose the following method:

TABLE IIIP

ART OF HIERARCHY FOR take-taking . Semantic Role Class

Theme

Holdable Obj

Source

Supportable Obj (cid:116)

Containable Obj

1) For any semantic role r that is deﬁned in Re-FrameNet but missing from a sentence s , an entity e that matches r according to the deﬁnition and has less sententialdistance from s is preferable to be the value of r . Here,the sentential distance between e and r is deﬁned as ( n − m ) , if e and r appear in the m -th and n -th sentencesin the same sentence ﬂow respectively, with m ≤ n. For ≤ k ≤ n , it is formalized as: r ki = arg min e ∈ r l ( k − l ) ,if r ki is missing and e matches r ki .2) If a semantic role r cannot be recovered through 1),it is assumed that (the value of) r is unspeciﬁed inthe sense that any entity satisfying the Re-FrameNet deﬁnition of r is a default value of r under the givencontext . For instance, the Source role of single sentence“put beverage in the fridge” is unspeciﬁed and thus anyentity in the class beverage can be taken as the valueof

Source under the context of this sentence. Obviously,all missing semantic roles of the ﬁrst sentence in a ﬂowof instructions are unspeciﬁed. In fact, given a context,not all of the semantic roles speciﬁed in

FrameNet or Re-FrameNet are necessary for naturalistic languageinstruction understanding and task planning.In general, we divide semantic matching and recovering intotwo cases. The ﬁrst case is for zero sentential distance, i.e.,recovering semantic roles based on the instruction itself. Ta-ble II shows some heuristic rules for this case, each assigning anoun of the designated dependence type to a semantic role of ameta-task. For example, according to the ﬁrst rule in Table II, beverage is assigned to the semantic role

Theme of the meta-task put-Placing . Similarly, fridge is assigned to

Goal of thesame meta-task according to the second rule. After matching,the single instruction “put beverage in the fridge” is interpretedas an instantiated meta-task of put-Placing as follow: ( deﬁne ( meta-task put-Placing ( :parameters robot beverage null fridge )) ... ) In the case where a semantic role of a sentence cannot beidentiﬁed within the sentence, semantic matching is conducted Some of unspeciﬁed roles should be identiﬁed by grounding [17], [6],[18], [19], which is beyond the scope of this article.

TABLE IVP

ART OF HIERARCHY FOR CLASSES . Class Subclass Subsubclass

Object Containable Obj fridge

Object Holdable Obj beer,beverage

Object Supportable Obj table based on a taxonomical hierarchy, which speciﬁes what sortsof entities can be taken as values by a semantic role. Forexample, the

Theme role of meta-task put-Placing should takea holdable object for the robot. Table III shows a part of thehierarchy about meta-task take-Taking . Moreover, the hierar-chy needs to be extended by class-subclass relationships, asexempliﬁed in Table IV. Consider the example sentence “takethe beer ” in Figure 2. The entities appeared in the context are fridge and fridge-door . In our taxonomical hierarchy, fridge-door is an instance of door which is neither supportable norcontainable. Therefore, only fridge can be a value of the

Source role of take-Taking . In the case of multiple candidatesfor a semantic role, the nearest entity will be selected. Thehigh-level part of our hierarchy is similar to that of AfNet [18].This is beneﬁcial to integrating grounding mechanism into ourprototype system.VI. T

ASK P LANNING WITH

ASPGiven the meta-task semantic representation of a sentence,we generate an action sequence using OMICS and functionaldeﬁnition knowledge of common verbs (e.g.,

Re-FrameNet ).In our previous work, we proposed the

OK-planner [8] basedon ASP. In this approach, all types of knowledge are convertedinto ASP and then an ASP solver is applied to generatean action sequence. However, this work does not consider common verbs for handling complex tasks.In this article, we built our planner upon our previous workbut additionally consider the following challenges: 1) howto deﬁne the functional knowledge of primitive actions in

Action Model and 2) how to convert

Re-FrameNet deﬁnitionof common verbs into ASP.

A. Planning with Action Model

As aforementioned, we specify robot skills in our system by anaction model, i.e., a set of primitive actions that are executablefor the robot. Table V shown some basic deﬁnition of theprimitive actions for a typical service robot though differenttypes of robots may have different action model. Formally,each primitive action a is deﬁned as a pair (cid:104) pre ( a ) , ef f ( a ) (cid:105) ,where pre ( a ) and ef f ( a ) are the preconditions and effectsof a respectively. For instance, moveto(obj) is a primitiveaction that tells the robot to move close to the speciﬁed object obj . The pre and ef f of moveto(obj) show whether therobot is near the speciﬁed obj before and after the moveto action respectively.Given any initial state s and a possible plan a , . . . , a n , an action model determines a predicted trajectory τ ∗ = (cid:104) s , a , s , . . . , a n , s n (cid:105) through inference for all the states s , . . . , s n along with the execution of the action sequence duringplanning. For instance, given an instruction “get food fromfridge”, we need to generate a plan for the robot as: TABLE VL

IST OF PRIMITIVE ACTIONS THAT CAN BE EXECUTED BY THE ROBOT .Primitive Action(a) Description(a), pre ( a ) , eff ( a ) moveto(obj, t) Move to obj by using motion planner at time t. pre ( a ) : not near ( robot, obj, t − eff ( a ) : near ( robot, obj, t ) ﬁnd(obj, t) Find obj in the environment by using vision at time t. pre ( a ) : near ( robot, obj, t − eff ( a ) : beliveloction ( robot, obj, t ) pick up(obj, t) pick up obj by using robotic arm at time t. pre ( a ) : near ( robot, obj, t − pre ( a ) : beliveloction ( robot, obj, t − eff ( a ) : grapsing ( robot, obj, t ) put down(obj, t) put down obj on a plane in front of robot at time t. pre ( a ) : grapsing ( robot, obj, t − eff ( a ) : not grasping ( robot, obj, t ) open(obj, t) open the obj at time t. pre ( a ) : closed ( obj, t − eff ( a ) : opened ( obj, t ) close(obj, t) close the obj at time t. pre ( a ) : opened ( obj, t − eff ( a ) : closed ( obj, t ) moveto(fridge,1), open(fridge,2),find(food,3), pick_up(food,4),close(fridge,5). Note that the semantic representation of a user instructioncan be easily converted into a ASP form [8]. All we have todo is to ﬁll sufﬁcient knowledge for the ASP planner. Usingour

Re-FrameNet deﬁnition, an action verb is reorganizedby a set of precondition, postcondition, and invariant oversemantic roles of the action. Therefore, the remaining problemfor our approach is how to convert the functional deﬁnitionsof common verbs into ASP.

B. Conversion of Functional Knowledge

Let α be a common verb (word sense). The set of linguisticvariables of α ’s frame is denoted by Θ( α ) . The set ofproperties and relations over Θ( α ) occur in the functionaldeﬁnitions of verbs belonging to α ’s Frame is denoted by Σ( α ) . Given a task task α based on the common verb α as: ( :meta-task α ( :parameters ( p X ) · · · ( p h X ))) where X , . . . , X h ∈ Θ( α ) and p , . . . , p h are predicates overa set X of variables, each constraint of the common verb α can be converted to a set of ASP rules w.r.t. the task task α as:1. A precondition ( :precond α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t ) , . . . , not true ( l n , t ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t ) , . . . , not true ( l (cid:48) n , t ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X )

2. A postcondition ( :postcond α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48) ) , . . . , not true ( l n , t (cid:48) ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48) ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X )

3. An invariant ( :invariant α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48)(cid:48) ) , . . . , not true ( l n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X )

4. An invariant ( disj ( :invariant α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m )))( :invariant α ( conj ( disj l ∗ · · · l ∗ n ) · · · ( disj l (cid:48)∗ · · · l (cid:48)∗ m )))) is converted to the following ASP rules: f ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48)(cid:48) ) , . . . , not true ( l n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · · f ← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) f ∗ ← process ( task α , t, t (cid:48) ) , not true ( l ∗ , t (cid:48)(cid:48) ) , . . . , not true ( l ∗ n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · · f ∗ ← process ( task α , t, t (cid:48) ) , not true ( l (cid:48)∗ , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48)∗ n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) ← f, f ∗ After all pieces of knowledge have been converted into theASP rules, an ASP solver iclingo [20] — a combination of

Gringo and clasp for incremental grounding and solving — isused to incrementally ground the ASP rules above and searchfor answer sets, from which a plan can be computed [8].VII. E

XPERIMENTS

We empirically evaluate our system with three experiments.The ﬁrst experiment was devised to investigate the perfor-mance of our SMR (i.e., Semantic Matching and Recovering)method. The second experiment aimed to testing the perfor-mance of the whole system when different open knowledgebases were used. We also analyzed the main factors thatmay affect the performance. Finally, we demonstrate that howour approach can be deployed in our KeJia robot to solveinstruction understanding problems in two domestic scenarios.Additionally, we also present our long-term effort on applyingthe proposed technique in the RoboCup@Home competitions.

TABLE VIR

ESULTS OF TRANSLATION OVER TWO TESTSETS OF F RAME N ET AND

OMICS.

Syntactic Data P R F

Verb OMICS 97.61 81.83 89.03Entities OMICS 80.32 67.33 73.25

Identiﬁcation Data P R F

Frame OMICS 84.31 61.43 71.07Frame FrameNet 80.98 79.05 80.00Semantic Roles OMICS 78.00 53.71 63.62

A. Experiments with SMR

To test our SMR method, we collect 191,740 examples an-notated with frame-semantic structures for the frame iden-tiﬁcation model from

FrameNet lexicon and 470 examplesfrom OMICS. Then, we parse each sentence by the Stanfordparser. Finally, we only select those examples whose LUis a verb or a verb phrase. As a result, the training datacontains 70,149 examples and the test data contains 18,183examples from

FrameNet and 630 examples from OMICS. Inour experiments, the frame identiﬁcation model instantiates76,289 binary features.Table VI shows the results on each part of translation of hi-erarchal instructions. The performance is evaluated by

Precise (P),

Recall (R), F1 (F) deﬁned as: P recise = T P/ ( T P + F P ) , Recall = T P/T , F ∗ P recise ∗ Recall/ ( P recise + Recall ) , where T P stands for the number of the sentencesparsed correctly,

F P is the number of the sentences parsedwrongly, and T is the total length of the dataset.As we can see from the results, syntactic results have avery high precise and F value, which beneﬁts to the meta-task identiﬁcation phase. However, it does not disambiguatethe meaning of a verb (e.g., the verb “get” has two meanings:“Getting: get the food” and “Motion: get to the room”).The meta-task identiﬁcation, which obtains a F value of80 over the FrameNet data and 71.07 over the OMICS data.Moreover, the overall performance of the whole translationsystem maintains a quite high precise and relatively low recalldue to the data sparseness and one meta-task assumption.

B. Experiments on OMICS

The experiments on OMICS were divided into two tests. Test1 was conducted on 11,885 user tasks from the

Tasks/Steps table and Test 2 on 467 user desires from the

Help table.Test 1 consisted of four rounds. In the ﬁrst round, only thedeﬁnitions of the 11,885 tasks from the

Tasks/Steps table and asmall action model AM representing the basic perception andmanipulation skills of a robot were used. Speciﬁcally, AM con-tained only 6 primitive actions: move, ﬁnd, pick up, put down,open , and close . Synonymy knowledge from FrameNet wasused into the second to fourth rounds of Test 1. In the third andfourth rounds, rewritten knowledge from

Re-FrameNet wasconsidered with our SMR technique. However, in the thirdround, missing roles were not recovered from the context.Table VII shows the experimental results of Test 1. Thesecond row shows the numbers of tasks that were successfullyplanned by the global planner with tasks/steps in the fourrounds. The third row shows the total numbers of tasks thatwere successfully planned in the four rounds. The fourth

TABLE VIIE

XPERIMENTAL RESULTS OVER

USER TASKS . Test 1

AM FN SMR 0

SMR 1

Tasksteps

134 150 618

Tasksteps+

157 174 756

Percent(%)

GroundTruth(%) * * 63.75 64

TruthPercent(%) (a) T he N u m be r o f T a sks (b)Fig. 5. Inﬂuences of the Frame in Re-FrameNet in Test 1. row shows the percentages of successfully planned tasks withrespect to the total number of tested tasks. Since there areno ground truth data for OMICS, we randomly drew 80 and100 samples from the last two rounds respectively and veriﬁedthem manually. It turned out that 51 and 64 samples amongthem were correct. As shown in the ﬁfth row of Table VII,the correctness percent decreased when Re-FrameNet wasused; but the number of correctly planned tasks still increasedremarkable. Moreover, we can see that the overall performanceimproved when semantic roles of common verbs was used,much better than the state-of-art solution [8].As shown in Figure 5, the number of the successfullyplanned tasks gradually increased when more frames wereadded to the algorithm. It also shows that some frames cannotbe mapped into robots’ action (i.e.,

Mass motion and

Waiting ).The main reason is the limit of robots’ primitive actions.Table IX reports the main types of failures that we observedin Test 1. Speciﬁcally, the

Parsed Failure occurred in 3027tasks because the semantic matching and recovering procedurefailed to retrieve any frame from

Re-FrameNet (RFN) for atask. The

RFN Failure occurred in 4394 tasks due to the factthat

Re-FrameNet contains only 43 frames, in which 7421tasks cannot be used to generate a plan by the robot. A

GlobalPlanning Failure occurs when a task/step t cannot be plannedand none of the following conditions hold: t is a primitiveaction, semantically equivalent to meta-task in Re-FrameNet or another task in the

Tasks/Steps table. In total, there were3527 tasks failed in this category. A

Local Planning Failure occurs when the solver (in Algorithm 3) is launched but failsto generate any plan. Further study reveals that these two sortsof planning failures are mainly due to lack of knowledge/skills.Test 2 was conducted on 467 user desires from the

Help table of OMICS. The experimental results are shown inTable VIII. As we can see, the success rates were higher thanthe corresponding rounds of Test 1. In particular, the successrate is as high as 81% in the last round. This is because adesire can be met by various tasks, which can be different

TABLE VIIIE

XPERIMENTAL RESULTS OVER

USER DESIRES . Test 2

AM FN SMR 0

SMR 1

Help

244 247 299

Help+Tasksteps

254 261 358

Percent(%)

TABLE IXI

NFLUENCES OF MAIN FACTORS OF FAILURE IN T EST Failure Number Percent (%)

Parsed Failure 3027 26.7RFN Failure 4394 38.8Global Planning Failure 3527 31.2Local Planning Failure 378 3.3 from one another. Therefore, knowledge used in the rounds ofTest 2 was much richer than that in Test 1.Notice that the overall performance increased about 5 timesin Test 1 and 50% in Test 2 when semantic roles of commonverbs and

Re-FrameNet was used. There are two main reasonsfor this improvement. Firstly, rewritten knowledge of commonverbs in

Re-FrameNet ﬁlls knowledge gaps caused by lack ofdeﬁnitions of these verbs in OMICS . Without the knowl-edge, 761( = = Re-FrameNet and SMR made about 76% and 24% contributions to theimprovement of success rate in task planning respectively.

C. Case Study on KeJia Robot

We conducted a case study of our system with the KeJia robot.As shown in Figures 6 and 7, our KeJia robot is based ona two-wheels driving chassis of 62cm × ×

1) Scenario 1:

As shown in Figure 6, a toy and a toy boxwere placed on the ﬂoor. Our KeJia robot was asked by a userto “clean up toys”. Note that, with only this instruction, therobot is unable to complete the task because the action “cleanup” is unspeciﬁed. In our system, the robot ﬁrst extracted thesubtasks of the task “clean up toys” based on the knowledgein OMICS. By doing so, a tuple of (cid:104) task . “clean up toys”: step1 . “pick up toys from ﬂoor”; step 2 . “put toys in toybox”. (cid:105) was generated. Then, our SMR method matched and recovered (a) (move(loc(ﬂoor)),1) (b) (pick up(toy),2)(c) (move(loc(toybox)),3) (d) (put down(toy),4)Fig. 6. Execution of the task “clean up toys” in tasksteps. subﬁgure (a)and (b) are plans for “pick up toys from ﬂoor”, (c) and (d) for “put toys intoybox”. semantic roles of each step in the tuple as: ( deﬁne ( task clean up (toys) ( :subtasks pick up-Pick up ( :parameters toys floor ))( :subtasks put-Placing ( :parameters toys floor toybox )))) After that, our planner sequentially processed each subtask.In this phase, since the action pick up is a primitive action,the subtask pick up can be directly executed by our robot.For the second subtask, we tried to generate a plan given thedeﬁnition of the meta-task put-Placing as: ( deﬁne ( meta-task put-Placing ( :parameters ?Agent ?Theme ?Source ?Goal ))( :precondition (at Theme Source) )( :precondition (conj(portable Theme)(object Theme)) )( :postcondition (at Theme Goal) ) In this scenario, the plan generated by the planner for thistask is shown in Figures 6(c) and 6(d). At this point, the task“clean up toys” is solved by our system and ﬁnally the entireplan is executed by the robot to complete the task.

2) Scenario 2:

As shown in Figure 7, a user spoke tothe robot that he “have a headache”. This was identiﬁed asa user desire. Similar to the previous scenario, our systemﬁrst extracted a series of help tasks for the user desire suchas “with pain medication”, “give them an aspirin”, etc. Then,our SMR method matched and recovered semantic roles ofeach help task. In this scenario, our planner failed to plan forthe task “with pain medication” but successfully recoveredthe

Source elements and generated a plan for the task “givethem an aspirin”. A list of actions for the plan of this task areillustrated in Figure 7. (a) (move(loc(aspirin)),1) (b) (pick up(aspirin),2)(c) (move(loc(them)),3) (d) (put down(aspirin),4)Fig. 7. Execution of “give them an aspirin” for the desire “have a headache”.TABLE XS

CORES OF A LL R OBO C UP @H OME B ENCHMARK T ESTS . Competition top 1 top 2 top 3 top 4 top 5RoboCup 2013 4767 4645 3622 3155 3066Team Name WE NimbRo TU/e Homer BORGRoboCup 2014 9305 5701 5656 4842 3417Team Name WE TU/e NimbRo Tobi PumasRoboCup 2015 750 651 647 562 359Team Name WE Homer TU/e Tobi Pumas

A video demon for the two scenarios above with our KeJiarobot is given at: https://youtu.be/A4GBXHG0l74

3) RoboCup@Home:

This is an international annual com-petition for domestic service robots and is part of the RoboCupevent. In this competition, a set of benchmark tests areproposed to evaluate the robots’ abilities and performance in arealistic non-standardized home environment setting. The mostrelated benchmark test to this article is the

General PurposeService Robot (GPSR) test, which requires a robot to solvetasks upon request in natural language randomly generated bythe referees during the competition.In the RoboCup@Home competitions of the past threeyears, our team — WrightEagle (WE) [21] got the 1st placeonce and 2nd place twice. Table X shows the total scoresof the top 5 teams in the benchmark tests (without the ﬁnalstage). It can be seen from the results that our team (i.e., WE)performed very well in the competitions. Particularly, in theGPSR tests, the performance of our system was competitivecomparing to other top teams as shown in Table XI.Although there are generally many factors contributing tothe success in the RoboCup@Home competitions, our robotdid beneﬁt substantially from the proposed system as describedin this article to process user instructions and generate plans.The competitions motivated us to develop a general-purposesystem for understanding user instructions in natural languageand also provide a good testbed for such systems. TABLE XIS

CORES OF T HE GPSR B

ENCHMARK T ESTS . GPSR Test top 1 top 2 top 3 top 4 top 5RoboCup 2013 900 500 450 250 250Team Name NimbRo Pumas WE TU/e TobiRoboCup 2014 750 700 500 0 0Team Name WE NimbRo TU/e Tobi PumasRoboCup 2015 105 60 30 30 20Team Name Tobi WE TU/e Homer Skuba

VIII. R

ELATED W ORK

To date, many approaches on instruction understanding andtask planning for service robots have been proposed in theliterature. For instance, several integrated systems [2], [16],[22] for natural language understanding have been introducedto enable robots to complete tasks given instructions in naturallanguage. However, they all assume that instructions are deﬁ-nitely speciﬁed for the domains and do not consider semanticdisambiguation of verbs and their roles. Work have beenproposed to manually create environment-driven instructionsfor grounding user instructions in natural language to robots’actions [10], [23]. However, these methods cannot scale tolarge number of tasks because each task need to be manuallyspeciﬁed in an environment, and are not suitable for differenttypes of robots (e.g., robots with different arm conﬁgurations).To improve generality and scalability, researchers have triedto exploit online knowledge and learn large-scare knowledgerepresentations to build a general-purpose system for instruc-tion understanding. For example, Lemaignan et al. [24], [25]have tried to understand and reason about knowledge aroundan action model using online knowledge for robots. It isworth pointing out that we previously proposed an integratedsystem [8] for our KeJia robot consisting of multi-mode NLP,integrated decision-making, and open knowledge searching.The approaches that are most related to ours are the onesusing OMICS for robots to complete household tasks. Theﬁrst attempt to utilize OMICS to accomplish a householdtask is [26], which proposed a generative model based on theMarkov chain techniques. Later on, [27], [28], [29] presented asystem called KNOWROB for processing knowledge in orderto achieve more ﬂexible and general behavior. Most recently,we proposed a formal description of knowledge gaps betweenuser instructions and local knowledge in robotic system forinstruction understanding [30], [8], [31], [32]. However, inthese efforts using OMICS for robot task planning with userinstructions, common verbs are normally not deﬁned in theknowledge base, which limits their performance on utilizingexisting open knowledge. Thus, our work is proposed toaddress the weakness of state-of-the-art methods.IX. C

ONCLUSIONS

This article proposed a general-purpose system for servicerobot handling large-scale user instructions in natural lan-guage. The key problem that we addressed is how to mapprimitive tasks into robot actions using semantic roles ofcommon verbs provided by semantic dictionaries — a commonresource of open knowledge in linguistics. To solve thisproblem, we proposed a novel approach for semantic matching and recovering. Furthermore, we utilized semantic roles ofcommon verbs deﬁned in semantic dictionaries for handlingunderspeciﬁcation of naturalistic language instructions in taskplanning. Empirical evaluation and analysis were made andshow good performance with two test sets consisting of11885 user tasks and 467 user desires collected from OMICS.Moreover, we developed a prototype system deployed onour KeJia robot and demonstrated our techniques with twotypical scenarios. Notably, our system has been used in theRoboCup@Home competitions and shown good performancein the benchmark tests over the past three years.Here, we conclude with the following ﬁndings:1) Overall performance of our system can be improvedwhen

Re-FrameNet was used. As shown by our ex-perimental results, both the knowledge in

Re-FrameNet and the SMR technique contributed to the improvement,indicating that rewritten knowledge of common verbsand recovering semantic roles from context are usefulfor naturalistic instruction understanding and planning.2) The computational efﬁciency of our system can beimproved using the hierarchism of user instructions andknowledge. As shown by our case study, instruction un-derstanding and task planning can be done for our robotin realtime, given that task decomposition knowledgesuch as OMICS was used for efﬁcient global planningand costly local planning was limited only to smallnumber of low-level tasks deﬁned in

Re-FrameNet .In the future, we plan to develop techniques to learn extraknowledge unavailable from user input, such as knowledgeabout robot manipulation, action conﬁgurations in ﬁner de-grees other than semantic role, and most importantly ground-ing. Moreover, we will investigate methods to automaticallygenerate a large set of

Re-FrameNet for robot tasks.R

EFERENCES[1] X. Chen, J. Ji, J. Jiang, G. Jin, F. Wang, and J. Xie, “DevelopingHigh-level Cognitive Functions for Service Robots,” in

Proceedings of9th International Conference on Autonomous Agents and Multi-agentSystems , 2010.[2] J. Dzifcak, M. Scheutz, C. Baral, and P. Schermerhorn, “What todo and how to do it: Translating natural language directives intotemporal and dynamic logic representation for goal management andaction execution,” in

IEEE International Conference on Robotics andAutomation . ICRA, 2009, pp. 4163–4168.[3] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward understanding naturallanguage directions,” in , 2010.[4] D. Nyga and M. Beetz, “Everything robots always wanted to knowabout housework (but were afraid to ask),” in

IEEE/RSJ InternationalConference on Intelligent Robots and Systems , 2012.[5] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula,“Robobrain: Large-scale knowledge engine for robots,” in

InternationalSymposium of Robotics Research , 2014.[6] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, andN. Roy, “Understanding natural language commands for robotic naviga-tion and mobile manipulation,” in

Proceedings of National Conferenceon Articial Intelligence , 2011.[7] R. Gupta and M. Kochenderfer, “Common sense data acquisition forindoor mobile robots,” in

Proceedings of the 19th National Conferenceon Artiﬁcial Intelligence , San Jose, California, USA, 2004, pp. 605–610.[8] X. Chen, J. Ji, Z. Sui, and J. Xie, “Handling open knowledge forservice robots,” in

Proceedings of the Twenty-Third International JointConference on Artiﬁcial Intelligence , 2013. [9] M. P. West, A general service list of English words: with semanticfrequencies and a supplementary word-list for the writing of popularscience and technology . Longmans, Green, 1953.[10] D. Misra, J. Sung, K. Lee, and A. Saxena, “Tell me dave: Context-sensitive grounding of natural language to manipulation instructions,”in

The International Journal of Robotics Research , 2014.[11] M. Gelfond and V. Lifschitz, “The stable model semantics for logicprogramming,” in

Proceedings of the 5th International Conference onLogic Programming . ICLP-88, 1988, pp. 1070–1080.[12] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley framenetproject,” in

Proceedings of the 17th international conference on Com-putational linguistics . Association for Computational Linguistics, 1998,pp. 86–90.[13] P. Bogaards, “Dictionaries for learners of english,”

International Journalof Lexicography , vol. 9, no. 4, pp. 277–320, 1996.[14] M.-C. de Marneffe, B. Maccartney, and C. D. Manning, “GeneratingTyped Dependency Parses from Phrase Structure Parses,” in

Proceed-ings of the 5th International Conference on Language Resources andEvaluation (LREC-06) . Genoa, Italy: ELRA/ELDA Paris, 2006, pp.449–454.[15] M.-C. de Marneffe and C. D. Manning, “The Stanford typed dependen-cies representation,” in

Proceedings of the COLING 2008 Workshop onCross-framework and Cross-domain Parser Evaluation , no. ii. Manch-ester, UK: ACL, 2008, pp. 1–8.[16] R. Cantrell, M. Scheutz, P. Schermerhorn, and X. Wu, “Robust spokeninstruction understanding for HRI,” in

Proceedings of the 5th ACM/IEEEInternational Conference on Robot Interaction , 2010.[17] T. Kollar, V. Perera, D. Nardi, and M. Veloso, “Learning environmentalknowledge from task-based human-robot dialog,” in

Proc. of the IEEEInternational Conference on Robotics and Automation , 2013.[18] K. M. Varadarajan and M. Vincze, “AfRob: The Affordance NetworkOntology for Robots,” in

IEEE/RSJ International Conference on Intel-ligent Robots and Systems , 2012.[19] T. Williams, R. Cantrell, G. Briggs, P. Schermerhorn, and M. Scheutz,“Grounding Natural Language References to Unvisited and HypotheticalLocations,” in

Proceedings of the Twenty-Seventh AAAI Conference onArtiﬁcial Intelligence , Bellevue, Washington, USA, 2013.[20] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, andS. Thiele, “Engineering an incremental asp solver,” in

Logic Program-ming . Springer, 2008, pp. 190–205.[21] A. Bai, F. Wu, and X. Chen, “Towards a principled solution to simulatedrobot soccer,” in

Proceedings of the Robot Soccer World Cup XVISymposium (RoboCup) , Mexico City, Mexico, 2012, pp. 141–153.[22] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-basedreactive mission and motion planning,”

IEEE Transactions on Robotics ,vol. 25, no. 6, pp. 1370–1381, 2009.[23] S. Hemachandra, M. Walter, S. Tellex, and S. Teller, “Learning spatial-semantic representations from natural language descriptions and sceneclassiﬁcations,” in , 2014, pp. 2623–2630.[24] S. Lemaignan, “Grounding the interaction: knowledge management forinteractive robots,”

KI-K unstliche Intelligenz , pp. 1–3, 2012.[25] S. Lemaignan, R. Ros, E. Sisbot, R. Alami, and M. Beetz, “Groundingthe interaction: Anchoring situated discourse in everyday human-robotinteraction,”

International Journal of Social Robotics , vol. 4, no. 2, pp.181–199, 2012.[26] C. Shah and R. Gupta, “Building plans for household tasks fromdistributed knowledge,” in

Proceedings of the 19th International JointConference on Artiﬁcial Intelligence (IJCAI 2005) Workshop on Mod-eling Natural Action Selection . Citeseer, 2005.[27] M. Tenorth, L. Kunze, D. Jain, and M. Beetz, “Knowrob-map-knowledge-linked semantic object maps,” in

Humanoid Robots (Hu-manoids), 2010 10th IEEE-RAS International Conference on . IEEE,2010, pp. 430–435.[28] L. Kunze, M. Tenorth, and M. Beetz, “Putting peoples common senseinto knowledge bases of household robots,” in

KI 2010: Advances inArtiﬁcial Intelligence . Springer, 2010, pp. 151–159.[29] M. Tenorth and M. Beetz, “Knowrob: A knowledge processing in-frastructure for cognition-enabled robots,”

The International Journal ofRobotics Research , vol. 32, no. 5, pp. 566–590, 2013.[30] X. Chen, J. Xie, J. Ji, and Z. Sui, “Toward open knowledge enabling forhuman-robot interaction,”

Journal of Human-Robot Interaction , vol. 1,no. 2, pp. 100–117, 2012.[31] J. Xie and X. Chen, “Understanding instructions on large scale forhuman-robot interaction,” in

Proceedings of the 2014 IEEE/WIC/ACMInternational Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 03 . IEEE Computer Society, 2014,pp. 175–182.[32] J. Xie, X. Chen, and J. Ji, “Multi-mode natural language processingfor human-robot interaction,” in