Understanding User Instructions by Utilizing Open Knowledge for Service Robots
11 Understanding User Instructions by Utilizing OpenKnowledge for Service Robots
Dongcai Lu, Feng Wu ∗ , and Xiaoping Chen Abstract —Understanding user instructions in natural languageis an active research topic in AI and robotics. Typically, naturaluser instructions are high-level and can be reduced into low-leveltasks expressed in common verbs (e.g., ‘take’, ‘get’, ‘put’). Forrobots understanding such instructions, one of the key challengesis to process high-level user instructions and achieve the specifiedtasks with robots’ primitive actions. To address this, we proposenovel algorithms by utilizing semantic roles of common verbsdefined in semantic dictionaries and integrating multiple openknowledge to generate task plans. Specifically, we present a newmethod for matching and recovering semantics of user instruc-tions and a novel task planner that exploits functional knowledgeof robot’s action model. To verify and evaluate our approach,we implemented a prototype system using knowledge fromseveral open resources. Experiments on our system confirmedthe correctness and efficiency of our algorithms. Notably, oursystem has been deployed in the KeJia robot, which participatedthe annual RoboCup@Home competitions in the past three yearsand achieved encouragingly high scores in the benchmark tests.
Index Terms —Service Robots, Human-Robot Interaction, Nat-ural Language Understanding and Task Planning.
I. I
NTRODUCTION N OWADAYS, service robots can do more and more workin our daily life, such as moving around in a house,fetching drink or medicine for elderly people, or preparingfood for a family. They are smart and can do many complextasks autonomously. Nevertheless, when robots encounter userrequests or tasks in an open-ended form (e.g., through dialogsin natural language), they often fail to response properly, notonly due to possible language processing failures but alsothe challenges of task planning with incomplete knowledge.For example, as illustrated in Figure 1, a daily instruction“clean up toys” is challenging for a robot to process if theaction “clean up” is under-specified and “have a headache”is also nontrivial for a robot to offer help to people withoutgrounding the helping verb (i.e., knowing how to help). Theseare common tasks in domestic scenarios and therefore it isdesirable for service robots to be able to complete such tasksgiven user instructions in natural language.Typically, user instructions are action-directed in the sensethat the fundamental purpose of an instruction is to specifywhat users want a robot to do for them. This indicates aconnection between robot understanding (i.e., knowing whatthe users said) and acting (i.e., doing what the users asked).In other words, understanding an instruction means that therobot is able to generate a plan (i.e., sequence of actions) forthe tasks specified in the instruction [1], [2], [3], [4], [5], [6].Therefore, it is crucial for the robot to have the knowledgeabout the tasks and actions in order to do planning. However, - Task: · Clean up toys- Steps: · Pick up toys from floor · Put toys in the toybox(a) Clean up toys - Desire: · Have a headache- Actions: · Give him an aspirin · With pain medication(b) Have a headache
Fig. 1. Examples of robot tasks for user instructions in natural language. some knowledge may be missing in the instruction (e.g., “havea headache” does not directly indicate that the robot shouldgiven the user an aspirin). Consequently, the robot does notknow how to act when such instructions are presented.Fortunately, there is more and more common knowledgeavailable in open resources, such as the
Open Mind IndoorCommon Sense (OMICS) database [7], wikihow , WordNet ,and many other digital dictionaries. In these dictionaries,actions are often hierarchical where a high-level action iscomposed of several lower-level actions. Similarly, user in-structions are often specified hierarchically in which an actionis referred by an action verb or verb phrase. For instance,“clean up a house” may indicate a series of subtasks such as“clean the table”, “clean the floor”, etc. Therefore, common-sense knowledge about hierarchical relations between tasksand subtasks is useful for instruction understanding.In our previous studies [8], we found that a user instructionrepresenting a high-level task can usually be reduced into a se-quence of low-level subtasks, using hierarchical knowledge inopen resources. Furthermore, we observed that this reductionprocedure often ends up at so-called primitive tasks (i.e., low-level subtasks expressed in common verbs [9]). For instance, inOMICS, “serve a drink from fridge” is reduced into a sequenceof low-level subtasks expressed in common verbs, such as “goto fridge”, “open the fridge door”, and “take the drink”, where‘go’, ‘open’, and ‘take’ are common verbs. Ideally, if all ofthe primitive tasks in the reduction can be directly mappedinto robot’s actions, the robot can simply complete the taskby executing those actions.However, it is generally nontrivial to map primitive tasks a r X i v : . [ c s . R O ] J un to robot’s actions. One of the key challenges is that there islittle knowledge about common verbs in most open resourcesand furthermore how they can be executed by robot withits actions. To avoid this challenge, most of the existingapproaches [3], [4], [10] manually create a small set of hand-coded robot actions for primitive tasks though their scalability(i.e., only work for small problems) and generality (i.e., onlywork for specific domains) are limited. To build a general-purpose system for handling large-scale user instructions, wedirectly tackle this challenge and consider the follow problems:1) how to define semantics of meanings of common verbs,match and recover such semantics in user instructions and 2)how to handle a large number of instructions and generateplans in realtime using open knowledge resources.To address these problems, we propose a novel system forservice robots to 1) process user instructions based on semanticroles of common verbs defined in semantic dictionaries, and2) then generate plans for the corresponding tasks of userinstructions. The semantic roles suggest possible entities in theknowledge representation that may be missing from or omittedin natural instructions. In more detail, we introduce a heuristicmethod to match and recover missing semantic roles from thecontext of user instructions. Then, we use a planner basedon Answer Set Programming (ASP) [11] to exploit definitionsof common verbs in terms of semantic roles and generate aplan for the task specified in the user instruction. By puttingthem together, we built a general-purpose system for servicerobots that can handle large-scale user instructions using opencommonsense knowledge.To evaluate our approach, we conducted a corpus-basedexperiment on two test sets with 11885 user tasks and 467user desires collected from OMICS. We also developed aprototype system and ran a case study on a service robot intwo typical domestic scenarios. Our experimental results showsubstantial improvement in performance on user instructionunderstanding. It is worth pointing out that the proposedsystem has been successfully deployed in our KeJia robot,which participates annually RoboCup@Home competitionand won the first place once and the second place twicein the pass three years. During the benchmark tests of theRoboCup@Home competitions, our system is used by ourrobot for understanding the instructions in English given byreferees and completing the corresponding tasks. This confirmsthe usefulness of our system in practice.The remainder of this article is organized as follows. Sec-tion II introduces our problem and Section III presents anoverview of our system. Then, Section IV proposes our mainalgorithms, followed by Sections V and VI describing thetwo key techniques used in our algorithms. Next, Section VIIreports our experimental results. Finally, Section VIII brieflyreviews the related work and Section IX concludes.II. P ROBLEM S TATEMENT
We aim to building a general-purpose system so that the robotcan understand user instructions and provide service for the http://ai.ustc.edu.cn/en/robocup/atHome/index.php Fig. 2. System architecture. user. To this end, we must solve the problem of generating asequence of primitive actions, which can be directly executedby a robot, given user instructions in natural language. Forexample, when a user says: “please serve a meal for me”,the robot will take the meal, put it on a plate, and placethe plate on a table; when a user says: “I am thirsty”, therobot will take a drink from the fridge and deliver it to theuser. To achieve this, our system must be able to extract atask from a user instruction in natural language (i.e., knowingwhat the user said) and generate a executable plan for thetask (i.e., doing what the user asked). In other words, naturallanguage understanding and task planning must be combinedsystematically in order to solve our problem.In the next section, we give an overview of our system forinstruction understanding and task planning that is built byintegrating different modules.III. S
YSTEM O VERVIEW
The overall architecture of our system is shown in Figure 2. Aswe can see, the human-robot dialog system transcribes spokenutterances into text sentences and manages the dialog withusers. Each sentence in the dialog is then transferred to theprocessing module, which generates a sequence of primitiveactions for the task expressed in natural language. After that, asequence of commands corresponding to each primitive actionis computed by the Motion Planning module. Finally, thecommands are executed by the Robot Control module.Here, we focus on the
Processing module that takes a textsentence as its input and outputs a sequence of primitiveactions that are executable by the robot. The main componentsof our Processing module are described in detail as follows.
A. Open Knowledge
As shown in Figure 2, we use open knowledge both for
Natural Language Processing (NLP) and task planning. The
Open Knowledge considered in our system includes OMICS,
FrameNet , and
Re-FrameNet as introduced below.OMICS [7] is an extensive collection of knowledge forindoor service robots gathered from internet users. Currently,it contains 48 tables capturing different sorts of knowledge, among which the
Help and
Tasks/Steps tables are most usefulfor our system. Each tuple of the
Help table maps a userdesire to a task that may meet the desire (e.g., (cid:104) “feel thirsty”,“by offering drink” (cid:105) ). Each tuple of the
Tasks/Steps tabledecomposes a task into several steps (e.g., (cid:104) “serve a drink”,0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”,3. “give class to person” (cid:105) ). Given this, OMICS offers use-ful knowledge about hierarchism of naturalistic instructions,where a high-level user request (e.g., “serve a drink”) canbe reduced to lower-level tasks (e.g., “get a glass”, · · · ).Another feature of OMICS is that elements of any tuplein an OMICS table are semantically related according to apredefined template. This facilitates the semantic interpretationof the OMICS tuples.
FrameNet is a digital dictionary providing rich semanticinformation for action verbs. It groups action verbs into Frames and specifies word definitions in terms of semanticroles called
Frame Elements (FEs) for each Frame [12].Although the connections between an action verb and itssemantic roles are useful for resolving under-specification ofnaturalistic instructions, this knowledge cannot be directlyused by robots since it is not formalized in
FrameNet . Toovercome this difficulty, we developed
Re-FrameNet — aformalized version of
FrameNet by rewriting part of
FrameNet knowledge in a formal meta-language.Specifically, in
Re-FrameNet , a Frame of FrameNet isformalized as a meta-task and re-defined by a set of pre-condition, postcondition, invariant, and/or steps over semanticroles of the meta-task. In the definition, FEs (i.e., semanticroles) such as
Theme, Source , and
Goal of the Frame aretaken as meta-variables. Therefore, the definition of a meta-task specifies the common semantic structure of action verbsin the corresponding Frame. For example, the meta-task put-Placing is defined as: ( define ( meta-task put-Placing ( :parameters ?Agent ?Theme ?Source ?Goal ))( :precondition ... )( :postcondition ... )( :invariant ... ) ) where all action verbs in Frame Placing (e.g., lay, heap,deposit ) share the same definition. When a robot tries toplan with put-Placing as its action verb (verb sense) for aninstruction, our NLP components will try to extract appropriateentities for every semantic roles specified in the definition ofmeta-task put-Placing (See Section V for more detail).It is worth noting that common verbs are normally notexplained in the aforementioned open resources because mostof them belong to the so-called
General Service List (GSL) —a list of roughly 2000 most frequent English words [9]. TheGSL is taken as the defining vocabulary of dictionaries suchas the Longman Dictionary of Contemporary English, basedon the notion that words should be defined using “terms lessabstruse than the word that is to be explained” [13]. As aresult, there are few definitions of the GSL verbs in OMICSor other digital dictionaries. https://framenet.icsi.berkeley.edu/fndrupal/ http://ai.ustc.edu.cn/en/research/reframenet.php B. NLP Module
This module maps user instruction in natural language I tothe OMICS tables, which contains tuple (cid:104) task, steps (cid:105) for task-oriented instructions or tuple (cid:104) desire, task (cid:105) for desire-orientedinstructions (See Section IV for more detail). The output is alogical form L to the Planning module, containing a frame-semantic representation as: ( meta-task take-Taking ( : parameters food fridge ) ) Specifically, interpreting I to L is done in three steps: 1) dependency parsing that analyzes the dependencies of eachword in a sentence, 2) frame-semantic parsing that identifiesthe verb’s frame, and 3) semantic matching and recovering that fills the semantic roles for a given frame. In Section V,each step will be described in detail. C. Planning Module
The
Planning module takes the logical form of user instruc-tion L , online knowledge base (e.g., Re-FrameNet, WordNet,FrameNet ), domain knowledge, and robot’s skills as the inputs.The output of the
Planning module is a high-level plan for themotion planning module.We employ both global and local planners in the
Planning module. The global planner searches through the whole knowl-edge of task decomposition in OMICS to generate a plan.However, most of tasks is OMICS cannot be decomposed intorobot’s primitive actions because many steps in OMICS arereferred by common verbs, for which OMICS does not containdecomposition knowledge. For example, verbs such as take,place, put, get , and turn frequently occur in task steps butthere is no knowledge in OMICS about how to execute themby the robot. Therefore, a local planner based on ASP is usedfor planning based on merely the instruction itself.Note that the local planner is incapable of generating aplan for under-specification terms in an instruction. Therefore,common verbs referred by the instruction must be specifiedfirst in order to generate a plan. Fortunately, semantic dic-tionaries such as
FrameNet provide rich knowledge aboutcommon verbs. In
Re-FrameNet , we reorganize the definitionof an action verb by a set of precondition, postcondition,and invariant over semantic roles of the action (a.k.a., thefunctional definition of action). Given this, a planner based onASP can plan actions for the instruction using the formalizedfunctional definition of an action. Section VI will give moredetail about our planning method.
D. Skills and Action Model
For a robot, we define an
Action Model to specify its skills.Specifically, an Action Model consists of several primitiveactions. Each primitive action a is defined by a set of precon-dition, postcondition and invariant, similar to the definition ofa common verb in Re-FrameNet . In other words, they specifyconditions under which a can be executed, conditions that holdwhen a finishes, and conditions that must be satisfied duringthe execution of a respectively. Indeed, a primitive action isthe formal specification of a robot skill. As we will show latersections, the Action Model is useful for our system to generatea plan that is executable by the robot. Algorithm 1
SolveTask(task t , ActionModel AM ) gSeen := ∅ /* prevent infinite recursive loop when exploratorysearching itself */ initiate worldmodel and plans if t ∈ gSeen then return null end if gSeen = gSeen ∪ t subT asks = FindSubTasks(t) /* find subtasks of task t from the Tasks/Steps table in OMICS */ for each task s in subT asks do if GeneratePlans( s , AM ) = null then F oundEqualT ask = F alse while there is a new t (cid:48) from the Tasks/Steps table seman-tically equivalent to s do if SolveTask( t (cid:48) , AM ) (cid:54) = null then F oundEqualT ask = T rue plans .append(SolveTask ( t (cid:48) , AM ) ) wordmodel = simulator ( wordmodel, plans ) break end if end while if F oundEqualT ask = F alse then return null end if else plans .append(GeneratePlans( s ,AM)) wordmodel = simulator ( wordmodel, plans ) end if end for /*successfully planned*/ return plans /* all steps have been solved. */ E. Learning Module
In this module, methods such as log linear , ConditionalRandom Field (CRF),
Learn from Demonstration (LfD) areused to learn robot’s low-level skills. Intuitively, the moreskills a robot possesses, the more capable it is. For example,unless a robot knows how to pour water to a cup, it cannotfinish the high-level task such as “make a coffee” (with thetask-step tuple (cid:104) “make a coffee”, 0. “put hot water in a cup”,1. “pour the coffee” (cid:105) ). In this paper, we assume that our robothas all necessary low-level skills to complete a task specifiedby user instructions though most of the skills must be learnedone by one in practice. The learning methods for robot skillsare interesting but beyond the scope of this article.After introducing our system as a whole, we describe ourmain algorithms for instruction understanding next.IV. U
NDERSTANDING U SER I NSTRUCTIONS
There are two types of user instructions that we consider in thisarticle: 1) task-oriented instruction (e.g., “serve a meal”) and2) desire-oriented instruction (e.g., “I am thirsty”). In OMICS,a task-oriented instruction is represented as tuple (cid:104) t, s (cid:105) , where s = (cid:104) s , s , · · · , s n (cid:105) is a sequence of the n steps to completethe task t . For example, given task t = “serve a meal”, asequence of steps may be s = (cid:104) s : “take the meal”, s : “putit on a plate”, s : “place the plate on a table” (cid:105) . Similarly, adesire-oriented instruction is represented as tuple (cid:104) d, t (cid:105) , where t is the task corresponding to the user desire d . For instance,given user desire d = “I am thirsty”, the task for a robotmay be t = “serve a drink”. Indeed, in most of the domestic Algorithm 2
SolveHelp(desire t , ActionModel AM ) AllHelps := FindHelpsMaptoDesire(desire t )/* find all help tasks mapped to desire t*/ for each help task s in AllHelps do if GeneratePlans( s , AM ) = null then for task gs in Tasks/Steps Table do if gs semantically equivalent to s then return SolveTask( gs ,AM) end if end for else return GeneratePlans( s , AM ) end if end for return null Algorithm 3
GeneratePlans(task t , ActionModel AM ) /* generate a plan for low-level task t */ sem := SemanticMatchAndRecover( t ) if sem.frame = NULL then return null end if if sem.frame ∈ AM then return sem.frame ( sem.parameters ) else gRF N := FindRFNBySem( sem.frame ) /* find the defini-tion of sem.frame in Re-FrameNet */ Res = solver( gRF N, sem, AM ) /* compute a plan by in-putting rules of gRFN, sem and AM. */ if Res (cid:54) = NULL then return
Res else return null end if end if scenarios, a user instruction is usually either task-oriented ordesire-oriented. Now, we turn to our algorithms for generatinga plan for these two types of user instructions respectively.Algorithm 1 is used to process task-oriented instructionsby utilizing the
Tasks/Steps table in OMICS. The input isa naturally expressed task t and the robot’s action model AM and the output is a sequence of primitive actions plans .Specifically, it first finds all subtasks of task t from the Tasks/Steps table of OMICS. Then, it tries to generate a plan(i.e., a sequence of primitive actions) for each subtask. If aplan is successfully generated, the plan is added to the planlist plans and the simulator advances to the next subtask.Otherwise, it searches the
Tasks/Steps table of OMICS againfor all
Semantically Equivalent (SE) tasks of that subtask untilone of the SE tasks is successfully planned. If there is no SEtask or none of the SE tasks can be successfully planned, a null is returned to indicate the failure of task planning. Afterall subtasks are successfully planned, plans are returned andexecuted by the robot.Algorithm 2 is used to process desire-oriented instructionsby utilizing the Help table in OMICS. Similarly, the inputis a desire and an action model and the output is a plan.Specifically, it first finds a list of help tasks offering the For example, the tasks of “give someone an object” and “take an objectto someone” are semantically equivalent. corresponding help when given a desire. Then, it tries toplan for each of the help tasks by checking whether the helptask can be successfully planned with a sequence of primitiveactions. If so, the resulting plan is returned. Otherwise, itsearches the
Tasks/Steps table in OMICS for a SE task ofthe help task and calls Algorithm 1 to generate the plan.Notice that both Algorithms 1 and 2 depend on Algorithm3 to generate a plan for a low-level task t . In Algorithm 3,it first performs semantic role matching and recovering fortask t and outputs a frame and its roles. If no verb frame isidentified, the process terminates with null as no plan can begenerated. If the frame is a primitive action, this frame plus itsroles are returned. Otherwise, the frame is evoked by commonverbs. In this case, it first finds the definition of sem.f rame in Re-FrameNet and translate it to a set of rules. After that,it computes a plan based on the rules of gRF N , the frame sem , and the action model AM .Now, the key procedures in Algorithm 3 are: 1) how to dosemantic role matching and recovering given a task expressedin natural language; 2) how to compute a plan given a set ofrules, a frame, and an action model. The details about the twoprocedures are described in Sections V and VI respectively.V. S EMANTIC M ATCHING AND R ECOVERING
We propose a three-phase procedure to translate a user in-struction expressed in natural language into the internal rep-resentation, which can be handled by our planner. Firstly, aprobabilistic syntactic parser is used to retrieve the dependen-cies of the instruction. Secondly, the frame of sentence’s verbis identified by frame-semantic parsing. Here, without loss ofgenerality, we assume that each instruction represents just asingle task (verb). Thirdly, the semantic roles of the frameare recovered and filled as much as possible with the matchedentities appeared in the instruction or its sentential context,represented as a meta-task in
Re-FrameNet . More details aboutour three-phase procedure is described below.
A. Dependency Parsing
We use the Stanford parser [14] in the first phase, whichproduces the Stanford-typed dependencies between wordsin a sentence. These dependencies indicate the grammaticalrelations between words in terms of the name of relation,governor, and dependence [15]. Figure 3 illustrates the parsingof a sentence “take food out of refrigerator”. The edge of thetype dobj denotes that the noun “food” is the direct objectof the verb “take”. The verb “take” also governs the noun“refrigerator” via the typed dependency prep out of . Sincethe typed dependency between a verb and a noun reveals theirsemantic-role relation, the syntactic structure of an instructionis used for our semantic role matching and recovering.
B. Frame Semantic Parsing
Given that a verb varies in different senses, an instruction mayrepresent different meanings and therefore can be mapped todifferent frames in
FrameNet . For instance, the verb “take”can represent the Frame
Bring or Removing under different
Fig. 3. Stanford typed dependencies of “ take food out of fridge ”. contexts. The Stanford parser does not disambiguate verbsenses. Therefore, we propose a Frame Semantic Parsing method to map a verb to a unique Frame. Specifically, wedefine a frame identification model and train the model withsets of data from
FrameNet and OMICS as below.
1) Model:
Given a sentence x = (cid:104) x , . . . , x n (cid:105) with frame-evoking verb v , we seek the most likely Frame f ∗ in the frameidentification stage. Let F be the set of candidate Frames for v , L the set of verbs found in the FrameNet annotations, and L f ⊆ L the subset of verbs annotated by evoking the Frame f . The frame identification can be formalized by the followingprediction rule: f ∗ = arg max f ∈F (cid:88) l ∈L f p ( f, l | v, x ) For f ∈ F and l ∈ L f , a conditional log-linear model is usedto model the probability p ( f, l | v, x ; θ ) : p ( f, l | v, x ; θ ) = exp[ θ · Φ( f, l, v, x )] (cid:80) f (cid:48) ∈F (cid:80) l (cid:48) ∈L f (cid:48) exp[ θ · Φ( f (cid:48) , l (cid:48) , v, x )] where θ · Φ( f, l, v, x ) is the inner product (cid:80) Mi =1 θ i × Φ i ( f, l, v, x ) and θ is the parameter vector over the featurefunction Φ with M dimension.Generally, the feature function allows for a variety of (pos-sibly overlapping) features. A feature Φ i may relate a frame f to a verb v , representing a lexical-semantic relationship.
2) Data:
Our training and test sets come from
FrameNet lexicon and OMICS. The
FrameNet lexicon is a taxonomy ofmanually identified general-purpose Frames in English. Listedin the lexicon with each Frame are several lemmas (with partof speech) that can denote the Frame or some aspect of it —these are often called
Lexical Units (LUs). Table I shows someexamples of our training and test sets.
3) Training:
Given the training subset of the data in theform (cid:104) x j , v j , f j , s j (cid:105) Nj =1 where N is the number of sentences,we discriminatively train the frame identification model bymaximizing the following log-likelihood function: max θ N (cid:88) j =1 log (cid:88) l ∈L jf p ( f j , l | v j , x ) . Specifically, we optimize it using a distributed version ofgradient ascent algorithm with initial value (cid:126)θ as: for k = 0 ..D − for i = 1 ..Mθ i = θ i + α ∂ (cid:80) Nj =1 log (cid:80) l ∈L jf p ( f j , l | v j , x ) ∂θ TABLE ID
ATA COLLECTED FROM F RAME N ET AND ANNOTATED FROM
OMICS.
Data Size Examples Verb LU Frame
FrameNet 191740 i want to bring your daughter up to the prison bring bring.v Bringingi was visited by one of the king ’s most important officials visited visit.v Arrivingcutting his wrist and jumping from a third-floor window cutting cut.v Cause harmOMICS 1100 remove objects from surface remove remove.v Removingcomplete the dance together complete complete.v Activity finishTABLE IIH
EURISTIC RULES FOR SEMANTIC ROLE FILLING WITHIN SENTENCE . Meta-task Dependency Type Semantic Role put-Placing dobj Themeput-Placing prep in Goaltake-Removing dobj Themetake-Removing prep from Sourcedry-Cause to be dry dobj Dryeedeliver-Delivery prep to Recipient · · · where D is a parameter that controls the number of passesover the training data, M is the number of features, and N isthe total size of our training set.Note that the computational complexity of the algorithmabove is O ( D × M × N ) . When the number of features islarge, it will be costly to train our model sequentially. In orderto update the parameter of a feature f faster, we consider N f training examples that contains only f instead of N . Hence,the computational complexity becomes O ( D × M × N f ) , where N f is usually much smaller than N . C. Roles Matching and Recovering
After the Frame for the meta-task achieved from
Re-FrameNet is identified, the semantic roles of the meta-task must be filledwith the corresponding entities (expressed by nouns) in thesentence or from its sentential context. In Figure 4, given steps s = (cid:104) s , ..., s n (cid:105) and Frames of each step f = (cid:104) f , ..., f n (cid:105) ,we match and recover missing semantic roles of each Frame r = (cid:104) r , ..., r n (cid:105) , where r i = (cid:104) r i , ..., r ik i (cid:105) . s : f rame ( f ) , role ( r ) , role ( r ) , ..., role ( r k ) s : f rame ( f ) , role ( r ) , role ( r ) , ..., role ( r k ) · · · s n : f rame ( f n ) , role ( r n ) , role ( r n ) , ..., role ( r nk n ) Fig. 4. Formalization description of instruction flow.
Take the flow of instructions (cid:104) step 1: “go to fridge”; step2: “open the fridge door”; step 3: “take the beer”; step 4:“close the fridge door” (cid:105) for example. The third instruction(i.e., step 3) is identified as the meta-task take-Taking , whosesemantic roles in
Re-FrameNet include
Agent , Theme , and
Source . However, this instruction only explicitly specifies therole
Theme (the beer) , while the others are missing fromit. Note that the semantic role
Source can be recovered andmatched with the entity fridge in the sentential context of thisinstruction. Therefore, the challenge of our third phase lies inthe recovering of missing semantic roles.To address this challenge, we borrow ideas from the “lastobjects” method [16] and propose the following method:
TABLE IIIP
ART OF HIERARCHY FOR take-taking . Semantic Role Class
Theme
Holdable Obj
Source
Supportable Obj (cid:116)
Containable Obj
1) For any semantic role r that is defined in Re-FrameNet but missing from a sentence s , an entity e that matches r according to the definition and has less sententialdistance from s is preferable to be the value of r . Here,the sentential distance between e and r is defined as ( n − m ) , if e and r appear in the m -th and n -th sentencesin the same sentence flow respectively, with m ≤ n. For ≤ k ≤ n , it is formalized as: r ki = arg min e ∈ r l ( k − l ) ,if r ki is missing and e matches r ki .2) If a semantic role r cannot be recovered through 1),it is assumed that (the value of) r is unspecified inthe sense that any entity satisfying the Re-FrameNet definition of r is a default value of r under the givencontext . For instance, the Source role of single sentence“put beverage in the fridge” is unspecified and thus anyentity in the class beverage can be taken as the valueof
Source under the context of this sentence. Obviously,all missing semantic roles of the first sentence in a flowof instructions are unspecified. In fact, given a context,not all of the semantic roles specified in
FrameNet or Re-FrameNet are necessary for naturalistic languageinstruction understanding and task planning.In general, we divide semantic matching and recovering intotwo cases. The first case is for zero sentential distance, i.e.,recovering semantic roles based on the instruction itself. Ta-ble II shows some heuristic rules for this case, each assigning anoun of the designated dependence type to a semantic role of ameta-task. For example, according to the first rule in Table II, beverage is assigned to the semantic role
Theme of the meta-task put-Placing . Similarly, fridge is assigned to
Goal of thesame meta-task according to the second rule. After matching,the single instruction “put beverage in the fridge” is interpretedas an instantiated meta-task of put-Placing as follow: ( define ( meta-task put-Placing ( :parameters robot beverage null fridge )) ... ) In the case where a semantic role of a sentence cannot beidentified within the sentence, semantic matching is conducted Some of unspecified roles should be identified by grounding [17], [6],[18], [19], which is beyond the scope of this article.
TABLE IVP
ART OF HIERARCHY FOR CLASSES . Class Subclass Subsubclass
Object Containable Obj fridge
Object Holdable Obj beer,beverage
Object Supportable Obj table based on a taxonomical hierarchy, which specifies what sortsof entities can be taken as values by a semantic role. Forexample, the
Theme role of meta-task put-Placing should takea holdable object for the robot. Table III shows a part of thehierarchy about meta-task take-Taking . Moreover, the hierar-chy needs to be extended by class-subclass relationships, asexemplified in Table IV. Consider the example sentence “takethe beer ” in Figure 2. The entities appeared in the context are fridge and fridge-door . In our taxonomical hierarchy, fridge-door is an instance of door which is neither supportable norcontainable. Therefore, only fridge can be a value of the
Source role of take-Taking . In the case of multiple candidatesfor a semantic role, the nearest entity will be selected. Thehigh-level part of our hierarchy is similar to that of AfNet [18].This is beneficial to integrating grounding mechanism into ourprototype system.VI. T
ASK P LANNING WITH
ASPGiven the meta-task semantic representation of a sentence,we generate an action sequence using OMICS and functionaldefinition knowledge of common verbs (e.g.,
Re-FrameNet ).In our previous work, we proposed the
OK-planner [8] basedon ASP. In this approach, all types of knowledge are convertedinto ASP and then an ASP solver is applied to generatean action sequence. However, this work does not consider common verbs for handling complex tasks.In this article, we built our planner upon our previous workbut additionally consider the following challenges: 1) howto define the functional knowledge of primitive actions in
Action Model and 2) how to convert
Re-FrameNet definitionof common verbs into ASP.
A. Planning with Action Model
As aforementioned, we specify robot skills in our system by anaction model, i.e., a set of primitive actions that are executablefor the robot. Table V shown some basic definition of theprimitive actions for a typical service robot though differenttypes of robots may have different action model. Formally,each primitive action a is defined as a pair (cid:104) pre ( a ) , ef f ( a ) (cid:105) ,where pre ( a ) and ef f ( a ) are the preconditions and effectsof a respectively. For instance, moveto(obj) is a primitiveaction that tells the robot to move close to the specified object obj . The pre and ef f of moveto(obj) show whether therobot is near the specified obj before and after the moveto action respectively.Given any initial state s and a possible plan a , . . . , a n , an action model determines a predicted trajectory τ ∗ = (cid:104) s , a , s , . . . , a n , s n (cid:105) through inference for all the states s , . . . , s n along with the execution of the action sequence duringplanning. For instance, given an instruction “get food fromfridge”, we need to generate a plan for the robot as: TABLE VL
IST OF PRIMITIVE ACTIONS THAT CAN BE EXECUTED BY THE ROBOT .Primitive Action(a) Description(a), pre ( a ) , eff ( a ) moveto(obj, t) Move to obj by using motion planner at time t. pre ( a ) : not near ( robot, obj, t − eff ( a ) : near ( robot, obj, t ) find(obj, t) Find obj in the environment by using vision at time t. pre ( a ) : near ( robot, obj, t − eff ( a ) : beliveloction ( robot, obj, t ) pick up(obj, t) pick up obj by using robotic arm at time t. pre ( a ) : near ( robot, obj, t − pre ( a ) : beliveloction ( robot, obj, t − eff ( a ) : grapsing ( robot, obj, t ) put down(obj, t) put down obj on a plane in front of robot at time t. pre ( a ) : grapsing ( robot, obj, t − eff ( a ) : not grasping ( robot, obj, t ) open(obj, t) open the obj at time t. pre ( a ) : closed ( obj, t − eff ( a ) : opened ( obj, t ) close(obj, t) close the obj at time t. pre ( a ) : opened ( obj, t − eff ( a ) : closed ( obj, t ) moveto(fridge,1), open(fridge,2),find(food,3), pick_up(food,4),close(fridge,5). Note that the semantic representation of a user instructioncan be easily converted into a ASP form [8]. All we have todo is to fill sufficient knowledge for the ASP planner. Usingour
Re-FrameNet definition, an action verb is reorganizedby a set of precondition, postcondition, and invariant oversemantic roles of the action. Therefore, the remaining problemfor our approach is how to convert the functional definitionsof common verbs into ASP.
B. Conversion of Functional Knowledge
Let α be a common verb (word sense). The set of linguisticvariables of α ’s frame is denoted by Θ( α ) . The set ofproperties and relations over Θ( α ) occur in the functionaldefinitions of verbs belonging to α ’s Frame is denoted by Σ( α ) . Given a task task α based on the common verb α as: ( :meta-task α ( :parameters ( p X ) · · · ( p h X ))) where X , . . . , X h ∈ Θ( α ) and p , . . . , p h are predicates overa set X of variables, each constraint of the common verb α can be converted to a set of ASP rules w.r.t. the task task α as:1. A precondition ( :precond α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t ) , . . . , not true ( l n , t ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t ) , . . . , not true ( l (cid:48) n , t ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X )
2. A postcondition ( :postcond α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48) ) , . . . , not true ( l n , t (cid:48) ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48) ) ,t < t (cid:48) , p ( X ) , . . . , p h ( X )
3. An invariant ( :invariant α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m ))) is converted to the following ASP rules: ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48)(cid:48) ) , . . . , not true ( l n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · ·← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X )
4. An invariant ( disj ( :invariant α ( conj ( disj l · · · l n ) · · · ( disj l (cid:48) · · · l (cid:48) m )))( :invariant α ( conj ( disj l ∗ · · · l ∗ n ) · · · ( disj l (cid:48)∗ · · · l (cid:48)∗ m )))) is converted to the following ASP rules: f ← process ( task α , t, t (cid:48) ) , not true ( l , t (cid:48)(cid:48) ) , . . . , not true ( l n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · · f ← process ( task α , t, t (cid:48) ) , not true ( l (cid:48) , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48) n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) f ∗ ← process ( task α , t, t (cid:48) ) , not true ( l ∗ , t (cid:48)(cid:48) ) , . . . , not true ( l ∗ n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) · · · f ∗ ← process ( task α , t, t (cid:48) ) , not true ( l (cid:48)∗ , t (cid:48)(cid:48) ) , . . . , not true ( l (cid:48)∗ n , t (cid:48)(cid:48) ) ,t < t (cid:48) , t < = t (cid:48)(cid:48) , t (cid:48)(cid:48) < = t (cid:48) , p ( X ) , . . . , p h ( X ) ← f, f ∗ After all pieces of knowledge have been converted into theASP rules, an ASP solver iclingo [20] — a combination of
Gringo and clasp for incremental grounding and solving — isused to incrementally ground the ASP rules above and searchfor answer sets, from which a plan can be computed [8].VII. E
XPERIMENTS
We empirically evaluate our system with three experiments.The first experiment was devised to investigate the perfor-mance of our SMR (i.e., Semantic Matching and Recovering)method. The second experiment aimed to testing the perfor-mance of the whole system when different open knowledgebases were used. We also analyzed the main factors thatmay affect the performance. Finally, we demonstrate that howour approach can be deployed in our KeJia robot to solveinstruction understanding problems in two domestic scenarios.Additionally, we also present our long-term effort on applyingthe proposed technique in the RoboCup@Home competitions.
TABLE VIR
ESULTS OF TRANSLATION OVER TWO TESTSETS OF F RAME N ET AND
OMICS.
Syntactic Data P R F
Verb OMICS 97.61 81.83 89.03Entities OMICS 80.32 67.33 73.25
Identification Data P R F
Frame OMICS 84.31 61.43 71.07Frame FrameNet 80.98 79.05 80.00Semantic Roles OMICS 78.00 53.71 63.62
A. Experiments with SMR
To test our SMR method, we collect 191,740 examples an-notated with frame-semantic structures for the frame iden-tification model from
FrameNet lexicon and 470 examplesfrom OMICS. Then, we parse each sentence by the Stanfordparser. Finally, we only select those examples whose LUis a verb or a verb phrase. As a result, the training datacontains 70,149 examples and the test data contains 18,183examples from
FrameNet and 630 examples from OMICS. Inour experiments, the frame identification model instantiates76,289 binary features.Table VI shows the results on each part of translation of hi-erarchal instructions. The performance is evaluated by
Precise (P),
Recall (R), F1 (F) defined as: P recise = T P/ ( T P + F P ) , Recall = T P/T , F ∗ P recise ∗ Recall/ ( P recise + Recall ) , where T P stands for the number of the sentencesparsed correctly,
F P is the number of the sentences parsedwrongly, and T is the total length of the dataset.As we can see from the results, syntactic results have avery high precise and F value, which benefits to the meta-task identification phase. However, it does not disambiguatethe meaning of a verb (e.g., the verb “get” has two meanings:“Getting: get the food” and “Motion: get to the room”).The meta-task identification, which obtains a F value of80 over the FrameNet data and 71.07 over the OMICS data.Moreover, the overall performance of the whole translationsystem maintains a quite high precise and relatively low recalldue to the data sparseness and one meta-task assumption.
B. Experiments on OMICS
The experiments on OMICS were divided into two tests. Test1 was conducted on 11,885 user tasks from the
Tasks/Steps table and Test 2 on 467 user desires from the
Help table.Test 1 consisted of four rounds. In the first round, only thedefinitions of the 11,885 tasks from the
Tasks/Steps table and asmall action model AM representing the basic perception andmanipulation skills of a robot were used. Specifically, AM con-tained only 6 primitive actions: move, find, pick up, put down,open , and close . Synonymy knowledge from FrameNet wasused into the second to fourth rounds of Test 1. In the third andfourth rounds, rewritten knowledge from
Re-FrameNet wasconsidered with our SMR technique. However, in the thirdround, missing roles were not recovered from the context.Table VII shows the experimental results of Test 1. Thesecond row shows the numbers of tasks that were successfullyplanned by the global planner with tasks/steps in the fourrounds. The third row shows the total numbers of tasks thatwere successfully planned in the four rounds. The fourth
TABLE VIIE
XPERIMENTAL RESULTS OVER
USER TASKS . Test 1
AM FN SMR 0
SMR 1
Tasksteps
134 150 618
Tasksteps+
157 174 756
Percent(%)
GroundTruth(%) * * 63.75 64
TruthPercent(%) (a) T he N u m be r o f T a sks (b)Fig. 5. Influences of the Frame in Re-FrameNet in Test 1. row shows the percentages of successfully planned tasks withrespect to the total number of tested tasks. Since there areno ground truth data for OMICS, we randomly drew 80 and100 samples from the last two rounds respectively and verifiedthem manually. It turned out that 51 and 64 samples amongthem were correct. As shown in the fifth row of Table VII,the correctness percent decreased when Re-FrameNet wasused; but the number of correctly planned tasks still increasedremarkable. Moreover, we can see that the overall performanceimproved when semantic roles of common verbs was used,much better than the state-of-art solution [8].As shown in Figure 5, the number of the successfullyplanned tasks gradually increased when more frames wereadded to the algorithm. It also shows that some frames cannotbe mapped into robots’ action (i.e.,
Mass motion and
Waiting ).The main reason is the limit of robots’ primitive actions.Table IX reports the main types of failures that we observedin Test 1. Specifically, the
Parsed Failure occurred in 3027tasks because the semantic matching and recovering procedurefailed to retrieve any frame from
Re-FrameNet (RFN) for atask. The
RFN Failure occurred in 4394 tasks due to the factthat
Re-FrameNet contains only 43 frames, in which 7421tasks cannot be used to generate a plan by the robot. A
GlobalPlanning Failure occurs when a task/step t cannot be plannedand none of the following conditions hold: t is a primitiveaction, semantically equivalent to meta-task in Re-FrameNet or another task in the
Tasks/Steps table. In total, there were3527 tasks failed in this category. A
Local Planning Failure occurs when the solver (in Algorithm 3) is launched but failsto generate any plan. Further study reveals that these two sortsof planning failures are mainly due to lack of knowledge/skills.Test 2 was conducted on 467 user desires from the
Help table of OMICS. The experimental results are shown inTable VIII. As we can see, the success rates were higher thanthe corresponding rounds of Test 1. In particular, the successrate is as high as 81% in the last round. This is because adesire can be met by various tasks, which can be different
TABLE VIIIE
XPERIMENTAL RESULTS OVER
USER DESIRES . Test 2
AM FN SMR 0
SMR 1
Help
244 247 299
Help+Tasksteps
254 261 358
Percent(%)
TABLE IXI
NFLUENCES OF MAIN FACTORS OF FAILURE IN T EST Failure Number Percent (%)
Parsed Failure 3027 26.7RFN Failure 4394 38.8Global Planning Failure 3527 31.2Local Planning Failure 378 3.3 from one another. Therefore, knowledge used in the rounds ofTest 2 was much richer than that in Test 1.Notice that the overall performance increased about 5 timesin Test 1 and 50% in Test 2 when semantic roles of commonverbs and
Re-FrameNet was used. There are two main reasonsfor this improvement. Firstly, rewritten knowledge of commonverbs in
Re-FrameNet fills knowledge gaps caused by lack ofdefinitions of these verbs in OMICS . Without the knowl-edge, 761( = = Re-FrameNet and SMR made about 76% and 24% contributions to theimprovement of success rate in task planning respectively.
C. Case Study on KeJia Robot
We conducted a case study of our system with the KeJia robot.As shown in Figures 6 and 7, our KeJia robot is based ona two-wheels driving chassis of 62cm × ×
1) Scenario 1:
As shown in Figure 6, a toy and a toy boxwere placed on the floor. Our KeJia robot was asked by a userto “clean up toys”. Note that, with only this instruction, therobot is unable to complete the task because the action “cleanup” is unspecified. In our system, the robot first extracted thesubtasks of the task “clean up toys” based on the knowledgein OMICS. By doing so, a tuple of (cid:104) task . “clean up toys”: step1 . “pick up toys from floor”; step 2 . “put toys in toybox”. (cid:105) was generated. Then, our SMR method matched and recovered (a) (move(loc(floor)),1) (b) (pick up(toy),2)(c) (move(loc(toybox)),3) (d) (put down(toy),4)Fig. 6. Execution of the task “clean up toys” in tasksteps. subfigure (a)and (b) are plans for “pick up toys from floor”, (c) and (d) for “put toys intoybox”. semantic roles of each step in the tuple as: ( define ( task clean up (toys) ( :subtasks pick up-Pick up ( :parameters toys floor ))( :subtasks put-Placing ( :parameters toys floor toybox )))) After that, our planner sequentially processed each subtask.In this phase, since the action pick up is a primitive action,the subtask pick up can be directly executed by our robot.For the second subtask, we tried to generate a plan given thedefinition of the meta-task put-Placing as: ( define ( meta-task put-Placing ( :parameters ?Agent ?Theme ?Source ?Goal ))( :precondition (at Theme Source) )( :precondition (conj(portable Theme)(object Theme)) )( :postcondition (at Theme Goal) ) In this scenario, the plan generated by the planner for thistask is shown in Figures 6(c) and 6(d). At this point, the task“clean up toys” is solved by our system and finally the entireplan is executed by the robot to complete the task.
2) Scenario 2:
As shown in Figure 7, a user spoke tothe robot that he “have a headache”. This was identified asa user desire. Similar to the previous scenario, our systemfirst extracted a series of help tasks for the user desire suchas “with pain medication”, “give them an aspirin”, etc. Then,our SMR method matched and recovered semantic roles ofeach help task. In this scenario, our planner failed to plan forthe task “with pain medication” but successfully recoveredthe
Source elements and generated a plan for the task “givethem an aspirin”. A list of actions for the plan of this task areillustrated in Figure 7. (a) (move(loc(aspirin)),1) (b) (pick up(aspirin),2)(c) (move(loc(them)),3) (d) (put down(aspirin),4)Fig. 7. Execution of “give them an aspirin” for the desire “have a headache”.TABLE XS
CORES OF A LL R OBO C UP @H OME B ENCHMARK T ESTS . Competition top 1 top 2 top 3 top 4 top 5RoboCup 2013 4767 4645 3622 3155 3066Team Name WE NimbRo TU/e Homer BORGRoboCup 2014 9305 5701 5656 4842 3417Team Name WE TU/e NimbRo Tobi PumasRoboCup 2015 750 651 647 562 359Team Name WE Homer TU/e Tobi Pumas
A video demon for the two scenarios above with our KeJiarobot is given at: https://youtu.be/A4GBXHG0l74
3) RoboCup@Home:
This is an international annual com-petition for domestic service robots and is part of the RoboCupevent. In this competition, a set of benchmark tests areproposed to evaluate the robots’ abilities and performance in arealistic non-standardized home environment setting. The mostrelated benchmark test to this article is the
General PurposeService Robot (GPSR) test, which requires a robot to solvetasks upon request in natural language randomly generated bythe referees during the competition.In the RoboCup@Home competitions of the past threeyears, our team — WrightEagle (WE) [21] got the 1st placeonce and 2nd place twice. Table X shows the total scoresof the top 5 teams in the benchmark tests (without the finalstage). It can be seen from the results that our team (i.e., WE)performed very well in the competitions. Particularly, in theGPSR tests, the performance of our system was competitivecomparing to other top teams as shown in Table XI.Although there are generally many factors contributing tothe success in the RoboCup@Home competitions, our robotdid benefit substantially from the proposed system as describedin this article to process user instructions and generate plans.The competitions motivated us to develop a general-purposesystem for understanding user instructions in natural languageand also provide a good testbed for such systems. TABLE XIS
CORES OF T HE GPSR B
ENCHMARK T ESTS . GPSR Test top 1 top 2 top 3 top 4 top 5RoboCup 2013 900 500 450 250 250Team Name NimbRo Pumas WE TU/e TobiRoboCup 2014 750 700 500 0 0Team Name WE NimbRo TU/e Tobi PumasRoboCup 2015 105 60 30 30 20Team Name Tobi WE TU/e Homer Skuba
VIII. R
ELATED W ORK
To date, many approaches on instruction understanding andtask planning for service robots have been proposed in theliterature. For instance, several integrated systems [2], [16],[22] for natural language understanding have been introducedto enable robots to complete tasks given instructions in naturallanguage. However, they all assume that instructions are defi-nitely specified for the domains and do not consider semanticdisambiguation of verbs and their roles. Work have beenproposed to manually create environment-driven instructionsfor grounding user instructions in natural language to robots’actions [10], [23]. However, these methods cannot scale tolarge number of tasks because each task need to be manuallyspecified in an environment, and are not suitable for differenttypes of robots (e.g., robots with different arm configurations).To improve generality and scalability, researchers have triedto exploit online knowledge and learn large-scare knowledgerepresentations to build a general-purpose system for instruc-tion understanding. For example, Lemaignan et al. [24], [25]have tried to understand and reason about knowledge aroundan action model using online knowledge for robots. It isworth pointing out that we previously proposed an integratedsystem [8] for our KeJia robot consisting of multi-mode NLP,integrated decision-making, and open knowledge searching.The approaches that are most related to ours are the onesusing OMICS for robots to complete household tasks. Thefirst attempt to utilize OMICS to accomplish a householdtask is [26], which proposed a generative model based on theMarkov chain techniques. Later on, [27], [28], [29] presented asystem called KNOWROB for processing knowledge in orderto achieve more flexible and general behavior. Most recently,we proposed a formal description of knowledge gaps betweenuser instructions and local knowledge in robotic system forinstruction understanding [30], [8], [31], [32]. However, inthese efforts using OMICS for robot task planning with userinstructions, common verbs are normally not defined in theknowledge base, which limits their performance on utilizingexisting open knowledge. Thus, our work is proposed toaddress the weakness of state-of-the-art methods.IX. C
ONCLUSIONS
This article proposed a general-purpose system for servicerobot handling large-scale user instructions in natural lan-guage. The key problem that we addressed is how to mapprimitive tasks into robot actions using semantic roles ofcommon verbs provided by semantic dictionaries — a commonresource of open knowledge in linguistics. To solve thisproblem, we proposed a novel approach for semantic matching and recovering. Furthermore, we utilized semantic roles ofcommon verbs defined in semantic dictionaries for handlingunderspecification of naturalistic language instructions in taskplanning. Empirical evaluation and analysis were made andshow good performance with two test sets consisting of11885 user tasks and 467 user desires collected from OMICS.Moreover, we developed a prototype system deployed onour KeJia robot and demonstrated our techniques with twotypical scenarios. Notably, our system has been used in theRoboCup@Home competitions and shown good performancein the benchmark tests over the past three years.Here, we conclude with the following findings:1) Overall performance of our system can be improvedwhen
Re-FrameNet was used. As shown by our ex-perimental results, both the knowledge in
Re-FrameNet and the SMR technique contributed to the improvement,indicating that rewritten knowledge of common verbsand recovering semantic roles from context are usefulfor naturalistic instruction understanding and planning.2) The computational efficiency of our system can beimproved using the hierarchism of user instructions andknowledge. As shown by our case study, instruction un-derstanding and task planning can be done for our robotin realtime, given that task decomposition knowledgesuch as OMICS was used for efficient global planningand costly local planning was limited only to smallnumber of low-level tasks defined in
Re-FrameNet .In the future, we plan to develop techniques to learn extraknowledge unavailable from user input, such as knowledgeabout robot manipulation, action configurations in finer de-grees other than semantic role, and most importantly ground-ing. Moreover, we will investigate methods to automaticallygenerate a large set of
Re-FrameNet for robot tasks.R
EFERENCES[1] X. Chen, J. Ji, J. Jiang, G. Jin, F. Wang, and J. Xie, “DevelopingHigh-level Cognitive Functions for Service Robots,” in
Proceedings of9th International Conference on Autonomous Agents and Multi-agentSystems , 2010.[2] J. Dzifcak, M. Scheutz, C. Baral, and P. Schermerhorn, “What todo and how to do it: Translating natural language directives intotemporal and dynamic logic representation for goal management andaction execution,” in
IEEE International Conference on Robotics andAutomation . ICRA, 2009, pp. 4163–4168.[3] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward understanding naturallanguage directions,” in , 2010.[4] D. Nyga and M. Beetz, “Everything robots always wanted to knowabout housework (but were afraid to ask),” in
IEEE/RSJ InternationalConference on Intelligent Robots and Systems , 2012.[5] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula,“Robobrain: Large-scale knowledge engine for robots,” in
InternationalSymposium of Robotics Research , 2014.[6] S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, andN. Roy, “Understanding natural language commands for robotic naviga-tion and mobile manipulation,” in
Proceedings of National Conferenceon Articial Intelligence , 2011.[7] R. Gupta and M. Kochenderfer, “Common sense data acquisition forindoor mobile robots,” in
Proceedings of the 19th National Conferenceon Artificial Intelligence , San Jose, California, USA, 2004, pp. 605–610.[8] X. Chen, J. Ji, Z. Sui, and J. Xie, “Handling open knowledge forservice robots,” in
Proceedings of the Twenty-Third International JointConference on Artificial Intelligence , 2013. [9] M. P. West, A general service list of English words: with semanticfrequencies and a supplementary word-list for the writing of popularscience and technology . Longmans, Green, 1953.[10] D. Misra, J. Sung, K. Lee, and A. Saxena, “Tell me dave: Context-sensitive grounding of natural language to manipulation instructions,”in
The International Journal of Robotics Research , 2014.[11] M. Gelfond and V. Lifschitz, “The stable model semantics for logicprogramming,” in
Proceedings of the 5th International Conference onLogic Programming . ICLP-88, 1988, pp. 1070–1080.[12] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley framenetproject,” in
Proceedings of the 17th international conference on Com-putational linguistics . Association for Computational Linguistics, 1998,pp. 86–90.[13] P. Bogaards, “Dictionaries for learners of english,”
International Journalof Lexicography , vol. 9, no. 4, pp. 277–320, 1996.[14] M.-C. de Marneffe, B. Maccartney, and C. D. Manning, “GeneratingTyped Dependency Parses from Phrase Structure Parses,” in
Proceed-ings of the 5th International Conference on Language Resources andEvaluation (LREC-06) . Genoa, Italy: ELRA/ELDA Paris, 2006, pp.449–454.[15] M.-C. de Marneffe and C. D. Manning, “The Stanford typed dependen-cies representation,” in
Proceedings of the COLING 2008 Workshop onCross-framework and Cross-domain Parser Evaluation , no. ii. Manch-ester, UK: ACL, 2008, pp. 1–8.[16] R. Cantrell, M. Scheutz, P. Schermerhorn, and X. Wu, “Robust spokeninstruction understanding for HRI,” in
Proceedings of the 5th ACM/IEEEInternational Conference on Robot Interaction , 2010.[17] T. Kollar, V. Perera, D. Nardi, and M. Veloso, “Learning environmentalknowledge from task-based human-robot dialog,” in
Proc. of the IEEEInternational Conference on Robotics and Automation , 2013.[18] K. M. Varadarajan and M. Vincze, “AfRob: The Affordance NetworkOntology for Robots,” in
IEEE/RSJ International Conference on Intel-ligent Robots and Systems , 2012.[19] T. Williams, R. Cantrell, G. Briggs, P. Schermerhorn, and M. Scheutz,“Grounding Natural Language References to Unvisited and HypotheticalLocations,” in
Proceedings of the Twenty-Seventh AAAI Conference onArtificial Intelligence , Bellevue, Washington, USA, 2013.[20] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, andS. Thiele, “Engineering an incremental asp solver,” in
Logic Program-ming . Springer, 2008, pp. 190–205.[21] A. Bai, F. Wu, and X. Chen, “Towards a principled solution to simulatedrobot soccer,” in
Proceedings of the Robot Soccer World Cup XVISymposium (RoboCup) , Mexico City, Mexico, 2012, pp. 141–153.[22] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-basedreactive mission and motion planning,”
IEEE Transactions on Robotics ,vol. 25, no. 6, pp. 1370–1381, 2009.[23] S. Hemachandra, M. Walter, S. Tellex, and S. Teller, “Learning spatial-semantic representations from natural language descriptions and sceneclassifications,” in , 2014, pp. 2623–2630.[24] S. Lemaignan, “Grounding the interaction: knowledge management forinteractive robots,”
KI-K unstliche Intelligenz , pp. 1–3, 2012.[25] S. Lemaignan, R. Ros, E. Sisbot, R. Alami, and M. Beetz, “Groundingthe interaction: Anchoring situated discourse in everyday human-robotinteraction,”
International Journal of Social Robotics , vol. 4, no. 2, pp.181–199, 2012.[26] C. Shah and R. Gupta, “Building plans for household tasks fromdistributed knowledge,” in
Proceedings of the 19th International JointConference on Artificial Intelligence (IJCAI 2005) Workshop on Mod-eling Natural Action Selection . Citeseer, 2005.[27] M. Tenorth, L. Kunze, D. Jain, and M. Beetz, “Knowrob-map-knowledge-linked semantic object maps,” in
Humanoid Robots (Hu-manoids), 2010 10th IEEE-RAS International Conference on . IEEE,2010, pp. 430–435.[28] L. Kunze, M. Tenorth, and M. Beetz, “Putting peoples common senseinto knowledge bases of household robots,” in
KI 2010: Advances inArtificial Intelligence . Springer, 2010, pp. 151–159.[29] M. Tenorth and M. Beetz, “Knowrob: A knowledge processing in-frastructure for cognition-enabled robots,”
The International Journal ofRobotics Research , vol. 32, no. 5, pp. 566–590, 2013.[30] X. Chen, J. Xie, J. Ji, and Z. Sui, “Toward open knowledge enabling forhuman-robot interaction,”
Journal of Human-Robot Interaction , vol. 1,no. 2, pp. 100–117, 2012.[31] J. Xie and X. Chen, “Understanding instructions on large scale forhuman-robot interaction,” in
Proceedings of the 2014 IEEE/WIC/ACMInternational Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 03 . IEEE Computer Society, 2014,pp. 175–182.[32] J. Xie, X. Chen, and J. Ji, “Multi-mode natural language processingfor human-robot interaction,” in