[PDF] Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot

Abstract

To be useful in everyday environments, robots must be able to identify and locate real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. Building off of this progress, this paper addresses the problem of identifying generic objects and locating them in 3D using a mobile robot with an RGB camera. We achieve this by, first, introducing a video object segmentation-based approach to visual servo control and active perception and, second, developing a new Hadamard-Broyden update formulation. Our segmentation-based methods are simple but effective, and our update formulation lets a robot quickly learn the relationship between actuators and visual features without any camera calibration. We validate our approach in experiments by learning a variety of actuator-camera configurations on a mobile HSR robot, which subsequently identifies, locates, and grasps objects from the YCB dataset and tracks people and other dynamic articulated objects in real-time.

Full PDF

VVideo Object Segmentation-based Visual Servo Controland Object Depth Estimation on a Mobile Robot

Brent A. Grifﬁn Victoria Florence Jason J. CorsoUniversity of Michigan { griffb,vflorenc,jjcorso } @umich.edu Abstract

To be useful in everyday environments, robots must beable to identify and locate real-world objects. In re-cent years, video object segmentation has made signiﬁcantprogress on densely separating such objects from back-ground in real and challenging videos. Building off of thisprogress, this paper addresses the problem of identifyinggeneric objects and locating them in 3D using a mobilerobot with an RGB camera. We achieve this by, ﬁrst, intro-ducing a video object segmentation-based approach to vi-sual servo control and active perception and, second, devel-oping a new Hadamard-Broyden update formulation. Oursegmentation-based methods are simple but effective, andour update formulation lets a robot quickly learn the rela-tionship between actuators and visual features without anycamera calibration. We validate our approach in experi-ments by learning a variety of actuator-camera conﬁgura-tions on a mobile HSR robot, which subsequently identiﬁes,locates, and grasps objects from the YCB dataset and trackspeople and other dynamic articulated objects in real-time.

1. Introduction

Visual servo control (VS), using visual data in the servoloop to control a robot, is a well-established ﬁeld [11, 28].Using features from RGB images, VS has been used for po-sitioning UAVs [26, 43] and wheeled robots [35, 42], ma-nipulating objects [29, 54], and even laparoscopic surgery[56]. While this prior work attests to applicability of VS,generating robust visual features for VS in unstructuredenvironments with generic objects (e.g., without ﬁducialmarkers) remains an open problem.On the other hand, video object segmentation (VOS), thedense separation of objects in video from background, hasmade recent progress on real, unstructured videos. Thisprogress is due in part to the introduction of multiple bench-mark datasets [47, 49, 57], which evaluate VOS methodsacross many challenging categories, including moving cam-

Figure 1.

RGBD View of Cluttered Scene . Using an RGB im-age (top left), HSR identiﬁes and segments ﬁve target objects (topright). However, the associated depth image is unreliable (bottomleft) and provides depth data for only one target (bottom right).Figure 2.

Finding Objects in RGB . With our approach, HSR seg-ments, locates, and grasps objects using a single RGB camera. eras, occlusions, objects leaving view, scale variation, ap-pearance change, edge ambiguity, multiple interacting ob-jects, and dynamic background; these challenges frequentlyoccur simultaneously. However, despite all of VOS’s con-tributions to video understanding, we are unaware of anywork that utilizes VOS for control.To this end, this paper develops a VOS-based frameworkto address the problem of visual servo control in unstruc-tured environments. We also use VOS to estimate depthwithout a 3D sensor (e.g., an RGBD camera in Figure 1 and a r X i v : . [ c s . R O ] J a n

20, 53]). Developing VOS-based features for control anddepth estimation has many advantages. First, VOS meth-ods are robust across a variety of unstructured objects andbackgrounds, making our framework general to many set-tings. Second, many VOS methods operate on streamingimages, making them ideal for tracking objects from a mov-ing robot. Third, ongoing work in active and interactiveperception enables robots to automatically generate object-speciﬁc training data for VOS methods [6, 21, 32, 41]. Fi-nally, VOS remains a hotly studied area of video under-standing, and future improvements in the accuracy and ro-bustness of state-of-the-art segmentation methods will sim-ilarly improve our method.The primary contribution of our paper is the developmentand experimental evaluation of video object segmentation-based visual servo control (VOS-VS). We demonstrate theutility of VOS-VS on a mobile robot equipped with an RGBcamera to identify and position itself relative to many chal-lenging objects from HSR challenges and the YCB objectdataset [9]. To the best of our knowledge, this work is ﬁrstuse of video object segmentation for control.A second contribution is our new Hadamard-Broydenupdate formulation, which outperforms the original Broy-den update in experiments and enables a robot to learn therelationship between actuators and VOS-VS features on-line without any camera calibration. Using our update,our robot learns to servo with seven unique conﬁgurationsacross seven actuators and two cameras. To the best of ourknowledge, this work is the ﬁrst use of a Broyden updateto directly estimate the pseudoinverse feature Jacobian forvisual servo control on a robot.A ﬁnal contribution is introducing two more VOS-basedmethods, VOS-DE and VOS-Grasp. VOS-DE combinessegmentation features with Galileo’s Square-cube law andactive perception to estimate an object’s depth, which, withVOS-VS, provides an object’s 3D location. VOS-Graspuses segmentation features for grasping and grasp-error de-tection. Thus, using our approach, robots can ﬁnd and graspobjects using a single RGB camera (see Figure 2).We provide source code and annotated YCB object train-ing data at https://github.com/griffbr/VOSVS .

2. Related Work

Video object segmentation methods can be categorizedas unsupervised, which usually rely on object motion [19,24, 33, 46, 55], or semi-supervised, which segment objectsspeciﬁed in user-annotated examples [5, 13, 23, 36, 45, 60].Of particular interest to the current work, semi-supervisedmethods learn the visual characteristics of a target object,which enables them to reliably segment dynamic or staticobjects. To generate our VOS-based features, we segment objects using One-Shot Video Object Segmentation (OS-VOS) [8], which is state-of-the-art in VOS and has inﬂu-enced other leading semi-supervised methods [40, 51]. In addition to the visual servo literature cited in Sec-tion 1, this paper builds off of other methods for control de-sign and feature selection. For control design, a techniqueusing a hybrid input of 3D Cartesian space and 2D imagespace is developed in [39], with depth estimation providedexternally. As a step toward more natural image features,Canny edge detection-based planar contours of objects areused in [14]. When designing features, work in [38] showsthat z -axis features should scale proportional to the opti-cal depth of observed targets. Finally, work in [15] con-trols z -axis motions using the longest line connecting twofeature points for rotation and the square root of the col-lective feature-point-polygon area for depth; this approachaddresses the Chaumette Conundrum presented in [10] butalso requires that all feature points remain in the image. No-tably, early VS methods require structured visual features(e.g., ﬁducial markers), while recent learning-based meth-ods require manipulators with a ﬁxed workspace [1, 30, 62].Taking advantage of recent progress in computer vision,this paper introduces robust segmentation-based image fea-tures for visual servoing that are generated from ordinary,real-world objects. Furthermore, our features are rotationinvariant, work when parts of an object are out of view oroccluded, and do not require any particular object viewpointor marking, making this work applicable to articulated anddeformable objects (e.g., the yellow chain in Figures 1-2).Finally, our method enables visual servo control on a mobilemanipulation platform, on which we also use segmentation-based features for depth estimation and grasping. A critical asset for robot perception is taking actionsto improve sensing and understanding of the environment,i.e., Active Perception (AP) [3, 4]. Compared to struc-ture from motion [2, 31, 34], which requires feature match-ing or scene ﬂow to relate images, AP exploits knowl-edge of a robot’s relative position to relate images and im-prove 3D reconstruction. Furthermore, AP methods selectnew view locations explicitly to improve perception perfor-mance [17, 50, 61]. In this work, we use active percep-tion with VOS-based features to estimate an object’s depth.We complete our estimate during our robot’s approach to anobject, and, by tracking the estimate’s convergence, we cancollect more data if necessary. Essentially, by using an RGBcamera and kinematic information that is already available,we estimate the 3D position of objects without any 3D sen-sors, including: LIDAR, which is cost prohibitive and colorblind; RGBD cameras, which do not work in ambient sun- igure 3.

HSR Control Model. light among other conditions (see Figure 1); and stereo cam-eras, which require calibration and feature matching. Evenwhen 3D sensors are available, RGB-based methods pro-vide an indispensable backup for perception [44, 52].

3. Robot Model and Perception Hardware

For our robot experiments, we use a Toyota Human Sup-port Robot (HSR), which has a 4-DOF manipulator armmounted on a torso with prismatic and revolute joints and adifferential drive base [58, 59]. Using the revolute joint atopits differential drive base, we effectively control HSR as anomnidirectional robot. For visual servo control, we use theactuators shown in Figure 3 as the joint space q ∈ R , q = (cid:2) q head tilt , q head pan , · · · , q base roll (cid:3) (cid:124) . (1)In addition to q , HSR’s end effector has a parallel gripperwith series elastic ﬁngertips for grasping objects; the ﬁnger-tips have 135 mm maximum width.For perception, we use HSR’s base-mounted UST-20LX2D scanning laser for obstacle avoidance and the head-mounted Xtion PRO LIVE RGBD camera and end effector-mounted wide-angle grasp camera for segmentation. Thehead tilt and pan joints act as a 2-DOF gimbal for the headcamera, and the grasp camera moves with the arm and wristjoints; both cameras stream 640 ×

480 RGB images.A signiﬁcant component of HSR’s manipulation DOFcomes from its mobile base. While many planning algo-rithms work well on high DOF arms with a stationary base,the odometer errors of HSR compound during trajectory ex-ecution and cause missed grasps. Thus, VS is well-suitedfor HSR and other mobile robots, providing visual feedbackon an object’s relative position during mobile manipulation.

4. Segmentation-based Visual Servo Control

Assume we are given an RGB image I containing anobject of interest. Using VOS, we generate a binary mask M = vos ( I, W ) , (2) where M consists of pixel-level labels (cid:96) p ∈ { , } , (cid:96) p = 1 indicates pixel p corresponds to the segmented object, and W are learned VOS parameters (details in Section 7.2).Using M , we deﬁne the following VOS-based features s A := (cid:88) (cid:96) p ∈ M (cid:96) p (3) s x := (cid:80) (cid:96) p ∈ M, (cid:96) p =1 p x s A (4) s y := (cid:80) (cid:96) p ∈ M, (cid:96) p =1 p y s A , (5)where s A is a measure of segmentation area by the numberof labeled pixels, s x is the x -centroid of the segmented ob-ject using x -axis label positions p x , and s y is the equivalent y -centroid. In addition to (3)-(5), we introduce more VOSfeatures for depth estimation and grasping in Sections 5-6. Using VOS-based features for our visual servo controlscheme, we deﬁne image feature error e := s ( I, W ) − s ∗ , (6)where s ∈ R k is the vector of visual features found in im-age I using learned VOS parameters W and s ∗ ∈ R k isthe vector of desired feature values. In contrast to many VScontrol schemes, e in (6) has no dependence on time, pre-vious observations, or additional system parameters (e.g.,camera parameters or 3D object models).Typical VS approaches relate camera motion to s using ˙ s = L s v c , (7)where L s ∈ R k × is a feature Jacobian relating the threelinear and three angular camera velocities v c ∈ R to ˙ s .From (6)-(7), assuming ˙s ∗ = 0 = ⇒ ˙ e = ˙ s = L s v c , weﬁnd the VS control velocities v c to minimize e as v c = - λ (cid:99) L + s e , (8)where (cid:99) L + s is the estimated pseudoinverse of L s and λ en-sures an exponential decoupled decrease of e [11]. Notably,VS control using (8) requires continuous, six degree of free-dom (DOF) control of camera velocity.To make (8) more general for discrete motion planningand fewer required control inputs, we modify (7)-(8) to ∆ s = J s ∆ q (9) ∆ q = - λ (cid:99) J + s e , (10)where ∆ q is the change of q ∈ R n actuated joints, J s ∈ R k × n is the feature Jacobian relating ∆ q to ∆ s , and (cid:99) J + s is the estimated pseudoinverse of J s . We command ∆ q directly to the robot joint space as our VOS-VS controllerto minimize e and reach the desired feature values s ∗ in (6). .3. Hadamard-Broyden Update Formulation In real visual servo systems, it is impossible to know theexact feature Jacobian ( J s ) relating control actuators to im-age features [11]. Instead, some VS methods estimate J s directly from observations [12]; among these, a few use theBroyden update rule [27, 29, 48], which iteratively updatesonline. In contrast to previous VS work, Broyden’s origi-nal paper provides a formulation to estimate the pseudoin-verse feature Jacobian ( (cid:99) J + s ) [7, (4.5)]. However, we found itnecessary to augment Broyden’s formulation with a logicalmatrix H , and deﬁne our new Hadamard-Broyden update (cid:99) J + s t +1 := (cid:99) J + s t + α (cid:32) (cid:0) ∆ q − (cid:99) J + s t ∆ e (cid:1) ∆ q (cid:124) (cid:99) J + s t ∆ q (cid:124) (cid:99) J + s t ∆ e (cid:33) ◦ H , (11)where α determines the update speed, ∆ q = q t − q t − and ∆ e = e t − e t − are the changes in joint space and featureerrors since the last update, and H ∈ R n × k is a logical ma-trix coupling actuators to image features. In experiments,we initialize (11) using α = 0 . and (cid:99) J + s t =0 = 0 . H .The Hadamard product with H prevents undesired cou-pling between certain actuator and image feature pairs. Inpractice, we ﬁnd that using the original Broyden update re-sults in unpredictable convergence and learning gains foractuator-image feature pairs that are, in fact, unrelated. For-tunately, we ﬁnd that using H in (11) enables real-time con-vergence without any calibration on the robot for all of theexperiment conﬁgurations in Section 7.3. We learn seven unique VOS-VS conﬁgurations using ourHB update. Using s x (4) and s y (5) in e (6), we deﬁne error e x,y := s x,y ( M ( I, W )) − s ∗ = (cid:20) s x s y (cid:21) − s ∗ . (12)Using e x,y and HSR joints q (1), we choose (cid:99) J + s in (11) as (cid:99) J + s ≈ ∂ q ∂ e x,y = ∂ q ∂ s x,y =  ∂q head tilt ∂s x ∂q head tilt ∂s y ∂q head pan ∂s x ∂q head pan ∂s y ... ... ∂q base roll ∂s x ∂q base roll ∂s y  , (13)where (cid:99) J + s ∈ R × . Note that in our Hadamard-Broydenupdate (11), each element ∂q i ∂s j in (cid:99) J + s is multiplied by ele-ment H i,j in the Hadamard product. Thus, we conﬁgure thelogical coupling matrix H by setting H i,j = 1 if couplingactuated joint q i with image feature s j is desired. Using ourupdate formulation (11), we learn (cid:99) J + s on HSR for the seven H conﬁgurations listed in Table 1 and provide experimentalresults for each conﬁguration in Section 7.3. Table 1.

VOS-VS Hadamard-Broyden Update Conﬁgurations. (cid:99) J + s values are learned online using our HB update formulation(11), enabling HSR to automatically learn the relationship betweenactuators and visual features without any camera calibration. H (11) Learned ∂q i ∂s j in (cid:99) J + s (13)Conﬁg. Camera s x s y H head Head q head pan q head tilt H arm lift Grasp q arm lift -0.00157 q arm roll H arm wrist Grasp q wrist ﬂex -0.00221 q arm roll q arm lift -0.00036 H arm both Grasp q wrist ﬂex -0.00392 q arm roll H base Grasp q base forward -0.00179 q base lateral H base grasp Grasp q base forward -0.00040 q base lateral OpticalAxis VOS-DESegmented Object in Pinhole ImageVOS-VS GraspCamera

Figure 4.

VOS-based Visual Servo and Depth Estimation.

HSR ﬁrst aligns an object with the camera’s optical axis thenestimates the object’s depth as the camera approaches. UsingGalileo’s Square-cube law (15), we estimate the object’s depth us-ing changes in relative camera position and segmentation area.

5. Segmentation-based Depth Estimation

By combining VOS-based features with active percep-tion, we are able to estimate the depth of segmented objectsand approximate their 3D position. As shown in Figure 4,we initiate our depth estimation framework (VOS-DE) bycentering the optical axis of our camera with a segmentedobject using the H base VOS-VS controller. This alignmentminimizes lens distortion, which facilitates the use of anideal camera model. Using the pinhole camera model [22],projections of objects onto the image plane scale inverselywith their distance on the optical axis from the camera.Thus, with the object centered on the optical axis, we canrelate projection scale and object distance using (cid:96) d = (cid:96) d = ⇒ (cid:96) (cid:96) = d d , (14)where (cid:96) is the projected length of an object measurementorthogonal to the optical axis, d is the distance along theoptical axis of the object away from the camera, and (cid:96) is theprojected measurement length at a new distance d . Com-bining Galileo Galilei’s Square-cube law with (14), A = A (cid:18) (cid:96) (cid:96) (cid:19) = ⇒ A = A (cid:18) d d (cid:19) , (15) igure 5. VOS-based Grasping . VOS-based visual servo control (columns 1 to 2), active depth estimation (2-4), and mobile robot grasping(5-6). Using our combined framework with a single RGB camera, HSR identiﬁes the sugar box, locates it in 3D, and picks it up in real-time. where A is the projected object area corresponding to (cid:96) and d (see Figure 4). As the camera advances on the opticalaxis, we modify (15) to relate collected images using d (cid:112) A = d (cid:112) A = c object , (16)where c object is a constant proportional to the orthogonal sur-face area of the segmented object. Also, using a coordinateframe with the z axis aligned with the optical axis, d = z camera − z object , (17)where z camera and z object are the z -axis coordinates of thecamera and object. Because the camera and object are bothcentered on the z axis, x camera = x object = 0 and y camera = y object = 0 . Using (17) and s A (3), we update (16) as ( z camera , − z object ) √ s A, = ( z camera , − z object ) √ s A, = c object , (18)where the object is assumed stationary between images (i.e., ˙ z object = 0 ) and the z camera position is known from therobot’s kinematics. Note that z camera provides relative depthfor VOS-DE and (18) identiﬁes a key linear relationship be-tween √ s A and the distance between the object and camera.Finally, after collecting a series of m measurements, weestimate the depth of the segmented object. From (18), z object √ s A, + c object = z camera , √ s A, , (19)which over the m measurements in Ax = b form yields  √ s A, √ s A, ... ... √ s A,m  (cid:20) ˆ z object ˆ c object (cid:21) =  z camera , √ s A, z camera , √ s A, ... z camera ,m √ s A,m  . (20)By solving (20) for ˆ z object and ˆ c object , we estimate the dis-tance d in (17), and, thus, the 3D location of the object. InSection 7.4, we show that our combined VOS-VS and VOS-DE framework is sufﬁcient for locating, approaching, andestimating the depth of a variety of unstructured objects. . · s A ( p i x e l s ) . . . z ca m e r a ( m ) . . . . Number of Observations ˆ z ob j ec t ( m ) Figure 6.

Depth Estimate of Sugar Box.

Data collected andprocessed in real-time during the initial approach in Figure 5.

Remark : There are many methods to ﬁnd approximate so-lutions to (20). In practice, we ﬁnd that a least squares solu-tion provides robustness to outliers caused by segmentationerrors (see visual and quantitative example in Figures 5-6).

6. Segmentation-based Grasping

We develop a VOS-based method of grasping and grasp-error detection (VOS-Grasp). Assuming an object is cen-tered and has estimated depth ˆ z object , we move z camera to z camera, grasp = ˆ z object + z gripper , (21)where z gripper is the known z -axis offset between z camera andthe center of HSR’s closed ﬁngertips. Thus, when z camera isat z camera, grasp , HSR can reach the object at depth ˆ z object .After moving to z camera, grasp , we center the object directlyunderneath HSR’s antipodal gripper using H base grasp VOS- igure 7.

Experiment Objects from YCB Dataset.

Object cat-egories are (from left to right) Food, Kitchen, Tool, and Shape.Spanning from 470 mm long to the 4 mm thick, we intentionallyselect many of the challenge objects to break our framework.

VS control. To ﬁnd a suitable grasp location, we projectand rotate a mask of the gripper, M grasp , into the camera asshown in column 5 of Figure 5 and solve arg min q wrist roll J ( q wrist roll ) = M ∩ M grasp ( q wrist roll ) M ∪ M grasp ( q wrist roll ) , (22)where J is the intersection over union (or Jaccard in-dex [18]) of M grasp and object segmentation mask M , and M grasp ( q wrist roll ) is the projection of M grasp corresponding toHSR wrist rotation q wrist roll . Thus, we grasp the object usingthe wrist rotation with least intersection between the objectand the gripper, which is then less likely to collide with theobject before achieving a parallel grasp.After the object is grasped, we lift HSR’s arm to performa visual grasp check. We consider a grasp complete if s A, raised > . s A, grasp , (23)where s A, grasp is the object segmentation size s A (3) duringthe initial grasp and s A, raised is the corresponding s A afterlifting the arm. If s A decreases when lifting the arm, theobject is further from the camera and not securely grasped.Thus, we quickly identify if a grasp is missed and regrasp asnecessary. Note that this VOS-based grasp check can alsowork with other grasping methods [25, 37]. A completedemonstration of our VOS-based visual servo control, depthestimation, and grasping framework is shown in Figure 5.

7. ROBOT EXPERIMENTS

For most of our experiments, we use the objects fromthe YCB object dataset [9] shown in Figure 7. We usesix objects from each of the food, kitchen, tool, and shapecategories and purposefully choose some of the most difﬁ-cult objects. To name only a few of the challenges for theselected objects: dimensions span from the 470 mm longpan to the 4 mm thick washer, most of the contours changewith pose, and over a third of the objects exhibit specularreﬂection of overhead lights. To learn object recognition,we annotate ten training images of each object using HSR’sgrasp camera with various object poses, backgrounds, anddistances from the camera (see example image in Figure 2).

Initial Target LocationCentered on Target (Crash)

Original UpdateOurs (11)

Figure 8.

Learning (cid:99) J + s for H base . Visual servo trajectory of thetarget object in image space (right) using the original Broyden up-date (red) and our Hadamard-Broyden update (11) (blue). Startingwith the same (cid:99) J + s t =0 and offset target location (yellow chain, left),the original update leads HSR into the wall while our update learnsthe correct visual servoing parameters to center HSR on the target. − · − Number of Updates P a r a m e t e r V a l u e ∂q base forward ∂s x ∂q base lateral ∂s y Figure 9.

Learning (cid:99) J + s Parameters for H base . This plot corre-sponds to the fourteen Hadamard-Broyden updates used to learnvisual servoing parameters in Figure 8. ∂q base forward ∂s x initializes withthe incorrect sign but still converges using our update formulation. We segment objects using OSVOS [8]. OSVOS uses abase network trained on ImageNet [16] to recognize im-age features, re-trains a parent network on DAVIS [47] tolearn general video object segmentation, and then ﬁne tunesfor each of our experiment objects (i.e., each object hasunique learned parameters W in (2)). After learning W ,our VOS framework segments HSR’s 640 ×

480 RGB im-ages at 29.6 Hz using a single GPU (GTX 1080 Ti).

Hadamard-Broyden Update

We learn all of the VOS-VSconﬁgurations in Table 1 on HSR using the Hadamard-Broyden update formulation in (11). We initialize each con-ﬁguration using (cid:99) J + s t =0 = 0 . H , α = 0 . , and a targetobject in view to elicit a step response from the VOS-VScontroller (see Figure 8). Each conﬁguration starts at a spe-ciﬁc pose (e.g., H base uses the leftmost pose in Figures 4-5),and conﬁgurations use s ∗ = [320 , (cid:48) in (12), except for H base grasp , which uses s ∗ = [220 , (cid:48) to position grasps.When initializing each conﬁguration, after a few itera-tions of control inputs from (10) and updates from (11),he learned (cid:99) J + s matrix generally shows convergence for any H i,j component that is initialized with the correct sign (e.g.,ﬁve updates for ∂q base lateral ∂s y in Figure 9). Components initial-ized with an incorrect sign generally require more updatesto change directions and jump through zero during one ofthe discrete updates (e.g., ∂q base forward ∂s x in Figure 9). If an objectgoes out of view from an incorrectly signed component, wereset HSR’s pose and restart the update from the most recent (cid:99) J + s t . Once s ∗ is reached, the object can be moved to elicit afew more step responses for ﬁne tuning. Table 1 shows thelearned parameters for each conﬁguration. In the remainingexperiments, we set α = 0 in (11) to reduce variability. H base Results

We show the step response of all (cid:99) J + s con-ﬁgurations in Table 1 by performing experiments centeringthe camera on objects placed at various viewpoints withineach conﬁguration’s starting pose. In Figure 10, both H base and H base grasp exhibit a stable response. Our motivation tolearn two base conﬁgurations is the increase in s x,y sensi-tivity to base motion as an object’s depth decreases. H base operates with the camera raised high above objects, while H base grasp operates with the camera directly above objects toposition for grasping. Thus, H base requires more movementthan H base grasp for the same changes in s x,y . This differenceis apparent in Table 1 from H base learning greater ∂q base ∂s val-ues and in Figure 10 from H base ’s smaller s x,y distributionfor identical object distances. H arm Results

We show the step response of all arm-basedVOS-VS conﬁgurations in Figure 11. Each conﬁgurationuses the same objects and starting pose. Although each con-ﬁguration segments the pan and baseball, s ∗ is not reach-able for these objects within any of the conﬁgured actuatorspaces; H arm wrist is the only conﬁguration to center on allfour of the other objects. The overactuated H arm both has themost overshoot, while H arm lift has the most limited range ofcamera positions but essentially deadbeat control. H head Results

Finally, we show the step response of H head in Figure 12. H head is the only conﬁguration that usesHSR’s 2-DOF head gimbal and camera, and it exhibits asmooth step response over the entire image. Remarkably,even though H head uses the head camera, it still uses thesame OSVOS parameters W that are learned on grasp cam-era images; this further demonstrates the general applicabil-ity of VOS-VS in regards to needing no camera calibration. We perform an experiment consisting of a consecutiveset of mobile trials that simultaneously test VOS-VS andVOS-DE. Each trial consists of three unique YCB objectsplaced at different heights: one on the blue bin 0.25 mabove the ground, one on the green bin 0.125 m above theground, and one directly on the ground (see bin conﬁgura-

Figure 10.

Visual Servoing using Learned Parameters . Ini-tial view with segmented objects (left) and visual servo trajecto-ries centering on each object (right). While objects are identicallyplaced for the H base (top) and H base grasp (bottom) experiments,each conﬁguration has learned the correct scale of actuation tocenter on objects from its own visual perspective. Note that inthe H base view, the wood block starts very close to s ∗ (green dot). tion in Figure 2). The trial conﬁgurations and correspond-ing results are provided in Table 2. VOS-VS is considereda success (“X”) if HSR locates and centers on the object fordepth estimation. VOS-DE is considered a success if HSRachieves z camera, grasp (21) such that HSR can close its grip-pers on the object without hitting the underlying surface and z camera does not move past the top surface of the object.Across all 24 objects, VOS-VS has a 83% success rate.VOS-DE, which is only applicable when VOS-VS suc-ceeds, has a 50% success rate. By category, food objectshave the highest success (100% VOS-VS, 83% VOS-DE)and kitchen objects have the lowest (50% VOS-VS, 66%VOS-DE). Failures are caused by segmentation errors. Al-though VOS-VS can center on a poorly segmented object,VOS-DE fails if there are erratic changes in segmentationarea (we provide examples in the Appendix). Additionally,VOS-DE’s margin for success varies between objects (e.g.,the smallest margin is the 4 mm thick washer). Pick-and-place Challenges

We perform additional exper-iments for our VOS-based methods, including our work inthe TRI-sponsored HSR challenges. These challenges con-sist of timed trials for pick-and-place tasks with randomlyscattered, non-YCB objects (e.g., the banana peel in Fig-ure 13). These challenges are a particularly good demon-stration of VOS-VS and VOS-Grasp. We provide additionalﬁgures for these experiments in the Appendix. igure 11. Initial view of objects and visual servo trajectories using H arm lift (center left), H arm wrist (center right), and H arm both (right).Figure 12. Initial view and visual servo trajectories using H head .Table 2. Consecutive Mobile Robot Trial Results.

All results arefrom a single consecutive set of mobile HSR trials. Across all ofthe challenge objects, VOS-VS has a 83% success rate. Except forone VOS-DE trial, the food objects were a complete success.

Object Support SuccessItem Category Height (m) VS DEChips Can Food 0.25 X XPotted Meat Food 0.125 X XPlastic Banana Food Ground X XBox of Sugar Food 0.25 X XTuna Food 0.125 XGelatin Food Ground X XMug Kitchen 0.25 X XSoftscrub Kitchen 0.125 N/ASkillet with Lid Kitchen Ground N/APlate Kitchen 0.25 X XSpatula Kitchen 0.125 N/AKnife Kitchen Ground XPower Drill Tool 0.25 X XMarker Tool 0.125 XPadlock Tool Ground XWood Tool 0.25 XSpring Clamp Tool 0.125 XScrewdriver Tool Ground XBaseball Shape 0.25 XPlastic Chain Shape 0.125 XWasher Shape Ground XStacking Cup Shape 0.25 X XDice Shape 0.125 N/AFoam Brick Shape Ground X X

Dynamic Articulated Objects

Finally, we perform addi-tional VOS-VS experiments with dynamic articulated ob-jects. Using H base , HSR tracks a plastic chain across theroom in real-time as we kick it and throw it in a varietyof unstructured poses; we can even pick up the chain anduse it the guide HSR’s movements from the grasp camera. Figure 13.

Additional Experiments.

Using VOS-VS, HSR is ableto track dynamic objects like people in real-time, making VOS-VS a useful tool for human-robot cooperation (left). HSR takingbanana peel to garbage for a pick-and-place challenge (right).

In addition, by training OSVOS to recognize an article ofclothing, HSR reliably tracks a person moving throughoutthe room using H head (see Figure 13). Experiment videosare available at: https://youtu.be/hlog5FV9RLs .

8. Conclusions and Future Work

We develop a video object segmentation-based approachto visual servo control, depth estimation, and grasping. Vi-sual servo control is a useful framework for controlling aphysical robot system from RGB images, and video objectsegmentation has seen rampant advances within the com-puter vision community for densely segmenting unstruc-tured objects in challenging videos. The success of oursegmentation-based approach to visual servo control in mo-bile robot experiments with real-world objects is a tributeto both of these communities and the initiation of a bridgebetween them. Future developments in video object seg-mentation will improve the robustness of our method and,we expect, lead to other innovations in robotics.A signiﬁcant beneﬁt of our segmentation-based frame-work is that it only requires an RGB camera combined withrobot actuation. For future work, we are improving RGB-based depth estimation and grasping by comparing imagescollected from more robot poses, thereby leveraging moreinformation and making our 3D understanding of the targetobject more complete.

Acknowledgment

Toyota Research Institute (“TRI”) pro-vided funds to assist the authors with their research but thisarticle solely reﬂects the opinions and conclusions of its au-thors and not TRI or any other Toyota entity. igure 14.

Plate Segmentations used for Depth Estimation . Theplate is well-segmented from the higher camera position (top), buthas greater spectral reﬂection as the camera approaches (bottom).Figure 15.

Drill Segmentations used for Depth Estimation . Por-tions of the drill become unsegmented at the closer view (bottom).

Appendix

Segmentation Errors

Densely segmenting unstructuredobjects is a challenging problem, and, despite using state-of-the-art video object segmentation, we have some seg-mentation errors during our experiments. Figures 14-17show segmentations used for depth estimation during theconsecutive mobile robot trials in Section 7.4. Even withsome segmentation errors, VOS-VS centers on all four ob-jects from the high-camera position and VOS-DE success-fully estimates the depth of the plate and drill.

Robot’s Perspective when Learning VOS-VS

Figure 18shows the step-to-step visual servo transitions from therobot’s perspective as it is learning (cid:99) J + s for H base (corre-sponding to Figures 8-9). Figure 16.

Marker Segmentations used for Depth Estimation .Reﬂective areas of the background are included as part of themarker segmentation at the higher view (top), then portions of themarker become unsegmented at the closer view (bottom).Figure 17.

Padlock Segmentations used for Depth Estimation .The padlock segmentation goes from including small portions ofthe background (top) to leaving out large portions of the lock (bot-tom) as the camera approaches. Segmenting the padlock is dif-ﬁcult due to its small size and specular reﬂection, and depth es-timation of the padlock is difﬁcult due to erroneous changes insegmentation area.

Figures for Pick-and-place Experiments

Figure 19 showsa fully-automated pick-and-place task. Figure 20 shows apick-and-place task with VOS-VS-based human collabora-tion.

References [1] P. Abolghasemi, A. Mazaheri, M. Shah, and L. Boloni. Payattention! - robustifying a deep visuomotor policy throughtask-focused visual attention. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2019. 2igure 18.

Robot Perspective while Learning (cid:99) J + s for H base . Starting with (cid:99) J + s t =0 and offset target location (the yellow chain segmenta-tion), our Hadamard-Broyden update learns the correct visual servoing parameters to center the robot on the target in real-time. The targetis centered vertically after ﬁve updates (t = 5) and horizontally after fourteen (t = 14). We show the complete visual servo trajectory of thetarget object through image space on the bottom right. This ﬁgure corresponds with the experiment shown in Figures 8-9.[2] J. K. Aggarwal and N. Nandhakumar. On the computationof motion from sequences of images-a review. Proceedingsof the IEEE , 76(8):917–935, Aug 1988. 2[3] R. Bajcsy. Active perception.

Proceedings of the IEEE (In-vited Paper) , 76(8):966–1005, Aug 1988. 2[4] R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos. Revisiting activeperception.

Autonomous Robots , 42(2):177–196, Feb 2018.2[5] L. Bao, B. Wu, and W. Liu. CNN in MRF: video object seg-mentation via inference in A cnn-based higher-order spatio-temporal MRF. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018. 2[6] B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bulthoff, andC. Wallraven. Active in-hand object recognition on a hu-manoid robot.

IEEE Transactions on Robotics , 30(5):1260–1269, Oct 2014. 2[7] C. G. Broyden. A class of methods for solving nonlin-ear simultaneous equations.

Mathematics of Computation ,19(92):577–593, 1965. 4 [8] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix´e,D. Cremers, and L. Van Gool. One-shot video object seg-mentation. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2017. 2, 6[9] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel,and A. M. Dollar. Benchmarking in manipulation research:Using the yale-cmu-berkeley object and model set.

IEEERobotics Automation Magazine , 22(3):36–52, Sep. 2015. 2,6[10] F. Chaumette. Potential problems of stability and conver-gence in image-based and position-based visual servoing.In D. J. Kriegman, G. D. Hager, and A. S. Morse, editors,

The conﬂuence of vision and control , pages 66–78, London,1998. Springer London. 2[11] F. Chaumette and S. Hutchinson. Visual servo control. i.basic approaches.

IEEE Robotics Automation Magazine ,13(4):82–90, Dec 2006. 1, 3, 4[12] F. Chaumette and S. Hutchinson. Visual servo control. ii.advanced approaches [tutorial].

IEEE Robotics Automation igure 19.

HSR using VOS-VS and VOS-Grasp for Pick-and-place . After a set of HSR challenge objects are randomly poured ontothe metal tray, HSR identiﬁes the initial object locations using the downward-facing grasp camera (top row). HSR identiﬁes the bananapeel as the ﬁrst target, then centers on the peel amongst the cluttered objects using VOS-VS and then grasps the peel using VOS-Grasp(middle row). Finally, HSR performs a visual grasp check away from the other objects and then places the peel in the garbage bin(bottom row). All pick-and-place experiment and trial videos are available at: . Magazine , 14(1):109–118, March 2007. 4[13] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz-ingly fast video object segmentation with pixel-wise met-ric learning. In

Computer Vision and Pattern Recognition(CVPR) , 2018. 2[14] G. Chesi, E. Malis, and R. Cipolla. Automatic segmenta-tion and matching of planar contours for visual servoing. In

Proceedings 2000 ICRA. Millennium Conference. IEEE In-ternational Conference on Robotics and Automation. Sym-posia Proceedings (Cat. No.00CH37065) , volume 3, pages2753–2758 vol.3, April 2000. 2[15] P. I. Corke and S. A. Hutchinson. A new partitioned approachto image-based visual servo control.

IEEE Transactions onRobotics and Automation , 17(4):507–515, Aug 2001. 2[16] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2009. 6[17] R. Eidenberger and J. Scharinger. Active perception andscene modeling by planning with probabilistic 6d objectposes. In , pages 1036–1043, Oct2010. 2[18] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (VOC) chal- lenge.

International journal of computer vision , 88(2):303–338, 2010. 6[19] A. Faktor and M. Irani. Video segmentation by non-localconsensus voting. In

British Machine Vision Conference(BMVC) , 2014. 2[20] M. Ferguson and K. Law. A 2d-3d object detection sys-tem for updating building information models with mobilerobots. In , pages 1357–1365, Jan 2019. 2[21] V. Florence, J. J. Corso, and B. Grifﬁn. Self-supervised robotin-hand object learning.

CoRR , abs/1904.00952, 2019. 2[22] D. A. Forsyth and J. Ponce.

Computer Vision: A ModernApproach . Prentice Hall Professional Technical Reference,2002. 4[23] B. A. Grifﬁn and J. J. Corso. Bubblenets: Learning to se-lect the guidance frame in video object segmentation by deepsorting frames. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2019. 2[24] B. A. Grifﬁn and J. J. Corso. Tukey-inspired video objectsegmentation. In

IEEE Winter Conference on Applicationsof Computer Vision (WACV) , 2019. 2[25] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High preci-sion grasp pose detection in dense clutter. In , pages 598–605, Oct 2016. 6igure 20.

Using VOS-VS for Human-Robot Collaboration . HSR is asked to perform a pick-and-place task with the paper towel roll,but has no idea where it is. Using H head VOS-VS, HSR tracks the person so that he can show HSR where to ﬁnd the roll (top row).Using H head VOS-VS again, HSR centers it’s gaze on the roll to locate it, then uses H base grasp to position itself for VOS-Grasp (middlerow). Finally, HSR grasps the paper towel roll, veriﬁes the grasp using our visual check, and then places the roll in the yellow bin(bottom row). All pick-and-place experiment and trial videos are available at: .[26] N. Guenard, T. Hamel, and R. Mahony. A practical visualservo control for an unmanned aerial vehicle. IEEE Trans-actions on Robotics , 24(2):331–340, April 2008. 1[27] K. Hosoda and M. Asada. Versatile visual servoing withoutknowledge of true jacobian. In

Proceedings of IEEE/RSJInternational Conference on Intelligent Robots and Systems(IROS) , volume 1, pages 186–193 vol.1, Sep. 1994. 4[28] S. Hutchinson, G. D. Hager, and P. I. Corke. A tutorial onvisual servo control.

IEEE Transactions on Robotics andAutomation , 12(5):651–670, Oct 1996. 1[29] M. Jagersand, O. Fuentes, and R. Nelson. Experimentalevaluation of uncalibrated visual servoing for precision ma-nipulation. In

Proceedings of International Conference onRobotics and Automation (ICRA) , volume 4, pages 2874–2880 vol.4, April 1997. 1, 4[30] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov,A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis.Sim-to-real via sim-to-sim: Data-efﬁcient robotic graspingvia randomized-to-canonical adaptation networks. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , June 2019. 2[31] Y. Kasten, M. Galun, and R. Basri. Resultant based incre-mental recovery of camera pose from pairwise matches. In , pages 1080–1088, Jan 2019. 2 [32] M. Krainin, P. Henry, X. Ren, and D. Fox. Manipulator andobject tracking for in-hand 3d object modeling.

The Inter-national Journal of Robotics Research , 30(11):1311–1327,2011. 2[33] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for videoobject segmentation. In

IEEE International Conference onComputer Vision (ICCV) , 2011. 2[34] H. C. Longuet-Higgins. Readings in computer vision: Issues,problems, principles, and paradigms. chapter A ComputerAlgorithm for Reconstructing a Scene from Two Projections,pages 61–62. 1987. 2[35] A. D. Luca, G. Oriolo, and P. R. Giordano. Feature depthobservation for image-based visual servoing: Theory and ex-periments.

The International Journal of Robotics Research ,27(10):1093–1116, 2008. 1[36] J. Luiten, P. Voigtlaender, and B. Leibe. Premvos: Proposal-generation, reﬁnement and merging for video object segmen-tation. In

Asian Conference on Computer Vision (ACCV) ,2018. 2[37] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu,J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning toplan robust grasps with synthetic point clouds and analyticgrasp metrics.

CoRR , abs/1703.09312, 2017. 6[38] R. Mahony, P. Corke, and F. Chaumette. Choice of imagefeatures for depth-axis control in image based visual servoontrol. In

IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) , volume 1, pages 390–395 vol.1,Sept 2002. 2[39] E. Malis, F. Chaumette, and S. Boudet. 2 1/2 d visualservoing.

IEEE Transactions on Robotics and Automation ,15(2):238–250, April 1999. 2[40] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taix,D. Cremers, and L. V. Gool. Video object segmentationwithout temporal information.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , pages 1–1, 2018. 2[41] P. Marion, P. R. Florence, L. Manuelli, and R. Tedrake. Labelfusion: A pipeline for generating ground truth labels for realrgbd data of cluttered scenes. In , pages 1–8,May 2018. 2[42] G. L. Mariottini, G. Oriolo, and D. Prattichizzo. Image-basedvisual servoing for nonholonomic mobile robots using epipo-lar geometry.

IEEE Transactions on Robotics , 23(1):87–100,Feb 2007. 1[43] A. McFadyen, M. Jabeur, and P. Corke. Image-based visualservoing with unknown point feature correspondence.

IEEERobotics and Automation Letters , 2(2):601–607, April 2017.1[44] A. Milan, T. Pham, K. Vijay, D. Morrison, A. W. Tow, L. Liu,J. Erskine, R. Grinover, A. Gurman, T. Hunn, N. Kelly-Boxall, D. Lee, M. McTaggart, G. Rallos, A. Razjigaev,T. Rowntree, T. Shen, R. Smith, S. Wade-McCue, Z. Zhuang,C. Lehnert, G. Lin, I. Reid, P. Corke, and J. Leitner. Semanticsegmentation from limited training data. In ,pages 1908–1915, May 2018. 3[45] S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim. Fast videoobject segmentation by reference-guided mask propagation.In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2018. 2[46] A. Papazoglou and V. Ferrari. Fast object segmentation inunconstrained video. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV) , 2013. 2[47] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,M. Gross, and A. Sorkine-Hornung. A benchmark datasetand evaluation methodology for video object segmentation.In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2016. 1, 6[48] J. A. Piepmeier, G. V. McMurray, and H. Lipkin. Un-calibrated dynamic visual servoing.

IEEE Transactions onRobotics and Automation , 20(1):143–147, Feb 2004. 4[49] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, and L. V. Gool. The 2017 DAVIS challenge onvideo object segmentation.

CoRR , abs/1704.00675, 2017. 1[50] R. Spica, P. R. Giordano, and F. Chaumette. Coupling activedepth estimation and visual servoing via a large projectionoperator.

The International Journal of Robotics Research ,36(11):1177–1194, 2017. 2[51] P. Voigtlaender and B. Leibe. Online adaptation of convo-lutional neural networks for video object segmentation. In

British Machine Vision Conference (BMVC) , 2017. 2[52] S. Wade-McCue, N. Kelly-Boxall, M. McTaggart, D. Mor-rison, A. W. Tow, J. Erskine, R. Grinover, A. Gurman, T. Hunn, D. Lee, A. Milan, T. Pham, G. Rallos, A. Razji-gaev, T. Rowntree, R. Smith, K. Vijay, Z. Zhuang, C. F.Lehnert, I. D. Reid, P. I. Corke, and J. Leitner. Design ofa multi-modal end-effector and grasping system: How in-tegrated design helped win the amazon robotics challenge.

CoRR , abs/1710.01439, 2017. 3[53] C. Wang, D. Xu, Y. Zhu, R. Martin-Martin, C. Lu, L. Fei-Fei,and S. Savarese. Densefusion: 6d object pose estimation byiterative dense fusion. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019. 2[54] Y. Wang, H. Lang, and C. W. de Silva. A hybrid visual servocontroller for robust grasping by wheeled mobile robots.

IEEE/ASME Transactions on Mechatronics , 15(5):757–769,Oct 2010. 1[55] S. Wehrwein and R. Szeliski. Video segmentation with back-ground motion models. In

British Machine Vision Confer-ence (BMVC) , 2017. 2[56] G. Wei, K. Arbter, and G. Hirzinger. Real-time visual servo-ing for laparoscopic surgery. controlling robot motion withcolor image segmentation.

IEEE Engineering in Medicineand Biology Magazine , 16(1):40–45, Jan 1997. 1[57] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. S.Huang. Youtube-vos: A large-scale video object segmenta-tion benchmark.

CoRR , abs/1809.03327, 2018. 1[58] U. Yamaguchi, F. Saito, K. Ikeda, and T. Yamamoto. Hsr,human support robot as research and development platform.

The Abstracts of the international conference on advancedmechatronics : toward evolutionary fusion of IT and mecha-tronics : ICAM , 2015.6:39–40, 2015. 3[59] T. Yamamoto, K. Terada, A. Ochiai, F. Saito, Y. Asahara,and K. Murase. Development of human support robot asthe research platform of a domestic mobile manipulator.

ROBOMECH Journal , 6(1):4, Apr 2019. 3[60] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos.Efﬁcient video object segmentation via network modulation.

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018. 2[61] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan,M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli,F. Alet, N. C. Daﬂe, R. Holladay, I. Morona, P. Q. Nair,D. Green, I. Taylor, W. Liu, T. Funkhouser, and A. Ro-driguez. Robotic pick-and-place of novel objects in clut-ter with multi-affordance grasping and cross-domain imagematching. In

Proceedings of the IEEE International Confer-ence on Robotics and Automation (ICRA) , 2018. 2[62] Y. Zuo, W. Qiu, L. Xie, F. Zhong, Y. Wang, and A. L. Yuille.Craves: Controlling robotic arm with a vision-based eco-nomic system. In