MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision
Tingbo Hou, Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Matthias Grundmann
MMobilePose: Real-Time Pose Estimation forUnseen Objects with Weak Shape Supervision
Tingbo Hou, Adel Ahmadyan, Liangkai Zhang, Jianing Wei, and MatthiasGrundmann
Google Research, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 { tingbo,ahmadyan,liangkai,jianingwei,grundman } @google.com Abstract.
In this paper, we address the problem of detecting unseenobjects from RGB images and estimating their poses in 3D. We proposetwo mobile friendly networks: MobilePose-Base and MobilePose-Shape.The former is used when there is only pose supervision, and the latteris for the case when shape supervision is available, even a weak one.We revisit shape features used in previous methods, including segmen-tation and coordinate map. We explain when and why pixel-level shapesupervision can improve pose estimation. Consequently, we add shapeprediction as an intermediate layer in the MobilePose-Shape, and let thenetwork learn pose from shape. Our models are trained on mixed realand synthetic data, with weak and noisy shape supervision. They areultra lightweight that can run in real-time on modern mobile devices( e.g . 36 FPS on Galaxy S20). Comparing with previous single-shot solu-tions, our method has higher accuracy, while using a significantly smallermodel (2 ∼
3% in model size or number of parameters).
Keywords:
Pose estimation, 3D detection, shape, segmentation, mobile
Detecting unseen objects from images and estimating their poses in 3D remaina challenge in computer vision, which have not been fully explored in previ-ous work. Solving this problem enables many applications across computer vi-sion, augmented reality (AR), autonomous driving, and robotics. Furthermore,mobile-friendly solutions add their own layers of complexity to the problem: first,they should run in real-time with limited model size; second, unlike self-drivingcars where cameras are fixed [18,2], on-device models have to cope with variousdevice rotations.Towards the challenges, existing methods often simplify the problem by as-suming objects are known. Most of the methods from literature are instance-aware [9,31,27,26,8]. Meaning that, they are trained on a set of specific objects,and expected to work on the same instances. Object-specific features includingappearance and geometry can be learned to determine poses, and hence, applica-tions are mostly limited to grabbing known objects in robotics. Recent progress a r X i v : . [ c s . C V ] M a r T. Hou et al. in pose estimation has been made by leveraging 2D-3D correspondence at infer-ence time [14,32,20,15,19]. These methods require CAD models of the objects,which are not applicable to unseen ones. Recently, there are a few attempts forremoving the requirement of known objects. As an example, [30] uses depth im-ages to align unseen objects at inference time. While relieving prior knowledgeof objects, it relies on depth sensors, which requires extra hardware that is notavailable on general mobile devices.For unseen objects, we want the model to learn intra-category features, e.g .common shape or geometry of a category. We categorize geometry related repre-sentations and name them shape features, which can be mapped to images withpixel-level signals. Shape features have been previously used in pose estimation, e.g . segmentation [21,8], parameterization map [32], and coordinate map [30,15].These methods train Convolutional Neural Networks (CNNs) to infer shape fea-tures, which are then used out of the networks. They rely on highly accuratepredictions of shape features, preferably with high resolutions, to align with ob-jects’ CAD models. However, estimating shapes of unseen objects is as hardas, or maybe even harder than estimating poses. Instead of post-processing, weuse shape prediction as an intermediate layer, and have the network learn posefrom shape. Besides, pixel-level shape labels are expensive to annotate, whichbrings another challenge to the problem. This motivates us to look for weaklysupervised solutions.Although there are methods claiming real-time [27,15], none of them havenbeen deployed to mobile devices. To run on mobile, the CNN model needs to beultra lightweight, e.g . MobileNet [7] and MobileNetv2 [24], which only have a fewmillion parameters. Another requirement is post-processing, whose runtime alsocounts. This is often overlooked in previous methods [20,32,15,30], where expen-sive algorithms are widely used, e.g . RANdom Sample Consensus (RANSAC),Perspective-n-Point (PnP), and Iterative Closest Point (ICP).In this paper, we address the aforementioned challenges and limitationsby proposing two mobile-friendly networks: MobilePose-Base and MobilePose-Shape. MobilePose-Base is our baseline network with minimal model size, whichdetects unseen objects and estimates their poses in a single shot. The detectionis anchor-free, following a rising trend in 2D object detection [11,4,33]. It re-gresses projected vertices of a 3D bounding box to estimate object’s pose. Sincepose estimation is in a low-dimensional space, high-resolution features need su-pervision in order to make a positive contribution. Therefore, we also proposeMobilePose-Shape, which predicts shape features in an intermediate layer. Run-ning on mobile devices, the two networks only require cheap operations in post-processing to fully recover object’s rotation, translation, and size up to a scale.In this work, we are particularly interested in shoes. Following fashion trends,shoes have changing appearances and shapes with a number of sub-categories, e.g . sneakers, boots, flip-flops, high heels, etc .To summarize, our contributions in this paper are – We propose two novel MobilePose networks for detecting unseen objectsfrom RGB images and estimating their poses. Unlike previous methods, we obilePose 3 do not require any prior knowledge of the objects or additional hardware atinference time. – We revisit shape supervision by exploring when and why it can improve poseestimation. Comparing with previous methods that have shape prediction inparallel to other streams and use it in post-processing, we insert it as anintermediate layer and let the network learn pose from shape. – We train models with weak shape supervision, which transfers shape learningfrom synthetic data to real data. With the lightweight models, we developend-to-end applications that can run on mobile devices in real-time.
Given a large literature on pose estimation, we categorize recent work by theprior knowledge the methods require at inference time.
Instance-aware methods learn poses from a set of known objects and areexpected to work on the same instances. SSD-6D [9] extends the classic architec-ture in 2D object detection to 6DoF pose estimation. It predicts 2D boundingbox, discretized viewpoint, and in-plane rotation in a single shot. PoseCNN [31]estimates 3D translation of an object as image locations and depth while re-gressing over 3D rotation. [28] predicts the location of the 3D bounding box’svertices in an image using heatmaps, then recovers the orientation and transla-tion using PnP by knowing the object size. In [26], an Augmented Autoencoder(AAE) is used to extract orientation from other factors, e.g . translation andscale. A codebook is created for each object, which contains AAE embeddingsof all orientations. YOLO-6D [27] predicts image locations of projected box ver-tices, and recovers 6DoF pose using PnP. BB8 [21] uses coarse segmentation toroughly locate objects, subsequently estimating the corners of a 3D boundingbox. The method in [8] parallels segmentation and bounding box estimation astwo branches of the network. Pose estimation is improved by letting the networklearn the entire shape of an object. [25] utilizes a hybrid intermediate represen-tation to provide more supervision during training.
Model-aware methods require 3D CAD models of objects in post-processing.This is a much stronger prior, which leverages 2D-3D correspondences and yieldshigher accuracy. PVNet [20] finds 2D-3D correspondence of object features, andformulates pose estimation as a PnP problem. By assuming known 3D modelsof target objects, iterative mechanism [14] has been utilized to refine estimatedpose by comparing rendered images with inputs. Another recent method [32]computes the UV map of the object from a single RGB method and uses PnP +RANSAC to estimate the 6DOF object pose. The UV map is a parameterizationof 3D models, which also provides 2D-3D correspondence. [15] estimates rotationand translation separately, where rotation is computed, again, by RANSAC +PnP from coordinate map. Similarly, Pix2Pose [19] also predicts 3D coordinatesof objects from images, and uses RANSAC + PnP to recover poses. For a bet-ter prediction, it adopts Generative Adversarial Network (GAN) in training todiscriminate predicted coordinates and rendered coordinates.
T. Hou et al.
Depth-aware methods require depth images in addition to RGB imagesfor pose estimation. In [10], pose is estimated by searching over the nearestneighbors in a codebook of encoded RGB-D patches. DenseFusion [29] processesthe RGB image and depth image individually, and uses a dense fusion networkto extract pixel-wise dense features. Since it has 3D coordinates, it directlypredicts translation and rotation. In [13], a multi-view framework was proposedusing viewpoint alignment and pose voting. It adopts depth image as an optionalinput. A recent work [30] predicts object’s normalized coordinates, and alignsthem with a depth image to compute the pose. The normalized coordinatescompose yet another 2D-3D correspondence mapping.
Detection-aware methods rely on existing 2D detectors to find a boundingbox or ROI of an object. [18] estimates 3D bounding boxes of vehicles by apply-ing geometric constraints on 2D bounding boxes. The SSD-6D also detects 2Dbounding boxes using a SSD-style network. [30] employees the Mask R-CNN [5]to detect objects and find their 2D locations. [15] adopts YOLOv3 for 2D de-tector to crop images with objects during inference. Pix2Pose [19] also assumescropped images of objects, which are obtained by a modified Fast R-CNN [23]and Retinanet [16].
Shape features have been employed in estimating poses, e.g . segmenta-tion [21,8], 3D features [20], parameterization map [32], coordinate map [30,15,19], etc . These methods use shape prediction in post-processing, instead of in net-work. That is, they first infer shape features by CNN, and then use PnP or ICPto align them with the 3D models. On the contrary, we adopt shape predictionas an intermediate layer, and let the CNN learn 3D bounding boxes from them.This generalizes our method to unseen objects.
Real-time solutions have been proposed to push the technique closer toapplications. The models need to be lightweight in order to run in real-time,preferably in a single shot. [9] uses Inceptionv4 as backbone to build a SSD-stylearchitecture. [27,8] both adopt YOLOv2 [22] as backbone. In a recent work [15],YOLOv3 is used to detect objects in the first stage, while pose is estimated inthe second stage. With the limitations of model size and runtime, none of themhas been deployed to mobile devices and runs in real-time there.
In this section, we propose MobilePose-Base as our baseline network, whichdetects unseen objects without anchors and estimate their poses in a single shot.
We devise the backbone as a popular encoder-decoder architecture. To build anultra lightweight model, we select the MobileNetv2 [24] to build our encoder,which is proven to run real-time on mobile, and outperforms YOLOv2 [22]. TheMobileNetv2 is built upon inverted residual blocks, where shortcut connectionsare between thin bottleneck layers. An expansion-and-squeeze scheme is used in obilePose 5 regressiondetection ...
Fig. 1.
MobilePose-Base network. The blue and purple boxes are convolutional anddeconvolutional blocks, respectively. Orange boxes represent inverted residual blocksfrom MobileNetv2 [24]. the blocks. To make it even lighter, we remove some blocks with large channelsat the bottleneck, reducing half of the parameters.As shown in Fig. 1, the blue and purple boxes are convolutional and deconvo-lutional blocks, respectively. An orange box represents an inverted residual block.The numbers of blocks and their dimensions shown in the figure are exactly thesame in our implementation. The input is an image with size 640 × × We attach two heads after the backbone: detection and regression. The detectionhead is inspired by the anchor-free methods [11,33] in 2D object detection. Wemodel objects as distributions around their centers. The detection head outputsa 40 × × I with pixels { p } , its heatmap iscomputed as a bivariate normal distribution [3] H ( p ) = max i ∈O ( N ( p − µ i , σ i )) , (1)where O is the set of all object instances in the image, µ i is the centroid locationof object i , and σ i is the kernel size that is proportional to object size. We keepthe fractions when computing projections µ i to preserve accuracy. For multipleobjects in an image, we select the max heat for each pixel, as the examples shownin Fig. 1 and Fig. 2. By modeling objects as distributions, we end up using asimple L2 loss (mean squared error) for this head.In Fig. 2, we compare different detection methods used in single-shot poseestimation. Anchor-based methods ( e.g . [27]) set anchors at grid cells, and re-gresses bounding boxes at positive anchors (marked as green dots). It handlesmultiple objects in the same cell by assigning a number of anchors ad hocly. Seg-mentation methods ( e.g . [8]) find objects by segmented instances. For multiple T. Hou et al.
Fig. 2.
Detection methods (from left to right): anchor, segmentation, and distribution. objects from the same category, it needs instance segmentation to distinguishobjects. We model objects as Gaussian distributions, and detect them by findingpeaks. We use a high resolution in the figure for better illustration, while theactual resolution in our model is (40 × X i , let x i denote its projection on the image plane. We compute the displacement vectoras D i ( p ) = x i − p . (2)Displacement fields of multiple objects in an image are merged according to theirheats, as shown in Fig. 2. The regression head outputs a 40 × ×
16 tensor,where each box vertex contributes two channels of displacements. In the figure,we only show two displacement fields for illustration. To tolerate errors in peakextraction, we regress displacements at all pixels with significant heats. We useL1 loss (mean absolute error) for this head, which is more robust to outliers.With predicted D i ( p ), the loss is computed as L reg = mean H ( p ) >(cid:15) ( ||D i ( p ) + p − x i || ) , (3)where ||·|| denotes the L1-norm, and (cid:15) is a threshold (0.2 in our implementation). When shape supervision is available, even a weak one, we provide MobilePose-Shape, which predicts shape features at an intermediate layer.
The motivation is to guide the network learn high-resolution shape featuresthat are related to pose estimation. We found that simply introducing high-resolution features without supervision does not improve our pose estimation.This is because the regression of bounding box vertices is in a low dimensionalspace. Without supervision, the network may overfit on object-specific features obilePose 7 regression detectionshape
Fig. 3.
MobilePose-Shape network with shape prediction at an intermediate layer. at small scales. The problem is not valid for instance-aware pose estimation, butnot our case of unseen objects.Similar with previous methods [8,30,15,19], we select coordinate map andsegmentation as our intra-category shape features, as shown in Fig. 3. The co-ordinate map has three channels, corresponding to the axes of 3D coordinates.If we have the CAD model of an object in training data, we can render coor-dinate map using normalize coordinates as colors. Coordinate map is a strongfeature with pixel-level signals. However, it requires object’s CAD model andpose, which are difficult to acquire. Therefore, we add segmentation as anothershape feature. For simplicity, we use semantic segmentation, resulting in oneadditional channel in our shape supervision. Segmentation is a weak feature forpose estimation. That is, given a segmentation of an unseen object, it is notsufficient to determine its pose. Yet, it does not need object’s CAD model andpose, and is easier to acquire.
With the shape features, we modify the network with high-resolution layers inthe decoder and a shape prediction layer. In previous work [8,30,15,19], they addanother branch for shape prediction in parallel to other streams. The predictedshape is used out of network for building 2D-3D correspondence. In a contrary,we add an intermediate layer for shape prediction, whose output is further usedin network. Meanwhile, shape prediction is useful in many applications, makingour network a joint learning of multi-tasks: object detection, pose estimation,and shape prediction.As shown in Fig. 3, we combine multi-scale features in the decoder. A shapelayer is added at the end of the decoder, predicting shape features. It is thenconcatenated with the decoder to connect pose heads after downsampling. Specif-ically, we utilize four inverted residual blocks to reduce the resolution, and fi-nally attach the detection head and regression head, described in Section 3.2.The shape head (160 × ×
4) has four channels with L2 loss (mean squarederror). Training examples without shape labels are skipped when computing thisloss. Through experiments, we show that even with a weak supervision, the poseestimation is improved by introducing high-resolution shape prediction.
T. Hou et al.
Despite the lightweight model, post-processing is another component that iscritical to mobile applications. Expensive algorithms e.g . RANSAC, large-sizedPnP, and ICP, are not in our consideration. As a result, we simplify the post-processing to only two cheap operations: peak extraction and EPnP [12].To compute projected vertices of a 3D bounding box, we extract peaks ofthe detection output, a 40 ×
30 heatmap. For a peak pixel p , which may notnecessarily be the center pixel, the eight vertices { x i } of the projected boundingbox can be simply computed by x i = p + D i ( p ) , (4)where D i ( p ) is the displacement field of vertex x i computed by Eq. 2.Given the projected 2D box vertices and the camera intrinsics, we employ theEPnP [12] algorithm to recover a 3D bounding box up to scale. The algorithmhas constant complexity, which solves eigen-decomposition of a 12 ×
12 matrix.It does not require known object’s size. We choose four control points { C j } asthe origin (at object’s coordinate system), and three points along the coordinateaxes. These control points form an orthogonal basis of the object frame. Theeight vertices of a 3D bounding box can be represented by these four controlpoints, X i = (cid:88) j =0 α ij C j , (5)where { α ij } are coefficients that are held under rigid transformations.From camera projection, we obtain a linear system of 16 equations, with eachbox vertex contributing two equations for u i and v i . By re-writing control pointsin camera frame as a 12-vector C c , this linear system can be formulated as M · C c = 0 , (6)where M is a 16 ×
12 matrix composed by 2D vertices x i , camera intrinsics,and coefficients α ij . For details, please refer to our supplementary material. Thesolution to this linear system is the null eigenvectors of matrix M T M . With thissolution, we can recover a 3D bonding box in camera frame by Eq. 5, and furtherestimate object’s pose and size up to a scale. Since there is no existing large dataset of shoes with annotated poses, we buildour dataset using on-device AR techniques. We annotate real data with 3Dbounding boxes, and rely on synthetic data to provide weak supervision forshape. obilePose 9Dataset Pose Segmentation Coordinate Map Size
Table 1.
Datasets with their labels and sizes.
The lack of training data is a remaining challenge in 6DoF pose estimation. Themajority of previous methods are instance-aware with full supervision from asmall dataset. To solve the problem for unseen objects, we develop a pipeline tocollect and annotate video clips recorded by mobile devices equipped with AR.The cutting-edge AR solutions ( e.g . ARKit and ARCore) can estimate cameraposes and sparse 3D features on the fly using Visual Inertial Odometry (VIO).This on-device technology enables an affordable and scalable way to generate3D training data.The key to our data pipeline is efficient and accurate 3D bounding box anno-tation. We build a tool to visualize both 2D and 3D views of recorded sequences.Annotators draw 3D bounding boxes in the 3D view, and verify them in multi-ple 2D views across the sequence. The drawn bounding boxes are automaticallypopulated to all frames in the video sequence, using estimated camera posesfrom AR.As a result, we annotated 1800 video clips of shoes. Clips are several-secondslong of different shoes under various environments. We only accepted one clipfor one or a pair of shoes, and hence, the objects are completely different fromclip to clip. Among the clips, 1500 were randomly selected for training, and therest 300 were reserved for evaluation. Finally, considering adjacent frames fromthe same clip are very similar, we randomly selected 100K images for training,and 1K images for evaluation. As shown in Table 1, our real data only has 3Dbounding box labels. This is because annotating pixel-level shape labels frameby frame is much more expensive.
To provide shape supervision and enrich the real dataset, we generate two setsof synthetic data. The first one synthetic-3D has 3D labels. We collect AR videoclips of background scenes, and place virtual objects into the scenes. Specifically,we render virtual objects with random poses on a detected plane in the scene, e.g . a table or a floor. We reuse estimated lights in AR sessions for lighting.Measurements in AR session data are in metric scale. Therefore, virtual objectsare rendered coherently with the surrounding geometries. We collected 100 videoclips of common scenes: home, office, and outdoor. For each scene, we generated100 sequences by rendering 50 scanned shoes with random poses. Each sequencecontains a number of shoes. From the generated images, we randomly selected
Fig. 4.
Examples of our synthetic-3D (left three columns), synthetic-2D (right threecolumns), and their shape labels (bottom row). The last two examples are consideredwith mild and severe label errors.
80K for training. As some examples shown in Fig. 4 the synthetic-3D data has3D bounding box, segmentation, and coordinate map labels.Although the synthetic-3D data has accurate labels, the numbers of objectsand backgrounds are still limited. Therefore, we build a synthetic-2D dataset byimages crawled from the internet. We crawled 75K images of shoes with trans-parent background, and 40K images of backgrounds ( e.g . office and home). Shoeimages with trivial errors ( e.g . no transparent background) are filtered. We seg-mented shoes by the alpha channel, and randomly paste them to the backgroundimages. As shown in Fig. 4, the generated images are not realistic, and the la-bels are noisy. We roughly estimated that there are about 20% images with mildlabel errors ( e.g . small missing parts and shadows), and about 10% images withsevere label errors ( e.g . non-shoe objects and large extra background).We summarize our datasets in Table 1. The three datasets are complementaryto each other. The real data have real images collected at different places, thesynthetic-3D data have accurate and complete labels, while the synthetic-2Ddata cover a large number of objects. With a low cost, we demonstrate a methodto prepare training data with 2D and 3D labels, which can be used in othercomputer vision tasks.
Our training pipeline is implemented in TensorFlow. We train our networks usingthe Adam optimizer with batch size 128 on 8 GPUs. The initial learning rate is0.01 and gradually decays to 0.001 in 30 epochs. To deploy the trained models onmobile devices, we convert them to TFLite models. The conversion process willremove some layers such as batch normalization as they are not necessary during obilePose 11Scale 20 ×
15 40 ×
30 80 ×
60 160 × Table 2.
Study of decoder scales on [email protected] Base Shape-No Shape-CM Shape-FullReal
Study of datasets and network configurations on [email protected]. inference. Based on the MediaPipe framework [17], we build an application thatruns on various mobile devices.For evaluation metric, we adopt the average precision (AP) of 3D Intersection-over-Union (IoU). In previous work, the computation of 3D IoU is overly simpli-fied. For example, [30] assumes two 3D boxes are axis-aligned, and rotates oneof them to get the best IoU. [27] thought computing convex hull is tedious, andthey used 2D IoU of projections. These simplifications do not hold in general3D oriented boxes and often result in over-estimation of the 3D IoU metric. Onthe contrary, we compute the exact 3D IoU by finding intersecting points of twooriented boxes, and computing the volume of the intersection as a convex hull.Recall that our post-processing recovers 3D bounding boxes up to a scale. Al-though the scale is not necessary in our applications, it is needed in evaluation.We reuse the detected planes in AR session data to determine the metric scaleof our estimations. Please refer to the supplementary material for details.
We conduct ablation studies on decoder scale, shape supervision, and dataset.Table 2 shows the study on decoder scale of the MobilePose-Base. To comparedifferent scales, we build the decoder as shown in Fig. 3, and output at differentlevels. The detection and regression heads are also in the same experimentingscale. The result of this study shows that the MobilePose-Base achieves the bestaccuracy at scale 40 ×
30, in terms of AP at 0.5 3D IoU. As the model has morelayers with high-resolution features, its accuracy drops. This motivates us tolook for shape supervision with pixel-level signals.To study the contributions of our datasets and network components, wecompare different configurations and document the results in Table 3. In thisexperiment, we compare four network configurations: MobilePose-Base (Base),MobilePose-Shape without shape supervision (Shape-No), MobilePose-Shape withcoordinate map (CM) in shape supervision (Shape-CM), and the MobilePose-Shape with full shape supervision (Shape-Full). The configurations are evaluatedby training on three data combinations by AP at 0.5 3D IoU.
Fig. 5.
Example results on our real evaluation data. We show the reprojected boundingboxes of detected shoes, as well as the segmentation masks and coordinate maps learnedfrom weak supervision.
On the real dataset, we observe that the MoblePose-Shape is no better thanthe MobilePose-Base. This is consistent with our experiment on decoder scales,that high-resolution features without supervision are not helpful in pose esti-mation. Trained on the real dataset and synthetic-3D dataset, the MobilePose-Shape with coordinate map as supervision has better performance than otherconfigurations. This proves that shape supervision can improve pose estimation.We also notice that with full (segmentation + coordinate map) supervision, theaccuracy goes down. This indicates that segmentation is a weak feature, whichis redundant when coupling with coordinate map.Finally, trained on the all three datasets, the MobilePose-Shape has the bestperformance with full supervision. Although synthetic-2D dataset only has noisysegmentation label, the network is able to evolve with this weak and noisy su-pervision. Segmentation makes a positive contribution when there are data witheven noisy supervision. obilePose 13 A P MobilePoseMobileNetYOLO-SegYOLO-6D 0 200 400 600 800 1000Model Size (MB)010203040 F P S MobilePose-BaseMobilePose-ShapeMobileNet YOLO-SegYOLO-6D
Fig. 6.
Comparison of state-of-the-art single-shot methods: AP vs 3D IoU (left), andFPS vs model size (right).
In Fig. 5, we show some results from our real evaluation dataset. To visualize the3D bounding boxes, we reproject them to the 2D image plane. We also show thesegmentation and coordinate map predictions from our model. We would like toremind readers that both segmentation and coordinate map are learned purelyfrom synthetic data. Coordinate map only has 50 scanned shoes for supervision,while segmentation has very noisy labels. Recall that we argue learning accurateshape is more difficult than pose. We show that our model can infer accurateposes from weakly learned coarse shape features. Meanwhile, we show that ourmodel also predicts reasonably well segmentation masks by transfer learningfrom synthetic data. . We compare our method with three other methods on our shoedataset. Two of them are state-of-the-art single-shot solutions: YOLO-6D [27]and YOLO-Seg [8]. Both of the two methods use YOLOv2 [22] as their back-bones. The YOLO-6D predicts object’s class and a confidence value at each gridcell, which are used for detecting the object. The YOLO-Seg uses a semanticsegmentation branch for detection, parallel to its regression branch. Besides thetwo previous methods, we also compare with the MobileNetv2 [24] attached withour decoder and heads. Recall that our encoder removes several blocks with largenumber of channels with nearly half parameters from MobileNetv2. This is toverify the effect of this optimization.For all the methods, we follow the implementation details and uploaded codeif there is any. As shown in Fig. 6, our MobilePose-Shape has the best AP at 0.53D IoU. We also compare their model sizes and speeds on a smartphone (GalaxyS20 with Adreno 650 GPU). We benchmark models by running inference onmobile GPUs using TFLite with GPU delegate. Our MobilePose-Base runs at36 FPS with the smallest model size (16MB), and MobilePose-Shape runs at 26.5FPS with a slightly larger model size (18MB) and 10% higher AP. Interestingly,using only half of the model size, our MobilePose-Shape is comparable with the ∗ Glue ∗ Holep.REP-5px YOLO-6D 92.10
ADD-0.1d YOLO-6D 21.62 68.80 41.82 63.51 27.23 69.58 80.02 42.63MobilePose
Comparison with YOLO-6D [27] on Linemod dataset.Metric Method Ape Can Cat Driller Duck Eggbox ∗ Glue ∗ Holep.REP-5px YOLO-Seg 59.1 59.8 46.9 59.0 42.6 11.9 16.5 63.6MobilePose
ADD-0.1d YOLO-Seg 12.1 39.9 8.2 45.2 17.2 22.1 35.8 36.0MobilePose
Comparison with YOLO-Seg [8] on Occlusion dataset.
MobileNetv2 plus our decoder and heads. This indicates that by introducinghigh-resolution features with even a weak supervision, low-resolutions featurescan be compressed by using a shallower and thinner network. Finally, comparingwith the two previous models, ours are about 2 ∼
3% in model size or numberof parameters, and 3 ∼
12 times in FPS.
Public Datasets . We also compare the methods on two popular publicdatasets: Linemod [6] and Occlusion [1]. We adopt the standard metrics of repre-jection error (REP-5px) and average distance (ADD-0.1d), same as [27,8]. Theresults are shown in Table 4 and 5, where superscript ∗ indicates the object issymmetric. We use our MobilePose-Base for the two experiments, because ob-jects in the two datasets are relatively small. Without increasing model size withhigher resolution, or leveraging detection-aware cropping, shape features do notprovide sufficient supervision. Since our model is designed for unseen objectsfrom a single category, we trained a model for each object category. Comparingwith YOLO-Seg [8] that uses a multi-object model for Occlusion, ours has muchhigher accuracy, and the total model size is still smaller. In this paper, we address the problem of pose estimation from two different anglesthat have not been explored before. First, we do not assume any prior knowledgeof the unseen objects. Second, our MobilePose models are ultra lightweight thatcan run on mobile devices in real-time. Additionally, we reveal that pixel-levelshape supervision can guide the network to learn poses from high-resolution fea-tures. With a particular interest, we demonstrate our models on various unseenshoes through a mobile application. Furthermore, the proposed method can beeasily extended to other object categories. obilePose 15
References
1. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learn-ing 6d object pose estimation using 3d object coordinates. In: European Conferenceon Computer Vision (ECCV) (2014)2. van Dijk, T., de Croon, G.: How do neural networks see depth in single images?In: International Conference in Computer Vision (ICCV) (2019)3. Ding, L., Fridman, L.: Object as distribution. In: Conference on Neural InformationProcessing Systems (NeurIPS) (2019)4. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: Keypoint tripletsfor object detection. In: IEEE Conference on Computer Vision (ICCV) (2019)5. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask R-CNN. In: IEEE InternationalConference on Computer Vision (ICCV) (2017)6. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab,N.: Model based training, detection and pose estimation of texture-less 3d objectsin heavily cluttered scenes. In: Asian Conference on Computer Vision (ACCV)(2012)7. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks formobile vision applications. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR) (2017)8. Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object poseestimation. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2019)9. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: IEEE InternationalConference on Computer Vision (ICCV) (2017)10. Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep learning of localRGB-D patches for 3D object detection and 6d pose estimation. In: EuropeanConference on Computer Vision (ECCV) (2016)11. Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: EuropeanConference on Computer Vision (ECCV) (2018)12. Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: An accurate O(N) solution to thePnP problem. International Journal of Computer Vision (IJCV) (2), 155–166(2009)13. Li, C., Bai, J., Hager, G.D.: A unified framework for multi-view multi-class objectpose estimation. In: European Conference on Computer Vision (ECCV) (2018)14. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: Deep iterative matching for6D pose estimation. In: European Conference on Computer Vision (ECCV) (2018)15. Li, Z., Wang, G., Ji, X.: Cdpn: Coordinates-based disentangled pose network forreal-time. In: International Conference in Computer Vision (ICCV) (2019)16. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: IEEE International Conference on Computer Vision (ICCV) (2017)17. Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang,F., Chang, C., Yong, M.G., Lee, J., Chang, W., Hua, W., Georg, M., Grund-mann, M.: Mediapipe: A framework for building perception pipelines. CoRR abs/1906.08172 (2019)18. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimationusing deep learning and geometry. In: IEEE Conference on Computer Vision andPattern Recognition (CVPR) (2017)6 T. Hou et al.19. Park, K., Patten, T., Vincze, M.: Pix2pose: Pixel-wise coordinate regression ofobjects for 6d pose estimation. In: International Conference in Computer Vision(ICCV) (2019)20. Peng, S., Liu, Y., Huang, Q., Bao, H., Zhou, X.: Pvnet: Pixel-wise voting networkfor 6DoF pose estimation. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR) (2019)21. Rad, M., Lepetit, V.: BB8: A scalable, accurate, robust to partial occlusion methodfor predicting the 3D poses of challenging objects without using depth. In: IEEEInternational Conference on Computer Vision (ICCV) (2017)22. Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv:1612.08242(2016)23. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Conference on Neural InformationProcessing Systems (NeurIPS) (2015)24. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In-verted residuals and linear bottlenecks. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (2018)25. Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybridrepresentations. arXiv:2001.01869 (2020)26. Sundermeyer, M., Marton, Z.C., Durner, M., Brucker, M., Triebel, R.: Implicit3D orientation learning for 6D object detection from RGB images. In: EuropeanConference on Computer Vision (ECCV) (2018)27. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object poseprediction. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2018)28. Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deepobject pose estimation for semantic robotic grasping of household objects. In: Con-ference on Robot Learning (CoRL) (2018)29. Wang, C., Xu, D., Zhu, Y., Martn-Martn, R., Lu, C., Fei-Fei, L., Savarese, S.:DenseFusion: 6D object pose estimation by iterative dense fusion. In: IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) (2019)30. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalizedobject coordinate space for category-level 6D object pose and size estimation. In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)31. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neuralnetwork for 6d object pose estimation in cluttered scenes. In: Robotics: Scienceand Systems (RSS) (2018)32. Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner.In: IEEE Conference on Computer Vision (ICCV) (2019)33. Zhou, X., Wang, D., Kr¨ahenb¨uhl, P.: Objects as points. CoRR abs/1904.07850abs/1904.07850