[PDF] Improving Annotation for 3D Pose Dataset of Fine-Grained Object Categories

Abstract

Existing 3D pose datasets of object categories are limited to generic object types and lack of fine-grained information. In this work, we introduce a new large-scale dataset that consists of 409 fine-grained categories and 31,881 images with accurate 3D pose annotation. Specifically, we augment three existing fine-grained object recognition datasets (StanfordCars, CompCars and FGVC-Aircraft) by finding a specific 3D model for each sub-category from ShapeNet and manually annotating each 2D image by adjusting a full set of 7 continuous perspective parameters. Since the fine-grained shapes allow 3D models to better fit the images, we further improve the annotation quality by initializing from the human annotation and conducting local search of the pose parameters with the objective of maximizing the IoUs between the projected mask and the segmentation reference estimated from state-of-the-art deep Convolutional Neural Networks (CNNs). We provide full statistics of the annotations with qualitative and quantitative comparisons suggesting that our dataset can be a complementary source for studying 3D pose estimation. The dataset can be downloaded at this http URL.

Full PDF

IImproving Annotation for 3D Pose Dataset of Fine-Grained Object Categories

Yaming Wang [email protected]

Xiao Tan [email protected]

Yi Yang [email protected]

Ziyu Li [email protected]

Xiao Liu [email protected]

Feng Zhou [email protected]

Larry S. Davis [email protected]

Abstract

Existing 3D pose datasets of object categories are lim-ited to generic object types and lack of ﬁne-grained in-formation. In this work, we introduce a new large-scale dataset that consists of 409 ﬁne-grained categoriesand 31,881 images with accurate 3D pose annotation.Speciﬁcally, we augment three existing ﬁne-grained objectrecognition datasets (StanfordCars, CompCars and FGVC-Aircraft) by ﬁnding a speciﬁc 3D model for each sub-category from ShapeNet and manually annotating each 2Dimage by adjusting a full set of 7 continuous perspectiveparameters. Since the ﬁne-grained shapes allow 3D mod-els to better ﬁt the images, we further improve the annota-tion quality by initializing from the human annotation andconducting local search of the pose parameters with the ob-jective of maximizing the IoUs between the projected maskand the segmentation reference estimated from state-of-the-art deep Convolutional Neural Networks (CNNs). We pro-vide a full statistics of the annotations with qualitative andquantitative comparisons suggesting that our dataset canbe a complementary source for studying 3D pose estima-tion. The dataset can be downloaded at http://users.umiacs.umd.edu/˜wym/3dpose.html .

1. Introduction

In the past few years, the fast-pacing progress of genericimage recognition on ImageNet [13] has drawn increasingattention in classifying ﬁne-grained object categories [11,25], e.g . bird species [27], car makes and models [12].However, simply recognizing object labels is still far fromsolving many industrial problems where we need a deeperunderstanding of other attributes of the objects [15]. Onthe other hand, estimating 3D object pose from a single 2Dimage is an indispensable step in various practical applica-tions, such as vehicle damage detection [9], novel view syn-thesis [32, 21], grasp planning [26] and autonomous driv-ing [4]. In this work, we introduce the problem of esti-

Pascal3D+ ObjectNet3D StanfordCars3D (Ours)Pascal3D+ ObjectNet3D FGVC-Aircraft3D (Ours)

Figure 1. While both Pascal3D+ and ObjectNet3D contain morecomplicated scenarios with more generic categories for 3D poseestimation, we provide more accurate pose annotations on a largeset of ﬁne-grained object classes as a complementary source forstudying 3D pose estimation. mating 3D pose for ﬁne-grained objects from monocularimages. We believe this will become an important com-ponent in broader tasks, contributing to both ﬁne-grainedobject recognition and 3D object pose estimation.To address this task, collecting suitable data is of vi-tal importance. However, due to the expensive annotationcost, most existing 3D pose datasets only provide accurateground truth annotations for a few object classes and thenumber of instances associated to each category is quitesmall [20]. Although there are two large scale pose datasets,Pascal3D+ [30] and ObjectNet3D [29], both of them arecollected for generic object types and there is still no large-scale 3D pose dataset for ﬁne-grained object categories.Moreover, these datasets are lack of accurate pose infor-mation, since different objects in one hyper class ( i.e ., cars)are only matched with a few generic 3D shapes, leading toa high projection error that affects human annotators to ﬁndthe accurate pose, as demonstrated in Figure 1.In this work, we introduce a new benchmark pose esti-mation dataset for ﬁne-grained object categories. Specif-ically, we augment three existing ﬁne-grained recognitiondatasets, StanfordCars [12], CompCars [31] and FGVC-Aircraft [18], with two types of useful 3D information: (1)1 a r X i v : . [ c s . C V ] O c t or each object in the image, we manually annotate the fullperspective projection represented by 7 continuous pose pa-rameters; (2) we provide an accurate match of the computeraided design (CAD) model for each ﬁne-grained object cat-egory. The resulting augmented dataset consists of morethan 30,000 images for over 400 ﬁne-grained object cate-gories. Table 1 shows the general statistics of our dataset.To the best of our knowledge, our dataset is the very ﬁrstone which employs ﬁne-grained category aware 3D modelsin pose annotation. To fully utilize the valuable ﬁne-grainedinformation, we further develop an automatic pose reﬁne-ment mechanism to improve over the human annotations.Thanks to the ﬁne-grained shapes, an accurate pose param-eter also leads to the optimal segmentation overlap betweenthe projected 2D mask from the 3D model and the target ob-ject ground truth segmentation. We hence conduct a localgreedy search over the 7 full perspective pose parameters,initialized from the human annotation, to maximize the seg-mentation overlap objective. To avoid effort on segmenta-tion annotation, we utilize state-of-the-art image segmenta-tion models including both Mask R-CNN [8] and DeepLabv3+ [2] to obtain the as-accurate-as-possible segmentationreference. This process signiﬁcantly improves our annota-tion quality. Figure 2 illustrates this process.In summary, our contribution is three-fold. (1) We col-lect a new large-scale 3D pose dataset for ﬁne-grained ob-jects with more accurate annotations, which can be viewedas a complementary source to the existing pose dataset. (2)Our pose annotation contains a full perspective model pa-rameters including the camera focal length, which is a morechallenging benchmark for developing algorithms beyondonly estimating viewpoint angles (azimuth) [7] or recov-ering the rotation matrices [17]. (3) We propose a simplebut effective way to automatically reﬁne the pose annota-tion based on the segmentation cues. With the correspond-ing 3D ﬁne-grained model, this method can automaticallyreﬁne object pose while signiﬁcantly alleviating the humanlabel effort.

2. Related Work

3D Pose Estimation Dataset.

Due to the 3D ambigu-ity from 2D images and heavy annotation cost, earlier ob-ject pose datasets are limited not only in their dataset scalesbut also in the types of annotation they covered. Table 1provides a quantitative comparison between our dataset andprevious ones. For example, 3D Object dataset [22] onlyprovides viewpoint annotation for 10 object classes with10 instances for each class. EPFL Car dataset [20] con-sists of 2,299 images of 20 car instances captured at mul-tiple azimuth angles. Moreover, the other parameters in-cluding elevation and distance are kept almost the samefor all the instances in order to simplify the problem [20].Pascal3D+ [30] is perhaps the ﬁrst large-scale 3D pose

Dataset (cid:55)

EPFL Car [20] 1 2,299 continuous view (cid:55)

IKEA [15] 11 759 2d-3d alignment (cid:55)

Pascal3D+ [30] 12 30,899 2d-3d alignment (cid:55)

ObjectNet3D [29] 100 90,127 2d-3d alignment (cid:55)

StanfordCars3D 196 16,185 2d-3d alignment (cid:51)

CompCars3D 113 5,696 2d-3d alignment (cid:51)

FGVC-Aircraft3D 100 10,000 2d-3d alignment (cid:51)

Total (Ours) 409 31,881 2d-3d alignment (cid:51)

Table 1. Comparison between our 3D pose estimation dataset(StanfordCars3D + CompCars3D + FGVC-Aircraft3D) and otherbenchmark datasets. Our dataset can be viewed as a complemen-tary source to the existing large scale 3D pose dataset (Pascal3D+and ObjectNet3D) with a different focus on more intra-class cate-gories and ﬁne-grained details.

Fine-grained 2D Image Fine-grained 3D Model Initial Pose by HumanInitial 2D Segmentation Segmentation Reference Final Adjusted Pose

Figure 2. For an image with a ﬁne-grained category (Top left),we ﬁrst ﬁnd its corresponding ﬁne-grained 3D model (Top mid-dle) and manually annotate its rough pose (Top right). Since theproblem is to estimate the object pose such that the projection ofthe 3D model aligns with the image as well as possible, we fur-ther optimize the segmentation overlap between the projected 2Dmask (Bottom left) and the “groundtruth” mask (Bottom middle)estimated from state-of-the-art CNN models to obtain the ﬁnal 3Dpose (Bottom right). dataset for generic object categories, with 30,899 imagesfrom 12 different classes of the Pascal VOC dataset [6]. Re-cently, ObjectNet3D [29] further extends the dataset scaleto 90,127 images of 100 categories. Both Pascal3D+ andObjectNe3D assume a camera model with 6 parameters toannotate. However, different images in one hyper class ( i.e .,cars) are usually matched with a few coarse 3D CAD mod-els, thereby the projection error might be large due to thelack of accurate CAD models in some cases. Being awareof these problems, we therefore project ﬁne-grained CADmodels to match objects in the 2D images. In addition, ourdatasets surpass most of previous ones in both scales of im-ages and classes.

Fine-Grained Recognition Dataset.

Fine-grainedrecognition refers to the task of distinguishing sub-ordinatecategories [27, 12, 25]. In earlier works, 3D information isa common source to gain recognition performance improve-2 ine-Grained Label

Real-time Model ProjectionDrag: Annotate azimuth ( a ), elevation ( e )Scroll: Annotate distance ( d )3D-2D Alignment Display Annotate ( )In-plane RotationAnnotate ( u , v )Principal PointAnnotate ( f )Focal Length Rotation:

Clockwise Counter-Clockwise Reset

Offset:

RDownUpLeft RightResetResetZoom Out Submit

Zoom:

You are working now on task: Zoom InYou have completed: Tasks

Clockwise Counter-Clockwise Reset

Offset:

RDownUpLeft RightResetResetZoom Out Submit

Zoom:

You are working now on task: Zoom InYou have completed: Tasks

2D Image 3D Model

ShapeNet

Annotation Interface Initial Pose (1) human pose annotation

Mask-RCNNDeepLab Extract Segment Render Mask

Instance SegmentationSemantic SegmentationInitial Segment Segmentation Reference3D Model Iterative Local Search Final PoseInitial Pose2D Image (2) segmentation based pose reﬁnement

Figure 3. An overview of our whole annotation framework which includes two parts: (1) human initial pose annotation, and (2) segmen-tation based pose reﬁnement. The human annotation provides a strong initialization for the second-stage pose reﬁnement, hence we onlyneed to conduct local search to adjust the pose. ment [33, 28, 19, 24]. As deep learning prevails and ﬁne-grained datasets become larger, the effect of 3D informationon recognition diminishes [16, 11]. Recently, [24] incorpo-rate 3D bounding box into deep framework when images ofcars are taken from a ﬁxed camera. On the other hand, al-most all existing ﬁne-grained datasets are lack of 3D poselabels or 3D shape information [12], and pose estimationfor ﬁne-grained object categories are not well-studied. Ourwork ﬁlls this gap by annotating poses and matching CADmodels on three existing popular ﬁne-grained recognitiondatasets.

3D Model Dataset.

Similar to [29], we adopt the 2d-3d alignment method to annotate object poses, Annotat-ing in such a way requires a source for accessing accurate3D models of objects. Luckily, there has been substantialgrowth in the number of of 3D models available online overthe last decade [3, 5, 10, 14] with well-known repositorieslike the Princeton Shape Benchmark [23] which containsaround 1,800 3D models grouped into 90 categories. Inthis work, we use ShapeNet [1], the so far largest 3D CADmodel database which has indexed more than 3,000,000models, with 220,000 models out of which are classiﬁedinto 3,135 categories including various object types suchas cars, airplanes, bicyles, etc. The large amount of 3Dmodels allow us to ﬁnd an exact model to many of the ob-jects in the natural images. For example, the car category,ShapeNet provides 183,533 models for the car category and 114,045 models for the airplane category. Note that al-though we only annotate three ﬁne-grained datasets, our an-notation framework can be continued to apply to buildingmore 3D pose dataset, thanks to larger-scale datasets likeShapeNet [1] and iNaturalist [25].

3. Dataset Construction

Building our 3D pose dataset involves two main pro-cesses: (1) human pose annotation, and (2) segmentationbased pose reﬁnement. Figure 3 illustrates the whole pro-cess. Our human pose annotation process is similar to Ob-jectNet3D [29] but requires more effort on selecting ﬁner3D models. We ﬁrst select the most appropriate 3D carmodel from ShapeNet [1] for each object in the ﬁne-grainedimage dataset. We then obtain the pose parameters by ask-ing the annotators to align the projection of the 3D model tothe corresponding image using our designed interface.Although human can initiate the pose annotation withreasonably high efﬁciency and accuracy, we ﬁnd it hard forthem to adjust the ﬁne detailed poses. Our second-stagesegmentation based pose reﬁnement further adjusts the poseparameters by performing a local greedy search initializedfrom the human annotation. We discuss the details of eachprocess in the next subsections.3 eal-time Model ProjectionDrag: Annotate azimuth ( a ), elevation ( e )Scroll: Annotate distance ( d )3D-2D Alignment Display Annotate ( )In-plane RotationAnnotate ( u , v )Principal PointAnnotate ( f )Focal Length Rotation:

Clockwise Counter-Clockwise Reset

Offset:

RDownUpLeft RightResetResetZoom Out Submit

Zoom:

You are working now on task: Zoom InYou have completed: Tasks

Clockwise Counter-Clockwise Reset

Offset:

RDownUpLeft RightResetResetZoom Out Submit

Zoom:

You are working now on task: Zoom InYou have completed: Tasks

Clockwise Counter-Clockwise Reset

Offset:

RDownUpLeft RightResetResetZoom Out Submit

Zoom:

You are working now on task: Zoom InYou have completed: Tasks

Figure 4. An overview of our annotation interface. Our annotation tool renders the projected 2D mask onto the image in real time tofacilitate the annotators to better adjust pose parameters.

We build three ﬁne-grained 3D pose datasets. Eachdataset consists of two parts: 2D images and 3D models.The 2D images are collected from StanfordCars [12], Com-pCars [31] and FGVC-Aircraft [18] respectively. UnlikePascal3D+ [30] and ObjectNet3D [29], the target objects inmost images are non-occluded and easy to identify. In orderto distinguish between ﬁne-grained categories, we adopt adistinct model for each category. Thanks to ShapeNet [1],a large number of 3D models for ﬁne-grained objects areavailable with make/model names in their meta data, whichare used to ﬁnd the corresponding 3D model given an im-age category name. If there is no exact match between acategory name and the meta data, we manually select a vi-sually similar 3D model for that category. For Stanford-Cars, we annotate images for all 196 categories, where 148categories have exact matched 3D models. For CompCars,we only include 113 categories with matched 3D modelsin ShapeNet. For FGVC-Aircraft, we annotate images forall 100 categories with more than 70 matched models. Tothe best of our knowledge, our dataset is the very ﬁrst onewhich employs ﬁne-grained category aware 3D models in3D pose estimation.

The world coordinate system is deﬁned in accordancewith the 3D model coordinate system. In this case, a point X on a 3D model is projected onto a point x on a 2D image: x = P X , (1)via a perspective projection matrix: P = K [ R | T ] , (2) where K denotes the intrinsic parameter matrix: K =  f u f v  , (3)and R encodes a × rotation matrix between the worldand camera coordinate systems, parameterized by three an-gles, i.e ., elevation e , azimuth a and in-plane rotation θ . Weassume that the camera is always facing towards the originof the 3D model. Hence the translation T = [0 , , d ] T isonly deﬁned up to the model depth d , the distance betweenthe origins of the two coordinate systems, and the principalpoint ( u, v ) is the projection of the origin of world coor-dinate system on the image. As a result, our model has 7continuous parameters in total: camera focal length f , prin-cipal point location u , v , azimuth a , elevation e , in-planerotation θ and model depth d . Note that, since the imagesare collected online, the annotated intrinsic parameters ( u , v and f ) are approximations. Compared to previous datasets[30, 29] with 6 parameters ( f ﬁxed), our camera model con-siders both the camera focal length f and object depth d in a full perspective projection for ﬁner 2D-3D alignment,which allows for a more ﬂexible pose adjustment and a bet-ter shape matching. We annotate 3D pose information for all 2D imagesthrough crowd-sourcing. To facilitate the annotation pro-cess, we develop an annotation tool illustrated in Figure 4.For each image during annotation, we choose the 3D modelaccording to the ﬁne-grained label given beforehand. Wethen ask the annotators to adjust the 7 parameters so thatthe projected 3D model is aligned with the target object inthe 2D image. This process can be roughly summarized as4 lgorithm 1

Iterative local pose search algorithm:

Input:

3D model M , Human pose annotation p , seg-mentation reference s ∗ , 2D mask generator S ( p , M ) ,segmentation evaluation function IoU ( s , s ) , pose pa-rameter update unit (cid:15) , update step size α . Output:

Optimized pose parameter p ∗ . for each image with segmentation reference s ∗ do Initialize pose parameters p = p . Initialize 2D mask s = S ( p , M ) Initialize iou = IoU ( s, s ∗ ) repeat Update iou last = iou . for each dimension i in p do Update p + i = p i + α (cid:15) i Update p − i = p i − α (cid:15) i Render new 2D mask s + = S ( p + , M ) Render new 2D mask s − = S ( p − , M ) Update iou + = IoU ( s + , s ∗ ) Update iou − = IoU ( s − , s ∗ ) Update iou = max( iou, iou + , iou − ) Update p = arg max( iou, iou + , iou − ) end for if iou == iou last then Update α = α/ if α < = threshold then Set as convergence. end if else

Continue. end if until converge end for follows: (1) shift the 3D model such that the center of themodel (the origin of the world coordinate system) is roughlyaligned with the center of the target object in the 2D image;(2) rotate the model to the same orientation as the targetobject in the 2D image; (3) adjust the model depth d andcamera focal length f to match the size of the target objectin the 2D image. Some ﬁner adjustment might be appliedafter the three main steps. In this way we annotate all 7 pa-rameters across the whole dataset. On average, each imagetakes approximately 1 minute to annotate by an experiencedannotator. To ensure the quality, after one round of annota-tion across the whole dataset, we perform quality check andlet the annotators do a second round revision for the unqual-iﬁed examples. Although human annotators already provide reasonablyaccurate annotation in the ﬁrst stage, we notice that thereare still potential rooms to further improve the annotation

Initial Pose by Human Iteration 1Iteration 2 Final PoseFigure 5. Iterative local greedy search for the ﬁne detailed pose,initialized from human annotation. The green highlights are the2D masks projected by the 3D model during pose optimization. quality. This is because humans are good at providing astrong initial pose estimate but ﬁnetuning the detailed poseparameters is a very annoying thing to them. Realizing thatultimately the problem is to estimate the object pose suchthat the projection of the 3D model aligns with the image aswell as possible, we design a simple but effective iterativelocal greedy search algorithm to automatically adjust poseparameters by maximimzing max p J ( p ) = IoU ( S ( p , M ) , s ∗ ) (4)where s ∗ is the 2D object segmentation reference and S ( p , M ) maps a 3D model M to a 2D mask according tothe pose parameter p = ( a, e, θ, d, f, u, v ) .The algorithm aims to ﬁnetune the 7 pose parameters tomaximize the segmentation overlap between the projected2D mask from the 3D model and the segmentation refer-ence. We use the traditional intersection over union as thesegmentation overlapping criterion. The algorithm greedilyupdates pose parameters, it is hence a local search algo-rithm with guarantee to converge to a local optimum. Dur-ing the local search process, we observe it converges in 3-10iterations with 1 minute per image on average. Algorithm1 shows the overall process. Figure 5 illustrates the localsearch algorithm. To conduct the local greedy search, ideally we need theground truth target object segmentation. Although we maysetup another segmentation annotation interface for all 2Dimages in three datasets through crowd-sourcing, we ﬁndusing existing state-of-the-art image segmentation modelssuch as Mask R-CNN [8] and DeepLab v3+ [2] can alreadyprovide us with satisfying segmentation results. For ex-ample, on the Pascal VOC2012 segmentation benchmark,5 egmentation Reference

Mask-RCNNDeepLabRender Mask with OpenGL

Instance SegmentationSemantic SegmentationInitial SegmentInitial Pose2D Image2D Image ≥ Yes

Find Max Overlap No Figure 6. An illustration of our reference segmentation extraction process. Ideally, we can ask human annotators to annotate the groundtruth segment for the target object in a 2D image. However, we ﬁnd CNNs such as Mask-RCNN and DeepLab can already provide accurateenough segmentation predictions for the pose reﬁnement. S t a n f o r d C a r s D C o m p C a r s D F GV C - A i r c r a f t D azimuth a elevation e in-plane rotation θ focal length f model depth d principal u principal v Figure 7. The polar histograms of the three rotation parameters as well as the histograms of the other four parameters in our annotateddataset.

DeepLab v3+ can reach average IoUs of 93.2 on the “car”class and 97.0 on the “aeroplane” class respectively. MaskR-CNN, although does not provide as-accurate-as-enoughsemantic segmentation, is able to obtain instance-level seg-mentation, which are particularly useful for images withmore than 1 instance from the same class. In the end, weuse a combination of both models to ﬁnd the most appropri-ate segmentation reference. Figure 6 illustrates the process.

We plot the distributions of the 7 parameters in Figure 7for StanfordCars3D, CompCars3D and FGVC-Aircraft3Drespectively. Due to the nature of the original ﬁne-graineddataset, all the parameters are not uniformly distributed.Unsurprisingly, the most challenging parameter across thethree datasets is azimuth ( a ), which varies across the ◦ , while elevation ( e ) and in-plane rotation ( θ ) are somewhatconcentrated in a small range around ◦ since the imagesof cars and airplanes are often taken from the ground view.The distribution of focal length ( f ) and model depth ( d ) arealso not widespread because objects in these ﬁne-grainedimages are generally normalized and cropped to a standardsize. Although the parameter distribution issue may raiseconcerns about learning trivial solutions, we believe thatour ﬁrst attempt still provide reasonable diversity on poseannotation. For example, the distribution of azimuth ( a ) arequite different across the three datasets and complementaryto each other. This could encourage building a more gener-alized pose estimation model.6 .7. Dataset Split We split the three datasets in this way. For Stanford-Cars3D, since we have annotated all the images, we fol-low the standard train/test split provided by the originaldataset [12] with 8144 training examples and 8041 testingexamples. For CompCars3D, we randomly sample / ofour annotated data as training set and the rest / as testingset, resulting in 3798 training and 1898 test examples. Weprovide the train/test split information in the dataset release.For FGVC-Aircraft3D, we follow the standard train/testsplit provided by the original dataset [18] with 6667 trainingexamples and 3333 testing examples.

4. Dataset Comparison

We compare our annotation quality with two existinglarge-scale 3D pose dataset, PASCAL3D+ [30] and Ob-jectNet3D [29]. It is worth to note that we are not aimingto show the superiority of our dataset, since both previousdatasets consider more general scenarios with multiple ob-jects and challenging occlusion in an image. However, wehope that by comparing to them, we demonstrate our ﬁne-grained pose dataset can become a complementary resourcefor studying 3D pose estimation in monocular images.Figure 8 and Figure 9 show the qualitative comparisonon the “car” class and the “aeroplane” class respectively.Overall, we ﬁnd our annotation more satisfying by visuallycomparing the overlay images which maps the 3D modelon the 2D image. To further conduct quantitative compar-ison, we use segmentation overlap between the projected2D mask and the ground truth object mask as the evalua-tion measure. We randomly select 50 “car” images and 50“aeroplane” images from PASCAL3D+ and ObjectNet3Drespectively. We then randomly pick 50 images from Stan-fordCars3D and FGVC-Aircraft3D. In total, we randomlyselect 300 images and annotate them with ground truth seg-mentation. Since both PASCAL3D+ and StanfordCars3Dconsider more complicated scenarios such as multiple ob-jects with cluttered background, we ﬁlter out those imagescontaining more than one object with reasonably large sizefor a fair comparison. Hence the average IoUs can be anoptimistic estimate for both baseline datasets. Even withthat, our annotation shows a clear segmentation improve-ment on average IoUs on both “car” and “aeroplane”, asdemonstrated in Table 2. Particularly, both the mean andthe standard deviation of the segmentation IoUs get signif-icantly improved, indictating that our annotations are notonly more accurate but more stable as well.

We also analyze how much gain we get by conductingsegmentation based pose reﬁnement. To understand this, car PASCAL3D+ [30] ObjectNet3D [29] StanfordCars3D78.5% ± ± ± airplane PASCAL3D+ [30] ObjectNet3D [29] FGVC-Aircraft3D62.7% ± ± ± Table 2. Comparison on the average IoUs with the standard devi-ation on the “car” category and “aeroplane” category. Note thatin this evaluation, we manually annotate around 50 ground truthsegmentation masks for each dataset.Average IoUs Human Annotation Reﬁned AnnotationStanfordCars3D 84.1% ± ± FGVC-Aircraft3D 65.3% ± ± Table 3. Segmentation evaluation of initial human annotation andafter iterative pose reﬁnement on the two datasets. Note that inthis evaluation, we manually annotate around 50 ground truth seg-mentation masks for each dataset.Worse Equal BetterStanfordCars3D 13.0% 28.3% 58.7%FGVC-Aircraft3D 12.8% 40.4% 46.8%Table 4. Human satisfaction rate by comparing original human an-notation with reﬁned pose. “Worse” means reﬁned pose is worsethan initial pose. “Better” means reﬁned is better. “Equal” mean-ing the annotation are roughly the same. From the table, we cansee humans are much more satisﬁed with the reﬁned pose annota-tion. we utilize the manually annotated ground truth 2D segmen-tation on the randomly select 100 images from the Stan-fordCars and FGVC-Aircraft. We then compare the aver-age IoUs between human annotated pose and the reﬁnedpose. Table 3 shows the improvement of segmentation over-lap on the three datasets. On StanfordCars3D, for exam-ple, our second-stage reﬁnement improves average IoUsfrom 84.1% to 90.4%, which is signiﬁcant. On FGVC-Aircarft3D, the improvement is even more, from 65.3% to78.9%. Figure 10 and Figure 11 illustrate the pose improve-ment qualitatively.Considering segmentation overlap may not be the onlyappropriate quantitative measure, we further conduct a hu-man study to compare the pose annotation quality. To dothis, we hire 5 professional annotators, show them the 2D-3D alignment of the same image with annotation in the twostages simultaneously and let them rate the relative qual-ity for the 50 selected images in each dataset. The rel-ative comparison consists of “Worse”, “Equal” and “Bet-ter”, indicating the second-stage pose is either signiﬁcantlyworse, roughly equal or signiﬁcantly better than the ﬁrst-stage human annotation from the subjective point of view.Table 4 shows the human study result. Most of the time, thesecond-stage reﬁned pose is either roughly equal or signif-icantly better than the initial human annotation, suggestingthe beneﬁt of utilizing segmentation cues to facilitate thepose search.7 a s ca l D + O b j ec t N e t D S t a n f o r d C a r s D Figure 8. Qualitative comparison of ground-truth pose annotation between our StanfordCars3D and two existing large scale 3D posedatatset. We randomly select 5 car images from each dataset. While both Pascal3D+ and ObjectNet3D provide more complicated scenarioswith more generic categories for 3D pose estimation, our pose annotations look more accurate thanks to the ﬁne-grained shape matching. P a s ca l D + O b j ec t N e t D F GV C - A i r c r a f t D Figure 9. Qualitative comparison of ground-truth pose annotation between our FGVC-Aircraft3D and two existing large scale 3D posedatatset. We randomly select 5 aircraft images from each dataset.

5. Discussions

In summary, we introduce the new problem of 3D poseestimation for ﬁne-grained object categories from a monoc-ular image We annotate three popular ﬁne-grained recog-nition datasets with 3D shapes and poses, ending in total31,881 images with 409 classes. By utilizing image seg-mentation as an intermediate cue, we further improve thepose annotation quality. It is worth to note that human mayultimately produce better annotation given unlimited time,but the segmentation based pose reﬁnement provides a fa-cilitation with a better trade-off between cost and accuracy.There are still a need of future works to continue the im-provement. First, the super-categories shall be continued toenlarge with more ﬁne-grained datasets. Second, the cur-rent ﬁne-grained datasets are less challenging in terms ofbackground clutter and object size. Third, while all existinglarge-scale pose datasets limit to rigid objects, it is still nec- essary to develop methods for non-rigid objects. Finally, itis also possible to develop a neural network architecture toreplace the segmentation based pose reﬁnement and com-bine it with the human annotation interface. We leave theseas future work.

References [1] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 , 2015.[2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In

ECCV , 2018.[3] X. Chen, A. Golovinskiy, and T. Funkhouser. A benchmarkfor 3d mesh segmentation. In

Acm transactions on graphics(tog) , volume 28, page 73. ACM, 2009. n iti a l P o s e F i n a l P o s e Figure 10. Selected examples illustrating the second-stage automatic pose reﬁnement improving the initial human pose annotation onStanfordCars3D dataset. I n iti a l P o s e F i n a l P o s e Figure 11. Selected examples illustrating the second-stage automatic pose reﬁnement improving the initial human pose annotation onFGVC-Aircraft3D dataset.[4] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urta-sun. Monocular 3D object detection for autonomous driving.In

CVPR , 2016.[5] X. Chen, A. Saparov, B. Pang, and T. Funkhouser. Schellingpoints on 3d surface meshes.

ACM Transactions on Graphics(TOG) , 31(4):29, 2012.[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

International journal of computer vision , 88(2):303–338, 2010.[7] A. Ghodrati, M. Pedersoli, and T. Tuytelaars. Is 2D informa-tion enough for viewpoint estimation? In

BMVC , 2014.[8] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In

ICCV , 2017.[9] S. Jayawardena et al.

Image based automatic vehicle damagedetection . PhD thesis, The Australian National University,2013.[10] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi,and T. Funkhouser. Learning part-based templates from largecollections of 3d shapes.

ACM Transactions on Graphics(TOG) , 32(4):70, 2013.[11] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev,T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef-fectiveness of noisy data for ﬁne-grained recognition. In

European Conference on Computer Vision , pages 301–320.Springer, 2016.[12] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object rep-resentations for ﬁne-grained categorization. In

ICCV Work-shops on 3D Representation and Recognition , 2013. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassiﬁcation with deep convolutional neural networks. In

NIPS , 2012.[14] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono,M. Burtscher, H. Fu, T. Furuya, H. Johan, et al. Shrec14track: Extended large scale sketch-based 3d shape retrieval.In

Eurographics workshop on 3D object retrieval , volume2014. ., 2014.[15] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing IKEA ob-jects: Fine pose estimation. In

ICCV , 2013.[16] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNNmodels for ﬁne-grained visual recognition. In

ICCV , 2015.[17] S. Mahendran, H. Ali, and R. Vidal. 3d pose regression us-ing convolutional neural networks. In

IEEE InternationalConference on Computer Vision , volume 1, page 4, 2017.[18] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi.Fine-grained visual classiﬁcation of aircraft. arXiv preprintarXiv:1306.5151 , 2013.[19] R. Mottaghi, Y. Xiang, and S. Savarese. A coarse-to-ﬁnemodel for 3D pose estimation and sub-category recognition.In

CVPR , 2015.[20] M. Ozuysal, V. Lepetit, and P. Fua. Pose estimation for cate-gory speciﬁc multiview object localization. In

CVPR , 2009.[21] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C.Berg. Transformation-grounded image generation networkfor novel 3D view synthesis. In

CVPR , 2017.[22] S. Savarese and L. Fei-Fei. 3D generic object categorization,localization and pose estimation. In

ICCV , 2007.[23] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. Theprinceton shape benchmark. In

Shape modeling applications,2004. Proceedings , pages 167–178. IEEE, 2004.

24] J. Sochor, A. Herout, and J. Havel. BoxCars: 3D boxes ascnn input for improved ﬁne-grained vehicle recognition. In

CVPR , 2016.[25] G. Van Horn, O. Mac Aodha, Y. Song, A. Shepard, H. Adam,P. Perona, and S. Belongie. The inaturalist challenge 2017dataset. arXiv preprint arXiv:1707.06642 , 2017.[26] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen.Shape completion enabled robotic grasping. In

IROS , 2017.[27] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-port CNS-TR-2011-001, California Institute of Technology,2011.[28] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3dvoxel patterns for object category recognition. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 1903–1911, 2015.[29] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mot-taghi, L. Guibas, and S. Savarese. ObjectNet3D: A largescale database for 3D object recognition. In

ECCV , 2016.[30] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL:A benchmark for 3D object detection in the wild. In

WACV ,2014.[31] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scalecar dataset for ﬁne-grained categorization and veriﬁcation.In

CVPR , 2015.[32] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. Viewsynthesis by appearance ﬂow. In

ECCV , 2016.[33] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. De-tailed 3D representations for object recognition and model-ing.

PAMI , 35(11):2608–2623, 2013., 35(11):2608–2623, 2013.