[PDF] What can we learn about CNNs from a large scale controlled object dataset?

Abstract

Tolerance to image variations (e.g. translation, scale, pose, illumination) is an important desired property of any object recognition system, be it human or machine. Moving towards increasingly bigger datasets has been trending in computer vision specially with the emergence of highly popular deep learning models. While being very useful for learning invariance to object inter- and intra-class shape variability, these large-scale wild datasets are not very useful for learning invariance to other parameters forcing researchers to resort to other tricks for training a model. In this work, we introduce a large-scale synthetic dataset, which is freely and publicly available, and use it to answer several fundamental questions regarding invariance and selectivity properties of convolutional neural networks. Our dataset contains two parts: a) objects shot on a turntable: 16 categories, 8 rotation angles, 11 cameras on a semicircular arch, 5 lighting conditions, 3 focus levels, variety of backgrounds (23.4 per instance) generating 1320 images per instance (over 20 million images in total), and b) scenes: in which a robot arm takes pictures of objects on a 1:160 scale scene. We study: 1) invariance and selectivity of different CNN layers, 2) knowledge transfer from one object category to another, 3) systematic or random sampling of images to build a train set, 4) domain adaptation from synthetic to natural scenes, and 5) order of knowledge delivery to CNNs. We also explore how our analyses can lead the field to develop more efficient CNNs.

Full PDF

WWhat can we learn about CNNs from a large scale controlled object dataset?

Ali BorjiUCF [email protected]

Saeed IzadiAUT [email protected]

Laurent IttiUSC [email protected]

Abstract

Tolerance to image variations (e.g. translation, scale,pose, illumination) is an important desired property of anyobject recognition system, be it human or machine. Mov-ing towards increasingly bigger datasets has been trendingin computer vision specially with the emergence of highlypopular deep learning models. While being very useful forlearning invariance to object inter- and intra-class shapevariability, these large-scale wild datasets are not very use-ful for learning invariance to other parameters forcing re-searchers to resort to other tricks for training a model.In this work, we introduce a large-scale synthetic dataset,which is freely and publicly available, and use it to answerseveral fundamental questions regarding invariance and se-lectivity properties of convolutional neural networks. Ourdataset contains two parts: a) objects shot on a turntable:16 categories, 8 rotation angles, 11 cameras on a semi-circular arch, 5 lighting conditions, 3 focus levels, vari-ety of backgrounds (23.4 per instance) generating 1320 im-ages per instance (over 20 million images in total), and b)scenes: in which a robot arm takes pictures of objects on a1:160 scale scene. We study: 1) invariance and selectivityof different CNN layers, 2) knowledge transfer from one ob-ject category to another, 3) systematic or random samplingof images to build a train set, 4) domain adaptation fromsynthetic to natural scenes, and 5) order of knowledge de-livery to CNNs. We also explore how our analyses can leadthe ﬁeld to develop more efﬁcient CNNs.

1. Introduction

Object and scene recognition is arguably the most im-portant problem in computer vision and while humans doit fast and almost effortlessly, machines still lag behind hu-mans. In some cases where variability is relatively low (e.g.,ﬁngerprint or frontal face recognition) machines outperformhumans but they don’t perform quite as well when variety ishigh. Hence, the crux of the object recognition problem istolerance to intra- and inter-class variability, lighting, scale,in-plane and in-depth rotation, background clutter, etc [6].Thanks to deep neural networks, computer vision has enjoyed a rapid progress over the last couple of years wit-nessed by high accuracies over the ImageNet dataset (top-5error rate about 5-10% over 1000 object categories). Thesemodels (e.g.,VGG [37], Alexnet [20], Overfeat [33], andGoogLeNet [40]) have surpassed previous scores in sev-eral applications and benchmarks such as generic object andscene recognition [20, 37], object detection [33, 13], seman-tic scene segmentation [3, 13], face detection and recog-nition [47], texture recognition [4], ﬁne-grained recogni-tion [24], multi-view 3D shape recognition [39], activityand classiﬁcation [36, 18], and saliency detection [21].One big concern regarding the wild large scale bench-marks and datasets, however, is the lack of control overdata collection procedures and deep comprehension of stim-ulus variety. While existing large-scale datasets are veryrich in terms of inter- and intra-class variability, they fail toprobe the ability of a model to solve the general invarianceproblem. In order words, natural image datasets (e.g., Im-ageNet [5], SUN [45], PASCAL VOC [9], LabelMe [31],Tiny [42]) are inherently biased in the sense that they donot offer all object variations. To remedy this, some works(e.g., [29]) have resorted to synthetic datasets where severalobject parameters exist.Ideally, we want models to be tolerant to identity-preserving image variation (e.g. variation in position, scale,pose, illumination, occlusion). To probe this, some re-searchers have used synthetic home-brewed datasets ei-ther by taking pictures of objects on a turntable (e.g.,NORB [23], COIL [25], SOIL-47 [19], ALOI [12],GRAZ [26], BigBIRD [38]) or by constructing 3D graphicmodels and rendering textures to them (e.g., Pinto [29],Saenko [27]). While being very beneﬁcial in the past, thesedatasets are very small for training deep neural networkswith millions of parameters. Further, they usually havesmall number of classes, instances per class, backgroundvariability, in plane and in-depth rotation, illuminations,scale, and total number of images. Here, to remedy theseshortcomings, we introduce a large scale controlled objectdataset with rich variety and a larger set of images.Our main contributions in this work are two fold: 1) Weintroduce a large scale controlled dataset of objects shot in1 a r X i v : . [ c s . C V ] J a n solation and placed on scenes (together with other objects),and 2) We conduct several analyses of CNNs addressingfundamental questions and propose new pathways to buildmore efﬁcient deep learning models in the future.

2. Related work

Several controlled datasets for recognition tasks havebeen introduced in the past which have dramatically helpedprogress in computer vision. Some famous examples areFERET face [28] and MNIST digit [22] datasets. Nowa-days, we have systems that perform either at the level ofhumans or superior (perhaps not as robust due to variationsand noise). Similar datasets are available for generic objectrecognition but lack characteristics of a large scale repre-sentative dataset covering many sorts of invariance (e.g.,background clutter, illumination, shape, occlusion, size).For example, the COIL dataset [25], which also used aturntable to ﬁlm 100 objects under various lightings andposes, only contains one object instance per category (e.g.,one telephone, one mug). Further, objects were shot on onlyblack backgrounds. As another example, the larger ALOIdataset [12] contains 1,000 objects but few instances percategory. The NORB dataset [23] has 50 small toy objects(10 instances in each of 5 categories), however, all objectswere painted uniformly and shot in greyscale on blank back-grounds. Almost all available turntable datasets are smallscale and not very rich in terms of variations.Existing natural scene datasets such as ImageNet [5],SUN [45], Caltect256 [15], and Tiny [42] are very rich atthe instance level but lack variety in terms of other parame-ters (e.g., many instances of an object such as car but onlyfrom a random viewpoint).Most of previous research using controlled datasets,such as turntables images, has been focused on inspect-ing models or to brew concepts and ideas. Some recentworks have attempted to show that there is a real beneﬁtof these datasets and results achieved over them may gen-eralize/transfer to large scale natural scene datasets. Thishas been studied under the names of domain adaptation orknowledge transfer. The idea here is that knowledge gainedfrom a controlled dataset, created in one of the two waysmentioned above, can be transfered to real-world natural-istic datasets which may even have different statistics. Forexample, Peng et al. [27] trained a model from syntheti-cally generated images (using a 3D graphics object model)and by augmenting their data with images from ImageNetand PASCAL, reported an improvement in accuracy overthe latter datasets. They, however, did not probe whetherwhat they learned was due to better invariance or rich-ness at the instance level. Some other works have advo-cated and pursued this direction under different terminology[14, 32, 7, 10].Another drive for using controlled datasets comes from neuroscience and cognitive vision literature. While CNNswere inspired by hierarchical structure of the human visualventral stream [11], they were later used to explain somephysiological and behavioral data of humans and monkeys(e.g., [30, 35, 46, 34]). It has also been asserted that hu-mans learn invariance with few presentations of an objecta.k.a., zero- or one-shot learning. This is the opposite of theway that CNNs learn recognition. These models need anenormous amount of labeled data. In this work, we explorehow a rich controlled dataset, containing a lot of informa-tion regarding various object parameters, can be utilized toimprove object recognition performance. It is worth notingthat being aware of human performance is important oth-erwise progress could get trapped in a local minima. Justrecently He et. al. [16] reported top-5 error of 4.9% overImageNet which is lower than 5.1% human error rate. Thisraises some questions such as: Have models surpassed hu-mans? Is it theoretically possible to achieve a better perfor-mance than humans on this problems? etc.Another related area to our work, which naturally ﬁtswell to turntable datasets, is the manifold embeddingand dimensionality reduction literature. These techniquestry to preserve and leverage the underlying low dimen-sional manifold in a supervised or unsupervised manner(e.g.,, [48, 41]). For instance, Weston et al. [44] intro-duced an embedding-based regularizer to impose same la-bels for neighboring training samples thus beneﬁting fromstructure/manifold in the data. They used gradient descentto optimize the regularizer and adopted it for CNNs. An-other classic example is Siamese Networks [2] which aretwo identical copies of the same network, with the sameweights, fed into a ‘distance measuring’ layer to computewhether the two examples are similar or not, given labeleddata which encourages similar examples to be close, anddissimilar ones to have a minimum certain distance fromeach other. While these techniques have been applied tocontrolled datasets, it still remains to explore how usefulthey are over large scale datasets. Our proposed dataset canbe helpful in this direction as it combines the best of thetwo worlds: instance-level variety of large scale datasetsand rich parametrized controlled synthetic images. Thesetwo, we believe, could be precious to enhance the capabil-ity of CNNs.

3. Turntable object dataset

Our dataset contains 16 categories of objects(Micro ma-chines toys produced by Galoob corp.) which differ inshape, texture, color, etc. It has 25-160 instances per cate-gory shot on about real world backgrounds (printed satelliteimages in the scale of 1:160). The whole dataset containsmore than 20 million images and occupies about 17.65TB.We describe the photo shooting in the following.Each object instance was placed on a 14-inch diame- ) b) c) d) camera r o t a t i o n l0 l4l1 l2 l3 e) Figure 1. a)

Turn-table object photo shooting setup. a) turntable with8 rotation angles, 11 cameras on an arch, 4 lighting sources (generating5 lighting conditions), 3 focus values and random backgrounds (overall1320 images for each object instance per background). Recording parame-ters are: resolution 960 × ter circular plate shown in Fig. 1.a. Turntable rotated 45degrees per move thus generating 8 images (azimuth an-gles). This is referred to as ‘rotation’ parameter in the restof the paper. Eleven cameras (Logitech C910 webcams)were mounted on a semi-circular arch capturing 11 in-depth rotation images (elevation angles, referred to as ‘camera’parameter). We had four light sources (LED lightbulbs byEcosmart ECS) placed on four corners of the table generat-ing 4 lighting conditions (plus an additional ﬁfth case whereall lights were on). We also had 3 scales/focus conditions(-3, 0, and +3 from the default focus value of each camera).This setting resulted in a total of 1,320 images for one ob-ject instance on each background (11 × × × of 960 ×

720 andare stored in the lossless PNG format (about 1MB each).Sample images of the dataset are shown in Fig. 1.c (rota-tions and camera images of an instance from the car cate-gory), Fig. 1.d (an instance of a boat shot under 5 differentilluminations), and Fig. 1.e (samples of each object at rota-tion 0, lighting 0, focus 0 on a random chosen background).Statistics of the dataset are summarized in Table 1.As part of our dataset, we also shot objects in scenes us-ing two robotics-assisted arms (Fig. 1.b). Several objectswere placed manually on a congruent background (1:160satellite maps corresponding to a 195m × ×

720 pixels. Whilethe turn table images support learning object recognition,robotics-assisted scenes offer a platform for training objectdetectors and scene understanding. Turntable images arefrom predeﬁned parameters while images using the roboticsworkspace contain higher variety in terms of parameters(e.g., random viewpoints or scales). Together, these twotypes of images can be very useful for training and testingobject detection and recognition models in a way that re-semble natural settings.

4. Results

To start exercising the dataset, we tested it on small sub-sets of the available data. To understand generalizationacross image variations (object shape, object viewpoint,lighting, etc), CNNs are evaluated by taking slices of thedataset. We utilize a deep CNN pre-trained on ImageNetand ﬁne-tune it on our dataset. The behavior of off-the-shelffeatures is investigated in our analyses as well. We use 7 ob-ject categories (out of 16) and avoid data augmentation aswe have ﬂipped versions of the objects from the turntable.Since Alexnet has achieved great success in object andscene classiﬁcation benchmarks, we choose it as the repre- Background scenes were 125 satellite imagery, randomly taken fromthe Internet, plus additional 7 plain backgrounds (white, red, blue, yellow,etc). Every object was photographed on at least 20 different related contextbackgrounds (e.g., boats on the water, cars on roads). Cropped versions in 256 ×

256 pixels are also available.ategory boat bus calib- car equip- lightweight tank train ufo van semi air pickup heli- f1-car monsterration ment military wagon truck plane truck copter truckNum objects 27 25 13 160 64 54 31 25 40 29 33 85 40 25 40 40Num bg (mean) 20 21.3 1 26.1 21.6 18.5 30.3 37 29 29.4 23.1 18.4 30.1 23.2 14 21.5Num bg (std) 0.0 1.5 0.0 1.3 1.3 0.9 7.8 0.0 4.4 0.9 5.0 3.3 4.9 10.6 0.0 4.8Num bg (min-max) 20-20 20-23 1-1 24-28 20-23 18-20 20-36 37-37 26-37 28-30 17-27 17-26 25-35 14-35 14-14 14-25Total images 713K 704K 17K 5517K 1822K 2611K 1432K 462K 739K 933K 1112K 1907K 1505K 660K 950K 1425KSize (GB) 551 545 11 4300 1500 2100 1200 363 565 724 874 1400 1200 495 722 1100Used here (cid:88) (cid:88) - - - - (cid:88) (cid:88) (cid:88) (cid:88) - - - - (cid:88) - Table 1. Summary statistics of our dataset. There are 22,510,168 images in total from 16 categories (one used for calibration purposesonly) with 25 to 160 instances per category. Five parameters include: 11 cameras on an arch, 4 lighting sources on 4 corners (5 conditions),8 horizontal turn table rotations, 132 backgrounds (7 solid color) and 3 focus values. Average number of backgrounds per object instanceis 23.39. There are 46 unique backgrounds per category (avg bg per object 145.76 with std = 162.62; min = 25, max = 731). Total size ofthe dataset with resolution 960 ×

720 is 17.65T. The cropped version of these images (256 ×

256 pixels) is also available with 2.2TB insize. Total number of images per category is rounded to save space. sentative of CNNs in our analyses. Alexnet architecture isbasically a linear feed-forward cascade of convolution andpooling layers as follows: the ﬁrst two layers are composedof 4 sublayers: convolution, local response normalization,ReLUs and max-pooling . Layers 3 and 4 include convo-lution and ReLUs followed by Layer 5 which consists of convolution, ReLUs and max-pooling . Two fully connectedlayers (fc6 and fc7) are then appended on top of the pool5layer. Finally, the fc8 is the label layer. We refer the readerto the original paper of Alexnet [20] for more details onmodel parameters (e.g., data augmentation, RGB jitter, etc).Depending on our analyses here the label layer may contain2, 4 or 7 units (for object categories) or variable numberof units depending on the parameter prediction task. We re-port average accuracies and standard deviations where thereis randomness in the experimental procedure. Experimentsare performed using the publicly available Caffe toolkit [17]ran on a Nvidia Titan X GPU and Ubuntu 14.04 OS.We aim to answer these questions systematically: Cana pre-trained CNN model predict the setting parameters,say lighting source, degree of azimuthal rotation, degree ofcamera elevation, etc? and transfer the learned knowledgefrom one object category to another? Which parameters aremore important in the transfer? How much knowledge cana model transfer from our dataset to the ImageNet dataset?What is a good strategy to make an object dataset? ran-dom or systematic image harvesting? and ﬁnally how theorder of learning parameters invariance inﬂuences overallnetwork parameter tolerance?

Humans are very good at predicting the category of anobject and also tell about its setting parameters. This makesthem selective (to parameters including object category)and invariant to variations. In this experiment, we aim tosystematically investigate this competition for two layers of the Alexnet: pool5 and fc7. We probe the expressive powerof these layers for object and parameter prediction.Four categories from our dataset (out of 16) were chosenfor this analysis including boat, bus, tank and ufo. Imageswere lumped to train a SVM classiﬁer. All features werenormalized to have zero-mean before feeding to the classi-ﬁer. The dimensionality was reduced to N-dimensions us-ing SVD, where N refers to the number of instances in thetraining set. The reported results are average accuracy overrandom 5-fold cross validation test sets, each of size 2K. Wetrained two SVMs, one for category prediction and one forparameter prediction. Results are shown in Fig. 2.As expected, we can see that fc7 features result in a highaccuracy in classiﬁcation, however, the surprising salientresult is the shoulder-to-shoulder performance of pool5 tofc7. Relying on this outcome, it is clear that both fc7 andpool5 representations convey useful discriminative statisticsfor object recognition. Comparing the performance over pa-rameter prediction, one can notice the superiority of pool5layer over fc7. This is consistent with the work by Bakryet al. [1] where they analytically ﬁnd that fully connectedlayers make effort to collapse the low-dimensional intrinsicparameter manifolds to achieve invariant representations.However, in Bakry et al.’s work, only view-manifold has

Lighting Camera Rotation a cc u r a cy Category Prediction

Lighting Camera RotationParameter Prediction fc7pool5

Figure 2. Analysis of selectivy vs. invariance (expressive power)of pool5 and fc7 layers of Alexnet for category and parameter pre-diction over a four class problem. c7 (7 Categories) - with Fine-tuning pool5 (7 Categories) - with Fine-tuningpool5 (7 Categories) - without FTfc7 (7 Categories) - without FT boatbusf1cartanktrainufovan

Figure 3. t-SNE representation of Alexnet. The fc7 representationworks remarkably well at recognizing object categories as they aremutually linearly separable after ﬁne-tuning. Further, pool5 rep-resentation does not contain discriminative information comparedto fc7. This ﬁgure also demonstrates the effect of ﬁne-tuning. Thedistributions of samples for different categories tend to becomevery compact after ﬁne-tuning. Notice that ﬁne-tuning does notadd more discriminative power to the pool5 representation. been taken into consideration, while here thanks to ourdataset, we can analyze the behavior of more common pa-rameters in the real world.In brief, it is clear that the feature space by pool5 con-tains much more knowledge than fc7 for parameter predic-tion. At the same time, the very representation makes differ-ent categories to be highly separable from each other (i.e.,keeping the structure of manifolds as linearly-separable aspossible for different categories). The representation by fc7sensibly throws away the parameter information to becomeinvariant while keeping the categories as separable as pos-sible. We observe that the layer just before fully connectedones provides better compromise between categorizationand parameter estimation.Parameter prediction accuracies for lighting, rotation,and camera view in order are 100%, 77%, and 62%. Thisdemonstrates that camera view has the most complex struc-ture for parameter prediction whereas the lighting has thesimplest. This is acceptable since changing camera viewleads to geometric variations in the shape of the object, andports the prediction task into a much more difﬁcult problemto address. In contrast, lighting variations do not alter theshape of the object, and are thus easy to capture.We use the t-SNE dimensionality reduction method [43]to visualize the learned representations over seven cate-gories from our dataset along with variation parameters (SeeFigure8). Please see also the supplement for details.

Humans are very efﬁcient to estimate and transfer pa-rameters of a seen object to another object under many com-plicated scenarios. For example, they can reliably estimate

Lighting CameraRotation20406080100 a cc u r a cy Parameter Prediction using pool5

Boat, Bus, Tank, UFO (seen)F1Car (unseen)

Figure 4. Knowledge transfer over different objects categories withone parameter changing for Alexnet trained over four classes andtested on same classes (but different instances) and f1car. the lighting source of an object and tell whether anotherobject has been shot under nearly the same source direc-tion. Complementary to our previous analysis, in this exper-iment, we aim to asses the power of CNNs in transferringthe learned parameter over one object category to another.We focus on pool5 layer here since as we discussed, fc7is invariant to parameters and not useful for discriminatingbetween different values of parameters.All parameters are ﬁxed except one (i.e., slicing thedataset along only one parameter). We include instancesfrom four categories (boat, bus, tank, ufo) in the trainingset, and test the learned knowledge on instances from anunseen category (f1car- red bar) as well as four seen cate-gories (blue bar). We utilize the pool5 representation andreduce the dimensionality to N, where N refers to numberof samples. The 5-fold cross validation average accuracyfor parameter prediction is shown in Fig. 4.Results show a descent amount of knowledge transfer.It is observable from Fig. 4 that lighting parameter has thesimplest knowledge to be transferred on unseen categories,as it has a head-to-head accuracy across seen and unseencategories. On the other hand, knowledge transfer for rota-tion and camera view is accompanied with sensible degra-dation in performance. To sum up, we see that the knowl-edge is promisingly transferable across seen and unseen cat-egories, while the degradation in rotation and camera pre-diction is intuitively justiﬁable, as learned statistics in rota-tion and camera view prediction are dependent on the 3Dproperties of the object shape.

Currently, large scale datasets are constructed by har-vesting images randomly from the web. The main reasonto do this is to include as much variability as possible inthe dataset (mainly sampled along the intra- and inter-classvariation). While reasonable, it has not been systematicallystudied whether this is a good strategy compared to con-trolled ways conducted in turntable datasets. In this analy- a cc u r a cy Category Prediction using fc7 features

Random samplingSystematic sampling

12 24 100 1000 30001030507090110 num samples

Parameter Prediction using pool5 features

Random samplingSystematic samplingnum samples

Figure 5. Analysis of sampling strategies over a 4-class problem(boat, bus, tank, ufo). Left: category prediction accuracy using fc7features. Right: Parameter prediction accuracy. sis, we consider two strategies to ﬁnd the answer : 1) Ran-dom strategy where we choose n random samples (across allparameters and instances) and train an SVM to predict theobject category, and 2) Systematic (or exhaustive) strategy,in which we choose an object instance randomly and thenadd images to our training pool, by scanning all parameters,until we reach n samples. Assumption is that a ﬁxed budget(time or cost) for processing only n images is available.We addressed a 4 class problem (boat, bus, tank, ufo)by increasing n starting from 12 up to 10000 samples. Ineach experiment, n/4 samples were chosen randomly fromall 4 categories across all parameters, and were fed into theAlexNet to get the fc7(or pool5) representation. Then, wetrained a linear SVM classiﬁer on this data. A ﬁxed test setof size 500 was randomly selected from all categories withall parameters and was kept ﬁxed during the analysis. Wemeasure category prediction at fc7 and parameter predictionat pool5, reducing the dimensionality to 2500 for all valuesof n in the latter. Results are shown in Fig. 5.We observe that random sampling strategy performs bet-ter in category prediction. This makes sense since randomlychoosing images offers more instance level variety (bet-ter than systematic) leading to better recognition. Interest-ingly, and counter-intuitively, we see that random strategyworks better in parameter prediction as well. We believethat the parameter prediction is somewhat dependent on the3D properties of object shape, and since in the systematicstrategy, the learner is not faced with sufﬁcient instances,it fails to predict parameters compared to random strategy.Overall, what we learn is that instance level variations is ofhigh importance in both category and parameter predictionand this is perhaps why the systematic sampling strategy We have a ﬁxed test set from our dataset, and investigate which sam-pling strategy works better on this set. Without ﬁne tuning With ﬁne tuningNatural ourDB Natural ourDBNatural 95 75 93 ↓ ↓ ourDB 78 97 70 ↓ ↑ Table 2. Domain adaptation with boat vs. tank classiﬁcation.

Without ﬁne tuning With ﬁne tuningNatural ourDB Natural ourDBNatural [2000] 96.48 (0.5) 55.6 (2.7) 95.56 (0.6) 68.06 (2.0)ourDB [2000] 66.92 (3.2) 96.90 (0.2) 65.22 (1.4) 99.72 (0.1)ourDB [1000] +Natural [1000] 94.42 (0.8) 93.94 (0.4) 92.52 (0.2) 98.70 (0.2)

Table 3. Domain adaptation over a 4-class problem (boat, tank,bus, and train). Numbers in parentheses are standard deviations. is hindered. Thus, in dataset creation, it is vitally advanta-geous to have as much instance level variation as possible.

Currently, there is a gap in the literature connecting re-sults learned over synthetic datasets with results on largescale datasets. One way that we pursue here is training mod-els on our dataset (source) and see how much knowledgethose models can transfer to the large scale wild datasets(target). This way, we discover along which dimension(s) awild dataset varies the most and whether the target datasetoffers sufﬁcient variability for learning invariance. In otherwords, we can somehow indirectly measure dataset bias.Ultimately, we would like to generalize what we learn fromsynthetic datasets to natural large scale scene datasets.We consider two scenarios here: a) a binary classiﬁca-tion problem boat vs. tank, and b) a four class problem in-cluding boat, tank, bus and train. In each scenario, we traina SVM (using fc7 representation) from either natural scenes(selected from ImageNet) or ourDB and apply it to the otherdataset. We also augment images from the two datasets andmeasure the accuracy on each individual dataset. We con-sider both off-the shelf features of the Alexnet (pre-trainedover ImageNet) and ﬁne-tuned features over our dataset.

Augmenting data along all parameters:

Here we chooseimages along all parameters. Results in Table 2 show thattraining on each type of image, expectedly works the beston the same type of test image (95% from ImageNet to Im-ageNet and 97% from ourDB to ourDB). Cross applicationresults in lower accuracy, but still above 50% chance.We ﬁnd that ﬁne tuning the Alexnet on our dataset booststhe performance on ourDB to 100% with the cost of lower-ing the accuracy over the ImageNet. Doing so lessens otheraccuracies since the CNN features are tailored (and henceselective) to our images. The reason why performance islow when applying a trained model from our dataset to Im-ageNet is mainly because objects in these two datasets havedifferent textures and statistics.

26 3321 4224 13 1 8 3122 52 1821411171 817 1412 3 3322296 955447307target p r ed i c t i on Error without Fine Tuning = 0.076 boat bus f1car tank train ufo vanboatbusf1cartanktrainufovan 241 251 340 410 1 174 227 1 305target

Error with Fine Tuning = 0.001 boat bus f1car tank train ufo van

Figure 6. Confusion matrices of Alexnet over 7 classes of ourdataset without (left) and with ﬁne tuning.

Table 3 shows domain adaptation results over 4 classes.Results conﬁrm what we learned over 2 classes, althoughaccuracies are lower here. We also found that similar toﬁne tuning, combining images from datasets hinders perfor-mance over each individual dataset due to contamination.Performances over 2-class and 4-class problems werevery high here (above 95%). To further investigate accu-racy of the Alexnet, we increase the number of classes to 7.As seen in the confusion matrices in Fig. 6, ﬁne tuning thenetwork increases the accuracy from 92.5% to 99.9% withonly two mistakes . Augmenting data along single parameters:

Here, we aimto see which parameter makes the most effect on domain-adaptation (from synthetic images to natural images.). Wetake two categories, boat and tank, as both synthetic andnatural images are available for these two categories. Whilekeeping all parameters ﬁxed, we vary only one parameterto form a training set. Thus, we will come up with a cus-tomized training set in which only one dominant parameteris varying. Then, fc7 features are extracted for the trainingset and a linear SVM classiﬁer is trained on these samples.The same features are extracted for the natural images andthe learned model on synthetic samples is tested on them.For each parameter, we had 275 synthetic images for train-ing and a ﬁxed size set of 3000 images from ImageNet.To verify our ﬁndings, another experiment was designed inwhich all parameters were allowed to vary except a targetone. 2000 samples were randomly selected which satisﬁedour constraints and a linear SVM was trained (using fc7).The parameter whose absence drops the accuracy more isconsidered to be more dominant on natural images. 5-foldcross validation accuracies are reported in Fig. 7.As shown in the right side bars in chart in Fig. 7, it isclear that the camera-view is of higher importance as wegain the highest accuracy on the ﬁxed natural test set. Thatdoes make sense, since in real world images, it is expectedto see objects is different degree of elevations, and it is thedominant varying parameter in the wild. The rotation is the Please see the supplementary material for t-SNE visualization [43] ofwithout- and with ﬁne tuned fc7 and pool5 features. param changes param fixed [BG changes] a cc u r a cy LightingCamera ViewRotation

Figure 7. Domain adaptation when a single parameter can change. next important parameter as it gains the next top accuracyon natural images. Surprisingly, the lighting source rankedas the least effective parameter in our analysis. The rightside bars in chart in Fig. 7 veriﬁes our ﬁndings so that ab-sence of camera-view drops the recognition accuracy morethan other two parameters.

In this section, we analyze how the order of knowledgedelivery to CNNs inﬂuences parameter prediction. To do so,we ﬁrst prepare 40K training and 10K validation sets whileannotating them with rotation labels from four categories(boat, bus, tank and train). AlexNet baseline is ﬁne tuned onthis training set, hoping to start from a proper initializationpoint for optimization procedure. We set the learning ratefor all weights to 0.001 except for those of fc8 where theyare set to 0.01, and leave all other parameters to the defaultvalues. Afterward, we prepare a new training set consistingof 40K images from the same four categories in the previ-ous setting, except that they are annotated with camera viewlabels. 10K validation set is also prepared in the same way.Obtained weights from the previous step are loaded to thenetwork and are treated as a promising initialization pointfor another ﬁne tuning process over new data. The learn-ing rate is set to 0.001 for all weights except the fc8, wherethey are set to 0.01. All other parameters are left to have thedefault values.Next, we evaluate the performance on camera view androtation prediction using pool5 layer representation. Asﬁne tuning with low learning rate slightly changes weightswithin the network, we are interested to see which orderof changes in weights (before fully connected layers) givesthe superior performance in our desired task. To hunt whatwe are looking for, order of prepared datasets is reversedand delivered to the network in the opposite way, i.e. ﬁrstcamera and then rotation. We denote aforesaid orderingsas follows: 1) rotation-camera, and 2) camera-rotation forsimpler reference. In the evaluation phase, 2000 samplesare randomly selected from four categories, and pool5 fea-tures for them are extracted. After mean subtraction anddimensionality reduction, accuracies of 5-fold cross valida-tion are reported (See Table. 4).ask 1 [rotation-camera] 2 [camera-rotation]Camera 89.20% (1.47) 77.05% (1.18)Rotation 93.75%(1.66) 95.30% (1.00)

Table 4. Inﬂuence of data delivery order on parameter prediction.

Counter-intuitively, we found that order of data deliveryis so important to the network such that when the networksees the samples with rotation labels prior to camera labels,it ostensibly performs better in parameter prediction. Fromthe results in Table. 4, we can see when the network is ﬁrstlyﬁne tuned on rotation, the second stage, i.e. ﬁne tuning oncamera labels, does not damage the weights for rotation pre-diction. In contrast, when the camera labels are seen by thenetwork before rotation labels, performance of rotation pre-diction is expectedly becoming better than the previous or-dering, however this boost causes dramatic degradation incamera prediction.As we found in our previous experiments, camera viewvariation is a more ill-structured parameter to predict. Whenthe network sees the camera labels in the second stage, theadapted weights are more biased towards learning this pa-rameter, while the shiny point is that this bias does alsotry to keep the pre-seen knowledge for rotation unchanged.Hence, we can conclude that when there is the option forstage-wise training, it would be better to sort the parame-ters according to their complexities and feed them to thenetwork following simple to complex order. This way, thelast steps are devoted to manage the difﬁculties in complexparameters, while imposing less damage to weights adaptedfor simpler parameters.

5. Discussion and conclusion

We challenged the use of uncontrolled natural images inguiding that object recognition progress and introduced alarge scale controlled object dataset of over 20M imageswith rich variety of parameters that can be useful in the ﬁeld.By choosing slices through our dataset, we were able to sys-tematically study the invariance properties and generaliza-tion power of CNNs by independently varying the choice ofobject instances, viewpoints, lighting conditions, or back-grounds between training and test sets. Progressively ex-tending these results on increasingly larger subsets of ourdataset may help gain new insights on how the algorithmscan be modiﬁed to show greater invariance and generaliza-tion capabilities. In what follows, we summarize the lessonswe learn from our empirical investigation of the Alexnetbaseline on synthetic and natural images. i) Representation learned in pool5 layer is selective to pa-rameters (it is possible to readout parameters) while fc7layer is not. Both of these layers contain object categoryinformation (fc7 is more selective). It would be interestingto explore how selectivity of fc7 to both object and param- eters can be increased simultaneously. ii)

The knowledge obtained from some parameters is easierto be transferred to unseen object categories. In particular,we saw that illumination possesses the simplest knowledge,whereas rotation and camera-view parameters are more dif-ﬁcult. We also found that 3D properties of the object shapeplay a critical role in knowledge transfer. The higher vari-ability in shape, the less knowledge transfer on unseen cat-egories. iii)

Results of our sampling strategy analysis revealed theimportance of instance level variety compared to that of pa-rameter level. In particular, we found that random samplingstrategy leads to better generalization since more instancelevel variations can be included in the dataset. iv)

Results of data augmentation shows that simply addinginstances from two classes does not improve accuracymainly because objects in these two datasets have differenttextures and statistics. However, we found that there is gen-eralization from one dataset to the other as cross applicationof one dataset to the other results in above chance accuracy.It would be interesting to learn functions for domain adap-tation from our images to natural real world scenes such asthose in the ImageNet dataset. v) A large scale synthetic object database, such as the onepresented here, could be used as a diagnosis tool to inferalong which dimensions a large scale wild dataset variesthe most and how wild datasets offer information regardinginvariance to parameters. vi)

Last but not the least, we found that when there is theoption to perform stage-wise training, it would be advan-tageous to feed the network with data that has been sortedaccording to complexities of different dimensions. This canlead us to train CNNs layer-wise for learning different in-variances in different layers.Currently, deep learning models sacriﬁce invariance infavor of higher object category prediction accuracy. Itwould be best if we can achieve both at the same time (e.g.,it might be needed in some applications). It might be possi-ble to organize the feature manifolds in the the early layersin such a way to preserve information about object param-eters as well as category information. Two ways to explorethis include feature embedding through loss regularizationor adding camera parameters to the categorization loss. Theidea would be knowing the camera parameters may help ob-ject categorization. A recent study [8] have investigated thisidea by proposing a convolutional network for joint predic-tion of object category and pose information.In summary, we answered some questions regardingCNNs and datasets, and discussed future large-scale appli-cations of our dataset, which is freely shared and available.

Acknowledgements:

We wish to thank NVIDIA fortheir generous donation of GPUs used in this study. eferences [1] A. Bakry, M. Elhoseiny, T. El-Gaaly, and A. Elgammal. Dig-ging deep into the layers of cnns: In search of how cnnsachieve view invariance. arXiv preprint arXiv:1508.01983 ,2015. 4, 10[2] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun,C. Moore, E. S¨ackinger, and R. Shah. Signature veriﬁca-tion using a siamese time delay neural network.

Interna-tional Journal of Pattern Recognition and Artiﬁcial Intelli-gence , 7(04):669–688, 1993. 2[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deep con-volutional nets and fully connected crfs. arXiv preprintarXiv:1412.7062 , 2014. 1[4] M. Cimpoi, S. Maji, and A. Vedaldi. Deep ﬁlter banks fortexture recognition and segmentation. In

CVPR , pages 3828–3836, 2015. 1[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. IEEE, 2009. 1,2[6] J. J. DiCarlo, D. Zoccolan, and N. C. Rust. How does thebrain solve visual object recognition?

Neuron , 73(3):415–434, 2012. 1[7] B. A. Draper, U. Ahlrichs, and D. Paulus. Adapting objectrecognition across domains: A demonstration. In

ComputerVision Systems , pages 256–267. Springer, 2001. 2[8] M. Elhoseiny, T. El-Gaaly, A. Bakry, and A. Elgammal. Con-volutional models for joint object categorization and pose es-timation. arXiv preprint arXiv:1511.05175 , 2015. 8[9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

International journal of computer vision , 88(2):303–338, 2010. 1[10] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Un-supervised visual domain adaptation using subspace align-ment. In

Computer Vision (ICCV), 2013 IEEE InternationalConference on , pages 2960–2967. IEEE, 2013. 2[11] K. Fukushima. Neocognitron: A self-organizing neu-ral network model for a mechanism of pattern recogni-tion unaffected by shift in position.

Biological cybernetics ,36(4):193–202, 1980. 2[12] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders.The amsterdam library of object images.

IJCV , 61(1):103–112, 2005. 1, 2[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

Computer Vision and Pattern Recognition(CVPR), 2014 IEEE Conference on , pages 580–587. IEEE,2014. 1[14] R. Gopalan, R. Li, and R. Chellappa. Domain adaptationfor object recognition: An unsupervised approach. In

Com-puter Vision (ICCV), 2011 IEEE International Conferenceon , pages 999–1006. IEEE, 2011. 2[15] G. Grifﬁn, A. Holub, and P. Perona. The caltech-256. Cal-tech Technical Report. 2 [16] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectiﬁers: Surpassing human-level performance on imagenetclassiﬁcation. arXiv preprint arXiv:1502.01852 , 2015. 2[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In

Proceedings ofthe ACM International Conference on Multimedia , pages675–678. ACM, 2014. 4[18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classiﬁcation with con-volutional neural networks. In

Computer Vision and Pat-tern Recognition (CVPR), 2014 IEEE Conference on , pages1725–1732. IEEE, 2014. 1[19] D. Koubaroulis, J. Matas, and J. Kittler. Evaluating colourobject recognition algorithms using the soil-47 database. In

ACCV , 2002. 1[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012. 1, 4[21] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepﬁx: Afully convolutional neural network for predicting human eyeﬁxations. arXiv preprint arXiv:1510.02927 , 2015. 1[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998. 2[23] Y. LeCun, F. Huang, and L. Bottou. Learning methods forgeneric object recognition with invariance to pose and light-ing. 2004. 1, 2[24] T. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN mod-els for ﬁne-grained visual recognition.

ICCV , 2015. 1[25] S. A. Nene, S. K. Nayar, and H. Murase. Columbia objectimage library (coil-100), 1996. Technical Report CUCS-006-96. 1, 2[26] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Genericobject recognition with boosting, 2004. Technical ReportTR-EMT-2004-01, EMT, TU Graz, Austria. 1[27] X. Peng, B. Sun, K. Ali, and K. Saenko. Exploring invari-ances in deep convolutional neural networks using syntheticimages. arXiv preprint arXiv:1412.7122 , 2014. 1, 2[28] P. J. Phillips, H. Moon, P. J. Rauss, and S. Rizvi. The feretevaluation methodology for face recognition algorithms.

IEEE Transactions on Pattern Analysis and Machine Intel-ligence , 22:1090–1104, 2000. 2[29] N. Pinto, Y. Barhomi, D. Cox, and D. J.J. Comparingstate-of-the-art visual features on invariant object recogni-tion tasks. In

IEEE Workshop on Applications of ComputerVision. , 2011. 1[30] M. Riesenhuber and T. Poggio. Hierarchical models of objectrecognition in cortex.

Nature Neuroscience , 2:1019–1025,1999. 2[31] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-man. Labelme: a database and web-based tool for imageannotation.

International Journal of Computer Vision , 77(1-3):157–173, 2008. 1[32] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi-sual category models to new domains. In

Computer Vision–ECCV 2010 , pages 213–226. Springer, 2010. 233] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. arXiv preprintarXiv:1312.6229 , 2013. 1[34] T. Serre, A. Oliva, and T. Poggio. A feedforward architec-ture accounts for rapid categorization.

Proceedings of theNational Academy of Sciences , 104(15):6424–6429, 2007. 2[35] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Pog-gio. Object recognition with cortex-like mechanisms.

IEEEPAMI , 2007. 2[36] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In

Advancesin Neural Information Processing Systems , pages 568–576,2014. 1[37] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014. 1[38] A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel.Bigbird: A large-scale 3d database of object instances. In

Robotics and Automation (ICRA), 2014 IEEE InternationalConference on , pages 509–516. IEEE, 2014. 1[39] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.Multi-view convolutional neural networks for 3d shaperecognition.

ICCV , 2015. 1[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going deeper with convolutions. arXiv preprintarXiv:1409.4842 , 2014. 1[41] V. S. Tomar and R. C. Rose. Manifold regularized deep neu-ral networks. system , 1:5, 2014. 2[42] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tinyimages: a large dataset for non-parametric object and scenerecognition.

IEEE PAMI , 30(11):1958–1970, 2008. 1, 2[43] L. Van der Maaten and G. Hinton. Visualizing data us-ing t-sne.

Journal of Machine Learning Research , 9(2579-2605):85, 2008. 5, 7, 10[44] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learn-ing via semi-supervised embedding. In

Neural Networks:Tricks of the Trade , pages 639–655. Springer, 2012. 2[45] J. Xiao, J. Hays, K. Ehinger, A. Oliva, A. Torralba, et al. Sundatabase: Large-scale scene recognition from abbey to zoo.In

Computer vision and pattern recognition (CVPR), 2010IEEE conference on , pages 3485–3492. IEEE, 2010. 1, 2[46] D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon,D. Seibert, and J. J. DiCarlo. Performance-optimized hi-erarchical models predict neural responses in higher visualcortex.

Proceedings of the National Academy of Sciences ,111(23):8619–8624, 2014. 2[47] S. Yang, P. Luo, C. C. Loy, and X. Tang. From facial parts re-sponses to face detection: A deep learning approach.

ICCV ,2015. 1[48] Y. Yuan, L. Mou, and X. Lu. Scene recognition by manifoldregularized deep learning architecture. 2015. 2 ————————— Appendix ——————————

We use t-SNE dimensionality reduction method [43] tovisualize the learned representations

Experiment I: category prediction

In this experiment, werandomly select 2K samples from 7 categories (boat, bus,f1car, tank, train, ufo, and van) and feed them to a pre-trained CNN model, speciﬁcally Alexnet. Having fc7 andpool5 representations of selected samples ready, we use thet-SNE algorithm to reduce their dimensionality to 2D.In addition, 20K images are randomly selected from all7 categories and the network is ﬁne-tuned on the provideddata for object categorization. The same procedure is car-ried out on the ﬁne-tuned (FT) network. Fig. 8 depicts theresults.Our results in Fig. 8 show that fc7 representation worksremarkably well at recognizing object level categories asthey are mutually linearly separable after ﬁne-tuning thenetwork. Furthermore, pool5 representation does not con-tain discriminative information between object categoriescompared to fc7. This result is in alignment with Bakry etal., [1]. Fig. 8 also demonstrates the effect of ﬁne-tuningon feature spaces. The distributions of samples for differentcategories tend to become very compact and concentratedafter ﬁne-tuning. Notice that ﬁne-tuning does not add morediscriminative power to the pool5 representation.

Experiment II: rotation prediction

This experimentmakes effort to highlight the power of pool5 layer in rep-resenting image variations and discriminating among them.As we discussed in the main paper, our analyses show thatpool5 representation gives superior performance for param-eter prediction. To conﬁrm this statement, we select 200samples from the boat category (and instance number 01)while rotation, camera, and lighting parameters are chang-ing. We then label the samples with their rotation valuesand feed them to the pre-trained Alexnet model. The di-mensionality of fc7 and pool5 representations are reducedto 2D using tSNE. The same procedure is carried out usingthe ﬁne-tuned network to obtain the fc7 and pool5 represen-tations. Results are illustrated in Fig. 9.It can be seen that fc7 representation is not (fully) ca-pable of discriminating the rotation values, both with andwithout ﬁne-tuning. The representation by the pool5 layer,in contrast, conﬁrms our ﬁndings that pool5 contains infor-mation selective to parameters. Samples from 8 differentrotation values are perfectly and mutually linearly separablefrom each other. Fine-tuning tries to improve the discrim-inability through some sort of transformation.

Experiment III: camera prediction

With our success invisualizing the power of pool5 layer in capturing rotationvariations, in this experiment we aim to see whether thesame judgment is valid for camera prediction. As in theprevious experiment, we select 200 samples from the boat c7 (7 Categories) - with Fine-tuning boatbusf1cartanktrainufovan pool5 (7 Categories) - with Fine-tuning boatbusf1cartanktrainufovan pool5 (7 Categories) - without FT boatbusf1cartanktrainufovan fc7 (7 Categories) - without FT boatbusf1cartanktrainufovan

Figure 8. t-SNE representation for category prediction using fc7 and pool5 layers with and without ﬁne-tuning. fc7 (Rotation)-wth FT

Ro0Ro1Ro2Ro3Ro4Ro5Ro6Ro7 fc7 (Rotation)-without FT

Ro0Ro1Ro2Ro3Ro4Ro5Ro6Ro7 pool5 (Rotation)-with FT

Ro0Ro1Ro2Ro3Ro4Ro5Ro6Ro7 pool5 (Rotation)-without FT

Ro0Ro1Ro2Ro3Ro4Ro5Ro6Ro7

Figure 9. t-SNE representation for lighting prediction using fc7 and pool5 layers with and without ﬁne-tuning. ategory (instance number 01) and label them according totheir camera parameter value. 2D feature spaces derivedfrom fc7 and pool5 representations using pre-trained andﬁne-tuned Alexnet are depicted in Fig. 10.As before, fc7 representation does not offer useful infor-mation regarding separating samples with different cameraparameters, both in pre-trained and ﬁne-tuned cases. Weobserve quite the opposite using the pool5 layer represen-tation. Without ﬁne-tuning the network, we can observe 8clusters in Fig. 10 (see the up-right panel), each one corre-sponding to one rotation. For each rotation angle, the repre-sentation is surprisingly capable of discriminating differentvalues of camera parameters in ﬁve classes (we only use ﬁvevalues for camera parameter here).

Experiment IV: lighting prediction

Scrutinizing the be-havior of fc7 and pool5 layers should be interesting forlighting prediction as well. Therefore, we follow the pre-vious experiments except that here samples are labeled ac-cording to the lighting parameter values. Fig. 11 shows theresults for four different cases.Skipping the poor representation by fc7 layer, pool5layer again generates reasonable representation which isable to discriminate between different lighting conditions.Eight clusters are observable, each one corresponding toone rotation angle. In each cluster, samples with differentlighting parameters are discriminant which again supportsour previous statement regarding the capability of the pool5layer in parameter prediction.

Experiment V: instance prediction

In the last experiment,we aim to inspect the capacity of fc7 and pool5 layers ofCNNs for instance prediction. We randomly choose 2Ksamples from the boat category. The samples are passedthrough the network up to pool5 and fc7 layers. The ob-tained representations are visualized after dimensionalityreduction using the tSNE. The same procedure is repeatedwith the ﬁne-tuned network. Fig. 12 show the results.The fc7 representation, without ﬁne-tuning, is remark-ably capable to separate samples from different instances.Fine-tuning the network dramatically boosts this discrim-ination power by making clusters more compact. A repre-sentation is invariant to varying parameters if it ignores vari-ations and treats samples with different parameters equally,i.e., it makes the representations of similar samples as closeas possible in the feature space. This is exactly what we seein the in the representation space provided by fc7.Despite the reasonable parameter separability, the pool5layer does not force different instances to be clustered. Thisis the place where difference between pool5 and fc7 lay-ers can be seen in practice. This result indicates that thefc7 layer seeks to produce invariant representations (by col-lapsing manifolds), while the pool5 layer tries to preservemanifolds as much as possible. c7 (Camera)-without FT

Ca0Ca1Ca2Ca3Ca4 fc 7 (Camera)-with FT

Ca0Ca1Ca2Ca3Ca4 pool5 (Camera)-without FT

Ca0Ca1Ca2Ca3Ca4 pool5 (Camera)-with FT

Ca0Ca1Ca2Ca3Ca4

Figure 10. t-SNE representation for camera prediction using fc7 and pool5 layers with and without ﬁne-tuning. fc7 (Lighting)-with FT

LiLi1Li2Li3Li4 fc7 (Lighting)-without FT

LiLi1Li2Li3Li4 pool5 (Lighting)-with FT

LiLi1Li2Li3Li4LiLi1Li2Li3Li4 pool5 (Lighting)-without FT

Figure 11. t-SNE representation for lighting prediction using fc7 and pool5 layers with and without ﬁne-tuning. c7 (Instance Level)-Boat Category-with FTfc7 (Instance Level)-Boat Category - without FT pool5 (Instance Level)-Boat Category-without FTpool5 (Instance Level)-Boat Category-with FTc7 (Instance Level)-Boat Category-with FTfc7 (Instance Level)-Boat Category - without FT pool5 (Instance Level)-Boat Category-without FTpool5 (Instance Level)-Boat Category-with FT