[PDF] Beam Search for Learning a Deep Convolutional Neural Network of 3D Shapes

Abstract

This paper addresses 3D shape recognition. Recent work typically represents a 3D shape as a set of binary variables corresponding to 3D voxels of a uniform 3D grid centered on the shape, and resorts to deep convolutional neural networks(CNNs) for modeling these binary variables. Robust learning of such CNNs is currently limited by the small datasets of 3D shapes available, an order of magnitude smaller than other common datasets in computer vision. Related work typically deals with the small training datasets using a number of ad hoc, hand-tuning strategies. To address this issue, we formulate CNN learning as a beam search aimed at identifying an optimal CNN architecture, namely, the number of layers, nodes, and their connectivity in the network, as well as estimating parameters of such an optimal CNN. Each state of the beam search corresponds to a candidate CNN. Two types of actions are defined to add new convolutional filters or new convolutional layers to a parent CNN, and thus transition to children states. The utility function of each action is efficiently computed by transferring parameter values of the parent CNN to its children, thereby enabling an efficient beam search. Our experimental evaluation on the 3D ModelNet dataset demonstrates that our model pursuit using the beam search yields a CNN with superior performance on 3D shape classification than the state of the art.

Full PDF

BBeam Search for Learning a Deep ConvolutionalNeural Network of 3D Shapes

Xu Xu and Sinisa Todorovic

School of Electrical Engineering and Computer ScienceOregon State University, Corvallis, Oregon 97330Email: [email protected], [email protected]

Abstract —This paper addresses 3D shape recognition. Recentwork typically represents a 3D shape as a set of binary variablescorresponding to 3D voxels of a uniform 3D grid centered onthe shape, and resorts to deep convolutional neural networks(CNNs) for modeling these binary variables. Robust learning ofsuch CNNs is currently limited by the small datasets of 3D shapesavailable – an order of magnitude smaller than other commondatasets in computer vision. Related work typically deals with thesmall training datasets using a number of ad hoc, hand-tuningstrategies. To address this issue, we formulate CNN learning as abeam search aimed at identifying an optimal CNN architecture –namely, the number of layers, nodes, and their connectivity in thenetwork – as well as estimating parameters of such an optimalCNN. Each state of the beam search corresponds to a candidateCNN. Two types of actions are deﬁned to add new convolutionalﬁlters or new convolutional layers to a parent CNN, and thustransition to children states. The utility function of each actionis efﬁciently computed by transferring parameter values of theparent CNN to its children, thereby enabling an efﬁcient beamsearch. Our experimental evaluation on the 3D ModelNet datasetdemonstrates that our model pursuit using the beam search yieldsa CNN with superior performance on 3D shape classiﬁcation thanthe state of the art.

I. I

NTRODUCTION

This paper addresses the problem of 3D shape classiﬁcation.Our goal is to predict the object class of a given 3D shape,represented as a set of binary presence-absence indicatorsassociated with 3D voxels of a uniform 3D grid centered on theshape. This is one of the basic problems in computer vision, as3D shapes are important visual cues for image understanding.This is a challenging problem, because an object’s shape maysigniﬁcantly vary due to changes in the object’s pose andarticulation, and appear quite similar to shapes of other objects.There is a host of literature on reasoning about 3D objectshapes [1], [2]. Traditional approaches typically extract featurepoints from a 3D shape [3]–[5], and ﬁnd correspondencesbetween these feature points for shape recognition and retrieval[6]–[8]. However, these methods tend to be sensitive to long-range non-rigid and non-isometric shape deformations, be-cause, in part, the feature points capture only local shape prop-erties. In addition, ﬁnding optimal feature correspondencesis often formulated as an NP-hard non-convex QuadraticAssignment Problem (QAP), whose efﬁcient solutions comeat the price of compromised accuracy.Recently, the state-of-the-art performance in 3D shape clas-siﬁcation has been achieved using deep 3D Convolutional Neu-ral Networks (CNNs), called 3D ShapeNets [9]. 3D ShapeNet extends the well-known deep architecture called AlexNet [10],which is widely used in image classiﬁcation. Speciﬁcally,the extension modiﬁes 2D convolutions computed in AlexNetto 3D convolutions. However, 3D ShapeNets’ architectureconsists of 3 convolutional layers and 2 fully-connected layers,with a total of 12M parameters. This in turn means that arobust learning of ShapeNet parameters requires large trainingdatasets. But in comparison with common datasets used inother domains, currently available 3D shape datasets aresmaller by at least an order of magnitude. For example, thebenchmark 3D shape datasets SHREC’14 dataset [11] hasonly 8K shapes, and ModelNet dataset [9] has 150K shapes,whereas the well-known ImageNet [12] has 1.5M images.Faced with small training datasets, existing deep-learningapproaches to 3D shape classiﬁcation typically resort to anumber of ad hoc, hand-tuning strategies for robust learn-ing. Rarely are these justiﬁed based on extensive empiricalevaluation, as it would take a prohibitively long time, but onpast ﬁndings of other related work (e.g., image classiﬁcation).Hence, their particular design choices about the architectureand learning – e.g., the number of convolutional and fully-connected layers used, or speciﬁcation of the learning rate inbackpropagation – may be suboptimal.Motivated by the state-of-the-art performance of 3DShapeNets [9], we here adopt this framework, and focus onaddressing the aforementioned issues in a principled manner.Speciﬁcally, we formulate a model pursuit for robust learningof 3D CNN. This learning is speciﬁed as a beam search aimedat identifying an optimal CNN architecture – namely, thenumber of layers, nodes, and their connectivity in the network– as well as estimating parameters of such an optimal CNN.Each state of the beam search corresponds to a candidateCNN. Two types of actions are deﬁned to add either newconvolutional ﬁlters or new convolutional layer to a parentCNN, and thus transition to children states. The utility functionof each action is efﬁciently estimated as a training accuracy ofthe resulting CNN. The efﬁciency is achieved by transferringparameter values of the parent CNN to its children. Startingfrom the root “shallow and narrow” CNN, our beam search isguided by the utility function toward generating more complexCNN candidates with an increasingly larger classiﬁcationaccuracy on 3D shape training data, until training accuracystops increasing. The CNN candidate with the highest trainingaccuracy is ﬁnally taken as our 3D shape model, and used in a r X i v : . [ c s . C V ] D ec esting.In our experimental evaluation on the 3D ModelNet dataset[9], our beam search yields a 3D CNN with times fewerparameters than 3D ShapeNets [9]. The results demonstratethat our 3D CNN outperforms 3D ShapeNets by 3% on 40shape classes. This suggests that a model pursuit using beamsearch is a viable alternative to currently heuristic practice indesigning deep CNNs.In the following, Sec. II gives an overview of our 3D shapeclassiﬁcation using 3D CNN; Sec. III formulates our beamsearch in terms of the state-space, successor function, heuristicfunction, lookahead and backtrack strategy; Sec. IV speciﬁesour efﬁcient transfer of parameters from a parent model to itschildren in the beam search; and Sec. V presents our results.II. 3D S HAPE C LASSIFICATION U SING

3D CNNFor 3D shape classiﬁcation we use a 3D CNN. Givena binary volumetric representation of a 3D shape as input,our CNN predicts an object class of the shape. Below, weﬁrst describe the shape representation, and then explain thearchitecture of 3D CNN.In this paper, we adopt the binary volumetric representationof 3D shapes presented in [9]. Speciﬁcally, each shape isrepresented as a set of binary indicators corresponding to3D voxels of a uniform 3D grid centered on the shape. Theindicators take value 1 if the corresponding 3D voxels areoccupied by the 3D shape; and 0, otherwise. Hence, each 3Dshape is represented by a binary three-dimensional tensor. Thegrid size is set to × × voxels. The shape size isnormalized such that a cube of × × voxels fullycontains the shape, and the remaining empty voxels serve forpadding in all directions around the shape. Each shape is alsolabeled with a corresponding object class.As mentioned in Sec. I, the architecture of our 3D CNN issimilar to that of 3D ShapeNets [9], with the important dis-tinction that we greatly reduce the total number of parameters.As we will discuss in greater detail in the results section, thebeam search that we use for the model pursuit yields a 3DCNN with 3 convolutional layers and 1 fully connected layer,totaling 80K parameters. The top layer of our model representsthe standard soft-max layer for classifying the input shapesinto one of possible object classes.In the following section, we specify our model pursuitand the initial root CNN from which the beam search startsexploring candidate, more complex models, until training errorcannot be further reduced.III. B EAM S EARCH

Search-based approaches have a long-track record of suc-cessfully solving computer vision problems, including struc-tured prediction for scene labeling [13], [14], object local-ization [15], and boundary detection [16]. Unlike the aboverelated work, search in this paper is not used for inference, butfor identifying an optimal CNN architecture and estimatingCNN parameters. For efﬁciency of learning, we consider abeam search which limits the exploration of the state space to a few top candidates. Our beam search is deﬁned by thefollowing: • States correspond to CNN candidates, • Initial state represents a small CNN, • Successor function generates new states based on actionstaken in parent states, • Heuristic function evaluates the utility of the actions, andthus guides the beam search, • Lookahead and backtracking strategy.

Fig. 1. Architecture of our initial CNN model.

State-space:

The state-space is deﬁned as

Ω = { s } , wherestate s represents a network conﬁguration (also called archi-tecture). A CNN’s network conﬁguration speciﬁes the numberof convolutional and fully-connected layers, the number ofhidden units or 3D convolutional ﬁlters used in each layer, andwhich layers have max-pooling. In this paper, we constrain thebeam search such that the size of the fully connected layerremains the same as in the initial CNN, because we haveempirically found that only extending convolutional layersmaximally increases the network’s classiﬁcation accuracy (asalso reported in [17], [18]). Initial State:

Our model pursuit starts from a relativelysimple initial CNN, illustrated in Figure 1, and the goal ofthe beam search is to extend the initial model by addingeither new convolutional ﬁlters to existing layers, or newconvolutional layers. The initial model consists of only twoconvolutional layers and one fully-connected layer. The ﬁrstconvolutional layer has 16 ﬁlters of size 6 and stride 2. Thesecond convolutional layer has 32 ﬁlters of size 5 and stride2. Finally, the fully-connected layer has 400 hidden units.The parameters of the initial CNN are trained as follows.We ﬁrst generatively pre-train the model in a layer-wisefashion, and then use a discriminative ﬁne-tuning procedure.The standard Contrastive Divergence [19] is used to pre-trainthe two convolutional layers, whereas the top fully-connectedlayer is trained using Fast Persistent Contrastive Divergence[20]. Once one layer is learned, the weights are ﬁxed and thehidden activations are fed into the next layer as input. Afterthis pre-training, we continue to discriminatively ﬁne-tune thepre-trained model. We ﬁrst replace the topmost layer with anew randomly initialized fully-connected layer, and then adda standard softmax layer on top of the network to output classprobabilities. The standard cross-entropy loss is computedusing ground-truth class labels, and used in backpropagationto update the weights in all layers.iven this simple, initial CNN, the beam gradually buildsa search tree with new states s . Exploration of the state spaceconsists of generating successor states from a few selectedparent states. The selection is based on ranking the parentstates by a heuristic function, as further explained below. Successor function:

Γ : s → s (cid:48) , generates new states s (cid:48) from s by applying an action a ∈ A from a set of possibleactions A . In this paper, we specify A as consisting of twotypes of actions: 1) Add a new convolutional layer at the top ofall existing convolutional layers, where the newly added layerhas the same number of ﬁlters, ﬁlter size, and stride as thetop convolutional layer; 2) Double the number of ﬁlters in thetop convolutional layer. Other alternative deﬁnitions of A arealso possible. In particular, we have also considered an actionwhich adds max-pooling to convolutional layers; however,such an extended A has not produced better performance ontest data, relative to the above case when only two types ofactions are considered.As one of our technical contributions, in this paper, wespecify an efﬁcient successor function for enabling an efﬁcientbeam search. Speciﬁcally, we apply a knowledge transferprocedure, following the approach of [21], which efﬁcientlycopies parameter values of the previous state s to values ofnewly added parameters in the generated state s (cid:48) . After thisknowledge transfer, the new CNN s (cid:48) is ﬁne-tuned using only afew iterations (in our experiments, the number of iterations is10), for robustness. Note that a signiﬁcantly larger number ofiterations would have been necessary for this ﬁne-tuning, hadwe randomly initialized the newly added parameters of s (cid:48) (asis common in practice), instead of using knowledge transfer.In this way, we achieve efﬁciency. In the following section,we explain our knowledge transfer procedure in more detail. Heuristic function: H ( s , s (cid:48) ) ranks new states s (cid:48) given theirparent states s . H ( s , s (cid:48) ) is used to guide the beam search, whichselects the top K successor states, where K is taken as a beamwidth. H ( s , s (cid:48) ) is deﬁned as the difference in classiﬁcationaccuracy on training data between s and s (cid:48) . Lookahead and backtracking strategy:

For robustness,we specify a lookahead and backtracking strategy for selectingthe top K successor states. We ﬁrst explore the state spaceby applying the successor function several times from parentstates s , until the resulting tree search reaches a depth limit, D . Then, among the leaf states s (cid:48) at the tree depth D , weselect the top K leaves s (cid:48) evaluated with H ( s , s (cid:48) ) . From thesetop K leaf states, we backtrack to the unique K children ofparent states s , taken as valid new candidate CNNs.IV. K NOWLEDGE T RANSFER

When generating new candidate CNNs, we make our beamsearch efﬁcient by appropriately transferring parameter valuesfrom parent CNNs to their descendants. In the sequel, wespecify this knowledge transfer for the two types of searchactions considered in this paper.

A. Net2WiderNet

A new state can be generated by doubling the number ofﬁlters in the top convolutional layer of a parent CNN. This action effectively renders the new candidate CNN “wider” thanits parent model, and hence we call this action

Net2WiderNet .We estimate the parameters of the “wider” CNN as follows.The key idea is to estimate the newly added parameters suchthat the parent CNN and its “wider” child CNN give the sameoutputs for the same inputs. This knowledge-transfer strategyensures that the newly generated model is not worse than thepreviously considered model. After this knowledge transfer,parameters of the “wider” child CNN s (cid:48) are ﬁne-tuned to verifyif the action resulted in a better model than the parent CNN s , as evaluated with the heuristic function H ( s , s (cid:48) ) .In order to widen a convolutional layer, i , we need to updateboth sets of model parameters W ( i ) ∈ R m × n and W ( i +1) ∈ R n × p at layers i and i + 1 , respectively, where layer i has m inputs and n outputs, and layer i + 1 has p outputs. When theaction Net2WiderNet extends layer i so it has q > n outputs,we deﬁne the random mapping function g as g ( j ) = (cid:40) j , < j ≤ n random sample from { , , · · · , n } , n < j ≤ q (1)Then, the new sets of parameters U ( i ) and U ( i +1) can becomputed from W ( i ) and W ( i +1) as U ( i ) k,j = W ( i ) k,g ( j ) , (2) U ( i +1) j,h = 1 |{ x | g ( x ) = g ( j ) }| W ( i +1) g ( j ) ,h , (3)where k = 1 , ..., m , j = 1 , ..., q , and h = 1 , ..., p .From (2), the ﬁrst n columns of W ( i ) are simply copieddirectly into U ( i ) . Columns n + 1 through q of U ( i ) arecreated by randomly choosing columns of W ( i ) , as deﬁnedin g . The random selection is performed with replacement, soeach column of W ( i ) may be copied many times to columns n + 1 through q of U ( i ) .From (3), we similarly have that ﬁrst n rows of W ( i +1) are simply copied directly into U ( i +1) , and rows n + 1 through q of U ( i +1) are created by randomly choosing rowsof W ( i +1) , as deﬁned in g . In addition, the new parametersin U ( i +1) are normalized so as to account for the randomreplication of rows in U ( i +1) . The normalization is computedby diving the new parameters with a replication factor, givenby |{ x | g ( x ) = g ( j ) }| .It is straightforward to prove that the resulting extendednetwork with new parameters U ( i ) and U ( i +1) produces thesame outputs as the original network with parameters W ( i ) and W ( i +1) , for the same inputs.An example of this procedure is illustrated by Figure 2.In this example, we increase the size of hidden layer h ( i ) by adding one additional unit, while keeping the activationspropagated to hidden layer h ( i +1) unchanged. Assume thatwe randomly pick hidden unit h ( i )2 to replicate, then we copyits weights W ( i )1 , and W ( i )2 , to the new h ( i )3 unit. The weight W ( i +1)2 , , going out of h ( i )2 , must be copied to also go outof h ( i )3 . This outgoing weight must also be dived by 2 tocompensate for the replication of h ( i )2 . ig. 2. An example of Net2WiderNet action. . B. Net2DeeperNet

The second type of action that we consider is termed

Net2DeeperNet , since it adds a new convolutional layer to aparent CNN, thereby producing a deeper child CNN. Speciﬁ-cally,

Net2DeeperNet replaces a layer h ( i ) = φ ( W ( i ) (cid:62) h ( i − ) with two layers h ( i ) = φ ( U ( i ) (cid:62) φ ( W ( i ) (cid:62) h ( i − )) , where φ denotes the activation function. The new parameter matrix U is speciﬁed as the identity matrix.Figure 3 shows an illustration of Net2DeeperNet . Whenwe apply this action, we add a new convolutional layer andsimply set the new convolution ﬁlters to be identity functions.A zero padding is also added to maintain the size of activationsunchanged.It is worth noting that

Net2DeeperNet does not guaranteethat the resulting deeper network will give the same outputsas the original one, for the same inputs, when the activationfunction used is the sigmoid. The guarantee holds when theactivation function used is the rectiﬁed linear unit (ReLU),though. However, in our experiments, we have not found thatusing the sigmoid hurts the speciﬁed knowledge transfer of

Net2DeeperNet toward the efﬁcient beam search.

Fig. 3. An example of

Net2DeeperNet action.

V. E

XPERIMENTAL R ESULTS

A. Dataset

For evaluation, we use the ModelNet dataset [9], and thesame experimental set up of 3D ShapeNets [9], for faircomparison. ModelNet consists of 40 object classes such aschairs, tables,toilets, sofas, etc. Each class has 100 uniqueCAD models, representing the most common 3D shapes ofthe class, totaling 151,128 voxelized 3D models in the entiredataset. We conduct 3D classiﬁcation on both the 10-classsubset ModelNet10, and the full 40-class dataset, as in [9].We use the provided voxelizations and train/test splits forevaluation. Speciﬁcally, for each class, 960 instances are usedfor training and 240 instances are used for testing. We have implemented our beam search in MATLAB, ontop of a GPU-accelerated software library of 3D ShapeNets[9]. Experiments are run on a machine with the NVIDIA TeslaK80 GPU accelerator.

B. 3D shape classiﬁcation accuracy

Our classiﬁcation accuracy is averaged over all classes ontest data, and used for comparison with 3D ShapeNets [9]. Inaddition, we average our classiﬁcation accuracy over the ﬁveruns of the beam search from 5 different initial CNNs, all ofwhich have the same architecture, but differently initializedparameters.We test how our performance varies for different depthlimits D = 1 , , , , , and beam widths K = 1 , , . Thetraining and testing accuracies, as well as the total beam-searchruntime are presented in Figures 4, 5, 6, 7, 8, 9. Fig. 4. 3D shape classiﬁcation accuracy on testing data of 10 classes atdifferent beam width K .Fig. 5. 3D shape classiﬁcation accuracy on training data of 10 classes atdifferent beam width K . In the experiments, we observe that when considering thethird type of action which adds a new max-pooling layer, thisparticular action is never selected by the beam search. Thisis in part due to the fact that adding a pooling layer resultsin re-initializing subsequent fully-connected layer. In turn,this reduces the effectiveness of already learned parameters.Because of this, we actually do not consider the action ofadding a pooling layer in our speciﬁcation of the beam search.We compare our approach with two other approaches 3DShapeNets [9] and DeepPano [22] in Table I. As can be ig. 6. Total search time for 10 classes dataset at different beam width K .Fig. 7. 3D shape classiﬁcation accuracy on testing data of 40 classes atdifferent beam width K .Fig. 8. 3D shape classiﬁcation accuracy on training data of 40 classes atdifferent beam width K . seen, on ModelNet10 and ModelNet40, our accuracy is by3.63% better than DeepPano in the 40-class experiment, andby 2.55% better in the 10-class experiment.We observe that our model produced by the beam searchhas much fewer parameters than the network used in [9]. Theirmodel consists of three convolutional and two fully-connectedlearned layers. Their ﬁrst layer has 48 ﬁlters of size 6; thesecond layer has 160 ﬁlters of size 5; the third layer has 512ﬁlters of size 4; the fourth layer is a fully connected RBMwith 1200 hidden units; and the ﬁfth and ﬁnal layer is a fully-connected layer of size C , which is the number of classes.Our best found model consists of three convolutional layers Fig. 9. Total search time for 40 classes dataset at different beam width K .Algorithm ModelNet40Classiﬁcation ModelNet10ClassiﬁcationOurs 81.26% 88.00%DeepPano [22] 77.63% 85.45%3DShapeNets [9] 77% 83.5%TABLE IC OMPARISON OF OUR CLASSIFICATION ACCURACY (%)

WITHSTATE - OF - THE - ART ON THE M ODEL N ET AND M ODEL N ET DATASETS and one fully-connected layer. Our ﬁrst layer has 16 ﬁlters ofsize 6 and stride2; the second layer has 64 ﬁlters of size 5 andstride 2; the third layer has 64 ﬁlters of size 5 and stride 2;the last fully-connected layer has C hidden units. Our modelhas about 80K/12M = 0.6% parameters of theirs.The recent literature also presents two works on 3D shapeclassiﬁcation: VoxNet [23] and MVCNN [24], obtaininghigher classiﬁcation accuracies (90.1%, 83%) on ModelNet40than ours. However, a direct comparison with these approachesis not suitable. VoxNet uses a training process that takesaround 12 hours, while our individual training time for the bestfound model is less than 5 hours. MVCNN is based on the 2Dinformation viewed from different angles around 3D shape, soit is inherently a 2D CNN approach but not related to 3D CNN.In addition, they also use the large collection of 2D imagesfrom ImageNet containing millions of images belonging tothe same set of classes as the object categories presented inModelNet40, to help their training process, while our work’sonly dataset is ModelNet40. So based on these reasons, webelieve it is not suitable for our experimental results to becompared to theirs. VI. C ONCLUSION

We have presented a new deep model pursuit approach for3D shape classiﬁcation. Our learning uses a beam search,which explores the search space of various candidate CNN ar-chitectures toward achieving maximal classiﬁcation accuracy.The search tree is efﬁciently built using a training classiﬁca-tion accuracy based heuristic function, as well as knowledgetransfer to efﬁciently estimate parameters of new candidatemodels. Our experiments demonstrate that our approach out-performs the state of the art on the popular ModelNet10 andodelNet40 3D shape datasets by 3% . Our approach alsosuccessfully reduces the total number of parameters by 99.4%.As our approach could be easily applied to other problemsrequiring robust deep learning on small training datasets.A

CKNOWLEDGMENT

This research has been supported in part by National Sci-ence Foundation under grants IIS-1302700 and IOS-1340112.R

EFERENCES[1] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 14, no. 2, Feb. 1992.[2] J. W. Tangelder and R. C. Veltkamp, “A survey of content based 3dshape retrieval methods,”

Multimedia Tools Appl. , vol. 39, no. 3, Sep.2008. [Online]. Available: http://dx.doi.org/10.1007/s11042-007-0181-0[3] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernel signature:A quantum mechanical approach to shape analysis,” in

CVS , Nov 2011.[4] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably infor-mative multi-scale signature based on heat diffusion,” in

SGP’09 , 2009.[5] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature maps forlocal shape comparison,” in

SMA , June 2005.[6] E. Rodola, S. Rota Bulo, T. Windheuser, M. Vestner, and D. Cremers,“Dense non-rigid shape correspondence using random forests,” in

CVPR ,June 2014.[7] M. Leordeanu and M. Hebert, “A spectral technique for correspondenceproblems using pairwise constraints,” in

ICCV , vol. 2, Oct 2005.[8] F. Zhou and F. De la Torre, “Factorized graph matching,” in

CVPR , June2012.[9] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3d shapenets: A deep representation for volumetric shapes,” in

CVPR ,2015.[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NIPS , 2012.[11] L. et al., “A comparison of 3d shape retrieval methods based on a large-scale benchmark supporting multimodal queries,”

Computer Vision andImage Understanding , vol. 131, pp. 1 – 27, 2015.[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”

IJCV , vol. 115, no. 3, 2015.[13] M. Lam, J. Rao Doppa, S. Todorovic, and T. G. Dietterich, “HC-Searchfor structured prediction in computer vision,” in

CVPR , June 2015.[14] A. Roy and S. Todorovic, “Scene labeling using beam search undermutex constraints,” in

CVPR , June 2014.[15] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Efﬁcient subwindowsearch: A branch and bound framework for object localization,”

PAMI ,vol. 31, no. 12, 2009.[16] N. Payet and S. Todorovic, “Sledge: Sequential labeling of image edgesfor boundary detection,”

IJCV , vol. 104, no. 1, 2013.[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv , 2015.[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv , 2014.[19] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,”

Neural Computation , vol. 14, no. 8, 2002.[20] T. Tieleman and G. Hinton, “Using fast weights to improve persistentcontrastive divergence,” in

ICML , 2009.[21] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learningvia knowledge transfer,” arXiv , 2015.[22] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramicrepresentation for 3-d shape recognition,”

SPL , vol. 22, no. 12, 2015.[23] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural networkfor real-time object recognition,” in

IROS . IEEE, 2015.[24] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3d shape recognition,” in