Beam Search for Learning a Deep Convolutional Neural Network of 3D Shapes
BBeam Search for Learning a Deep ConvolutionalNeural Network of 3D Shapes
Xu Xu and Sinisa Todorovic
School of Electrical Engineering and Computer ScienceOregon State University, Corvallis, Oregon 97330Email: [email protected], [email protected]
Abstract —This paper addresses 3D shape recognition. Recentwork typically represents a 3D shape as a set of binary variablescorresponding to 3D voxels of a uniform 3D grid centered onthe shape, and resorts to deep convolutional neural networks(CNNs) for modeling these binary variables. Robust learning ofsuch CNNs is currently limited by the small datasets of 3D shapesavailable – an order of magnitude smaller than other commondatasets in computer vision. Related work typically deals with thesmall training datasets using a number of ad hoc, hand-tuningstrategies. To address this issue, we formulate CNN learning as abeam search aimed at identifying an optimal CNN architecture –namely, the number of layers, nodes, and their connectivity in thenetwork – as well as estimating parameters of such an optimalCNN. Each state of the beam search corresponds to a candidateCNN. Two types of actions are defined to add new convolutionalfilters or new convolutional layers to a parent CNN, and thustransition to children states. The utility function of each actionis efficiently computed by transferring parameter values of theparent CNN to its children, thereby enabling an efficient beamsearch. Our experimental evaluation on the 3D ModelNet datasetdemonstrates that our model pursuit using the beam search yieldsa CNN with superior performance on 3D shape classification thanthe state of the art.
I. I
NTRODUCTION
This paper addresses the problem of 3D shape classification.Our goal is to predict the object class of a given 3D shape,represented as a set of binary presence-absence indicatorsassociated with 3D voxels of a uniform 3D grid centered on theshape. This is one of the basic problems in computer vision, as3D shapes are important visual cues for image understanding.This is a challenging problem, because an object’s shape maysignificantly vary due to changes in the object’s pose andarticulation, and appear quite similar to shapes of other objects.There is a host of literature on reasoning about 3D objectshapes [1], [2]. Traditional approaches typically extract featurepoints from a 3D shape [3]–[5], and find correspondencesbetween these feature points for shape recognition and retrieval[6]–[8]. However, these methods tend to be sensitive to long-range non-rigid and non-isometric shape deformations, be-cause, in part, the feature points capture only local shape prop-erties. In addition, finding optimal feature correspondencesis often formulated as an NP-hard non-convex QuadraticAssignment Problem (QAP), whose efficient solutions comeat the price of compromised accuracy.Recently, the state-of-the-art performance in 3D shape clas-sification has been achieved using deep 3D Convolutional Neu-ral Networks (CNNs), called 3D ShapeNets [9]. 3D ShapeNet extends the well-known deep architecture called AlexNet [10],which is widely used in image classification. Specifically,the extension modifies 2D convolutions computed in AlexNetto 3D convolutions. However, 3D ShapeNets’ architectureconsists of 3 convolutional layers and 2 fully-connected layers,with a total of 12M parameters. This in turn means that arobust learning of ShapeNet parameters requires large trainingdatasets. But in comparison with common datasets used inother domains, currently available 3D shape datasets aresmaller by at least an order of magnitude. For example, thebenchmark 3D shape datasets SHREC’14 dataset [11] hasonly 8K shapes, and ModelNet dataset [9] has 150K shapes,whereas the well-known ImageNet [12] has 1.5M images.Faced with small training datasets, existing deep-learningapproaches to 3D shape classification typically resort to anumber of ad hoc, hand-tuning strategies for robust learn-ing. Rarely are these justified based on extensive empiricalevaluation, as it would take a prohibitively long time, but onpast findings of other related work (e.g., image classification).Hence, their particular design choices about the architectureand learning – e.g., the number of convolutional and fully-connected layers used, or specification of the learning rate inbackpropagation – may be suboptimal.Motivated by the state-of-the-art performance of 3DShapeNets [9], we here adopt this framework, and focus onaddressing the aforementioned issues in a principled manner.Specifically, we formulate a model pursuit for robust learningof 3D CNN. This learning is specified as a beam search aimedat identifying an optimal CNN architecture – namely, thenumber of layers, nodes, and their connectivity in the network– as well as estimating parameters of such an optimal CNN.Each state of the beam search corresponds to a candidateCNN. Two types of actions are defined to add either newconvolutional filters or new convolutional layer to a parentCNN, and thus transition to children states. The utility functionof each action is efficiently estimated as a training accuracy ofthe resulting CNN. The efficiency is achieved by transferringparameter values of the parent CNN to its children. Startingfrom the root “shallow and narrow” CNN, our beam search isguided by the utility function toward generating more complexCNN candidates with an increasingly larger classificationaccuracy on 3D shape training data, until training accuracystops increasing. The CNN candidate with the highest trainingaccuracy is finally taken as our 3D shape model, and used in a r X i v : . [ c s . C V ] D ec esting.In our experimental evaluation on the 3D ModelNet dataset[9], our beam search yields a 3D CNN with times fewerparameters than 3D ShapeNets [9]. The results demonstratethat our 3D CNN outperforms 3D ShapeNets by 3% on 40shape classes. This suggests that a model pursuit using beamsearch is a viable alternative to currently heuristic practice indesigning deep CNNs.In the following, Sec. II gives an overview of our 3D shapeclassification using 3D CNN; Sec. III formulates our beamsearch in terms of the state-space, successor function, heuristicfunction, lookahead and backtrack strategy; Sec. IV specifiesour efficient transfer of parameters from a parent model to itschildren in the beam search; and Sec. V presents our results.II. 3D S HAPE C LASSIFICATION U SING
3D CNNFor 3D shape classification we use a 3D CNN. Givena binary volumetric representation of a 3D shape as input,our CNN predicts an object class of the shape. Below, wefirst describe the shape representation, and then explain thearchitecture of 3D CNN.In this paper, we adopt the binary volumetric representationof 3D shapes presented in [9]. Specifically, each shape isrepresented as a set of binary indicators corresponding to3D voxels of a uniform 3D grid centered on the shape. Theindicators take value 1 if the corresponding 3D voxels areoccupied by the 3D shape; and 0, otherwise. Hence, each 3Dshape is represented by a binary three-dimensional tensor. Thegrid size is set to × × voxels. The shape size isnormalized such that a cube of × × voxels fullycontains the shape, and the remaining empty voxels serve forpadding in all directions around the shape. Each shape is alsolabeled with a corresponding object class.As mentioned in Sec. I, the architecture of our 3D CNN issimilar to that of 3D ShapeNets [9], with the important dis-tinction that we greatly reduce the total number of parameters.As we will discuss in greater detail in the results section, thebeam search that we use for the model pursuit yields a 3DCNN with 3 convolutional layers and 1 fully connected layer,totaling 80K parameters. The top layer of our model representsthe standard soft-max layer for classifying the input shapesinto one of possible object classes.In the following section, we specify our model pursuitand the initial root CNN from which the beam search startsexploring candidate, more complex models, until training errorcannot be further reduced.III. B EAM S EARCH
Search-based approaches have a long-track record of suc-cessfully solving computer vision problems, including struc-tured prediction for scene labeling [13], [14], object local-ization [15], and boundary detection [16]. Unlike the aboverelated work, search in this paper is not used for inference, butfor identifying an optimal CNN architecture and estimatingCNN parameters. For efficiency of learning, we consider abeam search which limits the exploration of the state space to a few top candidates. Our beam search is defined by thefollowing: • States correspond to CNN candidates, • Initial state represents a small CNN, • Successor function generates new states based on actionstaken in parent states, • Heuristic function evaluates the utility of the actions, andthus guides the beam search, • Lookahead and backtracking strategy.
Fig. 1. Architecture of our initial CNN model.
State-space:
The state-space is defined as
Ω = { s } , wherestate s represents a network configuration (also called archi-tecture). A CNN’s network configuration specifies the numberof convolutional and fully-connected layers, the number ofhidden units or 3D convolutional filters used in each layer, andwhich layers have max-pooling. In this paper, we constrain thebeam search such that the size of the fully connected layerremains the same as in the initial CNN, because we haveempirically found that only extending convolutional layersmaximally increases the network’s classification accuracy (asalso reported in [17], [18]). Initial State:
Our model pursuit starts from a relativelysimple initial CNN, illustrated in Figure 1, and the goal ofthe beam search is to extend the initial model by addingeither new convolutional filters to existing layers, or newconvolutional layers. The initial model consists of only twoconvolutional layers and one fully-connected layer. The firstconvolutional layer has 16 filters of size 6 and stride 2. Thesecond convolutional layer has 32 filters of size 5 and stride2. Finally, the fully-connected layer has 400 hidden units.The parameters of the initial CNN are trained as follows.We first generatively pre-train the model in a layer-wisefashion, and then use a discriminative fine-tuning procedure.The standard Contrastive Divergence [19] is used to pre-trainthe two convolutional layers, whereas the top fully-connectedlayer is trained using Fast Persistent Contrastive Divergence[20]. Once one layer is learned, the weights are fixed and thehidden activations are fed into the next layer as input. Afterthis pre-training, we continue to discriminatively fine-tune thepre-trained model. We first replace the topmost layer with anew randomly initialized fully-connected layer, and then adda standard softmax layer on top of the network to output classprobabilities. The standard cross-entropy loss is computedusing ground-truth class labels, and used in backpropagationto update the weights in all layers.iven this simple, initial CNN, the beam gradually buildsa search tree with new states s . Exploration of the state spaceconsists of generating successor states from a few selectedparent states. The selection is based on ranking the parentstates by a heuristic function, as further explained below. Successor function:
Γ : s → s (cid:48) , generates new states s (cid:48) from s by applying an action a ∈ A from a set of possibleactions A . In this paper, we specify A as consisting of twotypes of actions: 1) Add a new convolutional layer at the top ofall existing convolutional layers, where the newly added layerhas the same number of filters, filter size, and stride as thetop convolutional layer; 2) Double the number of filters in thetop convolutional layer. Other alternative definitions of A arealso possible. In particular, we have also considered an actionwhich adds max-pooling to convolutional layers; however,such an extended A has not produced better performance ontest data, relative to the above case when only two types ofactions are considered.As one of our technical contributions, in this paper, wespecify an efficient successor function for enabling an efficientbeam search. Specifically, we apply a knowledge transferprocedure, following the approach of [21], which efficientlycopies parameter values of the previous state s to values ofnewly added parameters in the generated state s (cid:48) . After thisknowledge transfer, the new CNN s (cid:48) is fine-tuned using only afew iterations (in our experiments, the number of iterations is10), for robustness. Note that a significantly larger number ofiterations would have been necessary for this fine-tuning, hadwe randomly initialized the newly added parameters of s (cid:48) (asis common in practice), instead of using knowledge transfer.In this way, we achieve efficiency. In the following section,we explain our knowledge transfer procedure in more detail. Heuristic function: H ( s , s (cid:48) ) ranks new states s (cid:48) given theirparent states s . H ( s , s (cid:48) ) is used to guide the beam search, whichselects the top K successor states, where K is taken as a beamwidth. H ( s , s (cid:48) ) is defined as the difference in classificationaccuracy on training data between s and s (cid:48) . Lookahead and backtracking strategy:
For robustness,we specify a lookahead and backtracking strategy for selectingthe top K successor states. We first explore the state spaceby applying the successor function several times from parentstates s , until the resulting tree search reaches a depth limit, D . Then, among the leaf states s (cid:48) at the tree depth D , weselect the top K leaves s (cid:48) evaluated with H ( s , s (cid:48) ) . From thesetop K leaf states, we backtrack to the unique K children ofparent states s , taken as valid new candidate CNNs.IV. K NOWLEDGE T RANSFER
When generating new candidate CNNs, we make our beamsearch efficient by appropriately transferring parameter valuesfrom parent CNNs to their descendants. In the sequel, wespecify this knowledge transfer for the two types of searchactions considered in this paper.
A. Net2WiderNet
A new state can be generated by doubling the number offilters in the top convolutional layer of a parent CNN. This action effectively renders the new candidate CNN “wider” thanits parent model, and hence we call this action
Net2WiderNet .We estimate the parameters of the “wider” CNN as follows.The key idea is to estimate the newly added parameters suchthat the parent CNN and its “wider” child CNN give the sameoutputs for the same inputs. This knowledge-transfer strategyensures that the newly generated model is not worse than thepreviously considered model. After this knowledge transfer,parameters of the “wider” child CNN s (cid:48) are fine-tuned to verifyif the action resulted in a better model than the parent CNN s , as evaluated with the heuristic function H ( s , s (cid:48) ) .In order to widen a convolutional layer, i , we need to updateboth sets of model parameters W ( i ) ∈ R m × n and W ( i +1) ∈ R n × p at layers i and i + 1 , respectively, where layer i has m inputs and n outputs, and layer i + 1 has p outputs. When theaction Net2WiderNet extends layer i so it has q > n outputs,we define the random mapping function g as g ( j ) = (cid:40) j , < j ≤ n random sample from { , , · · · , n } , n < j ≤ q (1)Then, the new sets of parameters U ( i ) and U ( i +1) can becomputed from W ( i ) and W ( i +1) as U ( i ) k,j = W ( i ) k,g ( j ) , (2) U ( i +1) j,h = 1 |{ x | g ( x ) = g ( j ) }| W ( i +1) g ( j ) ,h , (3)where k = 1 , ..., m , j = 1 , ..., q , and h = 1 , ..., p .From (2), the first n columns of W ( i ) are simply copieddirectly into U ( i ) . Columns n + 1 through q of U ( i ) arecreated by randomly choosing columns of W ( i ) , as definedin g . The random selection is performed with replacement, soeach column of W ( i ) may be copied many times to columns n + 1 through q of U ( i ) .From (3), we similarly have that first n rows of W ( i +1) are simply copied directly into U ( i +1) , and rows n + 1 through q of U ( i +1) are created by randomly choosing rowsof W ( i +1) , as defined in g . In addition, the new parametersin U ( i +1) are normalized so as to account for the randomreplication of rows in U ( i +1) . The normalization is computedby diving the new parameters with a replication factor, givenby |{ x | g ( x ) = g ( j ) }| .It is straightforward to prove that the resulting extendednetwork with new parameters U ( i ) and U ( i +1) produces thesame outputs as the original network with parameters W ( i ) and W ( i +1) , for the same inputs.An example of this procedure is illustrated by Figure 2.In this example, we increase the size of hidden layer h ( i ) by adding one additional unit, while keeping the activationspropagated to hidden layer h ( i +1) unchanged. Assume thatwe randomly pick hidden unit h ( i )2 to replicate, then we copyits weights W ( i )1 , and W ( i )2 , to the new h ( i )3 unit. The weight W ( i +1)2 , , going out of h ( i )2 , must be copied to also go outof h ( i )3 . This outgoing weight must also be dived by 2 tocompensate for the replication of h ( i )2 . ig. 2. An example of Net2WiderNet action. . B. Net2DeeperNet
The second type of action that we consider is termed
Net2DeeperNet , since it adds a new convolutional layer to aparent CNN, thereby producing a deeper child CNN. Specifi-cally,
Net2DeeperNet replaces a layer h ( i ) = φ ( W ( i ) (cid:62) h ( i − ) with two layers h ( i ) = φ ( U ( i ) (cid:62) φ ( W ( i ) (cid:62) h ( i − )) , where φ denotes the activation function. The new parameter matrix U is specified as the identity matrix.Figure 3 shows an illustration of Net2DeeperNet . Whenwe apply this action, we add a new convolutional layer andsimply set the new convolution filters to be identity functions.A zero padding is also added to maintain the size of activationsunchanged.It is worth noting that
Net2DeeperNet does not guaranteethat the resulting deeper network will give the same outputsas the original one, for the same inputs, when the activationfunction used is the sigmoid. The guarantee holds when theactivation function used is the rectified linear unit (ReLU),though. However, in our experiments, we have not found thatusing the sigmoid hurts the specified knowledge transfer of
Net2DeeperNet toward the efficient beam search.
Fig. 3. An example of
Net2DeeperNet action.
V. E
XPERIMENTAL R ESULTS
A. Dataset
For evaluation, we use the ModelNet dataset [9], and thesame experimental set up of 3D ShapeNets [9], for faircomparison. ModelNet consists of 40 object classes such aschairs, tables,toilets, sofas, etc. Each class has 100 uniqueCAD models, representing the most common 3D shapes ofthe class, totaling 151,128 voxelized 3D models in the entiredataset. We conduct 3D classification on both the 10-classsubset ModelNet10, and the full 40-class dataset, as in [9].We use the provided voxelizations and train/test splits forevaluation. Specifically, for each class, 960 instances are usedfor training and 240 instances are used for testing. We have implemented our beam search in MATLAB, ontop of a GPU-accelerated software library of 3D ShapeNets[9]. Experiments are run on a machine with the NVIDIA TeslaK80 GPU accelerator.
B. 3D shape classification accuracy
Our classification accuracy is averaged over all classes ontest data, and used for comparison with 3D ShapeNets [9]. Inaddition, we average our classification accuracy over the fiveruns of the beam search from 5 different initial CNNs, all ofwhich have the same architecture, but differently initializedparameters.We test how our performance varies for different depthlimits D = 1 , , , , , and beam widths K = 1 , , . Thetraining and testing accuracies, as well as the total beam-searchruntime are presented in Figures 4, 5, 6, 7, 8, 9. Fig. 4. 3D shape classification accuracy on testing data of 10 classes atdifferent beam width K .Fig. 5. 3D shape classification accuracy on training data of 10 classes atdifferent beam width K . In the experiments, we observe that when considering thethird type of action which adds a new max-pooling layer, thisparticular action is never selected by the beam search. Thisis in part due to the fact that adding a pooling layer resultsin re-initializing subsequent fully-connected layer. In turn,this reduces the effectiveness of already learned parameters.Because of this, we actually do not consider the action ofadding a pooling layer in our specification of the beam search.We compare our approach with two other approaches 3DShapeNets [9] and DeepPano [22] in Table I. As can be ig. 6. Total search time for 10 classes dataset at different beam width K .Fig. 7. 3D shape classification accuracy on testing data of 40 classes atdifferent beam width K .Fig. 8. 3D shape classification accuracy on training data of 40 classes atdifferent beam width K . seen, on ModelNet10 and ModelNet40, our accuracy is by3.63% better than DeepPano in the 40-class experiment, andby 2.55% better in the 10-class experiment.We observe that our model produced by the beam searchhas much fewer parameters than the network used in [9]. Theirmodel consists of three convolutional and two fully-connectedlearned layers. Their first layer has 48 filters of size 6; thesecond layer has 160 filters of size 5; the third layer has 512filters of size 4; the fourth layer is a fully connected RBMwith 1200 hidden units; and the fifth and final layer is a fully-connected layer of size C , which is the number of classes.Our best found model consists of three convolutional layers Fig. 9. Total search time for 40 classes dataset at different beam width K .Algorithm ModelNet40Classification ModelNet10ClassificationOurs 81.26% 88.00%DeepPano [22] 77.63% 85.45%3DShapeNets [9] 77% 83.5%TABLE IC OMPARISON OF OUR CLASSIFICATION ACCURACY (%)
WITHSTATE - OF - THE - ART ON THE M ODEL N ET AND M ODEL N ET DATASETS and one fully-connected layer. Our first layer has 16 filters ofsize 6 and stride2; the second layer has 64 filters of size 5 andstride 2; the third layer has 64 filters of size 5 and stride 2;the last fully-connected layer has C hidden units. Our modelhas about 80K/12M = 0.6% parameters of theirs.The recent literature also presents two works on 3D shapeclassification: VoxNet [23] and MVCNN [24], obtaininghigher classification accuracies (90.1%, 83%) on ModelNet40than ours. However, a direct comparison with these approachesis not suitable. VoxNet uses a training process that takesaround 12 hours, while our individual training time for the bestfound model is less than 5 hours. MVCNN is based on the 2Dinformation viewed from different angles around 3D shape, soit is inherently a 2D CNN approach but not related to 3D CNN.In addition, they also use the large collection of 2D imagesfrom ImageNet containing millions of images belonging tothe same set of classes as the object categories presented inModelNet40, to help their training process, while our work’sonly dataset is ModelNet40. So based on these reasons, webelieve it is not suitable for our experimental results to becompared to theirs. VI. C ONCLUSION
We have presented a new deep model pursuit approach for3D shape classification. Our learning uses a beam search,which explores the search space of various candidate CNN ar-chitectures toward achieving maximal classification accuracy.The search tree is efficiently built using a training classifica-tion accuracy based heuristic function, as well as knowledgetransfer to efficiently estimate parameters of new candidatemodels. Our experiments demonstrate that our approach out-performs the state of the art on the popular ModelNet10 andodelNet40 3D shape datasets by 3% . Our approach alsosuccessfully reduces the total number of parameters by 99.4%.As our approach could be easily applied to other problemsrequiring robust deep learning on small training datasets.A
CKNOWLEDGMENT
This research has been supported in part by National Sci-ence Foundation under grants IIS-1302700 and IOS-1340112.R
EFERENCES[1] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,”
IEEE Trans. Pattern Anal. Mach. Intell. , vol. 14, no. 2, Feb. 1992.[2] J. W. Tangelder and R. C. Veltkamp, “A survey of content based 3dshape retrieval methods,”
Multimedia Tools Appl. , vol. 39, no. 3, Sep.2008. [Online]. Available: http://dx.doi.org/10.1007/s11042-007-0181-0[3] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernel signature:A quantum mechanical approach to shape analysis,” in
CVS , Nov 2011.[4] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably infor-mative multi-scale signature based on heat diffusion,” in
SGP’09 , 2009.[5] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature maps forlocal shape comparison,” in
SMA , June 2005.[6] E. Rodola, S. Rota Bulo, T. Windheuser, M. Vestner, and D. Cremers,“Dense non-rigid shape correspondence using random forests,” in
CVPR ,June 2014.[7] M. Leordeanu and M. Hebert, “A spectral technique for correspondenceproblems using pairwise constraints,” in
ICCV , vol. 2, Oct 2005.[8] F. Zhou and F. De la Torre, “Factorized graph matching,” in
CVPR , June2012.[9] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3d shapenets: A deep representation for volumetric shapes,” in
CVPR ,2015.[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
NIPS , 2012.[11] L. et al., “A comparison of 3d shape retrieval methods based on a large-scale benchmark supporting multimodal queries,”
Computer Vision andImage Understanding , vol. 131, pp. 1 – 27, 2015.[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenet largescale visual recognition challenge,”
IJCV , vol. 115, no. 3, 2015.[13] M. Lam, J. Rao Doppa, S. Todorovic, and T. G. Dietterich, “HC-Searchfor structured prediction in computer vision,” in
CVPR , June 2015.[14] A. Roy and S. Todorovic, “Scene labeling using beam search undermutex constraints,” in
CVPR , June 2014.[15] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Efficient subwindowsearch: A branch and bound framework for object localization,”
PAMI ,vol. 31, no. 12, 2009.[16] N. Payet and S. Todorovic, “Sledge: Sequential labeling of image edgesfor boundary detection,”
IJCV , vol. 104, no. 1, 2013.[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv , 2015.[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv , 2014.[19] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,”
Neural Computation , vol. 14, no. 8, 2002.[20] T. Tieleman and G. Hinton, “Using fast weights to improve persistentcontrastive divergence,” in
ICML , 2009.[21] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learningvia knowledge transfer,” arXiv , 2015.[22] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramicrepresentation for 3-d shape recognition,”
SPL , vol. 22, no. 12, 2015.[23] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural networkfor real-time object recognition,” in
IROS . IEEE, 2015.[24] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3d shape recognition,” in