[PDF] Enabling Viewpoint Learning through Dynamic Label Generation

Abstract

Optimal viewpoint prediction is an essential task in many computer graphics applications. Unfortunately, common viewpoint qualities suffer from two major drawbacks: dependency on clean surface meshes, which are not always available, and the lack of closed-form expressions, which requires a costly search involving rendering. To overcome these limitations we propose to separate viewpoint selection from rendering through an end-to-end learning approach, whereby we reduce the influence of the mesh quality by predicting viewpoints from unstructured point clouds instead of polygonal meshes. While this makes our approach insensitive to the mesh discretization during evaluation, it only becomes possible when resolving label ambiguities that arise in this context. Therefore, we additionally propose to incorporate the label generation into the training procedure, making the label decision adaptive to the current network predictions. We show how our proposed approach allows for learning viewpoint predictions for models from different object categories and for different viewpoint qualities. Additionally, we show that prediction times are reduced from several minutes to a fraction of a second, as compared to state-of-the-art (SOTA) viewpoint quality evaluation. We will further release the code and training data, which will to our knowledge be the biggest viewpoint quality dataset available.

Full PDF

EEnabling Viewpoint Learning through Dynamic Label Generation

M. Schelling , P. Hermosilla , P.-P. Vázquez and T. Ropinski Ulm University, Germany Universitat Politècnica de Catalunya, Spain

MCConvNetworkDynamic Label Generation (Training)Input 3D Model

Sampling

Viewpoint QualitiesEvaluation

Iter. n-1

Fast Robust toMeshingSeparated fromRenderingTarget PredictionIter. n Iter. n+1

ViewpointEntropyViewpoint Kullback-Leiblerdivergence Visibility RatioViewpoint Mutual Information

Figure 1: We propose a new learning-based algorithm which is able to predict high quality viewpoints directly on 3D models. The key tolearning viewpoints is a novel approach to resolve label ambiguities, in the form of dynamic label generation, which adapts the network targetduring training, and enables our network to learn viewpoints for various viewpoint quality measures. By learning solely on unstructured 3Dpoint information, our approach is robust under mesh quality changes, and the viewpoint prediction is separated from the rendering processduring evaluation.

Abstract

Optimal viewpoint prediction is an essential task in many computer graphics applications. Unfortunately, common viewpointqualities suffer from two major drawbacks: dependency on clean surface meshes, which are not always available, and the lackof closed-form expressions, which requires a costly search involving rendering. To overcome these limitations we propose toseparate viewpoint selection from rendering through an end-to-end learning approach, whereby we reduce the inﬂuence ofthe mesh quality by predicting viewpoints from unstructured point clouds instead of polygonal meshes. While this makes ourapproach insensitive to the mesh discretization during evaluation, it only becomes possible when resolving label ambiguitiesthat arise in this context. Therefore, we additionally propose to incorporate the label generation into the training procedure,making the label decision adaptive to the current network predictions. We show how our proposed approach allows for learningviewpoint predictions for models from different object categories and for different viewpoint qualities. Additionally, we show thatprediction times are reduced from several minutes to a fraction of a second, as compared to state-of-the-art (SOTA) viewpointquality evaluation. We will further release the code and training data, which will to our knowledge be the biggest viewpointquality dataset available.

Keywords:

Viewpoint selection, deep learning, label ambiguity

1. Introduction

3D models play an essential role in all areas of computer graph-ics, such as games, animated movies or virtual reality. To effec-tively showcase these models or to assess their quality, not onlymodel parameters are important, such as geometry and material,but also the selection of optimal views is crucial. Optimal views should ensure that the model complexity is appropriately commu-nicated, and relevant structures are visible. Many quality measureshave been developed to aid in the automatic selection of optimalviewpoints on 3D models. The applications range from obtainingvantage points for capturing stills in architecture [HWZ ∗ ∗ ∗

16, SLMR14, SLF ∗ a r X i v : . [ c s . G R ] F e b . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation viewpoint quality measures aim to measure the information contentof rendered 2D images of 3D models. The information content isusually derived from the visibility of the model geometry, makingit sensitive to mesh quality and, in some cases, discretization. Forthis reason view quality measures generally assume the geometryto be a clean watertight surface mesh, which is not always avail-able in real world applications. Faulty meshing on the other hand,such as holes in the geometry or self-intersecting triangles, distortthe resulting viewpoint quality [BFS ∗ ∗ ∗

17] or symmetric am-biguity [LGS19]. Thus we propose a more general approach, thedynamic label generation, which integrates the label decision intothe training process. This allows the network to dynamically ad-just the labels during training which results in a harmonized labeldecision over the dataset, effectively reducing the inﬂuence of con-tradicting label decisions, and thus gradients, and enabling learningfor this more general type of ambiguity.Thus, within this paper we make the following contributions: • We present the ﬁrst learning-based approach that directly pre- dicts optimal viewpoints directly on 3D models, while being ro-bust to the input mesh quality. • We introduce a novel dynamic label generation method, incor-porating the label decision into the training to resolve label am-biguity. • We release viewpoint quality annotations for a subset of Mod-elNet40, which makes it the largest available viewpoint qualitydataset – by a large margin.

2. Related Work

The search for a good viewpoint of a 3D object is a problem that canbe dated back to ancient societies such as the Greeks and Romans.Several rules such as the golden ratio, or the rule of thirds havebeen proposed to estimate beauty or proportion. More recently, thesearch for preferred views has also been addressed, especially incomputer vision tasks (e.g., for object recognition), and researchershave wondered what parameters constitute a good view [PPB ∗ canonical views [BTB99]. They alsofound that in some cases, these correspond to three-quarter views(also with notable exceptions, such as in the case of vehicles). Sec-ord et al. also analyzed viewpoint preferences in a large scale userstudy, and derived a combination of existing techniques [SLF ∗ Viewpoint selection.

The automatic selection of viewpoints for3D scenes has many applications such as helping observers gainunderstanding on a certain scene [AVF04, LDW14, FWK17],for object recognition [DDND06, DDND09], assisting in robotictasks [SLM ∗ ∗ ∗ ∗ ∗

17, SLF ∗ ∗ ∗

11, FWBK15, BFS ∗ . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation design, as one forward pass through the network enables viewpointprediction in milliseconds, rather than minutes, which are requiredby the brute-force approaches. Label ambiguity.

Ambiguous labels are present in many tasks,such as image classiﬁcation, image segmentation, pose-estimationor age estimation [GXX ∗

17] and can hurt the performance of alearner if not considered [RLDB17]. There are different sourcesfor these ambiguities, some tasks naturally allow multiple correctlabels, e.g., in image classiﬁcation an image can contain multipleobjects, for other tasks it is difﬁcult to provide a deﬁnitive label,e.g., it is hard to determine the exact pose of a partially occludedperson. While classiﬁcation tasks can resolve label ambiguity tosome degree by design, regression tasks often struggle with am-biguous label information. While restating a regression as a clas-siﬁcation is possible [ST19], it limits the possible performance bydiscretizing the output space. In cases where ambiguity exclusivelystems from symmetry, partial restatement can be a trade-off, e.g., toresolve axial symmetry [LGS19] or rotational symmetry [CKF18].The problem of label ambiguity can also be viewed as a problem ofcontradicting gradients. While the inﬂuence of such gradients canbe reduced using mixtures of experts [JJNH91, JJ94], where multi-ple experts are trained together with a gating network to divide theproblem space into disjoint regions, each having its own expert, thismethod is not applicable to the problem of label ambiguity whichis not separable in the input space, e.g., the same data point couldbe present twice in the dataset with different labels.In contrast to these approaches, we present a novel dynamic labelgeneration, which integrates the label generation into the trainingstage and harmonizes the label decision without further assump-tions or restrictions.

3. Viewpoint Quality Measures

To demonstrate the proposed deep learning technique, we have con-sidered four different viewpoint quality measures, which we se-lected based on their effectiveness in previous studies and their pop-ularity: Viewpoint Entropy (VE) [VFSH01], Visibility Ratio (VR),also referred to as surface area [PB96], Viewpoint Kullback-Leiblerdivergence (VKL) [SPFG05], and Viewpoint Mutual Information(VMI) [FSG09], which are deﬁned as:VE = − ∑ z ∈Z a z ( v ) a t ( v ) log a z ( v ) a t ( v ) , (1)VR = ∑ z ∈Z vis z ( v ) A z A t , (2)VKL = ∑ z ∈Z a z ( v ) a t ( v ) log a z ( v ) A t a t ( v ) A z , (3)VMI = ∑ z ∈Z p ( z | v ) log p ( z | v ) p ( z ) , (4) Table 1: Viewpoint measure properties.

Two main properties, thecorrelation to user preference and the sensivity to mesh discretiza-tion, for the considered viewpoint measures [BFS ∗ ∗ z polygon Z set of polygons vis z ( v ) visibility of polygon z from viewpoint v (0 or 1) a z ( v ) projected area of polygon z from viewpoint va t ( v ) projected area of the model from viewpoint vA z area of polygon zA t total area of the model p ( z | v ) = a z ( v ) / a t ( v ) , conditional probability of z given vp ( z ) probability of z (average projected area of z ) According to Bonaventura et al. [BFS ∗ ∗

11] ranked VR and VE as the twomost preferred. Finally we added VKL as it is partially sensitive tothe models discretization, and the other ones were in the extremes(non-sensitive/highly sensitive). Other measures could also be con-sidered, however we restricted ourselves to these four measures asthey represent the range of two main properties, see Table 1.The best viewpoints for VR and VE correspond to the highestviewpoint quality values, and for VKL and VMI to the lowest view-point quality values. These viewpoint quality measures are deﬁnedfor polygonal models and thus are, in contrast to our approach, de-pendent on the actual meshing with various degrees. While VR andVMI are insensitive to the discretization of the model, and VKL isnear insensitive, they all still assume clean surface meshes, as forexample self-intersections of polygons change A t and A z and thusalso VR and VKL , without necessarily altering the visible surface.This underlying assumption makes it harder to compare good view-points for models under different meshing qualities or resolutions,which is a problem if we want to extract model-spanning featuresof good viewpoints and bias the network towards good viewpointsof clean surface meshes. We reduce these inﬂuences with a meshcleaning pipeline (see Section 4.2), in order to ensure that the view-point quality measures work as expected for different meshes.To compute the best viewpoints for a given model we sample theunit sphere S with 1 k viewpoints V ⊂ S ⊂ R on a Fibonaccisphere [Gon10], generating almost equidistantly distributed view- . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation (a) 256 ×

256 : 138 s (b) 512 ×

512 : 143 s (c) 1024 × s (d) 2048 × s Figure 2:

Inﬂuence of image resolution.

Projection ofVE on the viewpoint sphere from + y -axis for model air-plane_0275 from ModelNet40 computed at different resolutions256 , , , and the time needed to sample the 1 k viewpoints V , averaged over 10 runs. We choose a resolution of1024 as a trade-off between accuracy and speed. Note that thelocations of the maxima (yellow) are stable at higher resolutions.points, on which we evaluate the four viewpoint quality measures.Compared to other work with 240 [SLF ∗ ∗

18] viewpoints, we achieve a denser sampling of theview sphere.The viewpoint quality measures are computed on rendered 2Dimages, and are thus inﬂuenced by the image resolution. We com-pared different image resolutions, see Fig. 2, and chose to renderthe 3D models with 1024 × [ , ] , where 0and 1 refer to the viewpoint quality of the worst and best viewpoint,respectively: V Q ∗ ( v ) = V Q ( v ) − V Q ( v − ) V Q ( v + ) − V Q ( v − ) , (5)where v + ∈ V is a viewpoint with the best and v − ∈ V is one withthe worst viewpoint quality of the sampled views V . In the fol-lowing we will always refer to these normalized versions of theviewpoint quality measures.

4. High Quality Viewpoint Prediction

Predicting good viewpoints with neural networks confronts us withtwo major challenges, the non-uniqueness of the best viewpoint andthe mesh dependency of the viewpoint qualities. In the followingsections we will describe how we address these challenges.

Optimal viewpoints are not necessarily unique, e.g., due to modelsymmetry, which means that instead of a deﬁnite optimal viewpoint v + ∈ V we typically ﬁnd a set of viewpoints V + ⊂ V which max-imize the viewpoint quality measure. This phenomenon is referredto as label ambiguity.In the general setting of label ambiguity a set of labels Y is giventogether with a quality measure p : Y → [ , ] , which in our case isgiven by the normalized viewpoint qualities p ( v ) = V Q ∗ ( v ) : v ∈V → [ , ] .The naïve label decision would be to ignore label ambiguity andchoose one label y + ∈ Y as the Single Label (SL) for each modelprior to training and train to minimize the loss (cid:96) ( ˆ y , y + ) , between theprediction ˆ y and the chosen label. For viewpoints the natural choiceis the cosine distance (cid:96) ( ˆ v , v + ) = − ˆ v · v + || ˆ v || || v + || (6) = − ˆ v · v + , (7)between the prediction ˆ v and one viewpoint v + with a viewpointquality of 1. (Note: (cid:96) norms are 1 as we evaluate on the unitsphere.) However, if this decision is not consistent over the entiredataset, the network is unable to resolve the label ambiguity duringtraining, e.g., if two similar models with similar viewpoint qualitydistributions are labeled differently, the networks receives contra-dicting gradients impacting the learning capability, as illustrated inFig. 3 ( top ).We aim to resolve this problem by moving the label decisionfrom a preprocessing step into the training process, by making itdependent on the current network prediction. This way the labeldecision is implicitly learned by the network, and can change dy-namically during training to harmonize the label decisions over thedataset. In the following we propose two techniques for dynamiclabel generation. Multiple Labels (ML).

We choose a subset of high quality labels Y + : = { y ∈ Y | p ( y ) ≥ α } , (8)with a quality threshold 0 ≤ α ≤

1. During training the loss betweenthe current prediction ˆ y and the closest label in Y + is minimized, (cid:96) ML ( ˆ y ) = (cid:96) (cid:0) ˆ y , argmin y ∈Y + ( || y − ˆ y || ) (cid:1) . (9)In our setting of viewpoint prediction this simpliﬁes to (cid:96) ML ( ˆ v ) = min v ∈V + ( − ˆ v · v ) . (10)where we select labels V + with a quality threshold α = . V + often consists of clusters covering areas of goodviewpoint quality values, which are similar for similar input mod-els, causing the gradients to reinforce each other. However, as thenetwork only optimizes to the closest label, we observe it stoppingat the boundary of one of these clusters, rather than moving towardsits center (see Fig. 3), which results in non optimal values. To fur-ther improve the performance we propose a second approach whichconsiders the quality measure and not just a quality threshold. Gaussian Labels (GL).

We propose to select labels with a highquality value p in the proximity of the current network prediction. . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation Single Label (SL)

MAX MAXMAXClosestVQ > � Ours (ML + GL) * = Multiple Labels (ML)

First 1500 epochs

Gaussian Label (GL)

Epoch 1500 to end of training

Labels Network prediction

Sphericalview quality map

Figure 3:

Dynamic label generation.

Illustration of the proposed dynamic label generation technique for best viewpoint prediction, we useMercator projections of the viewpoint sphere, as indicated on the right.

Top:

Best viewpoints are not necessarily unique, thus randomlychoosing a maximum as label can create different labels for similar input models, which the network is unable to resolve.

Bottom:

Toharmonize the label decision, we propose our two stage dynamic label generation. We ﬁrst provide the network with multiple labels (ML)of high viewpoint quality and optimize towards the closest one. The labels typically form clusters in high quality areas, in which case theoptimization tends to converge towards the boundary. To reﬁne the predictions, we generate the label dynamically in a second stage (GL).The viewpoint quality distribution is weighted with a Gaussian centered at the current prediction and the maximum of the result is used asa label, which is typically a close local maximum, i.e., the maximum of the closest cluster. Both stages, ML and GL, provide more similarlabels for similar input.We incorporate this through a locality constraint by weighting thelabel distribution with a shifted Gaussian function p g ( y , ˆ y ) = p ( y ) · (cid:18) exp −(cid:107) y − ˆ y (cid:107) σ + s (cid:19) , (11)and then optimize towards a label which maximizes this measure y + g ( ˆ y ) = argmax y ∈Y p g ( y , ˆ y ) . (12)The additive term s ensures that distant high quality labels are notdismissed, which keeps the network from getting stuck in largerregions with low p Y values. For our experiments we set σ = , s =

1, which leads to

V Q g ( v , ˆ v ) = V Q ∗ ( v ) · (cid:18) exp −(cid:107) v − ˆ v (cid:107) + (cid:19) , (13) v + g ( ˆ v ) = argmax v ∈V V Q g ( v , ˆ v ) , (14) (cid:96) GL ( ˆ v ) = − ˆ v · v + g ( ˆ v ) . (15)We observe this approach to keep optimizing towards a localmaximum of V Q g , whereby the value of this local maximum can bein some cases sub-optimal, e.g., if the initial guess of the networkis in a bad region.For best results we use ML for initialization to ﬁrst optimize to-wards the closest high quality viewpoint, followed by GL to reﬁnethe predictions inside a promising region, see Fig. 3. As mentioned above, some viewpoint quality measures are sensi-tive to the meshing of the models, and bad mesh quality can leadto distortions in the viewpoint quality computation. These inaccu-racies reduce the comparability between different models, whichmakes it hard for a network to determine the important features.Providing clean and comparably meshed models on the other handbiases the network to implicitly extract features of a clean surfacesolely from point information. To minimize these inﬂuences, wepass all meshes through a mesh cleaning pipeline, which resolvesmesh intersections and regularizes the meshing. For details on themesh cleaning and its inﬂuence on the viewpoint quality measureswe refer the user to the supplementary material.We note that ourpipeline does not remove all artifacts but the achieved mesh qualityproved sufﬁcient for our experiments.

We deliberately chose point clouds as input to achieve robustness/ independence to mesh polygonization / discretization, in con-trast to neural networks which operate directly on meshes. Fur-ther, we chose Monte-Carlo Convolutional Neural Networks (MC-CNN) [HRV ∗

18] over other point cloud architectures because of itsrobustness to input sampling by considering the point cloud density.Our feature extraction network consists of four convolutionallayers with convolutional radii 0 . , . , . , √

3, relative to thebounding box of the model. Each convolutional layer operates . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation . Feature extractor network . . √ . MLPMLP

VE VKL VMI

MLPMLP VR Figure 4:

Network architecture.

Top:

We use a MCCNN feature extractor, each layer performs a spatial convolution with increasing radiuswhich increases the feature dimension and reduces the spatial resolution, resulting in a 2048 dimensional latent geometry representation.

Bottom:

The latent representation is processed by parallel MLPs, each predicting the viewpoint for a different viewpoint quality measure.on different resolutions, which are computed using Poisson disksampling with radii 0 , . , . , . , √

3, again relative to thebounding box of the model. The respective feature dimensions are3 , , , , top ). The learned latent representation is processed byfour parallel Multi Layer Perceptrons (MLPs) with three layers ofsizes 1024 , ,

3, each outputting a viewpoint ˆ v ∈ R for one ofthe four viewpoint quality measures, see Fig. 4 ( bottom ). We foundthat training one feature extractor for all four viewpoint qualitiesimproves the performance as compared to training four separatenetworks. An effect we account to the different losses improvingthe feature extractor, similar to auxiliary losses [SWY ∗ ∗

14] in the MLP layers with a dropout rate of0 .

5, Adam optimization [KB14] with batch size of 8 and a learningrate decay with an initial learning rate of 0 .

001 which is multipliedby 0 .

75 every 200 epochs. We train for a total of 3000 epochs andswitch from ML to GL after 1500.

5. Experiments

To validate our viewpoint learning approach, which is enabled bydynamic label generation, we conducted three experiments. First,we trained a neural network to predict good viewpoints on pointclouds of arbitrarily oriented 3D models, while we compare our la-bel generation method to existing techniques. Next we inspectedthe robustness of our method towards different meshing and sam- plings of the input models, and lastly provide timings for both thesampling algorithm and our network.

All experiments were conducted on a subset of Model-Net40 [WSK ∗ airplane, bench, bot-tle, car, chair, sofa, table and toilet , which we split into 80% train-ing, 10% validation and 10% test data. All models were prepro-cessed as described in Section 4.2. In order to sample the view-point quality measures in reasonable time we only use models withat most 10k faces. All meshes are converted into point clouds bysampling 20k random uniform points on the faces. We use a ratherdense input of 4096 points per model to capture ﬁne geometric de-tail, for comparison common object classiﬁcation networks typi-cally use only 1024 points [HRV ∗

18, QYSG17].

Data Augmentation

Neural networks working with three dimen-sional input data usually require a large database to achieve note-worthy performance. This is due to the high dimensionality of theinput space, as well as to the complexity of the tasks. As the avail-able sources for 3D data are rather limited, as compared for exam-ple to image data, the use of data augmentation, which increasesthe dataset virtually, are crucial for our experiments.Therefore, we use the following two data augmentation strate-gies, random sampling and rotations:1. The input point clouds are generated by selecting 1024 pointsusing farthest point sampling [ELPZ97], and selecting additional3072 random points.2. The input point clouds are augmented with rotations from SO ( ) , whereby the three angles are chosen from a random uni-form distribution on [ , π ] . . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation VR VKL VMIVE G r ound T r u t h O u r s Figure 5:

Viewpoint prediction results for different viewpoint qualities.

Viewpoints predicted by our network for unseen models andground truth achieved from sampling the viewpoint sphere. We also show the corresponding viewpoint quality spheres centered at thedisplayed viewpoint. The network successfully predicts high quality viewpoints, indicated by the yellow areas in the viewpoint spheres.

We demonstrate the effectiveness of our two stage dynamic labelgeneration (ML+GL) by comparing it against single label cosine-distance (SL) and existing work on resolving label ambiguity, DeepLabel Distribution Learning (DLDL) [GXX ∗ | v | and a classiﬁcation task for thesigns.We train a network as described in Section 4.3 on each categoryto predict viewpoints for the four different viewpoint quality mea-sures, the Viewpoint Entropy (VE), the Visibility Ratio (VR), theViewpoint Kullback-Leibler divergence (VKL) and the ViewpointMutual information (VMI). For SR and DLDL we have to adapt thearchitecture and loss as follows: SR : The loss function (cid:96) for SR consists of the cosine distance to | v + | and the cross entropy-loss of the sign prediction. Furthermorewe use two MLPs per output to predict absolute values and the signcategories. DLDL : The loss function for DLDL is the per-pixel (cid:96) distance.The MLPs are replaced by 2D decoder networks consisting of 2Ddeconvolutions and residual blocks [HZRS16] predicting the view-point quality distributions, for more details on the architecture werefer to the supplementary material. For a fair comparison we pre-dict at a resolution of 32 × = Viewpoint prediction results.

Top:

Mean viewpoint qual-ity in % of the predicted viewpoints using the different labelingtechniques on the test set. Using dynamically generated labels (ML,GL, ML+GL) improves over one stage methods (ML, GL) and ex-isting methods (SL, DLDL, SR), where our proposed two stage dy-namic label generation method ML+GL yields best results for allfour viewpoint quality measures.

Bottom:

Mean viewpoint qual-ity in % of the ML+GL approach for the different categories. Theperformance is consistent over all categories.labels categories VE VR VKL VMISL mean 62.4 71.0 80.7 83.0SR mean 63.1 69.8 80.6 80.1DLDL mean 58.7 66.6 77.9 77.9

Ours (ML+GL) mean

Abl. 1 (ML only) mean 70.1 72.1 82.6 82.1Abl. 2 (GL only) mean 74.2 75.1 89.3 87.7Ours (ML+GL) airplane 79.1 74.8 95.2 96.6Ours (ML+GL) bench 67.7 72.8 85.5 87.3Ours (ML+GL) bottle 75.3 78.0 94.9 94.1Ours (ML+GL) car 84.0 80.3 89.7 92.2Ours (ML+GL) chair 73.0 77.9 90.8 93.0Ours (ML+GL) sofa 88.8 75.7 92.2 93.5Ours (ML+GL) table 83.0 82.0 91.6 90.1Ours (ML+GL) toilet 83.8 84.3 89.8 93.4 . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation +- +-

Max Prediction Pred. sign inverted

Figure 6:

Spherical regression.

SR struggles with resolving labelambiguity as the ambiguity is not axis-symmetric leading to predic-tions that have ﬂipped sign decision (yellow) although the absolutevalue might be correct (blue).sion results for | v | , interpolating good viewpoints, see Fig. 6. Wetheorize that this is because an underlying assumption for SR is that | v | is the same for all labels, but as in our case the label ambigu-ity does not solely stem from model symmetry and the input is notnecessarily aligned with the 3D axes, the assumption does not hold.Predicting the viewpoint quality distribution (DLDL) resulted inthe worst results. By analyzing the network prediction we foundthat the predicted distributions are much smoother than the groundtruth distributions, for details we refer the reader to the supplemen-tary material. We hypothesize that the network is unable to capturethe geometric details which create the high frequency properties ofthe viewpoint quality distribution and as a result predicts an aver-aged distribution for similar models. We account this to two mainfactors. First the tasks itself is harder than only predicting the opti-mal viewpoint which demands the extraction of geometric featuresat a ﬁner scale. The extraction of such details would however re-quire a denser input sampling and a wider and deeper feature ex-tractor, and in consequence also a larger data set for training. Sec- max VE Network prediction Figure 7:

Robustness to mesh polygonization.

We show predic-tions using VE for different subdivisions of the seating surface.As VE favors small triangles the bias towards views from the topincreases with higher mesh density (red). Our network based ap-proach remains stable independent of the meshing (yellow). ond the inﬂuence of mesh quality on the viewpoint quality distri-bution is naturally higher than on the position of the optimal view-point. Thus our preprocessing pipeline might be insufﬁcient andcreate distortions that the network is unable to resolve.The results of our method are stable for all examined categoriesas can be seen in the bottom half of Table 2, showing that no ad-ditional tuning of the hyperparameters is necessary to learn variouscategories or viewpoint quality measures, detailed results can befound in the supplementary material.Viewpoints predicted on the test set, i.e. unseen models, by ournetwork trained with ML+GL labels can be seen in Fig. 5. We stressthat due to label ambiguity the network is not optimized towardsreproducing the same viewpoint as the sampling method, but topredict a viewpoint with high viewpoint quality. This potentiallyleads to different views, e.g., the toilet in Fig. 5, for which bothviews have a high quality, as can be seen in the viewpoint qualityspheres in the ﬁgure.

We use unstructured 3D convolutions and hence the input to thenetwork are point clouds only consisting of coordinate information.As these points carry no additional information about the polygo-nization of the underlying mesh we expect our approach to be in-sensitive to the discretization of the mesh at test time. Furthermore,the use of MCCNNs, which consider an estimate of the point den-sity, should result in a robustness to point sampling strategies.To conﬁrm this we perform two different experiments. The ﬁrstone is the application to a toy example, in which we subdivide apart of the chair_0047 mesh into smaller polygons. On the originalmodel VE prefers views from the bottom showing more geometricdetails in form of the legs, while after subdividing the seating sur-face VE mistakes the small faces as surface details, emphasizingthe visibility of this area, see Fig. 7, which results in a viewpoint farfrom the optimal views of the original mesh. Our approach on theother hand, predicts viewpoints in an optimal area of the originalmesh, independent of the meshing.In a second experiment we show the robustness of our approachin practice to input that differs from the clean data provided dur-ing training. First, to investigate the robustness to mesh quality, wetested our network on the raw ModelNet40 models, which containself-intersections, non-surface faces and non-uniform discretiza-Table 3:

Robustness to input sampling.

Comparison of the net-work performance at test time for different input data: prepro-cessed : by our mesh cleaning pipeline, raw: from unaltered Mod-elNet40, surface : point sampling of Qi et al. [QYSG17]. The net-work achieves comparable results under different input meshingand point sampling methods, without further training.source VE VR VKL VMIpreprocessed meshes 79.3 78.2 91.2 92.5raw meshes 78.9 76.2 88.6 90.4surface sampling 79.1 74.6 88.9 88.4 . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation tion. This is a more challenging task than the ﬁrst toy example,as the model geometries are different and not only the mesh dis-cretization, which confronts the network with out-of-domain input.Second, to show that our approach is additionally robust to differentpoint sampling strategies, we also evaluate on the points providedby Qi et al. [QYSG17], who use a different pipeline to achieve cleansurface point clouds. The results reported in Table 3 conﬁrm thatour approach is robust under sampling of the input data. We inferthat the network has learned an internal representation of the mesh-ing used during training .

We compared the time needed to estimate high quality viewpointsusing the sampling approach described in Section 3, and the timeneeded to predict high quality views using our neural networkmodel, as described in Section 4.3. The timings were measuredon a system with an Intel Core i7-8700K CPU @ 3.70GHz and aNVIDIA GeForce GTX 1080 GPU. While the sampling approachwas implemented using Python and OpenGL, our network ap-proach was realized through Python and TensorFlow. To make themeasurements comparable, we employed the following two condi-tions. First, we neglected initialization times, which include loadingthe meshes, preprocessing the meshes for the sampling method andsampling points and loading the weights for the network. Second,we sampled the viewpoint quality measures in one procedure, com-puting shared values only once. For the evaluation we chose mod-els of different sizes, ranging from 10k faces to 1M faces, wherebywe processed all these models 10 times with both methods and re-ported the averaged times in Table 4.While the elapsed time of the sampling approach is approxi-mately linear in the number of candidate views and the numberof faces the network only requires one execution. This execution’stime is independent of the model size, outperforming the othermethod in orders of magnitude. While we see some variation in theexecution time of the network, which we account to varying num-bers of points in the 3D convolutions and point hierarchy levels, thetimings are comparable for all inspected models.

6. Limitations and Future Work

To achieve the reported results, we trained category speciﬁc in-stances of our network in a divide-and-conquer scheme, which iscommon for similar deep learning tasks such as viewpoint estima-tion [ST19] or upright prediction [LZL16]. This prevents the pro-posed network from generalizing to unseen categories, however, wesee no theoretical limitation of our method and expect such general-ization to be possible in the future by i) expanding the learning ca-pabilities, e.g. using mixture of experts as was shown for viewpointestimation [LGS19], and ii) increased amount of training data, akey ingredient in order to generalize to unseen categories.While our network can predict multiple viewpoints at once, theviews are independent, as it predicts one viewpoint per measure.We see potential for predicting multiple viewpoints that compli-ment one another. However, this leads to the problem of deﬁning agood second view. Is it one that best covers the unseen parts of themodel or a second view with high quality value? Note that the latter Table 4:

Time comparison.

Elapsed time of sampling based meth-ods and ours for different model sizes, all timings are averaged over10 executions. We measure the brute force sampling method using250, 500 and 1 k candidate views, and measure our model whenbatch processing 1, 64 and 256 models at the same time. Our net-work approach is faster in orders of magnitude and is independentof the model size as it uses a point cloud of ﬁxed size. We reportN/A where the execution did not ﬁnish after 12 h .sampling oursnumber of views batch size ∗

7. Conclusion

The proposed learned viewpoint prediction provides a way to pre-dict high quality viewpoints for different viewpoint quality mea-sures and model categories. By separating viewpoint selection andrendering our approach performs faster than existing techniques byseveral orders of magnitude. This makes our method suitable forapplications which beneﬁt from speed and parallelizability, suchas automatic thumbnail generation of 3D data sets or initial cam-era placement for user interaction. The prediction of viewpointsdirectly from unstructured 3D point data proved to make the pre-diction robust to meshing properties, which makes us believe thatthe network has learned an internal representation of a clean mesh,as intended. The proposed dynamic label generation method is es-sential to resolve label ambiguity during training, outperformingexisting methods, and is designed to be transferable to other learn-ing tasks that involve label ambiguity.On top of the contributions made in this article, we provide adataset, which will be, to our knowledge, the ﬁrst large scale view-point quality dataset containing more than 16k models in total,more details can be found in the supplementary material. . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation

References [AVF04] A

NDÚJAR

C., V

ÁZQUEZ

P., F

AIRÉN

M.: Way-Finder: guidedtours through complex walkthrough models.

Computer Graphics Fo-rum 23 , 3 (2004), 499–508. doi:10.1111/j.1467-8659.2004.00781.x . 2[BFS ∗

18] B

ONAVENTURA

X., F

EIXAS

M., S

BERT

M., C

HUANG

L.,W

ALLRAVEN

C.: A survey of viewpoint selection methods for polygonalmodels.

Entropy 20 , 5 (2018). doi:10.3390/e20050370 . 2, 3, 4[BS05] B

ORDOLOI

U. D., S

HEN

H.-W.: View selection for volume ren-dering. In

VIS 05. IEEE Visualization, 2005. (Oct 2005), pp. 487–494. doi:10.1109/VISUAL.2005.1532833 . 2[BTB99] B

LANZ

V., T

ARR

M. J., B

ÜLTHOFF

H. H.: What object at-tributes determine canonical views?

Perception 28 , 5 (1999), 575–599.2[CKF18] C

ORONA

E., K

UNDU

K., F

IDLER

S.: Pose estimation for ob-jects with rotational symmetry. In (Oct 2018), pp. 7215–7222. 3[DCG10] D

UTAGACI

H., C

HEUNG

C. P., G

ODIL

A.: A benchmark forbest view selection of 3D objects. In

Proceedings of the ACM workshopon 3D object retrieval (2010), pp. 45–50. 4, 9[DDND06] D

EINZER

F., D

ERICHS

C., N

IEMANN

H., D

ENZLER

J.: In-tegrated viewpoint fusion and viewpoint selection for optimal objectrecognition. In

BMVC (2006), pp. 287–296. 2[DDND09] D

EINZER

F., D

ERICHS

C., N

IEMANN

H., D

ENZLER

J.: Aframework for actively selecting viewpoints in object recognition.

In-ternational Journal of Pattern Recognition and Artiﬁcial Intelligence 23 ,04 (2009), 765–799. doi:10.1142/S0218001409007351 . 2[ELPZ97] E

LDAR

Y., L

INDENBAUM

M., P

ORAT

M., Z

EEVI

Y. Y.: Thefarthest point strategy for progressive image sampling.

IEEE Trans-actions on Image Processing 6 , 9 (Sep. 1997), 1305–1315. doi:10.1109/83.623193 . 6[FSG09] F

EIXAS

M., S

BERT

M., G

ONZÁLEZ

F.: A uniﬁed information-theoretic framework for viewpoint selection and mesh saliency.

ACMTransactions on Applied Perception (TAP) 6 , 1 (2009), 1. 3[FWBK15] F

REITAG

S., W

EYERS

B., B

ÖNSCH

A., K

UHLEN

T. W.:Comparison and evaluation of viewpoint quality estimation algorithmsfor immersive virtual environments.

ICAT-EGVE 15 (2015), 53–60. 2[FWK17] F

REITAG

S., W

EYERS

B., K

UHLEN

T. W.: Assisted travelbased on common visibility and navigation meshes. In (2017), IEEE, pp. 369–370. 2[Gon10] G

ONZÁLEZ

Á.: Measurement of areas on a sphere using ﬁ-bonacci and latitude–longitude lattices.

Mathematical Geosciences 42 ,1 (2010), 49. 3[Gum02] G

UMHOLD

S.: Maximum entropy light source placement. In

Proceedings of IEEE Visualization Conference (oct 2002), pp. 275–282. doi:10.1109/VISUAL.2002.1183785 . 2[GXX ∗

17] G AO B.-B., X

ING

C., X IE C.-W., W U J., G

ENG

X.: Deeplabel distribution learning with label ambiguity.

IEEE Transactions onImage Processing 26 , 6 (2017), 2825–2838. 2, 3, 7[HRV ∗

18] H

ERMOSILLA

P., R

ITSCHEL

T., V

ÁZQUEZ

P.-P., V

INACUA

À., R

OPINSKI

T.: Monte carlo convolution for learning on non-uniformly sampled point clouds. In

SIGGRAPH Asia 2018 TechnicalPapers (2018), ACM, p. 235. 5, 6[HVH ∗

16] H

EINRICH

J., V

UONG

J., H

AMMANG

C. J., W U A., R IT - TENBRUCH

M., H

OGAN

J., B

RERETON

M., O’D

ONOGHUE

S. I.: Eval-uating viewpoint entropy for ribbon representation of protein structure.In

Computer Graphics Forum (2016), vol. 35, Wiley Online Library,pp. 181–190. 1, 2[HWZ ∗

17] H E J., W

ANG

L., Z

HOU

W., Z

HANG

H., C UI X., G UO Y.:Viewpoint selection for photographing architectures, 2017. arXiv:1703.01702 . 1 [HZRS16] H E K., Z

HANG

X., R EN S., S UN J.: Deep residual learn-ing for image recognition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition (2016), pp. 770–778. 7[JJ94] J

ORDAN

M. I., J

ACOBS

R. A.: Hierarchical mixtures of expertsand the em algorithm.

Neural computation 6 , 2 (1994), 181–214. 3[JJNH91] J

ACOBS

R. A., J

ORDAN

M. I., N

OWLAN

S. J., H

INTON

G. E.: Adaptive mixtures of local experts.

Neural computation 3 , 1(1991), 79–87. 3[KB14] K

INGMA

D., B A J.: Adam: A method for stochastic optimiza-tion.

International Conference on Learning Representations (12 2014).6[KTL ∗

17] K IM S.- H ., T AI Y.-W., L EE J.-Y., P

ARK

J., K

WEON

I. S.:Category-speciﬁc salient view selection via deep convolutional neuralnetworks. In

Computer Graphics Forum (2017), vol. 36, Wiley OnlineLibrary, pp. 313–328. 2[LC15] L

INO

C., C

HRISTIE

M.: Intuitive and efﬁcient camera controlwith the toric space.

ACM Transactions on Graphics (TOG) 34 , 4 (2015),1–12. 1[LDW14] L IU C.- A ., D ONG

R.- F ., W U H.: Flying robot based viewpointselection for the electricity transmission equipment inspection.

Mathe-matical Problems in Engineering 2014 (2014). doi:10.1155/2014/783810 . 2[LGS19] L

IAO

S., G

AVVES

E., S

NOEK

C. G.: Spherical regression:Learning viewpoints, surface normals and 3D rotations on n-spheres. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (2019), pp. 9759–9767. 2, 3, 7, 9[LVJ05] L EE C. H., V

ARSHNEY

A., J

ACOBS

D. W.: Mesh saliency.

ACM transactions on graphics (TOG) 24 , 3 (2005), 659–666. 2[LZL16] L IU Z., Z

HANG

J., L IU L.: Upright orientation of 3D shapeswith convolutional networks.

Graphical Models 85 (2016), 22 – 29. SICVM 2016 selected Papers. doi:https://doi.org/10.1016/j.gmod.2016.03.001 . 9[MC99] M

ARCHAND

E., C

HAUMETTE

F.: Active vision for completescene reconstruction and exploration.

IEEE Transactions on PatternAnalysis and Machine Intelligence 21 , 1 (1999), 65–72. 2[MEB ∗

17] M

EUSCHKE

M., E

NGELKE

W., B

EUING

O., P

REIM

B., L A - WONN

K.: Automatic viewpoint selection for exploration of time-dependent cerebral aneurysm data. In

Bildverarbeitung fuer die Medizin2017 . Springer, 2017, pp. 352–357. 1, 2[MNTP07] M

ÜHLER

K., N

EUGEBAUER

M., T

IETJEN

C., P

REIM

B.:Viewpoint selection for intervention planning. In

EuroVis (2007),pp. 267–274. 2[MVN12] M

ONCLÚS

E., V

ÁZQUEZ

P.-P., N

AVAZO

I.: Efﬁcient selec-tion of representative views and navigation paths for volume data explo-ration. In

Visualization in Medicine and Life Sciences II . Springer, 2012,pp. 133–151. 2[NDVZJ19] N

IMIER -D AVID

M., V

ICINI

D., Z

ELTNER

T., J

AKOB

W.:Mitsuba 2: A retargetable forward and inverse renderer.

ACM Transac-tions on Graphics (TOG) 38 , 6 (2019), 1–17. 9[NPLBY18] N

GUYEN -P HUOC

T. H., L I C., B

ALABAN

S., Y

ANG

Y.:Rendernet: A deep convolutional network for differentiable renderingfrom 3d shapes.

Advances in Neural Information Processing Systems 31 (2018), 7891–7901. 9[PB96] P

LEMENOS

D., B

ENAYADA

M.: Intelligent display in scenemodeling. new techniques to automatically compute good views. In

In-ternational Conference GraphiCon (1996), vol. 96, pp. 1–5. 3[PPB ∗

05] P

OLONSKY

O., P

ATANE

G., B

IASOTTI

S., G

OTSMAN

C.,S

PAGNUOLO

M.: What’s in an image? towards the computation of the"best" view of an object.

The Visual Computer 21 (08 2005), 840–847. doi:10.1007/s00371-005-0326-y . 2[QYSG17] Q I C. R., Y I L., S U H., G

UIBAS

L. J.: Pointnet++: Deephierarchical feature learning on point sets in a metric space, 2017. arXiv:1706.02413 . 6, 8, 9 . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation [RLDB17] R

UPPRECHT

C., L

AINA

I., D I P IETRO

R., B

AUST

M.: Learn-ing in an uncertain world: Representing ambiguity through multiple hy-potheses. In (Oct 2017), pp. 3611–3620. 3[SHK ∗

14] S

RIVASTAVA

N., H

INTON

G., K

RIZHEVSKY

A.,S

UTSKEVER

I., S

ALAKHUTDINOV

R.: Dropout: A simple wayto prevent neural networks from overﬁtting.

Journal of MachineLearning Research 15 (2014), 1929–1958. 6[SLF ∗

11] S

ECORD

A., L U J., F

INKELSTEIN

A., S

INGH

M., N

EALEN

A.: Perceptual models of viewpoint preference.

ACM Transactions onGraphics (TOG) 30 , 5 (2011), 109. 1, 2, 3, 4, 9[SLM ∗

17] S

ARAN

A., L

AKIC

B., M

AJUMDAR

S., H

ESS

J., N

IEKUM

S.: Viewpoint selection for visual failure detection. In (Sep.2017), pp. 5437–5444. doi:10.1109/IROS.2017.8206439 . 2[SLMR14] S

ONG

R., L IU Y., M

ARTIN

R. R., R

OSIN

P. L.: Meshsaliency via spectral processing.

ACM Transactions on Graphics (TOG)33 , 1 (2014), 1–17. 1[SMGH18] S

MITH

N., M

OEHRLE

N., G

OESELE

M., H

EIDRICH

W.:Aerial path planning for urban scene reconstruction: A continuous op-timization method and benchmark. In

SIGGRAPH Asia 2018 TechnicalPapers (2018), ACM, p. 183. 2[SPFG05] S

BERT

M., P

LEMENOS

D., F

EIXAS

M., G

ONZÁLEZ

F.:Viewpoint quality: Measures and applications. In

Computational Aes-thetics in Graphics, Visualization and Imaging (01 2005), pp. 185–192. doi:10.2312/COMPAESTH/COMPAESTH05/185-192 . 3[ST19] S HI N., T AO Y.: CNNs based viewpoint estimation for volumevisualization.

ACM Transactions on Intelligent Systems and Technology(TIST) 10 , 3 (2019), 27. 3, 9[SWY ∗

15] S

ZEGEDY

C., W EI L IU , Y ANGQING J IA , S ERMANET

P.,R

EED

S., A

NGUELOV

D., E

RHAN

D., V

ANHOUCKE

V., R

ABINOVICH

A.: Going deeper with convolutions. In (June 2015), pp. 1–9. doi:10.1109/CVPR.2015.7298594 . 6[TLB ∗

09] T AO Y., L IN H., B AO H., D

ONG

F., C

LAPWORTHY

G.:Structure-aware viewpoint selection for volume visualization. In (2009), IEEE, pp. 193–200. 2[VFSG06] V

IOLA

I., F

EIXAS

M., S

BERT

M., G

ROLLER

M. E.:Importance-driven focus of attention.

IEEE Transactions on Visualiza-tion and Computer Graphics 12 , 5 (2006), 933–940. 2[VFSH01] V

ÁZQUEZ

P.-P., F

EIXAS

M., S

BERT

M., H

EIDRICH

W.:Viewpoint selection using viewpoint entropy. In

VMV (2001), vol. 1,pp. 273–280. 3[VFSL02] V

ÁZQUEZ

P.-P., F

EIXAS

M., S

BERT

M., L

LOBET

A.: View-point entropy: a new tool for obtaining good views of molecules. In

ACMInternational Conference Proceeding Series (2002), vol. 22, pp. 183–188. 2[VGHN08] V

ÁZQUEZ

P.-P., G

ÖTZELMANN

T., H

ARTMANN

K.,N

ÜRNBERGER

A.: An interactive 3D framework for anatomical educa-tion.

International journal of computer assisted radiology and surgery3 , 6 (2008), 511–524. 2[WSK ∗

15] W U Z., S

ONG

S., K

HOSLA

A., Y U F., Z

HANG

L., T

ANG

X., X

IAO

J.: 3D shapenets: A deep representation for volumetric shapes.In

Proceedings of the IEEE conference on computer vision and patternrecognition (2015), pp. 1912–1920. 6[Yao08] Y AO W. Y. Z.: Intelligent volume visualization through transferfunction and viewpoint selection [j].

Journal of Computer-Aided Design& Computer Graphics 5 (2008). 2[YLLY19] Y

ANG

C., L I Y., L IU C., Y

UAN

X.: Deep learning-based viewpoint recommendation in volume visualization.

Jour-nal of Visualization 22 , 5 (Oct 2019), 991–1003. doi:10.1007/s12650-019-00583-4 . 1, 2[ZFY20] Z

HANG

Y., F EI G., Y

ANG

G.: 3D viewpoint estimation basedon aesthetics.

IEEE Access 8 (2020), 108602–108621. 2

Appendix A:

Mesh Cleaning PipelineOur mesh cleaning pipeline consists of three steps. First we resolveself intersections of the mesh, using PyMesh. These are particu-larly bad as in this case the area of a face does not correspond to itspotentially visible area, as parts of a face can be hidden inside themodel, changing the values of A z and A t . As a second step we re-move non-surface polygons by computing the visibility of the facesfrom 1000 views and drop all non-visible faces of the model usingMeshLab. This is primarily done to create cleaner surface meshesby removing unwanted parts of the model, e.g. passenger seats in-side planes. This results in a A t being closer to the actual surfacearea of the model, while also speeding up the downstream tasks byreducing the number of polygons per model. As the ﬁrst step intro-duces artifacts in the form of small and irregular meshing, whereself-intersections were resolved, we add a third and last step wherewe regularize the surface meshes by performing an edge-collapsereduction algorithm, again using MeshLab. Furthermore, the laststep also removes unwanted structures in the meshes, e.g. polygonsreferring to different textures, which are not relevant for shape in-formation but can inﬂuence the viewpoint quality. Fig. 8 shows de-tails of the model airplane_0004 from ModelNet40, which con-tains self-intersections (top left) and unnecessary polygons (topright). The proposed mesh processing resolves self-intersections inthe ﬁrst step and and cleans the meshing in the third step (bottomimages). We note that this mesh cleaning pipeline does not createperfect watertight surface meshes, but is merely a trade-off betweencomputation time and achieved mesh quality. The resulting qualityturned out to be adequate for our experiments. Providing an algo-rithm for high quality remeshing is beyond the scope of this paper,and an active ﬁeld of research on its own. Appendix B:

Distribution Learning

Network Architecture

For predicting the viewpoint quality distribution we use the samefeature encoder as for the other tasks, see Section 4.3, and replacethe prediction MLPs with 2D decoder networks. These decoder net-Figure 8:

Mesh cleaning.

Results of the different steps to cleanthe meshes on airplane_0004 . The original mesh contains self in-tersections (left) and non-uniform meshing artifacts (right), whichare resolved in the ﬁrst and third step our mesh cleaning pipeline,respectively. . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation

Table 5:

Detailed results.

Breakdown from Table 2 for each category.airplane bench bottle car chair sofa table toilet meanVE SL 60.7 50.9 64.8 58.0 63.8 88.7 38.3 74.4 62.4SR 49.4 65.1 53.3 64.0 66.3 63.4 73.5 70.0 63.1DLDL 47.0 55.4 53.4 58.7 62.9 63.4 56.5 72.4 58.7ML 55.4 62.1 50.4 79.8

VR SL 71.4 71.7 69.9 65.1 72.4

VKL SL 89.2 76.2 74.9 83.7 83.5 86.3 86.0 65.6 80.7SR 86.2 84.4 88.9 74.0 72.1 79.1 89.0 71.5 80.6DLDL 86.7 83.3 88.9 73.0 72.0 73.0 78.8 67.8 77.9ML 79.7 79.8 92.7 80.4 86.2 75.5 90.2 76.5 82.6GL 91.8

VMI SL 90.6 79.2 80.4 84.0 88.3 92.6 84.3 64.4 83.0SR 85.0 86.2 86.5 81.6 70.6 75.7 works consists of deconvolution layers to increase the spatial di-mensions interweaved with residual blocks to increase the decodercapacity. The whole decoder architecture is as follows: •

2D deconvolution with ﬁlter size 4 × • × •

2D deconv layer with ﬁlter size 2 × ( × , × , × , × ) with feature dimensions ( , , , ) . Predicted Distributions

Examples for predicted viewpoint distributions can be seen inFig. 9.

Appendix C:

Viewpoint predictionTables 5 shows the test results on the different categories forthe experiment from Section 5.2. Our combined ML+GL methodachieves best performance on almost all categories and viewpointquality measures. The results are comparable on all categories forall viewpoint quality measures, with or without considering statis-tics,

Prediction Ground Truth T e s t i ng T r a i n i ng Figure 9:

Predicted viewpoint quality distributions.

Results ofthe DLDL approach on the category airplane for one example fromthe training set and one example from the test set.

Appendix D:

DatasetWe release our training data which contains dense viewpoint qual-ity values for 1k viewpoints on a Fibonacci sphere for VE, VR, . Schelling et al. / Enabling Viewpoint Learning through Dynamic Label Generation

VKLand VMI for ~12k models from ModelNet40. For a subset of~4k models from the categories airplane, bench, bottle, car, chair,sofa, table and toiletairplane, bench, bottle, car, chair,sofa, table and toilet