Neural 3D Mesh Renderer
NNeural 3D Mesh Renderer
Hiroharu Kato , Yoshitaka Ushiku , and Tatsuya Harada The University of Tokyo, RIKEN { kato,ushiku,harada } @mi.t.u-tokyo.ac.jp Abstract
For modeling the 3D world behind 2D images, which3D representation is most appropriate? A polygon meshis a promising candidate for its compactness and geometricproperties. However, it is not straightforward to model apolygon mesh from 2D images using neural networks be-cause the conversion from a mesh to an image, or ren-dering, involves a discrete operation called rasterization,which prevents back-propagation. Therefore, in this work,we propose an approximate gradient for rasterization thatenables the integration of rendering into neural networks.Using this renderer, we perform single-image 3D mesh re-construction with silhouette image supervision and our sys-tem outperforms the existing voxel-based approach. Addi-tionally, we perform gradient-based 3D mesh editing opera-tions, such as 2D-to-3D style transfer and 3D DeepDream,with 2D supervision for the first time. These applicationsdemonstrate the potential of the integration of a mesh ren-derer into neural networks and the effectiveness of our pro-posed renderer.
1. Introduction
Understanding the 3D world from 2D images is one ofthe fundamental problems in computer vision. Humansmodel the 3D world in their brains using images on theirretinas, and live their daily existence using the constructedmodel. The machines, too, can act more intelligently byexplicitly modeling the 3D world behind 2D images.The process of generating an image from the 3D worldis called rendering . Because this lies on the border betweenthe 3D world and 2D images, it is crucially important incomputer vision.In recent years, convolutional neural networks (CNNs)have achieved considerable success in 2D image under-standing [7, 13]. Therefore, incorporating rendering intoneural networks has a high potential for 3D understanding.What type of 3D representation is most appropriate formodeling the 3D world? Commonly used 3D formats arevoxels, point clouds and polygon meshes. Voxels, which
Mesh Generator3D Mesh Backprop Silhouette
NeuralRenderer
Image Style Image New Mesh3D Mesh Ground-truthImage LossBackprop Loss
NeuralRenderer
Figure 1. Pipelines for single-image 3D mesh reconstruction (up-per) and 2D-to-3D style transfer (lower). are 3D extensions of pixels, are the most widely used for-mat in machine learning because they can be processed byCNNs [2, 17, 20, 24, 30, 31, 34, 35, 36]. However, it isdifficult to process high resolution voxels because they areregularly sampled from 3D space and their memory effi-ciency is poor. The scalability of point clouds, which aresets of 3D points, is relatively high because point clouds arebased on irregular sampling. However, textures and light-ing are difficult to apply because point clouds do not havesurfaces. Polygon meshes, which consist of sets of verticesand surfaces, are promising because they are scalable andhave surfaces. Therefore, in this work, we use the polygonmesh as our 3D format.One advantage of polygon meshes over other representa-tions in 3D understanding is its compactness. For example,to represent a large triangle, a polygon mesh only requiresthree vertices and one face, whereas voxels and point cloudsrequire many sampling points over the face. Because poly-gon meshes represent 3D shapes with a small number ofparameters, the model size and dataset size for 3D under-standing can be made smaller.Another advantage is its suitability for geometric trans-1 a r X i v : . [ c s . C V ] N ov ormations. The rotation, translation, and scaling of objectsare represented by simple operations on the vertices. Thisproperty also facilitates to train 3D understanding models.Can we train a system including rendering as a neuralnetwork? This is a challenging problem. Rendering con-sists of projecting the vertices of a mesh onto the screencoordinate system and generating an image through regulargrid sampling [16]. Although the former is a differentiableoperation, the latter, referred to as rasterization , is difficultto integrate because back-propagation is prevented by thediscrete operation.Therefore, to enable back-propagation with rendering,we propose an approximate gradient for rendering peculiarto neural networks, which facilitates end-to-end training ofa system including rendering. Our proposed renderer canflow gradients into texture, lighting, and cameras as well asobject shapes. Therefore, it is applicable to a wide range ofproblems. We name our renderer Neural Renderer .In the generative approach in computer vision and ma-chine learning, problems are solved by modeling and in-verting the process of data generation. Images are generatedvia rendering from the 3D world, and a polygon mesh is anefficient, rich and intuitive 3D representation. Therefore,“backward pass” of mesh renderers is extremely important.In this work, we propose the two applications illustratedin Figure 1. The first is single-image 3D mesh reconstruc-tion with silhouette image supervision. Although 3D recon-struction is one of the main problems in computer vision,there are few studies to reconstruct meshes from single im-ages despite the potential capacity of this approach. Theother application is gradient-based 3D mesh editing with 2Dsupervision. This includes a 3D version of style transfer [6]and DeepDream [18]. This task cannot be realized with-out a differentiable mesh renderer because voxels or pointclouds have no smooth surfaces.The major contributions can be summarized as follows. • We propose an approximate gradient for rendering of amesh, which enables the integration of rendering intoneural networks. • We perform 3D mesh reconstruction from single im-ages without 3D supervision and demonstrate our sys-tem’s advantages over the voxel-based approach. • We perform gradient-based 3D mesh editing opera-tions, such as 2D-to-3D style transfer and 3D Deep-Dream, with 2D supervision for the first time. • We will release the code for Neural Renderer.
2. Related work
In this section, we briefly describe how 3D representa-tions have been integrated into neural networks. We alsosummarize works related to our two applications.
3D representations are categorized into rasterized andgeometric forms. Rasterized forms include voxels andmulti-view RGB(D) images. Geometric forms include pointclouds, polygon meshes, and sets of primitives.Rasterized forms are widely used because they can beprocessed by CNNs. Voxels, which are 3D extensionsof pixels, are used for classification [17, 20, 24, 34, 35],3D reconstruction and generation [2, 30, 31, 34, 36]. Be-cause the memory efficiency of voxels is poor, some re-cent works have incorporated more efficient representa-tions [24, 30, 32]. Multi-view RGB(D) images, which rep-resent a 3D scene through a set of images, are used forrecognition [20, 27] and view synthesis [29].Geometric forms require some modifications to be in-tegrated into neural networks. For example, systems thathandle point clouds must be invariant to the order of points.Point clouds have been used for both recognition [12, 19,21] and reconstruction [5]. Primitive-based representations,which represent 3D objects using a set of primitives, suchas cuboids, have also been investigated [14, 39].A Polygon mesh represents a 3D object as a set of ver-tices and surfaces. Because it is memory efficient, suit-able for geometric transformations, and has surfaces, it isthe de facto standard form in computer graphics (CG) andcomputer-aided design (CAD). However, because the datastructure of a polygon mesh is a complicated graph, it is dif-ficult to integrate into neural networks. Although recogni-tion and segmentation have been investigated [10, 38], gen-erative tasks are much more difficult. Rezende et al . [23]incorporated the OpenGL renderer into a neural networkfor 3D mesh reconstruction. Gradients of the black-boxrenderer were estimated using REINFORCE [33]. In con-trast, the gradients in our renderer are geometry-groundedand presumably more accurate. OpenDR [15] is a differ-entiable renderer. Unlike this general-purpose renderer, ourproposed gradients are designed for neural networks.
The estimation of 3D structures from images is a tradi-tional problem in computer vision. Following the recentprogress in machine learning algorithms, 3D reconstructionfrom a single image has become an active research topic.Most methods learn a 2D-to-3D mapping function usingground truth 3D models. While some works reconstruct 3Dstructures via depth prediction [4, 25], others directly pre-dict 3D shapes [2, 5, 30, 31, 34].Single-image 3D reconstruction can be realized without3D supervision. Perspective transformer nets (PTN) [36]learn 3D structures using silhouette images from multipleviewpoints. Our 3D reconstruction method is also basedon silhouette images. However, we use polygon mesheswhereas they used voxels. a) Example of mesh & pixels(b) Standard rasterization(c) Derivative of (b)(d) Modification of (b)(e) Derivative of (d)Forward pass of proposed methodBackward pass of proposed method " = ( " , " ) * , * ( " ) * " * " * " * " , - No gradient flowBlurred image "* Figure 2. Illustration of our method. v i = { x i , y i } is one vertexof the face. I j is the color of pixel P j . The current position of x i is x . x is the location of x i where an edge of the face collideswith the center of P j when x i moves to the right. I j becomes I ij when x i = x . Using a differentiable feature extractor and loss function,an image that minimizes the loss can be generated via back-propagation and gradient descent. DeepDream [18] is anearly example of such a system. An initial image is repeat-edly updated so that the magnitude of its image feature be-comes larger. Through this procedure, objects such as dogsand cars gradually appear in the image.Image style transfer [6] is likely the most familiar andpractical example. Given a content image and style image ,an image with the specified content and style is generated.Our renderer provides gradients of an image with respectto the vertices and textures of a mesh. Therefore, Deep-Dream and style transfer of a mesh can be realized by usingloss functions on 2D images.
3. Approximate gradient for rendering
In this section, we describe Neural Renderer, which is a3D mesh renderer with gradient flow.
A 3D mesh consists of a set of vertices { v o , v o , .., v oN v } and faces { f , f , .., f N f } , where the object has N v verticesand N f faces. v oi ∈ R represents the position of the i -th vertex in the 3D object space and f j ∈ N representsthe indices of the three vertices corresponding to the j -thtriangle face. To render this object, vertices { v oi } in theobject space are transformed into vertices { v si } , v si ∈ R in the screen space. This transformation is represented by acombination of differentiable transformations [16].An image is generated from { v si } and { f j } via sampling. (a) Example of mesh & pixels(b) Standard rasterization(c) Derivative of (b)(d) Modification of (b)(e) Derivative of (d)Forward pass of proposed methodBackward pass of proposed method " = ( " , " ) * , * ( " ) * " * " * " * " , -. No gradient flowBlurred image "*. "*/ -/ Figure 3. Illustration of our method in the case where P j is insidethe face. I j changes when x i moves to the right or left. This process is called rasterization. Figure 2 (a) illustratesrasterization in the case of single triangle. If the center ofa pixel P j is inside of the face, the color I j of the pixel P j becomes the color of the overlapping face I ij . Because thisis a discrete operation, assuming that I ij is independent of v i , ∂I j ∂ v i is zero almost everywhere, as shown in Figure 2 (b–c). This means that the error signal back-propagated from aloss function to pixel P j does not flow into the vertex v i . For ease of explanation, we describe our method usingthe x-coordinate x i of a single vertex v i = v si in the screenspace and a single gray-scale pixel P j . We consider thecolor of P j to be a function I j ( x i ) on x i and freeze all vari-ables other than x i .First, we assume that P j is outside the face, as shownin Figure 2 (a). The color of P j is I ( x ) when x i is at thecurrent position x . If x i moves to the right and reaches thepoint x , where an edge of the face collides with the centerof P j , I j ( x i ) suddenly turns to the color of hitting point I ij .Let δ xi be the distance traveled by x i , let δ xi = x − x , andlet δ Ij represent the change in the color δ Ij = I ( x ) − I ( x ) .The partial derivative ∂I j ( x i ) ∂x i is zero almost everywhere, asillustrated in Figure 2 (b–c).Because the gradient is zero, the information that I j ( x ) can be changed by δ Ij if x i moves by δ xi to the right isnot transmitted to x i . This is because I j ( x i ) suddenlychanges. Therefore, we replace the sudden change with agradual change between x and x using linear interpola-tion. Then, ∂I j ∂x i becomes δ Ij δ xi between x and x , as shownin Figure 2 (d–e).The derivative of I j ( x i ) is different on the right and leftsides of x . How should one define a derivative at x i = x ?We propose switching the values using the error signal δ Pj ack-propagated to P j . The sign of δ Pj indicates whether P j should be brighter or darker. To minimize the loss, if δ Pj > , then P j must be darker. On the other hand, thesign of δ Ij indicates whether P j can be brighter or darker. If δ Ij > , P j becomes brighter by pulling in x i , but P j cannotbecome darker by moving x i . Therefore, a gradient shouldnot flow if δ Pj > and δ Ij > . From this viewpoint, wedefine ∂I j ( x i ) ∂x i | x i = x as follows. ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) x i = x = (cid:40) δ Ij δ xi ; δ Pj δ Ij < . δ Pj δ Ij ≥ . (1)Sometimes, the face does not overlap P j regardless ofwhere x i moves. This means that x does not exist. In thiscase, we define ∂I j ( x i ) ∂x i | x i = x = 0 .We use Figure 2 (b) for the forward pass because if weuse Figure 2 (d), the color of a face leaks outside of theface. Therefore, our rasterizer produces the same images asthe standard rasterizer, but it has non-zero gradients.The derivative with respect to y i can be obtained byswapping the x-axis and y-axis in the above discussion.Next, we consider a case where P j is inside the face, asshown in Figure 3 (a). In this case, I ( x i ) changes when x i moves to the right or left. Standard rasterization, itsderivative, an interpolated function, and its derivative areshown in Figure 3 (b–e). We first compute the derivativeson the left and right sides of x and let their sum be thegradient at x . Specifically, using the notation in Figure 3, δ I a j = I ( x a ) − I ( x ) , δ I b j = I ( x b ) − I ( x ) , δ ax = x a − x and δ bx = x b − x , we define the loss as follows. ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) x i = x = ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) ax i = x + ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) bx i = x . (2) ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) ax i = x = (cid:40) δ Iaj δ ax ; δ Pj δ I a j < . δ Pj δ I a j ≥ . (3) ∂I j ( x i ) ∂x i (cid:12)(cid:12)(cid:12) bx i = x = δ Ibj δ bx ; δ Pj δ I b j < . δ Pj δ I a j ≥ . (4) If there are multiple faces, our rasterizer draws only thefrontmost face at each pixel, which is the same as the stan-dard method [16]. During the backward pass, we first checkwhether or not the cross points I ij , I aij , and I bij are drawn,and do not flow gradients if they are occluded by surfacesnot including v i . Textures can be mapped onto faces. In our implementa-tion, each face has its own texture image of size s t × s t × s t . We determine the coordinates in the texture space corre-sponding to a position p on a triangle { v , v , v } usingthe centroid coordinate system. In other words, if p is ex-pressed as p = w v + w v + w v , let ( w , w , w ) bethe corresponding coordinates in the texture space. Bilinearinterpolation is used for sampling from a texture image. Lighting can be applied directly to a mesh, unlike vox-els and point clouds. In this work, we use a simple ambientlight and directional light without shading. Let l a and l d be the intensities of the ambient light and directional light,respectively, n d be a unit vector indicating the direction ofthe directional light, and n j be the normal vector of a sur-face. We then define the modified color of a pixel I lj on thesurface as I lj = (cid:0) l a + (cid:0) n d · n j (cid:1) l d (cid:1) I j .In this formulation, gradients also flow into the intensi-ties l a and l d , as well as the direction n d of the directionallight. Therefore, light sources can also be included as anoptimization target.
4. Applications of Neural Renderer
We apply our proposed renderer to (a) single-image 3Dreconstruction with silhouette image supervision and (b)gradient-based 3D mesh editing, including a 3D version ofstyle transfer [6] and DeepDream [18]. An image of a mesh m rendered from a viewpoint φ i is denoted R ( m, φ i ) . Yan et al . [36] demonstrated that single-image 3D re-construction can be realized without 3D training data. Intheir setting, a 3D generation function G ( x ) on an image x was trained such that silhouettes of a predicted 3D shape { ˆ s i = R ( G ( x ) , φ i ) } match the ground truth silhouettes { s i } , assuming that the viewpoints { φ i } are known. Thispipeline is illustrated in Figure 1. While Yan et al . [36]generated voxels, we generate a mesh.Although voxels can be generated by extending existingimage generators [8, 22] to the 3D space, mesh generationis not so straightforward. In this work, instead of generatinga mesh from scratch, we deform a predefined mesh to gen-erate a new mesh. Specifically, we use an isotropic spherewith vertices and move each vertex v i as v i + b i + c using a local bias vector b i and global bias vector c . Addi-tionally, we restrict the movable range of each vertex withinthe same quadrant on the original sphere. The faces { f i } areunchanged. Therefore, the intermediate outputs of G ( x ) are b ∈ R × and c ∈ R × . The mesh we use is specified by × parameters, which is far less than the typical voxelrepresentation with a size of . This low-dimensionalityis presumably beneficial for shape estimation.The generation function G ( x ) is trained using silhouetteloss L sl and smoothness loss L sm . Silhouette loss representsow much the reconstructed silhouettes { ˆ s i } differ from thecorrect silhouettes { s i } . Smoothness loss represents howsmooth the surfaces of a mesh are and acts as a regularizer.The objective function is a weighted sum of these two lossfunctions L = λ sl L sl + λ sm L sm .Let { s i } and { ˆ s i } be binary masks, θ i be the angle be-tween two faces including the i -th edge in G ( x ) , E be theset of all edges in G ( x ) , and (cid:12) be an element-wise product.We define the loss functions as: L sl ( x | φ i , s i ) = − | ˆ s i (cid:12) s i | | ˆ s i + s i − ˆ s i (cid:12) s i | . (5) L sm ( x ) = (cid:88) θ i ∈E (cos θ i + 1) . (6) L sl corresponds to a negative intersection over union (IoU)between the true and reconstructed silhouettes. L sm ensuresthat intersection angles of all faces are close to degrees.We assume that the object region in an image is seg-mented via preprocessing in common with the exitingworks [5, 31, 36]. We input the mask of the object regioninto the generator as an additional channel of an RGB im-age. Gradient-based image editing techniques [6, 18] gener-ate an image by minimizing a loss function L ( x ) on a 2Dimage x via gradient descent. In this work, instead of gen-erating an image, we optimize a 3D mesh m consisting ofvertices { v i } , faces { f i } , and textures { t i } based on its ren-dered image R ( m | φ i ) . In this section, we propose a method to transfer the style ofan image x s onto a mesh m c .For 2D images, style transfer is achieved by minimizing content loss and style loss simultaneously [6]. Specifically,content loss is defined using a feature extractor f c ( x ) andcontent image x c as L c ( x | x c ) = | f c ( x ) − f c ( x c ) | . Styleloss is defined using another feature extractor f s ( x ) andstyle image x s as L s ( x | x s ) = | M ( f s ( x )) − M ( f s ( x s )) | F . M ( x ) transforms a vector into a Gram matrix.In 2D-to-3D style transfer, content is specified as a 3Dmesh m c . To make the shape of the generated mesh similarto that of m c , assuming that the vertices-to-faces relation-ships { f i } are the same for both meshes, we redefine con-tent loss as L c ( m | m c ) = (cid:80) { v i , v ci }∈ ( m,m c ) | v i − v ci | . Weuse the same style loss as that in the 2D application. Specif-ically, L s ( m | x s , φ ) = | M ( f s ( R ( m, φ ))) − M ( f s ( x s )) | F .We also use a regularizer for noise reduction. Let P de-note the a set of colors of all pairs of adjacent pixels inan image R ( m, φ ) . We define this loss as L t ( m | φ ) = (cid:80) { p a , p b }∈P | p a − p b | . The objective function is L = λ c L c + λ s L s + λ t L t .We set an initial solution of m as m c and minimize L withrespect to { v i } and { t i } . Let f ( x ) be a function that outputs a feature map of animage x . For 2D images, a DeepDream of image x is achieved by minimizing −| f ( x ) | F via gradient descentstarting from x = x . Optimization is halted after afew iterations. Following a similar process, we minimize −| f ( R ( m, φ )) | F with respect to { v i } and { t i } .
5. Experiments
In this section, we evaluate the effectiveness of our ren-derer through the two applications.
To compare our mesh-based method with the voxel-basedapproach by Yan et al . [36], we used nearly the same datasetas they did . We used 3D objects from categories inthe ShapeNetCore [1] dataset. Images were rendered from azimuth angles with a fixed elevation angle, under thesame camera setup, and lighting setup using Blender. Therender size was × pixels. We used the same training,validation, and test sets as those used in [36].We compared reconstruction accuracy between thevoxel-based and retrieval-based approaches [36]. In thevoxel-based approach, G ( x ) is composed of a convolu-tional encoder and deconvolutional decoder. While their en-coder was pre-trained using the method in Yang et al . [37],our network works well without any pre-training. In theretrieval-based approach, the nearest training image is re-trieved using the fc6 feature of a pre-trained VGG net-work [26]. The corresponding voxels are regarded as a pre-dicted shape. Note that the retrieval-based approach usesground truth voxels for supervision.To evaluate the reconstruction performance quantita-tively, we voxelized both the ground truth meshes and thegenerated meshes to compute the intersection over union(IoU) between the voxels. The size of voxels was set to . For each object in the test set, we performed 3D recon-struction using the images from viewpoints, calculatedthe IoU scores, and reported the average score.We used an encoder-decoder architecture for the genera-tor G ( x ) . Our encoder is nearly identical to that of [36],which encodes an input image into a D vector. Our The dataset we used was not exactly the same as that used in [36].The rendering parameters for the input images were slightly different. Ad-ditionally, while our silhouette images were rendered by Blender from themeshes in the ShapeNetCore dataset, theirs were rendered by their PTNsusing voxelized data. igure 4. 3D mesh reconstruction from a single image. Results are rendered from three viewpoints. First column: input images. Secondthrough fourth columns: mesh reconstruction (proposed method). Fifth through seventh columns: voxel reconstruction [36]. airplane bench dresser car chair display lamp
Retrieval [36] . . . . . . . Voxel-based [36] . . . . . . . Mesh-based (ours) . . . . . . . loudspeaker rifle sofa table telephone vessel mean Retrieval [36] . . . . . . . Voxel-based [36] . . . . . . . Mesh-based (ours) . . . . . . . Table 1. Reconstruction accuracy measured by voxel IoU. Higher is better. Our mesh-based approach outperforms the voxel-based ap-proach [36] in out of categories.Figure 5. Generation of the back side of a CRT monitorwith/without smoothness regularizer. Left: input image. Center:prediction without regularizer. Right: prediction with regularizer. decoder is composed of three fully-connected layers. Thesizes of the hidden layer are and .The render size of our renderer is set to × anddownsampled them to × . We rendered only the sil-houettes of objects without using textures and lighting. Weset λ sl = 1 and λ sm = 0 . in Section 5.1.2, and λ sm = 0 in Section 5.1.3.We trained our generator using the Adam optimizer [11]with α = 0 . , β = 0 . , and β = 0 . . The batch size was set to . In each minibatch, we included silhouettesfrom two viewpoints per input image. We trained models with images from each class. Fig-ure 4 presents a part of results from the test set by ourmesh-based method and the voxel-based method [36] . Ad-ditional results are presented in the supplementary materi-als. These results demonstrate that a mesh can be correctlyreconstructed from a single image using our method.Compared to the voxel-based approach, the shapes re-constructed by our method are more visually appealingfrom the two points. One is that a mesh can representsmall parts, such as airplane wings, with high resolution.The other is that there is no cubic artifacts in a mesh. Al-though low resolutions and artifacts may not be a problemin tasks such as picking by robots, they are disadvanta-geous for computer graphics, computational photography, We trained generators using the code from the authors and our dataset. nd data augmentation.Without using the smoothness loss, our model some-times produces very rough surfaces. That is because thesmoothness of surfaces has little effect on silhouettes. Withthe smoothness regularizer, the surface becomes smootherand looks more natural. Figure 5 illustrates the effective-ness of the regularizer. However, if the regularizer is used,the voxel IoU for the entire dataset becomes slightly lower.
We trained a single model using images from all classes.The reconstruction accuracy is shown in Table 1. Our mesh-based approach outperforms the voxel-based approach [36]for out of categories. Our result is significantly bet-ter for the airplane , chair , display , loudspeaker , and sofa categories. The basic shapes of the loudspeaker and display categories are simple. However, the size andposition vary depending on the objects. The fact that ameshes are suitable for scaling and translation presumablycontributes to the performance improvements in these cat-egories. The variations in shapes in the airplane , chair and sofa categories are also relatively small.Our approach did not perform very well for the car , lamp , and table categories. The shapes of the objects inthese categories are relatively complicated, and they are dif-ficult to be reconstructed by deforming a sphere. Although our reconstruction method already surpasses thevoxel-based method in terms of visual appeal and voxelIoU, it has a clear disadvantage in that it cannot generateobjects with various topologies. In order to overcome thislimitation, it is necessary to generate the faces-to-verticesrelationship { f i } dynamically. This is beyond the scopeof this study, but it is an interesting direction for future re-search. We applied 2D-to-3D style transfer and 3D DeepDreamto the objects shown in Figure 6. Optimization was con-ducted using the Adam optimizer [11] with β = 0 . , and β = 0 . . We rendered images of size × anddownsampled them to × to eliminate aliasing. Thebatch size was set to . During optimization, images wererendered at random elevations and azimuth angles. Texturesize was set to s t = 4 .For style transfer, the style images we used were se-lected from [3, 9]. λ c , λ s , and λ t are manually tuned foreach input. The feature extractors f s for style loss were conv1 2 , conv2 3 , conv3 3 , and conv4 3 from the VGG-16 network [26]. The intensities of the lights were l a = 0 . and l d = 0 . , and the direction of the light was randomlyset during optimization. The α value of Adam was set to . − , − for { v i } , { t i } . The number of parameter up-dates was set to , .In DeepDream, images are rendered without lighting.The feature extractor was the inception 4c layer fromGoogLeNet [28]. The α value of Adam was set to − , − for { v i } , { t i } . Optimization is stopped after , iterations. Figure 7 presents the results of 2D-to-3D style transfer. Ad-ditional results are shown in the supplementary materials.The styles of the paintings were accurately transferred tothe textures and shapes. From the outline of the bunny andthe lid of the teapot, we can see the straight style of Cou-pland and Gris. The wavy style of Munch was also trans-ferred to the side of the teapot. Interestingly, the side of thetower of Babel was transferred only to the side, not to theupside, of the bunny.The proposed method provides a way to edit 3D modelsintuitively and quickly. This can be useful for rapid proto-typing for product design as well as art production.
Figure 8 presents the results of DeepDream. A nose andeyes emerged on the face of the bunny. The spout of theteapot expanded and became the face of the bird, whilethe body appeared similar to a bus. These transformationsmatched the 3D shape of each object.
6. Conclusion
In this paper, we enabled the integration of rendering of a3D mesh into neural networks by proposing an approximategradient for rendering. Using this renderer, we proposeda method to reconstruct a 3D mesh from a single image,the performance of which is superior to the existing voxel-based approach [36] in terms of visual appeal and the voxelIoU metric. We also proposed a method to edit the verticesand textures of a 3D mesh according to its 3D shape us-ing a loss function on images and gradient descent. Theseapplications demonstrate the potential of integrating meshrenderers into neural networks and the effectiveness of theproposed renderer.The applications of our renderer are not limited to thosepresented in this paper. Other problems will be solvedthrough incorporating our module in other systems. igure 6. Initial state of meshes in style transfer and DeepDream. Rendered from six viewpoints.Figure 7. 2D-to-3D style transfer. The leftmost images represent styles. The style images are
Thomson No. 5 (Yellow Sunset) (D. Coupland,2011),
The Tower of Babel (P. Bruegel the Elder, 1563),
The Scream (E. Munch, 1910), and
Portrait of Pablo Picasso (J. Gris, 1912).Figure 8. DeepDream of 3D mesh. cknowledgment
This work was partially funded by ImPACT Program ofCouncil for Science, Technology and Innovation (CabinetOffice, Government of Japan) and partially supported byJST CREST Grant Number JPMJCR1403, Japan.
References [1] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,et al. Shapenet: An information-rich 3d model repository. arXiv , 2015. 5[2] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d objectreconstruction. In
ECCV , 2016. 1, 2[3] V. Dumoulin, J. Shlens, M. Kudlur, A. Behboodi, F. Lemic,A. Wolisz, M. Molinaro, C. Hirche, M. Hayashi, E. Bagan,et al. A learned representation for artistic style.
ICLR , 2017.7[4] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. In
NIPS , 2014. 2[5] H. Fan, H. Su, and L. Guibas. A point set generation networkfor 3d object reconstruction from a single image. In
CVPR ,2017. 2, 5[6] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. In
CVPR , 2016. 2, 3,4, 5[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In
CVPR , 2014. 1[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In
NIPS , 2014. 4[9] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In
ECCV , 2016.7, 15, 17[10] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3dshape segmentation with projective convolutional networks.
CVPR , 2017. 2[11] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. In
ICLR , 2015. 6, 7[12] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In
ICCV , 2017. 2[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , 2012. 1[14] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, andL. Guibas. Grass: Generative recursive autoencoders forshape structures. In
SIGGRAPH , 2017. 2[15] M. M. Loper and M. J. Black. Opendr: An approximatedifferentiable renderer. In
ECCV , 2014. 2[16] S. Marschner and P. Shirley.
Fundamentals of computergraphics . CRC Press, 2015. 2, 3, 4 [17] D. Maturana and S. Scherer. Voxnet: A 3d convolutionalneural network for real-time object recognition. In
IROS ,2015. 1, 2[18] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Goingdeeper into neural networks.
Google Research Blog , 2015.2, 3, 4, 5[19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation.In
CVPR , 2017. 2[20] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas.Volumetric and multi-view cnns for object classification on3d data. In
CVPR , 2016. 1, 2[21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deephierarchical feature learning on point sets in a metric space.In
NIPS , 2017. 2[22] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. In
ICLR , 2016. 4[23] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia,M. Jaderberg, and N. Heess. Unsupervised learning of 3dstructure from images. In
NIPS , 2016. 2[24] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learningdeep 3d representations at high resolutions. In
CVPR , 2017.1, 2[25] A. Saxena, S. H. Chung, and A. Y. Ng. 3-d depth recon-struction from a single still image.
IJCV , 76(1):53–69, 2008.2[26] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition.
ICLR , 2015. 5,7[27] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition.In
ICCV , 2015. 2[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
CVPR , 2015. 7[29] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3dmodels from single images with a convolutional network. In
ECCV , 2016. 2[30] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree gen-erating networks: Efficient convolutional architectures forhigh-resolution 3d outputs. In
ICCV , 2017. 1, 2[31] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-viewsupervision for single-view reconstruction via differentiableray consistency. In
CVPR , 2017. 1, 2, 5[32] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong.O-cnn: Octree-based convolutional neural networks for 3dshape analysis. In
SIGGRAPH , 2017. 2[33] R. J. Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.
Machinelearning , 8(3-4):229–256, 1992. 2[34] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum.Learning a probabilistic latent space of object shapes via 3dgenerative-adversarial modeling. In
NIPS , 2016. 1, 2[35] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, andJ. Xiao. 3d shapenets: A deep representation for volumetricshapes. In
CVPR , 2015. 1, 236] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspectivetransformer nets: Learning single-view 3d object reconstruc-tion without 3d supervision. In
NIPS , 2016. 1, 2, 4, 5, 6, 7[37] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for3d view synthesis. In
NIPS , 2015. 5[38] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Syn-chronized spectral cnn for 3d shape segmentation. In
CVPR ,2017. 2[39] C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem. 3d-prnn: Generating shape primitives with recurrent neural net-works. In
ICCV , 2017. 2 ppendix A. Additional results
Figure 9 and Figure 10 show additional results of 3D re-construction. Figure 11, Figure 12, Figure 13, and Figure 14show additional results of style transfer. igure 9. 3D mesh reconstruction from a single image. The leftmost images are the inputs. Results are rendered from six viewpoints.igure 10. 3D mesh reconstruction from a single image. The leftmost images are the inputs. Results are rendered from six viewpoints.igure 11. Additional results of style transfer. The style images are
Self-Portrait (A. Bailly, 1917),
Thomson No. 5 (Yellow Sunset) (D.Coupland, 2011),
The Tower of Babel (P. Bruegel the Elder, 1563),
Jesuits III (L. Feininger, 1915),
Ritmo plastico del 14 luglio (S. Gino,1913),
The Starry Night (V. van Gogh, 1889), and
Portrait of Pablo Picasso (J. Gris, 1912).igure 12. Additional results of style transfer. The style images are
The Great Wave off Kanagawa , (Hokusai, 1829-1832),
The Trial (W.Lettl, 1981),
Bicentennial Print (R. Lichtenstein, 1975),
Portrait of a Friend (M. H. Maxy, 1926),
The Scream (E. Munch, 1910),
Femmenue assise (P. Picasso, 1909), and
Sketch [9].igure 13. Additional results of style transfer. The style images are
Self-Portrait (A. Bailly, 1917),
Thomson No. 5 (Yellow Sunset) (D.Coupland, 2011),
The Tower of Babel (P. Bruegel the Elder, 1563),
Jesuits III (L. Feininger, 1915),
Ritmo plastico del 14 luglio (S. Gino,1913),
The Starry Night (V. van Gogh, 1889), and
Portrait of Pablo Picasso (J. Gris, 1912).igure 14. Additional results of style transfer. The style images are
The Great Wave off Kanagawa , (Hokusai, 1829-1832),
The Trial (W.Lettl, 1981),
Bicentennial Print (R. Lichtenstein, 1975),
Portrait of a Friend (M. H. Maxy, 1926),
The Scream (E. Munch, 1910),
Femmenue assise (P. Picasso, 1909), and