3D Morphable Models as Spatial Transformer Networks
Anil Bas, Patrik Huber, William A. P. Smith, Muhammad Awais, Josef Kittler
33D Morphable Models as Spatial Transformer Networks
Anil Bas * , Patrik Huber † , William A. P. Smith * , Muhammad Awais † , Josef Kittler † * Department of Computer Science, University of York, UK † Centre for Vision, Speech and Signal Processing, University of Surrey, UK { ab1792,william.smith } @york.ac.uk, { p.huber,m.a.rana,j.kittler } @surrey.ac.uk Abstract
In this paper, we show how a 3D Morphable Model(i.e. a statistical model of the 3D shape of a class of ob-jects such as faces) can be used to spatially transform inputdata as a module (a 3DMM-STN) within a convolutionalneural network. This is an extension of the original spa-tial transformer network in that we are able to interpretand normalise 3D pose changes and self-occlusions. Thetrained localisation part of the network is independentlyuseful since it learns to fit a 3D morphable model to a singleimage. We show that the localiser can be trained using onlysimple geometric loss functions on a relatively small datasetyet is able to perform robust normalisation on highly uncon-trolled images including occlusion, self-occlusion and largepose changes.
1. Introduction
Convolutional neural networks (CNNs) are usuallytrained with such large amounts of data that they can learninvariance to scale, translation, in-plane rotation and, to acertain degree, out-of-plane rotations, without using anyexplicit geometric transformation model. However, mostnetworks do require a rough bounding box estimate as in-put and don’t work for larger variations. Recently, Jader-berg et al. [14] proposed the Spatial Transformer Network(STN) - a module that can be incorporated into a neural net-work architecture, giving the network the ability to explic-itly account for the effects of pose and nonrigid deforma-tions (which we refer to simply as “pose”). An STN explic-itly estimates pose and then resamples a specific part of theinput image to a fixed-size output image. It is thus able towork on inputs with larger translation and pose variation ingeneral, since it can explicitly compensate for it, and feed atransformed region of interest to the subsequent neural net-work layers. By exploiting and “hard-coding” knowledgeof geometric transformation, the amount of training dataand the required complexity of the network can be vastlyreduced. In this paper, we show how to use a 3D morphablemodel as a spatial transformer network (we refer to this asa 3DMM-STN). In this setting, the locations in the inputimage that are resampled are determined by the 2D projec-tion of a 3D deformable mesh. Hence, our 3DMM-STNestimates both 3D shape and pose. This allows us to ex-plicitly estimate and account for 3D rotations as well as selfocclusions. The output of our 3DMM-STN is a resampledimage in a flattened 2D texture space in which the imagesare in dense, pixel-wise correspondence. Hence, this outputcan be fed to subsequent CNN layers for further process-ing. We focus on face images and use a 3D morphable facemodel [4, 19], though our idea is general and could be ap-plied to any object for which a statistical 3D shape modelis available (though note that the loss functions proposed inSections 3.1 and 3.2 do assume that the object is bilaterallysymmetric). We release source code for our 3DMM-STN inthe form of new layers for the MatConvNet toolbox [30] . In a lot of applications, the process of pose normalisa-tion and object recognition are disjoint. For example, in thebreakthrough deep learning face recognition paper Deep-Face, Taigman et al. [27] use a 3D mean face as preprocess-ing, before feeding the pose-normalised image to a CNN.
Spatial transformers
The original STN [14] aimed tocombine these two processes into a single network that istrainable end to end. The localiser network estimated a 2Daffine transformation that was applied to the regular outputgrid meaning the network could only learn a fairly restrictedspace of transformations. Jaderberg et al. [14] also proposedthe concept of a 3D transformer, which takes 3D voxel dataas input, applies 3D rotation and translation, and outputs a2D projection of the transformed data. Working with 3D(volumetric data) removes the need to model occlusion orcamera projection parameters. In contrast, we work with The source code is available at https://github.com/anilbas/3DMMasSTN . a r X i v : . [ c s . C V ] A ug egular 2D input and output images but transform them viaa 3D model.A number of subsequent works were inspired by theoriginal STN. Yan et al. [31] use an encoder-decoder ar-chitecture in which the encoder estimates a 3D volumetricshape from an image and is trained by combining with a de-coder which uses a perspective transformer network to com-pute a 2D silhouette loss. Handa et al. [11] present the gvnn(Geometric Vision with Neural Networks) toolbox that, likein this paper, has layers that explicitly implement 3D geo-metric transformations. However, their goal is very differ-ent to ours. Rather than learning to fit a statistical shapemodel, they seek to use 3D transformations in low level vi-sion tasks such as relative pose estimation. Chen et al. [6]use a spatial transformer that applies a 2D similarity trans-form as part of an end to end network for face detection.Henriques and Vedaldi [12] apply a spatial warp prior toconvolutions such that the convolution result is invariant toa class of two-parameter spatial transformations. Like us,Yu et al. [32] incorporate a parametric shape model, thoughtheir basis is 2D (and trainable), models only sparse shapeand combines pose and shape into a single basis. They use asecond network to locally refine position estimates and trainend to end to perform landmark localisation. Bhagavatulaet al. [3] fit a generic 3D face model and estimate face land-marks, before warping the projected face model to better fitthe landmarks. They estimate 2D landmarks in a 3D-awarefashion, though they require known landmarks for training. Analysis-by-synthesis
Our localiser learns to fit a 3DMMto a single image. This task has traditionally been posedas a problem of analysis-by-synthesis and solved by opti-misation. The original method [4] used stochastic gradi-ent descent to minimise an appearance error, regularised bystatistical priors. Subsequent work used a more complexfeature-based objective function [23] and the state-of-the-art method uses Markov Chain Monte Carlo for probabilis-tic image interpretation [24].
Supervised CNN regression
Analysis-by-synthesis ap-proaches are computationally expensive, prone to conver-gence on local minima and fragile when applied to in-the-wild images. For this reason, there has been considerablerecent interest in using CNNs to directly regress 3DMM pa-rameters from images. The majority of such work is basedon supervised learning. Jourabloo and Liu [15] fit a 3DMMto detected landmarks and then train a CNN to directlyregress the fitted pose and shape parameters. Tr˜an et al. [29]use a recent multi-image 3DMM fitting algorithm [20] toobtain pooled 3DMM shape and texture parameters (i.e. thesame parameters for all images of the same subject). Theythen train a CNN to directly regress these parameters froma single image. They do not estimate pose and hence do not compute an explicit correspondence between the modeland image. Kim et al. [16] go further by also regressing illu-mination parameters (effectively performing inverse render-ing) though they train on synthetic, rendered images (usinga breeding process to increase diversity). They estimate a3D rotation but rely on precisely cropped input images suchthat scale and translation is implicit. Richardson et al. [21]also train on synthetic data though they use an iterativelyapplied network architecture and a shape-from-shading re-finement step to improve the geometry. Jackson et al. [13]regress shape directly using a volumetric representation.The DenseReg [10] approach uses fully convolutionalnetworks to directly compute dense correspondence be-tween a 3D model and a 2D image. The network does notexplicitly estimate or model 3D pose or shape (though theseare implied by the correspondence) and is trained by usingmanually annotated 2D landmarks to warp a 3D templateonto the training images (providing the supervision). Selaet al. [25] also use a fully convolutional network to pre-dict correspondence and also depth. They then merge themodel-based and data-driven geometries for improved qual-ity.The weakness of all of these supervised approaches isthat they require labelled training data (i.e. images with fit-ted morphable model parameters). If the images are realworld images then the parameters must come from an ex-isting fitting algorithm in which case the best the CNN cando is learn to replicate the performance of an existing algo-rithm. If the images are synthetic with known ground truthparameters then the performance of the CNN on real worldinput is limited by the realism and variability present in thesynthetic images. Alternatively, we must rely on 3D super-vision provided by multiview or RGBD images, in whichcase the available training data is vastly reduced.
Unsupervised CNN regression
Richardson et al. [22]take a step towards removing the need for labels by pre-senting a semi-supervised approach. They still rely on su-pervised training for learning 3DMM parameter regressionbut then refine the coarse 3DMM geometry using a secondnetwork that is trained in an unsupervised manner. Very re-cently, Tewari et al. [28] presented MoFA, a completely un-supervised approach for training a CNN to regress 3DMMparameters, pose and illumination using an autoencoder ar-chitecture. The regression is done by the encoder CNN. Thedecoder then uses a hand-crafted differentiable renderer tosynthesise an image. The unsupervised loss is the error be-tween the rendered image and the input, with convergenceaided by losses for priors and landmarks. Note that the de-coder is exactly equivalent to the differentiable cost functionused in classical analysis-by-synthesis approaches. Presum-ably, the issues caused by the non-convexity of this costfunction are reduced in a CNN setting since the gradient2 ocaliser(VGG) θ ⊙ Grid Generator Bilinear SamplerVisibilityMask
I WMY '' VX ' Figure 1. Overview of the 3DMM-STN. The localiser predicts3DMM shape parameters and pose. The grid generator projectsthe 3D geometry to 2D. The bilinear sampler resamples the inputimage to a regular output grid which is then masked by an occlu-sion mask computed from the estimated 3D geometry. is averaged over many images.While the ability of [28] to learn from unlabelled data isimpressive, there are a number of limitations. The complex-ity required to enable the hand-crafted decoder to producephotorealistic images of any face under arbitrary real worldillumination, captured by a camera with arbitrary geomet-ric and photometric properties, is huge. Arguably, this hasnot yet been achieved in computer graphics. Moreover, the3DMM texture should only capture intrinsic appearance pa-rameters such as diffuse and specular albedo (or even spec-tral quantities to ensure independence from the camera andlighting). Such a model is not currently available.
In this paper we propose a purely geometric approach inwhich only the shape component of a 3DMM is used to geo-metrically normalise an image. Unlike [10,13,15,16,21,25,29], our method can be trained in an unsupervised fashion,and thus does not depend on synthetic training data or thefitting results of an existing algorithm. In contrast to [28],we avoid the complexity and potential fragility of havingto model illumination and reflectance parameters. More-over, our 3DMM-STN can form part of a larger networkthat performs a face processing task and is trained end toend. Finally, in contrast to all previous 3DMM fitting net-works, the output of our 3DMM-STN is a 2D resampling ofthe original image which contains all of the high frequency,discriminating detail in a face rather than a model-based re-construction which only captures the gross, low frequencyaspects of appearance that can be explained by a 3DMM.
2. 3DMM-STN
Our proposed 3DMM-STN has the same componentsas a conventional STN, however each component must bemodified to incorporate the statistical shape model, 3Dtransformations and projection and self-occlusion. In this section we describe each component of a 3DMM-STN andthe layers that are required to construct it. We show anoverview of our architecture in Figure 1.
The localiser network is a CNN that takes an image asinput and regresses the pose and shape parameters, θ , of theface in the image. Specifically, we predict the followingvector of parameters: θ = ( r , t , logs (cid:124) (cid:123)(cid:122) (cid:125) pose , α (cid:124)(cid:123)(cid:122)(cid:125) shape ) . (1)Here, t ∈ R is a 2D translation, r ∈ R is an axis-anglerepresentation of a 3D rotation with rotation angle (cid:107) r (cid:107) andaxis r / (cid:107) r (cid:107) . Since scale must be positive, we estimate logscale and later pass this through an exponentiation layer,ensuring that the estimated scale is positive. The shape pa-rameters α ∈ R D are the principal component weights usedto reconstruct the shape.For our localiser network, we use the pretrained VGG-Faces [18] architecture, delete the classification layer andadd a new fully connected layer with D outputs. Theweights for the new layer are randomly initialised but scaledso that the elements of the axis-angle vector are in the range [ − π, π ] for typical inputs. The whole localiser is then fine-tuned as part of the subsequent training. In contrast to a conventional STN, the warped samplinggrid is not obtained by applying a global transformation tothe regular output grid. Instead, we apply a 3D transfor-mation and projection to a 3D mesh that comes from themorphable model. The intensities sampled from the sourceimage are then assigned to the corresponding points in aflattened 2D grid.For this reason, the grid generator network in a 3DMM-STN is more complex than in a conventional STN, althoughwe emphasise that it remains differentiable and hence suit-able for use in end to end training. The sample points in ourgrid generator are determined by the transformation param-eters θ estimated by the localiser network. Our grid gen-erator combines a linear statistical model with a scaled or-thographic projection as shown in Figure 2. Note that wecould alternatively use a perspective projection (modifyingthe localiser to predict a 3D translation as well as cameraparameters such as focal length). However, recent resultsshow that interpreting face shape under perspective is am-biguous [2, 26] and so we use the more restrictive ortho-graphic model here.We now describe the transformation applied by eachlayer in the grid generator and provide derivatives.3 nput: θ = ( r , t , logs , α )exp r to R Rotate3DMM Project Scale Translate r logs α R s X X (cid:48)
Y Y (cid:48) Y (cid:48)(cid:48) Figure 2. The grid generator network within a 3DMM-STN.
3D morphable model layer
The 3D morphable modellayer generates a shape X ∈ R × N comprising N
3D ver-tices by taking a linear combination of D basis shapes (prin-cipal components) stored in the matrix P ∈ R N × D andthe mean shape µ ∈ R N according to shape parameters α ∈ R D : X ( α ) i,j = x ( α ) j − i , i ∈ [1 , , j ∈ [1 , N ] , where x ( α ) = P α + µ and the derivatives are given by: ∂ x ∂ α = P , ∂X i,j ∂α k = P j − i,k . Note that such a linear model is exactly equivalent to a fullyconnected layer (and hence a special case of a convolutionallayer) with fixed weights and biases. This is not at all sur-prising since a linear model is exactly what is implementedby a single layer linear decoder. In this interpretation, theshape parameters play the role of the input map, the prin-cipal components the role of weights and the mean shapethe role of biases. This means that this layer can be im-plemented using an existing implementation of a convolu-tion layer and also, following our later suggestion for futurework, that the model could itself be made trainable simplyby having non-zero learning rate for the convolution layer.In our network, we use some of the principal compo-nents to represent shape variation due to identity and theremainder to represent deformation due to expression. Weassume that expressions are additive and we can thus com-bine the two into a single linear model. Note that the shapeparameters relating to identity may contain information thatis useful for recognition, so these could be incorporated intoa descriptor in a recognition network after the STN.
Axis-angle to rotation matrix layer
This layer convertsan axis-angle representation of a rotation, r ∈ R , into arotation matrix: R ( r ) = cos θ I + sin θ (cid:2) ¯r (cid:3) × + (1 − cos θ ) ¯r¯r T , where θ = (cid:107) r (cid:107) and ¯r = r / (cid:107) r (cid:107) and (cid:2) a (cid:3) × = − a a a − a − a a is the cross product matrix. The derivatives are given by [8]: ∂ R ∂r i = (cid:40) [ e i ] × if r = r i [ r ] × +[ r × ( I − R ( r )) e i ] × (cid:107) r (cid:107) R otherwisewhere e i is the i th vector of the standard basis in R .
3D rotation layer
The rotation layer takes as input a ro-tation matrix R and N
3D points X ∈ R × N and appliesthe rotation: X (cid:48) ( R , X ) = RX ∂X (cid:48) i,j ∂R i,k = X k,j , ∂X (cid:48) i,j ∂X k,j = R i,k , i, k ∈ [1 , , j ∈ [1 , N ] . Orthographic projection layer
The orthographic projec-tion layer takes as input a set of N
3D points X (cid:48) ∈ R × N and outputs N
2D points Y ∈ R × N by applying an ortho-graphic projection along the z axis: Y ( X (cid:48) ) = PX (cid:48) , P = (cid:20) (cid:21) ,∂Y i,j ∂X (cid:48) i,j = 1 , i ∈ [1 , , j ∈ [1 , N ] . Scaling
The log scale estimated by the localiser is firsttransformed to scale by an exponentiation layer: s ( logs ) = exp( logs ) , ∂s∂ logs = exp( logs ) . Then, the 2D points Y ∈ R × N are scaled: Y (cid:48) ( s, Y ) = s Y , ∂Y (cid:48) i,j ∂s = Y i,j , ∂Y (cid:48) i,j ∂Y i,j = s Translation
Finally, the 2D sample points are generatedby adding a 2D translation t ∈ R to each of the scaledpoints: Y (cid:48)(cid:48) ( t , Y (cid:48) ) = Y (cid:48) + N ⊗ t , ∂Y (cid:48)(cid:48) i,j ∂t i = 1 , ∂Y (cid:48)(cid:48) i,j ∂Y (cid:48) i,j = 1 , where N is the row vector of length N containing ones and ⊗ is the Kronecker product.4 igure 3. The output grid of our 3DMM-STN: a Tutte embeddingof the mean shape of the Basel Face Model. On the left we show avisualisation using the mean texture (though note that our 3DMM-STN does not use a texture model). On the right we show the meanshape as a geometry image [9]. In the original STN, the sampler component used bilin-ear sampling to sample values from the input image andtransform them to an output grid. We make a number ofmodifications. First, the output grid is a texture space flat-tening of the 3DMM mesh. Second, the bilinear samplerlayer will incorrectly sample parts of the face onto verticesthat are self-occluded so we introduce additional layers thatcalculate which vertices are occluded and mask the sampledimage appropriately.
Output grid
The purpose of an STN is to transform aninput image into a canonical, pose-normalised view. In thecontext of a 3D model, one could imagine a number of anal-ogous ways that an input image could be normalised. Forexample, the output of the STN could be a rendering of themean face shape in a frontal pose with the sampled textureon the mesh. Instead, we choose to output sampled texturesin a 2D embedding obtained by flattening the mean shape ofthe 3DMM. This ensures that the output image is approxi-mately area uniform with respect to the mean shape and alsothat the whole output image contains face information.Specifically, we compute a Tutte embedding [7] usingconformal Laplacian weights and with the mesh boundarymapped to a square. To ensure a symmetric embedding wemap the symmetry line to the symmetry line of the square,flatten only one side of the mesh and obtain the flatten-ing of the other half by reflection. We show a visualisa-tion of our embedding using the mean texture in Figure3. In order that the output warped image produces a regu-larly sampled image, we regularly re-sample (i.e. re-mesh)the 3DMM (mean and principal components) over a uni-form grid of size H (cid:48) × W (cid:48) in this flattened space. Thiseffectively makes our 3DMM a deformable geometry im-age [9]. The re-sampled 3DMM that we use in our STNtherefore has N = H (cid:48) W (cid:48) vertices and each vertex i has anassociated UV coordinate ( x ti , y ti ) . The corresponding sam-ple coordinate produced by the grid generator is given by ( x si , y si ) = ( Y (cid:48)(cid:48) ,i , Y (cid:48)(cid:48) ,i ) . Bilinear sampling
We use bilinear sampling, exactly asin the original STN such that the re-sampled image V ci atlocation ( x ti , y ti ) in colour channel c is given by: V ci = H (cid:88) j =1 W (cid:88) k =1 I cjk max(0 , − | x si − k | ) max(0 , − | y si − j | ) where I cjk is the value in the input image at pixel ( j, k ) incolour channel c . I has height H and width W . This bilin-ear sampling is differentiable (see [14] for derivatives) andso the loss can be backpropagated through the sampler andback into the grid generator. Self-occlusions
Since the 3DMM produces a 3D mesh,parts of the mesh may be self-occluded. The occludedvertices can be computed exactly using ray-tracing or z-buffering or they can be precomputed and stored in a lookuptable. For efficiency, we approximate occlusion by onlycomputing which vertices have backward facing normals.This approximation would be exact for any object that isglobally convex. For objects with concavities, the ap-proximation will underestimate the set of occluded ver-tices. Faces are typically concave around the eyes, the noseboundary and the mouth interior but we find that typicallyonly around 5% of vertices are mislabelled and the accuracyis sufficient for our purposes.This layer takes as input the rotation matrix R and theshape parameters α and outputs a binary occlusion mask M ∈ { , } H (cid:48) × W (cid:48) . The occlusion function is binary andhence not differentiable at points where the visibility of avertex changes, everywhere else the gradient is zero. Hence,we simply pass back zero gradients: ∂ M ∂ α = 0 , ∂ M ∂ R = 0 . Note that this means that the network is not able to learnhow changes in occlusion help to reduce the loss. Occlu-sions are applied in a forward pass but changes in occlusiondo not backpropagate.
Masking layer
The final layer in the sampler combinesthe sampled image and the visibility map via pixel-wiseproducts: W ci = V ci M x ti ,y ti , ∂W ci ∂V ci = M x ti ,y ti , ∂W ci ∂M x ti ,y ti = V ci .
3. Geometric losses for localiser training
An STN is usually inserted into a network as a prepro-cessor of input images and its output is then passed to aclassification or regression CNN. Hence, the pose normali-sation that is learnt by the STN is the one that produces op-timal performance on the subsequent task. In the context of5 igure 4. Siamese multiview loss. An image and its horizontalreflection yield two sampled images. We penalise differences inthese two images. a 3D morphable face model, an obvious task would be facerecognition. While this is certainly worth pursuing, we haveobserved that the optimal normalisation for recognition maynot correspond to the correct model-image correspondenceone would expect. For example, if context provided by hairand clothing helps with recognition, then the 3DMM-STNmay learn to sample this.Instead, we show that it is possible to train an STN toperform accurate localisation using only some simple geo-metric priors without even requiring identity labels for theimages. We describe these geometric loss functions in thefollowing sections.
Faces are approximately bilaterally symmetric. Ignor-ing the effects of illumination, this means that we expectsampled face textures to be approximately bilaterally sym-metric. We can define a loss that measures asymmetry ofthe sampled texture over visible pixels: (cid:96) sym = N (cid:88) i =1 3 (cid:88) c =1 M x ti ,y ti M x t sym ( i ) ,y t sym ( i ) ( V ci − V c sym ( i ) ) , (2)where V c sym ( i ) is the value in the resampled image at location ( W (cid:48) + 1 − x si , y si ) . Selection Euclidean Loss
LandmarksY '' Figure 5. Landmark loss. Left: The diagram shows the implemen-tation of the regression layer that computes the Euclidean distancebetween selected 2D points and ground truth positions. Right: Pre-dicted positions are in red and landmark positions are in green. Figure 6. Overview of the 3DMM-STN. From left to right: inputimage; rendering of estimated shape in estimated pose; sampledimage; occlusion mask; final output of 3DMM-STN.
If we have multiple images of the same face in differentposes (or equivalently from different viewpoints), then weexpect that the sampled textures will be equal (again, ne-glecting the effects of illumination). If we had such multi-view images, this would allow us to perform Siamese train-ing where a pair of images in different poses were sampledinto images V ci and W ci with visibility masks M and N giving a loss: (cid:96) multiview = N (cid:88) i =1 3 (cid:88) c =1 M x ti ,y ti N x ti ,y ti ( V ci − W ci ) . (3)Ideally, this loss would be used with a multiview facedatabase or even a face recognition database where imagesof the same person in different in-the-wild conditions arepresent. We use an even simpler variant which does notrequire multiview images; again based on the bilateral sym-metry assumption. A horizontal reflection of a face imageapproximates what that face would look like in a reflectedpose. Hence, we perform Siamese training on an input im-age and its horizontal reflection. This is different to thebilateral symmetry loss and is effectively encouraging thelocaliser to behave symmetrically. As has been observed elsewhere [28], convergence of thetraining can be speeded up by introducing surrogate lossfunctions that provide supervision in the form of landmarklocations. It is straightforward to add a landmark loss toour network. First, we define a selection layer that selects
L < N landmarks from the N
2D points outputted by thegrid generator: L = Y (cid:48)(cid:48) S (4)where S ∈ { , } N × L is a selection matrix with S T S = I L .Given L landmark locations l , . . . , l L and associated detec-tion confidence values c , . . . , c L , we computed a weightedEuclidean loss: (cid:96) landmark = L (cid:88) i =1 c i (cid:107) L i − l i (cid:107) . (5)Landmarks that are not visible (i.e. were not hand-labelledor detected) are simply assigned zero confidence.6 igure 7. 3DMM-STN output for multiple images of the same per-son in different poses. The statistical shape model provides a prior. We scale theshape basis vectors such that the shape parameters follow astandard multivariate normal distribution: α ∼ N ( , I D ) .Hence, the statistical prior can be encoded by the followingloss function: (cid:96) prior = (cid:107) α (cid:107) . (6)
4. Experiments
For our statistical shape model, we use D = 10 dimen-sions of which five are the first five (identity) principal com-ponents from the Basel Face Model [19]. The other fiveare expression components which come from FaceWare-house [5] using the correspondence to the Basel Model pro-vided by [33]. We re-mesh the Basel Model over a uniformgrid of size × . We trained our 3DMM-STN with thefour loss functions described in Section 3 using the AFLWdatabase [17]. This provides up to 21 landmarks per sub-ject for over 25k in-the-wild images. This is a relativelysmall dataset for training a deep network so we perform‘fine-tuning’ by setting the learning rate on the last layerof the localiser to four times that of the rest of the network.Figure 6 shows the pipeline of an image passing through a3DMM-STN. A by-product of the trained 3DMM-STN isthat it can also act as a 2D landmark localiser. After train-ing, the localiser achieves an average landmarking error of2.35 pixels on the part of AFLW used as validation set, overthe 21 landmarks, showing that overall, the training con-verges well.We begin by demonstrating that our 3DMM-STN learnsto predict consistent correspondence between model andimage. In Figure 7 we show 3DMM-STN output for mul-tiple images of the same person. Note that the features areconsistently mapped to the same location in the transformedoutput. In Figure 8 we go further by applying the 3DMM-STN to multiple images of the same person and then aver-age the resulting transformed images. We show results for10 subjects from the UMDFaces [1] dataset. The numberof images for each subject is shown in parentheses. The av-erages have well-defined features despite being computedfrom images with large pose variation. Elon Musk (34) Christian Bale (51) Elisha Cuthbert (53) Clint Eastwood (62) Emma Watson (73)Chuck Palahniuk (48) Nelson Mandela (52) Kim Jong-un (60) Ben Affleck (66) Courteney Cox (127)
Figure 8. A set of mean flattened images per subject. Real imagesare obtained from UMDFaces dataset. The number of images thatare used for averaging is stated next to subject’s name.
In Figure 9 we provide a qualitative comparison to [29].This is the only previous work on 3DMM fitting using aCNN for which the trained network is made publicly avail-able. In columns one and five, we show input images fromUMDFaces [1]. In columns two and six, we show the recon-struction provided by [29]. While the reconstruction cap-tures the rough appearance of the input face, it lacks thediscriminating detail of the original image. This method re-gresses shape and texture directly but not illumination orpose. Hence, we cannot directly compare the model-imagecorrespondence provided by this method. To overcome this,we use the landmark detector used by [29] during trainingand compute the optimal pose to align their reconstructionto these landmarks. We replace their cropped model by theoriginal BFM shape model and sample the image. This al-lows us to create the flattened images in columns three andseven. The output of our proposed 3DMM-STN is shownin columns four and eight. We note that our approach lessfrequently samples background and yields a more consis-tent correspondence of the resampled faces. In the bottomrow of the figure we show challenging examples where [29]did not produce any output because the landmark detectorfailed. Despite occlusions and large out of plane rotations,the 3DMM-STN still does a good job of producing a nor-malised output image.
5. Conclusions
In this paper we have shown how to use a 3D mor-phable model as a spatial transformer within a CNN. Ourproposed architecture has a number of interesting proper-ties. First, the network (specifically, the localiser part ofthe network) learns to fit a 3D morphable model to a sin-gle 2D image without needing labelled examples of fittedmodels. Since the problem of fitting a morphable model toan image is an unsolved problem (and therefore no exist-ing algorithm could be assumed to provide reliable groundtruth fits), this kind of unsupervised learning is desirable.Second, the morphable model itself is fixed in our currentarchitecture. However, there is no reason that this could not7 nput [29] [29] Flatten 3DMM-STN Input [29] [29] Flatten 3DMM-STNInput 3DMM-STN Input 3DMM-STN Input 3DMM-STN Input 3DMM-STNFigure 9. Qualitative comparison to [29]. The bottom row shows examples for which [29] failed to fit due to failure of the landmarkdetector. also be learnt. In this way, it may be possible to learn a 3Ddeformable model for an object class simply from a collec-tion of images that are labelled appropriately for the chosenproxy task.There are many ways that this work can be extended.First, we would like to investigate training our 3DMM-STNin an end to end recognition network. We would hope thatthe normalisation means that a recognition network couldbe trained on less data and with less complexity than ex-isting networks that must learn pose invariance implicitly.Second, the shape parameters estimated by the localisermay contain discriminative information and so these couldbe combined into subsequent descriptors for recognition.Third, we would like to further explore the multiview fittingloss. Using a multiview face database or video would pro-vide a rich source of data for learning accurate localisation.Finally, the possibility of learning the shape model duringtraining is exciting and we would like to explore other ob-jects classes besides faces for which 3DMMs do not cur-rently exist.
Acknowledgements
We gratefully acknowledge the support of NVIDIA Cor-poration with the donation of the Titan X Pascal GPU usedfor this research.
References [1] A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, andR. Chellappa. UMDFaces: An annotated face dataset fortraining deep networks. arXiv preprint arXiv:1611.01484v2 ,2016. 7[2] A. Bas and W. A. P. Smith. What does 2D geometric infor-mation really tell us about 3D face shape? arXiv preprintarXiv:1708.06703 , 2017. 3[3] C. Bhagavatula, C. Zhu, K. Luu, and M. Savvides. Fasterthan real-time facial alignment: A 3d spatial transformer net-work approach in unconstrained poses. In
Proc. ICCV , pageto appear, 2017. 2[4] V. Blanz and T. Vetter. A morphable model for the synthesisof 3D faces. In
Proc. SIGGRAPH , pages 187–194, 1999. 1,2[5] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Face-warehouse: A 3D facial expression database for visual com- uting. IEEE Trans. Vis. Comp. Gr. , 20(3):413–425, 2014.7[6] D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformernetwork for efficient face detection. In
Proc. ECCV , pages122–138, 2016. 2[7] M. S. Floater. Parametrization and smooth approxima-tion of surface triangulations.
Comput. Aided Geom. Des. ,14(3):231–250, 1997. 5[8] G. Gallego and A. Yezzi. A compact formula for the deriva-tive of a 3-D rotation in exponential coordinates.
J. Math.Imaging Vis. , 51(3):378–384, 2015. 4[9] X. Gu, S. J. Gortler, and H. Hoppe. Geometry images.
ACMTrans. Graphic. , 21(3):355–361, 2002. 5[10] R. A. G¨uler, G. Trigeorgis, E. Antonakos, P. Snape,S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolu-tional dense shape regression in-the-wild. In
Proc. CVPR ,2017. 2, 3[11] A. Handa, M. Bloesch, V. P˘atr˘aucean, S. Stent, J. McCor-mac, and A. Davison. gvnn: Neural network library for ge-ometric computer vision. In
Proc. ECCV Workshop on Ge-ometry Meets Deep Learning , 2016. 2[12] J. F. Henriques and A. Vedaldi. Warped convolutions:Efficient invariance to spatial transformations.
CoRR ,abs/1609.04382, 2016. 2[13] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropou-los. Large pose 3D face reconstruction from a single im-age via direct volumetric cnn regression. arXiv preprintarXiv:1703.07834 , 2017. 2, 3[14] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. In
Proc. NIPS , pages 2017–2025, 2015. 1, 5[15] A. Jourabloo and X. Liu. Large-pose face alignment viaCNN-based dense 3D model fitting. In
Proc. CVPR , 2016.2, 3[16] H. Kim, M. Zollh¨ofer, A. Tewari, J. Thies, C. Richardt,and C. Theobalt. Inversefacenet: Deep single-shot in-verse face rendering from a single image. arXiv preprintarXiv:1703.10956 , 2017. 2, 3[17] P. M. R. Martin Koestinger, Paul Wohlhart and H. Bischof.Annotated Facial Landmarks in the Wild: A Large-scale,Real-world Database for Facial Landmark Localization. In
Proc. First IEEE International Workshop on BenchmarkingFacial Image Analysis Technologies , 2011. 7[18] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep facerecognition. In
Proc. BMVC , 2015. 3[19] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vet-ter. A 3D face model for pose and illumination invariantface recognition. In S. Tubaro and J. Dugelay, editors,
Proc.AVSS , pages 296–301, 2009. 1, 7[20] M. Piotraschke and V. Blanz. Automated 3D face reconstruc-tion from multiple images using quality measures. In
Proc.CVPR , pages 3418–3427, 2016. 2[21] E. Richardson, M. Sela, and R. Kimmel. 3D face reconstruc-tion by learning from synthetic data. In
Proc. 3DV , pages460–469, 2016. 2, 3[22] E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learningdetailed face reconstruction from a single image. In
Proc.CVPR , 2017. 2 [23] S. Romdhani and T. Vetter. Estimating 3d shape and tex-ture using pixel intensity, edges, specular highlights, textureconstraints and a prior. In
Proc. CVPR , volume 2, pages986–993, 2005. 2[24] S. Sch¨onborn, B. Egger, A. Morel-Forster, and T. Vetter.Markov chain monte carlo for automated face image anal-ysis.
International Journal of Computer Vision , 123(2):160–183, 2017. 2[25] M. Sela, E. Richardson, and R. Kimmel. Unrestricted fa-cial geometry reconstruction using image-to-image transla-tion. arXiv preprint arXiv:1703.10131 , 2017. 2, 3[26] W. A. P. Smith. The perspective face shape ambiguity. In
Perspectives in Shape Analysis , pages 299–319. Springer,2016. 3[27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:Closing the gap to human-level performance in face verifica-tion. In
Proc. CVPR , pages 1701–1708, 2014. 1[28] A. Tewari, M. Zollh¨ofer, H. Kim, P. Garrido, F. Bernard,P. P´erez, and C. Theobalt. MoFA: Model-based deep con-volutional face autoencoder for unsupervised monocular re-construction. arXiv preprint arXiv:1703.10580 , 2017. 2, 3,6[29] A. T. Tr˜an, T. Hassner, I. Masi, and G. Medioni. Regressingrobust and discriminative 3D morphable models with a verydeep neural network. In
Proc. CVPR , 2017. 2, 3, 7, 8[30] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neuralnetworks for MATLAB. In X. Zhou, A. F. Smeaton, Q. Tian,D. C. A. Bulterman, H. T. Shen, K. Mayer-Patel, and S. Yan,editors,
Proceedings of the 23rd Annual ACM Conference onMultimedia Conference, MM ’15, Brisbane, Australia, Octo-ber 26 - 30, 2015 , pages 689–692. ACM, 2015. 1[31] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspec-tive transformer nets: Learning single-view 3D object recon-struction without 3D supervision. In
Advances in Neural In-formation Processing Systems , pages 1696–1704, 2016. 2[32] X. Yu, F. Zhou, and M. Chandraker. Deep deformation net-work for object landmark localization. In
Proc. ECCV , pages52–70, 2016. 2[33] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignmentacross large poses: A 3D solution. In
Proc. CVPR , pages146–155, 2016. 7, pages146–155, 2016. 7