[PDF] Regression-Based Image Alignment for General Object Categories

Abstract

Gradient-descent methods have exhibited fast and reliable performance for image alignment in the facial domain, but have largely been ignored by the broader vision community. They require the image function be smooth and (numerically) differentiable -- properties that hold for pixel-based representations obeying natural image statistics, but not for more general classes of non-linear feature transforms. We show that transforms such as Dense SIFT can be incorporated into a Lucas Kanade alignment framework by predicting descent directions via regression. This enables robust matching of instances from general object categories whilst maintaining desirable properties of Lucas Kanade such as the capacity to handle high-dimensional warp parametrizations and a fast rate of convergence. We present alignment results on a number of objects from ImageNet, and an extension of the method to unsupervised joint alignment of objects from a corpus of images.

Full PDF

RRegression-Based Image Alignmentfor General Object Categories

Hilton Bristow and Simon Lucey Queensland University of Technology (QUT)Brisbane QLD 4000, Australia Carnegie Mellon University (CMU)Pittsburgh PA 15289, USA

Abstract.

Gradient-descent methods have exhibited fast and reliableperformance for image alignment in the facial domain, but have largelybeen ignored by the broader vision community. They require the imagefunction be smooth and (numerically) diﬀerentiable – properties thathold for pixel-based representations obeying natural image statistics, butnot for more general classes of non-linear feature transforms. We showthat transforms such as Dense SIFT can be incorporated into a LucasKanade alignment framework by predicting descent directions via re-gression. This enables robust matching of instances from general objectcategories whilst maintaining desirable properties of Lucas Kanade suchas the capacity to handle high-dimensional warp parametrizations anda fast rate of convergence. We present alignment results on a number ofobjects from ImageNet, and an extension of the method to unsupervisedjoint alignment of objects from a corpus of images.

Keywords:

Lucas Kanade, alignment, regression, Dense SIFT

Traditionally, detectors used in general object detection have been applied ina discrete multi-scale sliding-window manner. This enables global search of theoptimal warp parameters (object scale and position within the source image), atthe expense of only being able to handle these simple transformations. Gradient-based approaches such as Lucas Kanade (LK) [2], on the other hand, can en-tertain more complex warp parametrizations such as rotations and changes inaspect ratio, but impose the constraint that the image function be smooth anddiﬀerentiable (analytically or eﬃciently numerically).This constraint is generally satisﬁed for pixel-based representations that fol-low natural image statistics [18], especially on constrained domains such as faces,which are known to exhibit low-frequency gradients [3]. For broader object cat-egories that exhibit large intra-class variation and discriminative gradient in-formation in the higher-frequencies ( i.e . the interaction of the object with thebackground) however, non-linear feature transforms that introduce tolerance tocontrast and geometry are required. These transforms violate the smoothnessrequirement of gradient-based methods. a r X i v : . [ c s . C V ] J u l Bristow & Lucey

As a result, the huge wealth of research into gradient-based methods for facialimage alignment has largely been ignored by the broader vision community. Inthis paper, we show that the LK objective can be modiﬁed to handle non-linearfeature transforms. Speciﬁcally, we show, – descent directions on feature images can be computed via linear regressionto avoid any assumptions about their statistics, – for least-squares regression, the formulation can be interpreted as an eﬃcientconvolution operation, – localization results on images from ImageNet using higher-order warp parame-trizations than scale and translation, – an extension to unsupervised joint alignment of a corpus of images.By showing that gradient-based methods can be applied to non-linear imagetransforms more generally, the huge body of research in image alignment can beleveraged for general object alignment. Image alignment is the problem of registering two images, or parts of images, sothat their appearance similarity is maximized. It is a diﬃcult problem in general,because (i) the deformation model used to parametrize the alignment can behigh-dimensional, (ii) the appearance variation between instances of the objectcategory can be large due to diﬀerences in lighting, pose, non-rigid geometry andbackground material, and (iii) search space is highly non-convex.

For localization of general object categories, the solution has largely been toparametrize the warp by a low-dimensional set of parameters – x, y -translationand scale – and exhaustively search across the support of the image for thebest set of parameters using a classiﬁer trained to tolerate lighting variation andchanges in pose and geometry. Though not usually framed in these terms, thisis exactly the role of multi-scale sliding-window detection.Higher-dimensional warps have typically not been used, due to the exponen-tial explosion in the size of the search space. This is evident in graphical mod-els, where it is only possible to entertain a restrictive set of higher-dimensionalwarps: those that are amenable to optimization by dynamic programming [7]. Aconsequence of this limitation is that sometimes underlying physical constraintscannot be well modelled: [21] use a tree to model parts of a face, resulting inﬂoating branches and leaf nodes that do not respect or approximate the elasticrelationship of muscles.A related limitation of global search is the speed with which warp parametriza-tions can be explored. Searching over translation can be computed eﬃciently viaconvolution, however there is no equivalent operator for searching aﬃne warpsor projections onto linear subspaces. egression-Based Image Alignment 3 [11] introduced a global method for gaining correspondence between imagesfrom general object categories – evaluated on Pascal VOC – based on homog-raphy consensus of local non-linear feature descriptors. They claim performanceimprovements over state-of-the-art congealing methods, but their only quali-tative assessment is on rigid objects, so it is diﬃcult to gauge how well theirmethod generalizes to non-rigid object classes.A related problem is that of co-segmentation [5], which aims to learn coherentsegmentations across a corpus of images by exploiting similarities between theforeground and background regions in these images. Such global methods areslow, but could be used as an eﬀective initializer for local image alignment (inthe same way that face detection is almost universally used to initialize faciallandmark localization).

Local search methods perform alignment by taking constrained steps on theimage function directly. The family of Lucas Kanade algorithms consider a ﬁrst-order Taylor series approximation to the image function and locally approxi-mate its curvature with a quadratic. Convergence to a minima follows if theJacobian of the linearization is well-conditioned and the function is smooth anddiﬀerentiable. Popular non-linear features such as Dense SIFT [12], HOG [6]and LBP [13] are non-diﬀerentiable image operators. Unlike pixel representa-tions whose f frequency spectra relates the domain of the optimization basinto the amount of blur introduced, these non-linear operators do not have well-understood statistical structure.Current state-of-the art local search methods that employ non-linear featuresfor face alignment instead use a cascade of regression functions, in a similarmanner to Iterative Error Bound Minimization (IEBM) [17]. A common themeof these methods [10,15,19] is that they directly regress to positional updates.This sidesteps issues with diﬀerentiating image functions, or inverting Hessians.The drawback, however, is that they require vast amounts of training data toproduce well-conditioned regressors. This approach is feasible for facial domaindata that can be synthesized and trained oﬄine in batch to produce fast runtimeperformance, but becomes impractical when performing alignment on arbitraryobject classes, which have traditionally only had weakly labelled data.The least squares congealing alignment algorithm [4], for example, has noprior knowledge of image landmarks, and learning positional update regressorsfor each pixel in each image is not only costly, their performance is poor whenusing only the surrounding image context as training data.[8] ﬁrst proposed the use of non-linear transforms (SIFT descriptors in theircase) for the congealing image alignment problem, noting like us, that pixel-based representations do not work on sets of images that exhibit high contrastvariance. Their entropy-based algorithm treats SIFT descriptors as stemmingfrom a multi-modal Gaussian distribution, and clusters the regions, at each it-eration ﬁnding the transform that minimizes the cluster entropy. As [4] pointed Bristow & Lucey out, however, employing entropy for congealing is problematic due to its pooroptimization characteristics. As a result, the method of [8] is slow to converge.The related ﬁeld of medical imaging has a large focus on image registrationfor measuring brain development, maturation and ageing, amongst others. [9,20]present methods for improving the robustness of unsupervised alignment byembedding the dataset in a graph, with edges representing similarity of images.Registration then proceeds by minimizing the total edge length of the graph.This improves the capture of images which are far from the dataset mean, butwhich can be found by traversing through intermediate images. Their applicationdomain – brain scans – is still highly constrained, permitting the estimationof geodesic distances between images in pixel space. Nonetheless, this type ofembedding is beyond what generic congealing algorithms have achieved.For general image categories, we instead propose to compute descent direc-tions via appearance regression. The advantage of this approach is that the sizeof the regression formulation is independent of the dimensionality of the featuretransform, so can be inverted with a small amount of training data.

The Inverse Compositional Lucas Kanade problem can be formulated as,arg min ∆ p || T ( W ( x ; p )) − I ( W ( x ; ∆ p )) || (1)where T is the reference template image, I is the image we wish to warp tothe template and W is the warp parametrization that depends on the imagecoordinates x and the warp parameters p . This is a nonlinear least squares(NLS) problem since the image function is highly non-convex. To solve it, therole of the template and the image is reversed and the expression is linearizedby taking a ﬁrst-order Taylor expansion about T ( W ( x ; p )) to yield,arg min ∆ p || T ( W ( x ; p )) + ∇ T ∂ W ∂ p ∆ p − I ( x ) || (2) ∇ T = ( ∂T∂x , ∂T∂y ) is the gradient of the template evaluated at W ( x ; p ). ∂ W ∂ p isthe Jacobian of the template. The update ∆ p describes the optimal alignmentof T to I . The inverse of ∆ p is then composed with the current estimate of theparameters, p k +1 = p k ◦ ∆ p − (3)and applied to I .The implication is that we always linearize the expression about the tem-plate T , but apply the (inverse of the) motion update ∆ p to the image I . Theconsequence of this subtle detail is that T is always ﬁxed, and thus the gradientoperator ∇ T only ever needs to be computed once [1]. This property extendsto our regression framework, where the potentially expensive regressor trainingstep can also happen just once, before alignment. egression-Based Image Alignment 5 For non-linear multi-channel image operators, we can replace the gradientoperator ∇ T with a general matrix R ,arg min ∆ p || T ( W ( x ; p )) + R ∂ W ∂ p ∆ p − I ( x ) || (4)The role of this matrix is to predict a descent direction for each pixel givencontext from other pixels and channels. The structure of the matrix determinesthe types of interactions that are exploited to compute the descent directions. Ifthe Jacobian is constant across all iterates – as is the case with aﬃne transforms– it can be pre-multiplied with the regressor so that solving each linearizationinvolves only a single matrix multiplication. We now discuss a simple least squares strategy for learning R . If we consideronly a translational warp, the expression of Eqn. 4 reduces to,arg min ∆ x || T ( x ) + R ∆ x − I ( x ) || (5)where ∆ x = ∆ p = ( ∆x, ∆y ). That is, we want to ﬁnd the step size along thedescent direction that minimizes the appearance diﬀerence between the templateand the image. If we instead ﬁx the ∆ x , we can solve for the R that minimizesthe appearance diﬀerence,arg min R (cid:88) ∆ x ∈D || T ( x ) + R ∆ x − T ( x + ∆ x ) || (6)Here we have replaced I ( x ) with the template at the known displacement, T ( x + ∆ x ). The domain of displacements D that we draw from for training balancessmall-displacement accuracy and large-displacement stability. Of course, least-squares regression is not the only possible approach. One could, for example,use support vector regression (SVR) when outliers are particularly problematicwith a commensurate increase in computational complexity.Each regressor involves solving the system of equations:arg min R i (cid:88) ∆ x ∈D i || T ( x i ) + R i ∆ x − T ( x i + ∆ x ) || (7)where i represents the i -th pixel location in the image. If the same domainof displacements is used for each pixel, the solution to this objective can becomputed in closed form as R ∗ i = (cid:0) ∆ x ∆ x T + ρ I (cid:1) − (cid:0) ∆ x T [ T ( x i + ∆ x ) − T ( x i )] (cid:1) (8)The ﬁrst thing to note is that ( ∆ x T ∆ x + ρ I ) − is a 2 × Bristow & Lucey

The ∆ x T [ T ( x i + ∆ x ) − T ( x i )] term within the expression is just a sum ofweighted diﬀerences between a displaced pixel, and the reference pixel. i.e ., (cid:20) (cid:80) ∆x (cid:80) ∆y ∆x ( T ( x + ∆x, y + ∆y ) − T ( x, y ) (cid:80) ∆x (cid:80) ∆y ∆y ( T ( x + ∆x, y + ∆y ) − T ( x, y ) (cid:21) (9)Other regression-based methods of alignment such as [19] leverage tens ofthousands of warped training examples during oﬄine batch learning to producefast runtime performance on a single object category (faces). We cannot aﬀordsuch complexity if we’re going to perform regression and alignment on arbitraryobject categories without a dedicated training time.If we sample ∆ x on a regular grid that coincides with pixel locations, thenEqn. 9 can be cast as two ﬁlters – one each for horizontal weights ∆x , andvertical weights ∆y , f x =  x − n . . . x n ... x − n . . . x n  f y =  y − n . . . y − n ... y n . . . y n  (10)If the x and y domains are both equal and odd, the contribution of T ( x, y ) iscancelled out. This is clearly a generalization of the central diﬀerence operator,which considers a domain of [ − , f x = (cid:2) − (cid:3) f y =  −  (11)Thus, an eﬃcient realization for learning a regressor at every pixel in theimage is, R = ( ∆ x T ∆ x + ρ I ) − [ f x ∗ T ( x ) f y ∗ T ( x )] (12)where ∗ is the convolution operator. For an image with N pixels, K channels anda warp with P motion parameters, the complexity of our image alignment canbe stated as a single O ( KN log KN + KN P ) pre-computation of the regressor,followed by an O ( KN P ) matrix-vector multiply and image warp per iteration,with an overall linear rate of convergence.

Dense non-linear feature transforms can be viewed as mapping each scalar pixelin a (grayscale) image to a vector R → R K . The added redundancy is re-quired to decorrelate the various lighting transforms aﬀecting the appearanceof objects. Some feature transforms such as HOG [6] also introduce a degree ofspatial insensitivity for matching dis-similar objects, though we ﬁnd in practice egression-Based Image Alignment 7 that alignment performance is more sensitive to lighting than geometric eﬀects(Fig. 1).During alignment, spatial operations are applied across each channel indepen-dently. In particular, our regression formulation does not consider correlationsbetween channels, so separate regressors can be learned on each feature planeof the image, then concatenated. This admits a highly eﬃcient representation inthe Fourier domain – the ﬁlters f x and f y only need to be transformed to theFourier domain once per image, rather than once per channel.To illustrate the beneﬁt of applying non-linear transforms, we performed analignment experiment between pairs of images with ground-truth registration,and progressively increased the initialization error, measuring the overall numberof trials that converged back to ground-truth (within (cid:15) tolerance). Faces withlabelled landmarks constitute a poor evaluation metric because of the provencapacity for pixel representations to perform well. Instead, we adopted the fol-lowing strategy for deﬁning ground-truth image pairs for general object classes:we manually sampled similar images from ImageNet and visually aligned themw.r.t. an aﬃne warp, then ran both LK and SIFT Flow at the “ground-truth”and asserted they did not diverge from the initialization (reﬁning the estimateand iterating where necessary).For each value of the initialization error, we ran 1000 trials. Fig. 1 presentsthe results, with a representative pair of ground-truth images. There is a progres-sive degradation in performance from SVR to least-squares regression to centraldiﬀerences on all of the Dense SIFT trials.Importantly, the pixel-based trials fail to converge even close to the ground-truth – the background distractors and diﬀerences between the zebras dominatethe appearance, which results in incoherent descent predictions. At the otherend of the spectrum, SVR consistently outperforms least-squares regression bya large margin, indicative that a large number of sample outliers exist over bothsmall and large domain sizes. This highlights the beneﬁt of treating alignmentas a regression problem rather than computing numeric approximations to thegradient ( i.e . central diﬀerences), and suggests that excellent performance canbe achieved with commensurate increase in computational complexity. In all of our alignment experiments, we extract densely sampled SIFT fea-tures [12] on a regular grid with a stride of 1 pixel. We cross-validated thespatial aggregation (cell) size, and found 4 × × Bristow & Lucey F r a c t i o n o f t r i a l s c o n v e r g e d pixels central_difference domain_3pixels central_difference domain_5pixels central_difference domain_7pixels regression_ls_gradient domain_3pixels regression_ls_gradient domain_5pixels regression_ls_gradient domain_7pixels regression_svr_gradient domain_3pixels regression_svr_gradient domain_5pixels regression_svr_gradient domain_7dense_sift_3 central_difference domain_3dense_sift_3 central_difference domain_5dense_sift_3 central_difference domain_11dense_sift_4 central_difference domain_3dense_sift_4 central_difference domain_5dense_sift_4 central_difference domain_11dense_sift_8 central_difference domain_3dense_sift_8 central_difference domain_5dense_sift_8 central_difference domain_11dense_sift_3 regression_ls_gradient domain_3dense_sift_3 regression_ls_gradient domain_5dense_sift_3 regression_ls_gradient domain_11dense_sift_4 regression_ls_gradient domain_3dense_sift_4 regression_ls_gradient domain_5dense_sift_4 regression_ls_gradient domain_11dense_sift_8 regression_ls_gradient domain_3dense_sift_8 regression_ls_gradient domain_5dense_sift_8 regression_ls_gradient domain_11dense_sift_3 regression_svr_gradient domain_3dense_sift_3 regression_svr_gradient domain_5dense_sift_3 regression_svr_gradient domain_11dense_sift_4 regression_svr_gradient domain_3dense_sift_4 regression_svr_gradient domain_5dense_sift_4 regression_svr_gradient domain_11dense_sift_8 regression_svr_gradient domain_3dense_sift_8 regression_svr_gradient domain_5dense_sift_8 regression_svr_gradient domain_11 Fig. 1.

Pairwise (LK) alignment performance of diﬀerent methods for increasinginitialization error. The number after Dense SIFT indicates the spatial aggrega-tion (cell) size of each SIFT descriptor. The domain is the limit of displacementmagnitude from which training examples are gathered for the regressors, or theblur kernel size in the case of central diﬀerences. There is a progressive degrada-tion in performance from SVR to least-squares regression to central diﬀerenceson Dense SIFT. The pixel-based methods fail to converge even when close to theground truth on challenging images such as the zebra. egression-Based Image Alignment 9

Fig. 2.

Representative pairwise alignments. Column-wise from left to right: (i)The template region of interest. (ii) The image we wish to align to the tem-plate. The bounding box initialization covers ≈

50% of the image area, to reﬂectthe fact that objects of interest rarely ﬁll the entire area of an image. (iii) Thepredicted region that best aligns the image to the template. The four examplesexhibit robustness to changes in pose, rotation, scale and translation, respec-tively.

We test the performance of our algorithm on a range of animal object categoriesdrawn from ImageNet. In Fig. 2, the ﬁrst column is the template image. If nobounding box is shown, the whole image is used as the template. The secondcolumn shows the image we wish to align, with the initialization bounding themiddle 50% of pixels – owing to the fact that photographers rarely frame theobject of interest to consume the entire image area. The third column shows theconverged solutions. In all of the cases shown, pixel-based representations failedto converge.

The task of unsupervised localization is to discover the bounding boxes of ob-jects of interest in a corpus of images with only their object class labelled. Inapproaches such as Object Centric Pooling [16], a detector is optimized jointlywith the estimated locations of bounding boxes. Importantly, bounding box can-didates are sampled in a multi-scale sliding-window manner, perhaps across aﬁxed number of aspect ratios. Exhaustive search cannot handle more complexsearch spaces, such as rotations.Gradient-based methods derived from the Lucas Kanade algorithm such asleast squares congealing [4] and RASL [14] have performed well on constraineddomains ( e.g . faces, digits, building fa¸cades), but not on general object cate-gories. Here we show that our feature regression framework can be applied toperform unsupervised localization.The RASL algorithm performs alignment by attempting to minimize therank of the overall stack. This only applies to linearly correlated images, how-ever. General object categories that exhibit large appearance variation and ar-ticulated deformations are unlikely to form a low-rank basis even when aligned.The introduction of feature transforms also explodes the dimensionality of theproblem, making SVD computation infeasible. Finally, RASL has a narrow basinof convergence, requiring that the misalignment can be modelled by the errorterm so that the low rank term is not simply an average of images in the stack(which is known to result in poor convergence properties [4]).We therefore present results using the least squares congealing algorithm. Itscales to large numbers of feature images, shares the same favourable inversecompositional properties as Lucas Kanade, and is robust to changes in illumina-tion via dense SIFT features.Fig. 3 shows the results of aligning a set of elephants. Recall that there is nooracle or ground truth – the elephants are “discovered” merely as the region thealigns most consistently across the entire image stack. Fig. 4 illustrates the stackmean before and after congealing. Even though individual elephants appear indiﬀerent poses, the aligned mean clearly elicits an elephant silhouette. egression-Based Image Alignment 11

Fig. 3.

The results of unsupervised ensemble alignment (congealing) on a setof 170 elephants taken from ImageNet. The objective is to jointly minimizethe appearance diﬀerence between all of the images in a least-squares sense –no prior information or training is required. The ﬁrst 6 rows present exemplarimages from the set that converged. The ﬁnal row presents a number of failurecases.

Fig. 4.

The mean image (i) before alignment and, (ii) after alignment w.r.t. anaﬃne warp. Although individual elephants undergo diﬀerent non-rigid deforma-tions, one can make out an elephant silhouette in the aligned mean.

Image alignment is a fundamental problem for many computer vision tasks,however a large portion of the research that has focussed on alignment in thefacial domain has not generalized well to broader image categories. As a result,exhaustive search strategies have dominated general image alignment. In thispaper, we showed that regression over image features could be used within aLucas Kanade framework to robustly align instances of objects diﬀering in pose,illumination, size and position, and presented a range of results from ImageNetcategories. We also demonstrated an example of unsupervised image alignment,whereby the appearance of an elephant was automatically discovered in a largenumber of images. Our future work aims to parametrize more complex warps sothat objects can be matched across greater pose and viewpoint variation.

References

1. S. Baker and I. Matthews. Equivalence and Eﬃciency of Image Alignment Al-gorithms.

International Conference of Computer Vision and Pattern Recognition(CVPR) , pages 1090–1097, 2001.2. S. Baker and I. Matthews. Lucas-Kanade 20 Years On: A Unifying Framework.

International Journal of Computer Vision (IJCV) , 56(3):221–255, Feb. 2004.3. T. Cootes and C. Taylor. Statistical models of appearance for computer vision.2004.4. M. Cox, S. Sridharan, and S. Lucey. Least-squares congealing for large numbersof images.

International Conference on Computer Vision (ICCV) , 2009.5. J. Dai, Y. Wu, J. Zhou, and S. Zhu. Cosegmentation and cosketch by unsupervisedlearning.

International Conference on Computer Vision (ICCV) , (1):1305–1312,Dec. 2013.6. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.

International Conference of Computer Vision and Pattern Recognition (CVPR) ,pages 886–893, 2005.egression-Based Image Alignment 137. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object de-tection with discriminatively trained part-based models.

Pattern Analysis andMachine Intelligence (PAMI) , 32(9):1627–45, Sept. 2010.8. G. B. Huang, V. Jain, and E. Learned-Miller. Unsupervised Joint Alignment ofComplex Images.

International Conference on Computer Vision (ICCV) , pages1–8, 2007.9. H. Jia, G. Wu, Q. Wang, and D. Shen. ABSORB: Atlas building by self-organizedregistration and bundling.

International Conference of Computer Vision and Pat-tern Recognition (CVPR) , pages 2785–2790, 2010.10. V. Kazemi and S. Josephine. One Millisecond Face Alignment with an Ensembleof Regression Trees.

International Conference of Computer Vision and PatternRecognition (CVPR) , 2014.11. J. Lankinen and J. Kamarainen. Local Feature Based Unsupervised Alignment ofObject Class Images.

British Machine Vision Conference (BMVC) , pages 107.1–107.11, 2011.12. C. Liu, J. Yuen, and A. Torralba. SIFT ﬂow: dense correspondence acrossscenes and its applications.

Pattern Analysis and Machine Intelligence (PAMI) ,33(5):978–94, May 2011.13. T. Ojala. Multiresolution gray-scale and rotation invariant texture classiﬁcationwith local binary patterns.

Pattern Analysis and Machine Intelligence (PAMI) ,pages 1–35, 2002.14. Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. RASL: robust alignment bysparse and low-rank decomposition for linearly correlated images.

Pattern Analysisand Machine Intelligence (PAMI) , 34(11):2233–46, Nov. 2012.15. S. Ren, X. Cao, Y. Wei, and J. Sun. Face Alignment at 3000 FPS via RegressingLocal Binary Features.

International Conference of Computer Vision and PatternRecognition (CVPR) , 1(1):1–8, 2014.16. O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei. Object-centric spatial pooling forimage classiﬁcation.

European Conference on Computer Vision (ECCV) , 2012.17. J. Saragih and R. Goecke. Iterative error bound minimisation for AAM alignment.

International Conference on Pattern Recognition (ICPR) , (August):20–23, 2006.18. E. Simoncelli and B. Olshausen. Natural Image Statistics and Neural Representa-tion.

Annual Review of Neuroscience , 2001.19. X. Xiong and F. De la Torre. Supervised Descent Method and Its Applicationsto Face Alignment.

International Conference of Computer Vision and PatternRecognition (CVPR) , pages 532–539, June 2013.20. S. Ying, G. Wu, Q. Wang, and D. Shen. Hierarchical unbiased graph shrinkage(HUGS): a novel groupwise registration for large data set.

NeuroImage , 84:626–38,Jan. 2014.21. X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark local-ization in the wild.