Regression-Based Image Alignment for General Object Categories
RRegression-Based Image Alignmentfor General Object Categories
Hilton Bristow and Simon Lucey Queensland University of Technology (QUT)Brisbane QLD 4000, Australia Carnegie Mellon University (CMU)Pittsburgh PA 15289, USA
Abstract.
Gradient-descent methods have exhibited fast and reliableperformance for image alignment in the facial domain, but have largelybeen ignored by the broader vision community. They require the imagefunction be smooth and (numerically) differentiable – properties thathold for pixel-based representations obeying natural image statistics, butnot for more general classes of non-linear feature transforms. We showthat transforms such as Dense SIFT can be incorporated into a LucasKanade alignment framework by predicting descent directions via re-gression. This enables robust matching of instances from general objectcategories whilst maintaining desirable properties of Lucas Kanade suchas the capacity to handle high-dimensional warp parametrizations anda fast rate of convergence. We present alignment results on a number ofobjects from ImageNet, and an extension of the method to unsupervisedjoint alignment of objects from a corpus of images.
Keywords:
Lucas Kanade, alignment, regression, Dense SIFT
Traditionally, detectors used in general object detection have been applied ina discrete multi-scale sliding-window manner. This enables global search of theoptimal warp parameters (object scale and position within the source image), atthe expense of only being able to handle these simple transformations. Gradient-based approaches such as Lucas Kanade (LK) [2], on the other hand, can en-tertain more complex warp parametrizations such as rotations and changes inaspect ratio, but impose the constraint that the image function be smooth anddifferentiable (analytically or efficiently numerically).This constraint is generally satisfied for pixel-based representations that fol-low natural image statistics [18], especially on constrained domains such as faces,which are known to exhibit low-frequency gradients [3]. For broader object cat-egories that exhibit large intra-class variation and discriminative gradient in-formation in the higher-frequencies ( i.e . the interaction of the object with thebackground) however, non-linear feature transforms that introduce tolerance tocontrast and geometry are required. These transforms violate the smoothnessrequirement of gradient-based methods. a r X i v : . [ c s . C V ] J u l Bristow & Lucey
As a result, the huge wealth of research into gradient-based methods for facialimage alignment has largely been ignored by the broader vision community. Inthis paper, we show that the LK objective can be modified to handle non-linearfeature transforms. Specifically, we show, – descent directions on feature images can be computed via linear regressionto avoid any assumptions about their statistics, – for least-squares regression, the formulation can be interpreted as an efficientconvolution operation, – localization results on images from ImageNet using higher-order warp parame-trizations than scale and translation, – an extension to unsupervised joint alignment of a corpus of images.By showing that gradient-based methods can be applied to non-linear imagetransforms more generally, the huge body of research in image alignment can beleveraged for general object alignment. Image alignment is the problem of registering two images, or parts of images, sothat their appearance similarity is maximized. It is a difficult problem in general,because (i) the deformation model used to parametrize the alignment can behigh-dimensional, (ii) the appearance variation between instances of the objectcategory can be large due to differences in lighting, pose, non-rigid geometry andbackground material, and (iii) search space is highly non-convex.
For localization of general object categories, the solution has largely been toparametrize the warp by a low-dimensional set of parameters – x, y -translationand scale – and exhaustively search across the support of the image for thebest set of parameters using a classifier trained to tolerate lighting variation andchanges in pose and geometry. Though not usually framed in these terms, thisis exactly the role of multi-scale sliding-window detection.Higher-dimensional warps have typically not been used, due to the exponen-tial explosion in the size of the search space. This is evident in graphical mod-els, where it is only possible to entertain a restrictive set of higher-dimensionalwarps: those that are amenable to optimization by dynamic programming [7]. Aconsequence of this limitation is that sometimes underlying physical constraintscannot be well modelled: [21] use a tree to model parts of a face, resulting infloating branches and leaf nodes that do not respect or approximate the elasticrelationship of muscles.A related limitation of global search is the speed with which warp parametriza-tions can be explored. Searching over translation can be computed efficiently viaconvolution, however there is no equivalent operator for searching affine warpsor projections onto linear subspaces. egression-Based Image Alignment 3 [11] introduced a global method for gaining correspondence between imagesfrom general object categories – evaluated on Pascal VOC – based on homog-raphy consensus of local non-linear feature descriptors. They claim performanceimprovements over state-of-the-art congealing methods, but their only quali-tative assessment is on rigid objects, so it is difficult to gauge how well theirmethod generalizes to non-rigid object classes.A related problem is that of co-segmentation [5], which aims to learn coherentsegmentations across a corpus of images by exploiting similarities between theforeground and background regions in these images. Such global methods areslow, but could be used as an effective initializer for local image alignment (inthe same way that face detection is almost universally used to initialize faciallandmark localization).
Local search methods perform alignment by taking constrained steps on theimage function directly. The family of Lucas Kanade algorithms consider a first-order Taylor series approximation to the image function and locally approxi-mate its curvature with a quadratic. Convergence to a minima follows if theJacobian of the linearization is well-conditioned and the function is smooth anddifferentiable. Popular non-linear features such as Dense SIFT [12], HOG [6]and LBP [13] are non-differentiable image operators. Unlike pixel representa-tions whose f frequency spectra relates the domain of the optimization basinto the amount of blur introduced, these non-linear operators do not have well-understood statistical structure.Current state-of-the art local search methods that employ non-linear featuresfor face alignment instead use a cascade of regression functions, in a similarmanner to Iterative Error Bound Minimization (IEBM) [17]. A common themeof these methods [10,15,19] is that they directly regress to positional updates.This sidesteps issues with differentiating image functions, or inverting Hessians.The drawback, however, is that they require vast amounts of training data toproduce well-conditioned regressors. This approach is feasible for facial domaindata that can be synthesized and trained offline in batch to produce fast runtimeperformance, but becomes impractical when performing alignment on arbitraryobject classes, which have traditionally only had weakly labelled data.The least squares congealing alignment algorithm [4], for example, has noprior knowledge of image landmarks, and learning positional update regressorsfor each pixel in each image is not only costly, their performance is poor whenusing only the surrounding image context as training data.[8] first proposed the use of non-linear transforms (SIFT descriptors in theircase) for the congealing image alignment problem, noting like us, that pixel-based representations do not work on sets of images that exhibit high contrastvariance. Their entropy-based algorithm treats SIFT descriptors as stemmingfrom a multi-modal Gaussian distribution, and clusters the regions, at each it-eration finding the transform that minimizes the cluster entropy. As [4] pointed Bristow & Lucey out, however, employing entropy for congealing is problematic due to its pooroptimization characteristics. As a result, the method of [8] is slow to converge.The related field of medical imaging has a large focus on image registrationfor measuring brain development, maturation and ageing, amongst others. [9,20]present methods for improving the robustness of unsupervised alignment byembedding the dataset in a graph, with edges representing similarity of images.Registration then proceeds by minimizing the total edge length of the graph.This improves the capture of images which are far from the dataset mean, butwhich can be found by traversing through intermediate images. Their applicationdomain – brain scans – is still highly constrained, permitting the estimationof geodesic distances between images in pixel space. Nonetheless, this type ofembedding is beyond what generic congealing algorithms have achieved.For general image categories, we instead propose to compute descent direc-tions via appearance regression. The advantage of this approach is that the sizeof the regression formulation is independent of the dimensionality of the featuretransform, so can be inverted with a small amount of training data.
The Inverse Compositional Lucas Kanade problem can be formulated as,arg min ∆ p || T ( W ( x ; p )) − I ( W ( x ; ∆ p )) || (1)where T is the reference template image, I is the image we wish to warp tothe template and W is the warp parametrization that depends on the imagecoordinates x and the warp parameters p . This is a nonlinear least squares(NLS) problem since the image function is highly non-convex. To solve it, therole of the template and the image is reversed and the expression is linearizedby taking a first-order Taylor expansion about T ( W ( x ; p )) to yield,arg min ∆ p || T ( W ( x ; p )) + ∇ T ∂ W ∂ p ∆ p − I ( x ) || (2) ∇ T = ( ∂T∂x , ∂T∂y ) is the gradient of the template evaluated at W ( x ; p ). ∂ W ∂ p isthe Jacobian of the template. The update ∆ p describes the optimal alignmentof T to I . The inverse of ∆ p is then composed with the current estimate of theparameters, p k +1 = p k ◦ ∆ p − (3)and applied to I .The implication is that we always linearize the expression about the tem-plate T , but apply the (inverse of the) motion update ∆ p to the image I . Theconsequence of this subtle detail is that T is always fixed, and thus the gradientoperator ∇ T only ever needs to be computed once [1]. This property extendsto our regression framework, where the potentially expensive regressor trainingstep can also happen just once, before alignment. egression-Based Image Alignment 5 For non-linear multi-channel image operators, we can replace the gradientoperator ∇ T with a general matrix R ,arg min ∆ p || T ( W ( x ; p )) + R ∂ W ∂ p ∆ p − I ( x ) || (4)The role of this matrix is to predict a descent direction for each pixel givencontext from other pixels and channels. The structure of the matrix determinesthe types of interactions that are exploited to compute the descent directions. Ifthe Jacobian is constant across all iterates – as is the case with affine transforms– it can be pre-multiplied with the regressor so that solving each linearizationinvolves only a single matrix multiplication. We now discuss a simple least squares strategy for learning R . If we consideronly a translational warp, the expression of Eqn. 4 reduces to,arg min ∆ x || T ( x ) + R ∆ x − I ( x ) || (5)where ∆ x = ∆ p = ( ∆x, ∆y ). That is, we want to find the step size along thedescent direction that minimizes the appearance difference between the templateand the image. If we instead fix the ∆ x , we can solve for the R that minimizesthe appearance difference,arg min R (cid:88) ∆ x ∈D || T ( x ) + R ∆ x − T ( x + ∆ x ) || (6)Here we have replaced I ( x ) with the template at the known displacement, T ( x + ∆ x ). The domain of displacements D that we draw from for training balancessmall-displacement accuracy and large-displacement stability. Of course, least-squares regression is not the only possible approach. One could, for example,use support vector regression (SVR) when outliers are particularly problematicwith a commensurate increase in computational complexity.Each regressor involves solving the system of equations:arg min R i (cid:88) ∆ x ∈D i || T ( x i ) + R i ∆ x − T ( x i + ∆ x ) || (7)where i represents the i -th pixel location in the image. If the same domainof displacements is used for each pixel, the solution to this objective can becomputed in closed form as R ∗ i = (cid:0) ∆ x ∆ x T + ρ I (cid:1) − (cid:0) ∆ x T [ T ( x i + ∆ x ) − T ( x i )] (cid:1) (8)The first thing to note is that ( ∆ x T ∆ x + ρ I ) − is a 2 × Bristow & Lucey
The ∆ x T [ T ( x i + ∆ x ) − T ( x i )] term within the expression is just a sum ofweighted differences between a displaced pixel, and the reference pixel. i.e ., (cid:20) (cid:80) ∆x (cid:80) ∆y ∆x ( T ( x + ∆x, y + ∆y ) − T ( x, y ) (cid:80) ∆x (cid:80) ∆y ∆y ( T ( x + ∆x, y + ∆y ) − T ( x, y ) (cid:21) (9)Other regression-based methods of alignment such as [19] leverage tens ofthousands of warped training examples during offline batch learning to producefast runtime performance on a single object category (faces). We cannot affordsuch complexity if we’re going to perform regression and alignment on arbitraryobject categories without a dedicated training time.If we sample ∆ x on a regular grid that coincides with pixel locations, thenEqn. 9 can be cast as two filters – one each for horizontal weights ∆x , andvertical weights ∆y , f x = x − n . . . x n ... x − n . . . x n f y = y − n . . . y − n ... y n . . . y n (10)If the x and y domains are both equal and odd, the contribution of T ( x, y ) iscancelled out. This is clearly a generalization of the central difference operator,which considers a domain of [ − , f x = (cid:2) − (cid:3) f y = − (11)Thus, an efficient realization for learning a regressor at every pixel in theimage is, R = ( ∆ x T ∆ x + ρ I ) − [ f x ∗ T ( x ) f y ∗ T ( x )] (12)where ∗ is the convolution operator. For an image with N pixels, K channels anda warp with P motion parameters, the complexity of our image alignment canbe stated as a single O ( KN log KN + KN P ) pre-computation of the regressor,followed by an O ( KN P ) matrix-vector multiply and image warp per iteration,with an overall linear rate of convergence.
Dense non-linear feature transforms can be viewed as mapping each scalar pixelin a (grayscale) image to a vector R → R K . The added redundancy is re-quired to decorrelate the various lighting transforms affecting the appearanceof objects. Some feature transforms such as HOG [6] also introduce a degree ofspatial insensitivity for matching dis-similar objects, though we find in practice egression-Based Image Alignment 7 that alignment performance is more sensitive to lighting than geometric effects(Fig. 1).During alignment, spatial operations are applied across each channel indepen-dently. In particular, our regression formulation does not consider correlationsbetween channels, so separate regressors can be learned on each feature planeof the image, then concatenated. This admits a highly efficient representation inthe Fourier domain – the filters f x and f y only need to be transformed to theFourier domain once per image, rather than once per channel.To illustrate the benefit of applying non-linear transforms, we performed analignment experiment between pairs of images with ground-truth registration,and progressively increased the initialization error, measuring the overall numberof trials that converged back to ground-truth (within (cid:15) tolerance). Faces withlabelled landmarks constitute a poor evaluation metric because of the provencapacity for pixel representations to perform well. Instead, we adopted the fol-lowing strategy for defining ground-truth image pairs for general object classes:we manually sampled similar images from ImageNet and visually aligned themw.r.t. an affine warp, then ran both LK and SIFT Flow at the “ground-truth”and asserted they did not diverge from the initialization (refining the estimateand iterating where necessary).For each value of the initialization error, we ran 1000 trials. Fig. 1 presentsthe results, with a representative pair of ground-truth images. There is a progres-sive degradation in performance from SVR to least-squares regression to centraldifferences on all of the Dense SIFT trials.Importantly, the pixel-based trials fail to converge even close to the ground-truth – the background distractors and differences between the zebras dominatethe appearance, which results in incoherent descent predictions. At the otherend of the spectrum, SVR consistently outperforms least-squares regression bya large margin, indicative that a large number of sample outliers exist over bothsmall and large domain sizes. This highlights the benefit of treating alignmentas a regression problem rather than computing numeric approximations to thegradient ( i.e . central differences), and suggests that excellent performance canbe achieved with commensurate increase in computational complexity. In all of our alignment experiments, we extract densely sampled SIFT fea-tures [12] on a regular grid with a stride of 1 pixel. We cross-validated thespatial aggregation (cell) size, and found 4 × × Bristow & Lucey F r a c t i o n o f t r i a l s c o n v e r g e d pixels central_difference domain_3pixels central_difference domain_5pixels central_difference domain_7pixels regression_ls_gradient domain_3pixels regression_ls_gradient domain_5pixels regression_ls_gradient domain_7pixels regression_svr_gradient domain_3pixels regression_svr_gradient domain_5pixels regression_svr_gradient domain_7dense_sift_3 central_difference domain_3dense_sift_3 central_difference domain_5dense_sift_3 central_difference domain_11dense_sift_4 central_difference domain_3dense_sift_4 central_difference domain_5dense_sift_4 central_difference domain_11dense_sift_8 central_difference domain_3dense_sift_8 central_difference domain_5dense_sift_8 central_difference domain_11dense_sift_3 regression_ls_gradient domain_3dense_sift_3 regression_ls_gradient domain_5dense_sift_3 regression_ls_gradient domain_11dense_sift_4 regression_ls_gradient domain_3dense_sift_4 regression_ls_gradient domain_5dense_sift_4 regression_ls_gradient domain_11dense_sift_8 regression_ls_gradient domain_3dense_sift_8 regression_ls_gradient domain_5dense_sift_8 regression_ls_gradient domain_11dense_sift_3 regression_svr_gradient domain_3dense_sift_3 regression_svr_gradient domain_5dense_sift_3 regression_svr_gradient domain_11dense_sift_4 regression_svr_gradient domain_3dense_sift_4 regression_svr_gradient domain_5dense_sift_4 regression_svr_gradient domain_11dense_sift_8 regression_svr_gradient domain_3dense_sift_8 regression_svr_gradient domain_5dense_sift_8 regression_svr_gradient domain_11 Fig. 1.
Pairwise (LK) alignment performance of different methods for increasinginitialization error. The number after Dense SIFT indicates the spatial aggrega-tion (cell) size of each SIFT descriptor. The domain is the limit of displacementmagnitude from which training examples are gathered for the regressors, or theblur kernel size in the case of central differences. There is a progressive degrada-tion in performance from SVR to least-squares regression to central differenceson Dense SIFT. The pixel-based methods fail to converge even when close to theground truth on challenging images such as the zebra. egression-Based Image Alignment 9
Fig. 2.
Representative pairwise alignments. Column-wise from left to right: (i)The template region of interest. (ii) The image we wish to align to the tem-plate. The bounding box initialization covers ≈
50% of the image area, to reflectthe fact that objects of interest rarely fill the entire area of an image. (iii) Thepredicted region that best aligns the image to the template. The four examplesexhibit robustness to changes in pose, rotation, scale and translation, respec-tively.
We test the performance of our algorithm on a range of animal object categoriesdrawn from ImageNet. In Fig. 2, the first column is the template image. If nobounding box is shown, the whole image is used as the template. The secondcolumn shows the image we wish to align, with the initialization bounding themiddle 50% of pixels – owing to the fact that photographers rarely frame theobject of interest to consume the entire image area. The third column shows theconverged solutions. In all of the cases shown, pixel-based representations failedto converge.
The task of unsupervised localization is to discover the bounding boxes of ob-jects of interest in a corpus of images with only their object class labelled. Inapproaches such as Object Centric Pooling [16], a detector is optimized jointlywith the estimated locations of bounding boxes. Importantly, bounding box can-didates are sampled in a multi-scale sliding-window manner, perhaps across afixed number of aspect ratios. Exhaustive search cannot handle more complexsearch spaces, such as rotations.Gradient-based methods derived from the Lucas Kanade algorithm such asleast squares congealing [4] and RASL [14] have performed well on constraineddomains ( e.g . faces, digits, building fa¸cades), but not on general object cate-gories. Here we show that our feature regression framework can be applied toperform unsupervised localization.The RASL algorithm performs alignment by attempting to minimize therank of the overall stack. This only applies to linearly correlated images, how-ever. General object categories that exhibit large appearance variation and ar-ticulated deformations are unlikely to form a low-rank basis even when aligned.The introduction of feature transforms also explodes the dimensionality of theproblem, making SVD computation infeasible. Finally, RASL has a narrow basinof convergence, requiring that the misalignment can be modelled by the errorterm so that the low rank term is not simply an average of images in the stack(which is known to result in poor convergence properties [4]).We therefore present results using the least squares congealing algorithm. Itscales to large numbers of feature images, shares the same favourable inversecompositional properties as Lucas Kanade, and is robust to changes in illumina-tion via dense SIFT features.Fig. 3 shows the results of aligning a set of elephants. Recall that there is nooracle or ground truth – the elephants are “discovered” merely as the region thealigns most consistently across the entire image stack. Fig. 4 illustrates the stackmean before and after congealing. Even though individual elephants appear indifferent poses, the aligned mean clearly elicits an elephant silhouette. egression-Based Image Alignment 11
Fig. 3.
The results of unsupervised ensemble alignment (congealing) on a setof 170 elephants taken from ImageNet. The objective is to jointly minimizethe appearance difference between all of the images in a least-squares sense –no prior information or training is required. The first 6 rows present exemplarimages from the set that converged. The final row presents a number of failurecases.
Fig. 4.
The mean image (i) before alignment and, (ii) after alignment w.r.t. anaffine warp. Although individual elephants undergo different non-rigid deforma-tions, one can make out an elephant silhouette in the aligned mean.
Image alignment is a fundamental problem for many computer vision tasks,however a large portion of the research that has focussed on alignment in thefacial domain has not generalized well to broader image categories. As a result,exhaustive search strategies have dominated general image alignment. In thispaper, we showed that regression over image features could be used within aLucas Kanade framework to robustly align instances of objects differing in pose,illumination, size and position, and presented a range of results from ImageNetcategories. We also demonstrated an example of unsupervised image alignment,whereby the appearance of an elephant was automatically discovered in a largenumber of images. Our future work aims to parametrize more complex warps sothat objects can be matched across greater pose and viewpoint variation.
References
1. S. Baker and I. Matthews. Equivalence and Efficiency of Image Alignment Al-gorithms.
International Conference of Computer Vision and Pattern Recognition(CVPR) , pages 1090–1097, 2001.2. S. Baker and I. Matthews. Lucas-Kanade 20 Years On: A Unifying Framework.
International Journal of Computer Vision (IJCV) , 56(3):221–255, Feb. 2004.3. T. Cootes and C. Taylor. Statistical models of appearance for computer vision.2004.4. M. Cox, S. Sridharan, and S. Lucey. Least-squares congealing for large numbersof images.
International Conference on Computer Vision (ICCV) , 2009.5. J. Dai, Y. Wu, J. Zhou, and S. Zhu. Cosegmentation and cosketch by unsupervisedlearning.
International Conference on Computer Vision (ICCV) , (1):1305–1312,Dec. 2013.6. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
International Conference of Computer Vision and Pattern Recognition (CVPR) ,pages 886–893, 2005.egression-Based Image Alignment 137. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object de-tection with discriminatively trained part-based models.
Pattern Analysis andMachine Intelligence (PAMI) , 32(9):1627–45, Sept. 2010.8. G. B. Huang, V. Jain, and E. Learned-Miller. Unsupervised Joint Alignment ofComplex Images.
International Conference on Computer Vision (ICCV) , pages1–8, 2007.9. H. Jia, G. Wu, Q. Wang, and D. Shen. ABSORB: Atlas building by self-organizedregistration and bundling.
International Conference of Computer Vision and Pat-tern Recognition (CVPR) , pages 2785–2790, 2010.10. V. Kazemi and S. Josephine. One Millisecond Face Alignment with an Ensembleof Regression Trees.
International Conference of Computer Vision and PatternRecognition (CVPR) , 2014.11. J. Lankinen and J. Kamarainen. Local Feature Based Unsupervised Alignment ofObject Class Images.
British Machine Vision Conference (BMVC) , pages 107.1–107.11, 2011.12. C. Liu, J. Yuen, and A. Torralba. SIFT flow: dense correspondence acrossscenes and its applications.
Pattern Analysis and Machine Intelligence (PAMI) ,33(5):978–94, May 2011.13. T. Ojala. Multiresolution gray-scale and rotation invariant texture classificationwith local binary patterns.
Pattern Analysis and Machine Intelligence (PAMI) ,pages 1–35, 2002.14. Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. RASL: robust alignment bysparse and low-rank decomposition for linearly correlated images.
Pattern Analysisand Machine Intelligence (PAMI) , 34(11):2233–46, Nov. 2012.15. S. Ren, X. Cao, Y. Wei, and J. Sun. Face Alignment at 3000 FPS via RegressingLocal Binary Features.
International Conference of Computer Vision and PatternRecognition (CVPR) , 1(1):1–8, 2014.16. O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei. Object-centric spatial pooling forimage classification.
European Conference on Computer Vision (ECCV) , 2012.17. J. Saragih and R. Goecke. Iterative error bound minimisation for AAM alignment.
International Conference on Pattern Recognition (ICPR) , (August):20–23, 2006.18. E. Simoncelli and B. Olshausen. Natural Image Statistics and Neural Representa-tion.
Annual Review of Neuroscience , 2001.19. X. Xiong and F. De la Torre. Supervised Descent Method and Its Applicationsto Face Alignment.
International Conference of Computer Vision and PatternRecognition (CVPR) , pages 532–539, June 2013.20. S. Ying, G. Wu, Q. Wang, and D. Shen. Hierarchical unbiased graph shrinkage(HUGS): a novel groupwise registration for large data set.
NeuroImage , 84:626–38,Jan. 2014.21. X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark local-ization in the wild.