[PDF] Unsupervised Feature Learning for Dense Correspondences across Scenes

Abstract

We propose a fast, accurate matching method for estimating dense pixel correspondences across scenes. It is a challenging problem to estimate dense pixel correspondences between images depicting different scenes or instances of the same object category. While most such matching methods rely on hand-crafted features such as SIFT, we learn features from a large amount of unlabeled image patches using unsupervised learning. Pixel-layer features are obtained by encoding over the dictionary, followed by spatial pooling to obtain patch-layer features. The learned features are then seamlessly embedded into a multi-layer match- ing framework. We experimentally demonstrate that the learned features, together with our matching model, outperforms state-of-the-art methods such as the SIFT flow, coherency sensitive hashing and the recent deformable spatial pyramid matching methods both in terms of accuracy and computation efficiency. Furthermore, we evaluate the performance of a few different dictionary learning and feature encoding methods in the proposed pixel correspondences estimation framework, and analyse the impact of dictionary learning and feature encoding with respect to the final matching performance.

Full PDF

UUnsupervised Feature Learning for Dense Correspondencesacross Scenes

Chao Zhang, Chunhua Shen, Tingzhi Shen v1 July 2014; v2 December 2014; v3 April 2015

Abstract

We propose a fast, accurate matchingmethod for estimating dense pixel correspondencesacross scenes. It is a challenging problem to estimatedense pixel correspondences between images depictingdiﬀerent scenes or instances of the same object cate-gory. While most such matching methods rely on hand-crafted features such as SIFT, we learn features froma large amount of unlabeled image patches using unsu-pervised learning. Pixel-layer features are obtained byencoding over the dictionary, followed by spatial pool-ing to obtain patch-layer features. The learned featuresare then seamlessly embedded into a multi-layer match-ing framework. We experimentally demonstrate thatthe learned features, together with our matching model,outperform state-of-the-art methods such as the SIFTﬂow [1], coherency sensitive hashing [2] and the recentdeformable spatial pyramid matching [3] methods bothin terms of accuracy and computation eﬃciency.Furthermore, we evaluate the performance of afew diﬀerent dictionary learning and feature encodingmethods in the proposed pixel correspondence estima-tion framework, and analyze the impact of dictionarylearning and feature encoding with respect to the ﬁnalmatching performance.

C. ZhangBeijing Institute of Technology, Beijing 100081, ChinaThe University of Adelaide, SA 5005, AustraliaC. Shen ( (cid:66) )The University of Adelaide, SA 5005, AustraliaAustralian Centre for Robotic VisionE-mail: [email protected]

T. ShenBeijing Institute of Technology, Beijing 100081, China

Estimating the dense correspondence between two im-ages across scenes is an important task, which has manyapplications in computer vision and computational pho-tography. Yet, it is a challenging problem due to largevariations exhibited in the matching images. Conven-tional dense matching methods developed for opticalﬂow and stereo usually only work well for the cases inwhich the two input images contain diﬀerent views ofthe same object. Here we are interested in dense match-ing of images with diﬀerent objects or scenes. Thisrequires the matching algorithms to be highly robustto diﬀerent object appearances and backgrounds, illu-mination changes, large displacements and viewpointchanges. For the task of matching objects in a speciﬁccategory, the intra-class variability can be larger thanthe inter-class diﬀerences.Recently a few methods were proposed to addressthese challenges, including hierarchical matching [1],fast patch matching [2, 4], sparse-to-dense matching [5]and most recently spatial pyramid matching [3]. Cur-rent matching approaches typically rely on either rawimage patches or hand-designed image features (e.g.,SIFT features [6]). Raw pixels or patches often lackthe robustness to cope with those challenging appear-ance variations. Given a particular task, in order tomodel complex real-world data, robust and distinctivefeature descriptors that can capture relevant informa-tion are needed. Hand-crafted features like SIFT haveachieved great success in many vision tasks such as im-age classiﬁcation [7], retrieval, and image matching. AsSIFT features have passed the test of time for good per-formance, SIFT is considered as one of the milestoneresults in computer vision, which was ﬁrst introducedmore than a decade ago [6]. Despite the remarkable a r X i v : . [ c s . C V ] A p r Chao Zhang, Chunhua Shen, Tingzhi Shen success in a number of applications, SIFT is criticizedfor drawbacks such as its large computational burden,and being incapable to well accommodate aﬃne view-point transformation. Researchers have been seekingimproved feature descriptors. However, manually de-signing features for each data set and task can be veryexpensive, time-consuming, and typically requires do-main knowledge of the data. In recent years, researchersobserved that instead of manually designing features us-ing heuristics, learning features from a large amount ofunlabeled data with some unsupervised machine learn-ing approaches achieves tremendous success in variousapplications. For example, in visual recognition the un-supervised feature learning pipeline has now becomethe common approach [8, 9]. Feature learning is attrac-tive as it exploits the availability of data and avoids theneed of feature engineering [7]. For unsupervised featurelearning, its main advantage is that unlabeled domain-speciﬁc data are usually abundant and very cheap toobtain. Inspired by the success of [3, 7, 9, 10], we pro-pose unsupervised feature learning for dense pixel cor-respondence estimation within a multi-layer matchingframework. The outline of our multi-layer model is il-lustrated in Figure 1.In our framework, features at the bottom layer(namely, the pixel layer) are extracted from raw im-age patches using unsupervised feature learning meth-ods. We then obtain more compact representations oflarger-size nodes at higher-level layers, which achievebetter robustness to noise and clutter, thus better dealwith severe variations in object or scene appearances.Larger spatial nodes with more compact features pro-vide better geometric regularization when the match-ing objects undergo large appearance variations, whilesmaller spatial nodes with more detailed features obtainﬁner correspondence. Our matching starts from the toplayer (i.e., the grid-cell layer). The matching solution ofa higher layer provides reliable initial correspondencesto the lower layer.We apply several well-known unsupervised featurelearning algorithms to extract pixel layer features. Thenwe present a detailed analysis on the impact of vari-ous parameters and conﬁgurations of our framework—the matching model as well as the unsupervised fea-ture learning techniques. Despite the simplicity of oursystem, our framework outperforms all previously pub-lished matching accuracy on the Caltech-101 dataset,the LMO dataset [11], and a subset of the Pascaldataset [12]. Our results demonstrate that it is pos-sible to achieve state-of-the-art performance by usinga tailored matching framework, even with simple unsu-pervised feature learning techniques.Our main contributions are thus as follows. – We apply unsupervised feature learning to the prob-lem of dense pixel correspondence estimation, ratherthan using hand-designed features. Experiment re-sults show that our method outperforms recentstate-of-the-art methods [1–3] in terms of both ac-curacy and running time. Our experiments demon-strate that the learned features can well handle vari-ations of diﬀerent factors. – Inspired by the recent development in multi-layernetworks and deep learning methods, we performmatching at several levels of the image representa-tions (grid-cell layer, patch layer, pixel layer). Ourmulti-layer matching model, designed for fast andaccurate matching, is suitable for the multi-layerunsupervised feature learning pipeline. – We use the patch-layer feature as the basic unit toestimate correspondences in the patch-layer match-ing such that the computation time is considerablyfaster (due to less time spent on feature extractionand fewer variables to optimize) while still keepingthe desirable power of learned features. Matching re-sults at the patch layer have already outperformedthose state-of-the-art methods in the literature interms of both matching accuracy and eﬃciency. – We evaluate the performance of a few dictionarylearning and feature encoding methods in the pro-posed pixel correspondence estimation framework.Moreover, we study the eﬀect of parameter choiceson the features learned by several feature learningmethods. Several important conclusions are drawn,which are diﬀerent from the case of unsupervisedfeature learning for image classiﬁcation [8].1.1 Related workWe brieﬂy review some relevant work in dense match-ing and unsupervised feature learning. Estimation ofdense correspondences between images is essential formany computer vision tasks such as image registration,segmentation [13], stereo matching and object recogni-tion [14]. It is challenging to estimate dense correspon-dences between images that contain diﬀerent scenes.Graph matching algorithms [14–16] were introducedto ﬁnd the dense correspondence. Typically these meth-ods use sparse features and rely on geometric relation-ships between nodes. Optical ﬂow methods have beenused to estimate the motion ﬁeld and dense correspon-dence in the literature. Recently, SIFT Flow [1] adoptsthe computational framework of optical ﬂow and pro-duces dense pixel-level correspondences by matchingSIFT descriptors instead of raw pixel intensities. Acoarse-to-ﬁne matching scheme is used in their method nsupervised Feature Learning for Dense Correspondences across Scenes 3

Grid-cell layer correspondencesPatch layer matching on patch features

Guide

Patch layer correspondencesPixel layer matchingon pixel features

Guide

Test image Grid-cell nodes Exemplar image Grid-cell nodesGrid-cell layer matching

Pixel layer correspondences

Patch layermatching resultPixel layermatching result feature structure matching

Fig. 1

Illustration of our multi-layer matching and unsupervised feature learning model. First column shows the featureextraction process of each layer. Second column shows the node structure of each layer. Third column outlines the matchingpipeline. The learned features at the pixel-level layer within a patch are spatially pooled to form a patch-level feature. Here thegrid-cell feature is the concatenation of patch-level features within a cell. The matching result from the grid-cell layer guidesthe matching at the patch-level layer; and the result at the patch-level layer guides the matching at the pixel-level layer. In ourexperiments, the matching accuracy obtained by the patch-level layer is already very high. Pixel-layer matching can furtherimprove the matching accuracy. to speed up the matching procedure. Kim et al. [3] pro-posed a deformable spatial pyramid (DSP) model forfast dense matching. Their model regularizes match-ing consistency through a pyramid graph. The match-ing cost in DSP is deﬁned by using multiple SIFT de-scriptors. PatchMatch [4] and more recent work of co-herency sensitive hashing (CSH) [2] are much faster inﬁnding the matching patches between two images, butabandon explicit geometric smoothness regularizationfor the speed, which may lead to noisy matching re-sults due to negligence of pixels’ geometric relations.Leordeanu et al. [5] proposed to extend sparse match-ing to dense matching by introducing local constraints.Image matching in general consists of two com-ponents: local image feature extraction and featurematching. First, one must deﬁne the image featuresbased upon which the correspondence between a pairof images can be established. An ideal image descrip-tor should be robust so that it does not change fromone image to another. Many methods use SIFT fea- tures as local descriptors because of their robustnessto scale and illumination changes, etc. Recent workshowed that, to some extent, carefully designed descrip-tors may improve the matching results [17, 18]. In [18],it is shown that SIFT features extracted at multiplescales lead to much better matches than single-scalefeatures. All these features were manually designed. In-stead, our work here is inspired by those feature learn-ing approaches that ﬁrst appeared in image classiﬁca-tion.In recent years, a large body of work on generic im-age classiﬁcation/categorization has focused on learn-ing features in an unsupervised fashion [8, 9]. Unsu-pervised feature learning (or deep learning by stack-ing unsupervised feature learning) has emerged as apromising technique for designing task-speciﬁc featuresby exploiting a large amount of unlabeled data [19]. Themain purpose of unsupervised feature learning is to de-sign low-dimensional features that capture some struc-ture underlying the high-dimensional input data. Typ-

Chao Zhang, Chunhua Shen, Tingzhi Shen ical unsupervised feature learning methods include in-dependent component analysis [20], auto-encoders [21],sparse coding [22, 23], (nonnegative) matrix factoriza-tion [23, 24], and a few clustering methods [25]. Interms of large-scale sparse coding and matrix factor-ization based feature learning, an online optimizationalgorithm based on stochastic approximations was in-troduced in [23].Low-level image alignment such as dense stereomatching, which shares similarity with the matchingtask that we concern here, often use hand-crafted lo-cal image descriptors [26]. Traditional local feature de-scriptors like SIFT was shown their values for densewide-baseline matching, but with limited success. Thisis mainly because their high computational cost andsensitivity to occlusions. The SURF feature [27] triesto speed up the computation of local features. In [28],Tola et al. designed the DAISY feature for fast and ac-curate wide-baseline stereo matching. The DAISY fea-ture attempts to solve both the computation and oc-clusion problems in stereo matching. Another compu-tationally cheap local feature descriptor is a modiﬁedversion of the local binary pattern (LBP) feature [29].In the sparse matching experiment, Heikkil¨a et al. haveshowed that the LBP descriptor performs favorablycompared to the SIFT [29]. Estimating the dense cor-respondence between images depicting diﬀerent scenes,which we concern here, is a much more challengingproblem compared to dense stereo matching. To ourknowledge, for dense correspondence estimation acrossscenes, to date the SIFT feature is still the standarddue to its very good performance.Our method is closely related to the approach of [3]that applies only two layers of matching, namely, gridcell layer and pixel layer. There both layers are repre-sented by the same type of features. They utilize sparsesampling to reduce the complexity and expense of large-node representations, which may cause loss of discrim-inative information. Instead of using SIFT features asthe descriptor as in [3], we learn features from a largeamount of small patches, which are randomly extractedfrom natural images. In our method, dense matching isperformed at several levels of the image representations(grid-cell layer, patch layer, pixel layer). For each layer,we obtain suitable features to represent image nodes.Compared to the bottom layer, features in higher layerare extracted to achieve more robustness to the noiseand clutter and more compactness of representation. Byusing the max-pooling operation, we obtain more com-pact representations of larger image nodes while remov-ing irrelevant details. We demonstrate the eﬃciency andeﬀectiveness of the learned features over hand-craftedfeatures for dense matching task. The second approach that has inspired our workis [10]. The main idea is to combine unsupervised jointalignment with unsupervised feature learning. Huanget al. used unsupervised feature learning, in particular,deep belief networks (DBNs) to obtain features that canrepresent an image at diﬀerent resolutions based on net-work depth [10]. There are major diﬀerences betweenour work and [10] although both have used unsuper-vised feature learning. Huang et al. considered the prob-lem of congealing alignment of images, which is to esti-mate the parametric image transform. One only needsto optimize for a small number of continuous variables(typically the rotation matrix and translation). Thusthe number of variables is independent of the imagesize [10]. In Huang et al. [10], gradient descent is em-ployed for this purpose. In contrast, we estimate the nonparametric correspondences at the pixel level. Theoptimization problem involved in our task is a muchmore challenging discrete combinatorial problem. It in-volves thousands of discrete variables. Thus we use be-lief propagation to achieve a locally optimal solution.It is unclear how the method of [10] may be applied todense correspondences.

In this section, we ﬁrst describe how to extract fea-tures for each level in Section 2.1. Then, we present ourframework in detail in Section 2.2.2.1 Multi-layer image representationsIn this work, we follow the standard unsupervised fea-ture learning framework in [8, 9], which has been suc-cessfully applied to generic image classiﬁcation. A com-mon feature learning framework performs the followingsteps to obtain feature representations:1. The dictionary learning step learns a dictionaryusing unsupervised learning algorithms, which areused to map input patch vectors to the new featurevectors.2. The feature encoding step extracts features from im-age patches centered at each pixel using the dictio-nary obtained at the ﬁrst step.3. The pooling step spatially pools features togetherover local regions of the images to obtain morecompact feature representations. For classiﬁcationtasks, the learned features are then used to train aclassiﬁer for predicting labels. In our case, we esti-mate dense correspondences in a multi-layer match-ing framework using the learned multi-level featurerepresentations. nsupervised Feature Learning for Dense Correspondences across Scenes 5

Next, we brieﬂy review the pipeline of feature learn-ing framework. As mentioned above, there are three keycomponents in the feature learning framework: dictio-nary learning, feature encoding and pooling operation.

To learn a dictionary, we ﬁrst extract N random patches x i from a collection of natural images as training data, x i ∈ R n , ( i = 1 , , ..., N ), and then pre-process thesepatches as described in [9]. Every patch is normalizedby subtracting the mean and normalized by the stan-dard deviation of its elements. This is equivalent to localbrightness and contrast normalization. We also applythe whitening process as done in [9]. Then we use anunsupervised learning algorithm to construct the dic-tionary D = [ d , d , . . . , d M ] ∈ R n × M . Here M is thedictionary size, and each column d j is a codeword.We consider the following dictionary learning meth-ods for learning the dictionary D :(a) K-means clustering: We learn M centroids { d j } , j = 1 , , ..., M from the sampled patches. K-means has been widely adopted in computer vision forbuilding codebooks but less widely used in ‘deep fea-ture learning’ [9]. This may be due to the fact thatK-means is less eﬀective when the input vectors are ofhigh dimension.(b) K-SVD [30]: The dictionary is trained by solvingthe following optimization problem using alternatingminimization:min D , S N (cid:88) i =1 (cid:107) x i − Ds i (cid:107) , s . t . (cid:107) s i (cid:107) ≤ k, ∀ i, (1)where S = [ s , s , . . . , s N ] ∈ R M × N are the sparsecodes. (cid:107) s i (cid:107) is the number of non-zero elements in s i ,which enforces the code’s sparsity. Here (cid:107)·(cid:107) and (cid:107)·(cid:107) are the L and L norm respectively. Note that to solvefor S , usually one seeks an approximate solution be-cause the optimization problem is NP-hard.(c) Random sampling (RA): M patches are ran-domly picked from the N patches to form a dictionary.Therefore no learning/optimization is performed in thiscase.Later we test these three dictionary learning meth-ods on the problem of dense matching. After obtaining the dictionary D , we extract patchescentered at each pixel of the pair of matching imagesafter applying pre-processing. The patch vector x i isencoded to generate the feature vector s i at the pixel layer. We consider the following coding methods in thiswork:(a) K-means triangle (KT) encoding: This can beviewed as a ‘soft’ encoding method while keeping spar-sity of codes. With the M basis vectors { d j } learnedby the ﬁrst stage, KT encodes the patch x i as: s ij = max (cid:8) , µ ( z ) − z j (cid:9) , where s ij is the j -th component of feature vector s i ; z j = (cid:107) x i − d j (cid:107) and µ ( z ) is the mean of z . By using thisencoder, roughly half of the features are set to 0.(b) Soft-assignment (SA) encoding: s ij = exp( − β (cid:107) x i − d j (cid:107) ) (cid:80) Ml =1 exp ( − β (cid:107) x i − d l (cid:107) ) . (2)Here β is the smoothing factor controlling the softnessof assignment.(c) Orthogonal matching pursuit (OMP-k) encod-ing: Given the patch x i and dictionary D , we use OMP-k [8] to obtain the feature s i , which has at most k non-zero elements:min S N (cid:88) i =1 (cid:107) x i − Ds i (cid:107) , s . t . (cid:107) s i (cid:107) ≤ k, ∀ i, (3)where (cid:107) s i (cid:107) is the number of non-zero elements in s i .This explicitly enforces the code’s sparsity.We mainly use K-means for dictionary learning andK-means triangle (KT) for encoding in our experiments.In the last part of Section 3.3, we evaluate the perfor-mance of diﬀerent learning and encoding methods men-tioned above in dense correspondence estimation. The general objective of pooling is to transform thejoint feature interpretation into a new, more usable onethat preserves important information while discardingirrelevant details [31]. Spatially pooling features over alocal neighbourhood to create invariance to small trans-formation of the input is employed in a large numberof models of visual recognition. The pooling operationis typically a sum, an average, a max or more rarelysome other commutative combination rules. In this pa-per, we apply the max-pooling operation to obtain thepatch-layer features.The pixel feature s i ∈ R M is the code of thepatch centered at pixel i , which is obtained at thefeature encoding step. At the patch layer,the image ispartitioned using a uniform grid into non-overlappingsquare patches. Each patch feature f = [ f , · · · , f j , · · · ,f M ] ∈ R M is obtained by max-pooling all pixel features Chao Zhang, Chunhua Shen, Tingzhi Shen within that patch, which is simply the component-wisemaxima over all pixel features within a patch P : f j = max i ∈ P s ij (4)where i ranges over all entries in image patch P . Thuseach patch feature has the same dimension as pixelfeatures. Note that the max-pooling operation is non-linear. It captures the main character of pixels in thepatch while maintaining the feature length and reduc-ing the feature number. Detailed discussions on the im-pact of feature learning methods are presented in Sec-tion 3.3. The grid-cell layer is built on the patch-level layer. Thestructure of the grid-cell layer is a spatial pyramid, asshown in Figure 2. Thus it can contain multiple levels.Each level contains a number of cells at diﬀerent reso-lutions. The cell size starts from the whole image (thetop level at Figure 2) to a certain spatial size accordingto the number of pyramid levels. A cell node is muchlarger than an image patch and may oﬀer greater reg-ularization when appearance matches are ambiguous.Each cell is represented by a grid-cell feature, calleda cell node. Grid-cell features are formed by concate-nating the patch layer features within a cell. For all theexperiments in the Section 3, we use the 3-level pyramidas shown in Figure 2. At the top level of the pyramid,there is one single cell node which is the whole image.The middle level contains 4 equal-size non-overlappingcell nodes; and the bottom level has 16 cell nodes.To demonstrate the great potential of unsupervisedfeature learning techniques in the dense matching task,which generally requires features to preserve visual de-tails, we decide not to use those complex learning algo-rithms and models. Instead, we employ simple learningalgorithms and design a tailored matching model forthe learned features to estimate pixel-level correspon-dences. The experiments in Section 3 demonstrate thatit is possible to achieve state-of-the-art performanceeven with simple algorithms of unsupervised featurelearning.2.2 The proposed matching framework

Our matching model consists of three layers: the grid-cell layer, patch-level layer and pixel-level layer.(a) model structure:

The grid-cell layer is the toplayer, which is a conventional spatial pyramid (we use a 3-level pyramid for all the experiments in this work).The cell size starts from the whole image to the pre-deﬁned cell size. Grid-cell node features are formed byconcatenating the patch level features within a cell. Thepatch-level layer lies underneath the grid-cell layer. Thebottom layer is pixel-level layer.(b) node deﬁnition:

To be clear, in our model, atthe grid-cell layer, each cell can be seen as a node. Atthe patch-level layer, each patch represents a node. Atthe pixel-level layer, a single pixel is a node.(c) node linkage:

In the pyramid of the grid-celllayer, each node links to the neighboring nodes withinthe same pyramid level as well as parent and child nodesin the adjacent pyramid levels, as shown in Figure 2.We deﬁne the node at the higher level layer as the par-ent of the nodes within its spatial extent at the lowerlayer. For the bottom two layers, namely the patch-levellayer and the pixel-level layer, each node is only linkedto the parent node.Figure 1 shows our matching pipeline. Our matchingprocess starts from the grid-cell layer matching. At thislayer, the matching cost and geometric regularizationare considered for the pyramid node of diﬀerent spatialextents. Matching results of the grid-cell layer guidethe patch-level matching. In other words, results of thegrid-cell layer oﬀer reliable initial correspondences forthe patch-level matching as generally larger spatial sup-ports provide better robustness to image variations. Atthe patch level layer, we estimate the correspondencesbetween the patch nodes of image pair. Guided by thegrid-cell matching results, in our framework, the patch-level matching can already achieve high matching ac-curacy and eﬃciency. In [1, 3] the authors sub-samplepixels to reduce the computation cost, which may leadto suboptimal solutions. In contrast, we do not needto sub-sample pixels because the patch layer match-ing in our framework is is extremely fast, which is oneof the major advantages of our method. At the pixel-level layer, the pixel matching reﬁnes the results of thepatch-layer matching. Figure 3 shows the comparisonof matching results from the pixel layer and the patchlayer. We can see that the pixel layer matching providesﬁner visual results with heavier computation.

For the grid-cell layer, q i denotes the center coordinateof patches within the cell node i . Let t i = ( u i , v i ) bethe translation of node i from the test image to theexemplar image. By minimizing the energy function (5),we obtain the optimal translation of each node in the nsupervised Feature Learning for Dense Correspondences across Scenes 7 Fig. 2

Illustration of the pyramid structure and node connection of the grid-cell layer. Note that the grid-cell layer can containmultiple pyramid levels (here we have 3 levels).(a) (b) (c) (d)

Fig. 3

Patch-level matching results vs. pixel-level matching results of our method. (a) and (b) are the test image and exemplarimage respectively. Images (c) and (d) are the patch-level result and the pixel-level matching result of our method, respectively.We can see that the pixel-level matching provides reﬁned visual results. grid-cell layer. The objective function is deﬁned as: E ( t ) = (cid:88) i D i ( t i ) + α (cid:88) i,j ∈N V ij ( t i , t j ) , (5)where D i is the data term; V i,j is the smoothness term;and α is the constant weight. N represents node pairslinked by graph edges between the neighboring nodesand the parent-child nodes in the spatial pyramid. Inthe above equation, the data term D i ( t i ) is deﬁned as: D i ( t i ) = 1 z min( (cid:107) f t ( q i ) − f e ( q i + t i ) (cid:107) , λ ) , (6)where z is the total number of patches included in thegrid-cell node i . f t ( q i ) is the cell node feature of the testimage centered at coordinate q i and f e ( q i + t i ) is thecell node feature of the exemplar image under certaintranslation t i . D i ( t i ) measures the similarity betweennode i and the corresponding node in the exemplar im-age according to translation t i . Here λ is a truncatedthreshold of feature distance. We set it to the mean dis-tance of pairwise pixels between image pairs. (cid:107)·(cid:107) is the L norm.Second, the smoothness term is deﬁned as: V ij ( t i , t j ) = min( (cid:107) t i − t j (cid:107) , γ ) , (7)which penalizes the matching location discrepanciesamong the neighboring nodes. We use loopy belief prop-agation (BP) to ﬁnd the optimal correspondence of each node. Although BP is not guaranteed to converge, ithas been applied with much experimental success [32].In our objective function, truncated L norms are usedfor both the data term and the smoothness term. Thesmoothness term accounts for matching outliers. Asin [3, 33], for BP, we use a generalized distance trans-form technique such that the computation cost of mes-sage passing between nodes is reduced.For the patch-level layer, each patch links to theparent node in the grid-cell layer. q i denotes the centercoordinate of each patch in the patch-level layer. Thepatch’s optimal translation can be obtained by: D i ( t ) = min( (cid:107) f t ( q i ) − f e ( q i + t ) (cid:107) , λ ) , t i = argmin t ( D i ( t ) + αV ip ( t , t p )) . (8)Here t i and t p are the patch i ’s optimal translationand its parent cell node’s optimal translation respec-tively. f t and f e denote the patch features in the testimage and exemplar image, respectively. Note that theresults from the grid-cell layer provide reliable initialcorrespondences for the patches at the patch layer.For the pixel-level layer, each pixel links to the par-ent patch node. Guided by the patch layer solution,pixel layer correspondences can be estimated eﬃcientlyand accurately. q i denotes the pixel i ’s coordinate in thepixel-level layer. The D i ( t ) in pixel i ’s optimal transla- Chao Zhang, Chunhua Shen, Tingzhi Shen tion is deﬁned as D i ( t ) = min( (cid:107) s t ( q i ) − s e ( q i + t ) (cid:107) , λ ) , t i = argmin t ( D i ( t ) + αV ip ( t , t p )) . (9)Here t i and t p are the pixel i ’s optimal translation andits parent patch node’s optimal translation respectively. s t and s e denote the pixel features in the test image andexemplar image respectively. Our method improves the matching accuracy and sub-stantially reduces the computation time compared tothe recent state-of-the-art method [3]. In [3], they haveused sparse descriptors sampling to reduce the compu-tational time, which may cause loss of key characteris-tics in the nodes of the matching graph.The grid-cell layer of our matching model is builtupon the patch layer features which cover the wholeimage area without using a sparse sampling process.The pixel-layer features obtained by an unsupervisedlearning algorithm appear to be discriminative for re-solving matching ambiguity between classes. By usinga pooling operation on the pixel features, we signiﬁ-cantly reduce the number of patch-layer features andthe possible translations, while enhancing the robust-ness to severe image visual variations.

Matching at thepatch-level layer is the core of our algorithm.

At thepatch layer, image pixels within the same patch sharethe same optimal translations. Our experiment resultsshow that the patch-layer matching results outperformthe state-of-the-art methods with a much faster com-putation speed. The pixel-level matching further reﬁnesthe matching accuracy. Thus the pixel-level matchingprocedure can be considered optional. If the test speedis on a budget, one does not have to perform the pixel-level matching.As can be seen from Figure 3, the pixel-level match-ing output (Figure 3(d)) retains ﬁner details in ob-ject edges, compared to the patch-level matching re-sult (Figure 3(c)). Our pixel and patch feature encodingschemes allow us to reduce the computation while im-proving the matching results. Experiment results in thenext section demonstrate the advantages of our method.

We conduct experiments to evaluate the matching qual-ity (Section 3.2) and to analyse the performance im-pact of several diﬀerent elements in the feature learningframework (Section 3.3). In Section 3.2, we test our method on three bench-mark vision datasets: the Caltech-101 dataset, the La-belMe Outdoor (LMO) dataset [11], and a subset of thePascal dataset.We also apply our method to semantic segmen-tation. We compare our method with state-of-the-artdense pixel matching methods, namely, the deformablespatial pyramid (DSP) approach (single-scale) [3], SIFTFlow (SF) [1], and coherency sensitive hashing (CSH)[2]. Note that DSP has achieved the previously best re-sults on dense pixel matching. For the DSP, SF andCSH methods, we use the code provided by their au-thors.In Section 3.3, we present a detailed analysis of theimpact of parameter settings in feature learning, includ-ing: (a) the choice of the unsupervised feature learningalgorithm, (b) the impact of the dictionary size, (c) theimpact of the training data size, (d) diﬀerent conﬁgu-rations in the patch feature extraction process.We set the parameters of the compared methods tothe values that were suggested in the original papers.In all of our experiments, we use 3-level pyramid in thegrid-cell layer. The parameters of our method for allexperiments are ﬁxed to α = 0 . γ = 0 . universal dictionary is learned from imagepatches extracted from 200 Background Google classimages in the Caltech-101 dataset. Note that ‘Back-ground Google’ contains mainly natural images whichare irrelevant to the test images in our experiments.The dictionary is learned before the matching process.Once the dictionary is learned, it is used to encode allthe test images. We use K-means dictionary learningand K-means triangle (KT) encoding for our method.The dictionary size is set to 100. Clearly, the length offeature vectors at the pixel-level layer and patch-levellayer are equal to the dictionary size.Then pixel-layer features are computed at each pixelof test images. More speciﬁcally, we perform encodingon an image region around each pixel centroid usingthe learned dictionary to form the feature vector ofthat pixel. So each pixel feature is extracted from an11 × × nsupervised Feature Learning for Dense Correspondences across Scenes 9 For each image pair (a test image and an exemplarimage), we ﬁnd the pixel correspondences between themusing a matching algorithm and transfer the annotatedclass labels of the exemplar image pixels to the testimage pixels.The LT-ACC criterion measures how many pixelsin the test image have been correctly labelled by thematching algorithm. On the Caltech-101 dataset, eachimage is divided into the foreground and backgroundpixels. The IOU metric reﬂects the matching quality forseparating the foreground pixels from the background.As for Caltech-101, since the ground-truth pixel cor-respondence information is not available, we use theLOC-ERR metric to evaluate the distortion of corre-spondence pixel locations with respect to the objectbounding box.Mathematically, the LT-ACC metric is computed as r = 1 (cid:80) i m i (cid:88) i (cid:88) p ∈ Λ i o ( p ) = a ( p ) , a ( p ) > , (10)where for pixel p in image i , the ground-truth annota-tion is a ( p ) and matching output is o ( p ); for unlabeledpixels a ( p ) = 0. Notation Λ i is the image lattice forimage i , and m i = (cid:80) p ∈ Λ i a ( p ) >

0) is the number oflabeled pixels for image i [11]. Here 1( · ) outputs 1 if thecondition is true; otherwise 0.To deﬁne the LOC-ERR of corresponding pixel po-sitions, we ﬁrst designate each image’s pixel coordi-nates using its ground-truth object bounding box. Pixelcoordinates are normalized with respect to the box’sposition and size. Then, the localization error of twomatched pixels is deﬁned as: e = 0 . | x − x | + | y − y | ),where ( x , y ) is the pixel coordinate of the ﬁrst imageand ( x , y ) is its corresponding location in the secondimage [3].The intersection over union (IOU) segmentationmeasure is used to assess per-class accuracy on the in-tersection of the predicted segmentation and the groundtruth, normalized by the union. Formally, it is deﬁnedas u = true pos.true pos. + f alse pos. + f alse neg. (11)IOU is now the standard metric for segmentation [12].3.2 Comparison with state-of-the-art dense matchingmethodsIn this section, we compare our method against state-of-the-art dense matching methods to examine the match-ing quality on object matching and scene segmentation.Detail about our method is described in Section 2.1. For this experiment, we evaluate the matching perfor-mance by using diﬀerent features in diﬀerent matchingframeworks. 100 test image pairs are randomly pickedfrom the Caltech-101 dataset. Each pair of images arechosen from the same class. The result is shown in Ta-ble 1. It shows that SIFT features in our multi-layermatching model are not able to achieve the same ac-curacy level as the learned features. This result showsthe advantage of learned features over hand-crafted fea-tures such as SIFT.Meanwhile, we use the learned features to replacethe SIFT features in the DSP [3]’s framework, anduse the same test images. The results show that theSIFT features can obtain better matching accuracy inDSP [3]’s framework. This result shows that the DSPmethod [3] is not able to take advantage of learned fea-tures, while our matching framework is tailored to theunsupervised features learning technique.The third observation is the computation speeds ofcompared methods. The CPU time of our method (atthe patch level) is about 8 times faster than that ofDSP [3] and 50 times faster than SIFT Flow of Liu etal. [1].For the patch-level matching, our method outper-forms CSH by about 13 points in LT-ACC, yet is twicefaster than CSH. Note that CSH is a noticeably fastmethod, which exploits hashing to quickly ﬁnd match-ing patches between two images. Our pixel-level match-ing further improves the patch layer matching accuracyand provides better visual matching quality, which ishard to be measured by LT-ACC. The examples areshown in Figure 3.By using multi-level representations, our proposedmatching method enables the learned features (ob-tained by unsupervised learning methods) to outper-form those hand-crafted features (e.g., SIFT features)in dense matching tasks. The general matching frame-work may not help features to achieve their best per-formance. A suitable matching framework improves thefeature performance in the matching task. We concludethat our matching framework and the unsupervised fea-ture learning pipeline are tightly coupled to achieve thebest performance.

Now we conduct extensive experiments on the Caltech-101 dataset. We randomly pick 20 image pairs for eachclass (2020 image pairs in total). Each pair of imagesare chosen from the same class. The ground-truth an- framework Ours DSP [3] SIFT Flow [1] CSH [2]feature learned feature SIFT learned feature SIFTmatching level patch level pixel level patch level pixel levelLT-ACC 0.801

Table 1

Comparison of object matching performance of diﬀerent methods on 100 pairs of images from the Caltech-101 datasetin terms of the matching accuracy and speed. The best results are shown in bold.method Ours (patch layer) DSP SIFT Flow CSHLT-ACC

Table 2

Intra-class image matching performance on theCaltech-101 dataset. The best results are in bold.

CSHSIFTFLOWDSPOurs (a) N u m be r o f c l a ss e s CSHSIFTFLOWDSPOurs (b) Fig. 4 (a) shows the percentage of each method achievingthe best performance in all of the 101 classes. Our methodachieves the best matching accuracy in 55 classes (54% of 101classes). (b) shows the histogram of each methods’ achieve-ments over matching accuracy (LT-ACC) in all classes. notation for each image is available, indicating the fore-ground object and the background.Table 2 shows the matching accuracy and CPU timeof our method against those state-of-the-art methods.As can be seen, our method achieves the highest labeltransfer accuracy and is more than 10 times faster thanthe DSP in the matching process. In this experiment,there are only two labels for each test image and theintra-class variability is very large. It is hard to achieveimprovement of 0.1. We can ﬁnd that the intra-class variability diﬀersby classes. Such as in the ‘Face’ class, objects in im-ages are similar. The highest matching accuracy for the‘Face’ class can reach 0.952 (obtained by the SIFT Flowmethod). However, in the ‘Beaver’ class, images varymuch more than the ‘Face’ class and are hard to match.Therefore, the best accuracy among four comparedmethods of the ‘Beaver’ class is lower than the ‘Face’class, as expected. In the ‘Beaver’ class, our methodoutperforms other methods and obtains a matching ac-curacy of 0.687.As the intra-class variability is hard to measure, weuse the best matching accuracy to reﬂect the variabilityof each class. Higher matching accuracy means smallerintra-class variability. Figure 4(b) shows the histogramof best matching accuracy in all classes. For each bin,we group the data by matching methods.Figure 4(b) shows that our method outperformsother methods in most cases. SIFT based methodsachieve better results in those ‘easy’ high-accuracyclasses such as ‘Face’. Our method can achieve bettermatching results for large intra-class variability classes.SIFT based methods can well handle the similar ap-pearance objects matching. This conclusion is consis-tent with the case of sparse point matching applica-tions. Our proposed method is more suitable to han-dle the matching problem of large intra-class variability than compared methods.The pie chart in Figure 4(a) shows the percentageof each method achieving the best accuracy in the 101classes. Our method achieves the best matching accu-racy in 55 out of 101 classes. Our method outperformsthe compared methods by a large margin.Figure 5 shows some qualitative results of the com-pared methods. The results show that our method ismore robust than other methods under large object ap-pearance variations (e.g., the second example) and clut-tered backgrounds (e.g., the ﬁrst and third examples).The DSP and SIFT Flow methods are based onSIFT features which are robust to local geometric andlocal aﬃne distortions. The SIFT Flow method enforcesthe matching smoothness between the pixel and itsneighboring pixels. Due to the enforcement of pixels’ nsupervised Feature Learning for Dense Correspondences across Scenes 11Input GT label Ours DSP SIFT Flow CSH

Fig. 5

Qualitative comparison. We show some example results of compared methods. The ﬁrst column stacks matching imagepairs. The second column shows ground-truth labels of input images. We establish the pixel correspondences for the top imageof each image pair. Columns 3–6 show the warping results from the exemplar image (the bottom one in the pair) to the testimage (top one) via pixel correspondences by our method, DSP, SIFT ﬂow and CSH, respectively. Here black and white colorsindicate the labels of background and target. connections, the SIFT Flow matching results may ex-hibit large distortions.The DSP method takes advantages of pyramidgraph matching and focuses on the pixel level optimiza-tion eﬃciency. The results show that the DSP methodcan achieve better results compared with SIFT Flow,which is consistent with the results presented in [3].Since DSP removes the neighboring-pixel connectionat the pixel-level matching optimization, pixels tend tomatch dispersedly beyond the object boundary.The CSH method ﬁnds the patch matches mainlyon patch appearance similarity and does not rely on theimage coherence assumption. Their results are visuallymore pleasant (the warping result is highly similar tothe test image). However, object pixels in a test imagecan easily incorrectly match the background pixels inthe exemplar image. This causes the low label-transferaccuracy.Our method takes advantages of these three meth-ods. Grid-cell layer matching in our method considersthe matching cost and geometric regularization in thepyramid of cells with diﬀerent sizes, and cell match- ing guides the patch-level matching. Then the pixelmatching reﬁnes the results of patch-level matching.Figure 5 shows that our method not only achieves ac-curate deformable matching but also keeps the object’smain shape. The matching accuracy by using our patch-level matching is better than that of DSP, while beingmuch faster. As shown above, our patch-level resultsalready outperform state-of-the-art results. Our pixel-level matching in general provides even better matchingaccuracy with more costly computation.

In this scene matching and segmentation experiment,we report the results on the LMO dataset [11]. Most ofthe LMO images are outdoor scenes including streets,beaches, mountains and buildings. The dataset namestop 33 object categories with the most labeled pixels.Other object categories are considered as the 34th cate-gory ‘unlabeled’ [11]. This experiment is more complexthan the experiment on the Caltech-101 dataset, whichcontains only two labels (foreground or background).

Fig. 6

Example scene matching results of compared methods. This plot is displayed in accordance with Figure 5. The onlydiﬀerence here is that the scenes have multiple labels. Diﬀerent pixel colors encode the 33 classes contained in the LMO dataset(e.g., tree, building, car, sky).

For each test image, we select 9 most similar exem-plar images by Euclidean distances of GIST as in [3,11].Through the matching process, we obtain dense cor-respondences between the test image and the selectedexemplar images. Similar to the Caltech-101 matchingexperiment, we transfer the pixel labels to the test im-age pixels from the corresponding exemplar pixels.For the scene matching, some example results areshown in Figure 6. Again, our method is more robustto the image variations (scene appearance changes andstructure changes) not only in scene warping but alsoin label transferring. Our labeling results appear moresimilar to the test image ground truth labels.For the scene segmentation, we follow the methoddescribed in [11]. After the matching and warping pro-cess, each pixel in the test image may have multiplelabels by matching diﬀerent exemplars. To obtain the method Ours DSP SIFT Flow CSHLT-ACC

Table 3

Scene segmentation performance of diﬀerent meth-ods on the LMO dataset in matching accuracy. For ourmethod, the dictionary is learned by using K-means and theencoding schemes is K-means triangle. The dictionary size isset to 100. The best results are boldfaced. ﬁnal image segmentation, we reconcile multiple labelsand impose spatial smoothness under a Markov ran-dom ﬁeld (MRF) model. The label likelihood is deﬁnedby the feature distance between test image pixel andits corresponding exemplar image pixel. In this experi-ment, we randomly pick 40 images as test images. Wereport the patch-level results of our method on thisdataset. nsupervised Feature Learning for Dense Correspondences across Scenes 13

Table 3 shows the segmentation accuracy of ourmethod compared with state-of-the-art methods. Ourmethod outperforms state-of-the-art methods in thesegmentation accuracy. In this experiment, we noticethat SIFT Flow outperforms the DSP method in GISTneighbors, which is consistent with the results in [3].The segmentation accuracy of DSP relies on the exem-plar list [3]. Our method does not have this problem. Inthe multi-class pixel labelling experiment, our methodoutperforms the compared methods by 0.02 of matchingaccuracy. Our IOU score is also better. The experimen-tal results show that our method provides higher match-ing accuracy. The reason is two-fold: (a) the learned fea-tures provide higher discriminability between classes,and (b) our matching model is more suitable for thelearned features to carry out dense matching tasks.

This part of experiments is carried out on the Pas-cal Visual Object Classes (VOC) 2012 dataset [12].There are 2913 images in the segmentation task, whichhave ground-truth annotations for each image pixel.There are 20 speciﬁed object classes in the Pascal 2012dataset. The objects which do not belong to one of theseclasses are given a ‘background’ label.In our experiment, we random choose 30 image pairsfor each class (600 image pairs in total) from those im-ages which only contain one object class. Each pair ofimages come from the same class. We consider the ob-jects as ‘foreground’ and others as ‘background’. Theparameter setting is the same as our experiments onthe Caltech-101 dataset. We use the same dictionary toobtain our pixel features as in the other experiments.Table 4 shows the matching accuracy and CPU timeof our method as well as those compared methods.Again, we can see that our method achieves the high-est label transfer accuracies and is more than 8 timesfaster than the DSP method. The pie chart in Figure7(a) shows the percentage of each method achievingthe best accuracy in 20 classes. We see that the pro-posed method achieves the best matching accuracy in10 classes, which outperforms the compared methods.Figure 7(b) shows the histogram of best matching ac-curacy in all classes.The results on the Pascal dataset are slightly diﬀer-ent from those on the Caltech-101 dataset. As we cansee from Figure 7(b), most of the classes in the Pascaldataset are low-accuracy classes. It means that objectsin the same class vary more than those in the Caltech-101 dataset. In this experiment, our method still out-performs other methods for most cases. This conﬁrmsthat our method achieves better matching results under method Ours (patch layer) DSP SIFT Flow CSHLT-ACC

Table 4

Intra-class image matching performance on the Pas-cal dataset. The best results are in bold.

SIFTFLOWDSPOurs (a) N u m be r o f c l a ss e s CSHSIFTFLOWDSPOurs (b)

Fig. 7 (a) shows the percentage of each method achievingthe best accuracy on the Pascal dataset. Our method achievesthe best matching accuracy in 10 classes out of 20. (b) showsthe histogram of each methods’ achievements in matchingaccuracy (LT-ACC) in all classes. large object appearance variations. It is further demon-strated that our proposed method is more suitable tohandle the matching problem of large intra-class vari-ability than those compared methods.Figure 8 shows some example results of comparedmethods. The results show that our method is morerobust than other methods under large object appear-ance variations (e.g., ﬁrst example) and cluttered back-grounds (e.g., third and fourth examples).3.3 Analysis of feature learningIn this section, we examine several factors that mayaﬀect the performance of our proposed matching algo-rithms. We randomly pick 20 pairs of images for eachobject class on the Caltech-101 dataset (thus in to-tal 2020 pairs of images). The parameters are set asfollowing unless otherwise speciﬁed. We use K-meansdictionary learning and K-means triangle encoding for

Fig. 8

Qualitative comparison. We show some example results of compared methods on the Pascal dataset. The ﬁrst columnstacks matching image pairs. The second column are ground truth label of input images. Columns 3–6 show the warping resultsfrom the exemplar image to the test image via pixel correspondences by our method, DSP, SIFT ﬂow and CSH, respectively. our method. The dictionary is learned from 10 imagepatches extracted from 200 Background Google classimages in Caltech-101. The dictionary size is set to 100.The patch size for extracting pixel features is 11 × × Here we examine the importance of dictionary learn-ing and feature encoding methods with respect to theﬁnal dense matching accuracy. First, we compare theK-means dictionary learning method, which has beenused in our experiments in the previous section, withtwo other dictionary learning algorithms, namely K-SVD [30] and randomly sampled patches (RA) [8].As we can see from Table 5, diﬀerent dictionarylearning methods do not have a signiﬁcant impact onthe ﬁnal the matching results. Even using randomlysampled patches as the dictionary can achieve encour-aging matching performance. Diﬀerent learning meth-ods lead to similar matching accuracies. As concluded

Encoder KT KT OMP-K SADictionary K-SVD RA K-means K-meansLT-ACC 0.789 0.792 0.803 0.755 0.667IOU 0.467 0.481 0.505 0.386 0.014LOC-ERR 0.354 0.336 0.324 0.497 1.621

Table 5

Object matching performance using diﬀerent dic-tionary learning and encoding methods on Caltech-101 inmatching accuracy. Deﬁnition of the acronyms: KT (K-meanstriangle), OMP-K (orthogonal matching pursuit), SA (softassignment), RA (random sampling). in [8], the main value of the dictionary is to provide ba-sis, and how to construct the dictionary is less criticalthan the choice of encoding. For the application of pixelmatching, we show that this conclusion holds too.Then, we compare three encoding schemes: K-meanstriangle (KT) [8], OMP-k [8] and soft assignment(SA) [34] to evaluate the impact of diﬀerent encodingschemes. We apply the OMP encoding with a sparsitylevel k = 10. According to the Table 5, KT encod-ing achieves the best matching result, which marginallyoutperforms other encoding methods. It shows that theencoding scheme has a larger impact on the feature per-formance than dictionary learning. nsupervised Feature Learning for Dense Correspondences across Scenes 15 In our model, there are two properties that pixelfeatures should obtain. The ﬁrst is sparsity. Since weuse max-pooling to form our patch features, some de-gree of sparsity contributes to the improvement. Lackof sparsity in pixel features may decrease the power ofpatch features. Based on the KT method, roughly onehalf of the features will be set to 0. SA results in densefeatures and it performs very poorly in our framework.This is very diﬀerent from image classiﬁcation applica-tions [34].The other property is smoothness. As mentionedin [10], they use the congealing method to align faceimages, which reduces entropy by performing localhill-climbing in the transformation parameters. Thesmoothness of that optimization landscape is a key fac-tor to their successful alignment. In our method, wemight have faced similar situations. To ﬁnd the densecorrespondence, we optimize the object function via be-lief propagation (BP). Without the smoothness, the al-gorithm can easily get stuck at a local minimum. OMP-k features are more sparse than the KT features, butOMP-k features are not suﬃciently smooth to performwell in our framework.The results in Table 5 show that KT encoding per-forms better than other two methods in our framework.This observation deviates from the case of generic imageclassiﬁcation, in which many encoding methods (KT,SA, sparse coding, soft thresholding, etc.) have per-formed similarly [8, 34].

In this subsection, we evaluate the impact of the dic-tionary size on the performance of dense matching. Asdictionary size equals to the feature dimension, largerdictionary implies more patch information for each fea-ture, which may lead to better matching performance.At the same time, the trade-oﬀ is that a lager featuredimension requires more computation and slows downthe matching procedure.We evaluate six dictionary sizes (64, 100, 144, 196,289, 400) on object matching performance, while keep-ing other parameters ﬁxed. As shown in Table 6, usinga dictionary of size 196 considerably outperforms thematching performance of size 64. Beyond this point,we observe slightly decreased accuracies. However, theCPU time grows tremendously from 0.04 seconds perimage matching with a dictionary size of 64 to 0.12 sec-onds using a dictionary of size 196, as shown in Table6. It can be seen that even using a dictionary size of 64we can achieve high accuracy performance. This exper-iment result shows that using a bigger dictionary size(longer feature length) leads to slightly better matching accuracy in a wide range. Based on these observations,we have chosen the dictionary size to be 100 in our ob-ject matching and scene segmentation experiments as abalance of the accuracy and CPU time.Also note that with this choice, our feature dimen-sion is actually smaller than the SIFT’s dimension of128. This fact has contributed to the faster speed ofour method, compared to methods like SIFT ﬂow.

This experiment considers the eﬀect of patch size forextracting pixel features (from 5 × ×

27 pixels,we evaluate 12 patch sizes). For each image pixel, we ex-tract a certain size patch around that pixel and obtainpixel feature by using KT encoding. Larger patch re-gions allow us to extract more complex features as theymay contain more information. On the other hand, itincreases the dimensionality of the space that the algo-rithm must cover. The results are shown in Figure 9(a)and Figure 9(c). Overall, larger patches for extractingpixel features lead to better matching accuracy. From5 × ×

17 pixels, the matching performance in-creases signiﬁcantly, while beyond 17 ×

17 pixels, theaccuracy improvement becomes negligible. From Fig-ure 9(c), we can see that the impact of the patch sizefor extracting pixel features is much greater than thechoice of dictionary sizes. As shown in Figure 9(c), thebest performance of 5 × ×

11, best performance point shifts to the dictio-nary size of 196. This suggests that one should choosea larger dictionary when larger patches are used.

We examine the impact of the pooling size for obtainingthe patch layer features. As described in Section 2.1, animage is divided into non-overlapping pooling regions.Each of those pooling region/patches is represented bya patch feature, which is obtained by max-pooling allpixel features within that patch. As a result, larger-size pooling regions (patches) result in fewer pooledfeatures.The experiment result is shown in Figure 9(b). Weconsider four pooling sizes (3 ×

3, 7 ×

7, 11 ×

11, 15 × × Table 6

Evaluation of diﬀerent dictionary sizes w.r.t. the ﬁnal matching accuracy. We can see that with a dictionary size of100, our method has already outperformed DSP in accuracy and CPU time.(a)(b)(c)

Fig. 9

The impact of changes in model setup of feature learn-ing process. (a) shows the matching accuracy using diﬀerentpatch sizes for extracting pixel layer features. (b) shows theimpact of the max-pooling size for obtaining the patch layerfeatures. (c) shows the performance of diﬀerent patch sizesand dictionary sizes. The impact of the patch size is greaterthan the changes of the dictionary size. Beyond the point ofpatch size 17 ×

17, the gain of increasing the patch size of thepatch layer becomes less noticeable.

Fig. 10

Matching accuracies with varying amount of train-ing data and varying dictionary sizes.

In all the previous experiments, the dictionary islearned form 10 patches extracted from 200 Back-ground Google class images in the Caltech-101 dataset.In this experiment, we evaluate the impact of the train-ing data size. Multiple dictionaries are learned from dif-ferent numbers of patches in the Background Googleclass of Caltech-101 dataset. The matching accuracyincreases very slightly with more sampled patches fortraining, which is expected. We have proposed to learn features for pixel corre-spondence estimation in an unsupervised manner. Anew multi-layer matching algorithm is designed, whichnaturally aligns with the unsupervised feature learn-ing pipeline. For the ﬁrst time, we show that learnedfeatures can work better than those widely-used hand-crafted features like SIFT on the problem of dense pixelcorrespondence estimation.We empirically demonstrate that our proposed algo-rithm can robustly match diﬀerent objects or scenes ex-hibiting large appearance diﬀerences and achieve state-of-the-art performance in terms of both matching ac-curacy and running time. A limitation of the proposedframework is that currently the system is not very ro-bust to rotation and scale variations. We want to pursuethis issue in future work. nsupervised Feature Learning for Dense Correspondences across Scenes 17

We have made the code online available at https://bitbucket.org/chhshen/ufl . Acknowledgments

This work is in part supported by ARC grantFT120100969. C. Zhang’s contribution was made whenshe was a visiting student at the University of Adelaide.

References

1. C. Liu, J. Yuen, and A. Torralba, “SIFT ﬂow: Densecorrespondence across scenes and its applications,”

IEEETrans. Patt. Analysis & Machine Intell. , vol. 33, no. 5, pp.978–994, 2011.2. S. Korman and S. Avidan, “Coherency sensitive hash-ing,” in

Proc. IEEE Int. Conf. Comp. Vis. , 2011, pp. 1607–1614.3. J. Kim, C. Liu, F. Sha, and K. Grauman, “Deformablespatial pyramid matching for fast dense correspon-dences,” in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,2013, pp. 2307–2314.4. C. Barnes, E. Shechtman, D. Goldman, and A. Finkel-stein, “The generalized PatchMatch correspondence al-gorithm,” in

Proc. Eur. Conf. Comp. Vis. , 2010.5. M. Leordeanu, A. Zanﬁr, and C. Sminchisescu, “Locallyaﬃne sparse-to-dense matching for motion and occlusionestimation,” in

Proc. IEEE Int. Conf. Comp. Vis. , 2013.6. D. G. Lowe, “Object recognition from local scale-invariant features,” in

Proc. IEEE Int. Conf. Comp. Vis. ,1999.7. L. Bo, X. Ren, and D. Fox, “Multipath sparse codingusing hierarchical matching pursuit,” in

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , 2013, pp. 660–667.8. A. Coates and A. Ng, “The importance of encoding versustraining with sparse coding and vector quantization,” in

Proc. Int. Conf. Mach. Learn. , 2011, pp. 921–928.9. A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in

Int.Conf. Artiﬁcial Intell. & Stat. , 2011, pp. 215–223.10. G. B. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller, “Learning to align from scratch,” in

Proc. Adv.Neural Inf. Process. Syst. , 2012.11. C. Liu, J. Yuen, and A. Torralba, “Nonparametric sceneparsing: Label transfer via dense scene alignment,” in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2009, pp.1972–1979.12. M. Everingham, S. Eslami, L. Van Gool, C. Williams,J. Winn, and A. Zisserman, “The Pascal visual objectclasses challenge: A retrospective,”

Int. J. Comp. Vis. ,2014. [Online]. Available: http://dx.doi.org/10.1007/s11263-014-0733-513. M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsuper-vised joint object discovery and segmentation in internetimages,” in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. ,2013, pp. 1939–1946.14. O. Duchenne, A. Joulin, and J. Ponce, “A graph-matching kernel for object categorization,” in

Proc. IEEEInt. Conf. Comp. Vis. , 2011, pp. 1792–1799.15. A. C. Berg, T. Berg, and J. Malik, “Shape matching andobject recognition using low distortion correspondences,”in

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , vol. 1, 2005,pp. 26–33. 16. M. Leordeanu and M. Hebert, “A spectral technique forcorrespondence problems using pairwise constraints,” in

Proc. IEEE Int. Conf. Comp. Vis. , vol. 2, 2005.17. E. Tola, V. Lepetit, and P. Fua, “A fast local descriptorfor dense matching,” in

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , 2008.18. L. Zelnik-Manor, “On SIFTs and their scales,” in

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , 2012, pp. 1522–1528.19. Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado,K. Chen, J. Dean, and A. Y. Ng, “Building high-levelfeatures using large scale unsupervised learning,” in

Proc.Int. Conf. Mach. Learn. , 2012.20. Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, “ICAwith reconstruction cost for eﬃcient overcomplete featurelearning,” in

Proc. Adv. Neural Inf. Process. Syst. , 2011,pp. 1017–1025.21. G. Zhou, K. Sohn, and H. Lee, “Online incremental fea-ture learning with denoising autoencoders,” in

Proc. Int.Conf. Artiﬁcial Intell. & Stat. , vol. 22, 2012, pp. 1453–1461.22. J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang,and S. Yan, “Sparse representation for computer visionand pattern recognition,”

Proceedings of the IEEE , vol. 98,no. 6, pp. 1031–1044, 2010.23. J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Onlinelearning for matrix factorization and sparse coding,”

J.Mach. Learn. Res. , vol. 11, pp. 19–60, 2010.24. D. D. Lee and H. S. Seung, “Learning the parts of objectsby non-negative matrix factorization,”

Nature , vol. 401,no. 6755, pp. 788–791, 1999.25. A. Coates and A. Y. Ng, “Selecting receptive ﬁelds indeep networks,” in

Proc. Adv. Neural Inf. Process. Syst. ,2011, pp. 2528–2536.26. T. Tuytelaars and L. Van Gool, “Wide baseline stereomatching based on local, aﬃnely invariant regions,” in

Proc. British Machine Vis. Conf. , 2000, pp. 412–425.27. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool,“Speeded-up robust features (surf),”

Comput. Vis. ImageUnderst. , vol. 110, no. 3, pp. 346–359, 2008.28. E. Tola, V. Lepetit, and P. Fua, “DAISY: An eﬃcientdense descriptor applied to wide baseline stereo,”

IEEETrans. Pattern Anal. Mach. Intell. , vol. 32, no. 5, pp. 815–830, May 2010.29. M. Heikkil¨a, M. Pietik¨ainen, and C. Schmid, “Descriptionof interest regions with local binary patterns,”

PatternRecogn. , vol. 42, no. 3, pp. 425–436, 2009.30. M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An al-gorithm for designing overcomplete dictionaries for sparserepresentation,”

IEEE Trans. Signal Process. , vol. 54,no. 11, pp. 4311–4322, 2006.31. Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoreticalanalysis of feature pooling in visual recognition,” in

Proc.Int. Conf. Mach. Learn. , 2010, pp. 111–118.32. A. T. Ihler, J. W. F. III, and A. S. Willsky, “Loopy beliefpropagation: Convergence and eﬀects of message errors,”

J. Mach. Learn. Res. , vol. 6, pp. 905–936, 2005.33. P. Felzenszwalb and D. Huttenlocher, “Eﬃcient beliefpropagation for early vision,”

Int. J. Comp. Vis. ,vol. 70, no. 1, pp. 41–54, 2006. [Online]. Available:http://dx.doi.org/10.1007/s11263-006-7899-434. L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,” in