[PDF] Mirror, mirror on the wall, tell me, is the error small?

Abstract

Do object part localization methods produce bilaterally symmetric results on mirror images? Surprisingly not, even though state of the art methods augment the training set with mirrored images. In this paper we take a closer look into this issue. We first introduce the concept of mirrorability as the ability of a model to produce symmetric results in mirrored images and introduce a corresponding measure, namely the \textit{mirror error} that is defined as the difference between the detection result on an image and the mirror of the detection result on its mirror image. We evaluate the mirrorability of several state of the art algorithms in two of the most intensively studied problems, namely human pose estimation and face alignment. Our experiments lead to several interesting findings: 1) Surprisingly, most of state of the art methods struggle to preserve the mirror symmetry, despite the fact that they do have very similar overall performance on the original and mirror images; 2) the low mirrorability is not caused by training or testing sample bias - all algorithms are trained on both the original images and their mirrored versions; 3) the mirror error is strongly correlated to the localization/alignment error (with correlation coefficients around 0.7). Since the mirror error is calculated without knowledge of the ground truth, we show two interesting applications - in the first it is used to guide the selection of difficult samples and in the second to give feedback in a popular Cascaded Pose Regression method for face alignment.

Full PDF

MMirror, mirror on the wall, tell me, is the error small?

Heng YangQueen Mary University of London [email protected]

Ioannis PatrasQueen Mary University of London [email protected]

Abstract

Do object part localization methods produce bilaterallysymmetric results on mirror images? Surprisingly not, eventhough state of the art methods augment the training setwith mirrored images. In this paper we take a closer lookinto this issue. We ﬁrst introduce the concept of mirrorabil-ity as the ability of a model to produce symmetric results inmirrored images and introduce a corresponding measure,namely the mirror error that is deﬁned as the difference be-tween the detection result on an image and the mirror of thedetection result on its mirror image. We evaluate the mir-rorability of several state of the art algorithms in two of themost intensively studied problems, namely human pose es-timation and face alignment. Our experiments lead to sev-eral interesting ﬁndings: 1) Surprisingly, most of state ofthe art methods struggle to preserve the mirror symmetry,despite the fact that they do have very similar overall per-formance on the original and mirror images; 2) the low mir-rorability is not caused by training or testing sample bias -all algorithms are trained on both the original images andtheir mirrored versions; 3) the mirror error is strongly cor-related to the localization/alignment error (with correlationcoefﬁcients around 0.7). Since the mirror error is calcu-lated without knowledge of the ground truth, we show twointeresting applications - in the ﬁrst it is used to guide theselection of difﬁcult samples and in the second to give feed-back in a popular Cascaded Pose Regression method forface alignment.

1. Introduction

The evolution of mirror (bilateral) symmetry has pro-foundly impacted animal evolution [7]. As a consequence,the overwhelming majority of modern animals ( > (a) Mirror error 0.2. (b) Mirror error 0.02.(c) Mirror error 0.6. (d) Mirror error 0.02. Figure 1: Example pairs of localization results on original(left) and mirror (right) images. First row: Human PoseEstimation [24], second row: Face Alignment by RCPR [4].The ﬁrst column (a and c) shows large mirror error and thesecond (b and d) small mirror error. Can we evaluate theperformance without knowing the ground truth?eral methods have reported close-to-human performance.This includes localization of objects in images (e.g. pedes-trian or face detection) or ﬁne-grained localization of objectparts (e.g. face parts localization, body parts localization,bird parts localization). Most of those methods augmentthe training set by mirroring the positive training samples.However, are these models able to give symmetric resultson a mirror image during testing?In order to answer this question we ﬁrst introduce theconcept of mirrorability, i.e., the ability of an algorithm togive on a mirror image bilaterally symmetric results, and aquantitative measure called the mirror error. The latter isdeﬁned as the difference between the detection result on animage and the mirror of detection result on its mirror im-age. We evaluate the mirrorability of several state of theart algorithms in two representative problems (face align-ment and human pose estimation) on several datasets. One1 a r X i v : . [ c s . C V ] J a n ould expect that a model that has been trained on a datasetaugmented with mirror images to give similar results on animage and its mirrored version. However, as can be seen inFig. 1 ﬁrst column, several state of the art methods in theircorresponding problems sometimes struggle to give sym-metric results in the mirror images. And for some samplesthe mirror error is quite large. By looking at the mirrora-bility of different approaches in human pose estimation andface alignment, we arrive at three interesting ﬁndings. First,most of the models struggle to preserve the mirrorability -the mirror error is present and sometimes signiﬁcant; Sec-ond, the low mirrorability is not likely to be caused by train-ing or testing sample bias - the training sets are augmentedwith mirrored images; Third, the mirror error of the samplesis highly correlated with the corresponding ground truth er-ror.This last ﬁnding is signiﬁcant since one of the nice prop-erties of the proposed mirror error is that it is calculated’blindly’, i.e., without using the ground truth. We rely onthis property in order to show two examples of how it couldbe used in practice. In the ﬁrst one the mirror error is usedas a guide for difﬁcult samples selection in unlabelled dataand in the second one it is used to provide feedback on acascaded pose regression method for face alignment. In theformer application, the samples selected based on the mirrorerror have shown high consistency across different meth-ods and high consistency with the difﬁcult samples selectedbased on the ground truth alignment error. In the latter ap-plication, the feedback mechanism is used in a multiple ini-tializations scheme in order to detect failures - this leads tolarge improvements and state of the art results in face align-ment.To summarize, in this paper we make the following con-tributions: • To the best of our knowledge, we are the ﬁrst to lookinto the mirror symmetric performance of object partlocalization models. • We introduce the concept of mirrorability and showhow the corresponding measure, called mirror error,that we propose can be used in evaluating general ob-ject part localization methods. • We evaluate the mirrorability of several algorithms intwo domains (i.e. face alignment and body part local-ization) and report several interesting ﬁndings on themirrorability. • We show two applications of the mirrorability in thedomain of face alignment.

2. Mirrorability in Object Part Localization

We deﬁne mirrorability as the ability of amodel/algorithm to preserve the mirror symmetry when applied on an image and its mirror image. In order toquantify it we introduce a measure called mirror errorthat is deﬁned as the difference between a detection resulton an image and the mirror of the result on its mirrorimage. Speciﬁcally, let us denote the shape of an object,for example a human or a face, by a set of K points, X = { x k } Kk =1 , where x k are the coordinates of the k -thpoint/part. The detection result on the original image isdenoted by q X = { q x k } Kk =1 and the detection result on themirror image is denoted by p X = { p x k } Kk =1 . The mirrortransformation of p X to the original image is denoted by p → q X = { p → q x k } Kk =1 , where p → q x k denotes the mirrorresult of the k -th part on the original image. Generally, adifferent index k (cid:48) is used on the mirror image (e.g. a lefteye in an image becomes a right eye in the mirror image).Therefore, the transformation consists of image coordinatestransform and the part index mirror transform ( k (cid:48) → k ).The image coordinate transform is applied on the horizontalcoordinate, that is p x k = w I − q x k , where w I is the widthof the image I and p x k is the x coordinate of the k pointin the mirror image. The index re-assignment is based onthe the mirror symmetric structure of a speciﬁc object, withan one-to-one mapping list where, for example, the lefteye index is mapped to the right eye index. Formally, themirror error of the k landmark (body joint or facial point)is deﬁned as || q x k − p → q x k || , and the sample-wise mirrorerror as: e m = 1 K K (cid:88) k =1 || q x k − p → q x k || (1)The mirror error that is deﬁned in the above equation hasthe following properties: First, a high mirror error reﬂectslow mirrorability and vice visa; Second, it is symmetric,i.e., given a pair of mirror images it makes no differencewhich is considered to be the original; Third, and impor-tantly, calculating the mirror error does not require groundtruth information.In a similar way we calculate the ground truth localiza-tion error q e a as the difference between the detected loca-tions and the ground truth locations of the facial landmarksor the human body joints. In order to be consistent and dis-tinguish it from the mirror error we call it the alignmenterror. Formally, q e a = 1 K K (cid:88) k =1 || q x k − gt x k || (2)where gt x k is the ground truth location of the k -th point. Ina similar way, we deﬁne the alignment error p e a on the mir-ror image of the test sample. For simplicity in what followswhen we use the term of alignment error e a , we mean thealignment error in the original image.oth Eq. 1 and Eq. 2 are absolute errors. In order to keepour analysis invariant to the size of the object in each image,we normalize them by the object size, i.e. s , the size of thebody or the face. The size of the human body and the faceare calculated in different ways and they are depicted whenwe use them. Experiment setting

In order to evaluate the mirroabil-ity of algorithms for human pose estimation, we focus ontwo representative methods, namely the Flexible Mixturesof Parts (FMP) method by Yang and Ramanan [24] andthe Latent Tree Models (LTM) by Wang and Li [20]. TheFMP is generally regarded as a benchmark method for hu-man pose estimation and most of the recent methods are im-proved versions or variants of it. The one by Wang and Li[20] introduced latent variables in tree model learning thatled to improvements. Both of them have provided sourcecode which we used in our evaluation. Since it is not ourmain focus to improve the performance in a speciﬁc do-main, we use popular state of the art approaches and eval-uate them on standard datasets. We use three widely useddatasets, namely the Leeds Sport Dataset (LSP), the ImageParse dataset [13] and the Buffy Stickmen dataset [6]. Weuse the default training/test split of the datasets. The num-ber of test images on LSP, Parse and Buffy is 1000, 276 and205 respectively. We trained both FMP and LTM modelson LSP and only FMP model on Parse and Buffy. We em-phasize that the training dataset is augmented with mirrorimages - this eliminates the training sample bias.

Overall performance difference

We ﬁrst compare theoverall performance on the original test set and on the mir-ror set. We use the evaluation criterion proposed in [24] andalso recommended in [1], namely the Percentage of CorrectKeypoints (PCK). In order to calculate the PCK for eachperson a tightly-cropped bounding box is generated as thetightest box around the person in question that contains allof the ground truth keypoints. The size of the person is cal-culated as s = max ( h, w ) , where h and w are the hightand width of the bounding box. This is used to normalizethe absolute mirror error in Eq. 1 and the alignment errorin Eq. 2. The results on Buffy, Parse and LSP are shownin Table 1, Table 2 and Table 3 respectively. As can beseen, there is no signiﬁcant overall difference between thedetection results on the original images and on their mirrorimages. The maximum difference of different methods ondifferent datasets is around 1% while the average differenceless than 1%. Mirrorability

The fact that the average performance onmirror images is similar to the average performance on theoriginals might be the root of the common belief that mod-els produce more or less bilaterally symmetrical results. A Points Head Shou Elbo Wri Hip

Avg

Original 96.9 97.3 91.1 80.8 79.6 89.1Mirror 97.1 98.4 91.8 81.9 80.4 89.9Table 1: PCK of FMP [24] on Buffy. A point is correct ifthe error is less than . ∗ max ( h, w ) Points Head Shou Elbo Wris Hip Knee Ankle

Avg

Original 90.0 85.6 68.3 47.3 77.3 75.6 67.3 73.1Mirror 90.0 86.1 67.6 46.3 76.8 74.6 68.5 72.8

Table 2: PCK of FMP [24] on Parse. A point is correct ifthe error is less than . ∗ max ( h, w ) . Points Head Shou Elbo Wris Hip Knee Ankle

Avg

FMP Original 81.2 61.1 45.5 33.4 63.0 55.6 49.5 55.6FMP Mirror 82.2 61.0 44.9 33.8 63.7 56.1 50.5 56.0LTM Original 88.5 66.0 51.3 41.1 69.7 59.2 55.6 61.6LTM Mirror 88.7 65.8 51.4 40.7 70.2 58.0 55.0 61.4

Table 3: PCK of FMP [24] and LTM [20] on LSP. A pointis correct if the error is less than . ∗ max ( h, w ) . (a) Yang and Ramanan [24] (b) Wang and Li [20] Figure 2: Visualization of mirror error (numbers on theupper) and alignment error (values on the lower) of bodyjoints. The values are percentages of the body size. The ra-dius of each ellipse represents the value of one standard de-viation of the mirror error on the corresponding body joint.closer inspection however reveals that this is not true. Letus ﬁrst visualize the mirror error of individual body joints,i.e., || q x k − p → q x k || of both FMP and LTM on the LSPdataset. In Fig 2 we plot the mirror error (normalized by thebody size in the example image) of the 1000 test images oneach individual joint. As can be seen, there is a differencewhich in some cases it is quite large. For example on theelbows, feet and especially on the wrists ( ∼ for FMPand ∼ for LTM). This result directly challenges theperception that the models give mirror symmetrical results.We reiterate that this is despite the fact that the overall per-formance is similar in the original and the mirror imagesand despite the fact that we have augmented the training setwith the mirror images. This leads us to the conclusion that Image Index E rr o r e a e m Figure 3: Mirror error and alignment error on LSP of LTM[20]. The x axis is the image indexes after sorting the align-ment error in ascend. Two example images and their mirrorimages are shown, one with small mirror error and the otherwith large mirror error.the low mirrorability (i.e. large mirror error) is not the resultof sample bias.It is interesting to observe in Fig. 2 that the joints withlarge average mirror error are usually the most challeng-ing to localize, that is they are the ones with the higheralignment error. This seems to indicate that there is cor-relation between the mirror error and the alignment error.In Fig. 3, as an example, we show the mirror error vs. thesorted sample-wise alignment error of LTM on LSP dataset.It is clear that the mirror error tends to increase as the im-age alignment error increases. Two examples of pairs ofimages are shown in Fig. 3 and the correlation between thesample-wise mirror error and the alignment error are shownthem in Fig. 4. On all three datasets the mirror error hasshown a strong correlation to the alignment error. For thesmaller datasets, Buffy and Parse the correlation coefﬁcientis around 0.6. On the larger LSP dataset, the correlationcoefﬁcient of both LTM and FMP is around 0.7. We canconclude that although the mirror error is calculated with-out knowledge of the ground truth, it is informative of thereal alignment error in each sample. Face alignment has been intensively studied and mostof the recent methods have reported close-to-human perfor-mance on face images ”in the wild”. Here, we look into themirrorability of face alignment methods and how their erroris correlated to the mirror error.

Experiment setting

For our analysis we focus on themost challenging datasets collected in the wild, namely the300W. It is created for Automatic Facial Landmark Detec-tion in-the-Wild Challenge [15]. To this end, several pop-ular data sets including LFPW [3], AFW [27] and HELEN m A li g n m e n t e rr o r e a Correlation Coefficient is 0.57192 (a)

Yang & Ramanan [24] on Buffy m A li g n m e n t e rr o r e a Correlation Coefficient is 0.61833 (b)

Yang & Ramanan [24] on Parse

Mirror error e m A li g n m e n t e rr o r e a Correlation Coefficient is 0.68328 (c) Yang and Ramanan [24] on LSP

Mirror error e m A li g n m e n t e rr o r e a Correlation Coefficient is 0.71082 (d) Wang and Li [20] on LSP

Figure 4: Correlation between the alignment error and mir-ror error. The correlation coefﬁcients are shown above theﬁgures.[10] were re-annotated with 68 points mark-up and a newdata set, called iBug, was added. We perform our analysison a test set that comprises of the test images from HELEN(330 images), LFPW (224 images) and the images in theiBug subset (135 images), that is 689 images in total. Theimages in the iBug subset are extremely challenging due tothe large head pose variations, faces that are partially out-side the image and heavy occlusions. The test images areﬂipped horizontally to get the mirror images. We evaluatethe performance of several recent state of the art methods,namely the Supervised Descent Method (

SDM ) [22], theRobust Cascaded Pose Regression (

RCPR ) [4], the Incre-mental Face Alignment (

IFA ) [2] and the Gaussian-NewtonDeformable Part Model (

GN-DPM ) [19]. For SDM, IFAand GN-DPM, only the trained models and the code for test-ing is available - we use those to directly apply them on thetest images. As stated in the corresponding papers, the IFAand GN-DPM were trained on the 300W dataset and theSDM model was trained using a much larger dataset. SDM,IFA and GN-DPM only detect the 49 inner facial points -our analysis on those methods is therefore based on thosepoints only. For RCPR, for which the code for training isavailable, we retrain the model on the training images of300W for the full 68 facial points mark-up. All those meth-ods build on the result of a face detector - since most ofthem are sensitive to initialization, we carefully choose the right face detector for each one to get the best performance.More speciﬁcally, for the IFA and GN-DPM we use the300W face bounding boxes and for SDM and RCPR we usethe Viola-Jones bounding boxes, that is for each method weused the detector that it used during training. For the meth-

Image Index0.00.20.40.60.81.0 E rr o r e m e a Figure 5: Mirror error and alignment error of RCPR [4]on 300W test images. Results are calculated over 68 facialpoints.

Image Index0.00.20.40.60.81.0 E rr o r e m e a Figure 6: Mirror error and alignment error of GN-DPM [19]on 300W test images. Results are calculated over 49 innerfacial points.ods that use the Viola-Jones bounding boxes, we checkedmanually to verify that the detection is correct - for thoseface images on which the Viola-Jones face detector fails,we adjust the 300W bounding box to roughly approximatethe Viola-Jones bounding box.

Mirrorability

We calculated the mirror error and thealignment error for each of the 689 test samples in 300Wfor SDM, IFA, GN-DPM and RCPR. In Fig. 6 and Fig. 5we show the errors for two of the algorithms, i.e., the GN-DPM and the RCPR. The former is a representative local-based method and the latter a representative holistic-basedmethod. Similar results were obtained for SDM and IFA.In each ﬁgure, two pairs of example images are shown -one with low mirror error (lower left corner) and one withlarge mirror error (upper right corner). We sort the sample-wise alignment error in ascending order and plot it togetherwith the corresponding sample mirror error. It is clear thatalthough GN-DPM and the RCPR work in a very differ-ent way, for both the mirror error tends to increase as the m A li g n m e n t e rr o r e a Correlation Coefficient is 0.65888 (a) SDM [22], 49P m A li g n m e n t e rr o r e a Correlation Coefficient is 0.73687 (b) RCPR [4], 68P m A li g n m e n t e rr o r e a Correlation Coefficient is 0.74086 (c) IFA [2], 49P m A li g n m e n t e rr o r e a Correlation Coefficient is 0.64161 (d) GN-DPM [19], 49P

Figure 7: Correlation between the alignment error and themirror error of various state of the art face alignment meth-ods. The correlation coefﬁcients are shown above the ﬁg-ures.alignment error increases. There are a few impulses in thelower range of the red curve, i.e., low q e a and high e m . Thismeans that although the algorithm has small alignment er-ror on the original samples it has large error on the mirrorimages, i.e., q e a is high. There are three cases that result inhigh mirror error: 1) low q e a and high p e a ; 2) high q e a andlow p e a (shown in Fig. 5 upper right corner); 3) high q e a and high p e a (shown in Fig. 6 upper right corner). Finally,in order to quantify this insight, we present the correlationbetween the mirror error and the alignment error in Fig. 7.In all of the four methods there is a strong correlation be-tween the mirror error and the alignment error with correla-tion coefﬁcients ranging from 0.64 to 0.74 - these are veryhigh.

3. Mirrorability Applications

In the previous sections we have shown that one of thenice properties of the mirror error is that it is strongly corre-lated with the object alignment error, that is with the groundtruth error. In this section we show how it can be used in twopractical applications, namely for selecting difﬁcult samplesand for providing feedback in a cascaded face alignmentmethod.

For any computer vision task, including face alignment,it is generally accepted that some samples are relativelymore difﬁcult than others, that is the error of the algorithmon them is higher. However, it is very difﬁcult to estimatea measure of how well the algorithm has performed on agiven sample without knowledge of the ground truth. Such

CPR IFA GN-DPM SDMRCPRIFAGN-DPMSDM 0.81 0.68 0.63 0.660.66 0.79 0.62 0.660.61 0.60 0.77 0.610.61 0.63 0.56 0.70 (a) ρ of S e a ⇔ S e m . RCPR IFA GN-DPM SDMRCPRIFAGN-DPMSDM 1.00 0.68 0.61 0.550.68 1.00 0.54 0.580.61 0.54 1.00 0.530.55 0.58 0.53 1.00 (b) ρ of S e m ⇔ S e m . RCPR IFA GN-DPM SDMRCPRIFAGN-DPMSDM 1.00 0.72 0.60 0.740.72 1.00 0.64 0.730.60 0.64 1.00 0.620.74 0.73 0.62 1.00 (c) ρ of S e a ⇔ S e a . Figure 8: Consistency measure of ’difﬁcult’ samples detection, with M = 150 .a measure would be very useful, for example in order to se-lect a proper alignment model for a given dataset or to selectwhich samples to annotate in an Active Learning scheme.Here, we show how the mirror error can be used for select-ing difﬁcult samples in the problem of face alignment. Inorder to do so we apply several methods (IFA, SDM, GN-DPM, RCPR) on the test images of the 300W and get thedetection results. Then we sort the normalized mirror error e m in descending order and select the ﬁrst M samples asbeing the most difﬁcult ones. We denote this set as S e m .In order to evaluate whether the samples that we haveselected in this way are truly ’difﬁcult’ we measure the sim-ilarity between the set containing those M selected samplesand the set S e a that contains the M samples that have thelargest alignment error e a for each method. We use a mea-sure that we call consistency which we deﬁne as the fractionof the common samples between the two sets, that is ρ = | S ∩ S | M (3)where | S ∩ S | is the size of the intersection of S and S .For each method i , we calculate two sets each containing M samples, i.e., S ie m and S ie a . We set the value of M to 150.The chance rate is MN , where M is the number of selectedand N is the size of the dataset - in our case is ≈ . .The pairwise consistency rate matrix of S ie m and S ie a isshown in Fig. 8a, where in a certain row we show the con-sistency between the S ie m of a certain method with the S ie a of all methods, including the method itself. Note that the di-agonal does not contain ones, since S ie m are the M sampleswith the highest mirror error and S ie a the M samples withthe highest alignment error. As it can be seen, the consis-tency between the two sets of samples for a speciﬁc method(i.e., the diagonal values) are all above 0.7 - the highest is0.81 for RCPR. More interestingly, the consistency acrossdifferent methods, i.e., the M samples selected accordingto e a for a method in a certain row and the M samples se-lected according to e m in a certain column is high, with val-ues ranging from 0.56 to 0.68. This shows that the samples that we have selected are truly ’difﬁcult’, not only for themethod employed in the selection process but also for theother face alignment methods. In other words this showsthat the methods that we have examined have difﬁcultieswith the same images.Second, we evaluate the consistency across different ap-proaches, i.e., the consistency of ’difﬁcult’ samples foundby different approaches. Thus, we calculate the pair-wise consistency of S ie m of those methods as shown inFig. 8b. The resulting values are clearly much higher thanthe chance value of 0.22. In Fig. 8c we depict the ’opti-mal’ case where the ground truth, that is the alignment er-ror itself, is used to calculate the pairwise consistency. Weobserve that the consistency calculated by our selection pro-cess is very close to the one calculated based on the groundtruth. We can further conclude that: • the difﬁculty of samples is shared by the differentmethods that we have examined. • the difﬁcult samples selected by the mirror error showhigh consistency across different approaches. In recent years cascaded methods like SDM [22], IFA[2], CFAN [25] and RCPR [4] have shown promising re-sults in face alignment. Although they differ in terms ofthe regressor and the features that they use in each iterationthey all follow the same strategy. The methods start fromone or several initializations of the face shape, that are oftencalculated from the face bounding box, and then iterativelyreﬁne the estimation of the face shape by applying at eachiteration a regressor that estimates the udpate of the shape.These methods are intrinsically sensitive to the initialization[4, 25] . As stated in [23], only initializations that are in arange of the optimal shape can converge to the correct so-lution. To address this problem, [5] proposed to use severalrandom initializations and give the ﬁnal estimate as the me-dian of the solutions to which they convergence. However,having several randomly generated initializations does notuarantee that the correct solution is reached. The ’smartrestart’ proposed in [4] has improved the results to a certaindegree. The scheme starts from different initializations andapply only 10% of the cascade. Then, the variance betweenthe predictions is checked. If the variance is below a certainthreshold, the remaining 90% of the cascade is applied asusual. Otherwise the process is restarted with a different setof initializations.Here, we propose to use the mirror error as a feedbackto close this open cascaded system. More speciﬁcally, fora given test image we ﬁrst create its mirror image. Thenwe apply the RCPR model on the original test image andthe mirror image and calculate the mirror error. If the mir-ror error is above a threshold we restart the process usingdifferent initializations, otherwise we keep the detection re-sults. This procedure can be applied until the mirror error isbelow a threshold, or until a maximum number of iterations M is reached. In contrast to the original RCPR method thatkeeps only the results from the last set of initializations, wekeep the one that has the smallest mirror error. This makessense since new random initializations do not necessarilylead to better results than past initializations.First we evaluate the effectiveness of our feedbackscheme. Ideally, the restart will be initiated only when thecurrent initialization is unable to lead to a good solution.Treating it as a two class classiﬁcation problem we reportresults using a precision-recall based evaluation. A facealignment is considered to belong to the ’good’ class if themean alignment error is below of the inter-ocular dis-tance, otherwise, it is considered to belong to the ’bad’ class- in the latter case a re-start is needed. The precision is thenumber of samples classiﬁed correctly as belonging to the’bad’ (positive) class divided by the total number of samplesthat are classiﬁed as belonging to the ’bad’ class. Recall inthis context is deﬁned as the number of true positives di-vided by the total number of samples that belong to the badclass. For a fair comparison, we adjust our threshold on themirror error (i.e. the threshold above which we restart thecascade with a different initialization) to get similar recallas the RCPR with smart re-start [4] gets using its default pa-rameters. We note that our parameter can also be optimizedby cross validation for better performance. As can be seenin Fig. 9, at a similar recall level, our proposed scheme hassigniﬁcantly higher precision (0.65 vs. 0.25) than that ofRCPR ’smart re-start’, this veriﬁes that our method is moreeffective in selecting samples for which restarting initializa-tions are needed.Second, we evaluate the improvement in the face align-ment that we obtain using our proposed feedback scheme.We compare to 1) RCPR without restart (RCPR-O), 2)RCPR with the smart restart of [4] (RCPR-S) and 3) otherstate of the art methods. We create two versions of ourmethod. The ﬁrst version, RCPR-F1, uses 5 initializations E rr o r No restartRCPR restart (a) Original RCPR restart scheme. Presion=0.25, Recall = 0.63. E rr o r No restartOur restart (b) Our restart scheme. Precision = 0.65, Recall = 0.63.

Figure 9: Restart scheme of our method vs. RCPR [4] (bestviewed in color).

Methods

RCPR-F2 RCPR-F1

RCPR-S RCPR-O SDM IFA GN-DPM CFAN49P

Table 4: 49/68 facial landmark mean error comparison .and at most two restarts - this allows direct comparison tothe baseline method that uses the same number of initial-izations and restarts. The second version, RCPR-F2, uses10 initializations and at most 4 times of restarts - this ver-sion produces better results and still has good runtime per-formance. We compare to SDM [22], IFA [2], GN-DPM[19] and CFAN [25] - all of those have publicly availablesoftware and report good results. The results of the com-parison is shown in Table 4. We compare the normalizedalignment error of the common 49 inner facial landmarksfor all of these methods and the 68 facial landmarks when-ever this is possible. On the challenging 300W test set, withour proposed feedback scheme, the RCPR method has thebest performance compared to not only the original versionof RCPR but also to all the other methods. Although goodperformance is obtained on the face alignment problem, weemphasize that the main focus of this work is to bring atten-tion to the mirroability of object localization models.

4. Related Work

As a method that estimates the quality of the output ofa vision system, our method is related to works like themeta-recognition [16], face recognition score analysis [21]and the recent failure alert [26] for failure prediction. Ourmethod differs from those works in two prominent aspects(1) we focus on ﬁne-grained object part localization prob-lem while they focus on instance level recognition or detec-tion. (2) we do not train any additional models for evalua-tion while all those methods rely on meta-systems. In thespeciﬁc application of evaluating the performance of Hu-man Pose Estimation, [9] proposed an evaluation algorithm,however, again such an evaluation requires a meta modelnd it only works for that speciﬁc application.Our method is also very different from object/feature de-tection methods that exploit mirror symmetry as a constraintin model building [18, 12]. We note that our model does notassume that the detected object or shape appears symmetri-cally in an image - such an assumption clearly does not holdtrue for the articulated (human body) and deformable (face)objects that we are dealing with. None of the methods thatwe have exploited in this paper explicitly used the appear-ance symmetry in model learning. Our method only utilizesthe mirror symmetry property to map the object parts be-tween the original and mirror images.Developing transformation invariant vision system hasdrawn much attention in the last decades. Examples arethe rotation invariant face detection method [14] and thescale invariant feature transform (SIFT) [11], which han-dle efﬁciently several transformations including the mirrortransformation. Recently, Gens and Domingos proposed theDeep Symmetry Networks [8] that use symmetry groups torepresent variations - it is unclear though how the proposedmethod can be applied for object part localization. Szegedy et al. [17] has studied some intriguing properties of neu-ral networks when dealing with certain artiﬁcial perturba-tions. Our method focuses on examining the performanceof object part localization methods on one of the simplesttransforms, i.e. mirror transformation, and drawing usefulconclusions.

5. Conclusion and Discussion

In this work, we have investigated how state of the art ob-ject localization methods behave on mirror images in com-parison to how they behave on the original ones. Surpris-ingly, all of the methods that we have evaluated on tworepresentative problems, struggle to get mirror symmetricresults despite the fact that they were trained with datasetsthat were augmented with the mirror images.In order to qualitatively analyze their behavior, we intro-duced the concept of mirrorability and deﬁned a measurecalled the mirror error. Our analysis let to some interestingﬁndings in mirrorability, among which a high correlationbetween the mirror error and ground truth error. Further,since the ground truth is not needed to calculate the mirrorerror, we show two applications, namely difﬁcult samplesselection and cascaded face alignment feedback that aids are-initialization scheme. We believe there are many otherpotential applications in particular in Active Learning.The ﬁndings of this paper raise several interesting ques-tions. Why some methods have shown better performancein terms of absolute mirror error, for example SDM issmaller and RCPR is bigger? Can the design of algorithmswith low mirrorability error lead to algorithms with goodoverall performance? We believe these are all interestingresearch problems for future work.

References [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d humanpose estimation: New benchmark and state of the art analysis. In

CVPR , 2014. 3[2] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental facealignment in the wild. In

CVPR , 2014. 4, 5, 6, 7[3] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Lo-calizing parts of faces using a consensus of exemplars. In

CVPR ,2011. 4[4] X. P. Burgos-Artizzu, P. Perona, and P. Doll´ar. Robust face landmarkestimation under occlusion. In

ICCV , 2013. 1, 4, 5, 6, 7[5] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shaperegression. In

CVPR , 2012. 6[6] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. 2darticulated human pose estimation and retrieval in (almost) uncon-strained still images.

IJCV , 99(2):190–214, 2012. 3[7] J. R. Finnerty. Did internal transport, rather than directed locomotion,favor the evolution of bilateral symmetry in animals?

BioEssays ,27(11):1174–1180, 2005. 1[8] R. Gens and P. Domingos. Deep symmetry networks. In

NIPS , 2014.8[9] N. Jammalamadaka, A. Zisserman, M. Eichner, V. Ferrari, andC. Jawahar. Has my algorithm succeeded? an evaluator for humanpose estimators. In

ECCV . Springer, 2012. 7[10] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactivefacial feature localization. In

ECCV , 2012. 4[11] D. G. Lowe. Object recognition from local scale-invariant features.In

ICCV , 1999. 8[12] G. Loy and J.-O. Eklundh. Detecting symmetry and symmetric con-stellations of features. In

ECCV . 2006. 8[13] D. Ramanan. Learning to parse images of articulated bodies. In

NIPS , 2006. 3[14] H. A. Rowley, S. Baluja, and T. Kanade. Rotation invariant neuralnetwork-based face detection. In

CVPR , 1998. 8[15] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300faces in-the-wild challenge: The ﬁrst facial landmark localizationchallenge. In

ICCV , 2013. 4[16] W. J. Scheirer, A. Rocha, R. J. Micheals, and T. E. Boult. Meta-recognition: The theory and practice of recognition score analysis.

T-PAMI , 33(8):1689–1695, 2011. 7[17] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus. Intriguing properties of neural networks. arXivpreprint arXiv:1312.6199 , 2013. 8[18] S. Tsogkas and I. Kokkinos. Learning-based symmetry detection innatural images. In

ECCV , pages 41–54. Springer, 2012. 8[19] G. Tzimiropoulos and M. Pantic. Gauss-newton deformable partmodels for face alignment in-the-wild. In

CVPR , 2014. 4, 5, 7[20] F. Wang and Y. Li. Beyond physical connections: Tree models inhuman pose estimation. In

CVPR , 2013. 3, 4[21] P. Wang, Q. Ji, and J. L. Wayman. Modeling and predictingface recognition system performance based on analysis of similar-ity scores.

TPAMI , 29(4):665–670, 2007. 7[22] X. Xiong and F. De la Torre. Supervised descent method and itsapplications to face alignment. In

CVPR , 2013. 4, 5, 6, 7[23] X. Xiong and F. De la Torre. Supervised descent methodfor solving nonlinear least squares problems in computer vision. arXiv:1405.0601 , 2014. 6[24] Y. Yang and D. Ramanan. Articulated human detection with ﬂexiblemixtures of parts.

T-PAMI , 35(12):2878–2890, 2013. 1, 3, 4[25] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-ﬁne auto-encodernetworks (cfan) for real-time face alignment. In

ECCV , 2014. 6, 7[26] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh. Predictingfailures of vision systems. In

CVPR , 2014. 7[27] X. Zhu and D. Ramanan. Face detection, pose estimation and land-mark localization in the wild. In