[PDF] Semi-supervised learning of deep metrics for stereo reconstruction

Abstract

Deep-learning metrics have recently demonstrated extremely good performance to match image patches for stereo reconstruction. However, training such metrics requires large amount of labeled stereo images, which can be difficult or costly to collect for certain applications. The main contribution of our work is a new semi-supervised method for learning deep metrics from unlabeled stereo images, given coarse information about the scenes and the optical system. Our method alternatively optimizes the metric with a standard stochastic gradient descent, and applies stereo constraints to regularize its prediction. Experiments on reference data-sets show that, for a given network architecture, training with this new method without ground-truth produces a metric with performance as good as state-of-the-art baselines trained with the said ground-truth. This work has three practical implications. Firstly, it helps to overcome limitations of training sets, in particular noisy ground truth. Secondly it allows to use much more training data during learning. Thirdly, it allows to tune deep metric for a particular stereo system, even if ground truth is not available.

Full PDF

aa r X i v : . [ c s . C V ] D ec Semi-supervised learning of deep metrics for stereoreconstruction

Stepan Tulyakov ∗ , Anton Ivanov , and Francois Fleuret eSpace Center, Swiss Federal Institute of Technology in Lausanne Computer Vision and Learning Group, Idiap Research Institute and Swiss Federal Institute ofTechnology in Lausanne

Abstract

Deep-learning metrics have recently demonstrated ex-tremely good performance to match image patches forstereo reconstruction. However, training such metricsrequires large amount of labeled stereo images, whichcan be difﬁcult or costly to collect for certain applica-tions.The main contribution of our work is a new semi-supervised method for learning deep metrics from unla-beled stereo images, given coarse information about thescenes and the optical system. Our method alternativelyoptimizes the metric with a standard stochastic gradientdescent, and applies stereo constraints to regularize itsprediction.Experiments on reference data-sets show that, fora given network architecture, training with this newmethod without ground-truth produces a metric withperformance as good as state-of-the-art baselinestrained with the said ground-truth.This work has three practical implications. Firstly, ithelps to overcome limitations of training sets, in partic-ular noisy ground truth. Secondly it allows to use muchmore training data during learning. Thirdly, it allows totune deep metric for a particular stereo system, even ifground truth is not available.

The stereo reconstruction problem consists in estimat-ing a depth map from two images taken from differ-ent viewpoints. The problem has many practical appli-cations in robotics [34], remote sensing [43], and 3Dgraphics [47].It has been heavily investigated for severaldecades [40], and recent developments focused ondesigning high-order, region-based and object-speciﬁcpriors [60, 10, 55, 17, 24, 29, 52, 51], and improving ∗ stepan.tulyakov@epﬂ.ch efﬁciency of large scale stereo [36, 25, 16, 7]. Perhapsthe most signiﬁcant recent breakthrough was to usedeep metrics [12, 58]. It led to considerable gainsin processing speed and reconstruction accuracy (seeTables 4, 5, and 6). Our work improves upon this lineof research. Stereo reconstruction algorithms rely on epipolar ge-ometry [18], according to which to every no-occludedpoint in one stereo view corresponds a point in the otherview lying on a line that does not depend on the scene,but only on the optical system. This line is called an epipolar line , and for a calibrated stereo system, it isknown for every image point. Furthermore, for a pin-hole camera, all the points lying on a given epipolar linein the second view correspond to points lying on a com-mon epipolar line in the ﬁrst view. Such two epipolarlines are called conjugate .It is a standard procedure to warp stereo views in or-der to make conjugate epipolar lines in these views hor-izontal and vertically aligned. This is called stereo rec-tiﬁcation , and in a rectiﬁed stereo pair, every point fromthe ﬁrst view corresponds to a point shifted horizontallyin the second view. The extension of this shift – alsoknown as a disparity – allows to compute the distanceto the corresponding 3d point, which is the ultimate goalof the stereo reconstruction.So at the core of the stereo reconstruction processlies the matching of similar patches in two images alongepipolar lines and the estimation of the disparity. It isnot a trivial task, since the local appearance of a physi-cal point in the two views might differ due to radiomet-ric and geometric distortions. The patch matching isusually performed using invariant similarity measures and descriptors , also known as features. Historically,the former were more popular for the stereo reconstruc-tion, while the latter were used for matching sparse1oints of interest.

The invariant similarity measures [21, 19] are pop-ular for stereo reconstruction, probably due to theirlow computational complexity. The simplest similar-ity measures are the sum of absolute differences (SAD),and the sum of squared differences (SSD). Zero-meanvariants of these methods (ZSAD, ZSSD), as well assum of absolute gradient differences (GSAD), are in-variant to local brightness changes, which can also beachieved by combining SAD and SSD with backgroundsubtraction by mean, Laplacian of Gaussian (LoG) [20]or Bilateral ﬁlters [4].Non-parametric similarity mea-sures, such as Rank and Census [56] are invariant to ar-bitrary order-preserving local intensity transformations,and measures such as the Mutual Information (MI) [23]explicitly model the joint intensity distribution in thetwo images, and are invariant to arbitrary intensitytransformations. All these methods are invariant to ra-diometric distortions only.

Invariant descriptors are popular for sparse point match-ing, and are designed to be invariant to both radio-metric and geometric distortions. They all are eitherlocal histograms of oriented image gradients such asSIFT [30], or binary strings of local pairwise pixelcomparisons such as BRIEF [9]. Although descriptorsare rarely used for stereo, there are some exceptions,such as DAISY [48], which can be efﬁciently computeddensely.Recently, the community has moved from these fullyhand-crafted descriptors to data-driven descriptors, in-corporating machine-learning approaches. Most ofsuch descriptors perform discriminative dimensionalityreduction either by feature selection, as VGG [45], lin-ear feature extraction, as LDAHash [46], or boosting,as BinBoost [50].

As for other application domains of machine learning,the current trend is to move beyond “shallow” models,where the learned quantities interact linearly with hand-designed non-linearities, but are not involved in furtherre-combinations. The resulting “deep metrics” demonstrate extremelygood performance compared to other similarity mea-sures and descriptors both for sparse point match-ing [22, 14, 44, 57, 54] and stereo reconstruction [58,12].Standard deep metric networks have a Siamese ar-chitecture, introduced in [8]. They consist of two “em-bedding” sub-networks with complete weight sharingthat join into a common “head”. Each embedding sub-network is convolutional, it takes an image patch as in-put, and outputs the patch’s descriptor. The “head” isusually fully connected, it takes the two descriptors asinput, and outputs a similarity measure. The Siamesearchitecture was ﬁrstly used for image patch matchingin its classic form in [22]. Later it was shown, that the“head” network may be replaced by a ﬁxed similaritysuch as L [44] or cosine [58], that the embedding sub-networks may not share weights [57], and, ﬁnally, thatthe explicit notion of a descriptor might not be neces-sary [57]. Existing methods for training a Siamese network forpatch matching are supervised, using a training set com-posed of positive and negative examples. Each positiveexample (respectively negative) is a pair composed of areference patch and its matching patch (respectively anon-matching one) from another image.Training either takes one example at the time, posi-tive or negative, and adapts the similarity [44, 12, 22,57, 54], or takes at each step both a positive and anegative example, and maximizes the difference be-tween the similarities, hence aiming at making the twopatches from the positive pair “more similar” than thetwo patches from the negative pair [58, 26, 6]. This lat-ter scheme is known as “Triplet Contrastive learning.”Although the supervised learning of deep metricsworks very well, the complexity of the models requiresvery large labeled training sets which are hard to col-lect for real applications. Beside, even when such largesets are available, the ground truth is produced automat-ically from sensors and, thus, usually noisy and/or maysuffer from gross errors. This can be mitigated by aug-menting the training set with random perturbations [58]or synthetic training data [14, 33]. However synthe-sis procedures are hand-crafted and do not account forthe regularities speciﬁc to the stereo system and targetscene at hand.2 .5 Semi-supervised learning

Our work is inspired by Multi-Instance Learning(MIL) [5] and Self-Training [49]. The main idea be-hind MIL, is to use “coarsely” labeled data, where onelabel indicates if a group of samples contains at leastone positive sample. This allows to deal with low ge-ometrical accuracy, or even the absence of geometricalinformation and a labeling at the scene level. It has beenapplied with success to deep learning [53].Another strategy to relax the requirement for detailedlabeling is Self-Training, where the training set is en-riched with unlabeled data. As for transductive learn-ing, self-training works by leveraging the informationcarried by the unlabeled data about the structure of thedata population [11, 37].Our most efﬁcient method uses dynamic program-ming (DP) to regularize the noisy prediction of the met-ric as it is currently trained. Similar idea appearedin [27], in a different context, to train a deep net-work to recognize handwritten characters, using word-wise labels to infer character-wise labels. It has alsobeen used to segment automatically sequences of actiondemonstrations into macro-actions to deal with non-Markovian decision processes [28], and the k -shortestpaths algorithm, which is a generalization of dynamicprogramming to multiple paths, was used to train aperson detector from videos with time-sparse ground-truth [3]. We start by formulating in § § § We are provided with a semi-supervised training set Tr = { ( e r , e + , e − ) n } n =1: N . Each training exampleis a triplet of series of s × s gray-scale patches: • reference patches e r = ( p r , p r , .., p rW ) extractedfrom a horizontal line of a left rectiﬁed stereo im-age, • positive patches e + = ( p +1 , p +2 , .., p + W ) extractedfrom the corresponding horizontal in the right rec-tiﬁed stereo image, and • negative patches e − = ( p − , p − , .., p − W ) extractedfrom another horizontal line of a right rectiﬁed stereoimage,where W is the number of patches per line, and N is thenumber of training examples. In addition to the trainingset, we are provided with the maximum possible dis-parity d max , which depends on the optical system anda prior knowledge about the scene.Our goal is to learn a deep metric S ( x, y ) suchthat, for any set of reference e r and positive imagepatches e + , the row-wise maxima of the similarity ma-trix S r + ij = S (cid:0) p ri , p + j (cid:1) correspond to the true matches.Note, that in contrast to [22, 57, 44, 14, 54, 12, 58] inour case each training example is not a pair of patches,but a triplet of series of patches each taken on an hori-zontal line of a rectiﬁed stereo image, so that we can uti-lize constraints and loss functions deﬁned on such fam-ilies of patches jointly. Additionally, processing lines asa whole signiﬁcantly speeds up the training process byallowing to reuse shared computations. The stereo matching problem satisﬁes the followingconstraints:(E)

Epipolar constraint.

Every non-occluded referencepatch has a matching positive patch [18][239-241p].(D)

Disparity range constraint.

The offset of the refer-ence patch index with respect to the matching posi-tive patch index is bounded by a maximum disparity d max . This comes from the stereo system parame-ters (focal length, pixel size, baseline) and the dis-tance range of the scenes.(U) Uniqueness constraint.

The matching positivepatch is unique [32].(C)

Continuity constraint.

The offsets of the refer-ence patches indices with respect to the matchingpositive patch indices are similar for nearby refer-ence patches everywhere except on depth disconti-nuities [32].(O)

Ordering constraint.

The reference patches are or-dered on their lines as the matching positive patcheson theirs.These constraints result in a particular shape of thepositive similarity matrix, as pictured in Figure 1.3 ositive patch index r e f e r e n c e p a t c h i n d e x U: uniquiness constraint (1-to-1) D: disparity range constraint(match only in gray area) C: continuity constraint(match line is continuous) O: ordering constraint(only down, right anddiagonal steps are allowed) maximum disparity m a x i m u m d i s p a r i t y occlusionsthese reference patches do not have match t h e s e p o s i t i v e p a t c h e s d o n o t h a v e m a t c h Figure 1: Positive similarity matrix. The bold line corre-sponds to the optimal matches that satisfy the stereo con-straints. Elements within the disparity range are shownin gray. Note that there are no matches for some pointson the reference and positive epipolar lines.

We developed several semi-supervised methods thatuse different subsets of the stereo constraints duringtraining. All methods alternate between two steps:(1) improving the metric, given the current estimateof the matches for the positive examples, and (2) re-computing these matches under the constraints, giventhe current estimate of the metric. They can be used incombination with any deep metric architecture and anygradient based optimization method.To each of our methods corresponds a loss functionoptimized in each of the two steps mentioned above. Ittakes as an input either S r + , or the three matrices S r + , S r − and S − + deﬁned respectively as follows: S r + ij = (cid:26) S ( p ri , p + j ) 0 ≤ i − j ≤ d max −∞ otherwise (1) S r − ij = (cid:26) S ( p ri , p − j ) 0 ≤ i − j ≤ d max −∞ otherwise (2) S − + ij = (cid:26) S ( p − i , p + j ) 0 ≤ i − j ≤ d max −∞ otherwise (3)In the next sections we describe each method in details. This method is inspired by Multi-Instance Learning(MIL) paradigm [5] and uses only the epipolar and thedisparity range constraints (E) and (D) from § | rows | X i ∈ rows max(0 , − max j S r + ij +max j S r − ij + µ ) +1 | cols | X j ∈ cols max(0 , − max i S r + ij + max i S − + ij + µ ) , (4)where rows = { d max + 1 , . . . , W } is a set of rowsof the similarity matrix that are guaranteed to have cor-rect matches (see Fig 1), col = { , . . . , W − d max } isa set of valid columns of the similarity matrix that areguaranteed to have correct matches, W is the numberof patches in a horizontal line of rectiﬁed image, and µ is a loss margin. Note that the disparity range con-straint is taken into account automatically, if we use thesimilarity matrices as deﬁned in § This method uses the epipolar, the disparity range, andthe uniqueness constraints (E), (D), and (U) from § | rows | X i ∈ rows max(0 , − max j S r + ij +max j ˆ S r + ij + µ ) +1 | cols | X j ∈ cols max(0 , − max i S r + ij + max i ˇ S r + ij + µ ) , (5)where ˆ S is a similarity matrix with masked out row-wise maxima, ˇ S is a similarity matrix with masked outcolumn-wise maxima. To mask out elements of simi-larity of matrix, we simply substitute them with −∞ .Experiments show that this method suffers from aproblem opposite to the one exhibited by the MILmethod: it produces over-sharpened metric, sensitiveeven to small shifts from the exact match. This isalso detrimental to the performance, since our goal isto ﬁnd metric invariant to small geometric transforma-tions, such as shift. We solved the problem by maskingout all spatial neighbors withing t sup radius from themaximas in ˆ S and in ˇ S . See the supplementary materi-als for details. As we showed in previous sections, the CON-TRASTIVE and the MIL methods have complemen-tary properties and use the stereo constraints in orthog-onal way. Therefore we can combine them into a newmethod that we call MIL-CONTRASTIVE.

This method uses all constraints listed in § p ∗ = argmax p ∈P | p | X ( i,j ) ∈ p S r + ij , (6)where P is the set of paths { ( i n , j n ) } n =1: M which arecontinuous in the following sense: ∀ n > , ( i n , j n ) − ( i n − , j n − ) ∈ { (0 , , (1 , , (1 , } , and ( i , j ) ∈ { } × [1 , d max ] . Which means that only down, right and diagonal stepsare allowed. This enforces the continuity and the order-ing constraints (C) and (O) in the solution. Notice also that we search for a path that has maximum averageenergy rather than maximum total energy to prevent abias toward longer paths and consequently smaller dis-parities.Given the best match-path p ∗ found by the dynamicprogramming we deﬁne our loss function as | p ∗ | X ( i,j ) ∈ p ∗ max(0 , − S r + ij + max k ˜ S r + ik + µ )+1 | p ∗ | X ( i,j ) ∈ p ∗ max(0 , − S r + ij + max l ˜ S r + lj + µ ) , (7)where ˜ S is a similarity matrix where all neighbors of el-ements belonging to p ∗ withing radius t sup are maskedout by setting their values to −∞ .The best match-path computed by the dynamic pro-gramming might contain vertical and horizontal seg-ments. These segments correspond to patches that areoccluded by foreground objects on one of the views,and thus do not have correct matches. Therefore, in ourexperiments we ignore vertical and horizontal segmentslonger than t occ during the learning. For more details,please refer to the supplementary materials. Our experiments were done in the Torch frame-work [13]. Optimization was performed with theADAM method with standard settings, using mini-batches of size equal to the training images height, andno data augmentation of any sort. The initialization ofweights and biases of our deep metric network was donein standard way by random sampling from zero-meanuniform distribution.We guarantee reproducibility of all experiments inthis section by using only available data-sets, and mak-ing our code available online under open-source licenseafter publications.

In our experiments we use three popular benchmarkdata-sets: KITTI’12 [15], KITTI’15 [34] and Middle-bury (MB) [40, 41, 39, 21, 38]. These data-sets haveonline scoreboards [1, 2], showing comparative perfor-mance of all participating stereo methods.KITTI’12 and KITTI’15 data-sets each consist of200 training and 200 test rectiﬁed stereo pairs of res-olution 1226 ×

370 acquired from cars moving around a5ity. About of the pixels in the training set aresupplied with a ground truth disparity acquired by alaser altimeter with error less than 3 pixels. The dis-parity range is about 230 pixels. Each data-set is sup-plied with an extension (respectively KITTI’12-EXTand KITTI’15-EXT) that contains 19 additional stereopairs for each scene, without ground truth disparity.This allows us to use 40 × more training data for thesemi-supervised learning than for the supervised (ac-tually even more, considering that only about 30% ofpixels in the training set have labels).Middlebury data-set (MB) consists of 60 training and30 test rectiﬁed stereo pairs. The images are acquiredby different stereo systems and contain different artiﬁ-cial scenes. Their resolution varies from 380 ×

430 to3000 × To estimate the performance of deep metrics we com-pute a prediction error rate deﬁned as the proportion ofnon-occluded patches for which the predicted disparityis off by more than 3 pixels.The motivation behind this work is to improve themetric as a mean to match patches in a stand-alone man-ner, as we have not taken into account the interplay withthe additional post-processing that may be applied in acomplete stereo pipeline. Performance regarding thismain objective is measured by picking the patch withthe largest similarity among the patches that belong toa valid disparity range on the epipolar line. We call thisthe winner-take all (WTA) error rate.A second measure is the error rate of a completestereo pipeline with plugged-in deep metric. This isa performance measure of direct practical interest, al-though not the objective we optimize during our train-ing.

The main contribution of this work is a new semi-supervised training method, not deep metric architec-ture, therefore we simply adopt the overall architectureof well performing MC-CNN fst network from [58],shown in Table 1, and substitute their learning methodwith ours.

Parameter KITTI’12,15 MB

Number of CNN layers 4 5Number of features per layer 64 64Receptive ﬁeld 3x3x64 3x3x64Activation function ReLU ReLUEquivalent patch size 9x9 11x11Similarity metric Cosine Cosine

Table 1: Network architectures for deep metricfrom [58] that we use in our experiments.

In this experiment we compare the performance ofthe proposed semi-supervised methods. We performedcomparison on KITTI’12 data-set using the winner-take-all (WTA) error (see § Method WTA error, [ % ] Time, [hr] MIL 18.45 45CONTRASTIVE 17.63 30MIL-CONTRASTIVE 16.12 65

CONTRASTIVE-DP 14.61 68

Table 2: Comparison of the proposed semi-supervisedlearning methods on KITTI’12 set. All methods areused to train the same network architecture. TheCONTRASTIVE-DP method, shown in bold, uses allthe constraints during learning and achieves the small-est WTA error. Notice that in general increasing thenumber of constraints increases performance.The main conclusion is that semi-supervised methodsthat use more stereo constraints during learning performbetter. For example, the MIL, that uses only the epipo-lar and the disparity range constraints, has larges WTAerror, whereas the CONTRASTIVE-DP, that uses theepipolar, the disparity range, the continuity, the unique-ness and the ordering constraints has smallest WTA er-ror.In all following sections, we use the best perform-ing CONTRASTIVE-DP method only, and refer to it asMC-CNN-SS, where SS stands for semi-supervised.

In this section, we compare the proposed semi-supervised method with our reference fully supervised6eep-metric baseline [58] on the three different sets, us-ing the winner-take-all (WTA) error (see § × more unlabeled data than labeled training data.In case of MB data-set our method does not have thisadvantage over the supervised method. The set has only30% more unlabeled training data than the labeled train-ing data. This is probably the reason why our methodshows slightly worse performance on this dataset thancompared to the supervised method. Method WTA error, [%]KITTI’12 KITTI’15 MB

MC-CNN fst [58] 15.44 15.38

MC-CNN-SS fst (ours)

Table 3: Comparison of our semi-supervised learningmethod with the fully supervised baseline using thesame network architecture [58]. Smallest WTA errorsare shown in bold. Our semi-supervised method out-performs the baseline in terms of WTA error across twosets, and does virtually as well on the third. This isremarkable since in contrast to the supervised method,our does not use ground truth disparity during learning.For reference, the two bottom rows show the perfor-mance of two standard similarity measures and descrip-tors. Note that following the setup of [58], the patchesused as input to the deep-learning methods are of size × for KITTI’12,’15, and × for MB. In this section we investigate how well our semi-supervised deep metric performs when it is combinedwith the complete stereo pipeline. For that we plug it inthe stereo pipeline from [58], and tuned the parametersof the pipeline using simple coordinate descent method,starting from the default values of [58]. Note that we used speciﬁc metric and pipeline parameters for eachdata-set.Then we computed disparity maps for the test setswith withheld ground truth, and uploaded the resultsto the evaluation web sites for the respective data-sets[1, 2]. The obtained evaluation results are shownin Tables 5, 6 and 4. As we can see, results with ourmetric trained without ground truth during training arevery close to the results of the fully supervised methodacross all benchmarks.Those are very encouraging results, given in partic-ular that we did not optimize the deep metric and thepipeline parameters together, and considering the per-formance in the winner-take-all setup of § Table 4: MB benchmark [2] snapshot from 14/11/2016with published methods (default view). Methodsranked 1, 2, 3, 4 and 5 use deep metrics for stereomatching. Note that our semi-supervised method MC-CNN-SS, shown in bold, that does not use ground truthdata during training, has an error rate very similar tothat of the supervised MC-CNN fst method, also shownin bold, trained with ground truth data.

In Figure 2 we show positive similarity matrices beforeand after the training with MC-CNN-SS on KITTI’12data-set. While one can not visually distinguish thebest match in the similarity matrices before the train-ing, it becomes clearly visible after. This suggests thatthe training improves discriminative ability of the deepmetric.In Figure 3 we show failure cases of learned deepmetric. Most of the failures happen when the ground7

Date Algorithm Pipeline Err, [%] Time,[s]

Table 5: KITTI’12 benchmark [1] snapshot from14/11/2016 with published methods (default view).Methods ranked 1, 2, 3, 4, 6, and 7 use deep metrics forstereo matching. Note that our semi-supervised methodMC-CNN-SS, shown in bold, that does not use groundtruth data during training, has an error rate very simi-lar to that of the supervised MC-CNN fst method, alsoshown in bold, trained with ground truth data. Sincethe MC-CNN fst method does not appear on KITTI’12evaluation table, due to restrictions on the number ofresults for a single paper, we borrowed it from [59]

Table 6: KITTI’15 benchmark [1] snapshot from14/11/2016 with published methods (default view).Methods ranked 1, 2, 3, 5, 6 and 7 use deep metrics forstereo matching. Note that our semi-supervised methodMC-CNN-SS, shown in bold, that does not use groundtruth data during training, has an error rate very simi-lar to that of the supervised MC-CNN fst method, alsoshown in bold, trained with ground truth data. Sincethe MC-CNN fst method does not appear on KITTI’12evaluation table, due to restrictions on the number ofresults for a single paper, we borrowed it from [59].truth match is visually indistinguishable from the incor-rect match picked by the deep metric. This happens ifthe reference patch is from a ﬂat image area, an area

212 before learningafter learningbefore learningafter learning

Figure 2: Diagonal part of the similarity matrix be-fore and after training with MC-CNN-SS on KITTI’12dataset. Top ﬁgure shows one of the stereo images withtwo highlighted epipolar lines. The pictures below showthe positive similarity matrices for these epipolar lines.The dark elements in the similarity matrices correspondto the higher similarities. WTA error before training is42.01%, and 14.61% after. Note that before the train-ing we can not visually distinguish the best matches inthe similarity matrices, while after the learning they areclearly visible.with a repetitive texture, or an area with a horizontaledge.Notably, some failures are triggered by probable er-rors in the ground truth. These errors might worsenoutcomes of the supervised learning but does not affectoutcomes of our semi-supervised learning, since it doesnot use the ground-truth.

In this experiment, we study how deep metric trainedusing our semi-supervised method on one data-set per-forms on another data-sets in terms of WTA error.From Table 7 it appears that a metric always performsbetter when the train and test population come from thesame data-set. This conﬁrms that our semi-supervisedmetric has great practical value: it allows to tune de-scriptor for a particular stereo system at hands, even ifdata-set with ground-truth is not available.

We proposed novel semi-supervised techniques fortraining patch similarity measures for stereo reconstruc-8 Figure 3: Failure cases of the deep metric trained withour MC-CNN-SS method on the KITTI data-set. Foreach example the three patches displayed correspond to(from top to bottom): the reference patch, the predictedmatch and the ground-truth match. Note that as expected,the ground truth and the predicted matches are often vi-sually indistinguishable. This happens if the referencepatch is from an area with almost horizontal edges (3, 6,13), a ﬂat image area (4, 5, 10), or an area with repetitivetexture. Some failures are triggered by likely errors inthe ground truth labeling (2, 12, 14, 16).

Training set WTA error, [%]KITTI’12 KITTI’15 MB

KITTI’12

Table 7: Generalization error across data-sets. Thesmallest WTA errors, shown in bold correspond to thecases when the train and test population come from thesame data-set. This conﬁrms that our semi-supervisedmetric has great practical value: it tune the descriptorsfor a particular stereo system, even if no ground truth isavailable.tion. These techniques allow to train with data-sets forwhich ground truth is not available, by relying on sim-ple constraints coming from properties of the opticalsensor, and from a rough knowledge about the scenesto process.We applied this framework to the training of a “deepmetric”, that is a deep siamese neural-network that takestwo patches as an input and predicts a similarity mea-sure. Benchmarking on standard data-sets shows thatthe resulting performance is as good or better than pub-lished results with the same network trained on the samebut fully labeled data-sets (see Table 3).This very good performance can be explained by thestrong redundancy of a fully labeled data-set, due to thecontinuity of surfaces, coupled with inevitable labelingerrors. The latter can degrade the performance result-ing from a fully supervised training process, and could only be mitigated by using a prior knowledge about theregularity of the labeling, similar to the constraints weuse.The techniques we propose open the way ﬁrst tousing stereo reconstruction based on deep metrics fordata-sets for which no ground-truth exists, such as plan-etary measurements. Second, it will allow the train-ing of larger neural networks, with very large unlabeleddata-sets. Our experiments show that the network thatwe are using in our experiments does beneﬁt from anone order of magnitude more training samples, than itis available to supervised method as shown in Table 3.We expect that this effect will be even more signiﬁcantif we use our training method with larger networks thatwould over-ﬁt existing labeled training sets.

References [1] KITTI 2012, 2015 stereo scoreboards. .Accessed: 2016-11-14. 5, 7, 8[2] Middlebury scoreboard. http://vision.middlebury.edu/stereo/ .Accessed: 2016-11-14. 5, 7[3] K. All, D. Hasler, and F. Fleuret. FlowBoost - Appear-ance learning from sparsely annotated video. In

CVPR ,pages 1433–1440, 2011. 3[4] A. Ansar, A. Castano, and L. Matthies. Enhanced Real-time Stereo Using Bilateral Filtering . , 2004.2[5] B. Babenko. Multiple instance learning: algorithms andapplications.

NCBI Google Scholar , 2008. 3, 4[6] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net: Conjoined Triple Deep Network for Learning Lo-cal Image Descriptors.

CoRR , 2016. 2[7] J. T. Barron and B. Poole. The fast bilateral solver.

ECCV , 2016. 1, 7[8] J. Bromley, I. Guyon, Y. Lecun, E. Sckinger, andR. Shah. Signature veriﬁcation using a ”siamese” timedelay neural network. In

NIPS , 1994. 2[9] M. Calonder, V. Lepetit, M. ¨Ozuysal, T. Trzcinski,C. Strecha, and P. Fua. BRIEF: Computing a local bi-nary descriptor very fast.

PAMI , 2012. 2[10] A. Chakrabarti, Y. Xiong, S. J. Gortler, and T. Zickler.Low-Level Vision by Consensus in a Spatial Hierarchyof Regions.

CVPR , 2015. 1[11] X. Chen, A. Shrivastava, and A. Gupta. NEIL: Extract-ing Visual Knowledge from Web Data. In

ICCV , 2013.3[12] Z. Chen, X. Sun, and L. Wang. A Deep Visual Cor-respondence Embedding Model for Stereo MatchingCosts.

ICCV , 2015. 1, 2, 3

13] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7:A matlab-like environment for machine learning. In

BigLearn, NIPS Workshop , 2011. 5[14] P. Fischer, A. Dosovitskiy, and T. Brox. DescriptorMatching with Convolutional Neural Networks: a Com-parison to SIFT.

ARXIV , 2014. 2, 3[15] A. Geiger, P. Lenz, and R. Urtasun. Are we ready forautonomous driving? the kitti vision benchmark suite.In

CVPR , 2012. 5[16] A. Geiger, M. Roser, and R. Urtasun. Efﬁcient large-scale stereo matching.

ACCV , 2010. 1[17] F. G¨uney and A. Geiger. Displets: Resolving StereoAmbiguities using Object Knowledge.

CVPR , 2015. 1,8[18] R. Hartley and A. Zisserman.

Multiple View Geometryin Computer Vision . Cambridge University Press, 2003.1, 3[19] H. Hirschm. Evaluation of Stereo Matching Costs onImages with Radiometric Differences.

PAMI , 2008. 2[20] H. Hirschm¨uller, P. R. Innocent, and J. Garibaldi. Real-Time Correlation-Based Stereo Vision with ReducedBorder Errors.

IJCV , 47, 2002. 2[21] H. Hirschmuller and D. Scharstein. Evaluation of CostFunctions for Stereo Matching.

CVPR , pages 1–8, 2007.2, 5[22] M. Jahrer, M. Grabner, and H. Bischof. Learned localdescriptors for recognition and matching.

Computer Vi-sion Winter Workshop , 2008. 2, 3[23] J. Kim, V. Kolmogorov, and R. Zabih. Visual corre-spondence using energy minimization and mutual infor-mation. In

ICCV , 2003. 2[24] K. R. Kim and C. S. Kim. Adaptive smoothness con-straints for efﬁcient stereo matching using texture andedge information. In

ICIP , 2016. 1, 7[25] J. Kowalczuk, E. T. Psota, and L. C. P´erez. Real-time Stereo Matching on CUDA using an Iterative Re-ﬁnement Method for Adaptive Support-Weight Corre-spondences Real-time Stereo Matching on CUDA usingan Iterative Reﬁnement Method for Adaptive Support-Weight Correspondences.

Transactions on Circuits andSystems for Video Technology , 2012. 1[26] B. G. V. Kumar, G. Carneiro, and I. Reid. Learn-ing Local Image Descriptors with Deep Siamese andTriplet Convolutional Networks by Minimising GlobalLoss Functions.

CVPR , 2016. 2[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recogni-tion.

Proceedings of the IEEE , 86(11):2278–2323, 1998.3[28] L. Lefakis and F. Fleuret. Dynamic Programming Boost-ing for Discriminative Macro-Action Discovery.

ICML ,32:1548–1556, 2014. 3 [29] A. Li, D. Chen, Y. Liu, and Z. Yuan. Coordinating multi-ple disparity proposals for stereo computation. In

CVPR ,2016. 1, 7[30] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints.

IJCV , 2004. 2[31] W. Luo, A. G. Schwing, and R. Urtasun. Efﬁcient deeplearning for stereo matching. In

CVPR , 2016. 8[32] D. Marr and T. Poggio. A Computational The-ory of Human Stereo Vision.

Biological Sciences ,204(1156):301–328, 1979. 3[33] N. Mayer, E. Ilg, P. H¨ausser, P. Fischer, D. Cremers,A. Dosovitskiy, and T. Brox. A large dataset to trainconvolutional networks for disparity, optical ﬂow, andscene ﬂow estimation.

CVPR , 2016. 2, 8[34] M. Menze and A. Geiger. Object scene ﬂow for au-tonomous vehicles. In

CVPR , 2015. 1, 5[35] V. Ntouskos and F. Pirri. Conﬁdence driven tgv fusion. arXiv preprint arXiv:1603.09302 , 2016. 8[36] E. T. Psota, J. Kowalczuk, M. Mittek, and L. C. Perez.MAP Disparity Estimation Using Hidden Markov Trees.

ICCV , 2015. 1[37] S. E. Reed and H. Lee. Raining deep neural networkson noisy labels with bootstrapping.

ICLR , 2015. 3[38] D. Scharstein, H. Hirschm¨uller, Y. Kitajima, G. Krath-wohl, N. Neˇsi´c, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate groundtruth.

Lecture Notes in Computer Science , 2014. 5[39] D. Scharstein and C. Pal. Learning conditional randomﬁelds for stereo. In

CVPR , 2007. 5[40] D. Scharstein and R. Szeliski. A Taxonomy and Evalua-tion of Dense Two-Frame Stereo Correspondence Algo-rithms.

IJCV , 2001. 1, 5[41] D. Scharstein and R. Szeliski. High-accuracy stereodepth maps using structured light.

CVPR , 2003. 5[42] A. Seki and M. Pollefeys. Patch based conﬁdence pre-diction for dense disparity map. In

BMVC , 2016. 8[43] D. E. Shean, O. Alexandrov, Z. M. Moratto, B. E. Smith,I. R. Joughin, C. Porter, and P. Morin. An automated,open-source pipeline for mass production of digital ele-vation models (DEMs) from very-high-resolution com-mercial stereo satellite imagery. { ISPRS } Journal ofPhotogrammetry and Remote Sensing , 2016. 1[44] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua,and F. Moreno-Noguer. Discriminative Learning ofDeep Convolutional Feature Point Descriptors.

ICCV ,2015. 2, 3[45] K. Simonyan, A. Vedaldi, and A. Zisserman. DescriptorLearning Using Convex Optimization.

PAMI , 2013. 2[46] C. Strecha, A. M. Bronstein, M. M. Bronstein, andP. Fua. LDAHash: Improved matching with smaller de-scriptors.

PAMI , 2012. 2[47] C. Strecha, T. Pylv¨an¨ainen, and P. Fua. Dynamic andscalable large scale image reconstruction. In

CVPR ,2010. 1

48] E. Tola. DAISY: A Fast Descriptor for Dense WideBaseline Stereo and Multiview Reconstruction.

PAMI ,2010. 2[49] I. Triguero, S. Garc´ıa, and F. Herrera. Self-labeled tech-niques for semi-supervised learning: taxonomy, soft-ware and empirical study.

Knowledge and InformationSystems , 2013. 3[50] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua.Learning Image Descriptors with the Boosting-Trick.

NIPS , 2012. 2[51] C. Vogel, S. Roth, and K. Schindler. View-consistent 3dscene ﬂow estimation over multiple frames. In

ECCV ,2014. 1, 8[52] C. Vogel, K. Schindler, and S. Roth. 3d scene ﬂow esti-mation with a piecewise rigid scene model.

IJCV , 2015.1, 8[53] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep MultipleInstance Learning for Image Classiﬁcation and Auto-Annotation.

CVPR , 2015. 3[54] Xufeng Han, T. Leung, Y. Jia, R. Sukthankar, and A. C.Berg. MatchNet: Unifying feature and metric learningfor patch-based matching.

CVPR , 2015. 2, 3[55] K. Yamaguchi, D. Mcallester, and R. Urtasun. EfﬁcientJoint Segmentation , Occlusion Labeling , Stereo andFlow Estimation.

ECCV , 2014. 1, 8[56] R. Zabih and J. Woodﬁll. Non-parametric Local Trans-forms for Computing Visual Correspondence.

ECCV ,1994. 2, 7[57] S. Zagoruko and N. Komodakis. Learning to Com-pare Image Patches via Convolutional Neural NetworksSergey.

CVPR , 2015. 2, 3[58] J. ˇZbontar and Y. LeCun. Computing the Stereo Match-ing Cost With a Convolutional Neural Network.

CVPR ,2015. 1, 2, 3, 6, 7, 8[59] J. Zbontar and Y. LeCun. Stereo matching by training aconvolutional neural network to compare image patches.

JMLR , 2016. 8[60] C. Zhang and Z. Li. MeshStereo : A Global StereoModel with Mesh Alignment Regularization for ViewInterpolation.

ICCV , 2015. 1, 7, 2015. 1, 7