Practical Deep Stereo (PDS): Toward applications-friendly deep stereo matching
PPractical Deep Stereo (PDS): Towardapplications-friendly deep stereo matching.
Stepan Tulyakov
Space Engineering Center atÉcole Polytechnique Fédérale de Lausanne [email protected]
Anton Ivanov
Space Engineering Center atÉcole Polytechnique Fédérale de Lausanne [email protected]
Francois Fleuret
École Polytechnique Fédérale de Lausanneand Idiap Research Institute [email protected]
Abstract
End-to-end deep-learning networks recently demonstrated extremely good perfor-mance for stereo matching. However, existing networks are difficult to use forpractical applications since (1) they are memory-hungry and unable to process evenmodest-size images, (2) they have to be trained for a given disparity range.The Practical Deep Stereo (PDS) network that we propose addresses both issues:First, its architecture relies on novel bottleneck modules that drastically reducethe memory footprint in inference, and additional design choices allow to handlegreater image size during training. This results in a model that leverages largeimage context to resolve matching ambiguities. Second, a novel sub-pixel cross-entropy loss combined with a MAP estimator make this network less sensitive toambiguous matches, and applicable to any disparity range without re-training.We compare PDS to state-of-the-art methods published over the recent months, anddemonstrate its superior performance on FlyingThings3D and KITTI sets.
Stereo matching consists in matching every point from an image taken from one viewpoint toits physically corresponding one in the image taken from another viewpoint. The problem hasapplications in robotics Menze and Geiger [2015], medical imaging Nam et al. [2012], remotesensing Shean et al. [2016], virtual reality and 3D graphics and computational photography Wanget al. [2016], Barron et al. [2015].Recent developments in the field have been focused on stereo for hard / uncontrolled environ-ments (wide-baseline, low-lighting, complex lighting, blurry, foggy, non-lambertian) Verleysen andDe Vleeschouwer [2016], Jeon et al. [2016], Chen et al. [2015], Galun et al. [2015], QUeau et al.[2017], usage of high-order priors and cues Hadfield and Bowden [2015], Güney and Geiger [2015],Kim and Kim [2016], Li et al. [2016], Ulusoy et al. [2017], and data-driven, and in particular, deepneural network based, methods Park and Yoon [2015], Chen et al. [2015], Žbontar and LeCun [2015],Zbontar and LeCun [2016], Luo et al. [2016], Tulyakov et al. [2017], Seki and Pollefeys [2017],Knöbelreiter et al. [2017], Shaked and Wolf [2017], Gidaris and Komodakis [2017], Kendall et al.[2017], Mayer et al. [2016], Pang et al. [2017], Chang and Chen [2018], Liang et al. [2018], Zhonget al. [2017]. This work improves on this latter line of research.
Preprint. Work in progress. a r X i v : . [ c s . C V ] J un able 1: Number of parameters, inference memory footprint, 3-pixels-error (3PE) and mean-absolute-error on FlyingThings3D ( × with disparities). DispNetCorr1D Mayer et al. [2016],CRL Pang et al. [2017], iResNet-i2 Liang et al. [2018] and LRCR Jie et al. [2018] predict disparitiesas classes and are consequently over-parameterized. GC Kendall et al. [2017] omits an explicitcorrelation step, which results in a large memory usage during inference. Our PDS has a smallnumber of parameters and memory footprint, the smallest 3PE and second smallest MAE, and it isthe only method able to handle different disparity ranges without re-training. Note that we followthe protocol of PSM Chang and Chen [2018], and calculate the errors only for ground truth pixelwith disparity < . Inference memory footprints are our theoretical estimates based on networkstructures and do not include memory required for storing networks’ parameters (real memoryfootprint will depend on implementation). Error rates and numbers of parameters are taken from therespective publications. Method Params Memory 3EP MAE Modify.[M] [GB] [%] [px] Disp.
PDS (proposed) 2.2 0.4 3.38 1.12 (cid:51)
PSM Chang and Chen [2018] 5.2 0.6 n/a 1.09 (cid:55)
CRL Pang et al. [2017] 78 0.2 6.20 1.32 (cid:55) iResNet-i2 Liang et al. [2018] 43 0.2 4.57 1.40 (cid:55)
DispNetCorr1D Mayer et al. [2016] 42 0.1 n/a 1.68 (cid:55)
LRCR Jie et al. [2018] 30 9.0 8.67 2.02 (cid:55)
GC Kendall et al. [2017] 3.5 4.5 9.34 2.02 (cid:55)
The first successes of neural networks for stereo matching were achieved by substitution of hand-crafted similarity measures with deep metrics Chen et al. [2015], Žbontar and LeCun [2015], Zbontarand LeCun [2016], Luo et al. [2016], Tulyakov et al. [2017] inside a legacy stereo pipeline for the post-processing (often Mei et al. [2011]). Besides deep metrics, neural networks were also used in othersubtasks such as predicting a smoothness penalty in a CRF model from a local intensity pattern Sekiand Pollefeys [2017], Knöbelreiter et al. [2017]. In Shaked and Wolf [2017] a “global disparity”network smooth the matching cost volume and predicts matching confidences, and in Gidaris andKomodakis [2017] a network detects and fixes incorrect disparities.
End-to-end deep stereo . Recent works attempt at solving stereo matching using neural networktrained end-to-end without post-processing Dosovitskiy et al. [2015], Mayer et al. [2016], Kendallet al. [2017], Zhong et al. [2017], Pang et al. [2017], Jie et al. [2018], Liang et al. [2018], Changand Chen [2018]. Such a network is typically a pipeline composed of embedding , matching , regularization and refinement modules:The embedding module produces image descriptors for left and right images, and the (non-parametric) matching module performs an explicit correlation between shifted descriptors to computea cost volume for every disparity Dosovitskiy et al. [2015], Mayer et al. [2016], Pang et al. [2017], Jieet al. [2018], Liang et al. [2018]. This matching module may be absent, and concatenated left-rightdescriptors directly fed to the regularization module Kendall et al. [2017], Chang and Chen [2018],Zhong et al. [2017]. This strategy uses more context, but the deep network implementing sucha module has a larger memory footprint as shown in Table 1. In this work we reduce memoryuse without sacrificing accuracy by introducing a matching module that compresses concatenatedleft-right image descriptors into compact matching signatures.The regularization module takes the cost volume, or the concatenation of descriptors, regularizesit, and outputs either disparities Mayer et al. [2016], Dosovitskiy et al. [2015], Pang et al. [2017],Liang et al. [2018] or a distribution over disparities Kendall et al. [2017], Zhong et al. [2017], Jieet al. [2018], Chang and Chen [2018]. In the latter case, sub-pixel disparities can be computed asa weighted average with SoftArgmin, which is sensitive to erroneous minor modes in the inferreddistribution.This regularization module is usually implemented as a hourglass deep network with shortcutconnections between the contracting and the expanding parts Mayer et al. [2016], Dosovitskiy et al.[2015], Pang et al. [2017], Kendall et al. [2017], Zhong et al. [2017], Chang and Chen [2018],Liang et al. [2018]. It composed of 2D convolutions and not treat all disparities symmetrically in2ome models Mayer et al. [2016], Dosovitskiy et al. [2015], Pang et al. [2017], Liang et al. [2018],which makes the network over-parametrized and prohibits the change of the disparity range withoutmodification of its structure and re-training. Or it can use 3D convolutions that treat all disparitiessymmetrically Kendall et al. [2017], Zhong et al. [2017], Jie et al. [2018], Chang and Chen [2018]. Asa consequence these networks have less parameters, but their disparity range is still is non-adjustablewithout re-training due to SoftArgmin as we show in § 3.3. In this work, we propose to use a novelsup-pixel MAP approximation for inference which computes a weighted mean around the disparitywith maximum posterior probability. It is more robust to erroneous modes in the distribution andallows to modify the disparity range without re-training.Finally, some methods Pang et al. [2017], Liang et al. [2018], Jie et al. [2018] also have a refinement module, that refines the initial low-resolution disparity relying on attention map, computed as left-right warping error. The training of end-to-end networks is usually performed in fully supervisedmanner (except of Zhong et al. [2017]).All described methods Dosovitskiy et al. [2015], Mayer et al. [2016], Kendall et al. [2017], Zhonget al. [2017], Pang et al. [2017], Jie et al. [2018], Liang et al. [2018], Chang and Chen [2018] usemodest-size image patches during training. In this work, we show that training on a full-size imagesboosts networks ability to utilize large context and improves its accuracy. Also, the methods, eventhe ones producing disparity distribution, rely on L loss, since it allows to train network to producesub-pixel disparities. We, instead propose to use more “natural” sub-pixel cross-entropy loss thatensures faster converges and better accuracy. Our contributions can be summarize as follows:1. We decrease the memory footprint by introducing a novel bottleneck matching module. Itcompresses the concatenated left-right image descriptors into compact matching signatures, whichare then concatenated and fed to the hourglass network we use as regularization module, insteadof the concatenated descriptors themselves as in Kendall et al. [2017], Chang and Chen [2018].Reduced memory footprint allows to process larger images and to train on a full-size images, thatboosts networks ability to utilize large context.2. Instead of computing the posterior mean of the disparity and training with a vanilla L penalty Chang and Chen [2018], Jie et al. [2018], Zhong et al. [2017], Kendall et al. [2017]we propose for inference a sub-pixel MAP approximation that computes a weighted mean aroundthe disparity with maximum posterior probability, which is robust to erroneous modes in thedisparity distribution and allows to modify the disparity range without re-training. For training wesimilarly introduce a sub-pixel criterion by combining the standard cross-entropy with a kernelinterpolation, which provides faster convergence rates and higher accuracy.In the experimental section, we validate our contributions. In § 3.2 we show how the reducedmemory footprint allows to train on full-size images and to leverage large image contexts to improveperformance. In § 3.3 we demonstrate that, thanks to the proposed sub-pixel MAP and cross-entropy,we are able to modify the disparity range without re-training, and to improve the matching accuracy.Than, in § 3.4 we compare our method to state-of-the-art baselines and show that it has smallest3-pixels error (3PE) and second smallest mean absolute error (MAE) on the FlyingThings3D set andranked third and fourth on KITTI’15 and KITTI’12 sets respectively. Our network takes as input the left and right color images { x L , x R } of size W × H and produces a“cost tensor” C = N et ( x L , x R | Θ , D ) of size D × W × H , where Θ are the model’s parameters,an D ∈ N is the maximum disparity.The computed cost tensor is such that C k,i,j is the cost of matching the pixel x Li,j in the left image tothe pixel x Ri − k,j in the right image, which is equivalent to assigning the disparity d i,j = 2 k to theleft image pixel.This cost tensor C can then be converted into an a posterior probability tensor as P (cid:0) d | x L , x R (cid:1) = softmax ( − C ) . ub-pix. MAP estimator32 x W/4 x H/4 Right descriptorLeft descriptor
Match Regularization network(3D conv.) 32 x W x H
Left image
32 x W x H
Right image
W/4 H/4
Compact matching signatures x D / D / p a i r s W H
Cost volume
D/2 InferenceTrainingSub-pix. Cross Entropy W x H
Disparity
W x H
Groundtruth
32 x W/4 x H/4
Figure 1: Network structure and processing flow during training and inference. Input / outputquantities are outlined with thin lines, while processing modules are drawn with thick ones. Followingthe vocabulary introduced in § 1, the yellow shapes are embedding modules, the red rectangle the matching module and the blue shape the the regularization module. The matching module is acontribution of our work, as in previous methods Kendall et al. [2017], Chang and Chen [2018] leftand shifted right descriptors are directly fed to the regularization module (hourglass network). Notethat the concatenated compact matching signature tensor is a 4D tensor represented here as 3D bycombining the feature indexes and disparities along the vertical axis.The overall structure of the network and processing flow during training and inference are shown inFigure 1, and we can summarize for clarity the input/output to and from each of the modules: • The embedding module takes as input a color image × W × H , and computes an image descriptor × W × H . • The matching module takes as input, for each disparity d , a left and a (shifted) right imagedescriptor both × W × H , and computes a compact matching signature × W × H . Thismodule is unique to our network and described in details in § 2.2. • The regularization module is a hourglass 3D convolution neural network with shortcut connectionsbetween the contracting and the expanding parts. It takes a tensor composed of concatenatedcompact matching signatures for all disparities of size × D × W × H , and computes a matchingcost tensor C of size D × W × H .Additional information such as convolution filter size or channel numbers is provided in the Supple-mentary materials.According to the taxonomy in Scharstein and Szeliski [2001] all traditional stereo matching methodsconsist of (1) matching cost computation, (2) cost aggregation, (3) optimization, and (4) disparityrefinement steps. In the proposed network, the embedding and the matching modules are roughlyresponsible for the step (1) and the regularization module for the steps (2-4).Besides the matching module, there are several other design choices that reduce test and trainingmemory footprint of our network. In contrast to Kendall et al. [2017] we use aggressive four-times sub-sampling in the embedding module, and the hourglass DNN we use for regularization module produces probabilities only for even disparities. Also, after each convolution and transposedconvolution in our network we place Instance Normalization (IN) Ulyanov et al. [2016] instead ofBatch Normalization (BN) as show in the Supplementary materials, since we use individual full-sizeimages during training. The core of state-of-the-art methods Kendall et al. [2017], Zhong et al. [2017], Jie et al. [2018], Changand Chen [2018] is the 3D convolutions Hourglass network used as regularization module, thattakes as input a tensor composed of concatenated left-right image descriptor for all possible disparityvalues. The size of this tensor makes such networks have a huge memory footprint during inference.We decrease the memory usage by implementing a novel matching with a DNN with a “bottle-neck” architecture. This module compresses the concatenated left-right image descriptors into acompact matching signature for each disparity, and the results is then concatenated and fed to theHourglass module. This contrasts with existing methods, which directly feed the concatenateddescriptors Kendall et al. [2017], Zhong et al. [2017], Jie et al. [2018], Chang and Chen [2018].4 ub-pixel MAP estimation SoftArgmin estimation Sub-pixel MAP estimation SoftArgmin estimation (a) (b)Figure 2: Comparison the proposed Sub-pixel MAP with the standardSoftArgmin: (a) in presence of a multi-modal distribution SoftArgminblends all the modes and produces an incorrect disparity estimate. (b)when the disparity range is extended (blue area), SoftArgmin estimatemay degrade due to additional modes.
Ground TruthTarget distribution
Figure 3: Target distri-bution of sub-pixel cross-entropy is a discretizedLaplace distribution cen-tered at sub-pixel ground-truth disparity.This module is inspired by CRL Pang et al. [2017] and DispNetCorr1D Pang et al. [2017], Mayeret al. [2016] which control the memory footprint (as shown in Table 1 by feeding correlation resultsinstead of concatenated embeddings to the Hourglass network and by Zagoruyko and Komodakis[2015] that show superior performance of joint left-right image embedding. We also borrowed someideas from the bottleneck module in ResNet He et al. [2016], since it also encourages compressedintermediate representations.
In state-of-the-art methods, a network produces an posterior disparity distribution and then use aSoftArgmin module Kendall et al. [2017], Zhong et al. [2017], Jie et al. [2018], Chang and Chen[2018], introduced in Kendall et al. [2017], to compute the predicted sub-pixel disparity as anexpectation of this distribution: ˆ d = (cid:88) d d · P (cid:0) d = d | x L , x R (cid:1) . This SoftArgmin approximates a sub-pixel maximum a posteriori (MAP) solution when the distri-bution is unimodal and symmetric. However, as illustrated in Figure 2, this strategy suffers fromtwo key weaknesses: First, when these assumptions are not fulfilled, for instance if the posterior ismulti-modal, this averaging blends the modes and produces a disparity estimate far from all of them.Second, if we want to apply the model to a greater disparity range without re-training, the estimatemay degrade even more due to additional modes.The authors of Kendall et al. [2017] argue that when the network is trained with the SoftArgmin, itadapts to it during learning by rescaling its output values to make the distribution unimodal. However,the network learns rescaling only for disparity range used during training. If we decide to change thedisparity range during the test, we will have to re-train the network.To address both of these drawbacks, we propose to use for inference a sub-pixel MAP approximationthat computes a weighted mean around the disparity with maximum posterior probability as ˜ d = (cid:88) d : | ˆ d − d | ≤ δ d · P (cid:0) d = d | x L , x R (cid:1) , where ˆ d = arg max ≤ d ≤ D P (cid:0) d = d | x L , x R (cid:1) , (1)with δ a meta-parameter (in our experiments we choose δ = 4 based on small scale grid searchexperiment on the validation set). The approximation works under assumption that the distribution issymmetric in a vicinity of a major mode.In contrast to the SoftArgmin, the proposed sup-pixel MAP is used only for inference. During trainingwe use the posterior disparity distribution and the sub-pixel cross-entropy loss discussed in the nextsection. Many methods use the L loss Chang and Chen [2018], Jie et al. [2018], Zhong et al. [2017], Kendallet al. [2017], even though the “natural” choice for the network that produces the posterior distributionis a cross-entropy. The L loss is often selected because it empirically Kendall et al. [2017] performs5etter than cross-entropy, and because when it is combined with SoftArgmin, it allows to train anetwork with sub-pixel ground truth.In this work, we propose a novel sub-pixel cross-entropy that provides faster convergence and betteraccuracy. The target distribution of our cross-entropy loss is a discretized Laplace distributioncentered at the ground-truth disparity d gt , shown in Figure 3 and computed as Q gt ( d ) = 1 N exp (cid:18) − | d − d gt | b (cid:19) , where N = (cid:88) i exp (cid:18) − | i − d gt | b (cid:19) , where b is a diversity of the Laplace distribution (in our experiments we set b = 2 , reasoning that thedistribution should reasonably cover at least several discrete disparities). With this target distributionwe compute cross-entropy as usual L ( Θ ) = (cid:88) d Q gt ( d ) · log P (cid:0) d = d | x L , x R , Θ (cid:1) . (2)The proposed sub-pixel cross-entropy is different from soft cross entropy Luo et al. [2016], sincein our case probability in each discrete location of the target distribution is a smooth function of adistance to the sub-pixel ground-truth. This allows to train the network to produce a distribution fromwhich we can compute sub-pixel disparities using our sub-pixel MAP. Our experiments are done with the PyTorch framework PyTorch. We initialize weights and biases ofthe network using default PyTorch initialization and train the network as shown in Table 2. During thetraining we normalize training patches to zero mean and unit variance. The optimization is performedwith the RMSprop method with standard settings.Table 2: Summary of training settings for every dataset.
FlyingThings3D KITTIMode from scratch fine-tune
Lr. schedule . for 120k, half every 20k . for 50k, half every 20k Iter.
Tr. image size × full-size × Max disparity
255 255
Augmentation not used mixUp Zhang et al. [2018], anisotropic zoom, random cropWe guarantee reproducibility of all experiments in this section by using only available data-sets, andmaking our code available online under open-source license after publication.
We used three data-sets for our experiments: KITTI’12 Geiger et al. [2012] and KITTI’15 Menzeand Geiger [2015], that we combined into a KITTI set, and FlyingThings3D Mayer et al. [2016]summarized in Table 3. KITTI’12, KITTI’15 sets have online scoreboards KITTY.The FlyingThings3D set suffers from two problems: (1) as noticed in Pang et al. [2017], Zhang et al.[2018], some images have very large (up to ) or negative disparities; (2) some images are renderedwith black dots artifacts. For the training we use only images without artifacts and with disparities ∈ [0 , .We noticed that this is dealt with in some previous publications by processing the test set using theground truth for benchmarking, without mentioning it. Such pre-processing may consist of ignoringpixels with disparity > Chang and Chen [2018], or discarding images with more than ofpixels with disparity > Pang et al. [2017], Liang et al. [2018]. Although this is not commendable,for the sake of comparison we followed the same protocol as Chang and Chen [2018] which is themethod the closest to ours in term of performance. In all other experiments we use the unaltered testset. 6e make validation sets by withholding 500 images from the FlyingThings3D training set, and 58from the KITTI training set, respectively.Table 3: Datasets used for experiments. During benchmarking, we follow previous works and usemaximum disparity, that is different from absolute maximum for the datasets, provided betweenparentheses.
Dataset Test
KITTI 395 395 ×
192 (230) sparse, ≤ px. (cid:51) FlyingThings3D 4370 25756 ×
192 (6773) dense , unknown (cid:55)
We measure the performance of the network using two standard measures: (1) ,which is the percentage of pixels for which the predicted disparity is off by more than pixels, and(2) mean-absolute-error (MAE) , the average difference of the predicted disparity and the ground truth.Note, that 3PE and MAE are complimentary, since 3PE characterize error robust to outliers, whileMAE accounts for sub-pixel error. In this section we show the effectiveness of training on full-size images. For that we train our networktill convergence on FlyingThings3D dataset with the L loss and SoftArgmin twice, the first timewe use × training patches randomly cropped from the training images as in Kendall et al.[2017], Chang and Chen [2018], and the second time we used full-size × training images.Note, that the latter is possible thanks to the small memory footprint of our network.As seen in Table 4, the network trained on small patches, performs better on larger than on smallertest images. This suggests, that even the network that has not seen full-size images during trainingcan utilize a larger context. As expected, the network trained on full-size images makes better use ofthe said context, and performs significantly better. Figure 4: Example of disparity estimation errors with the Sof-tArgmin and sup-pixel MAP on FlyingThings3d set. The firstcolumn shows image, the second – ground truth disparity, the third– SoftArgmin estimate and the fourth sub-pixel MAP estimate. Notethat SoftArgmin estimate, though completely wrong, is closer tothe ground truth than sub-pixel MAP estimate. This can explainlarger MAE of the sub-pixel MAP estimate. v a li d a t i o n s e t P E Sub-pix CrossEntropyL1
Figure 5: Comparison of theconvergence speed on FlyingTh-ings3d set with sub-pixel crossentropy and L loss. Note thatwith the proposed sub-pixel cross-entropy loss (blue) network con-verges faster.Table 4: Error of the proposed PDS network on FlyingThings3d set as a function of training patchsize. The network trained on full-size images (highlighted), outperforms the network trained on smallimage patches. Note, that in this experiment we used SoftArgmin with L loss during training. Train size Test size 3PE, [ % ] MAE, [px] ×
256 512 × ×
256 960 × ×
540 960 × L loss and full-sizetraining images and then test it twice: the first time with SoftArgmin for inference, and the secondtime with our sub-pixel MAP for inference instead.As shown in Table 5, the substitution leads to the reduction of the 3PE and slight increase of theMAE. The latter probably happens because in the erroneous area SoftArgmin estimate are completelywrong, but nevertheless closer to the ground truth since it blends all distribution modes, as shown inFigure 4.Table 5: Performance of the sub-pixel MAP estimator and cross-entropy loss on FlyingThings3d set.Note, that: (1) if we substitute SoftArgmin with sub-pixel MAP during the test we get lower 3PE andsimilar MAE; (2) if we increase disparity range twice MAE and 3PE of the network with sub-pixelMAP almost does not change, while errors of the network with SoftArgmin increase; (3) if we trainnetwork with with sub-pixel cross entropy it has much lower 3PE and only slightly worse MAE. Loss Estimator 3PE, [ % ] MAE, [px]Standard disparity range ∈ [0 , L + SoftArgmin SoftArgmin 4.50 3.40 L + SoftArgmin Sub-pixel MAP 4.22 3.42Sub-pixel cross-entropy. Sub-pixel MAP 3.80 3.63 Increased disparity range ∈ [0 , L + SoftArgmin SoftArgmin 5.20 3.81 L + SoftArgmin Sub-pixel MAP 4.27 3.53When we test the same network with the disparity range increased from 255 to 511 pixels theperformance of the network with the SoftArgmin plummets, while performance of the network withsub-pixel MAP remains almost the same as shown in Table 5. This shows that with Sub-pixel MAPwe can modify the disparity range of the network on-the-fly, without re-training.Next, we train the network with the sub-pixel cross-entropy loss and compare it to the network trainedwith SoftArgmin and the L loss. As show in Table 5, the former network has much smaller 3PEand only slightly larger MAE. The convergence speed with sub-pixel cross-entropy is also muchfaster than with L loss as shown in Figure 5. Interestingly, in Kendall et al. [2017] also reports fasterconvergence with one-hot cross-entropy than with L loss, but contrary to our results, they found that L provided smaller 3PE. In this section we show the effectiveness of our method, compared to the state-of-the-art methods.For KITTI, we computed disparity maps for the test sets with withheld ground truth, and uploadedthe results to the evaluation web site. For the FlyingThings3D set we evaluated performance on thetest set ourselves, following the protocol of Chang and Chen [2018] as explained in § 3.1.
FlyingThings3D set benchmarking results are shown in Table 1. Notably, the method we proposehas lowest 3PE error and second lowest MAE. Moreover, in contrast to other methods, our method hassmall memory footprint, number of parameters, and it allows to change the disparity range withoutre-training.
KITTI’12, KITTI’15 benchmarking results are shown in Table 6. The method we propose ranksthird on KITTI’15 set and fourth on KITTI’12 set, taking into account state-of-the-art results publisheda few months ago or not officially published yet iResNet-i2 Liang et al. [2018], PSMNet Chang andChen [2018] and LRCR Jie et al. [2018] methods.
In this work we addressed two issues precluding the use of deep networks for stereo matching inmany practical situations in spite of their excellent accuracy: their large memory footprint, and theinability to adjust to a different disparity range without complete re-training.8able 6: KITTI’15 (top) and KITTI’12 (bottom) snapshots from 15/05/2018 with top-10 methods,including published in a recent months on not officially published yet: iResNet-i2 Liang et al. [2018],PSMNet Chang and Chen [2018] and LRCR Jie et al. [2018]. Our method (highlighted) is 3rd inKITTI’15 and 4th in KITTI’12 leader boards.
References
Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. Fast bilateral-space stereofor synthetic defocus. In
CVPR , 2015.Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network.
CoRR , 2018.Zhuoyuan Chen, Xun Sun, and Liang Wang. A Deep Visual Correspondence Embedding Model forStereo Matching Costs.
ICCV , 2015.Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov,Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow withconvolutional networks. In
CVPR , 2015.Meirav Galun, Tal Amir, Tal Hassner, Ronen Basri, and Yaron Lipman. Wide baseline stereo matchingwith convex bounded distortion constraints. In
ICCV , 2015.Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kittivision benchmark suite. In
CVPR , 2012.Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixelwise labeling.
CVPR , 2017.Fatma Güney and Andreas Geiger. Displets: Resolving Stereo Ambiguities using Object Knowledge.
CVPR , 2015. 9imon Hadfield and Richard Bowden. Exploiting high level scene cues in stereo reconstruction. In
ICCV , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
CVPR , 2016.Hae-Gon Jeon, Joon-Young Lee, Sunghoon Im, Hyowon Ha, and In So Kweon. Stereo matchingwith color and monochrome cameras in low-light conditions. In
CVPR , 2016.Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yunchao Wei, Jiashi Feng, and Wei Liu. Left-rightcomparative recurrent model for stereo matching. In
CVPR , 2018.Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, AbrahamBachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression.
ICCV , 2017.K. R. Kim and C. S. Kim. Adaptive smoothness constraints for efficient stereo matching using textureand edge information. In
ICIP , 2016.KITTY. Kitti stereo scoreboards.
Accessed: 05May 2018.Patrick Knöbelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Pock. End-to-endtraining of hybrid cnn-crf models for stereo.
CVPR , 2017.Ang Li, Dapeng Chen, Yuanliu Liu, and Zejian Yuan. Coordinating multiple disparity proposals forstereo computation. In
CVPR , 2016.Zhengfa Liang, Yiliu Feng, Yulan Guo Hengzhu Liu Wei Chen, and Linbo Qiao Li Zhou JianfengZhang. Learning for disparity estimation through feature constancy.
CoRR , 2018.Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching.In
CVPR , 2016.Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, andThomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and sceneflow estimation. In
CVPR , 2016.Xing Mei, Xun Sun, Mingcai Zhou, Shaohui Jiao, Haitao Wang, and Xiaopeng Zhang. On buildingan accurate stereo matching system on graphics hardware. In
ICCV Workshops , 2011.Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In
CVPR , 2015.Kyoung Won Nam, Jeongyun Park, In Young Kim, and Kwang Gi Kim. Application of stereo-imagingtechnology to medical field.
Healthcare informatics research , 2012.Jiahao Pang, Wenxiu Sun, JS Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: Atwo-stage convolutional neural network for stereo matching. In
ICCVW , 2017.Min-Gyu Park and Kuk-Jin Yoon. Leveraging stereo matching with learning-based confidencemeasures. In
CVPR , 2015.PyTorch. Pytorch web site. http://http://pytorch.org/
Accessed: 05 May 2018.Yvain QUeau, Tao Wu, François Lauze, Jean-Denis Durou, and Daniel Cremers. A non-convexvariational approach to photometric stereo under inaccurate lighting. In
CVPR , 2017.Daniel Scharstein and Richard Szeliski. A Taxonomy and Evaluation of Dense Two-Frame StereoCorrespondence Algorithms.
IJCV , 2001.Akihito Seki and Marc Pollefeys. Patch based confidence prediction for dense disparity map. In
BMVC , 2016.Akihito Seki and Marc Pollefeys. Sgm-nets: Semi-global matching with neural networks. 2017.10mit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and reflectiveconfidence learning.
CVPR , 2017.David E Shean, Oleg Alexandrov, Zachary M Moratto, Benjamin E Smith, Ian R Joughin, ClairePorter, and Paul Morin. An automated, open-source pipeline for mass production of digitalelevation models (DEMs) from very-high-resolution commercial stereo satellite imagery. {ISPRS} ,2016.S. Tulyakov, A. Ivanov, and F. Fleuret. Weakly supervised learning of deep metrics for stereoreconstruction. In
ICCV , 2017.Ali Osman Ulusoy, Michael J Black, and Andreas Geiger. Semantic multi-view stereo: Jointlyestimating objects and voxels. In
CVPR , 2017.Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missingingredient for fast stylization.
CoRR , 2016.Cedric Verleysen and Christophe De Vleeschouwer. Piecewise-planar 3d approximation from wide-baseline stereo. In
CVPR , 2016.Ting-Chun Wang, Manohar Srikanth, and Ravi Ramamoorthi. Depth from semi-calibrated stereo anddefocus. In
CVPR , 2016.Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutionalneural networks. 2015.Jure Žbontar and Yann LeCun. Computing the Stereo Matching Cost With a Convolutional NeuralNetwork.
CVPR , 2015.Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network tocompare image patches.
JMLR , 2016.F. Zhang and B. W. Wah. Fundamental principles on learning new features for effective densematching.
IEEE Transactions on Image Processing , 27(2):822–836, Feb 2018. ISSN 1057-7149.doi: 10.1109/TIP.2017.2752370.Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empiricalrisk minimization. In
International Conference on Learning Representations , 2018.Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-supervised learning for stereo matching withself-improving ability.