[PDF] Semi-synthesis: A fast way to produce effective datasets for stereo matching

Abstract

Stereo matching is an important problem in computer vision which has drawn tremendous research attention for decades. Recent years, data-driven methods with convolutional neural networks (CNNs) are continuously pushing stereo matching to new heights. However, data-driven methods require large amount of training data, which is not an easy task for real stereo data due to the annotation difficulties of per-pixel ground-truth disparity. Though synthetic dataset is proposed to fill the gaps of large data demand, the fine-tuning on real dataset is still needed due to the domain variances between synthetic data and real data. In this paper, we found that in synthetic datasets, close-to-real-scene texture rendering is a key factor to boost up stereo matching performance, while close-to-real-scene 3D modeling is less important. We then propose semi-synthetic, an effective and fast way to synthesize large amount of data with close-to-real-scene texture to minimize the gap between synthetic data and real data. Extensive experiments demonstrate that models trained with our proposed semi-synthetic datasets achieve significantly better performance than with general synthetic datasets, especially on real data benchmarks with limited training data. With further fine-tuning on the real dataset, we also achieve SOTA performance on Middlebury and competitive results on KITTI and ETH3D datasets.

Full PDF

SSemi-synthesis: A fast way to produce effective datasets for stereo matching

Ju He ∗ Enyu Zhou ∗ Liusheng Sun Fei Lei Chenyang Liu Wenxiu Sun Johns Hopkins University SenseTime Research

Abstract

Stereo matching is an important problem in computervision which has drawn tremendous research attention fordecades. Recent years, data-driven methods with convo-lutional neural networks (CNNs) are continuously push-ing stereo matching to new heights. However, data-drivenmethods require large amount of training data, which is notan easy task for real stereo data due to the annotation difﬁ-culties of per-pixel ground-truth disparity. Though syntheticdataset is proposed to ﬁll the gaps of large data demand, theﬁne-tuning on real dataset is still needed due to the domainvariances between synthetic data and real data. In this pa-per, we found that in synthetic datasets, close-to-real-scenetexture rendering is a key factor to boost up stereo match-ing performance, while close-to-real-scene 3D modeling isless important. We then propose semi-synthetic, an effectiveand fast way to synthesize large amount of data with close-to-real-scene texture to minimize the gap between syntheticdata and real data. Extensive experiments demonstrate thatmodels trained with our proposed semi-synthetic datasetsachieve signiﬁcantly better performance than with generalsynthetic datasets, especially on real data benchmarks withlimited training data. With further ﬁne-tuning on the realdataset, we also achieve SOTA performance on Middleburyand competitive results on KITTI and ETH3D datasets.

1. Introduction

Stereo matching is one of the most fundamental prob-lems in computer vision. It is widely used in applicationssuch as reconstruction [8, 5], robot navigation [18, 19],and autonomous driving [4, 11]. Traditional methods solvethis task by carefully hand-crafted image priors and energyfunctions. Recently, with the resurgence of the deep learn-ing, we have witnessed signiﬁcant progress in this ﬁeld. Thedeep learning methods deﬁne loss functions and learn com-plex priors but are data-thirsty, which require a large amountof training data to reach good performance.There exist two kinds of datasets for stereo matching.One is real datasets such as Middlebury [20], KITTI [17], *: equal contribution.

Figure 1. A sample pair from semi-synthetic datasets and stereomatching results on a real stereo pair from Middlebury dataset. (a)Left view of a semi-synthetic image pair. (b) Ground-truth dispar-ity map. (c) Stereo matching result trained on SceneFlow dataset.(d) Stereo matching result trained on Semi-Synthetic datasets.

ETH3D [23], and the other is synthetic datasets such asSintel [1], Sun3D [27], SceneFlow [15]. Though theyprove their effectiveness in their corresponding task, theyall suffer from problems that harm their practical use. Realdatasets are usually small in scale due to the amount of hu-man labor involved in constructing the scenes and annotat-ing ground truth information, leading to the inaccuracy ofground truth and monotonousness of scenes. On the otherside, synthetic datasets suffer from lacking real textures,and they usually cost a long time to produce because of con-structing real scenes and rendering.To overcome these shortcomings, people now usuallycombine these two kinds of datasets to train models. Themodels are ﬁrstly pre-trained on the large general syntheticdatasets followed by ﬁne-tuned on the corresponding realdatasets. However, this kind of training strategy struggleswhen there is difﬁculty in collecting sufﬁcient high-qualityreal data. Besides, since some key factors of syntheticdatasets such as textures, illumination, disparity distributionmight be totally different from the real scenes, this kind ofpre-training might hurt the model’s performance.1 a r X i v : . [ c s . C V ] J a n ue to the limitations on the synthetic datasets, recentwork started to research on how to tackle domain gapsbetween different datasets by either transferring certaindataset to others [25, 12] or doing normalization to eachdataset separately [31]. However, there exists difﬁculty totransfer certain key factors such as disparity distribution andtextures to a uniform space. Their usage is also limited toexisting datasets and can not be adapted to unseen scenes.In this paper, we propose a novel and fast data synthesismethod, semi-synthesis , to produce large-scale on-demandstereo datasets. It is called semi-synthesis because we ex-tract real image patches from the corresponding scenes andthen texture them on generated geometry shapes. With thissimple method, we can easily control the key factors thataffect the training performance for the datasets, such as dis-parity distribution, textures, geometry shapes, etc. Also,with the fast speed of generating an image pair, we can pro-duce large amounts of data in a short time, which deﬁnitelyfacilitates the training of deep models. Extensive experi-ments have been conducted to prove the effectiveness of oursemi-synthetic datasets. Besides the simplicity of construct-ing large on-demand datasets compared to traditional syn-thetic datasets, our semi-synthetic datasets also alleviate theproblem of domain gaps by not only sampling textures fromthe corresponding scenes but also mimicking the desired en-vironment factors. After only trained on our semi-syntheticdatasets, models outperform those trained on SceneFlow onall real benchmarks, and even quite close to those ﬁne-tunedon real datasets on Middlebury.Our contributions are as follows:1. We propose a novel and fast method to produce largeon-demand semi-synthetic datasets for stereo match-ing. This method can also be further extended to someother ﬁelds, such as optical ﬂow.2. We analyze the impact of textures and scene geometryof semi-synthetic datasets on the ﬁnal performance.3. We achieve signiﬁcantly better performance on stereomatching benchmarks with our semi-synthetic datasetsthan with general synthetic datasets.

2. Related Work

In this section, we brieﬂy analyze existing stereo datasetsand some recent progress on models of stereo matching.

Stereo datasets can be roughly classiﬁed into two cate-gories: 1) real datasets, 2) synthetic datasets. Real datasetsare either constructed by using Time-of-Flight (ToF) orstructured light. ToF is a more convenient way to col-lect real datasets compared to structured light. However,it struggles at the precision of ground truth since objects which are far away or opaque can not reﬂect any light. An-other problem of ToF is that the resolution of images cap-tured by ToF is usually small, which means tiny objects’details can not be well learned by the network. By con-trast, datasets captured by structured light are much moreaccurate, but they are limited to indoor scenes and data col-lection. Thus they are usually of very small size. Syntheticdatasets are usually made by ﬁrst constructing backgroundscenes and then adding some foreground objects, followedby setting stereo camera to capture the images.We introduce four widely used datasets in detail here.

Middlebury dataset [21] was one of the ﬁrst datasetsfor stereo matching, which contains 38 real indoor scenescaptured via a structured light scanner. A new version ofthe Middlebury dataset [20] was proposed to give a moreaccurate annotation at a resolution of 6 Megapixels, and itcontains 33 novel indoor scenes. However, due to the dif-ﬁculty and high cost of constructing such accurate densestereo datasets, they are relatively small in size, and thisalso yields the problem of low variability. The scenes arelimited in the indoor environment with controlled light con-ditions.

KITTI dataset was ﬁrst produced in 2012 and extendedin 2015. It was recorded by using a laser scanner mountedon a car. While it is also a real dataset, the recorded scenesare limited on streets with a ﬁxed height and width. More-over, the benchmark images are of low resolution, and thelaser only provides sparse ground truth information.

ETH3D dataset was proposed by

Sch ¨ ops et al. [23]in 2017, which contains real images. The ground truth ofthe datasets is acquired by using a laser scanner. Insteadof carefully constructing scenes in a controlled laboratoryenvironment as in Middlebury, ETH3D provides the fullrange of challenges of real-world photogrammetric mea-surements. However, it still suffers from a lack of data sam-ples and variability. SceneFlow dataset is a combined dataset of three largesynthetic stereo video datasets proposed by Mayer [15]. Asa synthetic dataset, it has accurate dense disparity maps andvariability in scenes. However, it still suffers problems suchas the ﬁxed disparity distribution, limiting its usage on dif-ferent baselines. Besides, the cost to extend it to a speciﬁeddomain can also not be ignored due to the limited scene en-vironments.

As introduced by the survey of Scharstein [22], a typ-ical stereo matching algorithm takes four steps: matchingcost calculation, matching cost aggregation, disparity calcu-lation, and disparity reﬁnement. Traditional methods eitherfocus on aggregating local costs according to neighborhoodinformation [16, 32] or constructing a global energy func-tion to minimize it [30, 7, 10].2 igure 2. Diagram of the pipeline of generating semi-synthetic datasets.

Recently, with the appearance of large-scale datasets,deep neural networks tuned for stereo matching produceSOTA performance on several stereo benchmarks. Priorwork [13, 24, 33] mainly focus on producing better featuresfor traditional stereo matching algorithms with a convolu-tional neural network. First end-to-end network DispNettogether with a large synthetic dataset ScenFlow is intro-duced in [15]. Then the widely used 3D cost volume isﬁrst proposed in GCNet [9], which aims at using regressionto ﬁnd the optimum matching results. PSMNet [2] furtherintroduced pyramid spatial pooling and 3D hourglass struc-tures to enlarge the net’s receptive ﬁeld and achieve betterresults.While these methods get good results on stereo match-ing, they suffer from considerable use of GPU memoryand time, which limits their practical use. So recent re-search pays more attention to fast and light methods onhigh-resolution images. HSM [29] built a light model witha hierarchical design. AANet [28] replaced the costly 3Dconvolutions with aggregation modules and achieves com-parable accuracy. Cascade Cost Volume [6] further extendsthe hierarchical modules method to narrow the search rangeof deeper stages based on the output of prior stages.These models can achieve outstanding performance witha large number of training samples but will struggle in sit-uations when data are insufﬁcient. Besides, the commonproblem is that models trained on one speciﬁc domain can not generalize well on new domains. In this paper, we aimat solving these two problems simultaneously by generatinglarge-scale semi-synthetic datasets in the next section.

3. Method

We introduce our method of producing semi-syntheticdatasets step by step in section 3.1 followed by an analy-sis of our main differences from other synthetic datasets atsection 3.2.

The open-source 3D software Blender 2.83a is used togenerate required stereo data, including left images, rightimages, and the dense ground truth of disparity. As illus-trated by ﬁgure 2, our method mainly contains the followingsix steps.

Preparing background and stereo camera.

For indoorscenes, we usually choose a cuboid as a background to sim-ulate the walls, ﬂoors, and ceilings in the scene. For outdoorscenes, we generally choose a large faraway plane as thebackground whose disparity value is close to zero. The ex-ample scene shown in ﬁgure 2 is used to generate the semi-synthetic training datasets for Middlebury. Besides, we addthe stereo camera, which is a built-in function of Blender.As the experiment conducted by [14], the disparity distribu-tion of the training set will affect the results of the network.3 igure 3.

Examples demonstrating the variety of our semi-synthetic datasets.

With manually setting of stereo camera’s hyper-parameters, selected 3D models and image sequences, our datasets show a ﬂexibility in terms of disparity distribution, objects, and textures.

So we set the focal length, sensor size, and baseline param-eters of the scene according to the testing environments tosimulate the max disparity of testing scenes.

Looking for proper 3D models with UV maps.

Thedesired 3D models can be divided into simple geometricprimitives such as cubes and real object models like tables,which can be downloaded from the Blender kit. After ac-quiring these models, we delete their original materials andonly keep their UV maps. Random scaling and rotation aug-mentation are applied for data diversity.

Adding 3D models to the particle system.

After choos-ing the needed 3D models, we use Blender’s particle sys-tem to make these 3D models move in the space, avoidingmanually setting the motion tracks for objects. An emis-sion source is placed in the space to emit particles (the 3Dmodels), which is at a plane above the whole space in ourﬁgure 2. And we can simulate the disparity distribution oftest scene by controlling the particle quantity in differentlevels of depth. In this process, a sequence of frames willbe captured by the stereo camera, and each frame serves asan image in the ﬁnal dataset.

Collecting textures from images.

With all settings ofthe 3D structure ﬁnished, we collect some pictures fromthe testing scenes as the textures of the 3D objects. Notethat what we need is only a small set of monocular imageswithout any other types of data, such as ground truth dis-parity. Collected images are further cropped into a series ofsquares to prevent the images from being stretched. Puttingthese cropped images together yields an image sequence tobe exploited later.

Texturing collected image patches to 3D models.

Eachobject can be randomly textured with an image sequencesampled through the build-in features of Blender, which in-ducing more variety in the data. Shading mechanism is notapplied to our model, unlike other synthetic datasets, so theimages are directly output as textures to the material. Inthis way, we can ensure that the imported image textureswill not be disturbed by lighting so that the network can extract the real image’s features. The parameter MappingScale in Blender is adjusted according to the image patchesfor avoiding over-scaling.

Rendering the whole scene.

Finally, the Eevee engineis employed for rendering, and the depth map of the leftand right cameras can be output in the Blender’s Com-positing module. As we do not add shader to the pipeline,our method reaches a very fast speed. Using one RTX2080Ti, we can generate an image pair and the dense dis-parity ground truth at the resolution of 1500 × Compared with previous synthetic datasets, two majordifferences exist in our semi-synthetic datasets, which con-tribute to our superior performance, and we give a detailedanalysis below.

Texture.

Compared with SceneFlow, our dataset sharesthe similar 3D scene design of adding foreground objectson the background, but differ on the texture. In SceneFlow,the texture of objects was chosen from combination of pro-cedural images, ﬁxed real images, and texture-style pho-tographs. We argue that procedural and texture-style tex-tures would not contribute much to the model generalizationon real datasets due to different distributions. By contrast,we sample our textures from the images of correspondingtesting scenes directly to mimic the testing situations. Also,since we abandon high-level information by texturing dif-ferent image patches to objects without limitation of objectsthemselves, the diversity of textures becomes much larger,which beneﬁts the training of models.

Diversity of geometry primitives

Previously, Watsonet al. [26] also tried to use real images for making stereomatching datasets, and compared different data generationmethods, such as Afﬁne warps [3], Random pasted shapes[14] in their paper. But all the results have not been sig-niﬁcantly improved compared to SceneFlow especially onMiddlebury. We reason that it is because they lack the di-4 igure 4.

Visualization of outputs of CasStereo trained on different datasets on four Middlebury images at half resolution.

Semi-Synthetic-M stands for semi-synthetic datasets with Middlebury textures. Top row: Four images in the Middlebury datasets. Secondrow: Output of CasStereo trained on SceneFlow. Third row: Output of CasStereo trained on Semi-Synthetic-M. Fourth row: Output ofCasStereo trained on SceneFlow and Middlebury. Last row: Output of CasStereo trained on Semi-Synthetic-M and Middlebury. versity of geometry primitives. Therefore, We add various3D models in the generation process by texture real imageson 3D models. This operation intuitively provides muchmore information for the network to capture.

4. Experiments

We conduct extensive experiments to illustrate the ef-fectiveness of our semi-synthetic dataset. We ﬁrst describeour detailed setup in section 4.1, including datasets, modelstructures and metrics. Then we evaluate our models onthree public benchmarks with a comparison to SceneFlow.We give a detailed analysis in section 4.2 followed by abla-tion studies in section 4.3 to demonstrate the effects of dif- ferent settings of several key factors in our semi-syntheticdatasets.

Datasets

We use four publicly available datasets includ-ing Middlebury-v3 [20], KITTI 2015 [17], ETH3D [23]and Sceneﬂow [15] plus our own semi-synthetic datasets.Middlebury-v3 contains 10 high-resolution training imagepairs and 13 additional image pairs with ground truth.KITTI 2015 contains 200 low-resolution image pairs col-lected on streets. ETH3D consists of 27 low-resolution im-age pairs of both indoor and outdoor scenes. Sceneﬂow con-tains around 35k synthetic image pairs with dense ground5 able 1. Results on Middlebury-v3 additional images where all pixels are evaluated. Test size stands for the test resolution of images whereQ means quarter, H means half and F stands for full. Best results for each method are bolded, second best results are underlined.Method Test Size Training Datasets avgerr rms bad-1.0 bad-2.0 bad-4.0PSMNet [2] Q SceneFlow 13.026 36.281 61.361 42.810 26.718Semi-Synthetic-M 4.269

HSM [29] F SceneFlow 8.806 26.272 55.267 36.913 23.122Semi-Synthetic-M 6.402 22.346 39.487 24.177 14.497SceneFlow + Middlebury 6.368 23.830 38.579 22.331 13.849Semi-Synthetic-M + Middlebury

CasStereo [6] H SceneFlow 17.960 44.488 50.971 37.948 28.170Semi-Synthetic-M

SceneFlow + Middlebury 5.570 20.752 27.961 17.188 11.399Semi-Synthetic-M + Middlebury 5.419 21.528 25.169 14.526 9.331 truth.

Implementation

We conduct our experiments basedon three network structures: PSMNet [2], HSM [29],CasStereo [6]. PSMNet is a classic work which ﬁrst in-troduced the pyramid structure into Stereo Matching. HSMis a light network structure proposed to produce fast on-demand results under some realistic situations and can beapplied to high-resolution input pairs. CasStereo is a recentwork which dynamically adjusting the search range of laterstages according to the results of early stages. Limited byGPU memory due to different design of network modules,size used for Middlebury datasets during training and test-ing are different for each network. We follow the strategiesimplemented by the authors while making some adaptationsto train these networks. To be speciﬁc:For PSMNet, we perform color normalization to all data.During training, images are randomly cropped to size H =256 and W = 512 . The max disparity is set to forKITTI, for Middlebury, and for ETH3D.For HSM, we perform all augmentation strategies pro-posed by the authors to the four public datasets excludingour semi-synthetic datasets. During training, all images areresized to size H = 512 and W = 768 . The max disparityis set to for KITTI, for Middlebury, for ETH3D.For CasStereo, we adopt a three-stage cascade cost vol-ume. The max disparity is set to for KITTI, forMiddlebury, and for ETH3D. Corresponding disparityhypothesis and interval are also adapted according to themax disparity. Metrics

Our evaluation is done on Middlebury-v3,KITTI 2015 and ETH3D benchmarks. We use differentmetrics for these three benchmarks.For Middlebury-v3, we adopt the ofﬁcial metrics whichcontains bad-4.0 (percentage of “bad“ pixels whose erroris greater than 4.0), bad-2.0, bad-1.0 for evaluating the re-sult under different accuracy requirement; avgerr (averageabsolute error in pixels) and rms (root-mean-square dispar- ity error in pixels) count for subpixel level absolute errors.These metrics are also used for evaluating ETH3D.For KITTI 2015, we adopt D1-all for measuring the per-centage of outliers for all pixels.

In this section, We discuss several strategies for the pre-training and ﬁne-tuning of those networks. Since thesebenchmarks only allow one submission on the ofﬁcial testsplits, we evaluate our experiments on the validation sets ofeach dataset. To be speciﬁc, for Middlebury we adopt the13 additional pairs for validation. For KITTI 2015, we fol-low the split protocol in HSM [29]. For ETH3D we manu-ally sample 13 pairs from the training sets as validation sets.We mainly conduct four experiments for each network: 1)Train on SceneFlow. 2) Train on semi-synthetic datasets.3) Train on SceneFlow and ﬁne-tune on corresponding realdatasets. 4) Train on semi-synthetic datasets and ﬁne-tuneon corresponding real datasets. For each validation dataset,we re-generate the semi-synthetic datasets with the corre-sponding real images as textures. We denote them withsufﬁx, e.g. Semi-Synthetic-M means that the textures ofgenerated semi-synthetic datasets come from Middlebury,and K stands for KITTI, E stands for ETH3D. We limitthe size of semi-synthetic datasets to be comparable withSceneFlow for a fair comparison. We pre-train our modelsfor 10 epochs and for the ﬁne-tuning stage we set the epochsto be 30.

Results on Middlebury.

Deep learning based methodshave been struggling on Middlebury due to the insufﬁciencyof training data and the complexity of scenes. Table 1 sum-marizes the results on Middlebury dataset. All three net-works achieve much better performance when trained onsemi-synthetic datasets compared to those trained on Scene-Flow. The average error rate drops 8% in average and bad-2.0 rate drops by 50% percent. For PSMNet and CasStereo,the results are even better than those pre-trained on Scene-6 igure 5.

Visualization of outputs of CasStereo trained on different datasets on three KITTI images.

Semi-Synthetic-K stands forsemi-synthetic datasets with KITTI textures. Top row: Three images in the KITTI datasets. Second row: Output of CasStereo trained onSceneFlow. Third row: Output of CasStereo trained on Semi-Synthetic-K. Fourth row: Output of CasStereo trained on SceneFlow andKITTI. Last row: Output of CasStereo trained on Semi-Synthetic-K and KITTI.

Flow and then ﬁne-tuned on Middlebury. Networks, pre-trained on Semi-Synthetic-M datasets and further ﬁne-tunedon Middlebury, get little performance gain, and even nega-tive gain for CasStereo case. This indicates that our datasetsalready contain sufﬁcient information to learn the complexMiddlebury scenes.

Results on KITTI.

KITTI is a relatively easier datasetfor stereo matching since the scenes are comparable sim-ple and consistent and there exists adequate frames. Deepmodels can get very good results on it after ﬁne-tuned formany epochs, e.g. the ﬁnal submission model of PSM-Net on KITTI benchmark was ﬁne-tuned for 1000 epochs.However, in this way, the models will be overﬁtting to theKITTI dataset and thus the performance will decrease a loton other datasets. We only ﬁne-tune the models on KITTI2015 training sets for 30 epochs and compare the results. Asshown in table 2, networks pre-trained on semi-synthetic-K improve by a lot margin compared to those pre-trainedon SceneFlow. And the performance is still better than theSceneFlow ones after ﬁne-tuned which demonstrates the su-periority of our semi-synthetic datasets.

Results on ETH3D.

ETH3D has a sparse ground truth

Table 2. Results on KITTI 2015 validation sets where all pixels areevaluated. Best results for each method are bolded.Method Datasets D1-allPSMNet [2] SceneFlow 33.586Semi-Synthetic-K 9.510SceneFlow + KITTI 2.695Semi-Synthetic-K + KITTI

HSM [29] SceneFlow 8.75Semi-Synthetic-K 6.73SceneFlow + KITTI 3.75Semi-Synthetic-K + KITTI

CasStereo [6] SceneFlow 18.570Semi-Synthetic-K 4.924SceneFlow + KITTI 2.013Semi-Synthetic-K + KITTI collected by laser scanner, there exists some regions with-out ground truth that can not be learned sufﬁciently and thedetails of objects edge are blurred. This phenomenon can bealleviated by our semi-synthetic datasets with dense groundtruth and ﬁner details. Table 3 summarizes the results on7 able 3. Results on ETH3D validation sets where all pixels are evaluated. Semi-Synthetic-E stands for semi-synthetic datasets with ETH3Dtextures. Best results for each method are bolded.Method Datasets avgerr rms bad-1.0 bad-2.0 bad-4.0PSMNet [2] SceneFlow 7.042 15.506 16.460 9.534 6.878Semi-Synthetic-E 1.792 4.247 12.907 8.301 6.210SceneFlow + ETH3D 0.327 0.685 5.479 1.671 0.335Semi-Synthetic-E + ETH3D

HSM [29] SceneFlow 3.282 8.420 25.020 11.594 6.564Semi-Synthetic-E 1.294 3.178 10.205 6.241 4.205SceneFlow + ETH3D 0.411 0.812 5.913 1.892 0.496Semi-Synthetic-E + ETH3D

CasStereo [6] SceneFlow 0.521 1.541 9.033 3.830 1.836Semi-Synthetic-E 0.429 1.283 6.388 3.211 1.565SceneFlow + ETH3D 0.281 0.563 3.988 0.985 0.295Semi-Synthetic-E + ETH3D

Table 4. Ablation study of texture on Middlebury.Method Test Size Texture avgerr rms bad-1.0 bad-2.0 bad-4.0CasStereo [6] H SceneFlow 17.960 44.488 50.971 37.948 28.170Photos 4.892 20.060 28.564 17.175 11.107Middlebury

Table 5. Ablation study of geometry shape complexity on Middlebury.Method Test Size Datasets avgerr rms bad-1.0 bad-2.0 bad-4.0CasStereo [6] H Simple 5.090 20.419 28.768 16.208 9.794Complex

ETH3D dataset. Models trained on semi-synthetic datasetsoutperform those trained on SceneFlow especially when notfurther ﬁne-tuned on ETH3D.

In this section, we conduct several ablation studies onour semi-synthetic datasets to analyze the impact of sev-eral factors, e.g. the source of textures. Our experiment inthis section is based on CasStereo network for Middleburyadditional images of half-size resolution. Models are onlytrained on semi-synthetic datasets with the correspondingfactor varying and all other settings ﬁxed.

Texture.

Table 4 summarizes the results on Middleburydataset with three different texture sources. SceneFlow rep-resents traditional synthetic datasets; photos refers to ran-domly downloaded images; Middlebury stands for trainingsets of Middlebury dataset. The texture similarity betweenthe training and test data is

M iddlebury > P hotos >SceneF low . The results indicate that the performance be-comes better with the increment of similarity between train-ing and testing data textures, especially when the source oftextures change from synthesis to reality.

Geometry shape diversity.

Table 5 summarizes Mid-dlebury’s performance with varying diversity of geometry shapes. Two typical settings are compared, where Simplerefers to a set of simple geometric bodies, such as cuboid,cone, cylinder and etc, while Complex denotes the morecomplex 3D models downloaded from the Internet as shownin Figure 2 Step 2, which is more diverse than the formerone. It is shown that a more diverse scene helps to train thenetwork.

5. Conclusion

In this work, we mainly study the problem of gener-ating effective datasets for stereo matching. Existing realdatasets are usually small in size, which hinders the train-ing of deep models. On the other hand, general syntheticdatasets suffer from lacking real textures and they do notcoincide with real testing scenes on several environmentalfactors. We propose to solve this by using a novel and fastmethod to produce large-scale on-demand semi-syntheticdatasets. Our extensive experiments demonstrate the ef-fectiveness of semi-synthetic datasets on three widely usedstereo benchmarks of real scenes. For future work, we aimat continuously analyzing the key factors of a good semi-synthetic dataset and extending this method to other relatedﬁelds such as optical ﬂow.8 eferences [1] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A nat-uralistic open source movie for optical ﬂow evaluation. InA. Fitzgibbon et al. (Eds.), editor,

European Conf. on Com-puter Vision (ECCV) , Part IV, LNCS 7577, pages 611–625.Springer-Verlag, Oct. 2012.[2] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereomatching network. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5410–5418, 2018.[3] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. Superpoint: Self-supervised interest point detectionand description. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , pages224–236, 2018.[4] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the kitti vision benchmarksuite. In , pages 3354–3361. IEEE, 2012.[5] Andreas Geiger, Julius Ziegler, and Christoph Stiller. Stere-oscan: Dense 3d reconstruction in real-time. In , pages 963–968. Ieee,2011.[6] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, FeitongTan, and Ping Tan. Cascade cost volume for high-resolutionmulti-view stereo and stereo matching. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 2495–2504, 2020.[7] H. Hirschmuller. Accurate and efﬁcient stereo processingby semi-global matching and mutual information. In , volume 2, pages 807–814vol. 2, 2005.[8] Shahram Izadi, David Kim, Otmar Hilliges, DavidMolyneaux, Richard Newcombe, Pushmeet Kohli, JamieShotton, Steve Hodges, Dustin Freeman, Andrew Davison,et al. Kinectfusion: real-time 3d reconstruction and inter-action using a moving depth camera. In

Proceedings of the24th annual ACM symposium on User interface software andtechnology , pages 559–568, 2011.[9] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, PeterHenry, Ryan Kennedy, Abraham Bachrach, and Adam Bry.End-to-end learning of geometry and context for deep stereoregression. In

Proceedings of the IEEE International Con-ference on Computer Vision , pages 66–75, 2017.[10] Andreas Klaus, Mario Sormann, and Konrad Karner.Segment-based stereo matching using belief propagation anda self-adapting dissimilarity measure. In

Proceedings of the18th International Conference on Pattern Recognition - Vol-ume 03 , ICPR ’06, page 15–18, USA, 2006. IEEE ComputerSociety.[11] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnnbased 3d object detection for autonomous driving. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 7644–7652, 2019.[12] Rui Liu, Chengxi Yang, Wenxiu Sun, Xiaogang Wang, andHongsheng Li. Stereogan: Bridging synthetic-to-real do- main gap by joint optimization of domain translation andstereo matching. In

Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition , pages12757–12766, 2020.[13] W. Luo, A. G. Schwing, and R. Urtasun. Efﬁcient deep learn-ing for stereo matching. In , pages 5695–5703, 2016.[14] Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazir-bas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox.What makes good synthetic training data for learning dispar-ity and optical ﬂow estimation?

International Journal ofComputer Vision , 126(9):942–960, 2018.[15] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. Alarge dataset to train convolutional networks for disparity,optical ﬂow, and scene ﬂow estimation. In

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 4040–4048, 2016.[16] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang. Segment-tree based cost aggregation for stereo matching. In , pages 313–320, 2013.[17] Moritz Menze and Andreas Geiger. Object scene ﬂow for au-tonomous vehicles. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 3061–3070, 2015.[18] Don Murray and James J Little. Using real-time stereo visionfor mobile robot navigation. autonomous robots , 8(2):161–171, 2000.[19] Javier Salmer´on-Garcı, Pablo Inigo-Blasco, Fernando Dı,Daniel Cagigas-Muniz, et al. A tradeoff analysis of a cloud-based robot navigation assistant using stereo image process-ing.

IEEE Transactions on Automation Science and Engi-neering , 12(2):444–454, 2015.[20] Daniel Scharstein, Heiko Hirschm¨uller, York Kitajima,Greg Krathwohl, Nera Neˇsi´c, Xi Wang, and Porter West-ling. High-resolution stereo datasets with subpixel-accurateground truth. In Xiaoyi Jiang, Joachim Hornegger, andReinhard Koch, editors,

Pattern Recognition , pages 31–42,Cham, 2014. Springer International Publishing.[21] Daniel Scharstein and Richard Szeliski. A taxonomy andevaluation of dense two-frame stereo correspondence algo-rithms.

International journal of computer vision , 47(1-3):7–42, 2002.[22] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy andevaluation of dense two-frame stereo correspondence algo-rithms. In

Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001) , pages 131–140, 2001.[23] Thomas Sch¨ops, Johannes L. Sch¨onberger, Silvano Galliani,Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-dreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In

Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2017.[24] A. Shaked and L. Wolf. Improved stereo matching with con-stant highway networks and reﬂective conﬁdence learning.In , pages 6901–6910, 2017.

25] Xiao Song, Guorun Yang, Xinge Zhu, Hui Zhou, ZheWang, and Jianping Shi. Adastereo: A simple and efﬁ-cient approach for adaptive stereo matching. arXiv preprintarXiv:2004.04627 , 2020.[26] Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambe-tov, Gabriel J Brostow, and Michael Firman. Learning stereofrom single images. arXiv preprint arXiv:2008.01484 , 2020.[27] Jianxiong Xiao, Andrew Owens, and Antonio Torralba.Sun3d: A database of big spaces reconstructed using sfmand object labels. In

Proceedings of the IEEE internationalconference on computer vision , pages 1625–1632, 2013.[28] Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega-tion network for efﬁcient stereo matching. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 1959–1968, 2020.[29] Gengshan Yang, Joshua Manela, Michael Happold, andDeva Ramanan. Hierarchical deep stereo matching on high-resolution images. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5515–5524, 2019.[30] Q. Yang. A non-local cost aggregation method for stereomatching. In , pages 1402–1409, 2012.[31] Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu,Benjamin Wah, and Philip Torr. Domain-invariant stereomatching networks. arXiv preprint arXiv:1911.13287 , 2019.[32] K. Zhang, J. Lu, and G. Lafruit. Cross-based local stereomatching using orthogonal integral images.

IEEE Trans-actions on Circuits and Systems for Video Technology ,19(7):1073–1079, 2009.[33] J. ˇZbontar and Y. LeCun. Computing the stereo matchingcost with a convolutional neural network. In , pages 1592–1599, 2015., pages 1592–1599, 2015.