[PDF] Guided Optical Flow Learning

Abstract

We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of ground truth optical flow, unsupervised CNNs often perform worse as they are naturally ill-conditioned. We therefore propose a novel framework in which proxy ground truth data generated from classical approaches is used to guide the CNN learning. The models are further refined in an unsupervised fashion using an image reconstruction loss. Our guided learning approach is competitive with or superior to state-of-the-art approaches on three standard benchmark datasets yet is completely unsupervised and can run in real time.

Full PDF

GGuided Optical Flow Learning

Yi Zhu Zhenzhong Lan Shawn Newsam Alexander G. Hauptmann University of California, Merced Carnegie Mellon University { yzhu25,snewsam } @ucmerced.edu { lanzhzh,alex } @cs.cmu.edu Abstract

We study the unsupervised learning of CNNs for opti-cal ﬂow estimation using proxy ground truth data. Super-vised CNNs, due to their immense learning capacity, haveshown superior performance on a range of computer visionproblems including optical ﬂow prediction. They howeverrequire the ground truth ﬂow which is usually not accessi-ble except on limited synthetic data. Without the guidanceof ground truth optical ﬂow, unsupervised CNNs often per-form worse as they are naturally ill-conditioned. We there-fore propose a novel framework in which proxy ground truthdata generated from classical approaches is used to guidethe CNN learning. The models are further reﬁned in an un-supervised fashion using an image reconstruction loss. Ourguided learning approach is competitive with or superior tostate-of-the-art approaches on three standard benchmarksyet is completely unsupervised and can run in real time.

1. Introduction

Optical ﬂow contains valuable information for generalimage sequence analysis due to its capability to representmotion. It is widely used in vision tasks such as humanaction recognition [18, 22, 21], semantic segmentation [8],video frame prediction [15], video object tracking etc.Classical approaches for estimating optical ﬂow are of-ten based on a variational model and solved as an energyminimization process [11, 4, 5]. They remain top perform-ers on a number of evaluation benchmarks; however, mostof them are too slow to be used in real time applications.Due to the great success of Convolutional Neural Network(CNN), several works [7, 16] have proposed using CNNs toestimate the motion between image pairs and have achievedpromising results. Although they are much more efﬁcientthan classical approaches, these methods require supervi-sion and cannot apply to real world data where the groundtruth is not easily accessible. Thus, some recent works[1, 20, 23] have investigated unsupervised learning throughnovel loss functions but they often perform worse than su-pervised ones. (ii) Unsupervised Reconstruction Loss(i)Average End-point Error

Figure 1. An overview of our proposed guided learning frame-work. ⊕ denotes computing the per-pixel endpoint error with re-spect to the proxy ground truth ﬂow. (cid:46)(cid:47) represents the inversewarping and unsupervised reconstruction loss with respect to theinput image pairs. To improve the accuracy of unsupervised CNNs for opti-cal ﬂow estimation, we propose to use the results of classi-cal methods as guidance for our unsupervised learning pro-cess. We refer to this as novel guided optical ﬂow learn-ing as shown in Fig. 1. Speciﬁcally, there are two stages.(i) We generate proxy ground truth ﬂow using classical ap-proaches, and then train a supervised CNN with them. (ii)We ﬁne tune the learned models by minimizing an image re-construction loss. By training the CNNs using proxy groundtruth, we hope to provide a good initialization point for sub-sequent network learning. By ﬁne tuning the models on tar-get datasets, we hope to overcome the risk that CNN mighthave learned the failure cases of the classical approaches.The entire learning framework is thus unsupervised.Our contributions are two-fold. First, we demonstratethat supervised CNNs can learn to estimate optical ﬂow welleven when only guided using noisy proxy ground truth datagenerated from classical methods. Second, we show thatﬁne tuning the learned models for target datasets by mini-mizing a reconstruction loss further improves performance.Our proposed guided learning is completely unsupervisedand achieves competitive or superior performance to state-of-the-art, real time approaches on standard benchmarks.

2. Method

Given an adjacent frame pair I and I , our goal is tolearn a model that can estimate the per-pixel motion ﬁeld ( U, V ) between the two images accurately and efﬁciently. a r X i v : . [ c s . C V ] J u l and V are the horizontal and vertical displacements, re-spectively. We describe our proxy ground truth guidedframework in Section 2.1, and the unsupervised ﬁne tuningstrategy in Section 2.2. Current approaches to the supervised training of CNNsfor estimating optical ﬂow use synthetic ground truthdatasets. These synthetic motions/scenes are quite differ-ent from real ones which limits the generalizability of thelearned models. And, even constructing synthetic datasetrequires a lot of manual effort [6]. The current largest syn-thetic datasets with dense ground truth optical ﬂow, FlyingChairs [7] and FlyingThings3D [16], consist of only kimage pairs which is not ideal for deep learning especiallyfor such an ill-conditioned problem as motion estimation.In order for CNN-based optical ﬂow estimation to reach itsfull potential, a learning framework is needed that can scalethe size of the training data. Unsupervised learning is oneideal way to achieve this scaling because it does not requireground truth ﬂow.Classical approaches to optical ﬂow estimation are un-supervised in that there is no learning process involved[11, 4, 5, 2, 12]. They only require the image pairs as in-put, with some extra assumptions (like image brightnessconstancy, gradient constancy, smoothness) and informa-tion (like motion boundaries, dense image matching). Thesenon-CNN based classical methods currently achieve thebest performance on standard benchmarks and are thus con-sidered the state-of-the-art. Inspired by their good perfor-mance, we conjecture that these approaches can be used togenerate proxy ground truth data for training CNN-basedoptical ﬂow estimators .In this work, we choose FlowFields [2] as our classicaloptical ﬂow estimator. To our knowledge, it is one of themost accurate ﬂow estimators among the published work.We hope that by using FlowFields to generate proxy groundtruth, we can learn to estimate motion between image pairsas effectively as using the true ground truth.For fair comparison, we use the “FlowNet Simple” net-work as descried in [7] as our supervised CNN architecture.This allows us to compare our guided learning approach tousing the true ground truth, particularly with respect to howwell the learned models generalize to other datasets. Weuse endpoint error (EPE) as our guided loss since it is thestandard error measure for optical ﬂow evaluation L epe = 1 N (cid:88) (cid:112) ( U − U (cid:48) ) + ( V − V (cid:48) ) , (1)where N denotes the total number of pixels in I . U and V are the proxy ground truth ﬂow ﬁelds while U (cid:48) and V (cid:48) arethe ﬂow estimates from the CNN. As stated in Section 1, a potential drawback to usingclassical approaches to create training data is that the qual-ity of this data will necessarily be limited by the accuracyof the estimator. If a classical approach fails to detect cer-tain motion patterns, a network trained on the proxy groundtruth is also likely to miss these patterns. This leads us toask if there is other unsupervised guidance that can improvethe network training?The unsupervised approach of [20] treats optical ﬂow es-timation as an image reconstruction problem based on theintuition that if the estimated ﬂow and the next frame canbe used to reconstruct the current frame then the networkhas learned useful representations of the underlying mo-tions. During training, the loss is computed as the pho-tometric error between the true current frame I and theinverse-warped next frame I (cid:48) L reconst = 1 N N (cid:88) i,j ρ ( I ( i, j ) − I (cid:48) ( i, j )) , (2)where I (cid:48) ( i, j ) = I ( i + U i,j , j + V i,j ) . The inverse warpis performed using a spatial transformer module [13] insidethe CNN. We use a robust convex error function, the gen-eralized Charbonnier penalty ρ ( x ) = ( x + (cid:15) ) α , to reducethe inﬂuence of outliers. This reconstruction loss is similarto the brightness constancy objective in classical variationalformulations but is quite different from the EPE loss in theproxy ground truth guided learning. We thus propose ﬁnetuning our model using this reconstruction loss as an addi-tional unsupervised guide.During ﬁne tuning, the total energy we aim to minimizeis a simple weighted sum of the EPE loss and the imagereconstruction loss L ( U, V ; I , I ) = L epe + λ · L reconst , (3)where λ controls the level of reconstruction guidance. Notethat we could add additional unsupervised guides like agradient constancy assumption or an edge-aware weightedsmoothness loss [10] to further ﬁne tune our models.An overview of our guided learning framework with boththe proxy ground truth guidance and the unsupervised ﬁnetuning is illustrated in Fig. 1.

3. Experiments

Flying Chairs [7] is a synthetic dataset designed specif-ically for training CNNs to estimate optical ﬂow. It is cre-ated by applying afﬁne transformations to real images andsynthetically rendered chairs. The dataset contains 22,872image pairs: 22,232 training and 640 test samples accordingto the standard evaluation split.

PI Sintel [6] is also a synthetic dataset derived from ashort open source animated 3D movie. There are 1,628frames, 1,064 for training and 564 for testing. It is the mostwidely adopted benchmark to compare optical ﬂow estima-tors. In this work, we only report performance on its ﬁnalpass because it contains sufﬁciently realistic scenes includ-ing natural image degradations.

KITTI Optical Flow 2012 [9] is a real world dataset col-lected from a driving platform. It consists of 194 trainingimage pairs and 195 test pairs with sparse ground truth ﬂow.We report the average EPE in total for the test set.We consider guided learning with and without ﬁne tun-ing. In the no ﬁne tuning regime, the model is trained usingthe proxy ground truth produced using a classical estimator.In the ﬁne tuning regime, the model is ﬁrst trained usingthe proxy ground truth and then ﬁne tuned using both theproxy ground truth and the reconstruction guide. The Sinteland KITTI datasets are too small to produce enough proxyground truth to train our model from scratch so the modelsevaluated on these datasets are ﬁrst pretrained on the Chairsdataset. These models are then either applied to the Sin-tel and KITTI datasets without ﬁne tuning or are ﬁne tunedusing the target dataset (proxy ground truth).

As shown in Fig. 1, our architecture consists of con-tractive and expanding parts. In the no ﬁne tuning learningregime, we calculate the per-pixel EPE loss for each expan-sion. There are expansions resulting in losses. We usethe same loss weights as in [7]. The models are trained us-ing Adam optimization with the default parameter values β = 0 . and β = 0 . . The initial learning rate is setto − and divided by half every k iterations after theﬁrst k. We end our training at k iterations.In the ﬁne tuning learning regime, we calculate both theEPE and reconstruction loss for each expansion. Thus thereare a total of losses. The generalized Charbonnier pa-rameter α is set to . in the reconstruction loss. λ is . .We use the default Adam optimization with a ﬁxed learningrate of − and training is stopped at k iterations.We apply the same intensive data augmentation as in[7] to prevent over-ﬁtting in both learning regimes. Theproxy ground truth is computed using the FlowFields binarykindly provided by authors in [2]. We have three observations given the results in Table 1.

Observation : We can use proxy ground truth generatedby state-of-the-art classical ﬂow estimators to train CNNsfor optical ﬂow prediction . A model trained using the Flow-Fields proxy ground truth achieves an average EPE of . on Chairs which is comparable to the . achieved by themodel trained using the true ground truth. Note that the Method Chairs Sintel KITTIFlowFields [2] .

45 5 .

81 3 . FlowNetS (Ground Truth) [7] .

71 8 .

43 9 . UnsupFlowNet [20] .

30 11 .

19 11 . FlowNetS (FlowFields) .

34 8 .

05 9 . FlowNetS (FlowFields) + Unsup .

01 7 .

96 9 . Table 1. Results reported using average EPE, lower is better. Bot-tom section shows our guided learning results, the models aretrained using the FlowFields proxy ground truth. The last rowincludes ﬁne tuning. proxy ground truth is still quite noisy with an average EPEof . away from the true ground truth.The model trained using the FlowFields proxy groundtruth (EPE 3.34) performs worse than the FlowFields esti-mator (EPE 2.45), which is expected. This is because Flow-Fields adopts a hierarchical approach which is non-local inthe image space. It also uses dense correspondence to cap-ture image details. Thus, FlowFields itself can output crispmotion boundaries and accurate ﬂow. However, unlike theCNN model, it cannot run in real time. Observation : Sometime, training using proxy groundtruth can generalize better than training using the trueground truth.

The model trained using the Chairs proxyground truth (computed with FlowFields) performs better(EPE 8.05) on Sintel than the model trained using the Chairstrue ground truth (EPE 8.43). We make similar observationsfor KITTI . This improved generalization might result fromover-ﬁtting when training with the true ground truth sincethe three datasets are quite different with respect to objectand motion types. The proxy is noisier which could serveas a form of data augmentation for unseen motion types.In addition, we experiment on directly training a Sin-tel model from scratch without using the pretrained Chairsmodel. We use the same implementation details. The per-formance is about one and half pixel worse in terms of EPEthan using the pretrained model. Therefore, pretrainingCNNs on a large dataset (with either true or proxy groundtruth data) is important for optical ﬂow estimation. Observation : Our proposed ﬁne tuning regime improvesperformance on all three datasets.

Fine tuning results in anaverage EPE decrease from . to . for Chairs, . to . for Sintel, and . to . for KITTI. Note that an aver-age EPE of . for Chairs is very close to performance ofthe supervised model FlowNetS (EPE . ). This demon-strates that image reconstruction loss is effective as an ad-ditional unsupervised guide for motion learning. It can actlike ﬁne tuning without requiring ground truth ﬂow of thetarget dataset.We also investigate training a network from scratch usinga joint training regime. That is, using both L epe and L reconst ,not only using L reconst in the ﬁne tuning stage. The per- Note that FlowNetS’s performance on KITTI (EPE 9.1) is ﬁne tuned. mages Ground Truth UnsupFlowNet FlowNetS Ours

Figure 2. Visual examples of predicted optical ﬂow from different methods. Top two are from Sintel, and bottom two from KITTI. formance was worse on all three benchmarks. The reasonmight be that pretraining using just the proxy ground truthprevents the model from becoming trapped in local minima.It thus can provide a good initialization for further networklearning. A joint training regime using both losses may hurtthe network’s convergence in the beginning.However, we expect unsupervised learning to bring morecomplementarity. Image reconstruction loss may not be themost appropriate guidance for learning optical ﬂow predic-tion. We will explore how to best incorporate additionalunsupervised objectives in future work.

We compare our proposed method to recent state-of-the-art approaches. We only consider approaches that are fastbecause optical ﬂow is often used in time sensitive applica-tions. We evaluated all CNN-based approaches on a work-station with Intel Core I7 with 4.00GHz and a Nvidia Ti-tan X GPU. For classical approaches, we just use their re-ported runtime. As shown in Table 2, our method performsthe best for Sintel even though it does not require the trueground truth for training. For Chairs, we achieve on parperformance with [7]. For KITTI, we perform inferior to[19]. This is likely because the ﬂow in KITTI is causedpurely by the motion of the car so the segmentation into lay-ers performed in [19] helps in capturing motion boundaries.Our approach outperforms the state-of-the-art unsupervisedapproaches of [1, 20] by a large margin, thus demonstrat-ing the effectiveness of our proposed guided learning us-ing proxy ground truth and image reconstruction. Visualcomparison of Sintel and KITTI results are shown in Fig.2. We can see that UnsupFlowNet [20] is able to producereasonable ﬂow ﬁelds estimation, but quite noisy. And itdoesn’t perform well in highly saturated and very dark re-gions. Our results are much more detailed and smootheddue to the proxy guidance and unsupervised ﬁne tuning.

Method Chairs Sintel KITTI RuntimeEPPM [3] − .

38 9 . . PCA-Flow [19] − . . . ∗ DIS-Fast [14] − .

13 14 . . ∗ FlowNetS [7] . .

43 9 . . UnsupFlowNet [20] .

30 11 .

19 11 . . USCNN [1] − . − − Ours . . . . Table 2. State-of-the-art comparison, runtime is reported in sec-onds per frame. Top: Classical approaches. Middle: CNN-basedapproaches. Bottom: Ours. ∗ indicates the algorithm is evaluatedusing CPU, while the rest are on GPU.

4. Conclusion

We propose a guided optical ﬂow learning frameworkwhich is unsupervised and results in an estimator that canrun in real time. We show that proxy ground truth data pro-duced using state-of-the-art classical estimators can be usedto train CNNs. This allows the training sets to scale whichis important for deep learning. We also show that trainingusing proxy ground truth can result in better generalizationthan training using the true ground truth. And, ﬁnally, wealso show that an unsupervised image reconstruction losscan provide further learning guidance.More broadly, we introduce a paradigm which can be in-tegrated into future state-of-the-art motion estimation net-works [17] to improve performance. In future work, weplan to experiment with large-scale video corpora to learnnon-rigid real world motion patterns rather than just learn-ing limited motions found in synthetic datasets.

Acknowledgements

This work was funded in part by a Na-tional Science Foundation CAREER grant,

IIS-1150115.We gratefully acknowledge NVIDIA Corporation throughthe donation of the Titan X GPU used in this work. eferences [1] A. Ahmadi and I. Patras. Unsupervised Convolutional Neu-ral Networks for Motion Estimation. In

ICIP , 2016.[2] C. Bailer, B. Taetz, and D. Stricker. Flow Fields: Dense Cor-respondence Fields for Highly Accurate Large DisplacementOptical Flow Estimation. In

ICCV , 2015.[3] L. Bao, Q. Yang, and H. Jin. Fast Edge-Preserving Patch-Match for Large Displacement Optical Flow. In

CVPR ,2014.[4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. HighAccuracy Optical Flow Estimation Based on a Theory forWarping. In

ECCV , 2004.[5] T. Brox and J. Malik. Large Displacement Optical Flow: De-scriptor Matching in Variational Motion Estimation.

PAMI ,33:500–513, March 2011.[6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A Nat-uralistic Open Source Movie for Optical Flow Evaluation. In

ECCV , 2012.[7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazrbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. FlowNet:Learning Optical Flow with Convolutional Networks. In

ICCV , 2015.[8] K. Fragkiadaki, P. Arbelez, P. Felsen, and J. Malik. Learningto Segment Moving Objects in Videos. In

CVPR , 2015.[9] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Au-tonomous Driving? The KITTI Vision Benchmark Suite. In

CVPR , 2012.[10] C. Godard, O. Mac Aodha, and G. J. Brostow. UnsupervisedMonocular Depth Estimation with Left-Right Consistency. arXiv preprint arXiv:1609.03677 , 2016.[11] B. K. Horn and B. G. Schunck. Determining Optical Flow.

Artiﬁcial Intelligence , 17:185–203, 1981.[12] Y. Hu, R. Song, and Y. Li. Efﬁcient Coarse-to-Fine Patch-Match for Large Displacement Optical Flow. In

CVPR ,2016.[13] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial Transformer Network. In

NIPS , 2015.[14] T. Kroeger, R. Timofte, D. Dai, and L. V. Gool. Fast OpticalFlow using Dense Inverse Searchn. In

ECCV , 2016.[15] M. Mathieu, C. Couprie, and Y. LeCun. Deep Multi-ScaleVideo Prediction beyond Mean Square Error. In

ICLR , 2016.[16] N. Mayer, E. Ilg, P. Husser, P. Fischer, D. Cremers, A. Doso-vitskiy, and T. Brox. A Large Dataset to Train ConvolutionalNetworks for Disparity, Optical Flow, and Scene Flow Esti-mation. In

CVPR , 2016.[17] A. Ranjan and M. J. Black. Optical Flow Estimation using aSpatial Pyramid Network. arXiv preprint arXiv:1611.00850 ,2016.[18] K. Simonyan and A. Zisserman. Two-Stream ConvolutionalNetworks for Action Recognition in Videos. In

NIPS , 2014.[19] J. Wulff and M. Black. Efﬁcient Sparse-to-Dense Opti-cal Flow Estimation using a Learned Basis and Layers. In

CVPR , 2015.[20] J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to Basics:Unsupervised Learning of Optical Flow via Brightness Con-stancy and Motion Smoothness. In

ECCVW , 2016. [21] Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann. Hid-den Two-Stream Convolutional Networks for Action Recog-nition. arXiv preprint arXiv:1704.00389 , 2017.[22] Y. Zhu and S. Newsam. Depth2Action: Exploring Embed-ded Depth for Large-Scale Action Recognition. In

ECCVWorkshop , 2016.[23] Y. Zhu and S. Newsam. DenseNet for Dense Flow. In