Guided Optical Flow Learning
GGuided Optical Flow Learning
Yi Zhu Zhenzhong Lan Shawn Newsam Alexander G. Hauptmann University of California, Merced Carnegie Mellon University { yzhu25,snewsam } @ucmerced.edu { lanzhzh,alex } @cs.cmu.edu Abstract
We study the unsupervised learning of CNNs for opti-cal flow estimation using proxy ground truth data. Super-vised CNNs, due to their immense learning capacity, haveshown superior performance on a range of computer visionproblems including optical flow prediction. They howeverrequire the ground truth flow which is usually not accessi-ble except on limited synthetic data. Without the guidanceof ground truth optical flow, unsupervised CNNs often per-form worse as they are naturally ill-conditioned. We there-fore propose a novel framework in which proxy ground truthdata generated from classical approaches is used to guidethe CNN learning. The models are further refined in an un-supervised fashion using an image reconstruction loss. Ourguided learning approach is competitive with or superior tostate-of-the-art approaches on three standard benchmarksyet is completely unsupervised and can run in real time.
1. Introduction
Optical flow contains valuable information for generalimage sequence analysis due to its capability to representmotion. It is widely used in vision tasks such as humanaction recognition [18, 22, 21], semantic segmentation [8],video frame prediction [15], video object tracking etc.Classical approaches for estimating optical flow are of-ten based on a variational model and solved as an energyminimization process [11, 4, 5]. They remain top perform-ers on a number of evaluation benchmarks; however, mostof them are too slow to be used in real time applications.Due to the great success of Convolutional Neural Network(CNN), several works [7, 16] have proposed using CNNs toestimate the motion between image pairs and have achievedpromising results. Although they are much more efficientthan classical approaches, these methods require supervi-sion and cannot apply to real world data where the groundtruth is not easily accessible. Thus, some recent works[1, 20, 23] have investigated unsupervised learning throughnovel loss functions but they often perform worse than su-pervised ones. (ii) Unsupervised Reconstruction Loss(i)Average End-point Error
Figure 1. An overview of our proposed guided learning frame-work. ⊕ denotes computing the per-pixel endpoint error with re-spect to the proxy ground truth flow. (cid:46)(cid:47) represents the inversewarping and unsupervised reconstruction loss with respect to theinput image pairs. To improve the accuracy of unsupervised CNNs for opti-cal flow estimation, we propose to use the results of classi-cal methods as guidance for our unsupervised learning pro-cess. We refer to this as novel guided optical flow learn-ing as shown in Fig. 1. Specifically, there are two stages.(i) We generate proxy ground truth flow using classical ap-proaches, and then train a supervised CNN with them. (ii)We fine tune the learned models by minimizing an image re-construction loss. By training the CNNs using proxy groundtruth, we hope to provide a good initialization point for sub-sequent network learning. By fine tuning the models on tar-get datasets, we hope to overcome the risk that CNN mighthave learned the failure cases of the classical approaches.The entire learning framework is thus unsupervised.Our contributions are two-fold. First, we demonstratethat supervised CNNs can learn to estimate optical flow welleven when only guided using noisy proxy ground truth datagenerated from classical methods. Second, we show thatfine tuning the learned models for target datasets by mini-mizing a reconstruction loss further improves performance.Our proposed guided learning is completely unsupervisedand achieves competitive or superior performance to state-of-the-art, real time approaches on standard benchmarks.
2. Method
Given an adjacent frame pair I and I , our goal is tolearn a model that can estimate the per-pixel motion field ( U, V ) between the two images accurately and efficiently. a r X i v : . [ c s . C V ] J u l and V are the horizontal and vertical displacements, re-spectively. We describe our proxy ground truth guidedframework in Section 2.1, and the unsupervised fine tuningstrategy in Section 2.2. Current approaches to the supervised training of CNNsfor estimating optical flow use synthetic ground truthdatasets. These synthetic motions/scenes are quite differ-ent from real ones which limits the generalizability of thelearned models. And, even constructing synthetic datasetrequires a lot of manual effort [6]. The current largest syn-thetic datasets with dense ground truth optical flow, FlyingChairs [7] and FlyingThings3D [16], consist of only kimage pairs which is not ideal for deep learning especiallyfor such an ill-conditioned problem as motion estimation.In order for CNN-based optical flow estimation to reach itsfull potential, a learning framework is needed that can scalethe size of the training data. Unsupervised learning is oneideal way to achieve this scaling because it does not requireground truth flow.Classical approaches to optical flow estimation are un-supervised in that there is no learning process involved[11, 4, 5, 2, 12]. They only require the image pairs as in-put, with some extra assumptions (like image brightnessconstancy, gradient constancy, smoothness) and informa-tion (like motion boundaries, dense image matching). Thesenon-CNN based classical methods currently achieve thebest performance on standard benchmarks and are thus con-sidered the state-of-the-art. Inspired by their good perfor-mance, we conjecture that these approaches can be used togenerate proxy ground truth data for training CNN-basedoptical flow estimators .In this work, we choose FlowFields [2] as our classicaloptical flow estimator. To our knowledge, it is one of themost accurate flow estimators among the published work.We hope that by using FlowFields to generate proxy groundtruth, we can learn to estimate motion between image pairsas effectively as using the true ground truth.For fair comparison, we use the “FlowNet Simple” net-work as descried in [7] as our supervised CNN architecture.This allows us to compare our guided learning approach tousing the true ground truth, particularly with respect to howwell the learned models generalize to other datasets. Weuse endpoint error (EPE) as our guided loss since it is thestandard error measure for optical flow evaluation L epe = 1 N (cid:88) (cid:112) ( U − U (cid:48) ) + ( V − V (cid:48) ) , (1)where N denotes the total number of pixels in I . U and V are the proxy ground truth flow fields while U (cid:48) and V (cid:48) arethe flow estimates from the CNN. As stated in Section 1, a potential drawback to usingclassical approaches to create training data is that the qual-ity of this data will necessarily be limited by the accuracyof the estimator. If a classical approach fails to detect cer-tain motion patterns, a network trained on the proxy groundtruth is also likely to miss these patterns. This leads us toask if there is other unsupervised guidance that can improvethe network training?The unsupervised approach of [20] treats optical flow es-timation as an image reconstruction problem based on theintuition that if the estimated flow and the next frame canbe used to reconstruct the current frame then the networkhas learned useful representations of the underlying mo-tions. During training, the loss is computed as the pho-tometric error between the true current frame I and theinverse-warped next frame I (cid:48) L reconst = 1 N N (cid:88) i,j ρ ( I ( i, j ) − I (cid:48) ( i, j )) , (2)where I (cid:48) ( i, j ) = I ( i + U i,j , j + V i,j ) . The inverse warpis performed using a spatial transformer module [13] insidethe CNN. We use a robust convex error function, the gen-eralized Charbonnier penalty ρ ( x ) = ( x + (cid:15) ) α , to reducethe influence of outliers. This reconstruction loss is similarto the brightness constancy objective in classical variationalformulations but is quite different from the EPE loss in theproxy ground truth guided learning. We thus propose finetuning our model using this reconstruction loss as an addi-tional unsupervised guide.During fine tuning, the total energy we aim to minimizeis a simple weighted sum of the EPE loss and the imagereconstruction loss L ( U, V ; I , I ) = L epe + λ · L reconst , (3)where λ controls the level of reconstruction guidance. Notethat we could add additional unsupervised guides like agradient constancy assumption or an edge-aware weightedsmoothness loss [10] to further fine tune our models.An overview of our guided learning framework with boththe proxy ground truth guidance and the unsupervised finetuning is illustrated in Fig. 1.
3. Experiments
Flying Chairs [7] is a synthetic dataset designed specif-ically for training CNNs to estimate optical flow. It is cre-ated by applying affine transformations to real images andsynthetically rendered chairs. The dataset contains 22,872image pairs: 22,232 training and 640 test samples accordingto the standard evaluation split.
PI Sintel [6] is also a synthetic dataset derived from ashort open source animated 3D movie. There are 1,628frames, 1,064 for training and 564 for testing. It is the mostwidely adopted benchmark to compare optical flow estima-tors. In this work, we only report performance on its finalpass because it contains sufficiently realistic scenes includ-ing natural image degradations.
KITTI Optical Flow 2012 [9] is a real world dataset col-lected from a driving platform. It consists of 194 trainingimage pairs and 195 test pairs with sparse ground truth flow.We report the average EPE in total for the test set.We consider guided learning with and without fine tun-ing. In the no fine tuning regime, the model is trained usingthe proxy ground truth produced using a classical estimator.In the fine tuning regime, the model is first trained usingthe proxy ground truth and then fine tuned using both theproxy ground truth and the reconstruction guide. The Sinteland KITTI datasets are too small to produce enough proxyground truth to train our model from scratch so the modelsevaluated on these datasets are first pretrained on the Chairsdataset. These models are then either applied to the Sin-tel and KITTI datasets without fine tuning or are fine tunedusing the target dataset (proxy ground truth).
As shown in Fig. 1, our architecture consists of con-tractive and expanding parts. In the no fine tuning learningregime, we calculate the per-pixel EPE loss for each expan-sion. There are expansions resulting in losses. We usethe same loss weights as in [7]. The models are trained us-ing Adam optimization with the default parameter values β = 0 . and β = 0 . . The initial learning rate is setto − and divided by half every k iterations after thefirst k. We end our training at k iterations.In the fine tuning learning regime, we calculate both theEPE and reconstruction loss for each expansion. Thus thereare a total of losses. The generalized Charbonnier pa-rameter α is set to . in the reconstruction loss. λ is . .We use the default Adam optimization with a fixed learningrate of − and training is stopped at k iterations.We apply the same intensive data augmentation as in[7] to prevent over-fitting in both learning regimes. Theproxy ground truth is computed using the FlowFields binarykindly provided by authors in [2]. We have three observations given the results in Table 1.
Observation : We can use proxy ground truth generatedby state-of-the-art classical flow estimators to train CNNsfor optical flow prediction . A model trained using the Flow-Fields proxy ground truth achieves an average EPE of . on Chairs which is comparable to the . achieved by themodel trained using the true ground truth. Note that the Method Chairs Sintel KITTIFlowFields [2] .
45 5 .
81 3 . FlowNetS (Ground Truth) [7] .
71 8 .
43 9 . UnsupFlowNet [20] .
30 11 .
19 11 . FlowNetS (FlowFields) .
34 8 .
05 9 . FlowNetS (FlowFields) + Unsup .
01 7 .
96 9 . Table 1. Results reported using average EPE, lower is better. Bot-tom section shows our guided learning results, the models aretrained using the FlowFields proxy ground truth. The last rowincludes fine tuning. proxy ground truth is still quite noisy with an average EPEof . away from the true ground truth.The model trained using the FlowFields proxy groundtruth (EPE 3.34) performs worse than the FlowFields esti-mator (EPE 2.45), which is expected. This is because Flow-Fields adopts a hierarchical approach which is non-local inthe image space. It also uses dense correspondence to cap-ture image details. Thus, FlowFields itself can output crispmotion boundaries and accurate flow. However, unlike theCNN model, it cannot run in real time. Observation : Sometime, training using proxy groundtruth can generalize better than training using the trueground truth.
The model trained using the Chairs proxyground truth (computed with FlowFields) performs better(EPE 8.05) on Sintel than the model trained using the Chairstrue ground truth (EPE 8.43). We make similar observationsfor KITTI . This improved generalization might result fromover-fitting when training with the true ground truth sincethe three datasets are quite different with respect to objectand motion types. The proxy is noisier which could serveas a form of data augmentation for unseen motion types.In addition, we experiment on directly training a Sin-tel model from scratch without using the pretrained Chairsmodel. We use the same implementation details. The per-formance is about one and half pixel worse in terms of EPEthan using the pretrained model. Therefore, pretrainingCNNs on a large dataset (with either true or proxy groundtruth data) is important for optical flow estimation. Observation : Our proposed fine tuning regime improvesperformance on all three datasets.
Fine tuning results in anaverage EPE decrease from . to . for Chairs, . to . for Sintel, and . to . for KITTI. Note that an aver-age EPE of . for Chairs is very close to performance ofthe supervised model FlowNetS (EPE . ). This demon-strates that image reconstruction loss is effective as an ad-ditional unsupervised guide for motion learning. It can actlike fine tuning without requiring ground truth flow of thetarget dataset.We also investigate training a network from scratch usinga joint training regime. That is, using both L epe and L reconst ,not only using L reconst in the fine tuning stage. The per- Note that FlowNetS’s performance on KITTI (EPE 9.1) is fine tuned. mages Ground Truth UnsupFlowNet FlowNetS Ours
Figure 2. Visual examples of predicted optical flow from different methods. Top two are from Sintel, and bottom two from KITTI. formance was worse on all three benchmarks. The reasonmight be that pretraining using just the proxy ground truthprevents the model from becoming trapped in local minima.It thus can provide a good initialization for further networklearning. A joint training regime using both losses may hurtthe network’s convergence in the beginning.However, we expect unsupervised learning to bring morecomplementarity. Image reconstruction loss may not be themost appropriate guidance for learning optical flow predic-tion. We will explore how to best incorporate additionalunsupervised objectives in future work.
We compare our proposed method to recent state-of-the-art approaches. We only consider approaches that are fastbecause optical flow is often used in time sensitive applica-tions. We evaluated all CNN-based approaches on a work-station with Intel Core I7 with 4.00GHz and a Nvidia Ti-tan X GPU. For classical approaches, we just use their re-ported runtime. As shown in Table 2, our method performsthe best for Sintel even though it does not require the trueground truth for training. For Chairs, we achieve on parperformance with [7]. For KITTI, we perform inferior to[19]. This is likely because the flow in KITTI is causedpurely by the motion of the car so the segmentation into lay-ers performed in [19] helps in capturing motion boundaries.Our approach outperforms the state-of-the-art unsupervisedapproaches of [1, 20] by a large margin, thus demonstrat-ing the effectiveness of our proposed guided learning us-ing proxy ground truth and image reconstruction. Visualcomparison of Sintel and KITTI results are shown in Fig.2. We can see that UnsupFlowNet [20] is able to producereasonable flow fields estimation, but quite noisy. And itdoesn’t perform well in highly saturated and very dark re-gions. Our results are much more detailed and smootheddue to the proxy guidance and unsupervised fine tuning.
Method Chairs Sintel KITTI RuntimeEPPM [3] − .
38 9 . . PCA-Flow [19] − . . . ∗ DIS-Fast [14] − .
13 14 . . ∗ FlowNetS [7] . .
43 9 . . UnsupFlowNet [20] .
30 11 .
19 11 . . USCNN [1] − . − − Ours . . . . Table 2. State-of-the-art comparison, runtime is reported in sec-onds per frame. Top: Classical approaches. Middle: CNN-basedapproaches. Bottom: Ours. ∗ indicates the algorithm is evaluatedusing CPU, while the rest are on GPU.
4. Conclusion
We propose a guided optical flow learning frameworkwhich is unsupervised and results in an estimator that canrun in real time. We show that proxy ground truth data pro-duced using state-of-the-art classical estimators can be usedto train CNNs. This allows the training sets to scale whichis important for deep learning. We also show that trainingusing proxy ground truth can result in better generalizationthan training using the true ground truth. And, finally, wealso show that an unsupervised image reconstruction losscan provide further learning guidance.More broadly, we introduce a paradigm which can be in-tegrated into future state-of-the-art motion estimation net-works [17] to improve performance. In future work, weplan to experiment with large-scale video corpora to learnnon-rigid real world motion patterns rather than just learn-ing limited motions found in synthetic datasets.
Acknowledgements
This work was funded in part by a Na-tional Science Foundation CAREER grant,
IIS-1150115.We gratefully acknowledge NVIDIA Corporation throughthe donation of the Titan X GPU used in this work. eferences [1] A. Ahmadi and I. Patras. Unsupervised Convolutional Neu-ral Networks for Motion Estimation. In
ICIP , 2016.[2] C. Bailer, B. Taetz, and D. Stricker. Flow Fields: Dense Cor-respondence Fields for Highly Accurate Large DisplacementOptical Flow Estimation. In
ICCV , 2015.[3] L. Bao, Q. Yang, and H. Jin. Fast Edge-Preserving Patch-Match for Large Displacement Optical Flow. In
CVPR ,2014.[4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. HighAccuracy Optical Flow Estimation Based on a Theory forWarping. In
ECCV , 2004.[5] T. Brox and J. Malik. Large Displacement Optical Flow: De-scriptor Matching in Variational Motion Estimation.
PAMI ,33:500–513, March 2011.[6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A Nat-uralistic Open Source Movie for Optical Flow Evaluation. In
ECCV , 2012.[7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazrbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. FlowNet:Learning Optical Flow with Convolutional Networks. In
ICCV , 2015.[8] K. Fragkiadaki, P. Arbelez, P. Felsen, and J. Malik. Learningto Segment Moving Objects in Videos. In
CVPR , 2015.[9] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Au-tonomous Driving? The KITTI Vision Benchmark Suite. In
CVPR , 2012.[10] C. Godard, O. Mac Aodha, and G. J. Brostow. UnsupervisedMonocular Depth Estimation with Left-Right Consistency. arXiv preprint arXiv:1609.03677 , 2016.[11] B. K. Horn and B. G. Schunck. Determining Optical Flow.
Artificial Intelligence , 17:185–203, 1981.[12] Y. Hu, R. Song, and Y. Li. Efficient Coarse-to-Fine Patch-Match for Large Displacement Optical Flow. In
CVPR ,2016.[13] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial Transformer Network. In
NIPS , 2015.[14] T. Kroeger, R. Timofte, D. Dai, and L. V. Gool. Fast OpticalFlow using Dense Inverse Searchn. In
ECCV , 2016.[15] M. Mathieu, C. Couprie, and Y. LeCun. Deep Multi-ScaleVideo Prediction beyond Mean Square Error. In
ICLR , 2016.[16] N. Mayer, E. Ilg, P. Husser, P. Fischer, D. Cremers, A. Doso-vitskiy, and T. Brox. A Large Dataset to Train ConvolutionalNetworks for Disparity, Optical Flow, and Scene Flow Esti-mation. In
CVPR , 2016.[17] A. Ranjan and M. J. Black. Optical Flow Estimation using aSpatial Pyramid Network. arXiv preprint arXiv:1611.00850 ,2016.[18] K. Simonyan and A. Zisserman. Two-Stream ConvolutionalNetworks for Action Recognition in Videos. In
NIPS , 2014.[19] J. Wulff and M. Black. Efficient Sparse-to-Dense Opti-cal Flow Estimation using a Learned Basis and Layers. In
CVPR , 2015.[20] J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to Basics:Unsupervised Learning of Optical Flow via Brightness Con-stancy and Motion Smoothness. In
ECCVW , 2016. [21] Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann. Hid-den Two-Stream Convolutional Networks for Action Recog-nition. arXiv preprint arXiv:1704.00389 , 2017.[22] Y. Zhu and S. Newsam. Depth2Action: Exploring Embed-ded Depth for Large-Scale Action Recognition. In
ECCVWorkshop , 2016.[23] Y. Zhu and S. Newsam. DenseNet for Dense Flow. In