[PDF] Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes

Abstract

Unsupervised deep learning for optical flow computation has achieved promising results. Most existing deep-net based methods rely on image brightness consistency and local smoothness constraint to train the networks. Their performance degrades at regions where repetitive textures or occlusions occur. In this paper, we propose Deep Epipolar Flow, an unsupervised optical flow method which incorporates global geometric constraints into network learning. In particular, we investigate multiple ways of enforcing the epipolar constraint in flow estimation. To alleviate a ``chicken-and-egg'' type of problem encountered in dynamic scenes where multiple motions may be present, we propose a low-rank constraint as well as a union-of-subspaces constraint for training. Experimental results on various benchmarking datasets show that our method achieves competitive performance compared with supervised methods and outperforms state-of-the-art unsupervised deep-learning methods.

Full PDF

UUnsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes

Yiran Zhong , , , Pan Ji , Jianyuan Wang , , Yuchao Dai , Hongdong Li , Australian National University, NEC Labs America, Northwestern Polytechnical University, ACRV, Data61 CSIRO { yiran.zhong, hongdong.li } @anu.edu.au, [email protected], [email protected] Abstract

Unsupervised deep learning for optical ﬂow computa-tion has achieved promising results. Most existing deep-netbased methods rely on image brightness consistency and lo-cal smoothness constraint to train the networks. Their per-formance degrades at regions where repetitive textures orocclusions occur. In this paper, we propose Deep Epipo-lar Flow, an unsupervised optical ﬂow method which in-corporates global geometric constraints into network learn-ing. In particular, we investigate multiple ways of enforc-ing the epipolar constraint in ﬂow estimation. To allevi-ate a “chicken-and-egg” type of problem encountered indynamic scenes where multiple motions may be present,we propose a low-rank constraint as well as a union-of-subspaces constraint for training. Experimental resultson various benchmarking datasets show that our methodachieves competitive performance compared with super-vised methods and outperforms state-of-the-art unsuper-vised deep-learning methods.

1. Introduction

Optical ﬂow estimation is a fundamental problem incomputer vision with many applications. Since Horn andSchunck’s seminal work [14], various methods have beendeveloped using variational optimization [2, 43, 5], energyminimization [19, 24, 32, 40], or deep learning [7, 8, 25,33]. In this paper, we particularly tackle the problem of un-supervised optical ﬂow learning using deep convolutionalneural networks (CNNs). Compared to its supervised coun-terpart, unsupervised ﬂow learning does not require ground-truth ﬂow, which is often hard to obtain, as supervision andcan thus be applied in broader domains.Recent research has been focused on transforming tradi-tional domain knowledge of optical ﬂow into deep learning,in terms of either training loss formulation or network ar-chitecture design. For example, in view of brightness con-sistency between two consecutive images, a constraint thathas been commonly used in conventional optical ﬂow meth- ods, researchers have formulated photometric loss [42, 31],with the help of fully differentiable image warping [15], totrain deep neural networks. Other common techniques in-cluding image pyramid [4] (to handle large ﬂow displace-ments), total variation regularization [30, 37] and occlusionhandling [1] have also led to either new network structures( e.g ., pyramid networks [25, 33]) or losses ( e.g ., smoothnessloss and occlusion mask [35, 16]). In the unsupervised set-ting, existing methods mainly rely on the photometric lossand ﬂow smoothness loss to train deep CNNs. This, how-ever, poses challenges for the neural networks to learn op-tical ﬂow accurately in regions with repetitive textures andocclusions. Although some methods [35, 16] jointly learnocclusion masks, these masks do not mean to provide moreconstraints but only to remove the outliers in the losses. Inlight of the difﬁculties of learning accurate ﬂow in these re-gions, we propose to incorporate global epipolar constraintsinto ﬂow network training in this paper.Leveraging epipolar geometry in ﬂow learning, however,is not a trivial task. An inaccurate or wrong estimate offundamental matrices [13] would mislead the ﬂow networktraining in a holistic way, and thus signiﬁcantly degrade themodel prediction accuracy. This is especially true when ascene contains multiple independent moving objects as onefundamental matrix can only describe the epipolar geom-etry of one rigid motion. Instead of posing a hard epipo-lar constraint, in this paper, we propose to use soft epipo-lar constraints that are derived using low-rankness when thescene is stationary, and union of subspaces structure whenthe scene is motion agnostic. We thus formulate corre-sponding losses to train our ﬂow networks unsupervisedly.Our work makes an attempt towards incorporating epipo-lar geometry into deep unsupervised optical ﬂow computa-tion. Through extensive evaluations on standard datasets,we show that our method achieves competitive performancecompared with supervised methods, and outperforms exist-ing unsupervised methods by a clear margin. Speciﬁcally,as of the date of paper submission, on KITTI and MPI Sin-tel benchmarks, our method achieves the best performanceamong published deep unsupervised optical ﬂow methods. a r X i v : . [ c s . C V ] A p r . Related work Optical ﬂow estimation has been extensively studied fordecades. A signiﬁcant number of papers have been pub-lished in this area. Below we only discuss a few geometry-aware methods and recent deep-learning based methods thatwe consider closely related to our method.

Supervised deep optical ﬂow.

Recently, end-to-end learn-ing based deep optical ﬂow approaches have shown theirsuperiority in learning optical ﬂow. Given a large amountof training samples, optical ﬂow estimation is formulatedto learn the regression between image pair and correspond-ing optical ﬂow. These approaches achieve comparable per-formance to state-of-the-art conventional methods on sev-eral benchmarks while being signiﬁcantly faster. FlowNet[7] is a pioneer in this direction, which needs a large-sizesynthetic dataset to supervise network learning. FlowNet2[8] greatly extends FlowNet by stacking multiple encoder-decoder networks one after the other, which could achievea comparable result to conventional methods on variousbenchmarks. Recently, PWC-Net [33] combines sophisti-cated conventional strategies such as pyramid, warping andcost volume into network design and set the state-of-the-artperformance on KITTI [12, 23] and MPI Sintel [6]. Thesesupervised deep optical ﬂow methods are hampered by theneed for large-scale training data with ground truth opticalﬂow, which also limits their generalization ability.

Unsupervised deep optical ﬂow.

Instead of using groundtruth ﬂow as supervision, Yu et al . [42] and Ren et al . [28]suggested that, similar to conventional methods, the imagewarping loss can be used as supervision signals in learningoptical ﬂow. However, there is a signiﬁcant performancegap between their work and the conventional ones. Then,Simon et al . [31] analyzed the problem and introducedbidirectional Census loss to handle illumination variationbetween frames robustly. Concurrently, Yang et al . [35]proposed an occlusion-aware warping loss to exclude oc-cluded points in error computation. Very recently, Janai etal . [16] extended two-view optical ﬂow to multi-view caseswith improved occlusion handling performance. Introduc-ing sophisticated occlusion estimation and warping loss re-duces the performance gap between conventional methodsand current unsupervised ones, nevertheless, the gap is stillhuge. To address this issue, we propose a global epipolarconstraint in ﬂow estimation that largely narrows the gap.

Geometry-aware optical ﬂow.

In the ﬁeld of cooperatingwith geometry constrains, Valgaerts et al . [34] introduceda variational model to simultaneously estimate the funda-mental matrix and the optical ﬂow. Wedel et al . [36] uti-lized fundamental matrix prior as a weak constraint in avariational framework. Yamaguchi et al . [39] converted op-tical ﬂow estimation task into a 1D search problem by us-ing precomputed fundamental matrices and the small mo- tion assumptions. These methods, however, assume that thescene is mostly rigid (and thus a single fundamental matrixis sufﬁcient to constrain two-view geometry), and treat thedynamic parts as outliers [36]. Garg et al . [11] used thesubspace constraint on multi-frame optical ﬂow estimationas a regularization term. However, this approach, assumesan afﬁne camera model and works over entire sequences.Wulff et al . [38] used semantic information to split the sceneinto dynamic objects and static background and only ap-plied strong geometric constraints on the static background.Recently, inspired by multi-task learning, people started tojointly estimate depth, camera poses and optical ﬂow in anuniﬁed framework [26, 41, 44]. These work mainly lever-age a consistency between ﬂows that estimated from a ﬂownetwork and computed from poses and depth. This con-straint only works for stationary scenes and their perfor-mance is only comparable with unsupervised deep opticalﬂow methods.By contrast, our proposed method is able to handle bothstationary and dynamic scenarios without explicitly com-puting fundamental matrices. This is achieved by introduc-ing soft epipolar constraints derived from epipolar geome-try, using low-rankness and union-of-subspaces properties.Converting these constraints to proper losses, we can ex-ert global geometric constraints in optical ﬂow learning andobtain much better performance.

3. Epipolar Constraints in Optical Flow

Optical ﬂow aims at ﬁnding dense correspondences be-tween two consecutive frames. Formally, let I t denote theimage at time t , and I t +1 the next image. For pixels x i in I t , we would like to ﬁnd their correspondences x (cid:48) i in I t +1 .The displacement vectors v = [ v , ..., v N ] ∈ R × N (with N the total number of pixels in I t ) are the optical ﬂow wewould like to estimate.Recall that in two-view epipolar geometry [13], by us-ing the homogeneous coordinates, a pair of point corre-spondences in two frames x (cid:48) i = ( x (cid:48) i , y (cid:48) i , T and x i =( x i , y i , T is related by a fundamental matrix F , x (cid:48) Ti F x i = 0 . (1)In the following sections, we show how to enforce theepipolar constraint as a global regularizer in ﬂow learning. Given estimated optical ﬂow v , we can convert it to aseries of correspondences x i and x (cid:48) i in I t and I t +1 re-spectively. Then these corresponding points can be usedto compute a fundamental matrix F by normalized 8 pointsmethod [13]. Once the F is estimated, we can computeits ﬁtting error. Directly optimizing Eq. (1) is not effec-tive as it is only an algebraic error that does not reﬂect theeal geometric distances. We can use the Gold Standardmethod [13] to compute the geometric distances but it re-quires reconstructing the 3D points (cid:98) X i beforehand for everypoint. Otherwise, we can use its ﬁrst-order approximation,the Sampson distance L F to represent the geometric error, L F = N (cid:88) i ( x (cid:48) Ti F x i ) ( F x i ) + ( F x i ) + ( F T x (cid:48) i ) + ( F T x (cid:48) i ) . (2)The difﬁculty of optimizing this equation comes from its chicken-and-egg character: it consists of two mutually inter-locked sub-problems, i.e ., estimating a fundamental matrix F from an estimated ﬂow and updating the ﬂow to complywith the F . This alternating method, therefore, heavily re-lies on proper initialization.Up to now, we have only considered the static scenescenario, where only ego-motion exists. In a multi-motionscene, this method requires estimating F for each motion,which again needs a motion segmentation step. It is stillfeasible to address this problem via iteratively solving threesub-tasks: (i) update ﬂow estimation ; (ii) estimate F m foreach rigid motion given current motion segmentation ; (iii) update motion segmentation based on the nearest F m .However, this method again has several inherent limita-tions. First, the number of motions need to be known as apriori which is almost impossible in general optical ﬂow es-timation. Second, this method is still sensitive to the qualityof initial optical ﬂow estimation and motion labels. Incor-rectly estimated ﬂow may generate wrong F m , which willin turn lead ﬂow estimation to the wrong solution, there-fore making the estimation even worse. Third, the motionsegmentation step is non-differentiable, so with it, an end-to-end learning becomes impossible.To overcome these drawbacks, we formulate two softepipolar constraints using low-rankness and union-of-subspaces properties. And we will show that these con-straints can be easily included as extra losses to regularizethe network learning. In this section, we show that it is possible to enforcea soft epipolar constraint without explicitly computing thefundamental matrix in a static scene.Note that we can rewrite the epipolar constraint in Eq. (1)as f T vec( x (cid:48) i x Ti ) = 0 , (3)where f ∈ R is the vectorized fundamental matrix of F and vec( x (cid:48) i x Ti ) = ( x i x (cid:48) i , x i y (cid:48) i , x i , y i x (cid:48) i , y i y (cid:48) i , y i , x (cid:48) i , y (cid:48) i , T . (4)Observe that, vec( x (cid:48) i x Ti ) lies on a subspace (of dimensionup to eight), called epipolar subspace [17]. Let us deﬁne h i = vec( x (cid:48) i x Ti ) . Then the data matrix H = [ h , ..., h N ] should be low-rank. This provides a possible way of reg-ularizing optical ﬂow estimation via rank minimization in-stead of explicitly computing F . Speciﬁcally, we can for-mulate a loss as L lowrank = rank( H ) , (5)which is unfortunately non-differentiable and is thus notfeasible to serve as a loss for ﬂow network training. For-tunately, we can still use its convex surrogate, the nuclearnorm, to form a loss as L ∗ lowrank = (cid:107) H (cid:107) ∗ , (6)where the nuclear norm || · || ∗ can be computed by perform-ing singular value decomposition (SVD) of H . Note that theSVD operation is differentiable and has been implementedin modern deep learning toolboxes such as Tensorﬂow andPytorch, so this nuclear norm loss can be easily incorpo-rated into network training. We also note that though thislow-rank constraint is derived from epipolar geometry de-scribed by a fundamental matrix, it still applies in degener-ate cases where a fundamental matrix does not exist. Forexample, when the motion is all zero or pure rotational, orthe scene is fully planar, H will have rank six; under certainspecial motions, e.g ., an object moving parallel to the imageplane, its H will have rank seven.Comparing to the original epipolar constraint, one mayconcern that this low-rank constraint is too loose to be effec-tive, especially when the ambient space dimension is onlynine. Although a thorough theoretical analysis is out of thescope of this paper (interested readers may refer to literaturesuch as [27]), we will show in our experiments that this losscan improve the model performance by a signiﬁcant marginwhen trained on data with mostly static scenes. However,this loss becomes ineffective when a scene has more thanone motion, as the matrix H will then be full-rank. In this section, we introduce another soft epipolar con-straint, namely union-of-subspaces constraint , which can beapplied in broader cases.From Eq. (4), it’s not hard to observe that h i from onerigid motion lies on a common epipolar subspace becausethey all share the same fundamental matrix. When thereare multiple motions in a scene, h i will lie in a union ofsubspaces. Note that this union-of-subspace structure hasbeen shown to be useful in motion segmentation from twoperspective images [20]. Here, we re-formulate it in opti-cal ﬂow learning and come up with an effective loss usingclosed-form solutions.In particular, the union-of-subspaces structure can becharacterized by the self-expressiveness property [10], i.e ., igure 1. Motion segmentation and afﬁnity matrix (constructed from C ) visualization. The scene contains three motions annotatedby three different colors: the ego-motion and the two cars’ movements. On the right, we show a constructed afﬁnity matrix from C whichcontains three diagonal blocks corresponding to these three motions. On the bottom left, we illustrate our estimated optical ﬂow and thetop left image shows that all these three motions are correctly segmented based on the C . The sparse dots on the image are the sampled2000 points that has been used to compute C . It proves that our Union-of-Subspace constraint can work under multi-body scenarios. a data point in one subspace can be represented by a lin-ear combination of other points from the same subspace.This has translated into a mathematical optimization prob-lem [22, 18] as min C (cid:107) C (cid:107) F s.t. H = HC . (7)where C is the subspace self-expression coefﬁcient and H is a matrix function of estimated ﬂows. Note that, in sub-space clustering literature, other norms on C have also beenused, e.g ., nuclear norm in [21] and (cid:96) norm in [10]. We areparticularly interested in the Frobenius norm regularizationdue to its simplicity and equivalence to nuclear norm opti-mization [18], which is crucial for formulating an effectiveloss for CNN training.However, in real world scenarios, the ﬂow estimation in-evitably contains noises. Therefore, we relax the constraintsin Eq.(7) by alternatively optimizing the function below L subspace = 12 (cid:107) C (cid:107) F + λ (cid:107) HC − H (cid:107) F , (8)Instead of using a iterative solver, given an H , we can derivea closed form solution for C , i.e ., C ∗ = ( I + λ H T H ) − λ H T H . (9)Plugging the solution of C back to Eq.(8), we arrive atour ﬁnal union-of-subspaces loss term that only depends onthe estimated ﬂow: L subspace = 12 (cid:107) ( I + λ H T H ) − λ H T H (cid:107) F + λ (cid:107) H ( I + λ H T H ) − λ H T H − H (cid:107) F . (10)Directly applying this loss to the whole image willlead to GPU memory overﬂow due to the computation of H T H ∈ R N × N (with N the number of pixels in a image).To avoid this, we employ a randomly sampling strategy to sample 2000 ﬂow points in a ﬂow map and compute a lossbased on these samples. This strategy is valid because ran-dom sampling will not change the intrinsic character of sets.We remark that this subspace loss requires no priorknowledge of the number of motions in a scene, so it can beused to train a ﬂow network on a motion-agnostic dataset.In a single-motion case, it works similarly to the low-rankloss since the optimal loss is closely related to the rank of H [18]. In a multi-motion case, as long as the epipolar sub-spaces are disjoint and principle angles between them arebelow certain thresholds [9], this loss can still serve as aglobal regularizer. Even when the scene is highly non-rigidor dynamic, unlike the hard epipolar constraint, this losswon’t be detrimental to the network training because it willhave same values for both ground-truth ﬂows and wrongﬂows. In Fig. 1, we show the results of a typical image pairfrom KITTI using this constraint, demonstrating the effec-tiveness of our method.

4. Unsupervised Learning of Optical Flow

We formulate our unsupervised optical ﬂow estimationapproach as an optimization of image based losses andepipolar constraint losses. In unsupervised optical ﬂow esti-mation, only photometric loss L photo can provide data term.Additionally, we use a smoothness term L smooth and ourepipolar constraint term L F | lowrank | subspace as our regulariza-tion terms. Our overall loss L is a linear combination ofthese three losses L = L photo + µ L smooth + µ L F | lowrank | subspace , (11)where µ , µ are the weights for each term. We empir-ically set µ = 0 . and µ = 0 . , . , . for L F , L ∗ lowrank , L subspace respectively. Similarly to conventional methods, we leverage the mostpopular brightness constancy assumption, i.e ., I t , I t +1 hould have similar pixel intensities, colors and gradients.Our photometric error is then deﬁned by the differencebetween the reference frame and the warped target framebased on ﬂow estimation.In [31], they target at the case which the illuminationmay changes from frame to frame and propose a bidirec-tional census transform C ( · ) to handle this situation. Weadopt this idea to our photometric error. Therefore, our pho-tometric loss is a weighted summation of pixel intensities(or color) loss L i , image gradient loss L g and bidirectionalcensus loss L c . L photo = λ L i + λ L c + λ L g , (12)where λ = 0 . , λ = 1 , λ = 1 are the weights for eachterm.Inspired by [35], we only compute our photometric losson non-occluded areas O and normalize the loss by thenumber of pixels of non-occluded regions. We determinea pixel to be occluded or not by forward-backward consis-tency check. If the sum of its forward and backward ﬂow isabove a threshold τ , we set the pixel as occluded. We use τ = 3 in all experiments.Our photometric loss is thus deﬁned as follows: L i = (cid:34) N (cid:88) i =1 O i · ϕ ( ˆ I t ( x i ) − I t ( x i )) (cid:35) / N (cid:88) i O i (13) L c = (cid:34) N (cid:88) i =1 O i · ϕ ( (cid:98) C t ( x i ) − C t ( x i )) (cid:35) / N (cid:88) i O i (14) L g = (cid:34) N (cid:88) i =1 O i · ϕ ( ∇ ˆ I t ( x i ) − ∇ I t ( x i )) (cid:35) / N (cid:88) i O i (15)where ˆ I t ( x i ) = I t +1 ( x i + v i ) is computed through im-age warping with the estimated ﬂow, and following [35], weuse a robust Charbonnier penalty ϕ ( x ) = √ x + 0 . toevaluate differences. Commonly, there are two kinds of smoothness prior inconventional optical ﬂow estimation: One is piece-wise pla-nar, and the other is piece-wise linear. The ﬁrst one can beimplemented by penalizing the ﬁrst order derivative of re-covered optical ﬂow and the later one is by the second orderderivative. For most rigid scenes, piece-wise planar modelcan provide a better interpolation. But for deformable cases,piece-wise linear model suits better. Therefore, we use acombination of these two models as our smoothness regu-larization term. We further assumes that the edges in opticalﬂows are edges in reference color images as well. Formally,our image guided smoothness term can be deﬁned as: L s = (cid:88) (cid:16) e − α |∇ I | |∇ V | + e − α | ∇ I | (cid:12)(cid:12) ∇ V (cid:12)(cid:12)(cid:17) /N, (16) where α = 0 . and α = 0 . and V ∈ R W × H × is amatrix form of v .

5. Experiments

We evaluate our methods on standard optical ﬂow bench-marks, including KITTI [12, 23], MPI-Sintel [6], FlyingChairs [7], and Middlebury [3]. We compare our resultswith existing optical ﬂow estimation methods based on stan-dard metrics, i.e ., endpoint error (EPE) and percentage ofoptical ﬂow outliers (Fl). We denote our method as EPI-Flow.

Architecture and Parameters.

We implemented ourEPIFlow network in an end-to-end manner by adopting thearchitecture of PWC-Net [33] as our base network due to itsstate-of-the-art performance. The original PWC-Net takesa structure of pyramid and learns on 5 different scales.However, a warping error is ineffective on low resolutions.Therefore, we pick the highest resolution output, upsam-ple it to match the input resolution by bilinear interpola-tion, and compute our self-supervised learning losses onlyon that scale. The learning rate for initial training (fromscratch) is − and that for ﬁne-tuning is − . Depend-ing on the resolution of input images, the batch size is 4 or8. We use the same data argumentation scheme as proposedin FlowNet2 [8]. Our network’s typical speed varies from0.07 to 0.25 seconds per frame during the training process,depending on the input image size and the losses used, andis around 0.04 seconds per frame in evaluation. The ex-periments were tested on a regular computer equipped witha Titan XP GPU. EPIFlow is signiﬁcantly faster comparedwith conventional methods. Pre-training.

We pre-trained our network on the FlyingChairs dataset using a weighted combination of the warp-ing loss and smoothness loss. Flying Chairs is a syn-thetic dataset consisting of rendered chairs superimposedon real-world Flickr images. Training on such a large-scale synthetic dataset allows the network to learn the gen-eral concepts of optical ﬂow before handling complicatedreal-world conditions, e.g ., changeable light or motions.To avoid trivial solutions, we disabled the occlusion-awareterm at the beginning of the training ( i.e ., the ﬁrst twoepochs). Otherwise, the network would generate all zeroocclusion masks which invalidate losses. The pre-trainingroughly took forty hours and its returned model was used asan initial model for other datasets.

KITTI Visual Odometry (VO) Dataset.

The KITTI VOdataset contains 22 calibrated sequences with 87,060 con-secutive pairs of real-world images. The ground truth posesITTI 2012 KITTI 2015 Sintel Clean Sintel FinalMethod EPE(all) EPE(noc) EPE(all) EPE(noc) Fl − all EPE(all) EPE(all) train test train test train train test train test train test N on - d ee p EpicFlow [29] 3.47 3.8 – 1.5 9.27 – 26.29% 2.27 4.11 3.56 6.29MRFlow [38] – – – – – – 12.19% (1.83) 2.53 (3.59) 5.38 S up e r v i s e d SpyNet-ft [25] (4.13) 4.1 – 2.0 – – 35.07% (3.17) 6.64 (4.32) 8.36FlowNet2-ft [8] (1.28) 1.8 – 1.0 2.30 – 10.41% (1.45) 4.16 (2.01) 5.74PWC-Net [33] 4.14 – – – 10.35 – – 2.55 – 3.93 –PWC-Net-ft [33] (1.45) 1.7 – 0.9 (2.16) – 9.60% (1.70) 3.86 (2.21) 5.17 U n s up e r v i s e d UnsupFlownet [42] (11.30) 9.9 (4.30) 4.6 – – – – – – –DSTFlow-ft [28] (10.43) 12.4 (3.29) 4.0 (16.79) (6.96) 39.00% (6.16) 10.41 (7.38) 11.28DF-Net-ft [44] (3.54) 4.4 – – (8.98) – 25.70% – – – –GeoNet [41] – – – – 10.81 8.05 – – – – –UnFlow [31] (3.29) – (1.26) – (8.10) – – – 9.38 7.91 10.21OAFlow-ft [35] (3.55) 4.2 – – (8.88) – 31.20% (4.03) 7.95 (5.95) 9.15CCFlow [26] – – – – (5.66) – 25.27% – – – –Back2Future-ft [16] – – – – (6.59) (3.22) 22.94% (3.89) 7.23 (5.52) 8.81Our-baseline 3.23 – 1.04 – 7.93 4.21 – 6.72 – 7.31 –Our-gtF 2.61 – 1.04 – 6.03 2.89 – 6.15 – 6.71 –Our-F 2.56 – – 6.42 3.09 – 6.21 – 6.73 –Our-low-rank 2.63 – 1.07 – 5.91 3.03 – 6.39 – 6.96 –Our-sub 2.62 – 1.03 – 6.02 2.98 – 6.15 – 6.83 –Our-sub-test-ft 2.61 ( ) 1.03 ( ) 5.56 2.56 ( ) 3.94 ( ) 5.08 ( )Our-sub-train-ft ( ) 3.4 (0.99) 1.3 ( ) ( ) 16.95% (3.54) (4.99)

Table 1.

Performance comparison on the KITTI and Sintel optical ﬂow benchmarks.

The metric EPE(noc) indicates the averageendpoint error of non-occluded regions while the term EPE(all) is that for all pixels. The KITTI 2015 testing dataset evaluates results bythe percentage of ﬂow outliers (Fl). The baseline, gtF, F, low-rank, and sub models were trained on the KITTI VO dataset. The parenthesesindicate the corresponding models that were trained on the same data and the missing entries (-) indicate the results were not reported.Note that the current STOA unsupervised method Back2Future Flow [16] uses three frames as input. Best results are marked by bold fonts. of the ﬁrst 11 sequences are available. We ﬁne-tuned ourinitial model on the KITTI VO dataset using various losscombinations. We chose it for two reasons: (1) it providesground truth camera poses for every frame, which simpliﬁesthe problem of network performance analysis and (2) mostscenes in the KITTI VO dataset are stationary and thus canbe ﬁtted by an ego-motion. The relative poses (between apair of images) and camera calibration can be used to com-pute fundamental matrices. To compare our various meth-ods fairly, we use the ﬁrst 11 sequences as our training set.

KITTI Optical Flow Dataset.

The KITTI optical ﬂowdataset contains two subsets: KITTI 2012 and KITTI 2015,where the ﬁrst one mostly contains stationary scenes and thelatter one includes more dynamic scenes. KITTI 2012 pro-vides 194 annotated image pairs for training and 195 pairsfor testing while KITTI 2015 provides 200 pairs for train-ing and 200 pairs for testing. Our training did not use theKITTI datasets’ multiple-view extension images.

MPI Sintel Dataset.

The MPI Sintel dataset provides nat-uralistic frames which were captured from an open source movie. It contains 1041 training image pairs with groundtruth optical ﬂows and pixel-wise occlusion masks, and alsoprovides 552 image pairs for benchmark testing. The scenesof the MPI Sintel dataset were rendered under two differentcomplexity (Clean and Final). Unlike the KITTI datasets,most scenes in the Sintel dataset are highly dynamic.

We use the sufﬁx “-baseline” to indicate our baselinemodel that was trained using only photometric and smooth-ness loss. “-F” represents the model that was trained us-ing hard fundamental matrix constraint with estimated F . “-gtF” means that we used the ground truth fundamental ma-trix. “-low-rank” refers to the model applying the low rankconstraint, and “-sub” is the model using our subspace con-straint. “-ft” denotes the model ﬁne-tuned on the datasets. KITTI VO training results.

We report our results thatwere trained on the KITTI VO dataset in Table 1, where ourmodels are compared with various state-of-the-art methods.Our methods outperform all previous learning-based unsu-nput Ours Back2Future [16] Our Error Back2Future Error [16]

Figure 2.

Qualitative results on KITTI 2015 Test dataset.

We compare our method with Back2Future Flow [16]. The second columncontains the ﬂows estimated by Our-sub-ft model while the third column contains the results of Back2Future Flow. The ﬂow error visual-ization is also provided where correct estimates are depicted in blue and wrong ones in red. Consistent with the quantitative analysis, ourresults are visually better on structural boundaries

Input Ours Back2Future [16] Our Error Back2Future Error [16]

Figure 3.

Qualitative results on the MPI Sintel dataset.

This ﬁgure shares the same layout with Fig. 2 except the top two rows are fromthe Final set and the two bottom rows are from the Clean set. The errors are visualized in gray on the Sintel benchmark. pervised optical ﬂow methods with a notable margin. Notethat most scenes in KITTI VO dataset are stationary, andtherefore the difference between our-gtF, our-F, our-low-rank and our-sub is small across these benchmarks.

Benchmark Fine-tuning Results.

We ﬁne-tuned ourmodels on each benchmark and report the results with asufﬁx ’-ft’ in Table 2. For example, simply following thesame hyper-parameters as before, we ﬁnetuned our modelson the KITTI 2015 testing data. After ﬁne-tuning, Our-submodel shows great performance improvement and achievedan EPE of 2.61 and 5.56 respectively on the KITTI 2012and KITTI 2015 training datasets, which outperforms all thedeep unsupervised methods and many supervised methods.Similarly, on the MPI Sintel trainings dataset, Our-sub-ftmodel performs best among the unsupervised methods, withan EPE of 3.94 on the Clean images and 5.08 on the Finalimages. Furthermore, both on the KITTI and Sintel testingbenchmarks, our method outperformed the current state-of- the-art unsupervised method Back2Future Flow by a mar-gin. We improve the best unsupervised performance froman Fl of 22.94% to 16.24% on KITTI 2015. The Our-sub-ftmodel achieved an EPE of 6.84 on the Sintel Clean datasetand 8.33 on the Final set, which are the results that unsuper-vised methods have never touched before. Additionally, itshould be noted that the Back2Future Flow method is basedon a multi-frame formulation while our method only re-quires two frames. Our model is also competitive comparedwith some ﬁne-tuned supervised networks, such as SpyNet.Qualitatively, as shown in Fig. 2 and Fig. 3, comparedwith the results of Back2Future Flow, the shapes in our es-timated ﬂows are more structured and have more explicitboundaries which represent motion discontinuities. Thistrend is also apparent in the ﬂow error images. For ex-ample, on the KITTI 2015 dataset (Fig. 2), the results ofBack2Future Flow usually bring a larger region of errorwith crimson colors around the objects.It should be noted that ﬁne-tuning on the target datasetsnput Our-baseline Our-F Our-low-rank Our-sub

Figure 4.

Endpoint error performance of our various models on the KITTI 2015 training dataset.

We compared Our-baseline, Our-F,Our-low-rank, and Our-sub models on the KITTI 2015 dataset to analyze their performance when handling dynamic objects. The resultsof the Our-sub model are much better. ( e.g ., KITTI 2015) does not bring signiﬁcant improvementbecause we have trained the models on a real-world datasetKITTI VO. The models have learned the general conceptsof realistic optical ﬂows and ﬁne tuning just helps them fa-miliar with the datasets’ characteristics. On the KITTI 2012training set, the ﬁne-tuned model achieves very close resultswith the Our-sub model, which are respectively 2.61 and2.62 EPE. Fine-tuning on the Sintel Clean dataset improvesthe result from 6.15 to 3.94 EPE, because the Sintel Cleandataset renders the synthetic scenes under low complexityand the images are quite different from the real world. Figure 5.

Endpoint error over epochs on the Sintel Finaldataset.

We illustrate the endpoint errors over the training epochswhen using various combinations of constraints. For all the threemethods, the training started from the same pre-trained model‘Our-baseline’. Combing the image warping and subspace con-straints outperforms other two methods, which is consistent withthe ﬁnal ﬁne-tuned results reported in Table 2.

The Our-F, Our-low-rank and Our-sub models all workwell in stationary scenes and they have similar quantita-tive performance. To further analyze their capabilities inhandling general dynamic scenarios, we ﬁne-tuned eachmethod on the KITTI 2015 and Sintel Final dataset. Bothof them involve multiple motions in an image while Sin-tel scenes are more dynamic. As shown in Table 2, Our-sub can handle dynamic scenarios best and achieves thelowest EPE in both benchmarks. The hard fundamentalconstraint shares a similar performance with our baseline Method KITTI 2015 Sintel FinalEPE(all) EPE(noc) EPE(all)Our-baseline-ft 6.16 2.85 5.87Our-F-ft 6.19 2.85 NaNOur-low-rank-ft 5.72 2.62 5.59Our-sub-ft 5.56 2.56 5.08

Table 2.

Fine-tuning results comparison on KITTI 2015 andSintel Final training sets.

We ﬁne-tuned our models on the train-ing sets of KITTI 2015 and Sintel Final dataset. The term NaNindicates the model cannot converge. model but cannot converge on the Sintal dataset, whose EPEis reported as NaN. It is because a highly dynamic scenedoes not have a global fundamental F . For the low-rankconstraint, its performance is not affected by dynamic ob-jects while it cannot gain information by modeling multi-ple movements as well. In Fig. 5, we provide the valida-tion error curves over the training’s early stages on Sintalﬁnal dataset. The subspace loss helps the model convergequicker and achieve lower cost than other methods.

6. Conclusion

In this paper, we have proposed effective methods to en-force global epipolar geometry constraints for unsupervisedoptical ﬂow learning. For a stationary scene, we applied thelow-rank constraint to regularize a globally rigid structure.For general dynamic scenes (multi-body or deformable), weproposed to use the union-of-subspaces constraint. Experi-ments on various benchmarking datasets have proved the ef-ﬁcacy and superiority of our methods compared with state-of-the-art (unsupervised) deep ﬂow methods. In the future,we plan to study the multi-frame extension, i.e ., enforcinggeometric constraints across multiple frames.

Acknowledgement

This research was supported in part byAustralia Centre for Robotic Vision, Data61 CSIRO, theNatural Science Foundation of China grants (61871325,61420106007) the Australian Research Council (ARC)grants (LE190100080, CE140100016, DP190102261). Theauthors are grateful to the GPUs donated by NVIDIA. eferences [1] Luis Alvarez, Rachid Deriche, Th´eo Papadopoulo, and JavierS´anchez. Symmetrical dense optical ﬂow estimation with oc-clusions detection.

Int. J. Comp. Vis. , 75(3):371–385, 2007.1[2] Gilles Aubert, Rachid Deriche, and Pierre Kornprobst. Com-puting optical ﬂow via variational techniques.

SIAM Journalon Applied Mathematics , 60(1):156–182, 1999. 1[3] Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth,Michael J. Black, and Richard Szeliski. A database andevaluation methodology for optical ﬂow.

Int. J. Comp. Vis. ,92(1):1–31, Mar 2011. 5[4] Jean-Yves Bouguet. Pyramidal implementation of the afﬁnelucas kanade feature tracker description of the algorithm.

In-tel Corporation , 5(1-10):4, 2001. 1[5] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical ﬂow estimation based on atheory for warping. In

Proc. Eur. Conf. Comp. Vis. , pages25–36. Springer, 2004. 1[6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. Anaturalistic open source movie for optical ﬂow evaluation.In

Proc. Eur. Conf. Comp. Vis. , pages 611–625, Oct. 2012.2, 5[7] Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick van derSmagt, Daniel Cremers, and Thomas Brox. Flownet: Learn-ing optical ﬂow with convolutional networks. In

Proc. IEEEInt. Conf. Comp. Vis. , pages 2758–2766, 2015. 1, 2, 5[8] Ilg Eddy, Mayer Nikolaus, Saikia Tonmoy, Keuper Margret,Dosovitskiy Alexey, and Brox Thomas. Flownet 2.0: Evolu-tion of optical ﬂow estimation with deep networks. In

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , Jul 2017. 1, 2, 5, 6[9] Ehsan Elhamifar and Ren´e Vidal. Clustering disjoint sub-spaces via sparse representation. In

IEEE International Con-ference on Acoustics Speech and Signal Processing , pages1926–1929. IEEE, 2010. 4[10] Ehsan Elhamifar and Rene Vidal. Sparse subspace cluster-ing: Algorithm, theory, and applications.

IEEE Trans. Pat-tern Anal. Mach. Intell. , 35(11):2765–2781, 2013. 3, 4[11] Ravi Garg, Luis Pizarro, Daniel Rueckert, and LourdesAgapito. Dense multi-frame optic ﬂow for non-rigid objectsusing subspace constraints. In

Proc. Asian Conf. Comp. Vis. ,pages 460–473, 2011. 2[12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the kitti vision benchmarksuite. In

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2012.2, 5[13] Richard Hartley and Andrew Zisserman.

Multiple view ge-ometry in computer vision . Cambridge university press,2003. 1, 2, 3[14] Berthold KP Horn and Brian G Schunck. Determining opti-cal ﬂow.

Artiﬁcial intelligence , 17(1-3):185–203, 1981. 1[15] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.Spatial transformer networks. In

Proc. Adv. Neural Inf. Pro-cess. Syst. , pages 2017–2025, 2015. 1[16] Joel Janai, Fatma G¨uney, Anurag Ranjan, Michael J. Black,and Andreas Geiger. Unsupervised learning of multi-frame optical ﬂow with occlusions. In

European Conference onComputer Vision (ECCV) , volume Lecture Notes in Com-puter Science, vol 11220, pages 713–731. Springer, Cham,Sept. 2018. 1, 2, 6, 7[17] Pan Ji, Hongdong Li, Mathieu Salzmann, and Yiran Zhong.Robust multi-body feature tracker: a segmentation-free ap-proach. In

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , pages3843–3851, 2016. 3[18] Pan Ji, Mathieu Salzmann, and Hongdong Li. Efﬁcient densesubspace clustering. In

IEEE Winter Conference on Applica-tions of Computer Vision , pages 461–468. IEEE, 2014. 4[19] Vladimir Kolmogorov and Ramin Zabih. Computing visualcorrespondence with occlusions via graph cuts. Technicalreport, Ithaca, NY, USA, 2001. 1[20] Zhuwen Li, Jiaming Guo, Loong-Fah Cheong, and StevenZhiying Zhou. Perspective motion segmentation via collab-orative clustering. In

Proc. IEEE Int. Conf. Comp. Vis. , pages1369–1376, 2013. 3[21] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, YongYu, and Yi Ma. Robust recovery of subspace structures bylow-rank representation.

IEEE Trans. Pattern Anal. Mach.Intell. , 35(1):171–184, 2013. 4[22] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-ShuangHuang, and Shuicheng Yan. Robust and efﬁcient subspacesegmentation via least squares regression. In

Proc. Eur. Conf.Comp. Vis. , pages 347–360. Springer, 2012. 4[23] Moritz Menze and Andreas Geiger. Object scene ﬂow forautonomous vehicles. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , 2015. 2, 5[24] Moritz Menze, Christian Heipke, and Andreas Geiger. Dis-crete optimization for optical ﬂow. In

German Conferenceon Pattern Recognition , pages 16–28. Springer, 2015. 1[25] Anurag Ranjan and Michael J. Black. Optical ﬂow estima-tion using a spatial pyramid network. In

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , July 2017. 1, 6[26] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun,Jonas Wulff, and Michael J Black. Adversarial collabo-ration: Joint unsupervised learning of depth, camera mo-tion, optical ﬂow and motion segmentation. arXiv preprintarXiv:1805.09806 , 2018. 2, 6[27] Benjamin Recht, Weiyu Xu, and Babak Hassibi. Neces-sary and sufﬁcient conditions for success of the nuclear normheuristic for rank minimization. In

IEEE Conference on De-cision and Control , pages 3065–3070. IEEE, 2008. 3[28] Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang,and Hongyuan Zha. Unsupervised deep learning for opticalﬂow estimation. In

Proc. AAAI Conf. Artiﬁcial Intelligence ,volume 3, page 7, 2017. 2, 6[29] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, andCordelia Schmid. Epicﬂow: Edge-preserving interpolationof correspondences for optical ﬂow. In

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , June 2015. 6[30] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlineartotal variation based noise removal algorithms.

Physica D:nonlinear phenomena , 60(1-4):259–268, 1992. 1[31] Meister Simon, Hur Junhwa, and Roth Stefan. Unﬂow:Unsupervised learning of optical ﬂow with a bidirectionalensus loss. In

Proc. AAAI Conf. Artiﬁcial Intelligence ,AAAI’18, 2018. 1, 2, 5, 6[32] Deqing Sun, Stefan Roth, and Michael J Black. Secrets ofoptical ﬂow estimation and their principles. In

Proc. IEEEConf. Comp. Vis. Patt. Recogn. , pages 2432–2439. IEEE,2010. 1[33] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.PWC-Net: CNNs for optical ﬂow using pyramid, warping,and cost volume. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , 2018. 1, 2, 5, 6[34] Levi Valgaerts, Andr´es Bruhn, and Joachim Weickert. Avariational model for the joint recovery of the fundamentalmatrix and the optical ﬂow. In

Proceedings of DAGM Sym-posium on Pattern Recognition , pages 314–324, 2008. 2[35] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, PengWang, and Wei Xu. Occlusion aware unsupervised learn-ing of optical ﬂow. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , June 2018. 1, 2, 5, 6[36] A. Wedel, D. Cremers, T. Pock, and H. Bischof. Structure-and motion-adaptive regularization for high accuracy opticﬂow. In

Proc. IEEE Int. Conf. Comp. Vis. , pages 1663–1668,Sept 2009. 2[37] Andreas Wedel, Thomas Pock, Christopher Zach, HorstBischof, and Daniel Cremers. An improved algorithm fortv-l 1 optical ﬂow. In

Statistical and geometrical approachesto visual motion analysis , pages 23–45. Springer, 2009. 1[38] Jonas Wulff, Laura Sevilla-Lara, and Michael J. Black. Op-tical ﬂow in mostly rigid scenes. In

Proc. IEEE Conf. Comp.Vis. Patt. Recogn. , July 2017. 2, 6[39] K. Yamaguchi, D. McAllester, and R. Urtasun. Robustmonocular epipolar ﬂow estimation. In

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , pages 1862–1869, June 2013. 2[40] Jiaolong Yang and Hongdong Li. Dense, accurate opticalﬂow estimation with piecewise parametric model. In

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , pages 1019–1027,2015. 1[41] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn-ing of dense depth, optical ﬂow and camera pose. In

Proc.IEEE Conf. Comp. Vis. Patt. Recogn. , volume 2, 2018. 2, 6[42] Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis.Back to basics: Unsupervised learning of optical ﬂow viabrightness constancy and motion smoothness. In

EuropeanConference on Computer Vision (ECCV) Workshops , pages3–10, 2016. 1, 2, 6[43] Christopher Zach, Thomas Pock, and Horst Bischof. A du-ality based approach for realtime tv-l 1 optical ﬂow. In

JointPattern Recognition Symposium , pages 214–223. Springer,2007. 1[44] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Un-supervised joint learning of depth and ﬂow using cross-taskconsistency. In