Learning how to be robust: Deep polynomial regression
Juan-Manuel Perez-Rua, Tomas Crivelli, Patrick Bouthemy, Patrick Perez
LLearning how to be robust:Deep polynomial regression
Juan-Manuel P´erez-R´ua , (cid:63) , Tomas Crivelli (cid:63)(cid:63) ,Patrick Bouthemy , and Patrick P´erez (cid:63) (cid:63) (cid:63) Technicolor, Cesson S´evign´e, France Inria, Centre Rennes – Bretagne Atlantique, France
Abstract.
Polynomial regression is a recurrent problem with a largenumber of applications. In computer vision it often appears in motionanalysis. Whatever the application, standard methods for regression ofpolynomial models tend to deliver biased results when the input data isheavily contaminated by outliers. Moreover, the problem is even harderwhen outliers have strong structure. Departing from problem-tailoredheuristics for robust estimation of parametric models, we explore deepconvolutional neural networks. Our work aims to find a generic approachfor training deep regression models without the explicit need of super-vised annotation. We bypass the need for a tailored loss function on theregression parameters by attaching to our model a differentiable hard-wired decoder corresponding to the polynomial operation at hand. Wedemonstrate the value of our findings by comparing with standard ro-bust regression methods. Furthermore, we demonstrate how to use suchmodels for a real computer vision problem, i.e. , video stabilization. Thequalitative and quantitative experiments show that neural networks areable to learn robustness for general polynomial regression, with resultsthat well overpass scores of traditional robust estimation methods.
Keywords:
Deep learning, polynomial regression, parameric motion model
Fitting a finite degree polynomial model to a set of measurements is a problemthat appears recurrently in machine learning and computer vision [1]. It is knownas polynomial fitting or polynomial regression. When the input data follow oneinstance of the model class, exactly or up to an additive white Gaussian noise,the optimal estimator of the polynomial coefficients is the least squaresestimator (LSE). However, in very few domains one would encounter such a (cid:63) Now with Orange Labs, France (cid:63)(cid:63)
Now with Zowl Labs, Argentina (cid:63) (cid:63) (cid:63)
Now with Valeo.ai, France Conventionally, in the deep learning literature we call “parameters” the set of valuesthat are learned during training (connection weights essentially). Sometimes, theword “parameters” also refers to the coefficients of a regressed polynomial. To avoidconfusion for the latter meaning, we use either the word “coefficients” in the firstpart of this manuscript or the phrase “parametric motion model” in the second part. a r X i v : . [ c s . C V ] M a y P´erez-R´ua et al. situation. In reality, data is usually affected not only by noise, but by non-trivialinterference, blind spots (unmasked missing data), and many other types ofoutliers. In these scenarios, LSE is biased.Attempts to account for the wide variety of input data contamination, in-cluding structured outliers, have been proposed in the past. These include spe-cific heuristics like random sample consensus [2] (RANSAC) or one of its manyproblem-specific variations. Robust statistics have enjoyed popularity among re-searchers as well. However, these solutions sometimes require a great deal oftuning, while still leaving room for improvement on the estimation accuracy andinsensitivity to structured outliers. Moreover, most of the available techniquesfor robust estimation rely on specific priors on the input data, for instance, ex-pected ratio of outliers [2] or rough localization of them, as it is expressed byalternate optimization in [3]. It is precisely with the goal of eliminating as muchas possible any need for prior knowledge on the input data that we explore deepneural networks in this context. We hypothesize that the multi-scale spatial rea-soning of a model empowered with stacks of convolutional layers is key towardsuniversally robust polynomial regression.Indeed, deep models were found to be useful in a large variety of complex re-gression problems [4,5,6]. The ubiquity of convolutional neural networks in thesetype of problems speaks of their potential for the task of polynomial regression.A particular property we are specially interested in this paper is robustness andhow to learn to be robust. However, during supervised learning, the types ofrobustness a model can learn are tightly related to the examples from the train-ing dataset. Given the difficulties that arise during collecting the large datasetsthat neural networks need, it is very likely that for a given problem only a smallportion of those cases are covered. How to help deep models generalize for othercases is an open question. In practice, this is usually handled by randomized dataaugmentation [7]. Indeed, being able to generalize from the training dataset, andbeing robust to damaged input seem to be, at least in principle, related conceptsin machine learning.Another difficult question that arises when training such models for regressionproblems is what is the best loss function. In particular when regressing coeffi-cients of a polynomial function, standard loss functions might not be optimal.This is related with the fact that, very often for some problems, few coefficientsare much larger than others, causing imbalance during training. This might bethe reason why for optical flow, a common regression problem in computer vi-sion [8], a convolutional neural network trained with the L – Describing a family of deep models for polynomial regression, – Defining a simple methodology for unsupervised training of polynomial re-gression models, eep polynomial regression 3 – Comparing the effect of a loss function applied on the output data stem-ming from the estimated parametric model vs. a loss applied directly on theestimated polynomial coefficients, – Exploring the effect of robust losses during training, – Analyzing polynomial regression problems of different input data dimension-ality. – A simple application for estimation of parametric motion models and videostabilization.We start by summarizing the related work in Section 2. Motivating ideas forour work are discussed in Section 3. We then explain our models in Section 4,and give way to the core experimental work in Section 5. Final comments aregiven in Section 8.
In this section we give a review of the related work. First, we start with abrief introduction to robust regression methods. We include a description andmotivation of iterative methods like RANSAC and consensus-based approaches,and continue with robust estimators. Later, we explain further the problem ofparametric motion model estimation, which is a form of regression often foundin the computer vision literature. Finally, we introduce recent works on deepmodels for regression and similar tasks. , proposed by Fischler and Bolles in 1981 [2], is an iterative methodfor alternated determination of model inliers and model parameters. It encom-passes randomized sampling of the input data set, estimation of a candidateparametric model explaining the chosen subset, and determining the propor-tion of data points that agree with the candidate model by using a hand-tunedthreshold. The method iterates for a fixed number of iterations or until enoughdata points find a consensus. The randomized nature of the method implies thatfor a single dataset results of multiple runs might be different. Furthermore,the algorithm parameters usually need to be tuned up for different problems,and it is known to be sensitive to the choice of the threshold [9]. To providesome more stability to this random heuristic, some works have focused on otherways to establish goodness of fit: least median squares, which even though itoffers outstanding robustness, still fails when the ratio of outliers is very large;MLESAC, which maximizes the likelihood rather than number of inliers [10];MINPRAN [11], which makes assumptions on the randomness of the data, etc.
Robust estimators aim to fix the bias problem of LSE, by replacing the L L L
1, but this change only increasesrobustness for the mono-dimensional case, where minimizing the L P´erez-R´ua et al. estimator of the median. Truncated least-squares and least-trimmed squares areother options to replace L We make a short overview of a common application of polynomial regression incomputer vision: estimation of parametric motion models. This use-case is a greatexample of polynomial regression with strong outliers. Natural scenes can oftenbe roughly separated into background and foreground segments. Foregroundsegments often include moving people, vehicles or any type of independentlymoving objects. When the task is estimating the dominant image motion dueto camera movements, foreground segments can be effectively seen as spatially-coherent outliers. Depending on the scene, these outliers can occupy a very largeportion of the image support, hindering accurate estimation.In many dynamic scene analysis building blocks, accuracy of polynomial re-gression is important, e.g. , in motion segmentation [14,15], optical flow estima-tion [16,17,18,19], detection of motion anomalies [20], and tracking [21]. Classicalmethods pose parametric motion model estimation as an inverse problem thatis solved through minimization of an energy functional [22,23]. These methodsleverage the motion constraint to form a data driven term encouraging motionparameters that minimize the displaced frame difference (DFD) between theinput images. In contrast to per-pixel optical flow estimation, the estimationof parametric motion models is not an underdetermined problem. Indeed, theproposed models explain the image-based motion cues for all the image pixelsat once (or a subset of them). Usually the number of observations, i.e. pixelpositions, is much greater than the number of parameters of the motion model,leading to stable solutions when no motion outliers are present in the scene.However, under the presence of outliers, models that simply penalize the dis-placed frame intensity difference with the L eep polynomial regression 5 Convolutional Neural Networks (CNNs) have started to dominate Computer Vi-sion problems that had been traditionally very complicated to address withlearning-based methods. This is most probably due to the higher-level fea-tures that the hierarchical CNN architectures are able to learn. One exampleof these problems is the estimation of optical flow. In scenes with large mo-tion ambiguity, only semantic cues are able to recover the correct apparent mo-tion [8,26,27,28,29]. This seems to be the reason why deep optical flow methodsare currently dominating benchmarks.In Dosovitskiy et al. [8] convolutional filters successfully learn how to esti-mate two dimensional motion fields from pairs of successive video frames. A fewelements of this approach, coined FlowNet, have to be considered when tacklingsimilar tasks. Performing a complex transform from 2D maps (images) to sameresolution 2D maps (optical flows) requires to capture high level features fromdata. In order that features can pick up global information at the total spatialextension of the input maps, they are implemented in a contractive fashion. In-deed, this is a very common practice in applied deep learning. A second part ofthe network must then take those features and expand them so that they are ableto restore the spatial resolution of the output. An encoder-decoder architecturecomes easily to mind. However, special attention must be taken for the motionestimation problem. Indeed, optical flow networks must have good localizationproperties. Forward skip connections from contractive layers are connectedthrough convolutions to the expanding part of FlowNet, alleviating the bad lo-calization issue of deep networks and simple encoder-decoder architectures.A problem closely related to polynomial regression, geometric matching, con-sists of finding a parametric transformation of the image grid, allowing the reg-istration of input frames. Recently, Rocco et al. , [32] proposed a neural networkmodel that is capable of registering pairs of images that do not necessarily be-long to the same image sequence. The target parametric transformations wereaffine and thin plate splines [33]. In their model, the problem is divided into threetasks: symmetric feature extraction with a Siamese network initialized with VGGfeatures [34], a dense correlation layer similar to the one used by FlowNet, anda regression layer, which infers the image grid transformation.Another regression problem that has been recently tackled by CNNs is hu-man pose estimation. Excellent results were obtained with the so-called stackedhourglass networks (SHN) [6]. The use of convolutions for optical flow has a longer history. For instance, Farneb¨ackimplemented his motion estimation method by means of separable convolutionsin [30]. Weinzaepfel et al. , [31] rely on a large stack of patch-based convolutionalresponses. To the best of our knowledge, however, the method of Dosovitskiy et al. [8] is actually the first one to use learned convolutional filters to perform the mappingbetween images and motion fields. P´erez-R´ua et al.
The success of deep models on the complicated tasks described in Section 2motivates the exploration of deep models for learning how to robustly estimateparametric models.One interesting element of FlowNet [8,27] is that it was trained on a syn-thetic dataset called
FlyingChairs . The dataset contains around 25,000 im-ages of chairs on background images extracted randomly from
Flickr . The back-grounds were assigned with a random rigid motion, and the foreground, com-posed of computer-generated chairs, with another one. A simple strategy fordata augmentation allows the network to generalize from that dataset to realimages. The final results of FlowNet are impressive considering that the pipelineis learned in an end-to-end fashion with synthetic data, and, powered by mod-ern GPUs, they are computed in almost real-time. The evolution of FlowNet,FlowNet 2.0 [27], ranks very highly in optical flow benchmarks. Perhaps theelement introduced by FlowNet 2.0 that is most relevant to this work is cur-riculum learning.
One of the issues of the original FlowNet is the poor be-haviour for small displacements. To tackle this, FlowNet 2.0 leverages a secondsynthetic dataset depicting more complex motions (and of smaller magnitude inaverage), coined
Flying3DThings . The optimal schedule for training was to firstuse
FlyingChairs , and then the
Flying3DThings . Apparently, a neural network ispredisposed to learn more complex data priors, when already trained for simplerones. We will test this hypothesis for our scenario later on.Newell et al. [6] stated the human pose estimation problem as a dense map-to-map inference problem. The important elements that allow such networks toperform so well can be summarized as follows: – Skip layers with symmetric connection from the convolutional operations inthe contractive part of the network, to the upsampling layers in the expansivepart of the network. This particular design essentially allows the network tobe aware of global and local information at every stage of the decoding part.A single module with this design properties is called an hourglass module. – Stacks.
Stacking hourglass modules seems to allow the SHN to perform re-peated top-down, bottom-up operations that might be essential on capturingdifferent aspects of the pose estimation problem at every module. – Residual connections.
The residual connections, as introduced by [35]allow very deep models to be properly trained. Each residual module isby-passed by an identity transformation that allows gradients flow freelythrough the network. A deeper understanding of residual learning can beobtained by looking at [36]. – Intermediate supervision.
SHN allows intermediate outputs to be usedin the training loss. This procedure guarantees that each hourglass modulelearns something about the pose estimation problem, and further stabilizesthe overall training.The lessons obtained from the start-of-the-art are directly leveraged by ourmodels and experiments in the following sections. eep polynomial regression 7
The polynomial regression problem that we tackle is defined by an input pair( x , d ) of a domain vector x = [ x ; x ; · · · ; x N ] ∈ R DN , and a correspondingrange vector d = [ d ; d ; · · · ; d N ] ∈ R RN . The dimensions D and R of d i and x i do not have to be the same. For instance, x i can be an image point ( D = 2)and d i an intensity value ( R = 1). The relationship between range and domainis assumed to follow a polynomial of given degree, d θi = P θ ( x i ), where θ is thevector of its M coefficients, e.g. , M = 6 for a two-dimensional affine transform.Rewriting this relation as a linear function of θ reads: d θi = M i ( x i ) θ, (1)where M i ( x i ) ∈ R R × M is a design matrix whose structure is maintained acrossthe input data, but whose values are a function of the corresponding domainelement x i . These design matrices can be stacked into a single matrix M ( x ) =[ M ( x ); · · · ; M N ( x N )] ∈ R RN × M so that: d θ x = M ( x ) θ. (2)Under the assumption that data only undergo additive Gaussian noise, theproblem of estimating θ is reduced to solving:ˆ θ = argmin θ N (cid:88) i =1 (cid:13)(cid:13) d i − d θi (cid:13)(cid:13) = argmin θ (cid:13)(cid:13) d − d θ x (cid:13)(cid:13) , (3)from where it follows ˆ θ = (cid:0) M T M (cid:1) − M T d (with x hidden for sake of concise-ness). This solution corresponds to the simplest possible baseline for polynomialregression, but it is clearly biased under the presence of outliers. To some extent, the problem of outlier removal is similar to the signal denois-ing problem that stacked denoising autoencoders (SDA) [37] address. Theseencoding-decoding architectures, however, are not directly amenable to the poly-nomial regression problem. Indeed, the function that transforms the code intooutput data should not be learned. It should instead take the form of a fixed,non-trainable differentiable decoding layer. This fixed decoder is simply givenby Eq. 2, i.e. , a linear transform of the hypothesized polynomial coefficients θ based on a problem-specific design matrix.Assuming that a learnable encoder, composed of convolution layers, thatmaps the input data to the polynomial coefficient space is available, an Encoder-Fixed Decoder , or model-based auto-encoder to use the terminology in [5], can Denoising in SDA is more of a proxy task to facilitate the unsupervised learningof meaningful features from data. However, similar ideas led to a very successfulmethod for image denoising in [38]. P´erez-R´ua et al.
OutputFixed decoderCode Input Encoder
Fig. 1.
Model-based autoencoder with fixed, non-trainable decoder . The input in greenis mapped to a code which effectively becomes the coefficients of a polynomial model when passedthrough the fixed decoder part (navy blue).
InputdataIntermediate outputsOutput features Stacked hourglass modules Code
Fig. 2.
Encoder.
The learnable part of our family of networks for deep polynomial regression. Theintermediate outputs from each hourglass module are collected for computing the loss together withthe final output after the encoding part of our architecture. be formed. Such a network can be trained with well known deep learning trainingalgorithms with a loss function acting on the output (decoded) data. Moreover,the denoising learning trick explained by [37] can be readily applied to sucharchitecture, as seen in Fig. 1. Granted that training pairs are composed of cor-rupted and clean data, such a network should be able to learn to regress whileignoring outliers, even structured ones if present in training data.Furthermore, by means of this training, the “code” naturally corresponds tothe desired polynomial parameters. An interesting element of this design is that itbypasses the polynomial coefficients themselves at the loss level, eliminating theneed for tweaking specific loss functions according to the type of polynomialsto be regressed. Indeed, comparing data vectors of the same domain is morestraightforward.
Let us, for now, ignore the exact architecture of the encoder part of our familyof networks. A common way to train denoising autoencoders was proposed in[37], as previously mentioned. This training trick can be categorized as an unsu-pervised learning method, since pairs of input images and corrupted images areconstructed on the fly during training, without the need of human intervention. eep polynomial regression 9
In the case of polynomial regression, this leaves the door open to fully unsuper-vised training, as it would be preferred since it is the most common frameworkto tackle the problem. In our framework, we train our networks by providingrandomly generated pairs of clean and corrupted data. The parameters of therandom generation process are discussed in supplementary material. Since everysample is generated randomly, training can encompass a very large number ofiterations without affecting generalization power of the learned model.
For the encoding part of our family of networks, we propose to use Stacked Hour-glass Modules [6]. Several of the ingredients of SHN seem to be well adapted forthe polynomial regression problem with encoder-decoder type of architectures.In particular, the repetitive bottom-up and top-down operations by stackingresidual hourglass modules seem to fit the spirit the multi-scale processing spiritof some methods. On the top of that, these residual modules capture scale in-formation, which in the opinion of the authors, it is one of the fundamentalelements of problem-tailored regression methods. In the experimental part ofthis work, we validate these claims by establishing a baseline network composedof more classical feedforward convolutional networks ( i.e. , purely contractive andwithout residual connections). As in SHN, we make use of intermediate lossesat the output of each hourglass module. The output features are used in a finalcontractive stage to obtain the polynomial coefficients or “code” (See Fig. 2).
As previously mentioned, estimation of parametric motion models is a very goodexample of a polynomial regression problem with naturally strong outliers. Insuch a setting a polynomial motion model for a moving scene is interpreted as thedominant scene motion stemming from camera motion. In that sense, outlierscorrespond to moving foreground objects, which can occupy large areas of thescene.A common way to perform video stabilization is to compute a temporally andspatially smooth optical flow map [39]. One way to achieve this is to computeat each instant a bi-dimensional optical flow ( R = 2) over the image domain( D = 2) and then fit a polynomial function to it. Given an input optical flowmap V = { f x } x ∈ Ω , one can fit a polynomial function f θ computable at everyposition x = ( x , x ) of the image grid Ω , so that: f θ x = (cid:20) u θ x v θ x (cid:21) = M ( x ) θ, (4)where, θ is a column vector containing the parameters of a polynomial motionmodel. Let us consider, for sake of generality, full quadratic motion models withtwelve coefficients ( M = 12). Then, the matrix M ( x ) in Eq. 4 takes the form: Any other common motion models like 4-parameter affine, and 8-parameter (corre-sponding to rigid motion of a planar scene) could be considered as well.0 P´erez-R´ua et al.
Table 1.
Regressing scalar functions with 4-th degree polynomial of one variable
Resultsfor 6 different testing datasets at increasingly higher outlier ratios with a fixed noise standarddeviation of 0 .
01. The numbers are the mean squared error between generated clean data and theoutputs of respective methods. The lower the number, the better the accuracy. We indicate in boldbest results and underline second best for every column.Outlier ratio0% 10% 20% 30% 40% 50% AverageLSE
FullNet Data 1+2 0.074 M ( x ) = (cid:20) x x x x x x x x x x x x (cid:21) , (5)and θ ∈ R . Equations 4 and 5 are specializations of previously introducedEq. 1 for a polynomial function with two-dimensional domain. This means thatthe proposed network of Fig. 1 applies directly to this problem. We explain datageneration and training issues in the supplementary material. In Section 5, wereport experimental results and a direct application for the problem of videostabilization. We present first two sets of experiments in Section 5.1. Their goal is to vali-date the design decisions explained in previous sections, and demonstrate thatthey hold even when the dimensionality of the problem changes. We start withleast squares (LSE) as the simpler baseline. We also provide results with threeconventional robust algorithms, RANSAC, and a robust Tukey [12] estimatorsolved with the Iterative Re-weighted Least Squares (IRWLS). Finally, in Sec-tion 5.2, we present a simple video stabilization pipeline based on parametricmotion models regressed from optical flow maps.
Regression of scalar polynomials.
We start with a toy experiment consisting of asimple 1 D regression problem ( R = 1 and D = 1) with scalar polynomials of de-gree four. The evaluation of our framework for deep polynomial regression is splitinto two types of network. First, the “Half-Nets” refer to our Encoder-Decodernetworks without stacked hourglass modules. Instead, “Half-Nets” are composedonly of contractive convolutions for the encoder part, and our fixed decoder ontop of it. Furthermore, we want to determine if split training [27] presents anyadvantage over training with a single complex dataset. Thus, Half-Net Data 1 eep polynomial regression 11 is trained only on a dataset encompassing functions with smaller magnitude,contaminated with little noise and with only few outliers.
Half-Net Data 1&2 is refined on a second stage with a more complicated dataset encompassing alarger variety of generation modes, more noise and up to 33% structured out-liers. During evaluation outlier ratios up to 50% are tested. Later on,
Half-NetData 1+2 is a network trained with a dataset encompassing both the simpleand complicated generation schemes combined in a single stage.A second type of networks encompassing SHN for the encoder part are re-ferred as “Full-Nets”. In the same order as for the “Half-Nets”,
Full-Net Data 1 , Full-Net Data 1&2 , Full-Net Data 1+2 study the importance of split training.In order to determine the value of our fixed decoder as part of the network, wealso train
Full-Net w.o. D with a mean square loss on the produced parametriccoefficients. Finally,
Full-Net R.L. is trained with a robust loss. For
Full-NetR.L. , instead of training with the proposed dataset composed of pairs of cleanand contaminated vectors, we train only with contaminated data . The robustloss in this network is expected to reject outliers automatically as it is proposedin [4]. Both Full-Net w.o. D and
Full-Net R.L. are trained on the complex dataset(
Data 1+2 ). All the results can be observed in Table 1.A first conclusion that can be drawn from Table 1 is that, overall, neural-based regression is more stable than classical robust methods (RANSAC orIRLWS) in the face of outlier corruption. Another interesting outcome of ourexperimentation is that for scalar polynomial regression, split training does notseem to provide a large gain in terms of prediction error. However, it does achievethe absolute best results in comparison to other dataset configurations. Perhapsmore interestingly, all of our networks with a preceding hourglass module presentoverall better results than any other set-up. In particular,
Full-Net Data 1&2 seems to achieve lowest errors, except for the corruption-free case, where LSE isthe optimal estimator, and therefore expected. An interesting finding is that ourdenoising training effectively teaches our networks how to be robust to outliers.In contrast, when simply training with a robust function (
Full-Net R.L. ) resultstend to be poor. This enforces the notion that neural networks learn howto be robust more easily by example than by application of robustlosses . Finally, it is worth mentioning that training through our hard-wired de-coder does indeed offer a jump in accuracy of the regression. This interestingfact seems to be in agreement to very recent findings in neural estimation ofimage geometry [40].
Regression of 2D polynomials (parametric motion models).
To show that ourfindings hold for several types of polynomial regression problems, we repeat theprevious experiments, this time for a polynomial regression problem of higher di-mensionality ( R = 2, D = 2), involving the full quadratic motion model of Eq. 5( M = 12). The experiments are collected in Table 2, where the items havethe same meaning than for Table 1. From Table 2, we observe that most of The idea of this particular experiment is assessing if only by means of a robust loss,neural networks are able to learn to reject outliers.2 P´erez-R´ua et al.
Table 2.
Regressing vector fields with 2-nd degree polynomial of two variables.
Results for6 different testing datasets at increasingly higher outlier ratios with a fixed noise standard deviationof 0 .
5. The error values with the Euclidean norm between generated clean data and the outputs ofrespective methods. We bold best results and underline second best for every column.Outlier ratio0% 10% 20% 30% 40% 50% AverageLSE the findings for scalar regression regression hold for the vector field case. Thistime,
Full-Net Data 1+2 delivers the best overall results, with a MSE score thatis roughly three times lower than the baseline without fixed decoder (
FullNetw.o.D. ) and almost four times lower than the best configuration of networkswithout stacked hourglass modules (
HalfNet Data 1&2 ). In this set-up
Full-NetData 1+2 and
Full-Net Data 1&2 do not seem to present significantly differ-ent results. Finally, it is worth mentioning that our training procedure presentstraining samples with at most 30% outlier ratio, while the experiments reachup to 50%. Interestingly, our best models generalized very well for the extremeoutlier conditions, achieving errors of one order of magnitude better than classicbaselines for the vector field case, and two orders of magnitude for the scalarfunction case.
We now report experiments to demonstrate that the proposed approach is rel-evant for real data, even if training is conducted on synthetic data only andwithout supervision. In particular, we take the best network from the second setof experiments in Section 5.1 and use it directly to compute parametric motionmodels from optical flow. We start from the popular optical flow baseline methodby Brox et al. [41]. Robustly fitting a quadratic motion model to an estimatedcomplex non parametric flow is very challenging due to flow inaccuracies (whichcan be slightly structured) and, more importantly, to highly structured outliersthat foreground objects induce.We first start by analyzing the behavior of our model on a synthetic opti-cal flow dataset, namely
FlyingChairs in Fig. 3. This dataset is composed ofsynthetically rendered chairs composed onto moving backgrounds. Our model It should be noted that one could replace the off-the-shelf optical flow method by amore precise one. Furthermore, one could connect a deep-learning based method toours, enabling end-to-end training from images. We leave this for future research.eep polynomial regression 13(a) (b) (c) (d) (e) (f) (g)
Fig. 3.
Visual results on the FlyingChairs dataset.
A different scene is shown in each row. (a-b) The image pair; (c) the input optical flow map (ground-truth); (d) output of
FullNet Data 1 ; (f)The corresponding results for
FullNet Data 1+2 . Observe the simplification of the input optical flowin a parametric motion model that gets rid of outliers (motion of the chairs); (e) and (g): normalizedper-pixel difference between the output and input flows corresponding to our model trained on thefirst and second datasets, respectively. Observe that the larger differences (more saturated colors)are, generally, in the pixels depicting a moving chair.
Fig. 4.
Visual results on the Kitti dataset.
A different scene is shown in each column: (top)input image pair; (third row) input optical flow; (fourth row) quadratic motion field obtained by ourdeep regression network; (fifth row) pixel-wise difference between inputs and outputs. captures what can be interpreted as dominant scene motion, mostly correspond-ing to the background. Interestingly, the high magnitude values of the differencebetween the input and output flow maps correspond to foreground objects, i.e. ,the moving chairs. In this sense, our method robustly captures global motion,showing high insensitivity to outliers. In Fig. 3, we can also see the differencebetween the same model trained on two different datasets (columns e and g). Ev-idently,
FullNet Data 1+2 , being trained on a more complex outlier-generationsetting, better captures global motion than
FullNet Data 1 .In a more realistic set-up, we take the Kitti dataset and compute opticalflow maps between pairs of frames with [41]. From these maps, we fit quadraticmotion models with our best network, resulting in the visual results in Fig. 4.The reader should notice that the simplified flow maps stemming from the fitted et al. quadratic motion models conform well with the dominant ego-motion inducedon the scene by the displacement of the embarked camera.
Fig. 5.
Video stabilization results . Estimated quadratic dominant motion in a real scene is usedto backwarp images aiming at video stabilization. First row: original unstable input. Second row:stabilized images. The out-of-frame holes (black) are left so the reader appreciates better the motioncompensation.
Video stabilization.
A common application of dominant motion estimation isvideo stabilization: cancelling the higher frequencies of background visual mo-tion indeed requires an accurate estimation of this flow field. In Fig. 5 we showstabilization results with out deep polynomial regression method on a sequenceundergoing camera-shaking by effect of strong wind. Observe that the imagesequences are correctly warped to compensate for dominant motion. In a furtherimprovement, we smooth pixel profiles in a similar fashion as [39] to achieve bet-ter temporal smoothness. Our algorithm delivers high quality results, consideringthat we did not introduce any higher level considerations for video stabilizationproblem.
In the following paragraphs we give important details of the models used duringthe experimental part of our work.
The problem we tackle in Section 5.2 is the regression of 4th degree polynomialcoefficients. This set-up is amenable to the multi-scale, repetitive processing ofstacked hourglass networks. However, we replace all the 2D convolutions by 1Dconvolutions in account of the input data structure.
Hourglass modules.
Each convolutional layer in the reductive part of an hourglassmodule is formed by 32 convolutional kernels of size three. At the output ofeach one of these convolutional layers a sequence of three operations is applied.These are, ReLU, 1D max pooling with stride 2, and 1D batch normalization. eep polynomial regression 15
Furthermore, after the ReLU operations a forward skip connections throughconvolutions of the same size are connected to bilinear upsampling operations inthe enlarging part of the hourglass modules. For this experiment we found thatstacking more than two hourglass modules did not improve performance muchfurther.
Model-based autoencoder.
We now explain the model used for the experiments of Section 5.2. Aligned withthe intention of the paper of providing a general model for regression problems,we use a model that is identical to the one used for scalar regression with only afew minor differences. Thus, each 1D convolution of previous model is replacedby a 2D convolution of twice as many planes. Furthermore, the fixed decoder ofthe model-based autoencoder is replaced to match the new polynomial operation.
Two dataset generators are used for training of our regression models, namely
Data 1 and
Data 2 . Each one of them is composed by pairs of “input” and “out-put” samples. The “output” samples are clean arrays of values stemming froma polynomial model, while the “input” samples are corresponding contaminatedmaps. The pair simulates a supervised training pair, but it is generated onlinerandomly. See Fig. 6 for an illustration of data generated for the vector fieldcase.The main difference between
Data 1 and
Data 2 is the intensity of the con-tamination. The idea is that
Data 1 would allow neural networks to learn thebasic operations for regression without many distractions, while a subsequentdataset would refine the network for handling more complex contamination.Thus, we set the maximum outlier ratio for
Data 1 to 0.1, while we set it to 0.3for
Data 2 . Gaussian noise is set to 0.1 for
Data 1 and 0.5 for
Data 2 . The struc-tured outliers correspond to randomly selected polygonal supports encompassinganother randomly sampled polynomial model.
In this paper we have proposed a neural approach to robust polynomial regres-sion. We carried out an experimental study spanning several important hints et al.
Fig. 6.
Randomly generated training pairs.
Top row: Contaminated input. Bottom row: cleanoutput. from previous state-of-the-art. We have provided a general class of architecturesthat learn to deal with outliers effectively. In particular, the spatially-consistentoutliers that have been an important problem for classical polynomial regressionmethods can be handled by properly trained and designed neural nets. More-over, we have shown that these findings hold for different settings of polynomialregression. The proposed architecture captures two important common designstrategies for polynomial regression: spatial awareness (effective through convo-lutions), and multi-scale processing (by stacked hourglass modules). This design,in conjunction with a stacked denoising autoencoder training with simulated out-liers, results in models that robustly learn how to handle largely contaminateddata. Furthermore, networks trained with purely synthetic data are able to gen-eralize to real data, leading to very good accuracy for global motion estimationand effective use for video stabilization. We can expect that our findings can begeneralized to other types of regression problems.
References
1. Meer, P., Mintz, D., Rosenfeld, A., Kim, D.Y.: Robust regression methods forcomputer vision: A review. International journal of computer vision (1) (1991)59–702. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography. Commu-nications of the ACM (6) (1981) 381–3953. Puy, G., Vandergheynst, P.: Robust image reconstruction from multiview mea-surements. SIAM Journal on Imaging Sciences (1) (2014) 128–1564. Belagiannis, V., Rupprecht, C., Carneiro, G., Navab, N.: Robust optimizationfor deep regression. In: Proceedings of the IEEE International Conference onComputer Vision. (2015) 2830–28385. Tewari, A., Zollh¨ofer, M., Kim, H., Garrido, P., Bernard, F., P´erez, P., Theobalt, C.:Mofa: Model-based deep convolutional face autoencoder for unsupervised monoc-ular reconstruction. In: International Conference on Computer Vision. (2017)eep polynomial regression 176. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. In: European Conference on Computer Vision, Springer (2016) 483–4997. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks forimage classification. In: Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on, IEEE (2012) 3642–36498. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van derSmagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutionalnetworks. In: International Conference on Computer Vision. (2015) 2758–27669. Torr, P.H., Murray, D.W.: The development and comparison of robust methodsfor estimating the fundamental matrix. International Journal of Computer Vision (3) (1997) 271–30010. Torr, P.H., Zisserman, A.: Mlesac: A new robust estimator with application toestimating image geometry. Computer Vision and Image Understanding (1)(2000) 138–15611. Stewart, C.V.: Minpran: A new robust estimator for computer vision. Transactionson Pattern Analysis and Machine Intelligence (10) (1995) 925–93812. Huber, P.J.: Robust statistics. In: International Encyclopedia of Statistical Science.Springer (2011) 1248–125113. Holland, P.W., Welsch, R.E.: Robust regression using iteratively reweighted least-squares. Communications in Statistics-theory and Methods (9) (1977) 813–82714. Odobez, J., Bouthemy, P.: Separation of moving regions from background in animage sequence acquired with a mobile camera. In: Video Data Compression forMultimedia Computing. Springer (1997) 283–31115. Cremers, D., Soatto, S.: Motion competition: A variational approach to piecewiseparametric motion segmentation. International Journal of Computer Vision (3)(2005) 249–26516. Black, M.J., Jepson, A.D.: Estimating optical flow in segmented images usingvariable-order parametric models with local deformations. Transactions on PatternAnalysis and Machine Intelligence (10) (1996) 972–98617. Farneb¨ack, G.: Two-frame motion estimation based on polynomial expansion. In:Image Analysis. Springer (2003) 363–37018. Fortun, D., Bouthemy, P., Kervrann, C.: Aggregation of local parametric candi-dates with exemplar-based occlusion handling for optical flow. Computer Visionand Image Understanding (2015) 1–18219. Yang, J., Li, H.: Dense, accurate optical flow estimation with piecewise parametricmodel. In: Computer Vision Pattern Recognition. (2015)20. P´erez-R´ua, J.M., Basset, A., Bouthemy, P.: Detection and localization of anoma-lous motion in video sequences from local histograms of labeled affine flows. Fron-tiers in ICT (2017) 1021. Black, M.J., Yacoob, Y.: Tracking and recognizing rigid and non-rigid facial mo-tions using local parametric models of image motion. In: International Conferenceon Computer Vision, IEEE (1995) 374–38122. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Paramet-ric and piecewise-smooth flow fields. Computer Vision and Image Understanding (1) (1996) 75–10423. Bergen, J., Anandan, P., Hanna, K., Hingorani, R.: Hierarchical model-basedmotion estimation. In: European Conference on Computer Vision, Springer (1992)237–25224. Odobez, J.M., Bouthemy, P.: Robust multiresolution estimation of parametricmotion models. Journal of Visual Communication and Image Representation (4)(1995) 348–3658 P´erez-R´ua et al.
25. Senst, T., Eiselein, V., Sikora, T.: Robust local optical flow for feature tracking.Transactions on Circuits and Systems for Video Technology (9) (2012) 1377–138726. Thewlis, J., Zheng, S., Torr, P.H., Vedaldi, A.: Fully-trainable deep matching.British Machine Vision Conference (2016)27. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0:Evolution of optical flow estimation with deep networks. Computer Vision PatternRecognition (2017)28. Bailer, C., Varanasi, K., Stricker, D.: Cnn-based patch matching for optical flowwith thresholded hinge embedding loss. In: Computer Vision Pattern Recognition.(2017)29. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow usingpyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)30. Farneb¨ack, G.: Fast and accurate motion estimation using orientation tensors andparametric motion models. In: International Conference on Pattern Recognition.Volume 1., IEEE (2000) 135–13931. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displace-ment optical flow with deep matching. In: International Conference on ComputerVision. (2013) 1385–139232. Rocco, I., Arandjelovi´c, R., Sivic, J.: Convolutional neural network architecturefor geometric matching. In: Computer Vision Pattern Recognition, IEEE (2017)33. Bookstein, F.L.: Principal warps: Thin-plate splines and the decomposition ofdeformations. Transactions on Pattern Analysis and Machine Intelligence (6)(1989) 567–58534. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)35. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Computer Vision Pattern Recognition. (2016) 770–77836. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.In: European Conference on Computer Vision, Springer (2016) 630–64537. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-ing autoencoders: Learning useful representations in a deep network with a localdenoising criterion. The Journal of Machine Learning Research11