[PDF] Learning how to be robust: Deep polynomial regression

Abstract

Polynomial regression is a recurrent problem with a large number of applications. In computer vision it often appears in motion analysis. Whatever the application, standard methods for regression of polynomial models tend to deliver biased results when the input data is heavily contaminated by outliers. Moreover, the problem is even harder when outliers have strong structure. Departing from problem-tailored heuristics for robust estimation of parametric models, we explore deep convolutional neural networks. Our work aims to find a generic approach for training deep regression models without the explicit need of supervised annotation. We bypass the need for a tailored loss function on the regression parameters by attaching to our model a differentiable hard-wired decoder corresponding to the polynomial operation at hand. We demonstrate the value of our findings by comparing with standard robust regression methods. Furthermore, we demonstrate how to use such models for a real computer vision problem, i.e., video stabilization. The qualitative and quantitative experiments show that neural networks are able to learn robustness for general polynomial regression, with results that well overpass scores of traditional robust estimation methods.

Full PDF

LLearning how to be robust:Deep polynomial regression

Juan-Manuel P´erez-R´ua , (cid:63) , Tomas Crivelli (cid:63)(cid:63) ,Patrick Bouthemy , and Patrick P´erez (cid:63) (cid:63) (cid:63) Technicolor, Cesson S´evign´e, France Inria, Centre Rennes – Bretagne Atlantique, France

Abstract.

Polynomial regression is a recurrent problem with a largenumber of applications. In computer vision it often appears in motionanalysis. Whatever the application, standard methods for regression ofpolynomial models tend to deliver biased results when the input data isheavily contaminated by outliers. Moreover, the problem is even harderwhen outliers have strong structure. Departing from problem-tailoredheuristics for robust estimation of parametric models, we explore deepconvolutional neural networks. Our work aims to ﬁnd a generic approachfor training deep regression models without the explicit need of super-vised annotation. We bypass the need for a tailored loss function on theregression parameters by attaching to our model a diﬀerentiable hard-wired decoder corresponding to the polynomial operation at hand. Wedemonstrate the value of our ﬁndings by comparing with standard ro-bust regression methods. Furthermore, we demonstrate how to use suchmodels for a real computer vision problem, i.e. , video stabilization. Thequalitative and quantitative experiments show that neural networks areable to learn robustness for general polynomial regression, with resultsthat well overpass scores of traditional robust estimation methods.

Keywords:

Deep learning, polynomial regression, parameric motion model

Fitting a ﬁnite degree polynomial model to a set of measurements is a problemthat appears recurrently in machine learning and computer vision [1]. It is knownas polynomial ﬁtting or polynomial regression. When the input data follow oneinstance of the model class, exactly or up to an additive white Gaussian noise,the optimal estimator of the polynomial coeﬃcients is the least squaresestimator (LSE). However, in very few domains one would encounter such a (cid:63) Now with Orange Labs, France (cid:63)(cid:63)

Now with Zowl Labs, Argentina (cid:63) (cid:63) (cid:63)

Now with Valeo.ai, France Conventionally, in the deep learning literature we call “parameters” the set of valuesthat are learned during training (connection weights essentially). Sometimes, theword “parameters” also refers to the coeﬃcients of a regressed polynomial. To avoidconfusion for the latter meaning, we use either the word “coeﬃcients” in the ﬁrstpart of this manuscript or the phrase “parametric motion model” in the second part. a r X i v : . [ c s . C V ] M a y P´erez-R´ua et al. situation. In reality, data is usually aﬀected not only by noise, but by non-trivialinterference, blind spots (unmasked missing data), and many other types ofoutliers. In these scenarios, LSE is biased.Attempts to account for the wide variety of input data contamination, in-cluding structured outliers, have been proposed in the past. These include spe-ciﬁc heuristics like random sample consensus [2] (RANSAC) or one of its manyproblem-speciﬁc variations. Robust statistics have enjoyed popularity among re-searchers as well. However, these solutions sometimes require a great deal oftuning, while still leaving room for improvement on the estimation accuracy andinsensitivity to structured outliers. Moreover, most of the available techniquesfor robust estimation rely on speciﬁc priors on the input data, for instance, ex-pected ratio of outliers [2] or rough localization of them, as it is expressed byalternate optimization in [3]. It is precisely with the goal of eliminating as muchas possible any need for prior knowledge on the input data that we explore deepneural networks in this context. We hypothesize that the multi-scale spatial rea-soning of a model empowered with stacks of convolutional layers is key towardsuniversally robust polynomial regression.Indeed, deep models were found to be useful in a large variety of complex re-gression problems [4,5,6]. The ubiquity of convolutional neural networks in thesetype of problems speaks of their potential for the task of polynomial regression.A particular property we are specially interested in this paper is robustness andhow to learn to be robust. However, during supervised learning, the types ofrobustness a model can learn are tightly related to the examples from the train-ing dataset. Given the diﬃculties that arise during collecting the large datasetsthat neural networks need, it is very likely that for a given problem only a smallportion of those cases are covered. How to help deep models generalize for othercases is an open question. In practice, this is usually handled by randomized dataaugmentation [7]. Indeed, being able to generalize from the training dataset, andbeing robust to damaged input seem to be, at least in principle, related conceptsin machine learning.Another diﬃcult question that arises when training such models for regressionproblems is what is the best loss function. In particular when regressing coeﬃ-cients of a polynomial function, standard loss functions might not be optimal.This is related with the fact that, very often for some problems, few coeﬃcientsare much larger than others, causing imbalance during training. This might bethe reason why for optical ﬂow, a common regression problem in computer vi-sion [8], a convolutional neural network trained with the L – Describing a family of deep models for polynomial regression, – Deﬁning a simple methodology for unsupervised training of polynomial re-gression models, eep polynomial regression 3 – Comparing the eﬀect of a loss function applied on the output data stem-ming from the estimated parametric model vs. a loss applied directly on theestimated polynomial coeﬃcients, – Exploring the eﬀect of robust losses during training, – Analyzing polynomial regression problems of diﬀerent input data dimension-ality. – A simple application for estimation of parametric motion models and videostabilization.We start by summarizing the related work in Section 2. Motivating ideas forour work are discussed in Section 3. We then explain our models in Section 4,and give way to the core experimental work in Section 5. Final comments aregiven in Section 8.

In this section we give a review of the related work. First, we start with abrief introduction to robust regression methods. We include a description andmotivation of iterative methods like RANSAC and consensus-based approaches,and continue with robust estimators. Later, we explain further the problem ofparametric motion model estimation, which is a form of regression often foundin the computer vision literature. Finally, we introduce recent works on deepmodels for regression and similar tasks. , proposed by Fischler and Bolles in 1981 [2], is an iterative methodfor alternated determination of model inliers and model parameters. It encom-passes randomized sampling of the input data set, estimation of a candidateparametric model explaining the chosen subset, and determining the propor-tion of data points that agree with the candidate model by using a hand-tunedthreshold. The method iterates for a ﬁxed number of iterations or until enoughdata points ﬁnd a consensus. The randomized nature of the method implies thatfor a single dataset results of multiple runs might be diﬀerent. Furthermore,the algorithm parameters usually need to be tuned up for diﬀerent problems,and it is known to be sensitive to the choice of the threshold [9]. To providesome more stability to this random heuristic, some works have focused on otherways to establish goodness of ﬁt: least median squares, which even though itoﬀers outstanding robustness, still fails when the ratio of outliers is very large;MLESAC, which maximizes the likelihood rather than number of inliers [10];MINPRAN [11], which makes assumptions on the randomness of the data, etc.

Robust estimators aim to ﬁx the bias problem of LSE, by replacing the L L L

1, but this change only increasesrobustness for the mono-dimensional case, where minimizing the L P´erez-R´ua et al. estimator of the median. Truncated least-squares and least-trimmed squares areother options to replace L We make a short overview of a common application of polynomial regression incomputer vision: estimation of parametric motion models. This use-case is a greatexample of polynomial regression with strong outliers. Natural scenes can oftenbe roughly separated into background and foreground segments. Foregroundsegments often include moving people, vehicles or any type of independentlymoving objects. When the task is estimating the dominant image motion dueto camera movements, foreground segments can be eﬀectively seen as spatially-coherent outliers. Depending on the scene, these outliers can occupy a very largeportion of the image support, hindering accurate estimation.In many dynamic scene analysis building blocks, accuracy of polynomial re-gression is important, e.g. , in motion segmentation [14,15], optical ﬂow estima-tion [16,17,18,19], detection of motion anomalies [20], and tracking [21]. Classicalmethods pose parametric motion model estimation as an inverse problem thatis solved through minimization of an energy functional [22,23]. These methodsleverage the motion constraint to form a data driven term encouraging motionparameters that minimize the displaced frame diﬀerence (DFD) between theinput images. In contrast to per-pixel optical ﬂow estimation, the estimationof parametric motion models is not an underdetermined problem. Indeed, theproposed models explain the image-based motion cues for all the image pixelsat once (or a subset of them). Usually the number of observations, i.e. pixelpositions, is much greater than the number of parameters of the motion model,leading to stable solutions when no motion outliers are present in the scene.However, under the presence of outliers, models that simply penalize the dis-placed frame intensity diﬀerence with the L eep polynomial regression 5 Convolutional Neural Networks (CNNs) have started to dominate Computer Vi-sion problems that had been traditionally very complicated to address withlearning-based methods. This is most probably due to the higher-level fea-tures that the hierarchical CNN architectures are able to learn. One exampleof these problems is the estimation of optical ﬂow. In scenes with large mo-tion ambiguity, only semantic cues are able to recover the correct apparent mo-tion [8,26,27,28,29]. This seems to be the reason why deep optical ﬂow methodsare currently dominating benchmarks.In Dosovitskiy et al. [8] convolutional ﬁlters successfully learn how to esti-mate two dimensional motion ﬁelds from pairs of successive video frames. A fewelements of this approach, coined FlowNet, have to be considered when tacklingsimilar tasks. Performing a complex transform from 2D maps (images) to sameresolution 2D maps (optical ﬂows) requires to capture high level features fromdata. In order that features can pick up global information at the total spatialextension of the input maps, they are implemented in a contractive fashion. In-deed, this is a very common practice in applied deep learning. A second part ofthe network must then take those features and expand them so that they are ableto restore the spatial resolution of the output. An encoder-decoder architecturecomes easily to mind. However, special attention must be taken for the motionestimation problem. Indeed, optical ﬂow networks must have good localizationproperties. Forward skip connections from contractive layers are connectedthrough convolutions to the expanding part of FlowNet, alleviating the bad lo-calization issue of deep networks and simple encoder-decoder architectures.A problem closely related to polynomial regression, geometric matching, con-sists of ﬁnding a parametric transformation of the image grid, allowing the reg-istration of input frames. Recently, Rocco et al. , [32] proposed a neural networkmodel that is capable of registering pairs of images that do not necessarily be-long to the same image sequence. The target parametric transformations wereaﬃne and thin plate splines [33]. In their model, the problem is divided into threetasks: symmetric feature extraction with a Siamese network initialized with VGGfeatures [34], a dense correlation layer similar to the one used by FlowNet, anda regression layer, which infers the image grid transformation.Another regression problem that has been recently tackled by CNNs is hu-man pose estimation. Excellent results were obtained with the so-called stackedhourglass networks (SHN) [6]. The use of convolutions for optical ﬂow has a longer history. For instance, Farneb¨ackimplemented his motion estimation method by means of separable convolutionsin [30]. Weinzaepfel et al. , [31] rely on a large stack of patch-based convolutionalresponses. To the best of our knowledge, however, the method of Dosovitskiy et al. [8] is actually the ﬁrst one to use learned convolutional ﬁlters to perform the mappingbetween images and motion ﬁelds. P´erez-R´ua et al.

The success of deep models on the complicated tasks described in Section 2motivates the exploration of deep models for learning how to robustly estimateparametric models.One interesting element of FlowNet [8,27] is that it was trained on a syn-thetic dataset called

FlyingChairs . The dataset contains around 25,000 im-ages of chairs on background images extracted randomly from

Flickr . The back-grounds were assigned with a random rigid motion, and the foreground, com-posed of computer-generated chairs, with another one. A simple strategy fordata augmentation allows the network to generalize from that dataset to realimages. The ﬁnal results of FlowNet are impressive considering that the pipelineis learned in an end-to-end fashion with synthetic data, and, powered by mod-ern GPUs, they are computed in almost real-time. The evolution of FlowNet,FlowNet 2.0 [27], ranks very highly in optical ﬂow benchmarks. Perhaps theelement introduced by FlowNet 2.0 that is most relevant to this work is cur-riculum learning.

One of the issues of the original FlowNet is the poor be-haviour for small displacements. To tackle this, FlowNet 2.0 leverages a secondsynthetic dataset depicting more complex motions (and of smaller magnitude inaverage), coined

Flying3DThings . The optimal schedule for training was to ﬁrstuse

FlyingChairs , and then the

Flying3DThings . Apparently, a neural network ispredisposed to learn more complex data priors, when already trained for simplerones. We will test this hypothesis for our scenario later on.Newell et al. [6] stated the human pose estimation problem as a dense map-to-map inference problem. The important elements that allow such networks toperform so well can be summarized as follows: – Skip layers with symmetric connection from the convolutional operations inthe contractive part of the network, to the upsampling layers in the expansivepart of the network. This particular design essentially allows the network tobe aware of global and local information at every stage of the decoding part.A single module with this design properties is called an hourglass module. – Stacks.

Stacking hourglass modules seems to allow the SHN to perform re-peated top-down, bottom-up operations that might be essential on capturingdiﬀerent aspects of the pose estimation problem at every module. – Residual connections.

The residual connections, as introduced by [35]allow very deep models to be properly trained. Each residual module isby-passed by an identity transformation that allows gradients ﬂow freelythrough the network. A deeper understanding of residual learning can beobtained by looking at [36]. – Intermediate supervision.

SHN allows intermediate outputs to be usedin the training loss. This procedure guarantees that each hourglass modulelearns something about the pose estimation problem, and further stabilizesthe overall training.The lessons obtained from the start-of-the-art are directly leveraged by ourmodels and experiments in the following sections. eep polynomial regression 7

The polynomial regression problem that we tackle is deﬁned by an input pair( x , d ) of a domain vector x = [ x ; x ; · · · ; x N ] ∈ R DN , and a correspondingrange vector d = [ d ; d ; · · · ; d N ] ∈ R RN . The dimensions D and R of d i and x i do not have to be the same. For instance, x i can be an image point ( D = 2)and d i an intensity value ( R = 1). The relationship between range and domainis assumed to follow a polynomial of given degree, d θi = P θ ( x i ), where θ is thevector of its M coeﬃcients, e.g. , M = 6 for a two-dimensional aﬃne transform.Rewriting this relation as a linear function of θ reads: d θi = M i ( x i ) θ, (1)where M i ( x i ) ∈ R R × M is a design matrix whose structure is maintained acrossthe input data, but whose values are a function of the corresponding domainelement x i . These design matrices can be stacked into a single matrix M ( x ) =[ M ( x ); · · · ; M N ( x N )] ∈ R RN × M so that: d θ x = M ( x ) θ. (2)Under the assumption that data only undergo additive Gaussian noise, theproblem of estimating θ is reduced to solving:ˆ θ = argmin θ N (cid:88) i =1 (cid:13)(cid:13) d i − d θi (cid:13)(cid:13) = argmin θ (cid:13)(cid:13) d − d θ x (cid:13)(cid:13) , (3)from where it follows ˆ θ = (cid:0) M T M (cid:1) − M T d (with x hidden for sake of concise-ness). This solution corresponds to the simplest possible baseline for polynomialregression, but it is clearly biased under the presence of outliers. To some extent, the problem of outlier removal is similar to the signal denois-ing problem that stacked denoising autoencoders (SDA) [37] address. Theseencoding-decoding architectures, however, are not directly amenable to the poly-nomial regression problem. Indeed, the function that transforms the code intooutput data should not be learned. It should instead take the form of a ﬁxed,non-trainable diﬀerentiable decoding layer. This ﬁxed decoder is simply givenby Eq. 2, i.e. , a linear transform of the hypothesized polynomial coeﬃcients θ based on a problem-speciﬁc design matrix.Assuming that a learnable encoder, composed of convolution layers, thatmaps the input data to the polynomial coeﬃcient space is available, an Encoder-Fixed Decoder , or model-based auto-encoder to use the terminology in [5], can Denoising in SDA is more of a proxy task to facilitate the unsupervised learningof meaningful features from data. However, similar ideas led to a very successfulmethod for image denoising in [38]. P´erez-R´ua et al.

OutputFixed decoderCode Input Encoder

Fig. 1.

Model-based autoencoder with ﬁxed, non-trainable decoder . The input in greenis mapped to a code which eﬀectively becomes the coeﬃcients of a polynomial model when passedthrough the ﬁxed decoder part (navy blue).

InputdataIntermediate outputsOutput features Stacked hourglass modules Code

Fig. 2.

Encoder.

The learnable part of our family of networks for deep polynomial regression. Theintermediate outputs from each hourglass module are collected for computing the loss together withthe ﬁnal output after the encoding part of our architecture. be formed. Such a network can be trained with well known deep learning trainingalgorithms with a loss function acting on the output (decoded) data. Moreover,the denoising learning trick explained by [37] can be readily applied to sucharchitecture, as seen in Fig. 1. Granted that training pairs are composed of cor-rupted and clean data, such a network should be able to learn to regress whileignoring outliers, even structured ones if present in training data.Furthermore, by means of this training, the “code” naturally corresponds tothe desired polynomial parameters. An interesting element of this design is that itbypasses the polynomial coeﬃcients themselves at the loss level, eliminating theneed for tweaking speciﬁc loss functions according to the type of polynomialsto be regressed. Indeed, comparing data vectors of the same domain is morestraightforward.

Let us, for now, ignore the exact architecture of the encoder part of our familyof networks. A common way to train denoising autoencoders was proposed in[37], as previously mentioned. This training trick can be categorized as an unsu-pervised learning method, since pairs of input images and corrupted images areconstructed on the ﬂy during training, without the need of human intervention. eep polynomial regression 9

In the case of polynomial regression, this leaves the door open to fully unsuper-vised training, as it would be preferred since it is the most common frameworkto tackle the problem. In our framework, we train our networks by providingrandomly generated pairs of clean and corrupted data. The parameters of therandom generation process are discussed in supplementary material. Since everysample is generated randomly, training can encompass a very large number ofiterations without aﬀecting generalization power of the learned model.

For the encoding part of our family of networks, we propose to use Stacked Hour-glass Modules [6]. Several of the ingredients of SHN seem to be well adapted forthe polynomial regression problem with encoder-decoder type of architectures.In particular, the repetitive bottom-up and top-down operations by stackingresidual hourglass modules seem to ﬁt the spirit the multi-scale processing spiritof some methods. On the top of that, these residual modules capture scale in-formation, which in the opinion of the authors, it is one of the fundamentalelements of problem-tailored regression methods. In the experimental part ofthis work, we validate these claims by establishing a baseline network composedof more classical feedforward convolutional networks ( i.e. , purely contractive andwithout residual connections). As in SHN, we make use of intermediate lossesat the output of each hourglass module. The output features are used in a ﬁnalcontractive stage to obtain the polynomial coeﬃcients or “code” (See Fig. 2).

As previously mentioned, estimation of parametric motion models is a very goodexample of a polynomial regression problem with naturally strong outliers. Insuch a setting a polynomial motion model for a moving scene is interpreted as thedominant scene motion stemming from camera motion. In that sense, outlierscorrespond to moving foreground objects, which can occupy large areas of thescene.A common way to perform video stabilization is to compute a temporally andspatially smooth optical ﬂow map [39]. One way to achieve this is to computeat each instant a bi-dimensional optical ﬂow ( R = 2) over the image domain( D = 2) and then ﬁt a polynomial function to it. Given an input optical ﬂowmap V = { f x } x ∈ Ω , one can ﬁt a polynomial function f θ computable at everyposition x = ( x , x ) of the image grid Ω , so that: f θ x = (cid:20) u θ x v θ x (cid:21) = M ( x ) θ, (4)where, θ is a column vector containing the parameters of a polynomial motionmodel. Let us consider, for sake of generality, full quadratic motion models withtwelve coeﬃcients ( M = 12). Then, the matrix M ( x ) in Eq. 4 takes the form: Any other common motion models like 4-parameter aﬃne, and 8-parameter (corre-sponding to rigid motion of a planar scene) could be considered as well.0 P´erez-R´ua et al.

Table 1.

Regressing scalar functions with 4-th degree polynomial of one variable

Resultsfor 6 diﬀerent testing datasets at increasingly higher outlier ratios with a ﬁxed noise standarddeviation of 0 .

01. The numbers are the mean squared error between generated clean data and theoutputs of respective methods. The lower the number, the better the accuracy. We indicate in boldbest results and underline second best for every column.Outlier ratio0% 10% 20% 30% 40% 50% AverageLSE

FullNet Data 1+2 0.074 M ( x ) = (cid:20) x x x x x x x x x x x x (cid:21) , (5)and θ ∈ R . Equations 4 and 5 are specializations of previously introducedEq. 1 for a polynomial function with two-dimensional domain. This means thatthe proposed network of Fig. 1 applies directly to this problem. We explain datageneration and training issues in the supplementary material. In Section 5, wereport experimental results and a direct application for the problem of videostabilization. We present ﬁrst two sets of experiments in Section 5.1. Their goal is to vali-date the design decisions explained in previous sections, and demonstrate thatthey hold even when the dimensionality of the problem changes. We start withleast squares (LSE) as the simpler baseline. We also provide results with threeconventional robust algorithms, RANSAC, and a robust Tukey [12] estimatorsolved with the Iterative Re-weighted Least Squares (IRWLS). Finally, in Sec-tion 5.2, we present a simple video stabilization pipeline based on parametricmotion models regressed from optical ﬂow maps.

Regression of scalar polynomials.

We start with a toy experiment consisting of asimple 1 D regression problem ( R = 1 and D = 1) with scalar polynomials of de-gree four. The evaluation of our framework for deep polynomial regression is splitinto two types of network. First, the “Half-Nets” refer to our Encoder-Decodernetworks without stacked hourglass modules. Instead, “Half-Nets” are composedonly of contractive convolutions for the encoder part, and our ﬁxed decoder ontop of it. Furthermore, we want to determine if split training [27] presents anyadvantage over training with a single complex dataset. Thus, Half-Net Data 1 eep polynomial regression 11 is trained only on a dataset encompassing functions with smaller magnitude,contaminated with little noise and with only few outliers.

Half-Net Data 1&2 is reﬁned on a second stage with a more complicated dataset encompassing alarger variety of generation modes, more noise and up to 33% structured out-liers. During evaluation outlier ratios up to 50% are tested. Later on,

Half-NetData 1+2 is a network trained with a dataset encompassing both the simpleand complicated generation schemes combined in a single stage.A second type of networks encompassing SHN for the encoder part are re-ferred as “Full-Nets”. In the same order as for the “Half-Nets”,

Full-Net Data 1 , Full-Net Data 1&2 , Full-Net Data 1+2 study the importance of split training.In order to determine the value of our ﬁxed decoder as part of the network, wealso train

Full-Net w.o. D with a mean square loss on the produced parametriccoeﬃcients. Finally,

Full-Net R.L. is trained with a robust loss. For

Full-NetR.L. , instead of training with the proposed dataset composed of pairs of cleanand contaminated vectors, we train only with contaminated data . The robustloss in this network is expected to reject outliers automatically as it is proposedin [4]. Both Full-Net w.o. D and

Full-Net R.L. are trained on the complex dataset(

Data 1+2 ). All the results can be observed in Table 1.A ﬁrst conclusion that can be drawn from Table 1 is that, overall, neural-based regression is more stable than classical robust methods (RANSAC orIRLWS) in the face of outlier corruption. Another interesting outcome of ourexperimentation is that for scalar polynomial regression, split training does notseem to provide a large gain in terms of prediction error. However, it does achievethe absolute best results in comparison to other dataset conﬁgurations. Perhapsmore interestingly, all of our networks with a preceding hourglass module presentoverall better results than any other set-up. In particular,

Full-Net Data 1&2 seems to achieve lowest errors, except for the corruption-free case, where LSE isthe optimal estimator, and therefore expected. An interesting ﬁnding is that ourdenoising training eﬀectively teaches our networks how to be robust to outliers.In contrast, when simply training with a robust function (

Full-Net R.L. ) resultstend to be poor. This enforces the notion that neural networks learn howto be robust more easily by example than by application of robustlosses . Finally, it is worth mentioning that training through our hard-wired de-coder does indeed oﬀer a jump in accuracy of the regression. This interestingfact seems to be in agreement to very recent ﬁndings in neural estimation ofimage geometry [40].

Regression of 2D polynomials (parametric motion models).

To show that ourﬁndings hold for several types of polynomial regression problems, we repeat theprevious experiments, this time for a polynomial regression problem of higher di-mensionality ( R = 2, D = 2), involving the full quadratic motion model of Eq. 5( M = 12). The experiments are collected in Table 2, where the items havethe same meaning than for Table 1. From Table 2, we observe that most of The idea of this particular experiment is assessing if only by means of a robust loss,neural networks are able to learn to reject outliers.2 P´erez-R´ua et al.

Table 2.

Regressing vector ﬁelds with 2-nd degree polynomial of two variables.

Results for6 diﬀerent testing datasets at increasingly higher outlier ratios with a ﬁxed noise standard deviationof 0 .

5. The error values with the Euclidean norm between generated clean data and the outputs ofrespective methods. We bold best results and underline second best for every column.Outlier ratio0% 10% 20% 30% 40% 50% AverageLSE the ﬁndings for scalar regression regression hold for the vector ﬁeld case. Thistime,

Full-Net Data 1+2 delivers the best overall results, with a MSE score thatis roughly three times lower than the baseline without ﬁxed decoder (

FullNetw.o.D. ) and almost four times lower than the best conﬁguration of networkswithout stacked hourglass modules (

HalfNet Data 1&2 ). In this set-up

Full-NetData 1+2 and

Full-Net Data 1&2 do not seem to present signiﬁcantly diﬀer-ent results. Finally, it is worth mentioning that our training procedure presentstraining samples with at most 30% outlier ratio, while the experiments reachup to 50%. Interestingly, our best models generalized very well for the extremeoutlier conditions, achieving errors of one order of magnitude better than classicbaselines for the vector ﬁeld case, and two orders of magnitude for the scalarfunction case.

We now report experiments to demonstrate that the proposed approach is rel-evant for real data, even if training is conducted on synthetic data only andwithout supervision. In particular, we take the best network from the second setof experiments in Section 5.1 and use it directly to compute parametric motionmodels from optical ﬂow. We start from the popular optical ﬂow baseline methodby Brox et al. [41]. Robustly ﬁtting a quadratic motion model to an estimatedcomplex non parametric ﬂow is very challenging due to ﬂow inaccuracies (whichcan be slightly structured) and, more importantly, to highly structured outliersthat foreground objects induce.We ﬁrst start by analyzing the behavior of our model on a synthetic opti-cal ﬂow dataset, namely

FlyingChairs in Fig. 3. This dataset is composed ofsynthetically rendered chairs composed onto moving backgrounds. Our model It should be noted that one could replace the oﬀ-the-shelf optical ﬂow method by amore precise one. Furthermore, one could connect a deep-learning based method toours, enabling end-to-end training from images. We leave this for future research.eep polynomial regression 13(a) (b) (c) (d) (e) (f) (g)

Fig. 3.

Visual results on the FlyingChairs dataset.

A diﬀerent scene is shown in each row. (a-b) The image pair; (c) the input optical ﬂow map (ground-truth); (d) output of

FullNet Data 1 ; (f)The corresponding results for

FullNet Data 1+2 . Observe the simpliﬁcation of the input optical ﬂowin a parametric motion model that gets rid of outliers (motion of the chairs); (e) and (g): normalizedper-pixel diﬀerence between the output and input ﬂows corresponding to our model trained on theﬁrst and second datasets, respectively. Observe that the larger diﬀerences (more saturated colors)are, generally, in the pixels depicting a moving chair.

Fig. 4.

Visual results on the Kitti dataset.

A diﬀerent scene is shown in each column: (top)input image pair; (third row) input optical ﬂow; (fourth row) quadratic motion ﬁeld obtained by ourdeep regression network; (ﬁfth row) pixel-wise diﬀerence between inputs and outputs. captures what can be interpreted as dominant scene motion, mostly correspond-ing to the background. Interestingly, the high magnitude values of the diﬀerencebetween the input and output ﬂow maps correspond to foreground objects, i.e. ,the moving chairs. In this sense, our method robustly captures global motion,showing high insensitivity to outliers. In Fig. 3, we can also see the diﬀerencebetween the same model trained on two diﬀerent datasets (columns e and g). Ev-idently,

FullNet Data 1+2 , being trained on a more complex outlier-generationsetting, better captures global motion than

FullNet Data 1 .In a more realistic set-up, we take the Kitti dataset and compute opticalﬂow maps between pairs of frames with [41]. From these maps, we ﬁt quadraticmotion models with our best network, resulting in the visual results in Fig. 4.The reader should notice that the simpliﬁed ﬂow maps stemming from the ﬁtted et al. quadratic motion models conform well with the dominant ego-motion inducedon the scene by the displacement of the embarked camera.

Fig. 5.

Video stabilization results . Estimated quadratic dominant motion in a real scene is usedto backwarp images aiming at video stabilization. First row: original unstable input. Second row:stabilized images. The out-of-frame holes (black) are left so the reader appreciates better the motioncompensation.

Video stabilization.

A common application of dominant motion estimation isvideo stabilization: cancelling the higher frequencies of background visual mo-tion indeed requires an accurate estimation of this ﬂow ﬁeld. In Fig. 5 we showstabilization results with out deep polynomial regression method on a sequenceundergoing camera-shaking by eﬀect of strong wind. Observe that the imagesequences are correctly warped to compensate for dominant motion. In a furtherimprovement, we smooth pixel proﬁles in a similar fashion as [39] to achieve bet-ter temporal smoothness. Our algorithm delivers high quality results, consideringthat we did not introduce any higher level considerations for video stabilizationproblem.

In the following paragraphs we give important details of the models used duringthe experimental part of our work.

The problem we tackle in Section 5.2 is the regression of 4th degree polynomialcoeﬃcients. This set-up is amenable to the multi-scale, repetitive processing ofstacked hourglass networks. However, we replace all the 2D convolutions by 1Dconvolutions in account of the input data structure.

Hourglass modules.

Each convolutional layer in the reductive part of an hourglassmodule is formed by 32 convolutional kernels of size three. At the output ofeach one of these convolutional layers a sequence of three operations is applied.These are, ReLU, 1D max pooling with stride 2, and 1D batch normalization. eep polynomial regression 15

Furthermore, after the ReLU operations a forward skip connections throughconvolutions of the same size are connected to bilinear upsampling operations inthe enlarging part of the hourglass modules. For this experiment we found thatstacking more than two hourglass modules did not improve performance muchfurther.

Model-based autoencoder.

We now explain the model used for the experiments of Section 5.2. Aligned withthe intention of the paper of providing a general model for regression problems,we use a model that is identical to the one used for scalar regression with only afew minor diﬀerences. Thus, each 1D convolution of previous model is replacedby a 2D convolution of twice as many planes. Furthermore, the ﬁxed decoder ofthe model-based autoencoder is replaced to match the new polynomial operation.

Two dataset generators are used for training of our regression models, namely

Data 1 and

Data 2 . Each one of them is composed by pairs of “input” and “out-put” samples. The “output” samples are clean arrays of values stemming froma polynomial model, while the “input” samples are corresponding contaminatedmaps. The pair simulates a supervised training pair, but it is generated onlinerandomly. See Fig. 6 for an illustration of data generated for the vector ﬁeldcase.The main diﬀerence between

Data 1 and

Data 2 is the intensity of the con-tamination. The idea is that

Data 1 would allow neural networks to learn thebasic operations for regression without many distractions, while a subsequentdataset would reﬁne the network for handling more complex contamination.Thus, we set the maximum outlier ratio for

Data 1 to 0.1, while we set it to 0.3for

Data 2 . Gaussian noise is set to 0.1 for

Data 1 and 0.5 for

Data 2 . The struc-tured outliers correspond to randomly selected polygonal supports encompassinganother randomly sampled polynomial model.

In this paper we have proposed a neural approach to robust polynomial regres-sion. We carried out an experimental study spanning several important hints et al.

Fig. 6.

Randomly generated training pairs.

Top row: Contaminated input. Bottom row: cleanoutput. from previous state-of-the-art. We have provided a general class of architecturesthat learn to deal with outliers eﬀectively. In particular, the spatially-consistentoutliers that have been an important problem for classical polynomial regressionmethods can be handled by properly trained and designed neural nets. More-over, we have shown that these ﬁndings hold for diﬀerent settings of polynomialregression. The proposed architecture captures two important common designstrategies for polynomial regression: spatial awareness (eﬀective through convo-lutions), and multi-scale processing (by stacked hourglass modules). This design,in conjunction with a stacked denoising autoencoder training with simulated out-liers, results in models that robustly learn how to handle largely contaminateddata. Furthermore, networks trained with purely synthetic data are able to gen-eralize to real data, leading to very good accuracy for global motion estimationand eﬀective use for video stabilization. We can expect that our ﬁndings can begeneralized to other types of regression problems.

References

1. Meer, P., Mintz, D., Rosenfeld, A., Kim, D.Y.: Robust regression methods forcomputer vision: A review. International journal of computer vision (1) (1991)59–702. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for modelﬁtting with applications to image analysis and automated cartography. Commu-nications of the ACM (6) (1981) 381–3953. Puy, G., Vandergheynst, P.: Robust image reconstruction from multiview mea-surements. SIAM Journal on Imaging Sciences (1) (2014) 128–1564. Belagiannis, V., Rupprecht, C., Carneiro, G., Navab, N.: Robust optimizationfor deep regression. In: Proceedings of the IEEE International Conference onComputer Vision. (2015) 2830–28385. Tewari, A., Zollh¨ofer, M., Kim, H., Garrido, P., Bernard, F., P´erez, P., Theobalt, C.:Mofa: Model-based deep convolutional face autoencoder for unsupervised monoc-ular reconstruction. In: International Conference on Computer Vision. (2017)eep polynomial regression 176. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. In: European Conference on Computer Vision, Springer (2016) 483–4997. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks forimage classiﬁcation. In: Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on, IEEE (2012) 3642–36498. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van derSmagt, P., Cremers, D., Brox, T.: Flownet: Learning optical ﬂow with convolutionalnetworks. In: International Conference on Computer Vision. (2015) 2758–27669. Torr, P.H., Murray, D.W.: The development and comparison of robust methodsfor estimating the fundamental matrix. International Journal of Computer Vision (3) (1997) 271–30010. Torr, P.H., Zisserman, A.: Mlesac: A new robust estimator with application toestimating image geometry. Computer Vision and Image Understanding (1)(2000) 138–15611. Stewart, C.V.: Minpran: A new robust estimator for computer vision. Transactionson Pattern Analysis and Machine Intelligence (10) (1995) 925–93812. Huber, P.J.: Robust statistics. In: International Encyclopedia of Statistical Science.Springer (2011) 1248–125113. Holland, P.W., Welsch, R.E.: Robust regression using iteratively reweighted least-squares. Communications in Statistics-theory and Methods (9) (1977) 813–82714. Odobez, J., Bouthemy, P.: Separation of moving regions from background in animage sequence acquired with a mobile camera. In: Video Data Compression forMultimedia Computing. Springer (1997) 283–31115. Cremers, D., Soatto, S.: Motion competition: A variational approach to piecewiseparametric motion segmentation. International Journal of Computer Vision (3)(2005) 249–26516. Black, M.J., Jepson, A.D.: Estimating optical ﬂow in segmented images usingvariable-order parametric models with local deformations. Transactions on PatternAnalysis and Machine Intelligence (10) (1996) 972–98617. Farneb¨ack, G.: Two-frame motion estimation based on polynomial expansion. In:Image Analysis. Springer (2003) 363–37018. Fortun, D., Bouthemy, P., Kervrann, C.: Aggregation of local parametric candi-dates with exemplar-based occlusion handling for optical ﬂow. Computer Visionand Image Understanding (2015) 1–18219. Yang, J., Li, H.: Dense, accurate optical ﬂow estimation with piecewise parametricmodel. In: Computer Vision Pattern Recognition. (2015)20. P´erez-R´ua, J.M., Basset, A., Bouthemy, P.: Detection and localization of anoma-lous motion in video sequences from local histograms of labeled aﬃne ﬂows. Fron-tiers in ICT (2017) 1021. Black, M.J., Yacoob, Y.: Tracking and recognizing rigid and non-rigid facial mo-tions using local parametric models of image motion. In: International Conferenceon Computer Vision, IEEE (1995) 374–38122. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Paramet-ric and piecewise-smooth ﬂow ﬁelds. Computer Vision and Image Understanding (1) (1996) 75–10423. Bergen, J., Anandan, P., Hanna, K., Hingorani, R.: Hierarchical model-basedmotion estimation. In: European Conference on Computer Vision, Springer (1992)237–25224. Odobez, J.M., Bouthemy, P.: Robust multiresolution estimation of parametricmotion models. Journal of Visual Communication and Image Representation (4)(1995) 348–3658 P´erez-R´ua et al.

25. Senst, T., Eiselein, V., Sikora, T.: Robust local optical ﬂow for feature tracking.Transactions on Circuits and Systems for Video Technology (9) (2012) 1377–138726. Thewlis, J., Zheng, S., Torr, P.H., Vedaldi, A.: Fully-trainable deep matching.British Machine Vision Conference (2016)27. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0:Evolution of optical ﬂow estimation with deep networks. Computer Vision PatternRecognition (2017)28. Bailer, C., Varanasi, K., Stricker, D.: Cnn-based patch matching for optical ﬂowwith thresholded hinge embedding loss. In: Computer Vision Pattern Recognition.(2017)29. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical ﬂow usingpyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)30. Farneb¨ack, G.: Fast and accurate motion estimation using orientation tensors andparametric motion models. In: International Conference on Pattern Recognition.Volume 1., IEEE (2000) 135–13931. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepﬂow: Large displace-ment optical ﬂow with deep matching. In: International Conference on ComputerVision. (2013) 1385–139232. Rocco, I., Arandjelovi´c, R., Sivic, J.: Convolutional neural network architecturefor geometric matching. In: Computer Vision Pattern Recognition, IEEE (2017)33. Bookstein, F.L.: Principal warps: Thin-plate splines and the decomposition ofdeformations. Transactions on Pattern Analysis and Machine Intelligence (6)(1989) 567–58534. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)35. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Computer Vision Pattern Recognition. (2016) 770–77836. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.In: European Conference on Computer Vision, Springer (2016) 630–64537. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-ing autoencoders: Learning useful representations in a deep network with a localdenoising criterion. The Journal of Machine Learning Research11