[PDF] DeepGUM: Learning Deep Robust Regression with a Gaussian-Uniform Mixture Model

Abstract

In this paper, we address the problem of how to robustly train a ConvNet for regression, or deep robust regression. Traditionally, deep regression employs the L2 loss function, known to be sensitive to outliers, i.e. samples that either lie at an abnormal distance away from the majority of the training samples, or that correspond to wrongly annotated targets. This means that, during back-propagation, outliers may bias the training process due to the high magnitude of their gradient. In this paper, we propose DeepGUM: a deep regression model that is robust to outliers thanks to the use of a Gaussian-uniform mixture model. We derive an optimization algorithm that alternates between the unsupervised detection of outliers using expectation-maximization, and the supervised training with cleaned samples using stochastic gradient descent. DeepGUM is able to adapt to a continuously evolving outlier distribution, avoiding to manually impose any threshold on the proportion of outliers in the training set. Extensive experimental evaluations on four different tasks (facial and fashion landmark detection, age and head pose estimation) lead us to conclude that our novel robust technique provides reliability in the presence of various types of noise and protection against a high percentage of outliers.

Full PDF

DDeepGUM: Learning Deep Robust Regression with aGaussian-Uniform Mixture Model

St´ephane Lathuili`ere , , Pablo Mesejo , , Xavier Alameda-Pineda ,and Radu Horaud Inria Grenoble Rhˆone-Alpes, Montbonnot-Saint-Martin, France, University of Granada, Granada, Spain, University of Trento, Trento, Italy [email protected]

Abstract.

In this paper we address the problem of how to robustly train a Con-vNet for regression, or deep robust regression. Traditionally, deep regression em-ploy the L loss function, known to be sensitive to outliers, i.e. samples thateither lie at an abnormal distance away from the majority of the training samples,or that correspond to wrongly annotated targets. This means that, during back-propagation, outliers may bias the training process due to the high magnitude oftheir gradient. In this paper, we propose DeepGUM: a deep regression model thatis robust to outliers thanks to the use of a Gaussian-uniform mixture model. Wederive an optimization algorithm that alternates between the unsupervised detec-tion of outliers using expectation-maximization, and the supervised training with cleaned samples using stochastic gradient descent. DeepGUM is able to adaptto a continuously evolving outlier distribution, avoiding to manually impose anythreshold on the proportion of outliers in the training set. Extensive experimen-tal evaluations on four different tasks (facial and fashion landmark detection, ageand head pose estimation) lead us to conclude that our novel robust technique pro-vides reliability in the presence of various types of noise and protection against ahigh percentage of outliers. Keywords:

Robust regression · Deep Neural Networks · Mixture model · Outlierdetection

For the last decade, deep learning architectures have undoubtably established the stateof the art in computer vision tasks such as image classiﬁcation [18,38] or object detec-tion [15,33]. These architectures, e.g. ConvNets, consist of several convolutional layers,followed by a few fully connected layers and by a classiﬁcation softmax layer with, forinstance, a cross-entropy loss. ConvNets have also been used for regression, i.e. predictcontinuous as opposed to categorical output values. Classical regression-based com-puter vision methods have addressed human pose estimation [39], age estimation [30],head-pose estimation [9], or facial landmark detection [37], to cite a few. Whenever a r X i v : . [ c s . C V ] A ug S. Lathuili`ere et al.

21 52 62 29 Fig. 1: A Gaussian-uniform mixture model is combined with a ConvNet architecture todowngrade the inﬂuence of wrongly annotated targets (outliers) on the learning process.ConvNets are used for learning a regression network, the softmax layer is replacedwith a fully connected layer, with linear or sigmoid activations, and L is often usedto measure the discrepancy between prediction and target variables. It is well knownthat L -loss is strongly sensitive to outliers, potentially leading to poor generalizationperformance [17]. While robust regression is extremely well investigated in statistics,there has only been a handful of methods that combine robust regression with deeparchitectures.This paper proposes to mitigate the inﬂuence of outliers when deep neural archi-tectures are used to learn a regression function, ConvNets in particular. More precisely,we investigate a methodology speciﬁcally designed to cope with two types of outliersthat are often encountered: (i) samples that lie at an abnormal distance away from theother training samples, and (ii) wrongly annotated training samples. On the one hand,abnormal samples are present in almost any measurement system and they are knownto bias the regression parameters. On the other hand, deep learning requires very largeamounts of data and the annotation process, be it either automatic or manual, is inher-ently prone to errors. These unavoidable issues fully justify the development of robustdeep regression.The proposed method combines the representation power of ConvNets with theprincipled probabilistic mixture framework for outlier detection and rejection, e.g. Fig-ure 1. We propose to use a Gaussian-uniform mixture (GUM) as the last layer of a Con-vNet, and we refer to this combination as DeepGUM. The mixture model hypothesizesa Gaussian distribution for inliers and a uniform distribution for outliers. We interleavean EM procedure within stochastic gradient descent (SGD) to downgrade the inﬂuenceof outliers in order to robustly estimate the network parameters. We empirically validatethe effectiveness of the proposed method with four computer vision problems and as-sociated datasets: facial and fashion landmark detection, age estimation, and head poseestimation. The standard regression measures are accompanied by statistical tests thatdiscern between random differences and systematic improvements.The remainder of the paper is organized as follows. Section 2 describes the re-lated work. Section 3 describes in detail the proposed method and the associated algo-rithm. Section 4 describes extensive experiments with several applications and associ-ated datasets. Section 5 draws conclusions and discusses the potential of robust deepregression in computer vision. eepGUM: Learning Deep Robust Regression 3 Robust regression has long been studied in statistics [17,24,31] and in computer vi-sion [6,25,36]. Robust regression methods have a high breakdown point , which is thesmallest amount of outlier contamination that an estimator can handle before yieldingpoor results. Prominent examples are the least trimmed squares, the Theil-Sen estimatoror heavy-tailed distributions [14]. Several robust training strategies for artiﬁcial neuralnetworks are also available [5,27].M-estimators, sampling methods, trimming methods and robust clustering are amongthe most used robust statistical methods. M-estimators [17] minimize the sum of apositive-deﬁnite function of the residuals and attempt to reduce the inﬂuence of largeresidual values. The minimization is carried our with weighted least squares techniques,with no proof of convergence for most M-estimators. Sampling methods [25], such asleast-median-of-squares or random sample consensus (RANSAC), estimate the modelparameters by solving a system of equations deﬁned for a randomly chosen data subset.The main drawback of sampling methods is that they require complex data-samplingprocedures and it is tedious to use them for estimating a large number of parameters.Trimming methods [31] rank the residuals and down-weight the data points associatedwith large residuals. They are typically cast into a (non-linear) weighted least squaresoptimization problem, where the weights are modiﬁed at each iteration, leading to iter-atively re-weighted least squares problems. Robust statistics have also been addressedin the framework of mixture models and a number of robust mixture models were pro-posed, such as Gaussian mixtures with a uniform noise component [2,8], heavy-taileddistributions [11], trimmed likelihood estimators [12,28], or weighted-data mixtures[13]. Importantly, it has been recently reported that modeling outliers with an uniformcomponent yields very good performance [8,13].Deep robust classiﬁcation was recently addressed, e.g. [3] assumes that observedlabels are generated from true labels with unknown noise parameters: a probabilisticmodel that maps true labels onto observed labels is proposed and an EM algorithm isderived. In [41] is proposed a probabilistic model that exploits the relationships betweenclasses, images and noisy labels for large-scale image classiﬁcation. This framework re-quires a dataset with explicit clean- and noisy-label annotations as well as an additionaldataset annotated with a noise type for each sample, thus making the method difﬁcultto use in practice. Classiﬁcation algorithms based on a distillation process to learn fromnoisy data was recently proposed [21].Recently, deep regression methods were proposed, e.g. [26,29,37,39,19]. Despitethe vast robust statistics literature and the importance of regression in computer vision,at the best of our knowledge there has been only one attempt to combine robust regres-sion with deep networks [4], where robustness is achieved by minimizing the Tukey’sbi-weight loss function, i.e. an M-estimator. In this paper we take a radical differentapproach and propose to use robust mixture modeling within a ConvNet. We conjec-ture that while inlier noise follows a Gaussian distribution, outlier errors are uniformlydistributed over the volume occupied by the data. Mixture modeling provides a princi-pled way to characterize data points individually, based on posterior probabilities. We

S. Lathuili`ere et al. propose an algorithm that interleaves a robust mixture model with network training,i.e. alternates between EM and SGD. EM evaluates data-posterior probabilities whichare then used to weight the residuals used by the network loss function and hence todowngrade the inﬂuence of samples drawn from the uniform distribution. Then, thenetwork parameters are updated which in turn are used by EM. A prominent feature ofthe algorithm is that it requires neither annotated outlier samples nor prior informationabout their percentage in the data. This is in contrast with [41] that requires explicitinlier/outlier annotations and with [4] which uses a ﬁxed hyperparameter ( c = 4 . )that allows to exclude from SGD samples with high residuals. We assume that the inlier noise follows a Gaussian distribution while the outlier errorfollows a uniform distribution. Let x ∈ R M and y ∈ R D be the input image and theoutput vector with dimensions M and D , respectively, with D (cid:28) M . Let φ denotea ConvNet with parameters w such that y = φ ( x , w ) . We aim to train a model thatdetects outliers and downgrades their role in the prediction of a network output, whilethere is no prior information about the percentage and spread of outliers. The probabilityof y conditioned by x follows a Gaussian-uniform mixture model (GUM): p ( y | x ; θ , w ) = π N ( y ; φ ( x ; w ) , Σ ) + (1 − π ) U ( y ; γ ) , (1)where π is the prior probability of an inlier sample, γ is the normalization parameterof the uniform distribution and Σ ∈ R D × D is the covariance matrix of the multivariateGaussian distribution. Let θ = { π, γ, Σ } be the parameter set of GUM. At trainingwe estimate the parameters of the mixture model, θ , and of the network, w . An EMalgorithm is used to estimate the former together with the responsibilities r n , which areplugged into the network’s loss, minimized using SGD so as to estimate the later. Let a training dataset consist of N image-vector pairs { x n , y n } Nn =1 . At each itera-tion, EM alternates between evaluating the expected complete-data log-likelihood (E-step) and updating the parameter set θ conditioned by the network parameters (M-step).In practice, the E-step evaluates the posterior probability (responsibility) of an image-vector pair n to be an inlier: r n ( θ ( i ) ) = π ( i ) N ( y n ; φ ( x n , w ( c ) ) , Σ ( i ) ) π ( i ) N ( y n ; φ ( x n , w ( c ) ) , Σ ( i ) ) + (1 − π ( i ) ) γ ( i ) , (2)where ( i ) denotes the EM iteration index and w ( c ) denotes the currently estimatednetwork parameters. The posterior probability of the n -th data pair to be an outlier is eepGUM: Learning Deep Robust Regression 5 − r n ( θ ( i ) ) . The M-step updates the mixture parameters θ with: Σ ( i +1) = N (cid:88) n =1 r n ( θ ( i ) ) δ ( i ) n δ ( i ) (cid:62) n , (3) π ( i +1) = N (cid:88) n =1 r n ( θ ( i ) ) /N, (4) γ ( i +1) = D (cid:89) d =1 (cid:115) (cid:18) C ( i +1)2 d − (cid:16) C ( i +1)1 d (cid:17) (cid:19) , (5)where δ ( i ) n = y n − φ ( x n ; w ( c ) ) , and C and C are the ﬁrst- and second-order centereddata moments computed using ( δ ( i ) nd denotes the d -th entry of δ ( i ) n ): C ( i +1)1 d = 1 N N (cid:88) n =1 (1 − r n ( θ ( i ) ))1 − π ( i +1) δ ( i ) nd , C ( i +1)2 d = 1 N N (cid:88) n =1 (1 − r n ( θ ( i ) ))1 − π ( i +1) (cid:16) δ ( i ) nd (cid:17) . (6)The iterative estimation of γ as just proposed has an advantage over using a constantvalue based on the volume of the data, as done in robust mixture models [8]. Indeed, γ is updated using the actual volume occupied by the outliers, which increases the abilityof the algorithm to discriminate between inliers and outliers.Another prominent advantage of DeepGUM for robustly predicting multidimen-sional outputs is its ﬂexibility for handling the granularity of outliers. Consider for ex-ample to problem of locating landmarks in an image. One may want to devise a methodthat disregards outlying landmarks and not the whole image. In this case, one may usea GUM model for each landmark category. In the case of two-dimensional landmarks,this induces D/ covariance matrices of size 2 ( D is the dimensionality of the targetspace). Similarly one may use an coordinate-wise outlier model, namely D scalar vari-ances. Finally, one may use an image-wise outlier model, i.e. the model detailed above.This ﬂexibility is an attractive property of the proposed model as opposed to [4] whichuses a coordinate-wise outlier model. As already mentioned we use SGD to estimate the network parameters w . Given theupdated GUM parameters estimated with EM, θ ( c ) , the regression loss function isweighted with the responsibility of each data pair: L DEEPGUM = N (cid:88) n =1 r n ( θ ( c ) ) || y n − φ ( x n ; w ) || . (7)With this formulation, the contribution of a training pair to the loss gradient vanishes(i) if the sample is an inlier with small error ( (cid:107) δ n (cid:107) → , r n → ) or (ii) if the sample is S. Lathuili`ere et al.

Fig. 2: Loss gradients for Biweight (black), Huber (cyan), L (magenta), and Deep-GUM (remaining colors). Huber and L overlap up to δ = 4 . (the plots are trun-cated along the vertical coordinate). DeepGUM is shown for different values of π and γ , although in practice they are estimated via EM. The gradients of DeepGUM andBiweight vanish for large residuals. DeepGUM offers some ﬂexibility over Biweightthanks to π and γ .an outlier ( r n → ). In both cases, the network will not back propagate any error. Con-sequently, the parameters w are updated only with inliers. This is graphically shownin Figure 2, where we plot the loss gradient as a function of a one-dimensional resid-ual δ , for DeepGUM, Biweight, Huber and L . For fair comparison with Biweight andHuber, the plots correspond to a unit variance (i.e. standard normal, see discussion fol-lowing eq. (3) in [4]). We plot the DeepGUM loss gradient for different values of π and γ to discuss different situations, although in practice all the parameters are estimatedwith EM. We observe that the gradient of the Huber loss increases linearly with δ , untilreaching a stable point (corresponding to c = 4 . in [4]). Conversely, the gradientof both DeepGUM and Biweight vanishes for large residuals (i.e. δ > c ). Importantly,DeepGUM offers some ﬂexibility as compared to Biweight. Indeed, we observe thatwhen the amount of inliers increases (large π ) or the spread of outliers increases (small γ ), the importance given to inliers is higher, which is a desirable property. The oppositeeffect takes place for lower amounts of inliers and/or reduced outlier spread. In order to train the proposed model, we assume the existence of a training and vali-dation datasets, denoted T = { x T n , y T n } N T n =1 and V = { x V n , y V n } N V n =1 , respectively. Thetraining alternates between the unsupervised EM algorithm of Section 3.1 and the super-vised SGD algorithm of Section 3.2, i.e. Algorithm 1. EM takes as input the training set,alternates between responsibility evaluation, (2) and mixture parameter update, (3), (4),(5), and iterates until convergence, namely until the mixture parameters do not evolveanymore. The current mixture parameters are used to evaluate the responsibilities ofthe validation set. The SGD algorithm takes as input the training and validation sets as eepGUM: Learning Deep Robust Regression 7 Algorithm 1

DeepGUM training input: T = ( x T n , y T n ) N T n =1 , V = { x V n , y V n } N V n =1 , and (cid:15) > (convergence threshold). initialization: Run SGD on T to minimize (7) with r n = 1 , ∀ n , until the convergence crite-rion on V is reached. repeatEM algorithm: Unsupervised outlier detection repeat

Update the r n ’s with (2).Update the mixture parameters with (3), (4), (5). until The parameters θ are stable. SGD:

Deep regression learning repeat

Run SGD to minimize L DEEPGUM in (7). until

Early stop with a patience of K epochs. until L DEEPGUM grows on V . well as the associated responsibilities. In order to prevent over-ﬁtting, we perform earlystopping on the validation set with a patience of K epochs.Notice that the training procedure requires neither speciﬁc annotation of outliers northe ratio of outliers present in the data. The procedure is initialized by executing SGD,as just described, with all the samples being supposed to be inliers, i.e. r n = 1 , ∀ n .Algorithm 1 is stopped when L DEEPGUM does not decrease anymore. It is important tonotice that we do not need to constrain the model to avoid the trivial solution, namelyall the samples are considered as outliers. This is because after the ﬁrst SGD execution,the network can discriminate between the two categories. In the extreme case whenDeepGUM would consider all the samples as outliers, the algorithm would stop afterthe ﬁrst SGD run and would output the initial model.Since EM provides the data covariance matrix Σ , it may be tempting to use theMahalanobis norm instead of the L norm in (7). The covariance matrix is narrow alongoutput dimensions with low-amplitude noise and wide along dimensions with high-amplitude noise. The Mahalanobis distance would give equal importance to low- andhigh-amplitude noise dimensions which is not desired. Another interesting feature ofthe proposed algorithm is that the posterior r n weights the learning rate of sample n asits gradient is simply multiplied by r n . Therefore, the proposed algorithm automaticallyselects a learning rate for each individual training sample. The purpose of the experimental validation is two-fold. First, we empirically validateDeepGUM with three datasets that are naturally corrupted with outliers. The valida-tions are carried out with the following applications: fashion landmark detection (Sec-tion 4.1), age estimation (Section 4.2) and head pose estimation (Section 4.3). Second,we delve into the robustness of DeepGUM and analyze its behavior in comparison with

S. Lathuili`ere et al. existing robust deep regression techniques by corrupting the annotations with an in-creasing percentage of outliers on the facial landmark detection task (Section 4.4).We systematically compare DeepGUM with the standard L loss, the Huber lossand the Biweight loss (used in [4]). In all these cases, we use the VGG-16 architec-ture [35] pre-trained on ImageNet [32]. We also tried to use the architecture proposedin [4], but we were unable to reproduce the results reported in [4] on the LSP and Parsedatasets, using the code provided by the authors. Therefore, for the sake of reproducibil-ity and for a fair comparison between different robust loss functions, we used VGG-16in all our experiments. Following the recommendations from [20], we ﬁne-tune the lastconvolutional block and both fully connected layers with a mini-batch of size 128 andlearning rate set to − . The ﬁne-tuning starts with 3 epochs of L loss, before exploit-ing either the Biweight, Huber of DeepGUM loss. When using any of these three losses,the network output is normalized with the median absolute deviation (as in [4]), com-puted on the entire dataset after each epoch. Early stopping with a patience of K = 5 epochs is employed and the data is augmented using mirroring.In order to evaluate the methods, we report the mean absolute error (MAE) be-tween the regression target and the network output over the test set. Inspired by [20],we complete the evaluation with statistical tests that allow to point out when the dif-ferences between methods are systematic and statistically signiﬁcant or due to chance.Statistical tests are run per-image regression errors and therefore can only be appliedto the methods for which the code is available, and not to average errors reported inthe literature; in the latter case, only MAE are made available. In practice, we use thenon-parametric Wilcoxon signed-rank test [40] to assess whether the null hypothesis(the median difference between pairs of observations is zero) is true or false. We denotethe statistical signiﬁcance with ∗ , ∗∗ or ∗∗∗ , corresponding to a p -value (the conditionalprobability of, given the null hypothesis is true, getting a test statistic as extreme or moreextreme than the calculated test statistic) smaller than p = 0 . , p = 0 . or p = 0 . ,respectively. We only report the statistical signiﬁcance of the methods with the lowestMAE. For instance, A ∗∗∗ means that the probability that method A is equivalent to anyother method is less than p = 0 . . Visual fashion analysis presents a wide spectrum of applications such as cloth recogni-tion, retrieval, and recommendation. We employ the fashion landmark dataset (FLD) [22]that includes more than K images, where each image is labeled with eight land-marks. The dataset is equally divided in three subsets: upper-body clothes (6 land-marks), full-body clothes (8 landmarks) and lower-body clothes (4 landmarks). Werandomly split each subset of the dataset into test ( K ), validation ( K ) and train-ing ( ∼ K ). Two metrics are used: the mean absolute error (MAE) of the landmarklocalization and the percentage of failures (landmarks detected further from the groundtruth than a given threshold). We employ landmark-wise r n .Table 1 reports the results obtained on the upper-body subset of the fashion land-mark dataset (additional results on full-body and lower-body subsets are included in the eepGUM: Learning Deep Robust Regression 9 Table 1: Mean absolute error on the upper-body subset of FLD, per landmark and inaverage. The landmarks are left (L) and right (R) collar (C), sleeve (S) and hem (H). Theresults of DFA are from [23] and therefore do not take part in the statistical comparison.

Method Upper-body landmarksLC RC LS RS LH RH Avg.DFA [23] ( L ) 15.90 15.90 30.02 29.12 23.07 22.85 22.85DFA [23] (5 VGG) 10.75 10.75 20.38 19.93 15.90 16.12 15.23 L ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ supplementary material). We report the mean average error (in pixels) for each land-mark individually, and the overall average (last column). While for the ﬁrst subset wecan compare with the very recent results reported in [23], for the other there are nopreviously reported results. Generally speaking, we outperform all other baselines inaverage, but also in each of the individual landmarks. The only exception is the com-parison against the method utilizing ﬁve VGG pipelines to estimate the position of thelandmarks. Although this method reports slightly better performance than DeepGUMfor some columns of Table 1, we recall that we are using one single VGG as front-end, and therefore the representation power cannot be the same as the one associatedto a pipeline employing ﬁve VGG’s trained for tasks such as pose estimation and clothclassiﬁcation that clearly aid the fashion landmark estimation task.Interestingly, DeepGUM yields better results than L regression and a major im-provement over Biweight [4] and Huber [16]. This behavior is systematic for all fashionlandmarks and statistically signiﬁcant (with p < . ). In order to better understandthis behavior, we computed the percentage of outliers detected by DeepGUM and Bi-weight, which are and respectively (after convergence). We believe that withinthis difference ( corresponds to . K images) there are mostly “difﬁcult” inliers,from which the network could learn a lot (and does it in DeepGUM) if they were notdiscarded as happens with Biweight. This illustrates the importance of rejecting theoutliers while keeping the inliers in the learning loop, and exhibits the robustness ofDeepGUM in doing so. Figure 3 displays a few landmarks estimated by DeepGUM. Age estimation from a single face image is an important task in computer vision withapplications in access control and human-computer interaction. This task is closely re-lated to the prediction of other biometric and facial attributes, such as gender, ethnic-ity, and hair color. We use the cross-age celebrity dataset (CACD) [7] that contains , images from , celebrities. The images are collected from search engines0 S. Lathuili`ere et al.

Method MAE L . Huber [16] 5.59Biweight [4] . Dex [30] . DexGUM ∗∗∗ . DeepGUM ∗∗∗ .

14 14 14 16 20 2349 51 60 60 60 62

Fig. 4: Results on the CACD dataset: (left) mean absolute error and (right) images con-sidered as outliers by DeepGUM, the annotation is displayed below each image.using the celebrity’s name and desired year (from 2004 to 2013). The dataset splits into3 parts, , celebrities are used for training, for validation and for testing.The validation and test sets are manually cleaned whereas the training set is noisy. Inour experiments, we report results using image-wise r n .Apart from DeepGUM, L , Biweight and Huber, we also compare to the age estima-tion method based on deep expectation (Dex) [30], which was the winner of the Lookingat People 2015 challenge. This method uses the VGG-16 architecture and poses the ageestimation problem as a classiﬁcation problem followed by a softmax expected valuereﬁnement. Regression-by-classiﬁcation strategies have also been proposed for memo-rability and virality [34,1]. We report results with two different approaches using Dex.First, our implementation of the original Dex model. Second, we add the GUM modelon top the the Dex architecture; we termed this architecture DexGUM.The table in Figure 4 reports the results obtained on the CACD test set for age es-timation. We report the mean absolute error (in years) for size different methods. Wecan easily observe that DeepGUM exhibits the best results: . years of MAE ( . years better than L ). Importantly, the architectures using GUM (DeepGUM followedby DexGUM) are the ones offering the best performance. This claim is supported by theresults of the statistical tests, which say that DexGUM and DeepGUM are statisticallybetter than the rest (with p < . ), and that there are no statistical differences be-tween them. This is further supported by the histogram of the error included in the sup-plementary material. DeepGUM considered that of images were outliers and thusthese images were undervalued during training. The images in Figure 4 correspond to eepGUM: Learning Deep Robust Regression 11 outliers detected by DeepGUM during training, and illustrate the ability of DeepGUMto detect outliers. Since the dataset was automatically annotated, it is prone to corruptedannotations. Indeed, the age of each celebrity is automatically annotated by subtractingthe date of birth from the picture time-stamp. Intuitively, this procedure is problematicsince it assumes that the automatically collected and annotated images show the rightcelebrity and that the times-tamp and date of birth are correct. Our experimental eval-uation clearly demonstrates the beneﬁt of a robust regression technique to operate ondatasets populated with outliers. The McGill real-world face video dataset [9] consists of 60 videos (a single participantper video, 31 women and 29 men) recorded with the goal of studying unconstrained faceclassiﬁcation. The videos were recorded in both indoor and outdoor environments un-der different illumination conditions and participants move freely. Consequently, someframes suffer from important occlusions. The yaw angle (ranging from − ◦ to ◦ )is annotated using a two-step labeling procedure that, ﬁrst, automatically provides themost probable angle as well as a degree of conﬁdence, and then the ﬁnal label is chosenby a human annotator among the plausible angle values. Since the resulting annotationsare not perfect it makes this dataset suitable to benchmark robust regression models. Asthe training and test sets are not separated in the original dataset, we perform a 7-foldcross-validation. We report the fold-wise MAE average and standard deviation as wellas the statistical signiﬁcance corresponding to the concatenation of the test results ofthe 7 folds. Importantly, only a subset of the dataset is publicly available (35 videosover 60).In Table 2, we report the results obtained with different methods and employ a dag-ger to indicate when a particular method uses the entire dataset (60 videos) for training.We can easily notice that DeepGUM exhibits the best results compared to the otherConvNets methods (respectively . ◦ , . ◦ and . ◦ lower than L , Huber and Bi-weight in MAE). The last three approaches, all using deep architectures, signiﬁcantlyoutperform the current state-of-the-art approach [10]. Among them, DeepGUM is sig-niﬁcantly better than the rest with p < . . We perform experiments on the LFW and NET facial landmark detection datasets [37]that consist of 5590 and 7876 face images, respectively. We combined both datasetsand employed the same data partition as in [37]. Each face is labeled with the posi-tions of ﬁve key-points in Cartesian coordinates, namely left and right eye, nose, andleft and right corners of the mouth. The detection error is measured with the Euclideandistance between the estimated and the ground truth position of the landmark, dividedby the width of the face image, as in [37]. The performance is measured with the fail-ure rate of each landmark, where errors larger than are counted as failures. The Table 2: Mean average error on the McGill dataset. The results of the ﬁrst half of the ta-ble are directly taken from the respective papers and therefore no statistical comparisonis possible. † Uses extra training data.

Method MAE RMSEXiong et al. [42] † - . ± . Zhu and Ramanan [43] † - . ± . Demirkus et al. [9] † - . ± . Drouard et al. [10] . ± .

42 23 . ± . L . ± .

18 12 . ± . Huber [16] . ± .

08 11 . ± . Biweight [4] . ± .

31 11 . ± . DeepGUM ∗∗∗ . ± .

00 11 . ± . two aforementioned datasets can be considered as outlier-free since the average failurerate reported in the literature falls below . Therefore, we artiﬁcially modify the an-notations of the datasets for facial landmark detection to ﬁnd the breakdown point ofDeepGUM. Our purpose is to study the robustness of the proposed deep mixture modelto outliers generated in controlled conditions. We use three different types of outliers: – Normally Generated Outliers (

NGO ): A percentage of landmarks is selected, re-gardless of whether they belong to the same image or not, and shifted a distance of d pixels in a uniformly chosen random direction. The distance d follows a Gaussiandistribution, N (25 , . NGO simulates errors produced by human annotators thatmade a mistake when clicking, thus annotating in a slightly wrong location. – Local - Uniformly Generated Outliers ( l-UGO ): It follows the same philosophy as

NGO , sampling the distance d from a uniform distribution over the image, insteadof a Gaussian. Such errors simulate human errors that are not related to the humanprecision, such as not selecting the point or misunderstanding the image. – Global - Uniformly Generated Outliers ( g-UGO ): As in the previous case, the land-marks are corrupted with uniform noise. However, in g-UGO the landmarks to becorrupted are grouped by image. In other words, we do not corrupt a subset of alllandmarks regardless of the image they belong to, but rather corrupt all landmarksof a subset of the images. This strategy simulates problems with the annotation ﬁlesor in the sensors in case of automatic annotation.The ﬁrst and the second types of outlier contamination employ landmark-wise r n , whilethe third uses image-wise r n .The plots in Figure 5 report the failure rate of DeepGUM , Biweight, Huber and L (top) on the clean test set and the outlier detection precision and recall of all exceptfor L (bottom) for the three types of synthetic noise on the corrupted training set. Theprecision corresponds to the percentage of training samples classiﬁed as outliers that aretrue outliers; and the recall corresponds to the percentage of outliers that are classiﬁedas such. The ﬁrst conclusion that can be drawn directly from this ﬁgure are that, on theone hand, Biweight and Huber systematically present a lower recall than DeepGUM. eepGUM: Learning Deep Robust Regression 13

10 20 30 40 50 60020406080100 10 20 30 40 50 60020406080100

10 20 30 40 50 600246810

10 20 30 40 50 60020406080100 (a) l-UGO

10 20 30 40 50 60020406080100 (b) g-UGO

10 20 30 40 50 60020406080100 (c)

NGO

Fig. 5: Evolution of the failure rate (top) when augmenting the noise for the 3 typesof outliers considered. We also display the corresponding precisions and recalls in per-centage (bottom) for the outlier class. Best seen in color.In other words, DeepGUM exhibits the highest reliability at identifying and, therefore,ignoring outliers during training. And, on the other hand, DeepGUM tends to present alower failure rate than Biweight, Huber and L in most of the scenarios contemplated.Regarding the four most-left plots, l-UGO and g-UGO , we can clearly observe that,while for limited amounts of outliers (i.e. < ) all methods report comparable per-formance, DeepGUM is clearly superior to L , Biweight and Huber for larger amountsof outliers. We can also safely identify a breakdown point of DeepGUM on l-UGO at ∼ . This is inline with the reported precision and recall for the outlier detection task.While for Biweight and Huber, both decrease when increasing the number of outliers,these measures are constantly around for DeepGUM (before for l-UGO ). Thefact that the breakdown point of DeepGUM under g-UGO is higher than is due tofact that the a priori model of the outliers (i.e. uniform distribution) corresponds to theway the data is corrupted.For NGO , the corrupted annotation is always around the ground truth, leading toa failure rate smaller than 7 % for all methods. We can see that all four methods ex-hibit comparable performance up to of outliers. Beyond that threshold, Biweightoutperforms the other methods in spite of presenting a progressively lower recall anda high precision (i.e. Biweight identiﬁes very few outliers, but the ones identiﬁed aretrue outliers). This behavior is also exhibited by Huber. Regarding DeepGUM, we ob-serve that in this particular setting the results are aligned with L . This is because theSGD procedure is not able to ﬁnd a better optimum after the ﬁrst epoch and thereforethe early stopping mechanism is triggered and SFD output the initial network, whichcorresponds to L . We can conclude that the strategy of DeepGUM, consisting in re-moving all points detected as outliers, is not effective in this particular experiment. Inother words, having more noisy data is better than having only few clean data in thisparticular case of 0-mean highly correlated noise. Nevertheless, we consider an attrac-tive property of DeepGUM the fact that it can automatically identify these particularcases and return an acceptable solution. This paper introduced a deep robust regression learning method that uses a Gaussian-uniform mixture model. The novelty of the paper resides in combining a probabilisticrobust mixture model with deep learning in a jointly trainable fashion. In this context,previous studies only dealt with the classical L loss function or Tukey’s Biweightfunction, an M-estimator robust to outliers [4]. Our proposal yields better performancethan previous deep regression approaches by proposing a novel technique, and the de-rived optimization procedure, that alternates between the unsupervised task of outlierdetection and the supervised task of learning network parameters. The experimentalvalidation addresses four different tasks: facial and fashion landmark detection, age es-timation, and head pose estimation. We have empirically shown that DeepGUM (i) is arobust deep regression approach that does not need to rigidly specify a priori the distri-bution (number and spread) of outliers, (ii) exhibits a higher breakdown point than ex-isting methods when the outliers are sampled from a uniform distribution (being able todeal with more than of outlier contamination without providing incorrect results),and (iii) is capable of providing comparable or better results than current state-of-the-artapproaches in the four aforementioned tasks. Finally, DeepGUM could be easily used toremove undesired samples that arise from tedious manual annotation. It could also dealwith highly unusual training samples inherently present in automatically collected hugedatasets, a problem that is currently addressed using error-prone and time-consuminghuman supervision. Acknowledgments . This work was supported by the European Research Council viathe ERC Advanced Grant VHIA (Vision and Hearing in Action)

References

1. Alameda-Pineda, X., Pilzer, A., Xu, D., Sebe, N., Ricci, E.: Viraliency: Pooling local virality.In: IEEE Conference on Computer Vision and Pattern Recognition (2017)2. Banﬁeld, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometricspp. 803–821 (1993)3. Bekker, A.J., Goldberger, J.: Training deep neural-networks based on unreliable labels. In:ICASSP. pp. 2682–2686 (2016)4. Belagiannis, V., Rupprecht, C., Carneiro, G., Navab, N.: Robust optimization for deep re-gression. In: ICCV (2015)5. Beliakov, G., Kelarev, A.V., Yearwood, J.: Robust artiﬁcial neural networks and outlier de-tection. Technical report. CoRR abs/1110.0169 (2011)6. Black, M.J., Rangarajan, A.: On the uniﬁcation of line processes, outlier rejection, and robuststatistics with applications in early vision. IJCV (1), 57–91 (1996)7. Chen, B.C., Chen, C.S., Hsu, W.H.: Cross-age reference coding for age-invariant face recog-nition and retrieval. In: ECCV (2014)8. Coretto, P., Hennig, C.: Robust improper maximum likelihood: tuning, computation, and acomparison with other methods for robust Gaussian clustering. JASA , 1648–1659 (2016)9. Demirkus, M., Precup, D., Clark, J.J., Arbel, T.: Hierarchical temporal graphical model forhead pose estimation and subsequent attribute classiﬁcation in real-world videos. CVIU pp.128–145 (2015)eepGUM: Learning Deep Robust Regression 1510. Drouard, V., Horaud, R., Deleforge, A., Ba, S., Evangelidis, G.: Robust head-pose estimationbased on partially-latent mixture of linear regressions. TIP , 1428–1440 (2017)11. Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variablemarginal amounts of tailweight: application to robust clustering. Statistics and Computing (6), 971–984 (2014)12. Galimzianova, A., Pernus, F., Likar, B., Spiclin, Z.: Robust estimation of unbalanced mixturemodels on samples with outliers. TPAMI (11), 2273 – 2285 (2015)13. Gebru, I.D., Alameda-Pineda, X., Forbes, F., Horaud, R.: Em algorithms for weighted-dataclustering with application to audio-visual scene analysis. IEEE TPAMI (12), 2402–2415(2016)14. Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis. Chapman & Hall/CRCTexts in Statistical Science, Taylor & Francis (2003)15. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate ObjectDetection and Semantic Segmentation. In: CVPR (2014)16. Huber, P.J.: Robust estimation of a location parameter. The annals of mathematical statisticspp. 73–101 (1964)17. Huber, P.: Robust Statistics. Wiley (2004)18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classiﬁcation with Deep Convolu-tional Neural Networks. In: NIPS (2012)19. Lathuili`ere, S., Juge, R., Mesejo, P., Mu˜noz Salinas, R., Horaud, R.: Deep Mixture of LinearInverse Regressions Applied to Head-Pose Estimation. In: CVPR (2017)20. Lathuili`ere, S., Mesejo, P., Alameda-Pineda, X., Horaud, R.: A comprehensive analysis ofdeep regression. arXiv preprint arXiv:1803.08450 (2018)21. Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, J.: Learning from Noisy Labels with Distilla-tion. arXiv preprint arXiv:1703.02391 (2017)22. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recogni-tion and retrieval with rich annotations. In: CVPR (2016)23. Liu, Z., Yan, S., Luo, P., Wang, X., Tang, X.: Fashion Landmark Detection in the Wild. In:ECCV (2016)24. Maronna, R.A., Martin, D.R., Yohai, V.J.: Robust statistics. John Wiley & Sons (2006)25. Meer, P., Mintz, D., Rosenfeld, A., Kim, D.Y.: Robust regression methods for computervision: A review. IJCV (1), 59–70 (1991)26. Mukherjee, S., Robertson, N.: Deep Head Pose: Gaze-Direction Estimation in MultimodalVideo. TMM (11), 2094–2107 (2015)27. Neuneier, R., Zimmermann, H.G.: How to train neural networks. In: Neural Networks: Tricksof the Trade, pp. 373–423. Springer Berlin Heidelberg (1998)28. Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Robust ﬁtting of mixtures using thetrimmed likelihood estimator. CSDA (1), 299–308 (2007)29. Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: A deep multi-task learning frameworkfor face detection, landmark localization, pose estimation, and gender recognition. CoRR abs/1603.01249 (2016)30. Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a singleimage without facial landmarks. IJCV (2016)31. Rousseeuw, P.J., Leroy, A.M.: Robust regression and outlier detection, vol. 589. John Wiley& sons (2005)32. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual RecognitionChallenge. IJCV (3), 211–252 (2015)33. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., Lecun, Y.: Overfeat: Integratedrecognition, localization and detection using convolutional networks. In: ICLR (2014)6 S. Lathuili`ere et al.34. Siarohin, A., Zen, G., Majtanovic, C., Alameda-Pineda, X., Ricci, E., Sebe, N.: How tomake an image more memorable?: A deep style transfer approach. In: ACM InternationalConference on Multimedia Retrieval (2017)35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-nition. CoRR abs/1409.1556 (2014)36. Stewart, C.V.: Robust parameter estimation in computer vision. SIAM Review (3), 513–537 (1999)37. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection.In: CVPR (2013)38. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)39. Toshev, A., Szegedy, C.: DeepPose: Human Pose Estimation via Deep Neural Networks. In:CVPR (2014)40. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin pp. 80–83(1945)41. Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled datafor image classiﬁcation. In: CVPR (2015)42. Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment.In: CVPR. pp. 532–539 (06 2013)43. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in thewild. In: CVPR. pp. 2879–2886 (2012) upplementary MaterialDeepGUM: Learning Deep Robust Regression with aGaussian-Uniform Mixture Model St´ephane Lathuili`ere , , Pablo Mesejo , , Xavier Alameda-Pineda ,and Radu Horaud Inria Grenoble Rhˆone-Alpes, Montbonnot-Saint-Martin, France, University of Granada, Granada, Spain, University of Trento, Trento, Italy [email protected]

This document contains the supplementary material for the paper

DeepGUM: Learn-ing Deep Robust Regression with a Gaussian-Uniform Mixture Model . We provide anextensive number of visual examples and extra-results obtained using our proposedprobabilistic-based robust regression approach. The different sections of this documentshow additional information about the four tasks addressed in the paper.

A Fashion Landmark Detection

In Section 4.1 of the manuscript, we presented experiments on the fashion landmarkdetection problem. In Figures 1 to 3, we show training examples containing at least onelandmark that DeepGUM considers as outlier. These landmarks correspond to threedifferent scenarios: – Figure 1 shows images containing (i) either wrong annotations (e.g. last two imagesof the last row), (ii) ill-posed cases such as more than one clothe per image or (iii)challenging images (i.e. unusual clothing items like the third and last images of thefourth row). – Figure 2 shows images in which one or more landmarks are visually occluded. – Figure 3 shows images containing inlier landmarks wrongly classiﬁed as outliersby DeepGUM.Table 2 displays results on two additional subsets of the fashion landmark detec-tion dataset (full-body and lower-body landmarks). The results related to upper-bodylandmarks are already reported in the paper. Scores of DFA [35] are not reported sinceauthors do not provide results for these two subsets. These results conﬁrm the superi-ority of the proposed model. Similarly to the ﬁrst set (see the manuscript), DeepGUMperforms best compared to the other robust methods and to L . This is again conﬁrmedvia statistical tests (except for the right hem in the full-body subset using L ). Fig. 1: Example of images from the Fashion Landmark Dataset: landmarks detected asoutliers by DeepGUM are shown in red, while inliers are shown in green. eepGUM: Learning Deep Robust Regression 19

Fig. 2: Example of images from the Fashion Landmark Dataset: landmarks detected asoutliers by DeepGUM are shown in red, while inliers are shown in green. In all theseimages, the detected outliers correspond to occluded landmarks.0 S. Lathuili`ere et al.

Fig. 3: Example of images from the Fashion Landmark Dataset: landmarks detectedas outliers by DeepGUM are shown in red, while inliers are shown in green. The redlandmarks correspond to inliers wrongly classiﬁed as outliers. eepGUM: Learning Deep Robust Regression 21

Table 1: Mean absolute error on the lower-body subsets of FLD, per landmark and inaverage. The landmarks are left (L) and right (R) hem (H) and trouser leg (T). DFA [35]does not report on this subset.

Method Lower-body landmarksLH RH LT RT Avg. L ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ Table 2: Mean absolute error on the full-body subset of FLD, per landmark and inaverage. The landmarks are left (L) and right (R) collar (C), sleeve (S), hem (H) andtrouser leg (T). DFA [35] does not report on this subset.

Method Full body landmarksLC RC LS RS LH RH LT RT Avg. L ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ B Age Estimation

In Section 4.2, we presented experiments on the age estimation task. In Figure 4, wedisplay the histogram of the absolute error obtained with different methods. We can seethe importance of using DeepGUM to reduce the number of large errors.

DeepGUMDexGUMDex [6]Biweight [30]Huber [36]L2

Fig. 4: Histogram of the absolute error obtained with the different methods tested on theCACD dataset.Figures from 5 to 7 display three different groups of images depending on the prob-ability of being inlier that DeepGUM assigns to each age annotation: – Figure 5 shows randomly selected images with a high probability of being out-liers according to DeepGUM ( r n < . ). Even if some of the results could bedebatable, we argue that most of the annotations (displayed below the image) areincorrect. DeepGUM correctly performs the task for which it was designed. – Figure 6 displays randomly selected images for which the network has trouble de-ciding between inlier and outlier ( . < r n < . ). Even more, for most of theseimages it is quite hard to decide whether the annotation is correct or not. – Figure 7 shows randomly selected images that are considered by DeepGUM asinliers ( . < r n ). Indeed, the annotation below each image looks correct in mostof the cases. eepGUM: Learning Deep Robust Regression 2355: 0.29 58: 0.29 56: 0.28 53: 0.30 58: 0.25 51: 0.3159: 0.31 56: 0.29 57: 0.30 57: 0.25 57: 0.31 51: 0.2957: 0.25 56: 0.28 49: 0.20 50: 0.32 47: 0.31 46: 0.3134: 0.31 32: 0.25 37: 0.26 24: 0.30 30: 0.20 23: 0.2817: 0.31 17: 0.31 18: 0.25 23: 0.32 19: 0.25 18: 0.2716: 0.25 15: 0.27 16: 0.31 17: 0.31 15: 0.31 17: 0.26 Fig. 5: Sample images of the CACD dataset estimated as outliers during training ( r n < . ). The label below each image is the annotated age together with the r n at the endof the training of DeepGUM. Fig. 6: Sample images of the CACD dataset with high outlier uncertainty ( . < r n < . ). The label below each image is the annotated age together with the r n at the endof the training of DeepGUM. eepGUM: Learning Deep Robust Regression 2553: 0.95 57: 0.80 57: 0.94 58: 0.95 53: 0.92 54: 0.9455: 0.87 50: 0.95 54: 0.95 48: 0.95 47: 0.94 42: 0.9350: 0.94 48: 0.94 46: 0.95 45: 0.95 50: 0.95 44: 0.9545: 0.94 46: 0.95 47: 0.95 46: 0.91 43: 0.93 43: 0.9541: 0.95 38: 0.95 37: 0.94 44: 0.95 43: 0.95 38: 0.9543: 0.89 42: 0.93 39: 0.95 41: 0.95 42: 0.93 38: 0.91 Fig. 7: Sample images of the CACD dataset estimated as inliers during training ( r n > . ). The label below each image is the annotated age together with the r n at the endof the DeepGUM training. C Head pose estimation

In Section 4.3 of the manuscript, we presented experiments on the head pose estimationtask. We illustrated the beneﬁt of our propoal that robustly detect outliers at trainingtime, Figure 9 shows images from the McGill dataset the DeepGUM considered asoutlier. In these examples, many clear outliers appear. For some images, it is difﬁcultto say if the annotation is good even for a human annotator. In Figure 8, we displaythe error obtained on one fold of the training set. It visually justiﬁes the choice of aGaussian-Uniform model for the error distribution.

30 20 10 0 10 20 30Error0500100015002000250030003500

Fig. 8: Error histogram on the McGill Dataset. Points that are considered as outliers aredisplayed in red ( r n < . ) and inliers are displayed in green ( r n ≥ . ) eepGUM: Learning Deep Robust Regression 27-68 23 -90 23 0 -450 0 0 68 0 2323 0 0 23 23 -90-90 -90 -45 0 -45 -45-45 -45 23 -45 -90 -90-45 68 90 0 23 23 Fig. 9: Sample images from the McGill dataset considered as outliers during training( r n < − ). The label below each image is the angle included in the annotation. D Facial Landmark Detection

Section 4.4 of the manuscript reports experiments on the facial landmark detection(FLD) task. We showed the beneﬁt of using DeepGUM, a robust regression approachable to detect outliers at training time, under the presence of different kinds of corrupt-ing outliers. Figure 11 shows images from FLD corrupted with the l-UGO strategy and of outliers (thus these images correspond to the point of the red curve in Figure5.a in the main manuscript with x-axis value equal to ). Superposed to the im-ages we can see circles and crosses for DeepGUM and Biweight respectively, locatedat the (corrupted) annotations. Since in this dataset the closest competitor is Biweightand plotting the results of more than two methods per image would be unintelligible,we do not report results for Huber. The color indicates whether each method detectsthe annotated landmark as an outlier (red) or inliner (green). In the case of Biweight,since the method is acts coordinate-wise, there are vertical and horizontal lines, denot-ing whether the vertical or horizontal coordinates respectively are detected as outliers.First of all we remark that almost all of the uncorrupted landmarks are detected as in-liers by both methods. This corresponds to the 100% outlier precision (i.e. no inliers areclassiﬁed as outliers) in the curves of Figure 5.a. Regarding the detection of outliers,we can see that Biweight classiﬁes many outliers as inliers (lower outlier recall withrespect to DeepGUM, as in Figure 5.a). We can also observe that, because Biweightworks coordinate-wise, some of the landmarks are detected as outliers horizontally andnot vertically and vice-versa. For instance in the ﬁrst row ﬁfth column, we can see thatthe nose landmark is wrongly annotated as close to the eyebrow. Horizontally the er-ror is not big, and therefore Biweight classiﬁes this as a horizontal inlier and verticaloutlier. However, this is wrong, because ideally we would not want to use the eyebrowas a nose samples. Other examples conﬁrm this behavior and explain, not only why therecall of Biweight is lower than DeepGUM, but also one of the reasons of the differencein performance.In order to better understand the training procedure of DeepGUM, we plot threecurves in Figure 10. These curves represent the same quantities as in Figure 5, but thex-axis takes a different meaning in this plot. Indeed, two of these curves correspond tothe precision (squares) and recall (triangles), both dashed, of the training set. The thirdcurve (circles-solid) corresponds to the MAE of the test set (which is clean). However,the abscissa correspond to the M-step iterations, i.e. update of the parameters of thegraphical model, θ , using equations (3), (4) and (5). When the test error is ﬂat, the EM islooping and therefore the network weights w are not updated (so the test set is constant).Once the EM converged, SGD takes over until convergence and a new execution overthe M-step (with the new network weights and hopefully lower test error). We can seethat the recall is increasing progressively, even when the network weights are constant,meaning that the EM is actually discovering in a progressively more efﬁcient mannerthe outliers in the training set. eepGUM: Learning Deep Robust Regression 29 Fig. 10: Precision and recall (on the training set) as MAE (on the test set) over theM-step iterations with l-UGO noise at .0 S. Lathuili`ere et al.