[PDF] Deep Robust Subjective Visual Property Prediction in Crowdsourcing

Abstract

The problem of estimating subjective visual properties (SVP) of images (e.g., Shoes A is more comfortable than B) is gaining rising attention. Due to its highly subjective nature, different annotators often exhibit different interpretations of scales when adopting absolute value tests. Therefore, recent investigations turn to collect pairwise comparisons via crowdsourcing platforms. However, crowdsourcing data usually contains outliers. For this purpose, it is desired to develop a robust model for learning SVP from crowdsourced noisy annotations. In this paper, we construct a deep SVP prediction model which not only leads to better detection of annotation outliers but also enables learning with extremely sparse annotations. Specifically, we construct a comparison multi-graph based on the collected annotations, where different labeling results correspond to edges with different directions between two vertexes. Then, we propose a generalized deep probabilistic framework which consists of an SVP prediction module and an outlier modeling module that work collaboratively and are optimized jointly. Extensive experiments on various benchmark datasets demonstrate that our new approach guarantees promising results.

Full PDF

DDeep Robust Subjective Visual Property Prediction in Crowdsourcing

Qianqian Xu Zhiyong Yang , Yangbangyan Jiang , Xiaochun Cao , Qingming Huang , , Yuan Yao Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, 100190, China State Key Laboratory of Info. Security (SKLOIS), Inst. of Info. Engin., CAS, Beijing, 100093, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing,100049, China School of Computer Science and Tech., University of Chinese Academy of Sciences, Beijing, 101408, China BDKM, University of Chinese Academy of Sciences, Beijing, 100190, China Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong [email protected] { yangzhiyong,jiangyangbangyan,caoxiaochun } @[email protected] [email protected] Abstract

The problem of estimating subjective visual properties(SVP) of images (e.g., Shoes A is more comfortable thanB) is gaining rising attention. Due to its highly subjectivenature, different annotators often exhibit different interpre-tations of scales when adopting absolute value tests. There-fore, recent investigations turn to collect pairwise compar-isons via crowdsourcing platforms. However, crowdsourc-ing data usually contains outliers. For this purpose, it isdesired to develop a robust model for learning SVP fromcrowdsourced noisy annotations. In this paper, we con-struct a deep SVP prediction model which not only leadsto better detection of annotation outliers but also enableslearning with extremely sparse annotations. Speciﬁcally,we construct a comparison multi-graph based on the col-lected annotations, where different labeling results corre-spond to edges with different directions between two ver-texes. Then, we propose a generalized deep probabilis-tic framework which consists of an SVP prediction moduleand an outlier modeling module that work collaborativelyand are optimized jointly. Extensive experiments on vari-ous benchmark datasets demonstrate that our new approachguarantees promising results.

1. Introduction

In recent years, estimating subjective visual properties(SVP) of images [9, 19, 24] is gaining rising attention incomputer vision community. SVP measures a user’s subjec-tive perception and feeling, with respect to a certain prop-erty in images/videos. For example, estimating propertiesof consumer goods such as shininess of shoes [9] improvescustomer experiences on online shopping websites; and es- timating interestingness [8] from images/videos would behelpful for media-sharing websites (e.g., Youtube). Mea-suring and ensuring good estimation of SVP is thus highlysubjective in nature. Traditional methods usually adopt ab-solute value to specify a rating from 1 to 5 (or, 1 to 10)to grade the property of a stimulus. For example, in im-age/video interestingness prediction, 5 being the most in-teresting, 1 being the least interesting. However, since bydeﬁnition these properties are subjective, different raters of-ten exhibit different interpretations of the scales and as a re-sult the annotations of different people on the same samplecan vary hugely. Moreover, it is unable to concretely deﬁnethe concept of scale (for example, what a scale 3 means foran image), especially without any common reference point.Therefore, recent investigations turn to an alternative ap-proach with pairwise comparison. In a pairwise compari-son test, an individual is simply asked to compare two stim-uli simultaneously, and votes which one has the strongerproperty based on his/her perception. Therefore individualdecision process in pairwise comparison is simpler than inthe typical absolute value tests, as the multiple-scale ratingis reduced to a dichotomous choice. It not only promisesassessments that are easier and faster to obtain with less de-manding task for raters, but also yields more reliable feed-back with less personal scale bias in practice. However, ashortcoming of pairwise comparison is that it has more ex-pensive sampling complexity than the absolute value tests,since the number of pairs grows quadratically with the num-ber of items to be ranked.With the growth of crowdsourcing [2] platforms such asMTurk, InnoCentive, CrowdFlower, CrowdRank, and Al-lOurIdeas, recent studies thus resort to using crowdsourcingtools to tackle the cost problem. However, since the partic-ipants in the crowdsourcing experiments often work in the1 a r X i v : . [ c s . C V ] M a r bsence of supervision, it is hard to guarantee the annota-tion quality in general [5]. If the experiment lasts too long,the raters always lose their patience and end the test in ahurry with random annotations. Worse, the bad users mighteven provide wrong answers deliberately to corrupt the sys-tem. Such contaminated decisions are useless and may de-viate signiﬁcantly from other raters’ decisions thus shouldbe identiﬁed and removed in order to achieve a robust SVPprediction result.Therefore, existing approaches on SVP prediction are of-ten split into two separate steps: the ﬁrst is a standard outlierdetection problem (e.g., majority voting) and the second isa regression or learning to rank problem. However, it hasbeen found that when pairwise local rankings are integratedinto a global ranking, it is possible to detect outliers thatcan cause global inconsistency and yet are locally consis-tent, i.e., supported by majority votes [14]. To overcomethis limitation, [9] proposes a more principled way to iden-tify annotation outliers by formulating the SVP predictiontask as a uniﬁed robust learning to rank problem, tacklingboth the outlier detection and SVP prediction tasks jointly.Different from this work which only enjoys the limited rep-resentation power of the image low-level features, our goalin this paper is to leverage the strong representation powerof deep neural networks to explore the SVP prediction issuefrom a deep perspective.When it comes to deep learning, it is known that severalkinds of factors can drive the deep learning model awayfrom a perfect one, with the data perturbation issue as antypical example. Besides the notorious issue coming fromthe crowdsourcing process, deep learning is in itself knownto be more vulnerable to contaminated data since the ex-tremely high model complexity brings extra risks to overﬁtthe noisy/contaminated data [20, 10, 30, 32, 15, 25, 22]. Webelieve that how to guarantee the robustness is one of thebiggest challenges when constructing deep SVP predictionmodels. In this sense, we propose a deep robust model forlearning SVP from crowdsourcing. As an overall summary,we list our main contributions as follows: • A novel method for robust prediction of SVP is pro-posed. To the best of our knowledge, our frameworkoffers the ﬁrst attempt to carry out the prediction pro-cedure with automatic detection of sparse outliers froma deep perspective. • In the core of the framework lies the uniﬁed proba-bilistic model, which is used to formulate the generat-ing process of the labels when outliers exist. Based onthis model, we then propose a Maximum A Posterior(MAP) based objective function. • An alternative optimization scheme is adopted to solvethe corresponding model. Speciﬁcally, the network parameters could be updated from the gradient-basedmethod with the back-propagation, whereas the outlierpattern could be solved from an ordinal gradient de-scent method or a proximal gradient method.

2. Related Work

Subjective visual property prediction has gained risingattention in the last several years. It covers a large varietyof computer vision problems, including image/video inter-estingness [8], memorability [16], and quality of experience[27] prediction, etc. When used as a semantically meaning-ful representation, the subjective visual properties are of-ten referred to as relative attributes [29, 19]. The originalSVP prediction approach treats this task as a learning-to-rank problem. The main idea is to use ordered pairs of train-ing images to train a ranking function that will generalize tonew images. Speciﬁcally, a set of pairs ordered accordingto their perceived property strength is obtained from humanannotators, and a ranking function that preserves those or-derings is learned. Given a new image pair, the ranker indi-cates which image has the property more. A naive way tolearn the ranker is to resort to traditional pairwise learning-to-rank methods such as RankSVM [17], RankBoost [6],and RankNet [3], etc. However, these methods are not anatural ﬁt in the scenarios with crowdsourced outliers. In[9], it proposes a uniﬁed robust learning to rank (URLR)framework to solve jointly both the outlier detection andlearning to rank problems. Different from this line of re-search, we study the robust SVP prediction in the contextof deep learning. Equipped with better feature representa-tion power, we show both theoretically and experimentallythat by solving both the outlier detection and ranking pre-diction problems jointly in a deep framework, we achievebetter outlier detection and better ranking prediction.

Learning from noisy data has been studied extensively inrecent years. Traditionally, such methods could be trackedback to statistical studies such as Majority voting, M -estimator [13], Huber-LASSO [27], and Least TrimmedSquares (LTS) [28], etc. However, these work do not haveprediction (especially with the power of deep learning) abil-ity for unseen samples. Recently, there is a wave to explorerobust methods to learn from noisy labels, in the contextof deep learning. Generally speaking, there are four typesof existing methods: ( I ) robust learning based on proba-bilistic graphical models where the noisy patterns are of-ten modeled as latent variables [30, 25]; ( II ) progressiveand self-paced learning, where easy and clean examples arelearned ﬁrst, whereas the hard and noisy labels are pro-gressively considered [10]; ( III ) loss-correction methods,igure 1: Overview of our approach. (1) Constructing a comparison graph from the crowdsourcing annotations, which iscontaminated with outlier labels. (2) We propose a generalized deep probabilistic framework, where an outlier indicator γ is learned along with the network parameters Θ . (3) Our Framework will output a clean graph on the training set, wherecontaminated annotations are eliminated. Furthermore, our model could predict a rank-preserved score for each unseeninstance. Best viewed in color.where the loss function is corrected iteratively [22]; ( IV )network architecture-based method, where the noisy pat-terns are modeled with speciﬁcally designed modules [15].Meanwhile, there are also some efforts on designing deeprobust models for speciﬁc tasks and applications: [20] pro-poses a method to learn from Weak and Noisy Labels forSemantic Segmentation; [32] proposes a deep robust unsu-pervised method for saliency detection, etc.Compared with these recent achievements, our work dif-fers signiﬁcantly in the sense that: a) We provide the ﬁrsttrial to explore the deep robust learning problem in the con-text of crowdsourced SVP learning. b) We adopt a pairwiselearning framework, whereas the existing work all adoptinstance-wise frameworks.

3. Methodology

Our goal in this paper is two-fold: (a)

We aim to learn a deep SVP prediction model froma set of sparse and noisy pairwise comparison labels.Speciﬁcally the ranking patterns should be preserved. (b)

To guarantee the quality of the model, we expect thatall the noisy annotations could be detected and re-moved along with the training process.We denote the id of two images in the i th pair as i and i , and denote the corresponding image pair as ( x i , x i ).More precisely, we are given a pool with n training imagesand a set of SVPs. In addition, for each SVP, we are givena set of pairwise comparison labels. Such pairwise com-parison data can be represented by a directed multi-graphwhere multiple edges could be found between two vertexes.Mathematically, we denote the graph as G = ( V , E ) . V is the set of vertexes which contains all the distinct imageitems occurred in the comparisons. E is the set of com-parison edges. For a speciﬁc user with id j and a speciﬁccomparison pair i deﬁned on two item vertexes i and i ,if the user believes that i holds a stronger/weaker pres-ence of the SVP, we then have an edge ( i , i , j ) / ( i , i , j ) ,respectively. Equivalently we also denote this relation as i j (cid:31) i / i j (cid:31) i . Since multiple users take part in the anno-tation process, it is natural to observe multi-edges betweentwo vertexes. Now we could denote the labeling results as afunction Y : E → {− , } . For a given pair i and a rater j who annotates this pair, the corresponding label is denotedas y ij , which is deﬁned as : (cid:26) y ij = − , ( i , i , j ) ∈ E ; y ij = − , ( i , i , j ) ∈ E . (1)Now we present an example of the deﬁned comparisongraph. See step 1 in Figure 1. In this ﬁgure, the SVP inquestion is the age of the humans in the images. Supposewe have 5 images with ground truth ages (marked withred in the lower right corner of each image), we then have V = { , , · · · , } . Furthermore, we have three users withid 1, 2, 3 who take part in the annotation. According to thelabeling results shown in the lower left side, we have E = { (1 , , , (1 , , , (2 , , , · · · , (1 , , , (5 , , , (5 , , } .As shown in this example, we would be most likely toobserve both i (cid:31) i and i (cid:31) i for a speciﬁc pair i . Thisis mainly caused by the bad and ugly users who provideerroneous labels. For example for vertexes 1 and 2, theedge (2 , , is obviously an abnormal annotation. Withthe above deﬁnitions and explanations, we are ready tointroduce the input and output of our proposed model. Input.

The input of our deep model is the deﬁned multi-graph G along with the image items, where each time a spe-ciﬁc edge is fed to the network. utput. As will be seen in the next subsection, our modelwill output the relative score s i and s i of the image pairalong with an outlier indicator which could automaticallyremove the abnormal directions on G . Note that learning s i and s i directly achieves our goal (a) , while detecting andremoving outlier directions on the graph directly achievesgoal (b) . In contrast to traditional methods, we propose a deep ro-bust SVP ranking model in this paper. According to step2 in Figure 1, we employ a deep Siamese [4, 21] convolu-tional neural network as the ranking model to calculate therelative scores for image pairs. In this model, the input is anedge in the graph G together with the image pair ( x i , x i ) .Each branch of the network is fed with an image and outputsthe corresponding scores s ( x i ) and s ( x i ) . Then we pro-pose a robust probabilistic model based on the differenceof the scores. As a note for the network architecture, wechoose an existing popular CNN architecture, ResNet-50[11], as the backbone of the Siamese network. Such resid-ual network is equipped with shortcut connections, bringingin promising performance in image tasks.With the network given, we are ready to elaborate a novelprobabilistic model to simultaneously prune the outliers andlearn the network parameters for SVP prediction. In ourmodel, the noisy annotations are treated as a mixture ofreliable patterns and outlier patterns. More precisely, toguarantee the performance of the whole model, we expect s ( x i ) , s ( x i ) , i.e., the scores returned by the network tocapture the reliable patterns in the labels. Meanwhile, weintroduce an outlier indicator term γ ( y ij ) to model the noisynature of the annotations. During the training process, ourprediction is an additive mixture of the reliable score andthe outlier indicator.To see how the inclusion of γ could help us detect andremove outlier, one should realize that, since y ij must beeither 1 or -1, there are only two distinct values for γ ( y ij ) ,with one for each direction. If we can learn a reasonable γ ( y ij ) such that γ ( y ij ) (cid:54) = 0 only if the corresponding direc-tion is not reliable, we can then remove the contaminateddirections in G and obtain a clean graph. To illustrate it inan easier way, let us back to step 1 in Figure 1. Accordingto the lower left contents, we have three annotations for pair( V , V ). We have two distinct γ ( y ij ) for these annotations:For the correct direction, we have a γ (1) for (1 , , and (1 , , ; For the contaminated direction, we have a differ-ent gamma with value γ ( − for (2 , , . Now if we canlearn γ ( y ij ) in a way that γ (1) = 0 and γ ( − (cid:54) = 0 , thenwe can easily detect the contaminated direction (2 , .Given the clariﬁcation above, our next step is to proposea probabilistic model of the labels based on the outlier indi-cator γ , the network parameters Θ , and the predicted scores s ( · ) . Speciﬁcally, we model the conditional distribution ofthe annotations along with the prior distribution of γ and Θ in the following form: y ij | x i , x i , Θ , γ ( y ij ) i.i.d ∼ f ( y ij , s ( x i, , x i, , Θ)+ γ ( y ij )) ,γ ( y ij ) | λ i.i.d ∼ h ( γ ( y ij ) , λ ) , Θ | λ ∼ g (Θ , λ ) . • s ( x i , x i , Θ) = s ( x i , Θ) − s ( x i , Θ) is the rela-tive score of the annotation, which will be directlylearned from the deep learning model with the param-eter set Θ . As mentioned above, s ( x i , x i , Θ) areexpected to model the reliable pattern in the annota-tions. The prior distribution of Θ is assumed to beassociated with a p.d.f. (probability density function) p (Θ | λ ) = g (Θ , λ ) ( λ is a predeﬁned hyperparam-eter), which is denoted as g in short. • γ ( y ij ) is the outlier indicator which induces unre-liability. Since only outliers have a nonzero indi-cator, we model the randomness of γ ( y ij ) with ani.i.d sparsity-inducing prior distribution (e.g., Lapla-cian distribution) with the p.d.f. being p ( γ ( y ij ) | λ ) = h ( γ ( y ij ) , λ ) ( λ denotes the hyperparameter), whichis denoted as h ij in short. • As we have mentioned above, the noisy prediction s ( x i , x i , Θ) + γ ( y ij ) is an additive mixture of thereliable score and outlier indicator. • f ( y ij , s ( x i, , x i, , Θ) + γ ( y ij )) is the conditionalp.d.f. of the labels, which is denoted as f ij in short.Let γ = { γ ( y ij ) } ( i ,i ,j ) ∈E , y = { y ij } ( i ,i ,j ) ∈E .Now our next step is to construct a loss function for thisprobabilistic model. According to the Maximum A Pos-terior (MAP) rule in statistics, a reasonable solution ofthe parameters should have a large posterior probability P (Θ , γ | y , X , λ , λ ) . In other words, with high prob-ability, the parameters ( γ , Θ in our model ) should be ob-served after seeing the data ( y , X in our model ) and thepredeﬁned hyperparameters ( λ , λ ). This motivates us to maximize the posterior probability in our objective function.Furthermore, to simplify the calculation of the derivatives,we adopt an equivalent form where the negative log poste-rior probability is minimized : min Θ , γ − log ( P (Θ , γ | y , X , λ , λ )) . Following the Bayesian rule, one has: P (Θ , γ | y , X , λ , λ )= P ( y | X , Θ , γ ) · P (Θ | λ ) · P ( γ | λ ) · P ( X ) (cid:82) Θ (cid:82) γ P ( X , y | Θ , γ ) · P (Θ | λ ) · P ( γ | λ ) d Θ d γ . t then becomes clear that P (Θ , γ | y , X , λ , λ ) is not di-rectly tractable. Fortunately, since X , y are given andwe only need to optimize Θ and γ , the tedious term P ( X ) (cid:82) Θ (cid:82) γ P ( X , y | Θ , γ ) · P (Θ | λ ) · P ( γ | λ ) d Θ d γ becomes a con-stant, which suggests that: P (Θ , γ | y , X , λ , λ ) ∝ (cid:89) ( i,j ) ∈D p ( y ij | x i, , x i, , γ ( y i,j ) , Θ) · p ( γ ( y ij ) | λ ) · p (Θ | λ )= (cid:89) ( i,j ) ∈D g · h ij · f ij . (2) where D : { ( i, j ) : ( i , i , j ) ∈ E or ( i , i , j ) ∈ E} . Thisimplies that our loss function could be simpliﬁed as: min Θ , γ (cid:88) ( i,j ) ∈D − (log( f ij ) + log( h ij )) − log( g ) . With the general framework given, we provide two spec-iﬁed models with different assumptions on the distributions: • Model A : If the prior distribution of γ ( y ij ) | λ isa Laplacian distribution with a zero location param-eter and a scale parameter of λ : Lap (0 , λ ) = λ exp( − | γ | /λ ) ; the prior distribution of Θ is anelement-wise Gaussian distribution N (0 , λ ) ; and y ij conditionally subjects to a Gaussian distribution N ( s ( x i, , x i, , Θ) + γ ( y ij ) , , then the problem be-comes: min Θ , γ (cid:88) ( i,j ) ∈D

12 ( y ij − s ( x i, , x i, , Θ) − γ ( y ij )) + (cid:88) ( i,j ) ∈D λ (cid:107) γ (cid:107) + λ (cid:88) θ ∈ Θ θ , where (cid:107) γ (cid:107) = (cid:80) ( i,j ) ∈D | γ ( y ij ) | . • Model B : If we adopt the same assumption as above,except that we assume that y ij conditionally subjectsto a Logistic-like distribution, then the problem couldbe simpliﬁed as: min γ , Θ (cid:88) ( i,j ) ∈D log(1 + ∆ ij ) + λ (cid:107) γ (cid:107) + λ (cid:88) θ ∈ Θ θ , where ∆ ij = exp( − y ij ( s ( x i, , x i, , Θ) + γ ( y ij ))) . With the model and network clariﬁed, we then introducethe optimization method we adopt in this paper. Speciﬁ-cally, we employ an iterative scheme where γ and the net-work parameters Θ are alternatively updated until the con-vergence is reached. γ , Learn Θ When ﬁxing γ , we see that Θ could be solved from thefollowing subproblem: min Θ − (cid:88) ( i,j ) ∈D log( f ij ) − log( g ) Since Θ only depends on the network, one could ﬁnd anapproximated solution by updating the network. For ModelA , this subproblem becomes: min Θ (cid:88) ( i,j ) ∈D

12 ( y ij − s ( x i, , x i, , Θ) − γ ( y ij )) + λ (cid:88) θ ∈ Θ θ . Similarly, for

Model B , we come to a subproblem in theform: min Θ (cid:88) ( i,j ) ∈D log(1 + ∆ ij ) + λ (cid:88) θ ∈ Θ θ . Θ , Learn γ Similarly, when Θ is ﬁxed, we could solve γ from: min γ (cid:88) ( i,j ) ∈D − (log( f ij ) + log( h ij )) This is a simple model of γ which does not involve the net-work. For Model A , this subproblem becomes: min γ (cid:88) ( i,j ) ∈D

12 ( y ij − s ( x i, , x i, , Θ) − γ ( y ij )) + λ (cid:107) γ (cid:107) . It enjoys a closed-form solution with the proximal operatorof (cid:96) norm: γ ( y ij ) = max( | c ij | − λ , · sign ( c ij ) , (3)where: c ij = y ij − s ( x i, , x i, , Θ) . For

Model B , this subproblem becomes: min γ (cid:88) ( i,j ) ∈D log(1 + ∆ ij ) + λ (cid:107) γ (cid:107) . Generally, there is no closed-form solution for thissubproblem. In this paper, we adopt the proximal gradientmethod [1] to ﬁnd a numerical solution.

4. Experiments

In this section, experiments are exhibited on three bench-mark datasets (see Table 1) which fall into two categories:(1) experiments on human age estimation from face images(Section 4.1), which can be considered as synthetic exper-iments. With the ground truth available, this set of experi-ments enables us to perform in-depth evaluation of the sig-niﬁcance of our proposed method, (2) experiments on esti-mating SVPs as relative attributes (Section 4.2 and 4.3).able 1: Dataset summary.

Dataset No.Pairs No.Images No.Classes

FG-Net Face Age Dataset 15,000 1002 1LFW-10 Dataset[23] 29,454 2000 10Shoes Dataset [18] 87,946 14,658 7

Table 2: Experimental results on Human age dataset.

Algorithm

ACC F1 Prec. Rec. AUC

Maj-LS .5555 .4673 .4369 .5022 .5650LS-with γ .5594 .4729 .4414 .5093 .5759Maj-Logistic .5421 .4687 .4264 .5205 .5489Logistic-with γ .5585 .4743 .4410 .5131 .5735Maj-RankNet [3] .5611 .4804 .4445 .5227 .5792Maj-RankBoost [6] .5425 .5991 .6458 .5587 .4507Maj-RankSVM [17] .5838 .3858 .4517 .3367 .5665Maj-GBDT [7] .5827 .3880 .4504 .3408 .5619Maj-DART [26] .5940 .3668 .4648 .3029 .5690URLR [9] .5765 .4633 .5748 .5131 .5762LS-Deep-w/o γ .7313 .6694 .6407 .7008 .8060Logit-Deep-w/o γ .7439 .6818 .6584 .7070 .8168LS-Deep-with γ .7967 .7414 .7323 .7508 .8784 Logit-Deep-with γ .7917 .7370 .7228 .7518 .8739 In this experiment, we consider age as a subjective vi-sual property of a face. The main difference between thisSVP with the other SVPs evaluated so far is that we do havethe ground truth, i.e., the persons age when the picture wastaken. This enables us to perform in-depth evaluation of thesigniﬁcance of our proposed framework.

Dataset

The FG-NET image age dataset contains 1002images of 82 individuals labeled with ground truth agesranging from 0 to 69. The training set is composed ofthe images of 41 randomly selected individuals and the restused as the test set. For the training set, we use the groundtruth age to generate the pairwise comparisons, with thepreference direction following the ground-truth order. Tocreate sparse outliers, a random subset (i.e., 20%) of thepairwise comparisons is reversed in preference direction. Inthis way, we create a paired comparison graph, possibly in-complete and imbalanced, with 1002 nodes and 15,000 pair-wise comparison samples. Competitors

We compare our method

Model A and

ModelB with 10 competitors. Note that

Model A is the leastsquare based deep model, while

Model B is a logistic re-gression based deep model. In the following experiments,we give

Model A an alias as

LS-Deep , and give

Model B an alias as

Logit-Deep :1)

Maj-LS : This method uses majority voting for outlierpruning and least squares problem for learning to rank.2)

LS-with γ : To test the improvement of merely adoptingthe robust model, we jointly employ the linear regressionmodel and our proposed robust mechanism as a baseline.3) Maj-Logistic : This method stands for another baseline in our work, where the majority voting is adopted for labelprocessing followed with the logistic regression.4) Logistic-with γ : Again, to test the improvement ofmerely adopting the robust model, we jointly employ thelogistic regression model and our proposed robust mecha-nism as a baseline.5) Maj-RankSVM [17] : We record the performance ofRankSVM to show the superiority of the representationlearning.6)

Maj-RankNet [3] : To show the effectiveness of using adeeper network, we compare our method with the classicalRankNet model preprocessed by the majority voting.7)

Maj-RankBoost [6] : Besides the deep learning frame-work, it is also known that the ensemble-based methodscould also serve a model for hierarchical learning and rep-resentation. In this sense, we compare our method withthe RankBoost model, one of the most classical ensemblemethod.8)

Maj-GBDT [7] : Gradient Boosting Decision Tree(GBDT) has gained surprising improvements in many tra-ditional tasks. Accordingly, we compare our methods withGBDT to show its strength.9)

Maj-DART [26] : Recently, the well-known drop-outtrick has also been applied to ensemble-based learning, beit the DART method. We also record the performance ofDART to show the superiority of our method.10)

URLR [9] : URLR is a uniﬁed robust learning to rankframework which aims to tackle both the outlier detectionand learning to rank jointly. We compare our algorithm withthis method to show the effectiveness of using a generalizedprobabilistic model and a deep architecture.

Ablation : To show the effectiveness of the proposed prob-abilistic model, we additionally add two competitors as theablation. Note that the key element to detect outlier is thefactor γ . In this way, the ablation competitors are formedwith γ eliminated:1) LS-Deep-w/o γ : This is a partial implementation of LS-Deep , where the factor γ is removed.2) Logit-Deep-w/o γ : This is a partial implementation of Logit-Deep , where the factor γ is removed. Evaluation metrics

Because the ground-truth age is avail-able, we adopt ACC, Precision, Recall, F1-score and AUCas the evaluation metrics to demonstrate the effectiveness ofour proposed method.

Implementation Details

For the four deep learning meth-ods, the learning rate is set as − , and λ is set as − .For LS-Deep-with γ , λ is set as 1.2. For Logit-Deep-with γ , λ is set as 0.6. Comparative Results

In all the non-deep competitive ex-periments, we adopt LBP as the low-level features. Look-ing at the ﬁve-metrics results in Table 2, we see that ourmethod (marked with red and green color) consistently out-performs all the benchmark algorithms by a signiﬁcant mar-igure 2: Outlier examples detected on Human age dataset.gin. This validates the effectiveness of our method. In par-ticular, it can be observed that: (1) LS-with γ (or Logistic-with γ ) is superior to Maj-LS (or Maj-Logistic) because theglobal outlier detection is better than local outlier detection(i.e., Majority voting). (2) The performance of deep meth-ods is better than all non-deep methods, interestingly eventhe ablation baseline methods without γ give better resultsthan traditional methods with outlier detection, which sug-gests the strong representation power of deep neural net-works in SVP prediction tasks. (3) It is worth mentioningthat our proposed Deep-with γ methods successfully ex-hibit roughly − improvement on all the ﬁve-metricsthan Deep-without γ methods, demonstrating the superioroutlier detection ability of our proposed framework. (4)Our proposed two models A (i.e., LS-Deep-with γ ) and B (i.e., Logit-Deep-with γ ) show comparable results on thisdataset, while model A holds the lead by a slight margin.Moreover, we visualize some examples of outliers de-tected by model A in Figure 2, while results returned bymodel B are very similar. It can be seen that those in theblue/green boxes are clearly outliers and are detected cor-rectly by our method. For better illustration, the ground-truth age is printed under each image. Moreover, blue boxesshow pairs with a large age differences while green boxesillustrate samples with subtle age differences, which indi-cates that our method not only can detect the easy pairswith a large age gap, but also can handle hard samples withsmall age gap (e.g., within only 1-2 years difference). Fourfailure cases are shown in red boxes, in which our methodtreats the images on the left are older than the right one asan outlier, but the ground truth agrees with the annotation.We can easily ﬁnd that this often occurs on pairs with smallage differences, which indicates that our methods may oc-casionally lose its power when meeting highly competitiveor confused pairs. Dataset

The LFW-10 dataset [23] consists of 2,000 faceimages, taken from the Labeled Faces in the Wild [12]dataset. It contains 10 relative attributes, like smiling, bigeyes, etc. Each pair was labeled by 5 people. For exam- Table 3: Experimental results (ACC) of 10 attributes onLFW-10 dataset.

Algorithm

Bald D.Hai B.Eye GLook Masc. Mouth Smile Teeth Foreh. Young Aver.

Maj-LS .4767 .5368 .4787 .4788 .5588 .4774 .5220 .5073 .4759 .5162 .5029LS-with γ .5805 .6400 .5506 .5932 .6009 .5097 .5178 .5198 .5680 .5911 .5672Maj-Logistic .6123 .6716 .5146 .5890 .6253 .5032 .5031 .5322 .5724 .6599 .5784Logistic-with γ .6059 .6400 .5640 .6038 .6275 .5269 .5073 .5405 .5724 .6437 .5832Maj-RankNet [3] .6123 .6421 .5551 .6208 .6275 .5097 .5304 .5468 .5899 .6275 .5862Maj-RankBoost [6] .5996 .7053 .5236 .5975 .6231 .5097 .5199 .5094 .6053 .6032 .5797Maj-RankSVM [17] .4852 .6526 .4180 .5805 .5588 .4882 .5283 .5156 .5482 .6397 .5415Maj-GBDT [7] .5551 .6253 .4899 .5466 .5721 .4903 .5094 .5198 .5965 .6235 .5528Maj-DART [26] .5508 .6337 .4899 .5339 .5698 .4989 .5597 .5364 .5943 .6134 .5581URLR [9] .5889 .6538 .6505 .5258 .5614 .6319 .5311 .4968 .5446 .5570 .5742LS-Deep-w/o γ .5932 .7095 .5551 .6081 .5543 .5742 .6436 .6133 .5746 .6741 .6100Logit-Deep-w/o γ .5551 .6758 .5124 .6335 .6253 .5806 .6038 .6175 .5724 .6235 .6000LS-Deep-with γ .6335 .7684 .5551 .6377 .6253 .7312 .7421 .7547 .6469 .7308 .6826 Logit-Deep-with γ .6631 .7726 .5798 .6419 .5965 .7032 .7358 .7069 .6075 .6862 .6694 Figure 3: Outlier examples of 4 representative attributes onLFW-10 dataset.ple, given a speciﬁc attribute, the user will choose whichone to be stronger in the attribute. As the goal of our pa-per is to predict SVP from noisy labels, we do not conductany pro-precessing steps to meet the agreement of labels as[31]. The resulting dataset has 29,454 total annotated sam-ple pairs, on average 2945 binary pairs per attribute.

Implementation Details

In competitive experiments, weadopt GIST as the low-level features. For the four deeplearning methods, the learning rate is set as − , and λ is set as − . For LS-Deep-with γ , λ is set as 1.2. ForLogit-Deep-with γ , λ is set as 0.5. Comparative Results

Table 3 reports the summary ACCfor each attribute. The following observations can be made:(1) Our deep-methods always outperform traditional non-deep methods and ablation baseline methods for all exper-iment settings with higher average ACC on all attributes(0.6826 vs. 0.6100 and 0.6694 vs. 0.6000 on two mod-els, respectively). (2) The performance of other methods isin general consistent with what we observed in the Humanage experiments.Moreover, Figure 3 gives some examples of the prunedpairs of 4 randomly selected attributes. In the success cases,the left images are (incorrectly) annotated to have more ofthe attribute than the right ones. However, they are eitherwrong or too ambiguous to give consistent answers, and asuch are detrimental to learning to rank. A number of fail-ure cases (false positive pairs identiﬁed by our models) arealso shown. Some of them are caused by unique viewpoints(e.g., for dark hair attribute, the man has sparse scalp, so itis hard to tell who has dark hair more); others are causedby the weak feature representation, e.g., in the young at-tribute example, as young would be a function of multiplesubtle visual cues like face shape, skin texture, hair color,etc., whereas something like baldness or smiling has a bet-ter visual focus captured well by part-based features.

Dataset

The Shoes dataset is collected from [18] whichcontains 14,658 online shopping images. In this dataset,7 attributes are annotated by users with a wide spectrum ofinterests and backgrounds. For each attribute, there are atleast 190 users who take part in the annotation, and eachuser is assigned with 50 images. Note that the dataset ac-tually uses binary annotations rather than pairwise annota-tions (1 for Yes, -1 for No). We then randomly sample posi-tive annotations and negatives annotations from each user’srecords to form the pairs we need. For each attribute, werandomly select such 2000 distinct pairs, ﬁnally yielding avolume of 87,946 total personalized comparisons.

Implementation Details

In competitive experiments, weconcatenate the GIST and color histograms provided by theoriginal dataset as the low-level features. For the LS-baseddeep methods, the learning rate is set as − . For theLogit-based deep methods, the learning rate is set as − . λ is set as − for all four methods. For LS-Deep-with γ , λ is set as 1.2. For Logit-Deep-with γ , λ is set as 0.8. Comparative Results

Similar to the Human age and LFW-10 datasets, table 4 again shows that the performance ofour proposed deep models is signiﬁcantly better than thatof other competitors. Moreover, some outlier detection ex-amples are shown in Figure 4. In the top four rows withsuccessful detection examples, the right images clearly havemore of the attribute than the left ones, however are incor-rectly annotated by crowdsourced raters. The failure casesare caused by the invisibility (e.g., for comfortable attribute,though the transparent rain-boots itself is ﬂat, there is in facta pair of high-heeled shoes inside with red color); othersare caused by different visual deﬁnitions of attributes (e.g.,for open attribute, it has multiple shades of meaning, e.g.,peep-toed (open at toe) vs. slip-on (open at heel) vs. sandal-like (open at toe and heel)); The remaining may be causedby ambiguity: both images have this attribute with similardegree. This thus corresponds to a truly ambiguous casewhich can go either way.

5. Conclusion

This work explores the challenging task of SVP predic-tion from noisy crowdsourced annotations from a deep per- Table 4: Experimental results (ACC) of 7 attributes onShoes dataset.

Algorithm

Comf. Fash. Form. Pointy Brown Open Ornate Aver.

Maj-LS .7300 .7825 .7325 .7897 .6950 .7331 .7300 .7418LS-with γ .8150 .8125 .7975 .7860 .7275 .7444 .7625 .7779Maj-Logistic .7600 .7850 .7475 .7970 .6900 .7068 .7175 .7434Logistic-with γ .8375 .8175 .7825 .7934 .7250 .7444 .7525 .7790Maj-RankNet [3] .7425 .7850 .7200 .7860 .6925 .7444 .7300 .7429Maj-RankBoost [6] .7525 .7300 .7275 .7675 .6975 .6955 .6725 .7204Maj-RankSVM [17] .7425 .7925 .7925 .8081 .6850 .7331 .7200 .7534Maj-GBDT [7] .7075 .7325 .7425 .8007 .6750 .7519 .7550 .7379Maj-DART [26] .6900 .7275 .7375 .8376 .6975 .7857 .7125 .7412URLR [9] .8200 .8150 .7900 .7860 .7325 .7444 .7550 .7775LS-Deep-w/o γ .7100 .8075 .7400 .7749 .7725 .7669 .7050 .7538Logit-Deep-w/o γ .7100 .8025 .7500 .8044 .7525 .7857 .6975 .7575LS-Deep-with γ .8500 .8550 .8125 .8044 .8250 .7782 .8300 .8222 Logit-Deep-with γ .8550 .8500 .8200 .8339 .8125 .7481 .8325 .8217 Figure 4: Outlier examples of 4 representative attributes onShoes dataset.spective. We present a simple but effective general prob-abilistic model to simultaneously predict rank preservingscores and detect the outliers annotations, where an outlierindicator γ is learned along with the network parameters Θ . Practically, we present two speciﬁc models with dif-ferent assumptions on the data distribution. Furthermore,we adopt an alternative optimization scheme to update γ and Θ iteratively. In our empirical studies, we perform aseries of experiments on three real-world datasets: Humanage dataset, LFW-10, and Shoes. The corresponding resultsconsistently show the superiority of our proposed model.

6. Acknowledgments

This work was supported in part by National Basic Re-search Program of China (973 Program): 2015CB351800and 2015CB85600, in part by National Natural Sci-ence Foundation of China: 61620106009, U1636214,61861166002, 61672514, and 11421110001, in part byKey Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013, in part by Beijing Natural Science Founda-tion (4182079), in part by Youth Innovation Promotion As-sociation CAS, and in part by Hong Kong Research GrantCouncil (HKRGC) grant 16303817. eferences [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAMJournal on Imaging Sciences , 2(1):183–202, 2009. 5[2] S. Branson, G. Van Horn, and P. Perona. Lean crowdsourc-ing: Combining humans and machines in an online system.In

IEEE Conference on Computer Vision and Pattern Recog-nition , pages 7474–7483, 2017. 1[3] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,N. Hamilton, and G. Hullender. Learning to rank using gradi-ent descent. In

International Conference on Machine Learn-ing , pages 89–96, 2005. 2, 6, 7, 8[4] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similaritymetric discriminatively, with application to face veriﬁcation.In

IEEE Conference on Computer Vision and Pattern Recog-nition , volume 1, pages 539–546, 2005. 4[5] F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah, andM. Allahbakhsh. Quality control in crowdsourcing: A surveyof quality attributes, assessment techniques, and assuranceactions.

ACM Computing Surveys , 51(1):7, 2018. 2[6] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efﬁcientboosting algorithm for combining preferences.

Journal ofMachine Learning Research , 4(Nov):933–969, 2003. 2, 6, 7,8[7] J. H. Friedman. Greedy function approximation: A gradi-ent boosting machine.

The Annals of Statistics , 29(5):1189–1232, 2001. 6, 7, 8[8] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao. In-terestingness prediction by robust learning to rank. In

Euro-pean Conference on Computer Vision , pages 488–503, 2014.1, 2[9] Y. Fu, T. M. Hospedales, T. Xiang, J. Xiong, S. Gong,Y. Wang, and Y. Yao. Robust subjective visual prop-erty prediction from crowdsourced pairwise labels.

IEEETransactions on Pattern Analysis and Machine Intelligence ,38(3):563–577, 2016. 1, 2, 6, 7, 8[10] B. Han, I. W. Tsang, L. Chen, P. Y. Celina, and S.-F. Fung.Progressive stochastic learning for noisy labels.

IEEE Trans-actions on Neural Networks and Learning Systems , (99):1–13, 2018. 2[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

IEEE Conference on ComputerVision and Pattern Recognition , pages 770–778, 2016. 4[12] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-beled faces in the wild: A database for studying face recogni-tion in unconstrained environments. In

Workshop on faces in‘Real-Life’ Images: detection, alignment, and recognition ,2008. 7[13] P. Huber.

Robust Statistics . New York: Wiley, 1981. 2[14] X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye. Statistical rankingand combinatorial Hodge theory.

Mathematical Program-ming , 127(6):203–244, 2011. 2[15] I. Jindal, M. Nokleby, and X. Chen. Learning deep networksfrom noisy labels with dropout regularization. In

IEEE Inter-national Conference on Data Mining , pages 967–972, 2016.2, 3[16] P. Jing, Y. Su, L. Nie, and H. Gu. Predicting image mem-orability through adaptive transfer learning from external sources.

IEEE Transactions on Multimedia , 19(5):1050–1062, 2017. 2[17] T. Joachims. Optimizing search engines using clickthroughdata. In

ACM International Conference on Knowledge Dis-covery and Data Mining , pages 133–142, 2002. 2, 6, 7, 8[18] A. Kovashka and K. Grauman. Discovering attribute shadesof meaning with the crowd.

International Journal of Com-puter Vision , 114(1):56–73, 2015. 6, 8[19] A. Kovashka and K. Grauman. Attributes for image retrieval.In

Visual Attributes , pages 89–117. Springer, 2017. 1, 2[20] Z. Lu, Z. Fu, T. Xiang, P. Han, L. Wang, and X. Gao. Learn-ing from weak and noisy labels for semantic segmentation.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 39(3):486–500, 2017. 2, 3[21] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov. Hammingdistance metric learning. In

Annual Conference on NeuralInformation Processing Systems , pages 1061–1069, 2012. 4[22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu.Making deep neural networks robust to label noise: A losscorrection approach. In

IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 1944–1952, 2017. 2,3[23] R. N. Sandeep, Y. Verma, and C. Jawahar. Relative parts:Distinctive parts for learning relative attributes. In

IEEEConference on Computer Vision and Pattern Recognition ,pages 3614–3621, 2014. 6, 7[24] H. Squalli-Houssaini, N. Q. Duong, M. Gwena¨elle, and C.-H. Demarty. Deep learning for predicting image memorabil-ity. In

IEEE International Conference on Acoustics, Speechand Signal Processing , pages 2371–2375, 2018. 1[25] A. Vahdat. Toward robustness against label noise in train-ing deep discriminative neural networks. In

Annual Con-ference on Neural Information Processing Systems , pages5601–5610, 2017. 2[26] R. K. Vinayak and R. Gilad-Bachrach. DART: dropouts meetmultiple additive regression trees. In

International Confer-ence on Artiﬁcial Intelligence and Statistics , 2015. 6, 7, 8[27] Q. Xu, J. Xiong, Q. Huang, and Y. Yao. Robust evaluation forquality of experience in crowdsourcing. In

ACM Conferenceon Multimedia , pages 43–52, 2013. 2[28] Q. Xu, M. Yan, C. Huang, J. Xiong, Q. Huang, and Y. Yao.Exploring outliers in crowdsourced ranking for qoe. In

ACMConference on Multimedia , pages 1540–1548, 2017. 2[29] X. Yang, T. Zhang, C. Xu, S. Yan, M. S. Hossain, andA. Ghoneim. Deep relative attributes.

IEEE Transactionson Multimedia , 18(9):1832–1842, 2016. 2[30] J. Yao, J. Wang, I. W. Tsang, Y. Zhang, J. Sun, C. Zhang,and R. Zhang. Deep learning from noisy image labels withquality embedding.

IEEE Transactions on Image Processing ,2018. 2[31] A. Yu and K. Grauman. Just noticeable differences in visualattributes. In

IEEE International Conference on ComputerVision , pages 2416–2424, 2015. 7[32] J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley. Deepunsupervised saliency detection: A multiple noisy labelingperspective. In