[PDF] Learning to Segment Object Candidates via Recursive Neural Networks

Abstract

To avoid the exhaustive search over locations and scales, current state-of-the-art object detection systems usually involve a crucial component generating a batch of candidate object proposals from images. In this paper, we present a simple yet effective approach for segmenting object proposals via a deep architecture of recursive neural networks (ReNNs), which hierarchically groups regions for detecting object candidates over scales. Unlike traditional methods that mainly adopt fixed similarity measures for merging regions or finding object proposals, our approach adaptively learns the region merging similarity and the objectness measure during the process of hierarchical region grouping. Specifically, guided by a structured loss, the ReNN model jointly optimizes the cross-region similarity metric with the region merging process as well as the objectness prediction. During inference of the object proposal generation, we introduce randomness into the greedy search to cope with the ambiguity of grouping regions. Extensive experiments on standard benchmarks, e.g., PASCAL VOC and ImageNet, suggest that our approach is capable of producing object proposals with high recall while well preserving the object boundaries and outperforms other existing methods in both accuracy and efficiency.

Full PDF

11 Learning to Segment Object Candidates viaRecursive Neural Networks

Tianshui Chen, Liang Lin, Xian Wu, Nong Xiao, and Xiaonan Luo

Abstract —To avoid the exhaustive search over locations andscales, current state-of-the-art object detection systems usuallyinvolve a crucial component generating a batch of candidateobject proposals from images. In this paper, we present asimple yet effective approach for segmenting object proposalsvia a deep architecture of recursive neural networks (ReNNs),which hierarchically groups regions for detecting object candi-dates over scales. Unlike traditional methods that mainly adoptﬁxed similarity measures for merging regions or ﬁnding objectproposals, our approach adaptively learns the region mergingsimilarity and the objectness measure during the process ofhierarchical region grouping. Speciﬁcally, guided by a structuredloss, the ReNN model jointly optimizes the cross-region similaritymetric with the region merging process as well as the objectnessprediction. During inference of the object proposal generation,we introduce randomness into the greedy search to cope withthe ambiguity of grouping regions. Extensive experiments onstandard benchmarks, e.g., PASCAL VOC and ImageNet, suggestthat our approach is capable of producing object proposalswith high recall while well preserving the object boundariesand outperforms other existing methods in both accuracy andefﬁciency.

Index Terms —Object proposal generation, Object segmenta-tion, Region grouping, Recursive neural networks, Deep learning.

I. I

NTRODUCTION O BJECT proposal generation, which aims to identify asmall set of region proposals where objects are likely tooccur, beneﬁts a wide range of applications such as genericobject detection [1], [2], object recognition [3]–[5] and objectdiscovery [6], [7]. Usually, a good object proposal method isdesired to be capable of not only recalling all existing objectsover scales and locations but also preserving their boundaries,for example in Figure 1.The challenges of object proposal lie in the presence ofsevere occlusion, variations in object shapes, and the lack ofcategory information. Most of the current methods [8]–[10]tackle these difﬁculties through bottom-up region groupingor segmentation. Those methods mainly involve two crucialcomponents, i.e., cross-region similarity metric and regionmerging algorithm. The similarity metric is utilized to mea-sure whether two adjacent regions should be merged, andthe merging algorithm performs the inference process that

Corresponding author is Xiaonan Luo.T. Chen, L. Lin, X. Wu, and N. Xiao are with the School of Data andComputer Science, Sun Yat-sen University, Guangzhou 510006, China. (E-mail: [email protected], [email protected], [email protected],[email protected]).X. Luo is with the School of Computer Science and Information Security,Guilin University of Electronic Technology, Guilin 541004, China (E-mail:[email protected]). Fig. 1. Some object proposals (indicated by the blue boxes) generated byour approach. Our results match well with the ground-truth (indicated by thegreen boxes), and also preserve the object boundaries (indicated by the bluesilhouettes inside the boxes). groups pairs of regions into super-regions and ﬁnally generatesobject proposals. Thus, object proposal generation methodsbased on region grouping basically follow the pipeline: theyassign a higher similarity score to the adjacent regions if itis conﬁdent that the regions belong to the same class, andrecursively merge the adjacent regions with highest score.Despite of acknowledged successes, these approaches usuallyrequire elaborative tuning or setting (e.g., manually designedcross-region similarity metric), limiting their performance incomplex environments.In this work, we develop a novel hierarchical region group-ing approach for generating and segmenting object proposalsby learning a recursive neural network (ReNN). In our ReNNarchitecture, we incorporate the cross-region similarity metriclearning into bottom-up region merging process for end-to-endtraining. In particular, we deﬁne a structured loss that penalizesthe incorrect merging candidates by measuring the similarityof adjacent regions and the objectness. In this way, our modelexplicitly optimizes the cross-region similarity learning andobjectness prediction within the recursive iterations. Inter-estingly, the forward process of ReNN ﬁnely accords withthe traditional bottom-up region grouping pipeline, leadingto a very natural embedding of the two crucial components(i.e., cross-region similarity metric and merging algorithm).Moreover, the objectness score is also learned with the ReNNtraining, bringing the beneﬁt of fast rejecting the false positivesamples. a r X i v : . [ c s . C V ] J u l Obviously, the greedy merging algorithms, that recursivelymerge two regions with highest merging scores, can be ap-plied for inference with the ReNN model [11]. However,the performance of greedy methods depends heavily on theaccuracy of merging scores, since greedy merging is generallysensitive to noise or local minima. In the task of objectproposal generation, once a segment of an object is incorrectlymerged with the background or other objects, this object haslittle possibility to be recalled. In addition, we experimentallyfound that greedy merging leads to incorrect object proposalseasily, especially when one segment of an object has similarappearance with background or other surrounding objects. Toalleviate this issue, we propose a randomized merging algo-rithm that introduces randomness in the recursive inferenceprocedure. Instead of merging a pair of neighbouring regionswith highest similarity score, we search for k pairs with top k highest similarities, and then randomly pick one pair accordingto a distribution constructed by their scores. The process isrepeated for K times, thus that errors occurred at one randommerging process can be corrected in other processes. In thisway, it can help to recall more incorrectly merged objects.Figure 1 shows some examples of object proposals generatedby our approach.The key contribution of this work is a deep architecture ofrecursive neural networks for generating object proposals andpreserving their boundaries. This framework jointly optimizesthe cross-region similarity and objectness measure togetherwith the hierarchical region grouping process, which is originalin literature of object segmentation and detection. Moreover,we design a randomized region merging algorithm with therecursive neural network learning, which introduces random-ness to handle the inherent ambiguities of composing regionsinto candidate objects and thus causes a notable gain in objectrecall rate. Extensive experimental evaluation and analysis onstandard benchmarks (e.g., PASCAL VOC and ImageNet) areprovided, demonstrating that our method achieves superiorperformances over existing approaches in both accuracy andefﬁciency.The remainder of the paper is organized as follows. SectionII presents a review of the related works. We then introduceour approach and optimization algorithm in detail in SectionIII and Section IV, respectively. Experimental results, com-parisons and analysis are exhibited in Section V. Section VIconcludes the paper.II. R ELATED W ORK

Many efforts have been dedicated to object proposal gen-eration. Here we roughly divide existing methods into twocategories: top-down window-based scoring and bottom-upregion grouping, according to their computation process.

A. Window-based Scoring

This category of methods [12]–[16] attempt to distinguishobject proposals directly from the surrounding backgroundthrough assigning an objectness score to each candidate sub-window. The objectness measures are usually deﬁned in di-verse ways, and object proposals generated by sliding windows are then ranked and thresholded by their objectness scores.As a pioneer work, Alexe et al. [15] employed saliency cueto measure the objectness of a given window, which wasfurther improved by [17] with learning methods and morecomplicated features. However, these methods may suffer fromexpensive computational cost, since they require to search overall locations and scales in images. Recently, to address thisproblem, BING [14] and Edge Box [13] exploited very simplefeatures such as gradient and contour information to scorethe windows, and achieved very high computational efﬁciency.Alternatively, Ren et al. [12] proposed a deep learning methodbased on fully convolutional networks (FCNs) [18] to scorewindows over scales and locations efﬁciently. Nonetheless, thismethod may not locate object accurately, since experimentalresults show that the recall rate deteriorates as the Intersectionover Union (IoU) threshold increases.

B. Region Grouping

This branch of researches [8], [9], [19]–[23] cast theobject proposal generation as a process of hierarchical re-gion segmentation or partition. Starting from an initial over-segmentation, these methods usually adopt a cross-regionsimilarity / distance metric [24], [25] that works togetherwith region merging algorithms. As a representative exampleof these methods, Uijlings et al. [8] leveraged four typesof low-level features (e.g., color, texture etc.) for similaritycomputing and generated object proposals via hierarchicalgreedy grouping. Using similar features with [8], Manen etal. [9] learned the merging probabilities and introduced arandomized prim algorithm for region grouping. Followingsimilar hierarchical grouping methods, Wang et al. [26] pro-posed a multi-branch hierarchical segmentation method vialearning multiple merging strategies at each step. Arbel´aez etal. [21] constructed hierarchical segmentations and exploredthe combinatorial space to combine multi-scale regions intoproposals. Xiao et al. [10] proposed a complexity-adaptivedistance metric for grouping the neighbouring super-pixels.It combined a low-complexity distance and a high-complexitydistance to adapt different complexity levels. Kr¨ahenb¨uhl andKoltun [19] trained classiﬁers to adaptively place seeds to hitthe objects in the image, and identiﬁed a small set of level setsas object proposals for each seed. This method was furtherimproved by ensembling multiple models to generate morediverse proposals [27]. Rantalankila et al. [22] integrated localregion merging and global graph-cut to generate proposals.Due to their high localization accuracy, they are adoptedin many state-of-the-art object detection [1], [2] and objectdiscovery [7] algorithms. However, these mentioned methodsmainly adopt ﬁxed similarity measures for merging regions orﬁnding object proposals, leading to suboptimal performanceswhen handling complex cases. In contrast, our approach adap-tively learns the region merging similarity and the objectnessmeasure during the process of hierarchical region grouping.Moreover, our method also introduces randomness into thebottom-up searching of region composition and yields signif-icant improvement over existing methods. 𝐹 " v $ x $ 𝐹 & 𝑜 $ 𝐹 " 𝐹 & v ( x ( 𝑜 ( 𝐹 & 𝐹 ) 𝐹 * 𝑜 +$ 𝑠 +$ 𝐹 " 𝐹 & v + x + 𝑜 + 𝐹 & 𝐹 ) 𝐹 * 𝑠 +$( 𝑜 +$( …… 𝐹 " v - x - Input image … …

Deep ConvNet Conv feature map ROI pooling

Local Feature ExtractionRecursive Region Grouping … Fixed length feature … Fig. 2. An overview of our proposed object proposal segmentation framework. The bottom shows local feature extraction, and the top illustrates bottom-uprecursive region grouping process. The four modules, F s , F c , F m and F o , work cooperatively to group regions for generating object proposals. III. F

RAMEWORK OF S EGMENTING O BJECT P ROPOSALS

In this section, we introduce our approach in detail. Theinput image is ﬁrst over-segmented into N regions with theefﬁcient graph-based method [28]. The Fast R-CNN [29] isused to extract local features for each region. We then design arecursive neural network to group regions and simultaneouslypredict the associated objectness scores for correspondingproposals. Furthermore, we propose a randomized mergingalgorithm, which introduces randomness into recursive infer-ence procedure to cope with the inherent ambiguities in theprocess of merging regions. Figure 2 gives an illustration ofour proposed framework. A. Local Feature Extraction

Since deep features have shown signiﬁcant improvementthan hand-crafted features on various vision tasks [30]–[35],we utilize the Fast RCNN [29] architecture to extract deeplocal features for each region. The architecture consists of 16convolutional layers, the same as VGG16-net [30], followedby the region of interests (ROI) pooling layer. Speciﬁcally,given an input image, our approach ﬁrst over-segments it into N regions with the efﬁcient graph-based method [28] andobtains the box for each region that tightly bounds this region.To achieve a better trade-off between speed and accuracy, wefollow [29] to resize the input image, thus that the short sideof the image is 600, remaining the aspect ratio unchanged. Thesixteen convolutional layers take the resized image as input,and produce a pooling of corresponding size feature maps. The ROI pooling layer subsequently extracts a ﬁxed length featurevector for each region. B. Recursive Neural Networks

We ﬁrst present some notations that would be used through-out this article. Let v i denote the local features of the i -thregion, and x i denote the corresponding semantic features. σ ( · ) denotes the rectiﬁed linear unit (ReLU), where σ ( x ) =max(0 , x ) .The core of this framework is the ReNN, which aims togroup the regions and simultaneously predict the objectnessscores for corresponding proposals in a recursive manner. TheReNN architecture is depicted in Figure 3. The ReNN com-prises four modules, i.e., semantic mapper, feature combiner,merging scorer and objectness scorer. Semantic mapper trans-forms the local features to semantic space which can be furtherpropagated to their parent nodes. Feature combiner computesthe joint semantic representations of all neighbouring childnodes. Given joint semantic representations, merging scorercalculates the score indicating the conﬁdence that two nodesshould be merged. Feature combiner merges the neighbouringnodes according to merging scores, and obtains a hierarchicaltree structural segmentations, each of which corresponds toa candidate of proposal. Objectness scorer computes a scorewhich estimates the likelihood of the candidate containing anobject. These four modules work cooperatively for proposalsegmentation, as illustrated in Figure 2. We describe thesefour modules in the following.

1) Semantic Mapper:

Semantic mapper F s is a simple feed-forward operator to map the local features into the semanticspace in which the combiner operates on. It can be expressedas, x i = F s ( v i ; θ s ) = σ ( W s v i + b s ) , (1) F s captures the region semantic representation, and propagatesit to its parent regions through the tree hierarchical structure.To better balance the computational efﬁciency and accuracy,we empirically set the dimensionality of local features v i as18,432 ( × × ), and that of semantic features x i as 256.Hence, the semantic mapper is a one-layer fully-connectednetwork, with 18,432 input and 256 output neurons, followedby the rectiﬁed linear unit. θ s = { W s , b s } are the learntparameters, in which W s and b s are the weight matrix andbias of the fully-connected layer, respectively.

2) Feature combiner:

Feature combiner F c recursivelytakes the semantic features of its two child nodes as input,and maps them to the semantic features of the parent node,formulated as, x i,j = F c ([ x i , x j ]; θ c ) = σ ( W c [ x i , x j ] + b c ) , (2) F c aggregates the semantic information of the two child nodesand obtains the semantic representation of the merged node. Ittakes semantic features of the original regions as leaf nodes,and recursively aggregates them to the root node in a bottom-up manner. In order to ensure the recursive procedure can beapplied, the dimensionality of parent node features is set thesame as that of child node features. Thus, the architectureof the feature combiner is identical to that of the semanticmapper, except that it has 512 ( × ) input neurons.Similarly, θ c = { W c , b c } are its learnt parameters, where W c and b c are the weight matrix and bias, respectively. … …… … … v 𝑗 v 𝑖 x 𝑗 x 𝑖 𝜃 𝑠 𝜃 𝑠 𝜃 𝑐 𝜃 𝑚 𝜃 𝑜 x 𝑖𝑗 𝜃 𝑜 𝜃 𝑜 Fig. 3. Illustration of the recursive neural network in our proposed framework.This network computes the scores for merging decision and objectness scoresof all regions.

3) Merging scorer:

Given the joint semantic features oftwo neighbouring nodes, merging scorer F m computes a scorethat indicates the conﬁdence that whether two nodes shouldbe merged, expressed as s i,j = F m ( x i,j ; θ m ) = W m x i,j + b m , (3)The scores determine the pair that should be merged ﬁrstin both learning and inference stages. It consists of one simple fully connected layer which takes 256 dimensionalitycombined features as input and produces one scores. θ m = { W m , b m } are the learnt parameters, where W m and b m are the weight matrix and bias of the fully-connected layer,respectively.

4) Objectness scorer:

Each node of the tree is related tothe semantic information of the corresponding region, i.e.,the semantic features. Objectness scorer F o directly predictsobjectness scores in semantic feature space. o i = F o ( x i ; θ o ) = φ ( W o σ ( W o x i + b o ) + b o ) , (4)where φ ( · ) is the softmax operation. Our approach rejectscandidate proposals that have low scores without compromis-ing the recall rate. We experimentally found that one fullyconnected layer (merely consisting of 512 parameters) is sosimple that it can not well ﬁt thousands of proposals. Thus,we utilize two stacked fully connected layers to implementthe objectness scorer, in which the ﬁrst one is 256 to 256,followed by the rectiﬁed linear unit, and the second one is 256to 2, followed by a softmax layer for objectness prediction. θ o = { W o , W o , b o , b o } are the learnt parameters,where W o and b o are the weight matrix and bias of theﬁrst fully-connected layer, while W o and b o are those ofthe second one. C. Randomized Merging Algorithm

As discussed above, greedy merging groups the neigh-bouring regions with the highest similarity score for eachiteration. Once a segment of an object mistakenly merges witha neighboring segment that belongs to surrounding objects orbackground, this object would have little chance to be found.Figure 4 presents an example as an illustration. Given an imagewith a brown cat and a black-white one, the brown cat issuccessfully detected using the greedy merging processing, asit is distinguishable from the background (red bounding boxin Figure 4). However, the white segment of the other catincorrectly merges with a piece of background as they havemore similar appearance (red circle in Figure 4). In this case,the subsequent merging process misses this cat inevitably.We propose a randomized merging algorithm to alleviate thisproblem. Instead of merging the neighbouring regions with thehighest similarity score for each iteration, our approach selectsone pair to merge among the top k highest pairs according to adistribution constructed based on their scores. The randomizedmerging process can be repeated for several times to increasethe diversity of the generated proposals. This helps to recallmore incorrectly merged objects, as explained in Section V-E.The randomized merging algorithm works as follows. Start-ing from the semantic features { x i } N seg i =1 and over-segmentedregions R = { r i } N seg i =1 where N seg is the number of seg-ments, our approach ﬁrst computes the merging scores of allneighbouring regions using the feature combiner and mergingscorer. Our approach then re-ranks the merging scores toobtain the k pairs of neighbouring regions { ( r i t , r j t ) } kt =1 withthe top- k highest scores { s i t ,j t } kt =1 , and further constructs Fig. 4. An example of incorrect merging using the greedy merging algorithm.Top left: Input image; top right: over-segmentation; bottom right: incorrectmerging; bottom left: merging result. The black-white cat is lost because itswhite part incorrectly merges with the background. a multinomial probability distribution according to the k merging scores, expressed as ( i t , j t ) ∼ M ult ( ρ ) , (5)where ρ i t ,j t = exp ( s i t ,j t ) (cid:80) kt =1 exp ( s i t ,j t ) , t = 1 , , · · · , k, (6)where ρ i t ,j t indicates the probability that the t -th pair ofregions can be selected. Our approach randomly draws onepair of regions (cid:0) r i t (cid:48) , r j t (cid:48) (cid:1) according to the probability dis-tribution M ult ( ρ ) , merges these two regions together, andthen computes new merging scores between the resultingregion and its neighbours. The process is repeated until thewhole image becomes one region. The general process isdetailed in Algorithm 1. As the candidate object proposals, weconsider the bounding boxes that tightly enclose the segmentsthroughout the hierarchy. Then the objectness scores, learnedby the objectness scorer, are used to rank the candidateproposals and the ones with low scores are rejected to geta certain number of proposals.IV. O PTIMIZATION

Suppose that we have the training set X = { ( I i , c i , b i ) | i =1 , , ..., N } , where N is the number of training samples; I i isthe i -th input sample, including the local features of all regionsand the adjacency matrix (as shown in Figure 5(a) and (b)); c i and b i are the corresponding class labels of regions and groundtruth object bounding boxes, respectively. Our model is jointlytrained with two objectives: 1) the merging loss L m penalizesincorrect region grouping in the hierarchical tree structure; and2) the objectness loss L o helps to learn the objectness scorer.Therefore, we deﬁne the structured loss as L = L m + λ L o + η || θ || , (7) Algorithm 1

Randomized merging algorithm

Input:

Initial region set R = { r i } N seg i =1 Output:

Set of object proposal P Initialize merging score set S = ∅ for all neighbouring region pair ( r i , r j ) do Calculate merging score s i,j S = S ∪ s i,j end for while S (cid:54) = ∅ do Get the k highest merging scores { s i t ,j t } kt =1 Construct multinomial distribution

M ult ( ρ ) Select randomly t (cid:48) -th pair according to M ult ( ρ ) Merge corresponding regions r t (cid:48) = r i t (cid:48) ∪ r j t (cid:48) Remove scores regarding r i t (cid:48) : S = S \ s i t (cid:48) , ∗ Remove scores regarding r j t (cid:48) : S = S \ s j t (cid:48) , ∗ Compute merging score set S t (cid:48) between r t (cid:48) and itsneighbours Update merging score set S = S ∪ S t (cid:48) Update region set R = R ∪ r t (cid:48) end while Extract object proposals P from all regions in R where θ = { θ s , θ c , θ m , θ o } are the set of parameters to learnand || θ || is the L2 norm regularization term. λ and η are twobalance parameters. A. Merging Loss

Given an input image I , its bottom-up merging processcan be presented as RN ( θ, I, t ) , and it produces a binarytree t ∈ T ( I ) , where T ( I ) is the set of all possible binarytrees constructed from input I . In the learning stage, the classlabels of all the segmented regions are available. We furtherdeﬁne T ( I, c ) as the set of all possible correct trees. Here, atree is regarded as correct if any region merges with the onebelonging to the same class before other regions from differentclasses. Figure 5 presents some examples of generating correctand incorrect trees from an image.

91 2 3 546 7 89 (a) (c)(d)

1 2 3 4 5 (b)

Fig. 5. Examples of generating correct and incorrect trees. (a) Input image,green and blue indicate differently labelled regions. (b) Adjacent matrix ofimage regions; (c) correct trees; (d) incorrect trees.

Inspired by [11], [36], we deﬁne a margin loss function (cid:52) L : I × C × T → R + , where (cid:52) L ( I, c, t ) measures thepenalty of the construction of a parsing tree t for input I withlabel c . In the context of recursive merging process, the lossincreases when a segment merges with the one from differentclass before those with the same class label. We denote N ( t ) as the set of non-terminal nodes of tree t , and subtree ( d ) as asubtree underneath the non-terminal node for each d ∈ N ( t ) .Naturally, we formulate the loss by penalizing the incorrectsubtrees (cid:52) L ( I, c, t ) = (cid:88) d ∈ N ( t ) { subtree ( d ) / ∈ T ( I, c ) } , (8)where {·} is an indicator function whose value is 1 whenthe expression is true and 0 otherwise. Figure 5(d) illustratestwo examples of incorrect trees, in which the margin lossesare 2 and 3, respectively.Our goal is to learn a function f θ ( · ) with small expectedloss on the unseen inputs. Similar to [11], [36], we considerthe following forms f θ ( I ) = arg max t ∈T ( I ) { s ( RN ( θ, I, t )) } , (9)where s ( · ) predicts the score for a tree by summing upmerging scores of all the merged neighbouring pairs. In theoptimization procedure, we aim to learn a score function thatassigns higher scores to correct trees than incorrect ones.Given the parameters θ , we ﬁrst deﬁne the margin betweenthe correct tree t i and another tree t for I i , s ( RN ( θ, I i , t i )) − s ( RN ( θ, I i , t )) . (10)Intuitively, the margin will be enlarged as the margin lossfunction (cid:52) L ( I, c, t ) increases, expressed as s ( RN ( θ, I i , t i )) − s ( RN ( θ, I i , t )) ≥ κ (cid:52) L ( I, c, t ) , (11)where κ is a parameter. The merging loss can be thus deﬁnedas L m = N (cid:88) i =1 L ( i ) m , (12)where L ( i ) m = max t ∈T ( I i ) { s ( RN ( θ, I i , t )) + κ (cid:52) L ( I i , c i , t ) }− max t i ∈T ( I i ,c i ) { s ( RN ( θ, I i , t i )) } . (13)Optimizing the merging loss can maximize the correct trees’scores while minimizing the scores of the highest scoringbut incorrect trees. Following [11], we utilize the greedymerging to approximatively ﬁnd an tree with maximum scoresamong T ( I i ) , and a correct tree with maximum scores among T ( I i , c i ) . The gradients are computed and back propagatedbased on these two selected trees. B. Objectness Loss

One of the main advantages of our approach is that itcan simultaneously predicts an objectness score for eachproposal candidate, which can be used for proposal rankingand rejecting the ones with low scores. We simply employ asoftmax classiﬁer with the semantic features of each node. Wegenerate positive and negative samples from all of the regionsas follows. Given a region, we ﬁrst calculate the IoU scoresbetween the box that tightly bounds this region with eachground truth bounding box. If the maximum IoU is larger than0.5, this region is considered as positive; and if the maximumIoU is smaller than 0.2, it is used as a negative sample. Allthese regions are considered as useful regions to deﬁne theobjectness loss. We simply ignore other regions since theymay not provide discriminative information. For the i -th usefulregion, the loss function can be deﬁned as L ( i ) o = − (cid:88) l =0 { l i = l } log ( p i,l ) , (14)where p i,l is the score corresponding to the likelihood of theregion belonging to label l . Hence L o = N u (cid:88) i =1 L ( i ) o , (15)where N u is the number of useful regions.The model is jointly trained by the stochastic gradientdescent (SGD) with momentum [37].V. E XPERIMENT

In this section, we present the extensive experimental resultsto compare with state-of-the-art methods, demonstrating thesuperiority of the proposed methods, and analyze the beneﬁtof introducing the randomized merging algorithm for objectproposals generation.

A. Experimental Setting1) Datasets:

We ﬁrst conduct the experiments on thePASCAL VOC2007 dataset [38], which consists of 9,963images from 20 categories of objects. The model is trainedusing 422 images of the PASCAL VOC2007’s segmentationset. We compare the performance of our approach with thoseof state-of-the-art methods, and evaluate the contribution ofrandomized merging algorithm using the 4,952 test images thatcontain 14,976 objects, including the “difﬁcult” ones. To betterdemonstrate the effectiveness of the proposed method, we alsoconducted experiments on the PASCAL VOC 2012 validationset, which contains 15,787 objects in 5,823 images. As ourmodel is trained with 20 object categories on PASCAL VOC,we further investigate the generalization ability of our methodto unseen object categories on ImageNet 2015 validationdataset [39], which contains about 20,000 images of 200categories, without re-training the model using the trainingsamples from ImageNet.

2) Evaluation Metrics:

One of the primary metrics is theIntersection over Union (IoU) measure, where the IoU isdeﬁned as the intersection area of the proposal, and the groundtruth bounding box divided by their union area. For a ﬁxednumber of proposals, the recall rate (the fraction of groundtruth annotations covered by proposals) varies as the IoUthreshold increases from 0.5 to 1, so that a recall-IoU curvecan be obtained. Besides, the curves indicating the recall ratewith reference to the number of ranked proposals, are alsogiven, with IoU ﬁxed as both 0.5 and 0.8, respectively. Thisis widely adopted by many proposal works [26], [40] forevaluation. We also compare the average recall (AR), deﬁnedas the average recall when IoU ranges from 0.5 to 1 [40],[41], since AR is considered to be strongly correlated withdetection performance.

3) Implementation Details:

Following [8], we adoptethe efﬁcient graph-based method [28] to produce initialover-segmentations with four parameter values (i.e., k =100 , , , ), respectively. We implement the proposedmodel using Caffe open source library [42], and train itby stochastic gradient descent (SGD) with a batch size of2, momentum of 0.9, weight decay of 0.0005. The learningrate is initialized as − and divided by 10 after 20 epochs.The balance parameter λ in Equation 7 is simply set as 1.Note that it is indeed possible to perform joint training forthe fast RCNN and the ReNN. We do not train the model inthis way, because only 422 images are provided for training,which easily leads to over-ﬁtting. Therefore, we simply usethe fast RCNN model pre-trained on the PASCAL VOC 2007detection dataset for local feature extraction, and then trainour ReNN model alone. In the random merging algorithm, theparameter k is set as 5. To improve the quality of generatedproposals, we perform the random merging process for K times ( K = 8 in our experiments), and rank all the generatedproposals using the objectness scores. The proposals with lowscores are rejects to get a certain number of proposals forevaluation. B. Comparison with State-of-the-Art Approaches

In this subsection, we compare our method with recent state-of-the-art methods, including BING [14], Randomized Prim(RP) [9], EdgeBox (EB) [13], Multiscale CombinatorialGrouping (MCG2015) [43], Selective Search (SS) [8],Faster R-CNN (RPN) [12], Complexity Adaptive DistanceMetric (CADM) [10], Multi-branch Hierarchical Segmenta-tion (MHS) [26], Geodesic Object Proposals (GOP) [19],Learn to Propose Objects (LPO) [27]. In our experiments, weuse Edgebox70 (optimal settings for an IoU threshold of 0.7)for EB, and default settings for others, in order to ensure thebest overall performance for these methods. In addition, wefollow [40] to control the number of candidates to a speciﬁcvalue for a fair comparison. Since BING, MCG2015, SS, MCG2015 is the improved version of original MCG and achieves muchbetter performance. We only compare with MHS on the PASCAL VOC 2007 dataset, becauseonly the results on this dataset are available. We only compare with GOP on the PASCAL VOC 2007 and ImageNetdatasets, because only the results on these two datasets are available.

CADM, RPN, EB, MHS and GOP provide sorted proposals,we select n proposals with top n highest scores for evaluation.However, RP and LPO does not provide the scores to rankthe proposals, so we simply select the ﬁrst n proposals in ourexperiments.We ﬁrst analyze the experimental results on the PAS-CAL VOC 2007 dataset, as depicted in Figure 6. It can beobserved that window-scoring-based methods (e.g., EB andRPN) achieve competitive recall rates with a relatively lowIoU threshold. This mainly beneﬁts from the exhaustive searchover locations and scales, and the accuracy of rejecting thenon-objects by the window-scoring-based methods. However,their recall rates drop signiﬁcantly when the IoU thresholdincreases. In contrast, region-grouping-based methods yieldbetter performance as the IoU threshold increases. It is shownthat MCG2015 performs best among region-grouping-basedmethods, but it is very time-consuming (over 30s per image)and may not practical especially for real-time object detectionsystems. It is noteworthy that our method runs × faster thanMCG2015, and meanwhile outperforms MCG2015 overall,particularly when the number of object proposals is strictlyconstrained (e.g., with 100 or 500 proposals).Typically, an IoU threshold of 0.5 is used to measurewhether the target object is detected successfully in objectdetection tasks. However, as suggested in recent works [13],[40], [41], the proposals with an IoU of 0.5 cannot ﬁt theground truth objects well, usually resulting in a failure ofsubsequent object detectors. This reveals the fact that the recallrate with an IoU threshold of 0.5 is weakly correlated withthe real detection performance. Hence, we also present thecurves of recall rate with respect to the number of proposalsat a more strict IoU threshold of 0.8, shown in Figure 6(e),to demonstrate the superiority of our method. We believethat our method may be more suitable for object detectionsystems owing to better localization accuracy and efﬁciency.Besides, we compare the average recall (AR), consideredto have a strong correlation with detection performance, asanother important metric for evaluation. As shown in Figure6(f), our method outperforms other state-of-the-art algorithms,which suggests that it is likely to achieve a better detectionperformance with the proposals generated by our method.We also compare the performance on the PASCAL VOC2012 validation set, as depicted in Figure 7. Note that RPNis trained with the data from both VOC 2007 and 2012datasets, but RP and our method are learned on the VOC 2007dataset without re-training here. Even though VOC 2012 ismore challenging and larger in size, our method still achievesbest performance over other state-of-the-art algorithms, againdemonstrating the effectiveness of the proposed method. It isalso shown that more improvement over other methods on theVOC 2012 is achieved than that on the VOC 2007.We present some qualitative examples in Figure 8, includingsome random samples (top four rows) that contains two ormore objects and some samples (the last row) that challengesour method. We ﬁnd that, in most cases, our results matchwell with the ground-truth, and preserve the accurate objectboundaries. The missed object are in part tiny ones, e.g., thedistant and severely-occluded ones. R e c a ll BING RP EB MCG2015 SS RPN CADM MHS GOP LPO Ours R e c a ll (a) R e c a ll (b) R e c a ll (c) IoU=0.5 R e c a ll (d) IoU=0.8 R e c a ll (e) A v e r age r e c a ll (f)Fig. 6. Comparison of our proposed method and other state-of-the-art approaches on the PASCAL VOC 2007 test set. Best viewed in color. IoU=0.8 R e c a ll BING RP EB MCG2015 SS RPN CADM LPO Ours R e c a ll (a) R e c a ll (b) R e c a ll (c) IoU=0.5 R e c a ll (d) IoU=0.8 R e c a ll (e) A v e r age r e c a ll (f)Fig. 7. Comparison of our proposed method and other state-of-the-art approaches on the PASCAL VOC 2012 validation set. Note that our model is trainedon PASCAL VOC 2007, but still achieves the best performance against other competitors. Best viewed in color. Fig. 8. Qualitative examples of our object proposals. Ground truth boxes are shown in green and red, with green indicating the object is found and redindicating the object is not found. The blue boxes are the object proposals with highest IoU to each ground truth box, and the blue silhouettes are thecorresponding object contours. All the samples are taken from PASCAL VOC dataset.

Figure 9 analyzes the AR with regard to the ground truthobjects in different sizes. It is shown that our method performsslightly worse than RPN if we only consider small-sizedobjects whose areas are less than 5k pixels. Nevertheless,our method yields best performance over other competitors ingeneral, especially for recalling objects in larger size. Othergrouping-based methods such as SS and CADM show similarresults. One possible reason is that grouping-based methodsdepends heavily on the over-segmentations. In this case, theboundaries of small-sized objects are generally difﬁcult to bewell preserved, if the segmentation is not accurate enough. Butthis problem will not have a signiﬁcant impact on larger-sized objects. Therefore, region-grouping-based approaches usuallyexhibit desirable ability to recall objects in relatively large size,but may fail to recall small-sized ones.

C. Object Detection Performance

Since most state-of-the-art object detectors rely on objectproposals as a ﬁrst preprocessing step, it is essential to evaluatethe ﬁnal detection performance with proposals generated bydifferent methods. In this subsection, we conduct experimentsto analyze the quality of proposals for object detection tasks.To this end, we use the Fast R-CNN detection framework[29] using both CaffeNet [44] and VGG-Net [30] as the Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAPSS SS TABLE IC

OMPARISON OF OBJECT DETECTION PERFORMANCE WITH PROPOSALS GENERATED BY DIFFERENT METHODS . A

LL OF THE DETECTORS ARE LEARNEDBY THE FAST

RCNN ON PASCAL VOC 2007

TRAINVAL SET , AND TESTED WITH CATEGORIES ON THE

PASCAL VOC 2007

TEST SET . T

HE UPPERPART PRESENT THE RESULTS USING F AST

R-CNN

WITH C AFFE N ET AND THE LOWER PART SHOWS THOSE USING THE F AST

R-CNN

WITH

VGG-N ET . Area of objects A v e r age r e c a ll BINGRPEBMCG2015SSRPNCADMMHSGOPLPOOurs

Fig. 9. Comparison of average recall (AR) with respect to different sizes ofground-truth objects on the PASCAL VOC 2007 test set. All of the AR ratesare computed with top ranked 500 proposals per image. Best viewed in color. benchmarks. The detectors are trained using PASCAL VOC2007 trainval set, and tested using the test set for all theexperiments here. We compare EB, SS, MCG2015 and RPNwith the proposed method. For a fair comparison, we selecttop-1000 proposals for all of the methods in both training andtesting stages. The mean average precision (mAP) and averageprecision (AP) for each of the 20 categories are shown inTable I. It can be seen that our proposed method achieves thebest mAPs of 58.6% and 69.0% using the Fast R-CNN withCaffeNet and VGG-Net, respectively, outperforming otherstate-of-the-art methods. This also veriﬁes the effectivenessof our method for detection tasks.

D. Generalization to Unseen Categories

In addition, we conduct experiments on the ImageNet 2015validation set to further evaluate the generalization ability to awider scope of object categories. Note that all of the learning-based models (e.g., RP, RPN and ours) are trained on thePASCAL VOC dataset, and directly tested on the ImageNet2015 validation set without re-training. The comparision of theexperimental results are shown in Figure 10. It can be seenthat our method has comparable performance with MCG2015,and surpasses other methods. No obvious deterioration inperformance is observed on the ImageNet 2015 validation set,suggesting that our method does not exclusively ﬁt the 20speciﬁc categories of objects from the PASCAL VOC. In other words, our method is capable of capturing generic objectnessinformation and generalizing to unseen categories. In addition,most state-of-the-art methods achieve similar performanceto those on the PASCAL VOC, while RPN suffers fromsevere performance drop. One possible reason is that categoryinformation is exploited to learn class-speciﬁc detectors, whichmakes the RPN overﬁt 20 categories of objects from thePASCAL VOC.

E. Evaluation of Randomized Merging Algorithm

[email protected] [email protected] ARGreedy 0.872 0.423 0.489Random 0.870 0.405 0.478TABLE IIC

OMPARISON OF GREEDY MERGING AND RANDOMIZED MERGING ON THE

PASCAL VOC 2007

TEST SET . W

E REPORT THE RESULTS USING TOPRANKED

PROPOSALS . [email protected]

AND

[email protected]

INDICATE THE RECALLRATES WITH AN I O U THRESHOLD OF

AND

RESPECTIVELY , AND AR IS THE AVERAGE RECALL . In this subsection, we evaluate the contribution of theproposed randomized merging algorithm. We compare theperformance of conventional greedy and the proposed ran-domized merging algorithms on the PASCAL VOC 2007. Inthe ﬁrst setting, we allow the randomized merging algorithmto be performed only one time in each recursion step, andthe experimental results are shown in Table II. It can beseen that greedy merging and one-time randomized mergingachieve comparable results. However, with the increase ofrandomized times, our merging algorithm generates morediverse proposals. Here we conducted the experiments andcompare the results obtained with different randomized times.The number of object proposals are ﬁxed as 500 and 1000,respectively. Figure 11 clearly shows that the AR rate improvesas the random times increases, and then goes near saturationeventually. This approach provides a notable gain in recallrates compared to greedy merging strategy. It is also shownthat the recall rate with an IoU threshold of 0.5 ﬁrst keepﬁxed and then drops as the random times increases, while thatwith 0.8 boosts consistently. This suggests that the predictedobjectness scores are not accurate enough with small IoUvalues with the ground truth bounding boxes.Another critical issue is that whether stable performancecan be achieved by our proposed method, since we introduce IoU=0.8 R e c a ll BING RP EB MCG2015 SS RPN CADM GOP LPO Ours R e c a ll (a) IoU=0.8 R e c a ll (b) A v e r age r e c a ll (c)Fig. 10. Comparison of our proposed method and other state-of-the-art approaches on the ImageNet 2015 validation set. Best viewed in color. A v e r age r e c a ll (a) R e c a ll (b) R e c a ll (c)Fig. 11. Comparison of recall rates with different randomized times on the PASCAL VOC 2007 test set. We report the results of both top 500 and 1000proposals for a fair comparison. Note that the number of proposals generated by one-time randomized merging is less than 1000, so we cannot provide thisresult here. Best viewed in color. [email protected] [email protected] AR00.20.40.60.81

Random (a)

[email protected] [email protected] AR00.20.40.60.81 (b)

[email protected] [email protected] AR00.20.40.60.81 (c)Fig. 12. Comparison of recall rates in four groups of randomized merging experiments on the PASCAL VOC 2007 test set. We report the results using topranked 500, 1000 and all of the proposals. [email protected] and [email protected] mean the recall rates with an IoU threshold of 0.5 and 0.8, respectively, and AR is the averagerecall. Best viewed in color. randomness to the merging process. To better clarify thisproblem, we conduct four groups of experiments, and werepeat the randomized merging process for ﬁve times foreach group. As shown in Figure 12, our method also exhibitsgreat stability in recall rates and average recall with differentnumbers of object proposals.

F. Efﬁciency Analysis

In this subsection, we present the comparison of the ef-ﬁciency of our model and the state-of-the-art methods. Theexecution time of MHS [26] are directly taken from [26] asits codes are not available. The deep-learning-based methods(i.e., ours and RPN) are conducted on a single NVIDIA TITANX GPU, and the rest are carried out on a desktop with anIntel i7 3.4GHz CPU and 16G RAM. The average running time of all the methods for generating 1,000 proposals onthe PASCAL VOC 2007 dataset are reported in Table III. Itcan be seen that window scoring methods achieve relativelyhigh computational efﬁciency because of using very simplefeatures and efﬁcient scoring methods. Among those regiongrouping methods, RP, GOP and LPR run slightly faster thanour method, but their performance are much inferior than ourson both PASCAL VOC and ImageNet datasets (see Figure6, 7 and 10). MCG2015 achieves comparable results, butit is extremely time-consuming. It is noteworthy that ourmethod achieves the best performance among all methodswhile sharing quite high running efﬁciency. Speciﬁcally, forour method, it takes about 0.2s for feature extraction, andabout 0.5s for one-time random merging. The random mergingprocess is repeated for 8 times, thus the running time is 4.2sper image. Type Method Time AR (%)Windows scoring Bing [14] 0.2 27.38RPN [12] 0.2 48.19EB [13] 0.3 50.30Regions grouping RP [9] 1.0 46.30GOP [19] 1.0 49.38LPO [27] 1.1 50.98MHS [26] 2.8 52.08SS [8] 10.0 51.91CACD [10] 22.0 52.30MCG2015 [43] 30.0 57.45Ours 4.2

TABLE IIIC

OMPARISON OF THE AVERAGE RUNNING TIME ( SECOND PER IMAGE ) FORGENERATING

PROPOSALS AND THE AVERAGE RECALL (AR)

ON THE

PASCAL VOC 2007

TEST SET . VI. C

ONCLUSION

In this paper, we have presented a simple yet effectiveapproach to hierarchically segment object proposals by de-velping a deep architecture of recursive neural networks. Weincorporate the similarity metric learning into the bottom-upregion merging process for end-to-end training, rather thanmanually designing various representations. In addition, weintroduce randomness into the greedy search to cope withthe ambiguity in the process of merging regions, making theinference more robust against noise. Extensive experimentson standard benchmarks demonstrate the superiority of ourapproach over state-of-the-art approaches. In addition, theeffectiveness of our method for real detection systems isalso veriﬁed. In future work, this proposed framework canbe tightly combined with category-speciﬁc object detectionmethods. A

CKNOWLEDGEMENT

This work was supported by State Key Development Pro-gram under Grant 2016YFB1001004, the National NaturalScience Foundation of China under Grant 61622214, theScience and Technology Planning Project of GuangdongProvince under Grant 2017A020208041, Special Program ofthe NSFC-Guangdong Joint Fund for Applied Research on Su-per Computation (the second phase), and Guangdong NaturalScience Foundation Project for Research Teams under Grant2017A030312006. R

EFERENCES[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

IEEE, Jun. 2014,pp. 580–587.[2] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling indeep convolutional networks for visual recognition,” in

Proc. Eur. Conf.Comput. Vis. , Sep. 2014, pp. 346–361.[3] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan,“Hcp: A ﬂexible cnn framework for multi-label image classiﬁcation,”

IEEE transactions on pattern analysis and machine intelligence , vol. 38,no. 9, pp. 1901–1907, 2016.[4] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label imagerecognition by recurrently discovering attentional regions,” in

IEEEInternational Conference on Computer Vision . IEEE, 2017, pp. 464–472.[5] T. Chen, Z. Wang, G. Li, and L. Lin, “Recurrent attentional reinforce-ment learning for multi-label image recognition,” in

Proc. of AAAIConference on Artiﬁcial Intelligence , 2018, pp. 6730–6737.[6] Y. J. Lee and K. Grauman, “Learning the easy things ﬁrst: Self-pacedvisual category discovery,” in

Proc. IEEE Conf. Comput. Vis. PatternRecognit.

IEEE, Jun. 2011, pp. 1721–1728.[7] M. Cho, S. Kwak, C. Schmid, and J. Ponce, “Unsupervised objectdiscovery and localization in the wild: Part-based matching with bottom-up region proposals,” arXiv preprint arXiv:1501.06170 , 2015.[8] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,”

Int. J. Comput. Vis. , vol. 104,no. 2, pp. 154–171, 2013.[9] S. Manen, M. Guillaumin, and L. Van Gool, “Prime object proposalswith randomized prim’s algorithm,” in

Proc. IEEE Int. Conf. Comput.Vis.

IEEE, Dec. 2013, pp. 2536–2543.[10] Y. Xiao, C. Lu, E. Tsougenis, Y. Lu, and C.-K. Tang, “Complexity-adaptive distance metric for object proposals generation,” in

Proc. IEEEConf. Comput. Vis. Pattern Recognit.

IEEE, Jun. 2015, pp. 778–786.[11] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenesand natural language with recursive neural networks,” in

Proc. Int. Conf.Mach. Learn.

ACM, Jun. 2011, pp. 129–136.[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” arXiv preprintarXiv:1506.01497 , 2015.[13] C. L. Zitnick and P. Doll´ar, “Edge boxes: Locating object proposalsfrom edges,” in

Proc. Eur. Conf. Comput. Vis.

Springer, Sep. 2014, pp.391–405.[14] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “BING: Binarizednormed gradients for objectness estimation at 300fps,” in

Proc. IEEEConf. Comput. Vis. Pattern Recognit.

IEEE, Jun. 2014, pp. 3286–3293.[15] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in

Proc.IEEE Conf. Comput. Vis. Pattern Recognit.

IEEE, Jun. 2010, pp. 73–80.[16] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness ofimage windows,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34,no. 11, pp. 2189–2202, 2012.[17] Z. Zhang, J. Warrell, and P. H. Torr, “Proposal generation for object de-tection using cascaded ranking svms,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) . IEEE, 2011, pp. 1497–1504.[18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” arXiv preprint arXiv:1411.4038 , 2014.[19] P. Kr¨ahenb¨uhl and V. Koltun, “Geodesic object proposals,” in

Proc. Eur.Conf. Comput. Vis.

Springer, Sep. 2014, pp. 725–739.[20] J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmenta-tion using constrained parametric min-cuts,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 34, no. 7, pp. 1312–1328, 2012.[21] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, “Multi-scale combinatorial grouping,” in

Proc. IEEE Conf. Comput. Vis. PatternRecognit.

IEEE, Jun. 2014, pp. 328–335.[22] P. Rantalankila, J. Kannala, and E. Rahtu, “Generating object segmen-tation proposals using global and local search,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit.

IEEE, Jun. 2014, pp. 2417–2424.[23] M. Bergh, G. Roig, X. Boix, S. Manen, and L. Gool, “Online video seedsfor temporal window objectness,” in

Proc. IEEE Int. Conf. Comput. Vis.

IEEE, Dec. 2013, pp. 377–384.[24] L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang, “Cross-domain visualmatching via generalized similarity measure and feature learning,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 39, no. 6,pp. 1089–1102, 2017. [25] S.-Z. Chen, C.-C. Guo, and J.-H. Lai, “Deep ranking for person re-identiﬁcation via joint representation learning,” IEEE Transactions onImage Processing , vol. 25, no. 5, pp. 2353–2367, 2016.[26] C. Wang, L. Zhao, S. Liang, L. Zhang, J. Jia, and Y. Wei, “Objectproposal by multi-branch hierarchical segmentation,” in

Proc. IEEEConf. Comput. Vis. Pattern Recognit.

IEEE, June 2015, pp. 3873–3881.[27] P. Kr¨ahenb¨uhl and V. Koltun, “Learning to propose objects,” in

Proc.IEEE Conf. Comput. Vis. Pattern Recognit.

IEEE, Jun. 2015, pp. 1574–1582.[28] P. F. Felzenszwalb and D. P. Huttenlocher, “Efﬁcient graph-based imagesegmentation,”

Int. J. Comput. Vis. , vol. 59, no. 2, pp. 167–181, 2004.[29] R. Girshick, “Fast R-CNN,” in

Proc. IEEE Conf. Comput. Vis. PatternRecognit.

IEEE, Jun. 2015, pp. 1440–1448.[30] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[31] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “DISC: Deep image saliencycomputing via progressive representation learning.”

IEEE Trans. NeuralNetw. Learn. Syst. , 2016.[32] L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang, “A deepstructured model with radius–margin bound for 3d human activityrecognition,”

International Journal of Computer Vision , vol. 118, no. 2,pp. 256–273, 2016.[33] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang, “Active self-pacedlearning for cost-effective and progressive face identiﬁcation,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 40, no. 1,pp. 7–19, 2018.[34] T. Chen, L. Lin, R. Chen, Y. Wu, and X. Luo, “Knowledge-embeddedrepresentation learning for ﬁne-grained image recognition,” in

Proc. ofInternational Joint Conference on Artiﬁcial Intelligence , 2018, pp. 627–634.[35] Z. Wang, T. Chen, J. Ren, W. Yu, H. Cheng, and L. Lin, “Deep reasoningwith knowledge graph for social relationship understanding,” in

Proc.of International Joint Conference on Artiﬁcial Intelligence , 2018, pp.2021–2018.[36] B. Taskar, D. Klein, M. Collins, D. Koller, and C. D. Manning, “Max-margin parsing.” in

EMNLP . Citeseer, 2004, p. 3.[37] L. Bottou, “Stochastic gradient descent tricks,” in

Neural Networks:Tricks of the Trade . Springer, 2012, pp. 421–436.[38] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,”

Int. J. Comput.Vis. , vol. 88, no. 2, pp. 303–338, 2010.[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “ImageNet largescale visual recognition challenge,”

Int. J. Comput. Vis. , pp. 1–42, 2014.[40] J. Hosang, R. Benenson, P. Doll´ar, and B. Schiele, “What makes foreffective detection proposals?” arXiv preprint arXiv:1502.05082 , 2015.[41] X. Chen, H. Ma, X. Wang, and Z. Zhao, “Improving object proposalswith multi-thresholding straddling expansion,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit.

IEEE, Jun. 2015, pp. 2587–2595.[42] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in

Proc. ACM Multimedia . ACM, Nov. 2014,pp. 675–678.[43] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik,“Multiscale combinatorial grouping for image segmentation and objectproposal generation,” arXiv preprint arXiv:1503.00848 , 2015.[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcationwith deep convolutional neural networks,” in

Proc. Adv. Neural Inf.Process. Syst. , Dec. 2012, pp. 1097–1105.

Tianshui Chen received the B.E. degree fromSchool of Information and Science Technology, SunYat-sen University, Guangzhou, China, in 2013,where he is currently pursuing the Ph.D. degreein computer science with the School of Data andComputer Science. His current research interestsinclude computer vision and machine learning.

Liang Lin (M’09, SM’15) is the Executive R&DDirector of SenseTime Group Limited and a fullProfessor of Sun Yat-sen University. He is the Excel-lent Young Scientist of the National Natural ScienceFoundation of China. From 2008 to 2010, he wasa Post-Doctoral Fellow at University of California,Los Angeles. From 2014 to 2015, as a senior visitingscholar he was with The Hong Kong PolytechnicUniversity and The Chinese University of HongKong. He currently leads the SenseTime R&D teamsto develop cutting-edges and deliverable solutions oncomputer vision, data analysis and mining, and intelligent robotic systems. Hehas authorized and co-authorized on more than 100 papers in top-tier academicjournals and conferences (e.g., 12 papers in TPAMI/IJCV and 50+ papers inCVPR/ICCV/NIPS/IJCAI). He has been serving as an associate editor of IEEETrans. Human-Machine Systems, The Visual Computer and Neurocomputing.He served as Area/Session Chairs for numerous conferences such as ICME,ACCV, ICMR. He was the recipient of Best Paper Dimond Award in IEEEICME 2017, Best Paper Runners-Up Award in ACM NPAR 2010, GoogleFaculty Award in 2012, Best Student Paper Award in IEEE ICME 2014, andHong Kong Scholars Award in 2014. He is a Fellow of IET.

Xian Wu received the B.E. degree from the Schoolof Software, Sun Yat-sen University, Guangzhou,China, in 2015, where he is currently pursuing theMS c. degree in Software Engineering with theSchool of Data and Computer Science. His currentresearch interests include computer vision and ma-chine learning.

Nong Xiao received the BS and PhD degrees incomputer science from the College of Computer atNational University of Defense Technology (NUDT)in China, in 1990 and 1996, respectively. He iscurrently a professor in the State Key Laboratoryof High Performance Computting at NUDT, China.His current research interests include large-scalestorage system, network computing, and computerarchitecture. He has more than 130 publications tohis credit in journals and international conferencesincluding IEEE TSC, IEEE TMM, JPDC, JCST,HPCA, ICCAD, MIDDLEWARE, MSST, IPDPS, CLUSTER, SYSTOR andMASCOTS. He is a member of the IEEE and ACM.