Ego-Object Discovery
11 Image and Vision Computing
Ego-Object Discovery
Marc Bola˜nos a , Petia Radeva a,b a Universitat de Barcelona, Gran Via de les Corts Catalanes, 585, Barcelona 08007, Spain b Computer Vision Center, Building O Campus UAB, Bellaterra (Barcelona) 08193, Spain
ABSTRACTLifelogging devices are spreading faster everyday. This growth can represent great benefits to developmethods for extraction of meaningful information about the user wearing the device and his / her envi-ronment. In this paper, we propose a semi-supervised strategy for easily discovering objects relevantto the person wearing a first-person camera. Given an egocentric video / images sequence acquiredby the camera, our algorithm uses both the appearance extracted by means of a convolutional neuralnetwork and an object refill methodology that allows to discover objects even in case of small amountof object appearance in the collection of images. An SVM filtering strategy is applied to deal with thegreat part of the False Positive object candidates found by most of the state of the art object detectors.We validate our method on a new egocentric dataset of 4912 daily images acquired by 4 persons aswell as on both PASCAL 2012 and MSRC datasets. We obtain for all of them results that largelyoutperform the state of the art approach. We make public both the EDUB dataset and the algorithmcode . c (cid:13)
1. Introduction
Ubiquitous computing is more present everyday in our lives,and with it lifelogging devices (Hodges et al., 2006; Michael,2013) are increasing their popularity and spread. By us-ing wearable cameras, we can acquire continuous data aboutthe life of persons, and build applications that convert thishuge amount of data into meaningful information about theirlifestyle. Hence, wearable cameras o ff er an easy manner to ac-quire information about our daily life tasks, and extract infor-mation about our typical activities and habits (Betancourt et al.)from an egocentric (or first-person) point of view. For exam-ple, Fig. 1 shows datasets acquired in three days by 3 di ff erentusers. We can observe that di ff erent persons have di ff erent envi-ronments. Probably, the most remarkable reason for being ableto detect visually the di ff erences in the users’ datasets is usuallydue to the distribution and aspect of scenes, objects and peoplethat appear. Following these premises, in this paper, we addressthe problem of automatically discovering which are the usual ∗∗ Marc Bola˜nos: Tel.: + e-mail: [email protected] (Marc Bola˜nos) https: // / s / py8xhalqxz15co3 / EDUB%202015.zip?dl = https: // github.com / MarcBS / Ego-Object Discovery / releases objects that form the environment of a person wearing the cam-era by means of a novel Object Discovery (OD) method. Wemust note the di ff erence between Object Recognition , wherethe goal is to discriminate objects according to their classes bya classifier previously trained with a set of training samples;
Fig. 1. Lifelogging sets from 3 users (each 2 row correspond to a di ff erentuser). Note how objects help to discriminate di ff erent environments. Theannotated objects are to be discovered by the object discovery algorithm. a r X i v : . [ c s . C V ] J u l Object Detection , where we should detect the subregion in theimage where an object appears; and
Object Discovery , wherewe have to both detect new object instances or concepts, andassign them a label even without having training examples fromall possible classes of objects.
Several works have been previously done in the OD field,some using segmentation techniques (Schulter et al., 2013; Rus-sell et al., 2006), others extracting objects relying on visualwords (Russell et al., 2006; Sivic et al., 2005; Liu and Chen,2007). In (Chatzilari et al., 2011), a semi-supervised methodfor segmentation-level labeling is presented and in (Tuytelaarset al., 2010) a comparison of unsupervised OD methods isshown. One of the best performing OD methods is the one Lee’set.al. published in (Lee and Grauman, 2011), where the authorspropose a semi-supervised OD approach for object discovery.It starts by selecting the easiest objects by an objectness de-tector and keeps an iterative discovery procedure by clusteringobject candidates, selecting the best one as the one correspond-ing to the newly discovered object and applying an One-ClassSVM to discover harder instances of it. The authors use a setof low-level image appearance (texture, colour and shape) andcontext features. One of its main drawbacks is that the featuresthat it used are not rich enough to capture the characteristicsof any existent real world object. More recently, in (Kadinget al., 2015), a method for object discovery relying in activelearning was presented. The authors base their work in the as-sumption that when dealing with an active learning problem,the oracle does not always know all the classes in advance andthat, furthermore, not all the classes are always interesting forthe problem at hand. With this in mind, they propose an Ex-pected Model Output Change (EMOC) criterion for selectingthe most relevant and useful images to label for the problemthey are addressing, and at the same time trying to avoid novalid objects by using a local density measure. Cho et al. in(Cho et al., 2015) worked on a part-based object discovery byproposing a new probabilistic matching strategy (ProbabilisticHough Matching) based on HOG descriptors for finding sim-ilar objects in di ff erent images. Additionally, they propose anassociated confidence for finding the most outstanding object ineach image.In egocentric data, object discovery has been studied in muchless extent. There, the OD brings new challenges consider-ing the non-intentionallity of the images, that is, compared tousual intentional images, the objects and people (if any) usu-ally do not appear in centered positions, and partial occlusionsproduced by other objects or the image border are quite fre-quent. In (Kang et al., 2011), the authors define a method forfinding new objects that a person can encounter in their dailyliving. They start by applying a segmentation of the images atdi ff erent levels, extracting colour, texture and shape informa-tion from each segment and applying a series of grouping andrefinement steps to find consistent clusters that can representnew concepts. The authors in (Fathi et al., 2011) develop anobject recognition method that uses segmentation techniquesfor extracting objects on egocentric visual data. In this case, the data acquired is captured using head-mounted cameras withhigh-temporal resolution (about 30 fps), what makes impossi-ble to record the whole day of the person (due to memory andbattery constraints). In order to solve this problem, we use cam-eras with low-temporal resolution (2-3 frames per minute) thatare worn on chest level for maximizing the user comfort. Asa result, we obtain a collection of images instead of a video,where objects are captured non-intentionally, and frequently ap-pear blurred and non-centred. The main additional challengesthese cameras cause are: 1) having frames so much temporallyspaced disable the possibility to directly infer information fromsequential frames and 2) extracted motion information is notreliable enough.The main handicaps of existent OD methods are: 1) they lacka way to capture and reuse the knowledge acquired when ana-lyzing the previous data, which is very important consideringthe redundancy of the data acquired in lifelogging (Min et al.,2014), and 2) many OD methods rely on using as a first step anobject detection algorithm like (Alexe et al., 2010; Cheng et al.,2014; Arbel´aez et al., 2014; Uijlings et al., 2013) for having aninitial set of object candidates. As we prove in section 3.2, thesemethods usually produce a very high number of False Positives(FP) that should be dealt with. In this paper, we propose a new OD method for egocentricdata (based on our previous work presented in (Bola˜nos et al.,2015)), that we call Ego-Object Discovery (EOD). Our contri-butions start by using a set of powerful features extracted bymeans of a Convolutional Neural Network (CNN). These net-works are proving their huge potential to address di ff erent prob-lems in the field of Computer Vision ((Honglak et al., 2009a,b;Goodfellow et al., 2014), just to mention a few). Lately, a newmethod (Moghimi et al., 2014) using CNN data has been pro-posed for egocentric activity recognition. However, no methodson OD using these features exist yet. To overcome the problempresent in previous works of nonexistent knowledge reuse weuse a new Refill methodology, which allows to discover newsamples from the categories, even having a low number of in-stances, which are quite present in egocentric sequences. As ad-ditional contributions w.r.t. our previous work, we here presenta strategy for solving the high number of FPs (or ’No Object’candidates) produced by the object detection methods: a SVMfiltering strategy. Also introduce the first egocentric object dis-covery dataset (EDUB) of lifelogging data with ground truth(GT) object segmentations, apply a comparison with the stateof the art object detection algorithms, and analyze the resultsof our method also on two public datasets of intentional images(PASCAL and MSRC).The article is organized as follows: in section 2, we definethe EOD algorithm. In section 3, we present the datasets usedto validate our method, the tests of EOD on all datasets, com-parison of state of the art object detectors and discussions onthe obtained results. We finish with some conclusions and fu-ture work.
2. The Ego-Object Discovery Approach
Given the problem of OD in low-temporal resolution egocen-tric data, our algorithm is formulated as an iterative procedure.At the beginning, it should be provided with a seed of initial ob-jects information to expand, defined as a small bag of labeledobjects, represented by their regions, and called a bag of re-fill . The EOD algorithm passes through several steps (see Fig.2): a) it detects image regions representing object candidatesand their corresponding objectness scores from each new set ofimages, b) extracts object candidates features by using a pre-trained CNN, c) filters false object (’No Object’) instances andd) proceeds with a clustering-based iterative procedure as fol-lows: 1) on the easiest objects, it applies a refill strategy by us-ing the bag of refill, 2) clusters them by using an agglomerativeclustering approach and labels the best cluster that representsthe newly discovered object and 3) applies a supervised expan-sion to find harder instances of it. After a fixed number of t iterations or until no easy sample remains, it outputs the set offound object coordinates and labels.To describe and cluster the candidates, EOD uses both ap-pearance and local context features. Appearance are extractedby a CNN (Jia, 2013), and context is provided by both the in-herent description of the object background that also extractsthe CNN, and indirectly the refill procedure, that will introduceinstances of the same classes but with di ff erent backgrounds.Being very suitable for lifelogging images considering the re-dundancy of the objects we routinely see. In the following sub-sections, we give details about each step of the EOD procedure. Object Candidates Generation:
The first step needed tocharacterize the environment of the user through object discov-ery is extracting a set of object candidates for each image. Todo so, we used the Objectness detector provided by Ferrari etal. in (Alexe et al., 2010), which additionally to the boundingbox for each candidate, outputs a score associated to the prob-ability of being a true object (objectness score). This score isproduced by three visual cues:
Multi-scale Saliency (finds blob-like structures at multiple scales that could indicate the presenceof an object);
Color Contrast (finds high colour di ff erences be-tween the analyzed bounding box and its surroundings); and Superpixels Straddling (penalizes the bounding boxes that donot respect the boundaries of the superpixels in the image).
Object Candidates Characterization:
As features tocluster the object candidates, we used a pre-trained CNN(Krizhevsky et al., 2012), which was trained on millions ofimages and is composed as a succession of convolutional andpooling layers. We deleted the last layer, which o ff ers a su-pervised classification of 1.000 ImageNet classes, and used theoutput of the penultimate layer as our features (4096 variables).Note that our approach is di ff erent to the one of (Lee and Grau-man, 2011) that used: LAB histograms for extracting colour in-formation, Pyramid HOG for extracting shape information, andSpatial Pyramid Matching (Lazebnik et al., 2006) for extractingtexture information. Fig. 2. Ego-Object Discovery methodology scheme. The di ff erent algo-rithms applied in each part of the methodology are represented in orange. False Objects Filtering:
The main drawback of most objectdetection methods is the huge number of FPs, they produce.Given that it is not enough to rely on the objectness score fordiscarding the ’No Object’ instances, we filter the object can-didates by an RBF-SVM classifier trained on CNN features todistinguish ’Object’ vs. ’No Object’ instances.
Easiest Objects Selection:
In order to achieve an iterativeeasy-first discovery, we used their associated objectness scoreto decide if a candidate ω is considered in the current iteration: ob jectnessS core ( ω ) > µ + ω σ − ω t , (1)where µ and σ are respectively, the mean and the standard de-viation of all scores, t is the current iteration, and ω and ω areweights. This easiness measure seems a promising method forobtaining object candidates in general. However, this techniquedoes not obtain the same results in egocentric datasets than inintentional images due to the fact that images are not capturedby a person looking at objects of the world, but are acquirednon-intentionally while a person is loosely wearing the camera.As a result of the inherent low frequency of appearance of dif-ferent objects of the real world, to the limited image quality ofthe wearable egocentric devices and to the constant moving ofthe user, a great part of the photos are unclear, dark or blurry(see Fig. 1). All this causes lower precision, when clusteringthe obtained object candidates. Refill Strategy:
In order to solve these problems, we definea ”refill” methodology as follows: at each iteration, the set ofselected easiest samples is completed with a certain percentage(w.r.t. the number of easy samples retrieved) of samples fromthe Bag of Refill, which are randomly chosen labeled samplesdistributed on the already discovered object classes. In this way,we address two problems: 1) di ffi culty to form a cluster from avery small set of class instances, and 2) di ffi culty to link sam-ples of the same class that were blurry and unclear. So, refillingthe space with more samples of the same class of objects, wecan obtain more compact clusters (see Fig. 3 and Fig. 4). Fig. 3. Clusters formed by theeasiest samples. Fig. 4. Clusters formed by the re-filled and easiest samples.
Clustering and Hard Instances Classification:
In this stepwe apply an Agglomerative Ward clustering on the object can-didates. Moreover, once the clusters are formed, we get the Sil-houette Coe ffi cient (Tan and Steinbach, 2011) on each clusterand select the best for the user to assign it a label. This coe ffi -cient is only calculated on the unlabeled samples, never usingthe refilled ones for selecting the most reliable cluster. At theend of each iteration, a OneClass-SVM for searching for harderinstances is built with the new cluster and the rest of the easysamples are classified.
3. Results
In this section, we discuss the three datasets we used (sum-marizing their characteristics in Table 1), and expose the di ff er-ent tests applied to illustrate the EOD performance. Due to the low number of publicly available egocentricdatasets and the complete lack of egocentric object-labeleddatasets, we considered very important to construct one andmake it public in order to serve as a base for algorithms com-parison for the egocentric community.The
Egocentric Dataset of the University of Barcelona(EDUB) ff erent days,2 days per person. The objects appearing in the images weresegmented using the online tool LabelMe (Russell et al., 2008)(although here we only use their bounding box) and their anno-tation files are similar to the ones provided by PASCAL. EDUBincludes the following classes (number of samples per classare given in parenthesis): ’lamp’ (2299), ’tvmonitor’ (1274),’hand’ (1232), ’person’ (1175), ’glass’ (831), ’building’ (732),’face’ (565), ’aircon’ (530), ’sign’ (506), ’cupboard’ (392),’paper’ (377), ’car’ (315), ’bottle’ (260), ’door’ (199), ’chair’(179), ’mobilephone’ (145), ’window’ (138), ’dish’ (65), ’mo-torbike’ (64), ’bicycle’ (12), and ’train’ (4). Note that in ourtests, we did not use the classes with few instances (i.e. smallerthan 100), considering that it would not be possible to discoverthem with a clustering strategy. Fig. 5. Object candidates obtained by the Ferrari’s objectness detector onthe EDUB dataset. From left to right and top to bottom: aircon, bottle,building, car, chair, sign, cupboard, door, face, glass, hand, tvmonitor,lamp, mobilephone, paper, person, window.
The second of the datasets we considered is the
PASCALVOC 2012 (Everingham et al., 2012), being one of the mostwidely used in object detection / recognition research , with verydi ffi cult and challenging images. We used the ’trainval’ (forhaving more samples) set of images for our tests, but previouslydeleted the images that had in common with its 2007 version.We applied this pre-processing to avoid any bias in the results,since some of the used object detection methods were trainedusing PASCAL VOC 2007. Table 1. Image / object characteristics for each of the used datasets. images objectcandidates GT objects classesMSRC PASCAL
EDUB
Microsoft ResearchCambridge (MSRC) (Lee and Grauman, 2005), which wasalso used in (Lee and Grauman, 2011) for object discovery, andtherefore will ease the comparison of the results. Consideringthat MSRC dataset is labeled at pixel level, we had to extractthe bounding boxes corresponding to each of the objects mak-ing some assumptions: 1) the bounding box for an object is theminimal closing box around all the connected pixels that belongto the same class; 2) given the dataset is split in folders, we onlyconsidered valid the objects with the same class as the folder’sname; 3) the minimal area for an object to be valid was set to50x50 image pixels (about 0.81% of the whole image); and 4)we excluded the labels ’grass’, ’sky’, ’mountain’, ’water’ and’road’, because they are not objects, but rather environments.Fig. 1 and 6 show some image samples from the 3 datasets.MSRC dataset, compared to the other two should obtain betterresults due to the position of the objects (central to the image)and their clear appearance. Even though in general PASCALhas some object instances very di ffi cult to find, the hardest oneis the EDUB (also considering the high rate of objects occlu-sions, blurriness and lower image quality). Fig. 6. MSRC image samples (top) and PASCAL 12 samples (bottom).
Given that the first step of the algorithm is to obtain objectcandidates from the images, we tested and compared four dif-ferent state of the art object detection methods on the threedatasets (see Table 2). We chose Objectness (Alexe et al.,2010), BING (Cheng et al., 2014), Multiscale CombinatorialGrouping (MCG) (Arbel´aez et al., 2014) and Selective Search(Uijlings et al., 2013) methods considering their good perfor-mances. For MCG, we applied its quickest, but less exhaustiveversion.Due to the dramatic increase of space needed to store all thesamples , we extracted the top W =
50 object candidates perimage sorted by their objectness score.Analyzing the percentage of NO (see overlapping score insection 3.3) and DR of each method, we can see that the DR Considering the PASCAL 12 dataset, we needed nearly 30GB of data tostore all the images and features for the tests
Table 2. Percentage of ’No Objects’ (NO) (or of False Positives) and De-tection Rate (DR) comparison of the four object detection methods on ourthree datasets.
Objectness BING MCG Sel.SearchMSRC
NODR
PASCAL
NODR
EDUB
NODR ff erent datasets, as one could immediatelyexpect looking at the images, it is clearly easier for any object-ness measure to get good results on the MSRC dataset, mean-while it is quite more di ffi cult on PASCAL and EDUB, havingan extra di ffi culty for the second one due to the non-intentionalacquisition and less clear images of the wearable cameras.Given our final goal of being able to discover the true dis-tribution of object classes and as many individual GT objectsas possible, we considered that the objectness measure that ob-tained better results for EOD was the one proposed by (Alexeet al., 2010), because we are interested in getting most of theGT objects in the dataset, even if we have to deal with a lot ofNO (i.e. noisy or FP) instances. In order to perform the methodology validation, we first leave a 50% of the object classes in the unlabeled pool asa test set. Note that we need to test if the algorithm is ableto discover unseen object classes. From the remaining part ofclasses, similar to (Lee and Grauman, 2011), we separated a40% of the total object candidates to represent the initial knowl-edge located into the bag of refill and used the remaining 60%for testing, too.In order to say that a candidate matches a GT object boundingbox, we followed the PASCAL VOC challenge criterion, thatuses the Overlapping Score (OS). Given a window region ω produced by the object detector, is considered a hit on a GTlabel, i ff : OS = | GT ∩ ω || GT ∪ ω | > . σ ∈{ . , . , , , , } and C ∈ { . , . , , , , } .All the tests were performed for each dataset separately and on arandomly selected fraction of its samples to save computationaltime. With these tests, we finally found that the best param-eters for filtering as many NO instances and at the same timekeeping as many ’Object’ instances as possible (high sensitiv-ity and high specificity) for both the PASCAL and the MSRCclassifiers were σ =
100 and C =
3. In the labeling step, forsimulation purposes, we labeled the best cluster with a majorityvoting strategy w.r.t the GT, although this labeling is intendedto be made by the camera user his- / herself.We designed di ff erent test settings to evaluate our proposal:S1: Features of (Lee and Grauman, 2011).S2: CNN object features.S3: CNN object features with Refill strategy.S4: CNN object concatenated with CNN scene features andRefill strategy.S5: CNN object features with Refill and SVM filter.S6: CNN object features with Refill, SVM filter and PCA.With the first pair of settings, we intend to compare the gen-eralization capabilities of the appearance features from (Leeand Grauman, 2011) against the extracted CNN features. Insetting S4, we tested adding a context about the scene, and insetting S6, we applied a PCA feature dimensionality reductionand transformation in case there is redundancy in the extractedCNN features. ffi cient Comparison In order to check if the clusters formed by using CNN fea-tures are more robust than the ones formed by using the featuresfrom (Lee and Grauman, 2011), we can analyse the mean sil-houette coe ffi cient values obtained in several iterations. In Fig.7 we plot the di ff erence on the silhouette coe ffi cient values ob-tained by using the two kind of features. The comparison isapplied for the top 15 clusters on the first 50 iterations of thealgorithm.We can immediately realise that the average compactness ofthe clusters and their di ff erence to the other clusters (which iswhat Silhouette Coe ffi cient measures) is always higher whenusing CNN features, and will lead to get purer clusters and abetter labeling. To evaluate our approach, we used the F-Measure, because itobjectively penalizes the FP and FN objects in each class, thatis, represents a trade-o ff between the Precision and Recall of themethod. At the same time, we want to give the same importanceto all classes, and are interested in finding as many di ff erentclasses as possible, but always leaving the NO instances aside,without considering them into the quality measures. Hence,we applied the average per-class precision and recall defined in Fig. 7. Comparison of mean silhouette coe ffi cient (thick lines) and standarddeviation (thin lines) for the top 15 clusters on 50 algorithm iterations (highvalues are better). (Sokolova and Lapalme, 2009) in order to obtain the averageF-Measure: F-Measure = Precision M ∗ Recall M Precision M + Recall M , (3)where Precision M and Recall M are the mean precision and re-call of all classes, giving the same weight to all of them.All measures were averaged by at least 5 executions per set-ting and for a maximum of 100 algorithm iterations. Usingthese tests, we compared all settings at the end of the easiestsamples discovery (Fig.8) and on each iteration (Fig.9). Fig. 8. Final F-Measure for eachsetting. Fig. 9. F-Measure evolution foreach di ff erent setting. Looking at Fig.8, we can clearly see that using CNN outper-forms the features of (Lee and Grauman, 2011), indicating thatthey can form purer clusters and find a wider variety of classesthanks to their best representation. Then, adding the Refill tech-nique, the EOD method outperforms the one using the CNNfeatures only. The rest of the methods can not reach the sameresults as CNN + Refill. Moreover, using the additional CNNfeatures of the whole image adds just noise to the set of fea-tures. That is, simply by using the CNN with the bounding boxof the object candidate already captures the closest and mostrelevant object context. Considering the high dimensionalityof CNN features, it seems that including a PCA dimensionalityreduction to the data does not provide any benefit to the objectdiscovery.Comparing the evolution of the F-Measure through the it-erations (Fig. 9), we see that any of the settings using CNNfeatures experiments a much higher increase in the F-Measurevalue just in the first 5-10 iterations, meaning that they can findclusters of true objects quicker than using the setting S1.Also, using the CNN features combined with the refill strat-egy, the results clearly improved from 0.072 to 0.285. This iscaused by the discovery of di ff erent classes of samples. Whilewhen using the features of (Lee and Grauman, 2011), we areonly able to discover 3 or 4 classes at most, achieving an aver-age of 0.072 F-Measure; with the setting S3, we can discoverinstances of more than half of the classes, getting nearly 0.29of F-Measure. Although on the EDUB using the setting S5(CNN + Refill + SVM Filtering) does not seem to get as goodF-Measure results as on the other settings, in other datasets, aswe will be able to see, it outperforms or nearly reaches the re-sults of setting S3. Furthermore, it gets a wider variety of objectclasses.
After having found the best combination of methods and pa-rameters to use, we tested and compared how good the newmethod was contrasting it with the state of the art method (Leeand Grauman, 2011) for any of the datasets (EDUB, PASCAL2012 and MSRC). In table 3, we can see a summary of the F-Measure results obtained for each of the datasets and each ofthe best test settings (average on at least 5 tests per setting).
Table 3. F-Measure comparison for the three datasets, the state of the art(Lee and Grauman, 2011) and our best test settings (CNN + Refill andCNN + Refill + Filter).
F-Measure S1 S3 (ours) S5 (ours)
MSRC
PASCAL
EDUB
Average ff erent datasets than the ones on test (1 / ff er better results ifwe want to stop early the discovery method. In this section, we analyze the object discovery results inmore general terms. In Fig. 10, we can see the absolute numberof object instances found by each of the methods compared tothe GT and the ones found by the Objectness measure ((Alexeet al., 2010), in this case without counting repeated instances ofthe same object).
Fig. 10. Objects found by each method compared to the GT and the onesfound by the Objectness measure (Alexe et al., 2010).
As we can see, using the parameters of setting S1 (Lee andGrauman, 2011), we are only able to find instances from 3 dif-ferent classes, which causes the previously seen very low F-Measure results. On the other hand, using either CNN + Re-fill (setting S3) or CNN + Refill + Filter (setting S5), we canclearly discover objects from a wider variety of classes, whichalso causes the higher resulting F-Measure. Moreover, we geta wider variety of classes with setting S5 (10 di ff erent classes)than with setting S3 (8 di ff erent classes).If we check the discovery order of the classes in each ofthe methods (see Fig. 12), we can see that some classes aremore easily discovered and repeated over the following itera-tions than others. This is caused not only by the number ofclass instances appearing in the dataset, but also by the pre-viously acquired knowledge (refill), the general method used,and / or the intra-class variability. Fig. 12. First discovery of the object classes as a function of iterations.
Table 4. Number of clusters found for each class using any of the settings S1, S3 or S5.
Test No Object hand lamp cupboard car glass chair face door window tvmonitor building paper person mobilephone signS1 96 0 1 2 0 0 0 0 0 0 0 0 1 0 0 0S3 71 1 0 3 0 1 0 6 0 0 8 0 4 0 0 3S5 49 2 3 6 0 0 4 5 1 1 23 0 1 5 0 0
Fig. 11. Examples of discovered objects for three di ff erent subjects (one row each). Better viewed in digital format. If we analyse the clusters number, where we find each class(see Table 4), we can see that even though having the same per-centage of NO candidates (92.75%), using Grauman’s features(setting S1), we get 96% of the clusters labeled as NO, but only71% of them using CNN + Refill (setting S3). Then, comparingit when adding the SVM filtering (setting S5), we can see thatit gets reduced to a 49% of the clusters thanks to the dramaticreduction of NO instances in the pool of unlabeled samples.In Fig. 13, we can see the evolution of GT unique instancesdiscovered by each of the methods on the accumulated itera-tions (each data point corresponds to an algorithm iteration)w.r.t. the F-Measure obtained by the method.
Fig. 13. Percentage of GT object discoveries accumulated on each iterationw.r.t. the F-Measure obtained.
We can see that using Grauman’s features seems to cover awider variety of object samples than either with settings S3 orS5 (about 16% against about 6-7% of the GT samples). Thisresult is probably directly related to the lower F-Measure ob-tained. Due to the lower generalization and representation ca-pabilities of the set of features used (compared to CNN), thelabeled clusters contain a wider variety of samples and objects,causing to label more unique object instances, but at the sametime having a worse average result.In Fig. 11 there are some examples of objects discovered by our methodology. We can see that it is able to discover instancesof the same classes even having a high intra-class variability(person or hand). Note that some samples are not yet discovereddue to the limited number of iterations applied (100).Regarding the complexity of EOD, it is easy to see that (in-dependently to the length of our feature vectors): • The objectness score extraction is of complexity O ( N ), be-ing N the number of images in the dataset; • The SVM filtering has complexity O ( N ); • The sorting of easiest objects is O ( N ∗ Wlog ( N ∗ W )), being W the number of candidates extracted for each image; • The refill strategy is O (1); • The CNN features extraction is O ( M ), being M the easyobjects number in the current iteration; • The clustering of easy objects is O ( M ); • The best cluster labeling is O (1); • The one-class SVM cost is O ( M ).Leading in total a cost of O ( N ∗ Wlog ( N ∗ W ) + M ), for eachiteration.
4. Conclusions
In this paper, we proposed a novel semi-supervised objectdiscovery algorithm for egocentric data that relies on featuresextracted from a pre-trained CNN and uses a refill strategy forfinding easily the classes with less samples. Moreover, weadded a SVM filtering strategy for discarding a great part ofthe high amount of ’No Object’ classes produced by any of theobjectness measures. We compared 4 of the state of the art ob-jectness measures in terms of ’No Object’ instances producedand the Detection Rate obtained when extracting a low numberof object candidates (W = ffi cult ones.Furthermore, we proved that this combined strategy also worksbetter than the previous ones for very noisy and blurry images.
5. Future Work
Our future work involves the following tasks:1. Define an algorithm to discover objects, scenes and peopleto characterize the environment of the persons wearing thecamera,2. Propose an iterative and combined scene and object dis-covery to take profit of the samples discovered from thecomplementary categories, and3. Make the method discriminative i.e. to detect which arethe objects and scenes that characterize the environmentof a person and distinguish them with respect to those ofthe other people.
Acknowledgments
This work was partially founded by the projects TIN2012-38187-C03-01 and SGR 1219.
References
Alexe, B., Deselaers, T., Ferrari, V., 2010. What is an object?, in: CVPR,Conference on, IEEE. pp. 73–80.Arbel´aez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J., 2014. Multi-scale combinatorial grouping, CVPR.Betancourt, A., Morerio, P., Regazzoni, C.S., Rauterberg, M., . The evolutionof first person vision methods: A survey .Bola˜nos, M., Garolera, M., Radeva, P., 2015. Object discovery using cnn fea-tures in egocentric videos, in: Iberian Conference on Pattern Recognitionand Image Analysis (in press). Springer.Chatzilari, E., Nikolopoulos, S., Papadopoulos, S., Zigkolis, C., Kompatsiaris,Y., 2011. Semi-supervised object recognition using flickr images, in: CBMI,2011 9th International Workshop on, IEEE. pp. 229–234.Cheng, M.M., Zhang, Z., Lin, W.Y., Torr, P., 2014. Bing: Binarized normedgradients for objectness estimation at 300fps, in: IEEE CVPR.Cho, M., Kwak, S., Schmid, C., Ponce, J., 2015. Unsupervised object discoveryand localization in the wild: Part-based matching with bottom-up regionproposals. arXiv preprint arXiv:1501.06170 .Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A., 2012.The pascal visual object classes challenge 2012 (VOC2012) results.http: // / challenges / VOC / voc2012 / workshop / index.html.Fathi, A., Ren, X., Rehg, J.M., 2011. Learning to recognize objects in egocen-tric activities, in: CVPR, 2011 IEEE Conference On, IEEE. pp. 3281–3288.Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V., 2014. Multi-digit number recognition from street view imagery using deep convolutionalneural networks. Google Inc., Mountain View, CA .Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., Smyth,G., Kapur, N., Wood, K., 2006. Sensecam: A retrospective memory aid, in:UbiComp 2006: Ubiquitous Computing. Springer, pp. 177–193. Honglak, L., Roger, G., Rajesh, R., Andrew Y., N., 2009a. Convolutional deepbelief networks for scalable unsupervised learning of hierarchical represen-tations. Computer Science Department, Stanford University, Stanford .Honglak, L., Yan, L., Rajesh, R., Peter, P., Andrew Y., N., 2009b. Unsupervisedfeature learning for audio classification using convolutional deep belief net-works. Computer Science Department, Stanford University, Stanford .Jia, Y., 2013. Ca ff e: An open source convolutional architecture for fast featureembedding. http: // ca ff e.berkeleyvision.org / .Kading, C., Freytag, A., Rodner, E., Bodesheim, P., Denzler, J., 2015. Activelearning and discovery of object categories in the presence of unnameableinstances, in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 4343–4352.Kang, H., Hebert, M., Kanade, T., 2011. Discovering object instances fromscenes of daily living, in: Computer Vision (ICCV), 2011 IEEE InternationalConference on, IEEE. pp. 762–769.Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification withdeep convolutional neural networks, in: NIPS, pp. 1097–1105.Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: Spa-tial pyramid matching for recognizing natural scene categories, in: CVPR,Computer Society Conference on, IEEE. pp. 2169–2178.Lee, Y.J., Grauman, K., 2005. Microsoft research cambridge object recognitionimage database.http: // research.microsoft.com / en-us / downloads / b94de342-60dc-45d0-830b-9f6e ff / default.aspx.Lee, Y.J., Grauman, K., 2011. Learning the easy things first: Self-paced visualcategory discovery, in: CVPR, Conference on, IEEE. pp. 1721–1728.Liu, D., Chen, T., 2007. Unsupervised image categorization and object local-ization using topic models and correspondences between images, in: ICCV,11th International Conference on, IEEE. pp. 1–7.Michael, K., 2013. Wearable computers challenge human rights.ABC Science .Min, W., Li, X., Tan, C., Mandal, B., Li, L., Lim, J.H., 2014. E ffiffi