Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision
Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, Ganesh Ramakrishnan
LLearning From Less Data: A Unified Data Subset Selection and Active LearningFramework for Computer Vision
Vishal KaushalIIT BombayMumbai, India [email protected]
Rishabh IyerMicrosoftRedmond, Washington, USA [email protected]
Suraj KothawadeIIT BombayMumbai, India [email protected]
Rohan MahadevAITOE LabsMumbai, India [email protected]
Khoshrav DoctorUniversity of MassachusettsMassachusetts, USA [email protected]
Ganesh RamakrishnanIIT BombayMumbai, Maharashtra, India [email protected]
Abstract
Supervised machine learning based state-of-the-art com-puter vision techniques are in general data hungry. Theirdata curation poses the challenges of expensive human la-beling, inadequate computing resources and larger experi-ment turn around times. Training data subset selection andactive learning techniques have been proposed as possiblesolutions to these challenges. A special class of subset se-lection functions naturally model notions of diversity, cover-age and representation and can be used to eliminate redun-dancy thus lending themselves well for training data subsetselection. They can also help improve the efficiency of ac-tive learning in further reducing human labeling efforts byselecting a subset of the examples obtained using the con-ventional uncertainty sampling based techniques. In thiswork, we empirically demonstrate the effectiveness of twodiversity models, namely the Facility-Location and Disper-sion models for training-data subset selection and reducinglabeling effort. We demonstrate this across the board for avariety of computer vision tasks including Gender Recog-nition, Face Recognition, Scene Recognition, Object Detec-tion and Object Recognition. Our results show that diver-sity based subset selection done in the right way can in-crease the accuracy by upto 5 - 10% over existing baselines,particularly in settings in which less training data is avail-able. This allows the training of complex machine learningmodels like Convolutional Neural Networks with much lesstraining data and labeling costs while incurring minimalperformance loss.
1. Introduction
Deep Convolutional Neural Network based models aretoday the state-of-the-art for most computer vision tasks.Seeds of the idea of deep learning were sown aroundthe late 90’s [28] and it gained popularity in 2012 whenAlexNet [25] won the challenging ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) 2012 competi-tion [46], demonstrating an astounding improvement overthe then state-of-the-art image classification techniques.This was soon followed by an upsurge of deep models forcomputer vision tasks from all over the community. Thisrenewed interest in CNNs is due to the accessibility of largetraining sets and increased computational power, thanksto GPUs. VGGNet [53] demonstrated that a simpler, butdeeper model can be used to improve accuracies. Thencame the even deeper GoogLeNet [59], the first CNN tobe fundamentally architecturally different from AlexNet.GoogLeNet introduced the idea that CNN layers didn’talways have to be stacked up sequentially. ResNet [20]was even deeper, a phenomenal 152 layer deep networkwith an incredible error rate of 3.57%, beating humansin the image classification task. State-of-the-art facerecognition techniques such as DeepFace [60], DeepID3[57], Deep Face Recognition [38] and FaceNet [47] alsoconsist of deep convolutional networks. Similar is thestory with state-of-the-art scene recognition techniques[69], and techniques for other computer vision tasks suchas age and gender classification [29] etc. Similarly, forobject detection tasks, the first significant advancementin deep learning was made by RCNNs [17], which wasfollowed by Fast RCNN [16] and then Faster RCNN[44] toward performance improvement. More recently,YOLO [42] and YOLO9000 [43] have emerged as thestate-of-the-art in object detection. YOLO’s architecture isinspired by GoogLeNet and consists of 24 convolutionallayers followed by two fully connected layers.Every coin, however, has two sides and so is the casewith deep learning. While deeper models are increasinglyimproving for computer vision tasks, they pose the follow-ing challenges: a) Increased training complexity and com-putational costs, b) Larger inference time, c) Larger experi-mental turn around times and difficulty in hyper-parametertuning, and d) Higher costs and more time for labeling.Training complexity and huge data requirements are ow-ing to the depth of the network and the large number of a r X i v : . [ c s . C V ] J a n arameters to be learnt. A large deep neural net with a largenumber of parameters to learn (and hence a large degreeof freedom) has a very complex and extremely non-convexerror surface to traverse and thus it requires a great deal ofdata to successfully search for a reasonably optimal point onthat surface. Each of the CNN architectures (AlexNet [25],ZFNet [68], VGGNet and GoogleNet were trained over aspan of several days (and in even weeks in some cases) on acouple of GPUs. Training with several GPUs together canreduce the time taken, but this results in slow experimen-tal turn around time especially for hyper-parameter tuningwhich is very important for getting these models to work inpractice.Orthogonal to the challenge of training complexity is thechallenge of unavailability of labeled data. Human labelingefforts are costly and grow exponentially with the size ofthe dataset [63]. A lot of data today comes from videos,which have a naturally associated redundancy. In this section, we review existing work addressingthe problems of increased training complexity, larger turnaround times, increased complexity for inference (runtime)and increased labeling costs.
Network Architechture Modifications for reducingtraining/inference time:
One way researchers have ad-dressed this challenge is through architectural modificationsto the network. For example, by making significant archi-tectural changes, GoogLeNet improved utilization of thecomputing resources inside the network thus allowing forincreased depth and width of the network while keeping thecomputational budget constant. Similarly, by explicitly re-formulating the layers as learnt residual functions with ref-erence to the layer inputs, instead of learning unreferencedfunctions, ResNet allows for easy training of much deepernetworks. Highway networks [56] introduce a new archi-tecture allowing a network to be trained directly throughsimple gradient descent. With the goal of reducing trainingcomplexity, some other studies [2, 21, 22] have also fo-cused on model compression techniques. On similar lines,studies like [29] propose simpler network architectures infavor of availability of limited training data.
Transfer Learning for reduced training/inference time:
Other approaches advocate use of pre-trained models,trained on large data sets and presented as ‘ready-to-use’but they rarely work out of the box for domain specificcomputer vision tasks (such as detecting security sensitiveobjects). This idea is called Transfer Learning [7]. Trans-fer learning allows models trained on one task to be easilyreused for a different task [37]. Several studies [7, 67, 32]have analyzed the transferability of features in deep convo-lutional networks across different computer vision tasks. Ithas been demonstrated that transferring features even fromdistant tasks can be better than using random features.
Reducing Labeling Costs:
Several approaches have beenproposed to reduce labeling costs. In [36] the authors de-scribe a weakly supervised convolutional neural network(CNN) for object classification that relies only on image-level labels, yet can learn from cluttered scenes contain- ing multiple objects to produce their approximate locations.Zero-shot learning [54, 3] and one-shot learning [62] tech-niques also help address the issue of dearth of annotatedexamples. In [15], Gavves et al combine techniques fromtransfer learning, active learning and zero-shot learning toreuse existing knowledge and reduce labeling efforts. Zero-shot learning techniques don’t expect any annotated sam-ples, unlike active learning and transfer learning that as-sume either the availability of at least a few labeled tar-get training samples or an overlap with existing labels fromother datasets. [6] propose a technique of combining selfsupervised learning tasks (i.e. where training data can becollected without manual labeling).
Active Learning in Computer Vision:
The core idea be-hind active learning [48] is that a machine learning algo-rithm can achieve greater accuracy with fewer training la-bels if it is allowed to choose the data from which it learns.There have been very few studies in active learning and sub-set selection for Computer Vision tasks. Using an informa-tion theoretic objective function such as mutual informa-tion between the labels, Sourati et al developed a frame-work [55] for evaluating and maximizing this objectivefor batch mode active learning. Another recent study hasadapted batch active learning to deep models [9]. Mostbatch-active learning techniques involve the computation ofthe Fisher matrix which is intractable for deep networks.Their method relies on computationally tractable approxi-mation of the Fisher matrix, thus making them relevant inthe context of deep learning.
Diversified Data Subset Selection and Active Learning:
A common approach to training data subset selection isto use the concept of a coreset [1], which aims to effi-ciently approximate various geometric extent measures overa large set of data instances via a small subset. Submodularfunctions naturally model the notion of diversity, represen-tation and coverage and hence submodular function opti-mization has been applied to recommendation systems torecommend relevant and diverse items that explicitly ac-count for the coverage of user interests [52]. SubmodularFunctions form natural models for training data subset se-lection [65, 66]. In particular, the data-likelihood functionsfor the Naive Bayes and Nearest Neighbor classifiers turnout to be Feature based and Facility Location functions re-spectively [65]. Therefore the training data subset selectionproblem for these classifiers turns out to be a constrainedsubmodular maximization problem.
In this paper, we present a unified framework for datasubset selection using two models for data summarization.The first is Facility Location (which models representation)and the second is Minimum Dispersion (which models di-versity). We argue for the utility of these functions andintuitively highlight the cases in which one of the modelswould work better compared to the other. We subsequentlyprovide four concrete use-cases of our framework. We di-vide our applications into supervised data selection (whereyou know the labels), unsupervised data selection (wherewe have no label information) and active learning..
Application 1: Supervised Data Subset Selection forQuick Training/Inference:
This application investi-gates the use of data subset selection for quick modeltraining and inference. As an example, we lookat KNN classification with features extracted fromCNNs. This is particularly relevant here since com-plexity of inference of this non-parametric classifier isdirectly proportional to the number of training exam-ples.2.
Application 2: Supervised Data Subset Selection forQuick Hyper-parameter tuning:
Another applicationof data subset selection is selection of a representativeyet smaller subset for hyper-parameter tuning for fasterturn-around time. With this smaller subset (say, for ex-ample 5% of the data), we can run several quicker ex-periments to find the optimal hyper-parameters. Oncethe hyper-parameters have been tuned, we can thentrain it on the full data.3.
Application 3: Unsupervised Data Subset Selectionfor Labeling from Video Data:
Often we need to cus-tom train models on specific scenarios (e.g. data fromself driving cars), and often this data comes fromvideos. We can use unsupervised data summarizationto get a representative set of frames (from, say, a largevideo) which can then be labeled to create a trainingdataset. The role of diversity is clearly evident heresince videos tend to have a lot of redundancy in them.4.
Application 4: Diversified Active Learning:
Lastly, weuse our submodular data subset selection for diversi-fied active learning, wherein we combine active learn-ing (for example, uncertainty sampling) with diversi-fied selection.We provide insights into the choice of two different sum-marization models for subset selection. The models havevery diverse characteristics, and we try to argue scenarioswhere we might want to use one over the other. We em-pirically demonstrate the utility of our framework and allthe applications above across the board for several complexcomputer vision tasks including Gender Recognition, Ob-ject Recognition, Object Detection, Face Recognition andScene Recognition. We show how using the right submod-ular models can provide as much as 5 - 10% improvementsover existing baselines in each of these four applications.We also point out here that the techniques proposed hereare orthogonal to related work on transfer Learning, Zeroshot learning and self supervised learning. As an example,one of the flavors we consider for diversified active learn-ing, is where we don’t train an end-to-end CNN but usetransfer learning for training the models. It will be interest-ing to study how this work can be extended to other flavorsincluding zero shot learning, semi-supervised learning etc.
2. Data Subset Selection and Active Learning
Given a set V = { , , , · · · , n } of items which we alsocall the Ground Set , define a utility function (set function) f : 2 V → R , which measures how good a subset X ⊆ V is. The goal is then to have a subset X which maximizes f with a constraint that the size of the set is less than or equalto k . It is easy to see that maximizing a generic set functionbecomes computationally infeasible as V grows.Problem 1: max { f ( X ) such that | X | ≤ k } (1)A special class of set functions, called submodular func-tions [35], however, makes this optimization easy. Sub-modular functions exhibit a property that intuitively for-malizes the idea of “diminishing returns”. That is, addingsome instance x to the set A provides more gain in terms ofthe target function than adding x to a larger set A (cid:48) , where A ⊆ A (cid:48) . Informally, since A (cid:48) is a superset of A and al-ready contains more information, adding x will not help asmuch. Using a greedy algorithm to optimize a submodularfunction (for selecting a subset) gives a lower-bound per-formance guarantee of a factor of − /e of optimal [35]to Problem 1, and in practice these greedy solutions are of-ten within a factor of 0.98 of the optimal [24]. This makesit advantageous to formulate (or approximate) the objectivefunction for data selection as a submodular function.Several diversity and coverage functions are submod-ular, since they satisfy this diminishing returns property.The ground-set V and the items { , , · · · , n } depend onthe choice of the task at hand. Submodular functions havebeen used for several summarization tasks including Imagesummarization [61], video summarization [19], documentsummarization [31], training data summarization and activelearning [65] etc.Building on these lines, in this work we demonstrate theutility of subset selection in allowing us to train machinelearning models using a subset of training data without sig-nificant loss in accuracy. In particular, we focus on Facility-Location function, which models the notion of representa-tiveness and the
Dispersion function, which models the no-tion of diversity.
Representation:
Representation based functions attemptto directly model representation, in that they try to finda representative subset of items, akin to centroids andmedoids in clustering. The Facility-Location function [34]is closely related to k-medoid clustering. Denote s ij asthe similarity between images i and j . We can then define f ( X ) = (cid:80) i ∈ V max j ∈ X s ij . For each image i , we computethe representative from X which is closest to i and add thesimilarities for all images. Note that this function, requirescomputing a O ( n ) similarity function. However, as shownin [64], we can approximate this with a nearest neighborgraph, which will require much less storage, and also canrun much faster for large ground set sizes. Diversity Models:
Diversity based functions attempt toobtain a diverse set of keypoints. The goal is to have min-imum similarity across elements in the chosen subset bymaximizing minimum pairwise distance between elements.There is a subtle difference between the notion of diver-sity and the notion of representativeness. While diversity nly looks at the elements in the chosen subset, represen-tativeness also worries about their similarity with the re-maining elements in the superset. Denote d ij as a distancemeasure between images i and j . Define a set function f ( X ) = min i,j ∈ X d ij . This function, called Dispersionfunction is not submodular, but can be still be efficientlyoptimized via a greedy algorithm [5]. A common choice ofdiversity models used in literature are determinantal pointprocesses [26], defined as p ( X ) = Det ( S X ) where S isa similarity kernel matrix, and S X denotes the rows andcolumns instantiated with elements in X . It turns out that f ( X ) = log p ( X ) is submodular, and hence can be effi-ciently optimized via the Greedy algorithm. However, un-like the Dispersion functions, this requires computing thedeterminant and is O ( n ) where n is the size of the groundset. This function is not computationally feasible for largescale and hence we do not consider it in our experiments.It is easy to see that maximizing the Dispersion functioninvolves obtaining a subset with maximal minimum pair-wise distance, thereby ensuring a diverse subset of snippetsor key-frames. The Dispersion function is called Disparity-Min function (we shall use both inter-changeably in this pa-per). Optimization Algorithms:
For cardinality constrainedmaximization (Problem 1), a simple greedy algorithm pro-vides a near optimal solution. Starting with X = ∅ , wesequentially update X t +1 = X t ∪ argmax j ∈ V \ X t f ( j | X t ) ,where f ( j | X ) = f ( X ∪ j ) − f ( X ) is the gain of addingelement j to set X . We run this till t = k and | X t | = k ,where k is the budget constraint. It is easy to see that thecomplexity of the greedy algorithm is O ( nkT f ) where T f is the complexity of evaluating the gain f ( j | X ) . This sim-ple greedy algorithm can be significantly optimized via alazy greedy algorithm [33]. The idea is that instead of re-computing f ( j | X t ) , ∀ j / ∈ t , we maintain a priority queueof sorted gains ρ ( j ) , ∀ j ∈ V . Initially ρ ( j ) is set to f ( j ) , ∀ j ∈ V . The algorithm selects an element j / ∈ X t ,if ρ ( j ) ≥ f ( j | X t ) , we add j to X t (thanks to submodular-ity). If ρ ( j ) ≤ f ( j | X t ) , we update ρ ( j ) to f ( j | X t ) and re-sort the priority queue. The complexity of this algorithm isroughly O ( kn R T f ) , where n R is the average number of re-sorts in each iteration. Note that n R ≤ n , while in practice,it is a constant thus offering almost a factor n speedup com-pared to the simple greedy algorithm. One of the parametersin the lazy greedy algorithms is T f , which involves evaluat-ing f ( X ∪ j ) − f ( X ) . One option is to do a na¨ıve imple-mentation of computing f ( X ∪ j ) and then f ( X ) and takethe difference. However, due to the greedy nature of algo-rithms, we can use memoization and maintain a precomputestatistics p f ( X ) at a set X , using which the gain can be eval-uated much more efficiently. At every iteration, we evaluate f ( j | X ) using p f ( X ) , which we call f ( j | X, p f ) . We thenupdate p f ( X ∪ j ) after adding element j to X . Table 1 pro-vides the precompute statistics, as well as the computationalgain for the Facility Location and Dispersion Functions. Inparticular, it is easy to see that evaluating f ( j | X ) na¨ıvelyis much more expensive than evaluating f ( j | X, p X ) . Thefollowing theorem provides the approximation guarantees for the greedy algorithm for the Facility Location and theDispersion Functions: Theorem 1 [35, 5] The greedy algorithm is guaranteed toobtain an approximation guarantee of − /e for Problem1 when f is the Facility Location function. Similarly, thegreedy algorithm achieves an approximation factor of / when f is the Dispersion function. When f is a linear com-bination of the Facility Location and Dispersion functions,we obtain an approximation factor of / . Active learning can be implemented in three flavors[48]. The first is Batch active learning - there is one round ofdata selection and the data points are chosen to be labeledwithout any knowledge of the resulting labels that will bereturned. The second is Adaptive active learning - thereare many rounds of data selection, each of which selectsone data point whose label may be used to select the datapoint at future rounds. The third flavor is Mini-batch adap-tive active learning - a hybrid scheme where in each round amini-batch of data points are selected to be labeled, and thatmay inform the determination of future mini-batches. Inour work, we shall focus on the mini-batch active learningscheme. As far as the query framework in active learningis concerned, the simplest and most commonly used queryframework is uncertainty sampling [30] wherein an activelearner queries the instances about which it is least certainhow to label. Getting a label of such an instance from the or-acle would be the most informative for the model. Another,more theoretically-motivated query selection framework isthe query-by-committee (QBC) algorithm [51]. In this ap-proach, a committee of models is maintained which are alltrained on the current labeled set but represent competinghypothesis. Each committee member is then allowed tovote on the labeling of query candidates. The most informa-tive query is considered to be the instance about which theymost disagree. There are other frameworks like ExpectedModel Change [50], Expected Error Reduction [45], Vari-ance Reduction [4] and Density-Weighted methods [49].In this paper, we shall use uncertainty sampling as the ac-tive learning algorithm.There are three common ways to compute uncertaintyfor each unlabeled data instance. u = 1 − max i p i (2) u = 1 − (max i p i − max j ∈{ C }− arg max i p i p j ) (3) u = − (cid:88) i p i ∗ log ( p i ) (4)where u = uncertainty, p i = probability of class i and { C } = set of classes. The goal of uncertainty sampling is to pick a batch ofunlabeled data instances with maximum uncertainty.
3. Applications of our Framework
In this section, we describe the applications of our DataSubset Selection Framework from Section 2.1. We empiri- ( X ) p f ( X ) f ( j | X, p f ) C o C p (cid:80) i ∈ V max k ∈ X s ik [max k ∈ X s ik , i ∈ V ] (cid:80) i ∈ V max( p if ( X ) , s ij ) − p if ( X ) O ( n ) O ( n )min k,l ∈ X,k (cid:54) = l d kl min k,l ∈ X,k (cid:54) = l d kl min { p f ( X ) , min k ∈ X d kj } − p f ( X ) O ( | X | ) O ( | X | ) Table 1. List of the precompute statistics p f ( X ) , gain evaluated using the precomputed statistics p f ( X ) and finally C o as the cost ofevaluation the function without memoization and C p as the cost with memoization for Facility Location and Dispersion Functions. It iseasy to see that memoization saves an order of magnitude in computation.Figure 1. Blue (Facility Location), Red (Disparity Min) and where-ever applicable, Orange (Random). The top two figures show the resultsfor application 1 (KNN Classification) on the Face and Gender Datasets. The Bottom left plot shows the results for application 2 (hyper-parameter tuning) for 5% of the subset for various sets of hyper-parameters (ImageNet) and finally the Bottom right plot shows the resultsfor application 3 on the Video dataset (Video Surveillance Object). cally establish the utility of this framework on four differentdata selection and active learning applications As a first application, we evaluate the efficiency of oursubset selection methods to enable learning from lesserdata, yet incurring minimal loss in performance. Towardsthis, we apply Facility-Location and Dispersion modelbased subset selection techniques for training a nearestneighbor classifier (kNN). For a given dataset we run sev-eral experiments to evaluate the accuracy of kNN, each timetrained on a different sized subset of training data. We takesubsets of sizes from 5% to 100% of full training data, witha step size of 5%. For each experiment we report the accu-racy of kNN over the hold-out data. We also compare theseresults against kNN trained on randomly selected subsets.Concretely, denote f as the measure of information as tohow well a subset X of training data performs as a proxy to the entire dataset. We then model this as an instance ofProblem 1 with f being the Facility-Location and Disper-sion Functions. Since this is a supervised data subset se-lection problem, we can also use the label information. Wepartition the ground set V as { V , · · · , V k } and k = |C| .Given a submodular function f , define the label-aware ver-sion of f as, f y ( X ) = k (cid:88) i =1 f ( X ∩ V i ) (5) Another application of our Data subset selection frame-work is to select a representative and diverse subset forquick hyper-parameter training. Note that this is an ap-plication of supervised data selection, since we know thelabels here. We use the Facility Location and Min Disper-sion functions to get a subset of the data, and use this subset igure 2. Results for application 4: Active Learning. Accuracies relative to Random: Blue (Facility Location), Red (Dispersion) andOrange (Uncertainty Sampling). Top Left: Adience, Top Right: Cats vs Dogs, Middle Left: Face, Middle Right: ImageNet, Bottom Left:Caltech-101, and Bottom Right: MIT-67. The plots on the top row are obtained using finetuning, while the rest (middle and bottom rows)are using transfer learning. for tuning typer-parameters. Note that the subset is typi-cally 5 - 10% of the entire data, and hence it is much eas-ier to tune the hyper-parameters on this subset. Once thehyper-parameters are tuned, we train the model on the en-tire data using the tuned hyper-parameters. We expect oursummarization models to perform better compared to sim-ple random sampling, and thereby be more representativefor hyper-parameter tuning.
As another application, we consider the problem of la-beling massive datasets, specifically when the data comesfrom videos. As an example, consider the applications dis-cussed in [8], where they observe that model customizationcan substantially improve performance for image classifi-cation and object detection tasks. Moreover, often the datahere comes from videos where naturally there is a lot of re-dundancy. Unlike the supervised data subset selection dis-cussed above, the data here is unlabeled, so we need to ap-ply unsupervised data subset selection here. We introducea surveillance dataset comprising of around 20 videos fromvarious scenarios (indoor, coartyards, outdoor scenario likeroads, traffic etc.) with frames sampled at 1 FPS.
Finally, we study the efficiency of our subset selectionmethods in the context of mini-batch adaptive active learn-ing to demonstrate savings on labeling effort. We propose aSubmodular Active Learning algorithm. The idea is to useuncertainty sampling as a filtering step to remove less infor-mative training instances. After filtering, we obtain a subset F . Denote β as the size of the filtered set F . We then selecta subset X by solving Problem 1. max { f ( X ) such that | X | ≤ B, X ⊆ F } (6)where B is the number of instances selected by the batchactive learning algorithm at every iteration. In our exper-iments, we use f as Facility-Location and Disparity-Minfunctions. For a rigorous analysis and a fair evaluation, weimplement this both in the context of transfer learning - aswell as fine tuning. For transfer learning, we extract thefeatures from a pre-trained CNN relevant to the computervision task at hand and train a Logistic Regression classifierusing those features.The basic diversified active learning algorithm is as men-tioned in the Algorithm 1 lgorithm 1 Goal-2: Submodular Active Learning Start with a small initial seed labeled data, say L for each round 1 to T do Fine tune a pre-trained CNN (Goal-2a) or train aLogistic Regression classifier (Goal-2b) with the la-beled set L Report the accuracy of this model over the hold-outdata. Using this model, compute uncertainties of remain-ing unlabeled data points U , and select a subset F, F ⊆ U of the most uncertain data points. Solve Problem 2 with Facility-Location andDisparity-Min Functions Label the selected subset and add to labeled set L end for The parameters B and β of FASS: B represents the per-centge of images that are to be labeled at the end of eachround and added to the training set. β specifies what per-centage of data (sorted in decreasing order of their hypoth-esized uncertainties by the current model) forms the groundset (for subset selection) in every round.While selecting β % most uncertain samples at eachround, any other data point having the exact same uncer-tainty as the last element of this set is also added to theuncertainty ground set.
4. Experiments and Datasets
In Table 2 we present the details of different datasetsused by us in above experiments along with the train-validate split for each. Out of the 8 datasets used, 5 datasetsare publicly available datasets and 3 of the datasets are ourcustom datasets (FaceData, GenderData and Video Surveil-lance Object).For application 1, we evaluate kNN with k=5, and inapplication 2, we perform hyper-parameter tuning on Im-ageNet with a subset of 5% of the data. With application3, we use our custom video dataset consisting of 76300 Im-ages for Object detection with the following classes: Per-son, Car, Bus, motorbike, Bicycle and Three-wheeler. Ourdataset comprises of videos from roads (outdoor), coart-yard (outdoor), office (indoor). Finally, For application 4experiments, the number of rounds was experimentally ob-tained for each experiment as the percentage of data that isrequired for the model to reach saturation. The values of B and β were empirically arrived at and are mentioned in Ta-ble 3 along with the details about the particular ComputerVision task and dataset used for each set of experiments.In Table3, the “Model” column refers to the pre-trainedCNN used to extract features and/or for finetuning. Forexample, “GoogleNet/ImageNet” refers to the GoogleNetmodel pre-trained on ImageNet data. In the “Parameters”column, B and β refer to the parameters in FASS and arementioned in percentages. We use Caffe deep learningframework [23] for all our Image classification experimentsand DarkNet [42] for Object Detection.
5. Results
We present the resules for application 1 in the top twoplots in Figure 1. As hypothesized, using only a portionof training data (about 40% in our case) we get similar ac-curacy as with using 100% of the data. Moreover, FacilityLocation performs much better than random sampling andDisparity Min, proving that Facility Location function is agood proxy for nearest neighbor classifier. Moreover, wesee this phenomenon on both Gender Recognition and FaceRecognition tasks. Also note that this aligns with the theo-retical results shown in [65].For application 2, we consider the ImageNet [46].We choose a subset of 5% of the dataset through su-pervised data subset selection obtained via randomsampling, Disparity Min and Facility Location. Weuse SGD with Momentum as the learning algorithmand tune the learning rate and momentum parame-ters. In particular, we select five sets of parametersfor the learning rate α and momentum µ as: ( α, µ ) =(0 . , . , (0 . , . , (0 . , . , (0 . , . , and (0 . , . . Figure 1 bottom left shows the results ofthe 5% subsets obtained via Disparity-Min and FacilityLocation relative to Random. We notice that the FacilityLocation function generally has a positive gain comparedto random, and except for one of the hyper-parametersdisparity-min also beats random. We next select thehyper-parameters for random, disparity min and FacilityLocation which obtain the best accuracy on the validationset (which was Set 1, Set 5 and Set 4 respectively). Usingthis hyper-parameters, we train imagenet using the com-plete dataset. We observe that the hyper-parameters chosenby FL obtain the best result (around 10.5% improvement inTop-5 accuracy compared to the hyper-parameters chosenby the random subset). The hyper-parameters chosen byDisparity-Min achieve around 52.5% top-5 accuracy, whilethe ones with Facility Location achieve close to 59.5%. Incomparison the random hyper-parameters achieve a top-5accuracy of 42%. All these results are using the GoogleNetarchitechture.In the case of application 3, we perform unsuperviseddata subset selection. The task here is to obtain a subsetof images from a larger dataset where the frames are takenfrom videos for labeling. We consider the problem of Ob-ject Detection using YOLOv2 [42]. We achieve subsets ofvarious sizes using unsupervised video summarization andcompare the results to random. The results are shown inBottom right in Figure 1. We see that Disparity Min hasthe best results followed closely by Facility Location. Boththese methods achieve almost a 5% improvement comparedto random subset in mAP. This is expected since Dispar-ity Min tends to pick a diverse set of images for training,thereby ensuring a good mix between the classes for train-ing, as compared to random which is not aware of the re-dundancy in the images. The fact that Disparity-Min mod-els diversity also suggests that diversity is more importanthere compared to representation.Finally, we compare the different models for applica-tion 4. First, we notice that almost always, the subset se-lection techniques and uncertainty sampling always outper- ame NumClasses NumTrain NumTrainPerClass NumHoldOut NumHoldOutPerClassAdience [10, 11] 2 1614 790(F), 824(M) 807 395(F), 412(M)Caltech-101 [27, 14] 101 7900 40-56 1246 8-15MIT-67 [40, 41] 67 10438 68-490 5159 32-244CatsVsDogs [13, 12] 2 16668 8334 8332 4166ImageNet [46] 1000 1.2M 1000 50000 50Video Surveillance Object 8 76300 5000 - 12000 10,211 500 - 3000GenderData 2 2200 1087(F), 1113(M) 548 249(F), 299(M)FaceData 255 2345 7-8 552 2-4 Table 2. Details of datasets used in our experiments
Task Dataset Model ParametersGender Recognition Adience VGGFace/CelebFaces [39] B = 0 . , β = 10% Object Recognition Cats vs Dogs GoogleNet/ImageNet [58] B = 1% , β = 10% Gender Recognition FaceData VGGFace/CelebFaces [39] B = 0 . , β = 10% Scene Recognition MIT-67 GoogleNet/Places205 [18] B = 2% , β = 10% Object Recognition Caltech-101 VGGFace/CelebFaces [39] B = 1% , β = 10% Object Recognition ImageNet GoogleNet/ImageNet [58] B = 1% , β = 10% Table 3. Experimental setup for application 4 form random sampling. Moreover, different subset selec-tion techniques work well for different problem contexts,with the intuition behind these results described below.We present the results for application 4 in Figure 2. Re-call that Disparity-Min selects most diverse elements, whileFacility Location picks the representative items. Figure 2(top row) demonstrates the results on Adience and Catsvs Dogs datasets. We notice that Diversity works betterthan representation in both these cases, particularly whenthe size of the dataset is small. This is because both theseproblems are two class classification, and there is a lot ofsimilarity in the dataset (for example, there is more thanone image for the same person, or similar looking cats ordogs). Therefore it is not surprising that Diversity worksreally well in these settings. Next we compare the resultson Object recognition, face recognition and scene recogni-tion datasets (Middle and Bottom rows of Figure 2): Im-ageNet, Caltech-101, MIT 67 and Face dataset. In thesedatasets, we have several classes and there is not a lot ofredunduncy. Disparity Min tends to pick the outlier imageswhich often does not make sense here. In this case, we seethat the representation model (Facility Location) works thebest. In each case, we see that subset selection algorithmson top of Uncertainty Sampling outperform vanilla Uncer-tainty Sampling and Random Selection. Moreover, the ef-fect of diversity and representation reduce as the number ofunlabeled instances reduce in which case Uncertainty sam-pling already works well. In practice, that would be thepoint where the accuracy saturates as well.
6. Conclusions and Lessons Learnt
This paper demonstrates the utility of subset selectionin training models for a variety of standard Computer Vi-sion tasks. Subset selection functions like Facility-Locationand Disparity-Min naturally capture the notion of represen-tativeness and diversity which thus help to eliminate re- dundancies in data. We show the practical utility of thisdata subset selection in four applications. The goal of thefirst application is to use data subset selection for reduc-ing training and inference time. We demonstrate this forKNN classification and show that Facility Location (repre-sentation models) perform the best with considerable im-provement over a random sampling. Next, we look at anapplication of hyper-parameter tuning for Image classifi-cation. We demonstrate this on ImageNet and show thatthe subset achieved by Facility Location is represents theentire dataset in a better way compared to a random sub-set. We also show that the best hyper-parameters on tuningsperformed over the representative subset achieves a betteraccuracy compared to the random subset. We also see, asexpected, that the models trained on the Facility Locationsubset consistently outperform the models trained on ran-dom subsets. We then study an application of video datasummarization for object detection. We see here that thediversity measure makes more sense since there is naturallya lot of redundancy in video datasets. We see consistentlythat the models trained on diverse subset beats the randomsubset by almost 5% in mAP. Finally, we demonstrate thebenefit of data summarization with uncertainty sampling.We see that Disparity works best when there is the need fordiverse selection (when the dataset tends to have a higheramount of redundancy) while the Facility Location modelworks best otherwise. In either case, we see that both thesemodels consistently outperform uncertainty sampling alone,which itself performs generally better than random.
7. Acknowledgements
The authors would like to thank Suyash Shetty, AnuragSahoo, Narsimha Raju and Pankaj Singh for discussions anduseful suggestions on the manuscripts. eferences [1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan.Geometric approximation via coresets.
Combinatorialand computational geometry , 52:1–30, 2005.[2] J. Ba and R. Caruana. Do deep nets really need to bedeep? In
Advances in neural information processingsystems , pages 2654–2662, 2014.[3] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha.Synthesized classifiers for zero-shot learning. In
Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5327–5336, 2016.[4] D. A. Cohn. Neural network exploration using optimalexperiment design.
Advances in neural informationprocessing systems , pages 679–679, 1994.[5] A. Dasgupta, R. Kumar, and S. Ravi. Summarizationthrough submodularity and dispersion. In
ACL (1) ,pages 1014–1022, 2013.[6] C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In
The IEEE InternationalConference on Computer Vision (ICCV) , 2017.[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutionalactivation feature for generic visual recognition. In
Icml , volume 32, pages 647–655, 2014.[8] P. Dubal, R. Mahadev, S. Kothawade, K. Dargan, andR. Iyer. Deployment of customized deep learningbased video analytics on surveillance cameras. arXivpreprint arXiv:1805.10604 , 2018.[9] M. Ducoffe, G. Portelli, and F. Precioso. Scalablebatch mode optimal experimental design for deep net-works.[10] E. Eidinger, R. Enbar, and T. Hassner. Adiencedataset, 2014.[11] E. Eidinger, R. Enbar, and T. Hassner. Age and gen-der estimation of unfiltered faces.
IEEE Transactionson Information Forensics and Security , 9(12):2170–2179, 2014.[12] J. Elson, J. R. Douceur, J. Howell, and J. Saul. Asirra:a captcha that exploits interest-aligned manual imagecategorization. In
ACM Conference on Computer andCommunications Security , volume 7, pages 366–374,2007.[13] J. Elson, J. R. Douceur, J. Howell, J. Saul, and Asirra.Catsvsdogs dataset, 2007.[14] L. Fei-Fei, R. Fergus, and P. Perona. Learning gen-erative visual models from few training examples: Anincremental bayesian approach tested on 101 objectcategories.
Computer vision and Image understand-ing , 106(1):59–70, 2007.[15] E. Gavves, T. Mensink, T. Tommasi, C. G. Snoek,and T. Tuytelaars. Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In
Proceedings of the IEEE International Conference onComputer Vision , pages 2731–2739, 2015. [16] R. Girshick. Fast r-cnn. In
Proceedings of the IEEEInternational Conference on Computer Vision , pages1440–1448, 2015.[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection andsemantic segmentation. In
Proceedings of the IEEEconference on computer vision and pattern recogni-tion , pages 580–587, 2014.[18] Google. Googlenet model trained on places 2014.[19] M. Gygli, H. Grabner, and L. Van Gool. Video sum-marization by learning submodular mixtures of objec-tives. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3090–3098, 2015.[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-ual learning for image recognition. In
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pages 770–778, 2016.[21] G. Hinton, O. Vinyals, and J. Dean. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.[22] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mbmodel size. arXiv preprint arXiv:1602.07360 , 2016.[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding.In
Proceedings of the 22nd ACM international confer-ence on Multimedia , pages 675–678. ACM, 2014.[24] A. Krause.
Optimizing sensing: Theory and applica-tions . ProQuest, 2008.[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classification with deep convolutional neuralnetworks. In
Advances in neural information process-ing systems , pages 1097–1105, 2012.[26] A. Kulesza, B. Taskar, et al. Determinantal pointprocesses for machine learning.
Foundations andTrends R (cid:13) in Machine Learning , 5(2–3):123–286,2012.[27] R. F. L. Fei-Fei and P. Perona. Caltech-101 datasets,2007.[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recog-nition. Proceedings of the IEEE , 86(11):2278–2324,1998.[29] G. Levi and T. Hassner. Age and gender classificationusing convolutional neural networks. In
Proceedingsof the IEEE Conference on Computer Vision and Pat-tern Recognition Workshops , pages 34–42, 2015.[30] D. D. Lewis and W. A. Gale. A sequential algo-rithm for training text classifiers. In
Proceedings ofthe 17th annual international ACM SIGIR conferenceon Research and development in information retrieval ,pages 3–12. Springer-Verlag New York, Inc., 1994.31] H. Lin and J. Bilmes. A class of submodular functionsfor document summarization. In
Proceedings of the49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies-Volume 1 , pages 510–520. Association for Computa-tional Linguistics, 2011.[32] M. Long, Y. Cao, J. Wang, and M. Jordan. Learningtransferable features with deep adaptation networks.In
International Conference on Machine Learning ,pages 97–105, 2015.[33] M. Minoux. Accelerated greedy algorithms for max-imizing submodular set functions. In
OptimizationTechniques , pages 234–243. Springer, 1978.[34] P. B. Mirchandani and R. L. Francis.
Discrete locationtheory . 1990.[35] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher.An analysis of approximations for maximizing sub-modular set functionsi.
Mathematical Programming ,14(1):265–294, 1978.[36] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is ob-ject localization for free?-weakly-supervised learningwith convolutional neural networks. In
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pages 685–694, 2015.[37] S. J. Pan and Q. Yang. A survey on transfer learning.
IEEE Transactions on knowledge and data engineer-ing , 22(10):1345–1359, 2010.[38] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deepface recognition. In
BMVC , volume 1, page 6, 2015.[39] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Vggfacemodel pretrained on celebfaces, 2015.[40] A. Quattoni and A. Torralba. Mit67 dataset, 2009.[41] A. Quattoni and A. Torralba. Recognizing indoorscenes. In
Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on , pages 413–420. IEEE, 2009.[42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.You only look once: Unified, real-time object detec-tion. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 779–788,2016.[43] J. Redmon and A. Farhadi. Yolo9000: Better, faster,stronger. arXiv preprint arXiv:1612.08242 , 2016.[44] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:Towards real-time object detection with region pro-posal networks. In
Advances in neural informationprocessing systems , pages 91–99, 2015.[45] N. Roy and A. McCallum. Toward optimal activelearning through monte carlo estimation of error re-duction.
ICML, Williamstown , pages 441–448, 2001.[46] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recog-nition challenge.
International Journal of ComputerVision , 115(3):211–252, 2015. [47] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet:A unified embedding for face recognition and cluster-ing. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 815–823,2015.[48] B. Settles. Active learning literature survey.
Universityof Wisconsin, Madison , 52(55-66):11, 2010.[49] B. Settles and M. Craven. An analysis of active learn-ing strategies for sequence labeling tasks. In
Proceed-ings of the conference on empirical methods in naturallanguage processing , pages 1070–1079. Associationfor Computational Linguistics, 2008.[50] B. Settles, M. Craven, and S. Ray. Multiple-instanceactive learning. In
Advances in neural informationprocessing systems , pages 1289–1296, 2008.[51] H. S. Seung, M. Opper, and H. Sompolinsky. Queryby committee. In
Proceedings of the fifth annual work-shop on Computational learning theory , pages 287–294. ACM, 1992.[52] C. Sha, X. Wu, and J. Niu. A framework for recom-mending relevant and diverse items.[53] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[54] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng.Zero-shot learning through cross-modal transfer. In
Advances in neural information processing systems ,pages 935–943, 2013.[55] J. Sourati, M. Akcakaya, J. G. Dy, T. K. Leen, andD. Erdogmus. Classification active learning based onmutual information.
Entropy , 18(2):51, 2016.[56] R. K. Srivastava, K. Greff, and J. Schmidhuber. Train-ing very deep networks. In
Advances in neural infor-mation processing systems , pages 2377–2385, 2015.[57] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3:Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873 , 2015.[58] Szegedy. Googlenet model trained on imagenet 2015.[59] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going deeper with convolutions. In
Proceed-ings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1–9, 2015.[60] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deep-face: Closing the gap to human-level performance inface verification. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,pages 1701–1708, 2014.[61] S. Tschiatschek, R. K. Iyer, H. Wei, and J. A. Bilmes.Learning mixtures of submodular functions for im-age collection summarization. In
Advances in neu-ral information processing systems , pages 1413–1421,2014.[62] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra,et al. Matching networks for one shot learning. In
Advances in Neural Information Processing Systems ,pages 3630–3638, 2016.63] C. Vondrick, D. Patterson, and D. Ramanan. Effi-ciently scaling up crowdsourced video annotation.
In-ternational Journal of Computer Vision , 101(1):184–204, 2013.[64] K. Wei, R. K. Iyer, and J. A. Bilmes. Fast multi-stagesubmodular maximization. In
ICML , pages 1494–1502, 2014.[65] K. Wei, R. K. Iyer, and J. A. Bilmes. Submodularityin data subset selection and active learning. In
ICML ,pages 1954–1963, 2015.[66] K. Wei, Y. Liu, K. Kirchhoff, C. Bartels, and J. Bilmes.Submodular subset selection for large-scale speechtraining data. In
Acoustics, Speech and Signal Pro-cessing (ICASSP), 2014 IEEE International Confer-ence on , pages 3311–3315. IEEE, 2014.[67] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. Howtransferable are features in deep neural networks? In
Advances in neural information processing systems ,pages 3320–3328, 2014.[68] M. D. Zeiler and R. Fergus. Visualizing and under-standing convolutional networks. In
European con-ference on computer vision , pages 818–833. Springer,2014.[69] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, andA. Oliva. Learning deep features for scene recognitionusing places database. In