Information Condensing Active Learning
IInformation Condensing Active Learning
Siddhartha Jain Ge Liu David Gifford Abstract
We introduce Information Condensing ActiveLearning (ICAL), a batch mode model agnos-tic Active Learning (AL) method targeted at DeepBayesian Active Learning that focuses on acquir-ing labels for points which have as much informa-tion as possible about the still unacquired points.ICAL uses the Hilbert Schmidt Independence Cri-terion (HSIC) to measure the strength of the de-pendency between a candidate batch of points andthe unlabeled set. We develop key optimizationsthat allow us to scale our method to large unla-beled sets. We show significant improvements interms of model accuracy and negative log likeli-hood (NLL) on several image datasets comparedto state of the art batch mode AL methods fordeep learning.
1. Introduction
Machine learning models are widely used for a vast ar-ray of real world problems. They have been applied suc-cessfully in a variety of areas including biology (Chinget al., 2018), chemistry (Sanchez-Lengeling & Aspuru-Guzik, 2018), physics (Guest et al., 2018), and materialsengineering (Aspuru-Guzik & Persson, 2018). Key to thesuccess of modern machine learning methods is access tohigh quality data for training the model. However such datacan be expensive to collect for many problems. Active learn-ing (Settles, 2009) is a popular methodology to intelligentlyselect the fewest new data points to be labeled while notsacrificing model accuracy. The usual active learning settingis pool-based active learning where one has access to a largeunlabeled dataset D U and uses active learning to iterativelyselect new points from D U to label. Our goal in this paper isto develop an active learning acquisition function to selectpoints that maximize the eventual test accuracy which isalso one of the most popular criteria used to evaluate anactive learning acquisition function. MIT CSAIL. Correspondence to: Siddhartha Jain < [email protected] > . In active learning, an acquisition function is used to selectwhich new points to label. A large number of acquisitionfunctions have been developed over the years, mostly forclassification (Settles, 2009). Acquisition functions usemodel predictions or point locations (in input feature orlearned representation space) to decide which points wouldbe most helpful to label to improve model accuracy. Acqui-sition functions then query for the labels of those points andadd them to the training set. A natural choice of acquisi-tion function is to acquire labels for points with the highestuncertainty or points closest to the decision boundary. Tak-ing a Bayesian point of view, several acquisition functionsselect points which give the most amount of knowledgeregarding a model’s parameters where knowledge is definedas the statistical dependency between the parameters of themodel and the predictions for the selected points. Mutualinformation (MI) is the usual choice for the dependencythough other choices are possible. While the focus for suchfunctions has been the acquisition of one point at a time,as each round of label acquisition and retraining of the MLmodel, particularly in the case of deep neural networks, canbe expensive. There have been several papers in the past fewyears that acquire points in batch . To ensure that a batchis diverse, the methods either measure the MI for entirebatch together with respect to the model’s parameters orexplicitly encourage batch diversity in the input or learnedrepresentation space.Another intuitive strategy is to select points that we expectwould provide substantial information about the labels ofthe rest of the unlabeled set, thus reducing model uncer-tainty. We show that the popular strategy of acquiring labelsfor points that maximize the mutual information with re-spect to the model parameters does not always minimizethe uncertainty of the model’s predictions averaged overthe still unlabeled points post acquisition. This suboptimaluncertainty can negatively affect test accuracy. Motivatedby this observation, we propose acquiring a batch of points B such that the model’s predictions on B have as high a sta-tistical dependency as possible with the model’s predictions on the entire unlabeled set D U . Thus we want a batch B that condenses the most amount of information about themodel’s predictions on D U . We call our method InformationCondensing Active Learning (ICAL). Naively searchingover all possible batches to find the optimal one would take a r X i v : . [ c s . L G ] F e b nformation Condensing Active Learning an exponential amount of time. We develop a greedy algo-rithm to select the batch of points efficiently. Similar greedyapproaches have also been explored in the context of featureselection (Da Veiga, 2015; Blanchet et al., 2008).A key desideratum for our acquisition function is to bemodel agnostic. This is partly because the model distribu-tion can be very heterogeneous. For example, ensembleswhich are often used as a model distribution can consist ofjust decision trees in a random forest or different architec-tures for a neural network. This means we cannot assumeany closed form for the model’s predictive distribution, andhave to resort to Monte Carlo sampling of the predictionsfrom the model to estimate the dependency between themodel’s predictions on the query batch and the unlabeledset. Mutual information, however, is known to be hard to ap-proximate using just samples (Song & Ermon, 2019). Thusto scale the method to larger batch sizes, we use the Hilbert-Schmidt Independence Criterion (HSIC), one of the mostpowerful extant statistical dependency measures for highdimensional settings. Another advantage of HSIC is that itis differentiable , which as we will discuss later in the text,can allow applications of the acquisition function to areaswhere MI would be difficult to make work.To summarize, we introduce Information Condensing ActiveLearning (ICAL) which maximizes the amount of informa-tion being gained with respect to the model’s predictions onthe unlabeled set of points. ICAL is a batch mode acquisi-tion function that is model agnostic and can be applied toboth classification and regression tasks. We then developan algorithm that can scale ICAL to large batch sizes whenusing HSIC as the dependency measure between randomvariables.
2. Related work
A review of work on acquisition functions for active learn-ing prior to the recent focus on deep learning is given bySettles (2009). The BALD (Bayesian Active Learning byDisagreement) (Houlsby et al., 2011) acquisition functionchooses a query point which has the highest mutual infor-mation about the model parameters. This turns out to be thepoint on which individual models sampled from the modeldistribution are confident about in their prediction but theoverall predictive distribution for that point has high en-tropy. In other words this is the point on which the modelsare individually confident but disagree on the most.In Guo & Schuurmans (2008) which builds on Guo &Greiner (2007), they formulate the problem as an integerprogram where they select a batch such that the post acqui-sition model is highly confident on the training set and haslow uncertainty on the unlabeled set. While the latter aspect is related to what we do, they need to retrain their model forevery candidate batch they search over in the course of try-ing to find the optimal batch. As the total number of possiblebatches is exponential in the size of the unlabeled set, thiscan get too computationally expensive for neural networkslimiting the applicability of this approach. Thus as far aswe know, Guo & Schuurmans (2008) has only been appliedto logistic regression. BMDR (Wang & Ye, 2015) queriespoints that are as close to the classifier decision boundaryas possible while still being representative of the overallsample distribution. The representativeness is measuredusing the maximum mean discrepancy (MMD) (Grettonet al., 2012) of the input features between the query batchand the set of all points with a lower MMD indicating amore representative query batch. However this approachis limited to classification problems as it needs a decisionboundary to exist. BMAL (Hoi et al., 2006) selects a batchsuch that the Fisher information matrices for the total un-labeled set and the selected batch are as close as possible.The Fisher information matrix is however quadratic in thenumber of parameters and thus infeasible to compute formodern deep neural networks. FASS (Filtered Active SubsetSelection) (Wei et al., 2015) picks the most uncertain pointsand then selects a subset of those points that are as similaras possible to the whole candidate batch which favors pointsthat can represent the diversity of the initial set of mostuncertain points.Recently active learning methods have been extendedto the deep learning setting. Gal et al. (2017) adaptsBALD (Houlsby et al., 2011) to the deep learning settingby using Monte Carlo Dropout (Gal & Ghahramani, 2016)to do inference for their Bayesian Neural Network. Theyextend BALD to the batch setting for neural networks withBatchBALD (Kirsch et al., 2019). In Pinsler et al. (2019),they adapt the Bayesian Coreset (Campbell & Broderick,2018) approach for active learning, though their approachrequires a batch size that changes for every acquisition. Asthe neural network decision boundary is intractable, Deep-Fool (Ducoffe & Precioso, 2018) uses the concept of ad-versarial examples (Goodfellow et al., 2014) to find pointsclose to the decision boundary. However this approach isagain limited to classification tasks. In Sener & Savarese(2017), they frame the problem as a core-set selection prob-lem. They try and select points that (cid:15) -cover the entire dataset.They formulate the problem as a set covering problem andsolve it using an integer program. FF-Comp (Geifman &El-Yaniv, 2017) also frames the problem as a coreset prob-lem. It builds a batch by selecting a point which is farthestaway from closest point to it in the set of points already inthe batch. DAL (Gissin & Shalev-Shwartz, 2019) trains aclassifier after every acquisition to try and distinguish be-tween the labeled and unlabeled set of examples. It thenselects the points the classifier is most confident about being nformation Condensing Active Learning unlabeled based on the idea that those are the points that areleast like the training points and thus labeling them shouldbe informative. Finally BADGE (Ash et al., 2019) samplespoints which are high magnitude and diverse in a halluci-nated gradient space with respect to the last layer of a neuralnetwork. All of FF-Comp, DAL, Sener & Savarese (2017),and BADGE however operate on the learned representa-tion space, as that is the only way the methods incorporatefeedback from the training labels into the active learning ac-quisition function, and they are thus not model-agnostic, asthey are not extendable to any model distribution where it isdifficult to have a notion of a common representation space(as in a random forest or ensembles with hetereogenousarchitectures, etc.).There is also extensive prior work on exploiting GaussianProcesses (GPs) for Active Learning (Houlsby et al., 2011;Krause et al., 2008). However GPs are hard to scale espe-cially for modern image datasets.
3. Background
We first define the statistical quantities we rely upon andthen describe the acquisition functions we will be using forcomparison as well as the baseline.
The entropy of a distribution is defined as H p Y q “´ ř x P X p x log p p x q , where p x is the probability of the x .Mutual information (MI) between two random variablesis defined as I r X ; Y s “ ř x P X ř y P Y p p x, y q log p p p x,y q p p x q p p y q q ,where p p x, y q is the joint probability of x, y . Note that I r X ; Y s “ H p Y q ´ H p Y | X q “ H p X q ´ H p X | Y q .A divergence Λ between two distributions is a measureof the discrepancy or difference between two distributions P, Q . A key property of a divergence is that it is 0 if andonly if
P, Q are the same distribution. In this paper, wewill be using the KL divergence and the MMD, which arerespectively defined as D KL p P || Q q “ ´ ÿ x P X P p x q log p Q p x q P p x q q M M D k p P, Q q “ E k p X, X q ` k p Y, Y q ´ k p X, Y q where k is a kernel in the Reproducing Kernel Hilbert Space(RKHS) H and µ k is the mean embedding of the distributioninto H as per the kernel k . We can then use the notion ofdivergence to define the dependency d between a set ofrandom variables X n as follows d p X n q “ Λ p P n , b i P i q where P n is the joint distribution of X n , P i the marginalof X i with b P i being the product of marginals. For D KL the dependency is exactly MI as defined above. For M M D the dependency is the Hilbert-Schmidt Independence Crite-rion (
HSIC ) which we discuss next.
Suppose we have two (possibly multivariate) distributions X , Y and we want to measure the dependence betweenthem. A well known way to measure it is using distancecovariance which intuitively, measures the covariance be-tween the distances of pairs of samples from the joint dis-tribution P XY and the product of marginal distributions P p X q , P p Y q (Sz´ekely et al., 2007). HSIC can simplybe thought of as distance covariance except in a kernelspace (Sejdinovic et al., 2013b).More formally, if X, Y are drawn from the joint distribution P XY , then their HSIC is defined as –
HSIC p P XY , k, l q “ E x,x ,y,y r k p x, x q l p y, y qs` E x,x r k p x, x qs E y,y r l p y, y qs´ E x,y r E x r k p x, x qs E y r k p y, y qss where p x, y q and p x , y q are independent pairs drawn from P XY . Note that HSIC p P XY q “ if and only if P XY “ P X P Y , that is, if X, Y are independent, for chracteristickernels k and l .For the case where we are measuring the joint dependencebetween d variables, we can use the dHSIC statistic (Sejdi-novic et al., 2013a; Pfister et al., 2018). The computationalcomplexity of dHSIC is bounded by the time taken tocompute the kernel matrix which is O p m d q where m is thenumber of samples and d the number of random variables.We use { dHSIC to denoate the empirical estimator of the dHSIC statistic. We give a brief overview of the acquisition functions wecompare our method against in this paper. For simplicity weonly consider the classification context though the extensionto a regression context is straightforward. Let the batch toacquire be denoted by B with B “ | B | , and k be the numberof Monte Carlo dropout samples. Given a model distribution M , training data D train , unlabeled data D U , input space X , set of labels Y and an acquisition function α p x, M q , wedecide which point to query next via: x ˚ “ arg max x P D U α p x, M q nformation Condensing Active Learning Max entropy selects the points that maximize the predic-tive entropy α p x, M q “ H p y | x, D train q“ ´ ÿ c p p y “ c | x, D train q log p p p y “ c | x, D train qq BatchBALD
BatchBALD (Kirsch et al., 2019) tries tofind a batch of points that has the highest mutual informationwith respect to the model parameters.
BALD is the non-batched version of BatchBALD. Formally α BatchBALD pt x , . . . , x B u , p p ω qq“ H p y , . . . , y B q ´ E p p ω q r H p y , . . . , y B | ω qs Filtered active submodular selection (FASS)
FASS (Wei et al., 2015) samples the β ˆ B mostuncertain points B and then subselect B points thatare as representative of B as possible. For the measureof uncertainty, FASS uses entropy H p y | x, D train q . Tomeasure the representativeness of B to B , FASS tries tochoose B to maximize the following function f p B q “ ÿ y P Y ÿ i P V y max s P B X V y w p i, s q Here V y Ď B is the set of points in B with predicted label, y and w p i, s q “ d ´ || x i ´ x s || is the similarity functionbetween points indexed by i, s where x i , x s P X and d isthe maximum distance between two points. The idea here isthat if a point in B already exists that is close to some point x P B , then f p B q will favor adding points to the batch thatare close to points other than x , thus increasing the batchdiversity. Note that FASS is equivalent to Max Entropy if β “ . Bayesian Coresets
In Pinsler et al. (2019), they try tobuild a batch such that the log posterior after acquiringthat batch best approximates the complete data log poste-rior (i.e. the log posterior after acquiring the entire poolset). Their approach closely follows the general BayesianCoreset (Campbell & Broderick, 2018) approach which con-structs a weighted subset of data that approximates the fulldataset. Crucially (Pinsler et al., 2019) assume that the pos-terior predictive distribution Y p of a point p is independentof that of the corresponding distribution Y p of another point p – an assumption we do not make. We show in the nextsection why avoiding such an assumption lets us more effec-tively minimize the error with respect to the test distributionversus just optimizing for maxmizing information gain forthe model posterior. As (Pinsler et al., 2019) require a vari-able batch size whereas all other methods (including ours)use a fixed batch size, for fairness of comparison, if the batch for this approach is smaller than the batch size beingused, we fill the rest of the batch with random points. Inpractice, we only observe this being necessary for CIFAR. Random
The points are selected uniformly at randomfrom the unlabeled pool. Thus α p x, M q is the uniformdistribution.
4. Motivation
As mentioned in the Introduction, the intuition behind ourmethod is that we want to acquire points that will give usas much information as possible about the still unlabeledpoints, thereby increasing the confidence of the model’spredictions. As the example below shows, modern activelearning methods that pick the point with the most amountof information with respect to the model parameters could infact end up increasing the average uncertainty of predictionon the still unlabeled data. More formally, if P p y x q is thepmf or pdf for the prediction y x on point x , then as the exam-ple below shows, the optimal choice of x P U for acquisitionmay not be optimal for decreasing ř x P U,x ‰ x H p y x q postacquisition. This can pose a problem if our metric is test setaccuracy. If the model is well calibrated, then we shouldexpect worse average entropy (uncertainty) to lead to anincrease in the number of errors. Example 1
Suppose we have a model distribution with10 possible models ω , . . . , ω with equal prior probabilityof being the true model ( p p w i q “ . for @ i ). Let thedatapoints be x , . . . , x L with their labels taking 4 possiblevalues. We define p kij “ p p y i “ j | x i , ω k q as the probabilityof the j th class for the ith datapoint given by the k th model.Let p k j “ j “ k, ď k ď p k “
1; 4 ď k ď p ki “ , p i “
1; 1 ď k ď , ď i ď L ω ω ω ω ω ω ω ω ω ω x x . . . x L Table 1: Labels that the different points x i take with proba-bility 1 under different models. The columns are the differ-ent models ω k , and the rows are the different points.Given that we have no other information about the models,we update the posterior probabilities for the models as fol-lows – if a model ω k outputs label l for a point x but afteracquisition, the label for x is not l , then we know that is not nformation Condensing Active Learning the correct model and thus its posterior probability is 0 (soit is eliminated). Otherwise we have no way of distinguish-ing between the remaining models so they all have equalposterior probability. Then for x the mutual information is I r y , ω | x , D train s“ H r y | x s ´ E p p ω | D train q r H r y | x , ω ss “ . For x . . . x L , I r y ´ L , ω | x ...L , D train s “ . . Howeverselecting x would decrease the expected posterior entropy H r y ´ L | x ...L , x , y , D train s from . to only . .Acquiring any of x ...L instead of x , however, would de-crease that entropy to 0, which would cause a much largerdecrease in the expected posterior entropy averaged over x ...L if L is large enough. The detailed calculations are inthe Appendix.While x ...L may not contribute much to the entropy ofthe joint predictive distribution or to the MI with respectto the model parameters compared to x , collectively theywill be weighted L ´ times more than x when lookingat the accuracy. We should thus expect a well-calibratedmodel to have a higher uncertainty, and thus make a lotmore errors on x ...L , if x is acquired versus if any of x ...L are acquired. For instance, in the above example,as L increases, the expected error rate would approach « . ˆ p { ˆ { q ˆ “ . (0.7 as 0.3 of the times thevalue of x would also fix what the true model is reducingerror rate on all x to 0) if x is acquired as the errors for x ...L are correlated, whereas the rate would approach 0were any of x ...L to be acquired.This motivates our choice of acquisition function as one thatselects the set of points whose acquisition would maximizethe information gained about predictive distribution on theunlabeled set. In Figure 1, we show the average posteriorentropy of the model’s predictions for our method comparedto BatchBALD, BayesCoreset, and Random acquisition.As can be seen from the figure, we are able to reduce theaverage posterior entropy much more effectively comparedto the other two. Details of this experiment are in Section6.2.Figure 1: Mean posterior entropy of the predictions aftereach acquisition on EMNIST.
5. Information Condensing ActiveLearning (ICAL)
In this section we present our acquisition function. Asbefore, let D train be the training points, D U the unlabeledpoints, y x the random variable denoting the prediction for x by the model trained on D train , and d the dependencymeasure being used. Then α ICAL pt x , . . . , x B u , d q “ | D U | ÿ x P D U d p y x , t y x , . . . , y x B uq that is, we try to find the batch that has highest averagedependency with respect to the unlabeled points’ marginalpredictive distribution. α ICAL estimation
As we mentioned in the introduction, we can use MI asthe dependency measure d but it is tricky to estimate MIusing just samples from the distribution, particularly high-dimensional or continuous variables. Furthermore, MI es-timators are usually not differentiable. Thus if we wantedto apply ICAL to domains where the pool set is continuousand infinite (for example, if we wanted to query gene expres-sion perturbations for a cell), we would run into obstacles.This motivates our choice of dHSIC as the dependencymeasure. In addition to being differentiable, dHSIC hasbetter empirical sample complexity for measuring depen-dency as opposed to estimators for MI. Indeed, popular MIestimators have been found to have variance with respectto ground truth MI that increases exponentially with the MIvalue (Song & Ermon, 2019). dHSIC has also been suc-cessfully used in the related context of feature selection viadependency maximization in the past (Da Veiga, 2015; Songet al., 2012). Furthermore, dHSIC is the Maximum MeanDiscrepancy (MMD) between the joint distribution and theproduction of marginals. MMD is known to be ď KL-divergence (Ramdas et al., 2015) and thus dHSIC ď MI.Thus we use dHSIC as the dependency measure for therest of the paper.Naively implementing α ICAL p B , dHSIC q would require O p| D U | m B q steps per candidate batch being evaluatedwhere m is the number of samples taken from p p y B q ( O p m B q to estimate dHSIC , which we need to do | D U | times).However, recall from Section 3.3 that dHSIC is a func-tion of solely the kernel matrices k x corresponding to therandom variables (in this case y x , x P D U ). Now one candefine the kernel k ˚ “ | D U | ř | D U | i “ k i . We can then provethe following theorems (proofs are in the Appendix). Theorem 1. k ˚ is a valid kernel. nformation Condensing Active Learning Theorem 2. ÿ x P D U { dHSIC p k x , k x P B q “ { dHSIC p ÿ x P D U k x , k x P B q where k x P B “ k x , . . . , k x B , x i P B . Using this refor-mulation, we only have to compute ř x P D U k x once peracquisition round. This lowers the computation cost to O p| D U | m ` m B q . Estimating dHSIC would still re-quire m to increase very rapidly with B (proportional to thedimension of the joint distribution). To get around this butstill maintain batch diversity, we try two strategies.For normal ICAL, we average the kernel matrices of pointsin the candidate batch. We then subsample r points from D U every time a point is added to the batch and only comparethe dependency with those. Thus even if two points arehighly correlated, and one of them is added to the batch, theother would not necessarily get a similarly high dHSIC statistic, and other points would get prioritized. We findin practice, that this is sufficient to acquire a diverse batch,as evidenced by Figure 4. This seems to be the case evenfor very large batches, and has the added benefit of furtherlowering the computational cost for evaluating a candidatebatch to O p rm ` ¨ m q . We use r “ for all ourexperiments.We develop another strategy we call ICAL-pointwise whichcomputes the marginal increase in dependence as a result ofadding a point to the batch. If a point is highly correlatedwith elements of the current batch, the marginal increasewould be negligible, making the point much less likelyto be selected. Figure 2 is a representative figure of therelative performance of ICAL and ICAL-pointwise. The twovariants perform very similarly despite ICAL-pointwise’sslight advantage in the early acquisitions. ICAL-pointwisehowever requires much less time for equivalent performancewhich we discuss briefly in Section 5.2 and more fully inthe Appendix. For ease of presentation, we use ICAL in theResults section and defer the full description and evaluationof ICAL-pointwise to the Appendix.Figure 2: Comparison of two ICAL variants(see Appendixfor comparison on other datasets)As there are an exponential number of candidate batches, anexhaustive search to find the optimal batch is infeasible. ForICAL we use a greedy forward selection strategy to build the batch and find that it performs well empirically. As the arg max over all of D U has to be computed every time a newpoint is being selected for the batch, and we have to performthis operation for each point that is added to the batch, thisgives a computation cost of O p r m `| D U | m B ` m B q “ O p| D U | m B q .It is possible that global nonlinear optimization of the batchICAL criterion would work even better than greedy opti-mization already does with respect to state of the art meth-ods. Efficient techniques for doing this optimization arenot obvious and beyond the scope of this work. We notehowever that greedy forward selection is a popular tech-nique that has been successfully used in a large variety ofcontexts (Da Veiga, 2015; Blanchet et al., 2008) To scale to large batch sizes, instead of adding points tothe batch to be acquired one at a time, we can add pointsin minibatches of size L . While this comes at the cost ofpossible diversity in the batch, we find that the tradeoff isacceptable for the datasets we experimented with. This givesa final computation cost of O p | D U | m BL q . By contrast thecorresponding runtime for BatchBALD is O p c ¨ D U | m ¨ B q where c is the number of classes. For all experiments withICAL, we were able to use L “ without any scalingdifficulties. For ICAL-pointwise, we used L “ D U onlyfor CIFAR-10 and CIFAR-100. As alluded to previously,ICAL-pointwise can accommodate much larger L comparedto ICAL before its performance degrades, allowing for muchgreater scaling. We evaluate this aspect of ICAL-pointwisein the Appendix.The final algorithm is given in Algorithm 1. Algorithm 1
Information Condensing Active Learning(ICAL) ( M , T, D train , D U , B, K, r, L )Train M on D train repeat B “ tu while | B | ă B do Y U “ the predictive distribution for x P D U accord-ing to M R “ Set of r randomly selected points from D U x “ argmax x α ICAL p B Y t x u , dHSIC q with theoptimizations as specified in Section 5.1 and 5.2 B “ B Y t x u end while D train “ D train Y B Retrain M on D train until T iterations reachedReturn M nformation Condensing Active Learning Figure 3: Performance on MNIST and repeated-MNIST. Accuracy and NLL after each acquisition.
6. Results
We demonstrate the effectiveness of ICAL using standardimage datasets including MNIST (LeCun et al., 1998), Re-peated MNIST (Kirsch et al., 2019), Extended MNIST (EM-NIST) (Cohen et al., 2017), fashion-MNIST, and CIFAR-10 (Krizhevsky et al., 2009). We compare ICAL with threestate of the art methods for batched active learning acqui-sition – BatchBALD, FASS, and BayesCoreset. We alsocompare against BALD and Max Entropy (MaxEnt) whichare not explicitly designed for batched selection, as wellas against a Random acquisition baseline. ICAL consis-tently outperforms BatchBALD, FASS, and BayesCoreseton accuracy and negative log likelihood (NLL).Throughout our experiments, for each dataset we hold out afixed test set for evaluating model performance after trainingand a fixed validation set for training purposes. We retrainthe model from the beginning after each acquisition to avoidcorrelation of subsequently trained models, and we use earlystopping after 3 (6 for ResNet18) consecutive epochs of val-idation accuracy drop. Following (Gal et al., 2017), weuse Neural Networks with MC dropout (Gal & Ghahra-mani, 2016) as a variational approximation for BayesianNeural Networks. We simply use a mixture of rationalquadratic kernels for dHSIC , which has been used success-fully with kernel based statistical dependency measures inthe past, with mixture length scales of t . , . , , , u asin (Bi´nkowski et al., 2018). All models are optimized withthe Adam optimizer (Kingma & Ba, 2014) using learningrate of 0.001 and betas (0.9,0.999). The small batch sizeexperiments are repeated 6 times with different seeds and adifferent initial training set for each run, with balanced labeldistribution across all classes. The same set of seeds is usedfor different methods on the same task. 8 different seeds areused for large batch size experiments using CIFAR datasets. We first examine ICAL’s performance on MNIST, which isa standard image dataset for handwritten digits. We furthertest out the scenario where duplicated data points exist (re- peated MNIST) as proposed in (Kirsch et al., 2019). Eachdata point in MNIST is replicated three times in repeated-MNIST, and isotropic Gaussian noise with std=0.1 is addedafter normalizing the image. We use a CNN consists of twoconvolutional layers with 32 and 64 5x5 convolution filters,each followed by MC dropout, max-pooling and ReLU. Onefully connected layer with 128 hidden units and MC dropoutis used after convolutional layers and the output soft-maxlayer has dimension of 10. All dropout uses probability of0.5, and the architecture achieved over 99% accuracy on fullMNIST. We use a validation set of size 1024 for MNISTand 3072 for repeated-MNIST, and a balanced test set ofsize 10,000 for both datasets. All models are trained for upto 30 epochs for MNIST and up to 40 epochs for repeated-MNIST. We sample an initial training set of size 20 (2 perclass) and conduct 30 acquisitions of batch size 10 on bothdatasets, and we use 50 MC dropout samples to estimate theposterior.The test accuracy and negative log-likelihood (NLL) areshown in Figure 3. ICAL significantly improves the NLLand outperforms all other baselines on accuracy, with highermargins on the earlier acquisition rounds. The performanceis consistent across all runs (the variance is smaller thanother baselines), and is robust even in the repeated-MNISTsetup, where all the other greedy methods show worsenperformance.
We then extend the task to a more sophisticated datasetnamed Extended-MNIST, which consists of 47 classes of28x28 images of both digits and letters. We used the bal-anced EMNIST where each class has 2400 training exam-ples. We use a validation set of 16384 and test set of size18800 (400 per class), and train for up to 40 epochs. Weuse a CNN consisting of three convolutional layers with 32,64, and 128 3x3 convolution filters, each followed by MCdropout, 2x2 max-pooling and ReLU. A fully connectedlayer with 512 hidden units and MC dropout is used afterconvolutional layers. We use an initial train set of 47 (1per class) and make 60 acquisitions of batch size 5. 50 MCdropout samples are used to estimate the posterior.The results are in Figure 5. We do substantially better in nformation Condensing Active Learning
Figure 4: Histogram of the labels of all acquired points using different active learning methods on EMNIST (47 classes).terms of both accuracy and NLL compared to all other meth-ods. A clue as to why our method outperforms on EMNISTcan be found in Figure 4. ICAL is able to acquire morediversed and balanced batches while all other methods haveoverly/under-represented classes (note that BatchBALD,Random and MaxEnt each totally miss examples from oneof classes). This indicates that our method is much morerobust in terms of performance even when the number ofclasses increases, whereas other alternatives degenerate.Figure 5: Performance on EMNIST and fashion-MNIST,ICAL significantly improves the accuracy and NLL.
We also examine ICAL’s performance on fashion-MNISTwhich consists of 10 classes of 28x28 Zalando’s articleimages. We use a validation set of 3072 and test set of size10000 (1000 per class), and train for up to 40 epochs. Thenetwork architecture is the same as the one used in MNISTtask. We use an initial train set of 20 (2 per class) and make30 acquisitions of batch size 10. 100 MC dropout samplesare used to estimate the posterior. As shown in Figure 5,we again do significantly better in terms of both accuracyand NLL compared to all other methods. Note that almostall baselines were inferior to random baseline except ICAL,showing the robustness of our method.
Finally we test our method on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) in a large batch sizesetting. CIFAR-10 consists of 10 classes with 6000 images per class whereas CIFAR-100 has 100 classes with 600images per class. We use a validation set of size 1024, anda balanced test set of size 10,000 for both datasets. ForCIFAR-10, we start with an initial training set of 10000examples (1000 per class) while for CIFAR-100, we startwith 20000 examples (200 per class). We do 10 acquisitionson CIFAR-10 and 7 acquisitions on CIFAR-100 with batchsize of 3000. We use a ResNet18 with additional 2 fullyconnected layers with MC dropouts, and train for up to 60epochs with learning rate 0.1 (allow early stopping). We runwith 8 different seeds. The results are in Figure 6. Note thatwe are unable to compare against BatchBALD for eitherCIFAR dataset as it runs out of memory.For CIFAR-10, ICAL dominates all other methods for allacquisitions except two – when the acquired dataset size is19000 and when it is 28000. ICAL also achieves the highestaccuracy at the end of all 10 acquisitions. With CIFAR-100, on all acquisitions ICAL outperforms a majority ofthe methods. Furthermore, ICAL again finishes with thehighest accuracy by a significant margin at the end of theacquisition rounds.Figure 6: Performance on CIFAR-10 and CIFAR-100 withbatch size=3000 using 8 seeds
7. Conclusion
We develop a novel batch mode active learning acquisitionfunction ICAL that is model agnostic and applicable toboth classification and regression tasks. We develop keyoptimizations that enable us to scale our method to largeacquisition batch and unlabeled set sizes. We show that weare robustly able to outperform state of the art methods forbatch mode active learning on a variety of image classifi-cation tasks in a deep neural network setting. Future workwill involve scaling the method to even larger batch sizespossibly using techniques developed in the feature selectioncontext (Da Veiga, 2015). Another interesting avenue for nformation Condensing Active Learning research could be to combine getting the most amount ofinformation for both the model parameters and for the labelsfor the unlabeled set into a single acquisition function andget the best of both worlds.
Acknowledgements
The authors would like to thank Dougal Sutherland andTatsunori Hashimoto for many useful discussions about thisproject.
References
Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J.,and Agarwal, A. Deep batch active learning by di-verse, uncertain gradient lower bounds. arXiv preprintarXiv:1906.03671 , 2019.Aspuru-Guzik, A. and Persson, K. Materials accelerationplatform: Accelerating advanced energy materials discov-ery by integrating high-throughput methods and artificialintelligence. 2018.Bi´nkowski, M., Sutherland, D. J., Arbel, M., and Gret-ton, A. Demystifying mmd gans. arXiv preprintarXiv:1801.01401 , 2018.Blanchet, F. G., Legendre, P., and Borcard, D. Forwardselection of explanatory variables.
Ecology , 89(9):2623–2632, 2008.Campbell, T. and Broderick, T. Bayesian coreset construc-tion via greedy iterative geodesic ascent. arXiv preprintarXiv:1802.01737 , 2018.Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K.,Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow,P.-M., Zietz, M., Hoffman, M. M., et al. Opportu-nities and obstacles for deep learning in biology andmedicine.
Journal of The Royal Society Interface , 15(141):20170387, 2018.Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. Em-nist: an extension of mnist to handwritten letters. arXivpreprint arXiv:1702.05373 , 2017.Da Veiga, S. Global sensitivity analysis with dependencemeasures.
Journal of Statistical Computation and Simu-lation , 85(7):1283–1305, 2015.Ducoffe, M. and Precioso, F. Adversarial active learning fordeep networks: a margin based approach. arXiv preprintarXiv:1802.09841 , 2018. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx-imation: Representing model uncertainty in deep learn-ing. In international conference on machine learning , pp.1050–1059, 2016.Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian activelearning with image data. In
Proceedings of the 34thInternational Conference on Machine Learning-Volume70 , pp. 1183–1192. JMLR. org, 2017.Geifman, Y. and El-Yaniv, R. Deep active learning over thelong tail. arXiv preprint arXiv:1711.00941 , 2017.Gissin, D. and Shalev-Shwartz, S. Discriminative activelearning. arXiv preprint arXiv:1907.06347 , 2019.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 , 2014.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf,B., and Smola, A. A kernel two-sample test.
Journal ofMachine Learning Research , 13(Mar):723–773, 2012.Guest, D., Cranmer, K., and Whiteson, D. Deep learning andits application to lhc physics.
Annual Review of Nuclearand Particle Science , 68:161–181, 2018.Guo, Y. and Greiner, R. Optimistic active-learning usingmutual information. In
IJCAI , volume 7, pp. 823–829,2007.Guo, Y. and Schuurmans, D. Discriminative batch modeactive learning. In
Advances in neural information pro-cessing systems , pp. 593–600, 2008.Hoi, S. C., Jin, R., Zhu, J., and Lyu, M. R. Batch mode activelearning and its application to medical image classifica-tion. In
Proceedings of the 23rd international conferenceon Machine learning , pp. 417–424. ACM, 2006.Houlsby, N., Husz´ar, F., Ghahramani, Z., and Lengyel, M.Bayesian active learning for classification and preferencelearning. arXiv preprint arXiv:1112.5745 , 2011.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kirsch, A., van Amersfoort, J., and Gal, Y. Batchbald: Ef-ficient and diverse batch acquisition for deep bayesianactive learning. In
Advances in Neural Information Pro-cessing Systems , 2019.Krause, A., Singh, A., and Guestrin, C. Near-optimal sen-sor placements in gaussian processes: Theory, efficientalgorithms and empirical studies.
Journal of MachineLearning Research , 9(Feb):235–284, 2008. nformation Condensing Active Learning
Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. Technical report, Citeseer,2009.LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.Gradient-based learning applied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.Peng, H., Long, F., and Ding, C. Feature selection basedon mutual information criteria of max-dependency, max-relevance, and min-redundancy.
IEEE Transactions onpattern analysis and machine intelligence , 27(8):1226–1238, 2005.Pfister, N., B¨uhlmann, P., Sch¨olkopf, B., and Peters, J.Kernel-based tests for joint independence.
Journal of theRoyal Statistical Society: Series B (Statistical Methodol-ogy) , 80(1):5–31, 2018.Pinsler, R., Gordon, J., Nalisnick, E., and Hern´andez-Lobato, J. M. Bayesian batch active learning as sparsesubset approximation. In
Advances in Neural InformationProcessing Systems , pp. 6356–6367, 2019.Ramdas, A., Reddi, S. J., P´oczos, B., Singh, A., and Wasser-man, L. On the decreasing power of kernel and distancebased nonparametric hypothesis tests in high dimensions.In
Twenty-Ninth AAAI Conference on Artificial Intelli-gence , 2015.Sanchez-Lengeling, B. and Aspuru-Guzik, A. Inversemolecular design using machine learning: Generativemodels for matter engineering.
Science , 361(6400):360–365, 2018.Sejdinovic, D., Gretton, A., and Bergsma, W. A kernel testfor three-variable interactions. In
Advances in NeuralInformation Processing Systems , pp. 1124–1132, 2013a.Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fuku-mizu, K. Equivalence of distance-based and rkhs-basedstatistics in hypothesis testing.
The Annals of Statistics ,pp. 2263–2291, 2013b.Sener, O. and Savarese, S. Active learning for convolutionalneural networks: A core-set approach. arXiv preprintarXiv:1708.00489 , 2017.Settles, B. Active learning literature survey. Technicalreport, University of Wisconsin-Madison Department ofComputer Sciences, 2009.Song, J. and Ermon, S. Understanding the limitations ofvariational mutual information estimators. arXiv preprintarXiv:1910.06222 , 2019.Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt,K. Feature selection via dependence maximization.
Jour-nal of Machine Learning Research , 13(May):1393–1434,2012. Sz´ekely, G. J., Rizzo, M. L., Bakirov, N. K., et al. Measuringand testing dependence by correlation of distances.
Theannals of statistics , 35(6):2769–2794, 2007.Wang, Z. and Ye, J. Querying discriminative and representa-tive samples for batch mode active learning.
ACM Trans-actions on Knowledge Discovery from Data (TKDD) , 9(3):17, 2015.Wei, K., Iyer, R., and Bilmes, J. Submodularity in datasubset selection and active learning. In
InternationalConference on Machine Learning , pp. 1954–1963, 2015. nformation Condensing Active Learning
Appendix
Derivation for Example 1
For x , the mutual information between the predicted label y and model parameters is: I r y , ω | x , D train s“ H r y | x s ´ E p p ω | D train q r H r y | x , ω ss“ H r ÿ k “ p p y | x , ω k q p p ω k qs ´ ÿ k “ p p ω k q H r p p y | x , ω k qs“ ´p ˆ p ˆ log p qq ` ˆ log p qq´ ˆ ˆ p´p ˆ log p q ` ˆ log p qqq“ . For x ...L , I r y ´ L , ω | x ...L , D train s“ ´p ˆ log p q ` ˆ log p qq´ ˆ p´p ˆ log p q ` ˆ log p qqq“ . After acquiring x , assuming the true label for x is 1, thenwe update the posterior over the model parameter such that p p w q| y “ “ and p p w k q| y “ “ for ă k ď .Then the expected averaged posterior entropy for x ...L is: L ´ L ÿ i “ H r y i | x i s| y “ “ L ´ L ÿ i “ H r ÿ k “ p p y i | x i , ω k q p p ω k q| y “ s“ L ´ ˆ p L ´ q ˆ p´p ˆ log p q ` ˆ log p qqq“ Similarly, we could compute the case where the true labelfor x is 2-4: L ´ L ÿ i “ H r y i | x i s| y “ “ L ´ L ÿ i “ H r y i | x i s| y “ “ L ´ L ÿ i “ H r y i | x i s| y “ “ L ´ ˆ p L ´ q ˆ p´p
67 log p q `
17 log p qqq “ . The expectation of the averaged posterior entropy with re-spect to predicted label for y (since we don’t know the truelabel) is: H r y ´ L , ω | x ...L , x , y D train s“ E y „ p p y | D train q r L ´ L ÿ i “ H r y i | x i s| y s“ ˆ ` ˆ ` ˆ ` ˆ . “ . Proof of Theorem 1 k ˚ is positive semidefinite (psd) and symmetric as the sumof psd symmetric matrices is also psd symmetric. Proof of Theorem 2
We show here that { dHSIC p k , k , . . . , k d q ` { dHSIC p k , k , . . . , k d q“ { dHSIC p k ` k , k , . . . , k d q but the extension to the arbitrary sums is straightforward.Using the definition of { dHSIC in (Sejdinovic et al., 2013a), { dHSIC p k , k , . . . , k d q ` { dHSIC p k , k , . . . , k d q “ n n ÿ a “ n ÿ b “ k p x i a , x i b q d ź j “ k j p x ji a , x ji b q` n d n ÿ a “ p n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ b “ k j p x ji a , x ji b q´ n d ` p n ÿ a “ n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ a “ n ÿ b “ k j p x ji a , x ji b q` n n ÿ a “ n ÿ b “ k p x i a , x i b q d ź j “ k j p x ji a , x ji b q` n d n ÿ a “ p n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ b “ k j p x ji a , x ji b q´ n d ` p n ÿ a “ n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ a “ n ÿ b “ k j p x ji a , x ji b q nformation Condensing Active Learning “ ” n n ÿ a “ n ÿ b “ k p x i a , x i b q d ź j “ k j p x ji a , x ji b q` n n ÿ a “ n ÿ b “ k p x i a , x i b q d ź j “ k j p x ji a , x ji b q ı ` ” n d n ÿ a “ p n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ b “ k j p x ji a , x ji b q` n d n ÿ a “ p n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ b “ k j p x ji a , x ji b q ı ´ ” n d ` p n ÿ a “ n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ a “ n ÿ b “ k j p x ji a , x ji b q` n d ` p n ÿ a “ n ÿ b “ k p x i a , x i b qq d ź j “ n ÿ a “ n ÿ b “ k j p x ji a , x ji b q ı “ n n ÿ a “ n ÿ b “ p k p x i a , x i b q ` k p x i a , x i b qq d ź j “ k j p x ji a , x ji b q` n d n ÿ a “ ” n ÿ b “ p k p x i a , x i b q` k p x i a , x i b qq ı d ź j “ n ÿ b “ k j p x ji a , x ji b q´ n d ` ” n ÿ a “ n ÿ b “ p k p x i a , x i b q` k p x i a , x i b qq ı d ź j “ n ÿ a “ n ÿ b “ k j p x ji a , x ji b q“ { dHSIC p k ` k , k , . . . , k d q ICAL-pointwise
To evaluate the marginal dependency increase if a candidatepoint x is added to batch B , we sample a set R from the poolset D U and compute the pairwise dHSIC of both B and B “ B Y t x u with respect to each point in R . Let the result-ing vectors (each of length | R | ) with the dHSIC scores be d B and d B . Then the marginal dependency increase statistic M x for point p is M x “ | R | ř i max pp d i B { d i B q , q where i is the ith element of the vector. When then modify the α ICAL as follows - α ICAL p B Y t x uq “ α ICAL p B Y t x uq ¨p M x ´ q and use the point with the highest value of α ICAL as the point to acquire. Note that as we want to get as ac-curate an estimate of M x as possible, we ideally want tochoose as large a set R as possible. In general, we also wantto choose | R | to be greater than the number of classes. Thismakes ICAL-pointwise more memory intensive comparedto ICAL. We also tried another criterion for batch selectionbased on the minimal-redundancy-maximal-relevance (Peng et al., 2005) but that had significantly worse performancecompared to ICAL and ICAL-pointwise.Figure 7: Relative performance of ICAL and ICAL-pointwise onsmaller datasets (EMNIST,FashionMNIST,MNIST and CIFAR10)with parameters set to equivalent computation cost
In Figure 7, we analyze the performance of ICAL versusICAL-pointwise when their parameters are set such thatcomputational cost is about the same. As can be seen theyare broadly similar with ICAL-pointwise having a slightadvantage in earlier acquisitions and ICAL being slightlybetter in later ones.We also analyze the relative performance as the mini-batchsize L changes in Figure 8. In the Figure, iter “ D U L isthe number of iterations taken to build the entire acquisitionbatch (note that the actual acquisition happens after theentire batch has been built). ICAL-pointwise requires morecomputation time than ICAL in small L setup, howeverif time is the major constraint, ICAL-pointwise is to bepreferred as its performance degrades more slowly as L , thesize of the minibatch, increases. As the performance usuallypeaks at L “ , if one is trying to get the best performanceor if memory is a constraint, then ICAL is to be preferred.Figure 8: Relative performance of ICAL and ICAL-pointwiseon CIFAR100 with different mini-batch size L . iter “ D U L isthe number of iterations taken to build the entire acquisition batch(note that the actual acquisition happens after the entire batch hasbeen built) nformation Condensing Active Learning Runtime and memory considerations
BatchBALD runs out of memory on CIFAR-10 and CIFAR-100 and thus we are unable to compare against it for thosetwo datasets. For the MNIST-variant datasets, ICAL takesabout a minute for building the batch to acquire (batch sizesof 5 and 10). For CIFAR-10 (batch size 3000), with L “ ,the runtime is about 20 minutes but it scales linearly with { L (Figure 10). Thus it is only 5 minutes for L “ ( iter “ ) which is already sufficient to give comparableperformance to L “ (Figure 9). For CIFAR-100 (batchsize 3000), the performance does degrade with high L butas we mentioned previously, ICAL-pointwise holds up a lotbetter in terms of performance with high L (Figure 8) andthus if time is a strong consideration, that variant should beused instead.Figure 9: CIFAR10 performance with different L. iter “ BL isthe number of iterations taken to build the entire acquisition batchof size B (note that the actual acquisition happens after the entirebatch has been built) Figure 10: