[PDF] Bayesian Batch Active Learning as Sparse Subset Approximation

Abstract

Leveraging the wealth of unlabeled data produced in recent years provides great potential for improving supervised models. When the cost of acquiring labels is high, probabilistic active learning methods can be used to greedily select the most informative data points to be labeled. However, for many large-scale problems standard greedy procedures become computationally infeasible and suffer from negligible model change. In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. Our approach is motivated by approximating the complete data posterior of the model parameters. While naive batch construction methods result in correlated queries, our algorithm produces diverse batches that enable efficient active learning at scale. We derive interpretable closed-form solutions akin to existing active learning procedures for linear models, and generalize to arbitrary models using random projections. We demonstrate the benefits of our approach on several large-scale regression and classification tasks.

Full PDF

BBayesian Batch Active Learning asSparse Subset Approximation

Robert Pinsler

Department of EngineeringUniversity of Cambridge [email protected]

Jonathan Gordon

Department of EngineeringUniversity of Cambridge [email protected]

Eric Nalisnick

Department of EngineeringUniversity of Cambridge [email protected]

José Miguel Hernández-Lobato

Department of EngineeringUniversity of Cambridge [email protected]

Abstract

Leveraging the wealth of unlabeled data produced in recent years provides greatpotential for improving supervised models. When the cost of acquiring labels ishigh, probabilistic active learning methods can be used to greedily select the mostinformative data points to be labeled. However, for many large-scale problemsstandard greedy procedures become computationally infeasible and suffer fromnegligible model change. In this paper, we introduce a novel Bayesian batchactive learning approach that mitigates these issues. Our approach is motivated byapproximating the complete data posterior of the model parameters. While naivebatch construction methods result in correlated queries, our algorithm producesdiverse batches that enable efﬁcient active learning at scale. We derive interpretableclosed-form solutions akin to existing active learning procedures for linear models,and generalize to arbitrary models using random projections. We demonstrate thebeneﬁts of our approach on several large-scale regression and classiﬁcation tasks.

Much of machine learning’s success stems from leveraging the wealth of data produced in recentyears. However, in many cases expert knowledge is needed to provide labels, and access to theseexperts is limited by time and cost constraints. For example, cameras could easily provide imagesof the many ﬁsh that inhabit a coral reef, but an ichthyologist would be needed to properly labeleach ﬁsh with the relevant biological information. In such settings, active learning (AL) [1] enablesdata-efﬁcient model training by intelligently selecting points for which labels should be requested.Taking a Bayesian perspective, a natural approach to AL is to choose the set of points that maximallyreduces the uncertainty in the posterior over model parameters [2]. Unfortunately, solving this combi-natorial optimization problem is NP-hard. Most AL methods iteratively solve a greedy approximation,e.g. using maximum entropy [3] or maximum information gain [2, 4]. These approaches alternatebetween querying a single data point and updating the model, until the query budget is exhausted.However, as we discuss below, sequential greedy methods have severe limitations in modern machinelearning applications, where datasets are massive and models often have millions of parameters.A possible remedy is to select an entire batch of points at every AL iteration. Batch AL approachesdramatically reduce the computational burden caused by repeated model updates, while resulting inmuch more signiﬁcant learning updates. It is also more practical in applications where the cost of a r X i v : . [ s t a t . M L ] J a n a) M AX E NT (b) BALD (c) Ours Figure 1: Batch construction of different AL methods on cifar10 , shown as a t-SNE projection [12].Given 5000 labeled points (colored by class), a batch of 200 points (black crosses) is queried.acquiring labels is high but can be parallelized. Examples include crowd-sourcing a complex labelingtask, leveraging parallel simulations on a compute cluster, or performing experiments that requireresources with time-limited availability (e.g. a wet-lab in natural sciences). Unfortunately, naivelyconstructing a batch using traditional acquisition functions still leads to highly correlated queries [5],i.e. a large part of the budget is spent on repeatedly choosing nearby points. Despite recent interest inbatch methods [5–8], there currently exists no principled, scalable Bayesian batch AL algorithm.In this paper, we propose a novel Bayesian batch AL approach that mitigates these issues. The keyidea is to re-cast batch construction as optimizing a sparse subset approximation to the log posteriorinduced by the full dataset. This formulation of AL is inspired by recent work on

Bayesian coresets [9, 10]. We leverage these similarities and use the Frank-Wolfe algorithm [11] to enable efﬁcientBayesian AL at scale. We derive interpretable closed-form solutions for linear and probit regressionmodels, revealing close connections to existing AL methods in these cases. By using randomprojections, we further generalize our algorithm to work with any model with a tractable likelihood.We demonstrate the beneﬁts of our approach on several large-scale regression and classiﬁcation tasks.

We consider discriminative models p ( y | x , θ ) parameterized by θ ∈ Θ , mapping from inputs x ∈ X to a distribution over outputs y ∈ Y . Given a labeled dataset D = { x n , y n } Nn =1 , the learning taskconsists of performing inference over the parameters θ to obtain the posterior distribution p ( θ |D ) .In the AL setting [1], the learner is allowed to choose the data points from which it learns. In additionto the initial dataset D , we assume access to (i) an unlabeled pool set X p = { x m } Mm =1 , and (ii) anoracle labeling mechanism which can provide labels Y p = { y m } Mm =1 for the corresponding inputs.Probabilistic AL approaches choose points by considering the posterior distribution of the modelparameters. Without any budget constraints, we could query the oracle M times, yielding thecomplete data posterior through Bayes’ rule, p ( θ |D ∪ ( X p , Y p )) = p ( θ |D ) p ( Y p |X p , θ ) p ( Y p |X p , D ) , (1)where here p ( θ |D ) plays the role of the prior. While the complete data posterior is optimal from aBayesian perspective, in practice we can only select a subset, or batch, of points D (cid:48) = ( X (cid:48) , Y (cid:48) ) ⊆ D p due to budget constraints. From an information-theoretic perspective [2], we want to query points X (cid:48) ⊆ X p that are maximally informative , i.e. minimize the expected posterior entropy, X ∗ = arg min X (cid:48) ⊆X p , |X (cid:48) |≤ b E Y (cid:48) ∼ p ( Y (cid:48) |X (cid:48) , D ) [ H [ θ |D ∪ ( X (cid:48) , Y (cid:48) )]] , (2)where b is a query budget. Solving Eq. (2) directly is intractable, as it requires considering all possiblesubsets of the pool set. As such, most AL strategies follow a myopic approach that iteratively choosesa single point until the budget is exhausted. Simple heuristics, e.g. maximizing the predictive entropy(M AX E NT ), are often employed [13, 5]. Houlsby et al. [4] propose BALD, a greedy approximationto Eq. (2) which seeks the point x that maximizes the decrease in expected entropy: x ∗ = arg min x ∈X p H [ θ |D ] − E y ∼ p ( y | x , D ) [ H [ θ | x , y , D ]] . (3)While sequential greedy strategies can be near-optimal in certain cases [14, 15], they become severelylimited for large-scale settings. In particular, it is computationally infeasible to re-train the model2fter every acquired data point, e.g. re-training a ResNet [16] thousands of times is clearly impractical.Even if such an approach were feasible, the addition of a single point to the training set is likelyto have a negligible effect on the parameter posterior distribution [5]. Since the model changesonly marginally after each update, subsequent queries thus result in acquiring similar points in dataspace. As a consequence, there has been renewed interest in ﬁnding tractable batch AL formulations.Perhaps the simplest approach is to naively select the b highest-scoring points according to a standardacquisition function. However, such naive batch construction methods still result in highly correlatedqueries [5]. This issue is highlighted in Fig. 1, where both M AX E NT (Fig. 1a) and BALD (Fig. 1b)expend a large part of the budget on repeatedly choosing nearby points. We propose a novel probabilistic batch AL algorithm that mitigates the issues mentioned above. Ourmethod generates batches that cover the entire data manifold (Fig. 1c), and, as we will show later, arehighly effective for performing posterior inference over the model parameters. Note that while ourapproach alternates between acquiring data points and updating the model for several iterations inpractice, we restrict the derivations hereafter to a single iteration for simplicity.The key idea behind our batch AL approach is to choose a batch D (cid:48) , such that the updated logposterior log p ( θ |D ∪ D (cid:48) ) best approximates the complete data log posterior log p ( θ |D ∪ D p ) . InAL, we do not have access to the labels before querying the pool set. We therefore take expectationw.r.t. the current predictive posterior distribution p ( Y p |X p , D ) = (cid:82) p ( Y p |X p , θ ) p ( θ |D ) d θ . The expected complete data log posterior is thus E Y p [log p ( θ |D ∪ ( X p , Y p ))] = E Y p [log p ( θ |D ) + log p ( Y p |X p , θ ) − log p ( Y p |X p , D )]= log p ( θ |D ) + E Y p [log p ( Y p |X p , θ )] + H [ Y p |X p , D ]= log p ( θ |D ) + M (cid:88) m =1 (cid:32) E y m [log p ( y m | x m , θ )] + H [ y m | x m , D ] (cid:124) (cid:123)(cid:122) (cid:125) L m ( θ ) (cid:33) , (4)where the ﬁrst equality uses Bayes’ rule (cf. Eq. (1)), and the third equality assumes conditionalindependence of the outputs given the inputs. This assumption holds for the type of factorizedpredictive posteriors we consider, e.g. as induced by Gaussian or Multinomial likelihood models. Batch construction as sparse approximation

Taking inspiration from Bayesian coresets [9, 10],we re-cast Bayesian batch construction as a sparse approximation to the expected complete data logposterior. Since the ﬁrst term in Eq. (4) only depends on D , it sufﬁces to choose the batch thatbest approximates (cid:80) m L m ( θ ) . Similar to Campbell and Broderick [10], we view L m : Θ (cid:55)→ R and L = (cid:80) m L m as vectors in function space. Letting w ∈ { , } M be a weight vector indicating whichpoints to include in the AL batch, and denoting L ( w ) = (cid:80) m w m L m (with slight abuse of notation),we convert the problem of constructing a batch to a sparse subset approximation problem, i.e. w ∗ = minimize w (cid:107)L − L ( w ) (cid:107) subject to w m ∈ { , } ∀ m, (cid:88) m m ≤ b. (5)Intuitively, Eq. (5) captures the key objective of our framework: a “good" approximation to L impliesthat the resulting posterior will be close to the (expected) posterior had we observed the complete poolset. Since solving Eq. (5) is generally intractable, in what follows we propose a generic algorithm toefﬁciently ﬁnd an approximate solution. Inner products and Hilbert spaces

We propose to construct our batches by solving Eq. (5) in aHilbert space induced by an inner product (cid:104)L n , L m (cid:105) between function vectors, with associated norm (cid:107) · (cid:107) . Below, we discuss the choice of speciﬁc inner products. Importantly, this choice introduces anotion of directionality into the optimization procedure, enabling our approach to adaptively constructquery batches while implicitly accounting for similarity between selected points.3 rank-Wolfe optimization To approximately solve the optimization problem in Eq. (5) we followthe work of Campbell and Broderick [10], i.e. we relax the binary weight constraint to be non-negativeand replace the cardinality constraint with a polytope constraint. Let σ m = (cid:107)L m (cid:107) , σ = (cid:80) m σ m , and K ∈ R M × M be a kernel matrix with K mn = (cid:104)L m , L n (cid:105) . The relaxed optimization problem isminimize w ( − w ) T K ( − w ) subject to w m ≥ ∀ m, (cid:88) m w m σ m = σ, (6)where we used (cid:107)L − L ( w ) (cid:107) = ( − w ) T K ( − w ) . The polytope has vertices { σ/σ m m } Mm =1 and contains the point w = [1 , , . . . , T . Eq. (6) can be solved efﬁciently using the Frank-Wolfealgorithm [11], yielding the optimal weights w ∗ after b iterations. The complete AL procedure, Active Bayesian CoreSets with Frank-Wolfe optimization (ACS-FW), is outlined in Appendix A (seeAlgorithm A.1). The key computation in Algorithm A.1 (Line 6) is (cid:28)

L − L ( w ) , σ n L n (cid:29) = 1 σ n N (cid:88) m =1 (1 − w m ) (cid:104)L m , L n (cid:105) , (7)which only depends on the inner products (cid:104)L m , L n (cid:105) and norms σ n = (cid:107)L n (cid:107) . At each iteration, thealgorithm greedily selects the vector L f most aligned with the residual error L − L ( w ) . The weights w are then updated according to a line search along the f th vertex of the polytope (recall that theoptimum of a convex objective over a polytope—as in Eq. (6)—is attained at the vertices), which byconstruction is the f th -coordinate unit vector. This corresponds to adding at most one data point tothe batch in every iteration. Since the algorithm allows to re-select indices from previous iterations,the resulting weight vector has ≤ b non-zero entries. Empirically, we ﬁnd that this property leads tosmaller batches as more data points are acquired.Since it is non-trivial to leverage the continuous weights returned by the Frank-Wolfe algorithm in aprincipled way, the ﬁnal step of our algorithm is to project the weights back to the feasible space,i.e. set ˜ w ∗ m = 1 if w ∗ m > , and otherwise. While this projection step increases the approximationerror, we show in Section 7 that our method is still effective in practice. We leave the exploration ofalternative optimization procedures that do not require this projection step to future work. Choice of inner products

We employ weighted inner products of the form (cid:104)L n , L m (cid:105) ˆ π = E ˆ π [ (cid:104)L n , L m (cid:105) ] , where we choose ˆ π to be the current posterior p ( θ |D ) . We consider two spe-ciﬁc inner products with desirable analytical and computational properties; however, other choicesare possible. First, we deﬁne the weighted Fisher inner product [17, 10] (cid:104)L n , L m (cid:105) ˆ π, F = E ˆ π (cid:2) ∇ θ L n ( θ ) T ∇ θ L m ( θ ) (cid:3) , (8)which is reminiscent of information-theoretic quantities but requires taking gradients of the expectedlog-likelihood terms w.r.t. the parameters. In Section 4, we show that for speciﬁc models this choiceleads to simple, interpretable expressions that are closely related to existing AL procedures.An alternative choice that lifts the restriction of having to compute gradients is the weighted Euclideaninner product, which considers the marginal likelihood of data points [10], (cid:104)L n , L m (cid:105) ˆ π, = E ˆ π [ L n ( θ ) L m ( θ )] . (9)The key advantage of this inner product is that it only requires tractable likelihood computations. InSection 5 this will prove highly useful in providing a black-box method for these computations in anymodel (that has a tractable likelihood) using random feature projections. Method overview

In summary, we (i) consider the L m in Eq. (4) as vectors in function spaceand re-cast batch construction as a sparse approximation to the full data log posterior from Eq. (5);(ii) replace the cardinality constraint with a polytope constraint in a Hilbert space, and relax thebinary weight constraint to non-negativity; (iii) solve the resulting optimization problem in Eq. (6)using Algorithm A.1; (iv) construct the AL batch by including all points x m ∈ X p with w ∗ m > . Note that the entropy term in L m (see Eq. (4)) vanishes under this norm as the gradient for θ is zero. Analytic expressions for linear models

In this section, we use the weighted Fisher inner product from Eq. (8) to derive closed-form expres-sions of the key quantities of our algorithm for two types of models: Bayesian linear regression andprobit regression. Although the considered models are relatively simple, they can be used ﬂexibly toconstruct more powerful models that still admit closed-form solutions. For example, in Section 7we demonstrate how using neural linear models [18, 19] allows to perform efﬁcient AL on severalregression tasks. We consider arbitrary models and inference procedures in Section 5.

Linear regression

Consider the following model for scalar Bayesian linear regression, y n = θ T x n + (cid:15) n , (cid:15) n ∼ N (0 , σ ) , θ ∼ p ( θ ) , (10)where p ( θ ) is a factorized Gaussian prior with unit variance; extensions to richer Gaussian priors arestraightforward. Given a labeled dataset D , the posterior is given in closed form as p ( θ | D , σ ) = N (cid:0) θ ; ( X T X + σ I ) − X T y , Σ θ (cid:1) with Σ θ = σ ( X T X + σ I ) − . For this model, a closed-formexpression for the inner product in Eq. (8) is (cid:104)L n , L m (cid:105) ˆ π, F = x Tn x m σ x Tn Σ θ x m , (11)where ˆ π is chosen to be the posterior p ( θ |D , σ ) . See Appendix B.1 for details on this derivation.We can make a direct comparison with BALD [2, 4] by treating the squared norm of a data pointwith itself as a greedy acquisition function, α ACS ( x n ; D ) = (cid:104)L n , L n (cid:105) ˆ π, F , yielding, α ACS ( x n ; D ) = x Tn x n σ x Tn Σ θ x n , α BALD ( x n ; D ) = 12 log (cid:18) x Tn Σ θ x n σ (cid:19) . (12)The two functions share the term x Tn Σ θ x n , but BALD wraps the term in a logarithm whereas α ACS scales it by x Tn x n . Ignoring the x Tn x n term in α ACS makes the two quantities proportional— exp(2 α BALD ( x n ; D )) ∝ α ACS ( x n ; D ) —and thus equivalent under a greedy maximizer. Anotherobservation is that x Tn Σ θ x n is very similar to a leverage score [20–22], which is computed as x Tn ( X T X ) − x n and quantiﬁes the degree to which x n inﬂuences the least-squares solution. Wecan then interpret the x Tn x n term in α ACS as allowing for more contribution from the current instance x n than BALD or leverage scores would. Probit regression

Consider the following model for Bayesian probit regression, p ( y n | x n , θ ) = Ber (cid:0) Φ( θ T x n ) (cid:1) , θ ∼ p ( θ ) , (13)where Φ( · ) represents the standard Normal cumulative density function (cdf), and p ( θ ) is assumed tobe a factorized Gaussian with unit variance. We obtain a closed-form solution for Eq. (8), i.e. (cid:104)L n , L m (cid:105) ˆ π, F = x Tn x m (cid:16) BvN ( ζ n , ζ m , ρ n,m ) − Φ( ζ n )Φ( ζ m ) (cid:17) (14) ζ i = µ T θ x i (cid:112) x Ti Σ θ x i ρ n,m = x Tn Σ θ x m (cid:112) x Tn Σ θ x n (cid:112) x Tm Σ θ x m , where BvN ( · ) is the bi-variate Normal cdf. We again view α ACS ( x n ; D ) = (cid:104)L n , L n (cid:105) ˆ π, F as anacquisition function and re-write Eq. (14) as α ACS ( x n ; D ) = x Tn x n (cid:32) Φ ( ζ n ) (1 − Φ ( ζ n )) − T (cid:32) ζ n , (cid:112) x Tn Σ θ x n (cid:33)(cid:33) , (15)where T ( · , · ) is Owen’s T function [23]. See Appendix B.2 for the full derivation of Eqs. (14) and (15).Eq. (15) has a simple and intuitive form that accounts for the magnitude of the input vector and aregularized term for the predictive variance. We only introduce α ACS to compare to other acquisition functions; in practice we use Algorithm A.1. Random projections for non-linear models

In Section 4, we have derived closed-form expressions of the weighted Fisher inner product for twospeciﬁc types of models. However, this approach suffers from two shortcomings. First, it is limited tomodels for which the inner product can be evaluated in closed form, e.g. linear regression or probitregression. Second, the resulting algorithm requires O (cid:0) |P| (cid:1) computations to construct a batch,restricting our approach to moderately-sized pool sets.We address both of these issues using random feature projections, allowing us to approximate thekey quantities required for the batch construction. In Algorithm A.2, we introduce a procedure thatworks for any model with a tractable likelihood, scaling only linearly in the pool set size |P| . To keepthe exposition simple, we consider models in which the expectation of L n ( θ ) w.r.t. p ( y n | x n , D ) istractable, but we stress that our algorithm could work with sampling for that expectation as well.While it is easy to construct a projection for the weighted Fisher inner product [10], its dependenceon the number of model parameters through the gradient makes it difﬁcult to scale it to more complexmodels. We therefore only consider projections for the weighted Euclidean inner product fromEq. (9), which we found to perform comparably in practice. The appropriate projection is [10] ˆ L n = 1 √ J [ L n ( θ ) , · · · , L n ( θ J )] T , θ j ∼ ˆ π, (16)i.e. ˆ L n represents the J -dimensional projection of L n in Euclidean space. Given this projection, weare able to approximate inner products as dot products between vectors, (cid:104)L n , L m (cid:105) ˆ π, ≈ ˆ L Tn ˆ L m , (17)where ˆ L Tn ˆ L m can be viewed as an unbiased sample estimator of (cid:104)L n , L m (cid:105) ˆ π, using J Monte Carlosamples from the posterior ˆ π . Importantly, Eq. (16) can be calculated for any model with a tractablelikelihood. Since in practice we only require inner products of the form (cid:104)L − L ( w ) , L n /σ n (cid:105) ˆ π, ,batches can be efﬁciently constructed in O ( |P| J ) time. As we show in Section 7, this enables us toscale our algorithm up to pool sets comprising hundreds of thousands of examples. Bayesian AL approaches attempt to query points that maximally reduce model uncertainty. Commonheuristics to this intractable problem greedily choose points where the predictive posterior is mostuncertain, e.g. maximum variance and maximum entropy [3], or that maximally improve the expectedinformation gain [2, 4]. Scaling these methods to the batch setting in a principled way is difﬁcult forcomplex, non-linear models. Recent work on improving inference for AL with deep probabilisticmodels [24, 13] used datasets with at most

10 000 data points and few model updates.Consequently, there has been great interest in batch AL recently. The literature is dominated bynon-probabilistic methods, which commonly trade off diversity and uncertainty. Many approaches aremodel-speciﬁc, e.g. for linear regression [25], logistic regression [26, 27], and k-nearest neighbors[28]; our method works for any model with a tractable likelihood. Others [6–8] follow optimization-based approaches that require optimization over a large number of variables. As these methods scalequadratically with the number of data points, they are limited to smaller pool sets.Probabilistic batch methods mostly focus on Bayesian optimization problems. Several approachesselect the batch that jointly optimizes the acquisition function [29, 30]. As they scale poorly withthe batch size, greedy batch construction algorithms are often used instead [31–34]. A commonstrategy is to impute the labels of the selected data points and update the model accordingly [33].Our approach also uses the model to predict the labels, but importantly it does not require to updatethe model after every data point. Moreover, most of the methods in Bayesian optimization employGaussian process models. While AL with non-parametric models [35] could beneﬁt from that work,scaling such models to large datasets remains challenging. Our work therefore provides the ﬁrstprincipled, scalable and model-agnostic Bayesian batch AL approach.Similar to us, Sener and Savarese [5] formulate AL as a core-set selection problem. They constructbatches by solving a k -center problem, attempting to minimize the maximum distance to one of the k queried data points. Since this approach heavily relies on the geometry in data space, it requires6ALD (a) t = 1 (b) t = 2 (c) t = 3 (d) t = 10 ACS-FW (e) t = 1 (f) t = 2 (g) t = 3 (h) t = 10 Figure 2: Batches constructed by BALD (top) and ACS-FW (bottom) on a probit regression task.10 training data points (red, blue) were sampled from a standard bi-variate Normal, and labeledaccording to p ( y | x ) = Ber (5 x + 0 x ) . At each step t , one unlabeled point (black cross) is queriedfrom the pool set (colored according to acquisition function ; bright is higher). The current meandecision boundary of the model is shown as a black line. Best viewed in color.an expressive feature representation. For example, Sener and Savarese [5] only consider ConvNetrepresentations learned on highly structured image data. In contrast, our work is inspired by Bayesiancoresets [9, 10], which enable scalable Bayesian inference by approximating the log-likelihood of alabeled dataset with a sparse weighted subset thereof. Consequently, our method is less reliant on astructured feature space and only requires to evaluate log-likelihood terms. We perform experiments to answer the following questions: (1) does our approach avoid correlatedqueries, (2) is our method competitive with greedy methods in the small-data regime, and (3) doesour method scale to large datasets and models? We address questions (1) and (2) on several linearand probit regression tasks using the closed-form solutions derived in Section 4, and question (3)on large-scale regression and classiﬁcation datasets by leveraging the projections from Section 5.Finally, we provide a runtime evaluation for all regression experiments. Full experimental details aredeferred to Appendix C. Does our approach avoid correlated queries?

In Fig. 1, we have seen that traditional AL methodsare prone to correlated queries. To investigate this further, in Fig. 2 we compare batches selectedby ACS-FW and BALD on a simple probit regression task. Since BALD has no explicit batchconstruction mechanism, we naively choose the b = 10 most informative points according to BALD.While the BALD acquisition function does not change during batch construction, α ACS ( x n ; D ) rotates after each selected data point. This provides further intuition about why ACS-FW is able tospread the batch in data space, avoiding the strongly correlated queries that BALD produces. Is our method competitive with greedy methods in the small-data regime?

We evaluate theperformance of ACS-FW on several UCI regression datasets. We compare against (i) R

ANDOM :select points randomly; (ii) M AX E NT : naively construct batch using top b points according tomaximum entropy criterion (equivalent to BALD in this case); (iii) M AX E NT -SG: use M AX E NT with sequential greedy strategy (i.e. b = 1 ); (iv) M AX E NT -I: sequentially acquire single datapoint, impute missing label and update model accordingly. Starting with labeled points sampledrandomly from the pool set, we use each AL method to iteratively grow the training dataset byrequesting batches of size b = 10 until the budget of queries is exhausted. To guarantee faircomparisons, all methods use the same neural linear model, i.e. a Bayesian linear regression modelwith a deterministic neural network feature extractor [19]. In this setting, posterior inference can be Source code is available at https://github.com/rpinsler/active-bayesian-coresets . We use α ACS (see Eq. (15)) as an acquisition function for ACS-FW only for the sake of visualization. ( year : ) seeds. M AX E NT -Iand M AX E NT -SG require order(s) of magnitudes more model updates and are thus not directlycomparable. N d R

ANDOM M AX E NT ACS-FW M AX E NT -I M AX E NT -SGyacht

308 6 1 . ± . . ± . . ± . . ± . . ± . boston

506 13 4 . ± . . ± . . ± . . ± . . ± . energy

768 8 0 . ± . . ± . . ± . . ± . . ± . power . ± . . ± . . ± . . ± . . ± . year

515 345 90 13 . ± . . ± . . ± . N/A N/A

Table 2: Runtime in seconds on UCI regression datasets averaged over 40 ( year : 5) seeds. Wereport mean batch construction time (BT/it.) and total time (TT/it.) per AL iteration, as well as totalcumulative time (total). M AX E NT -I requires order(s) of magnitudes more model updates and is thusnot directly comparable. R ANDOM M AX E NT ACS-FW M AX E NT -IBT/it. TT/it. total BT/it. TT/it. total BT/it. TT/it. total BT/it. TT/it. totalyacht . . . . . . . . . . . . boston . . . . . . . . . . . . energy . . . . . . . . . . . . power . . . . . . . . . . . . year . . . . . . . . . N/A N/A N/A done in closed form [19]. The model is re-trained for epochs after every AL iteration usingAdam [36]. After each iteration, we evaluate RMSE on a held-out set. Experiments are repeated for seeds, using randomized / train-test splits. We also include a medium-scale experimenton power that follows the same protocol; however, for ACS-FW we use projections instead of theclosed-form solutions as they yield improved performance and are faster. Further details, includingarchitectures and learning rates, are in Appendix C.The results are summarized in Table 1. ACS-FW consistently outperforms R ANDOM by a largemargin (unlike M AX E NT ), and is mostly on par with M AX E NT on smaller datasets. While the resultsare encouraging, greedy methods such as M AX E NT -SG and M AX E NT -I still often yield better resultsin these small-data regimes. We conjecture that this is because single data points do have signiﬁcantimpact on the posterior. The beneﬁts of using ACS-FW become clearer with increasing dataset size:as shown in Fig. 3, ACS-FW achieves much more data-efﬁcient learning on larger datasets. R M S E (a) yacht R M S E (b) energy R M S E ACS-FW (ours)MaxEntRandom (c) year

Figure 3: Test RMSE on UCI regression datasets averaged over (a-b) and (c) seeds during AL.Error bars denote two standard errors. Does our method scale to large datasets and models?

Leveraging the projections from Section 5,we apply ACS-FW to large-scale datasets and complex models. We demonstrate the beneﬁts of ourapproach on year , a UCI regression dataset with ca.

515 000 data points, and on the classiﬁcationdatasets cifar10 , SVHN and

Fashion MNIST . Methods requiring model updates after every data point(e.g. M AX E NT -SG, M AX E NT -I) are impractical in these settings due to their excessive runtime.For year , we again use a neural linear model, start with labeled points and allow for batches ofsize b = 1000 until the budget of

10 000 queries is exhausted. We average the results over seeds,8 A cc u r a c y (a) cifar10 A cc u r a c y (b) SVHN A cc u r a c y ACS-FW (ours)BALDK-CenterK-MedoidsMaxEntRandom (c) Fashion MNIST

Figure 4: Test accuracy on classiﬁcation tasks over 5 seeds. Error bars denote two standard errors.using randomized / train-test splits. As can be seen in Fig. 3c, our approach signiﬁcantlyoutperforms both R ANDOM and M AX E NT during the entire AL process.For the classiﬁcation experiments, we start with ( cifar10 : ) labeled points and requestbatches of size b = 3000 ( ), up to a budget of

12 000 (

20 000 ) points. We compare to R

ANDOM ,M AX E NT and BALD, as well as two batch AL algorithms, namely K-M EDOIDS and K-C

ENTER [5]. Performance is measured in terms of accuracy on a holdout test set comprising

10 000 ( FashionMNIST :

26 032 , as is standard) points, with the remainder used for training. We use a neural linearmodel with a ResNet18 [16] feature extractor, trained from scratch at every AL iteration for epochs using Adam [36]. Since posterior inference is intractable in the multi-class setting, we resortto variational inference with mean-ﬁeld Gaussian approximations [37, 38].Fig. 4 demonstrates that in all cases ACS-FW signiﬁcantly outperforms R

ANDOM , which is a strongbaseline in AL [5, 13, 24]. Somewhat surprisingly, we ﬁnd that the probabilistic methods (BALDand M AX E NT ), provide strong baselines as well, and consistently outperform R ANDOM . We discussthis point and provide further experimental results in Appendix D. Finally, Fig. 4 demonstratesthat in all cases ACS-FW performs at least as well as its competitors, including state-of-the-artnon-probabilistic batch AL approaches such as K-C

ENTER . These results demonstrate that ACS-FWcan usefully apply probabilistic reasoning to AL at scale, without any sacriﬁce in performance.

Runtime Evaluation

Runtime comparisons between different AL methods on the UCI regressiondatasets are shown in Table 2. For methods with ﬁxed AL batch size b (R ANDOM , M AX E NT andM AX E NT -I), the number of AL iterations is given by the total budget divided by b (e.g. /

10 = 10 for yacht ). Thus, the total cumulative time (total) is given by the total time per AL iteration (TT/it.)times the number of iterations. M AX E NT -I iteratively constructs the batch by selecting a single datapoint, imputing its label, and updating the model; therefore the batch construction time (BT/it.) andthe total time per AL iteration take roughly b times as long as for M AX E NT (e.g. x for yacht ). Thisapproach becomes infeasible for very large batch sizes (e.g. for year ). The same holds true forM AX E NT -SG, which we have omitted here as the run times are similar to M AX E NT -I. ACS-FWconstructs batches of variable size, and hence the number of iterations varies.As shown in Table 2, the batch construction times of ACS-FW are negligible compared to the totaltraining times per AL iteration. Although ACS-FW requires more AL iterations than the othermethods, the total cumulative run times are on par with M AX E NT . Note that both M AX E NT andM AX E NT -I require to compute the entropy of a Student’s T distribution, for which no batch versionwas available in PyTorch as we performed the experiments. Parallelizing this computation wouldlikely further speed up the batch construction process. We have introduced a novel Bayesian batch AL approach based on sparse subset approximations.Our methodology yields intuitive closed-form solutions, revealing its connection to BALD as well asleverage scores. Yet more importantly, our approach admits relaxations (i.e. random projections) thatallow it to tackle challenging large-scale AL problems with general non-linear probabilistic models.Leveraging the Frank-Wolfe weights in a principled way and investigating how this method interactswith alternative approximate inference procedures are interesting avenues for future work.9 cknowledgments

Robert Pinsler receives funding from iCASE grant

References [1] Burr Settles. Active learning.

Synthesis Lectures on Artiﬁcial Intelligence and MachineLearning , 6(1):1–114, 2012.[2] David JC MacKay. Information-based objective functions for active data selection.

Neuralcomputation , 4(4):590–604, 1992.[3] Claude Elwood Shannon. A mathematical theory of communication.

Bell System TechnicalJournal , 27(3):379–423, 1948.[4] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learningfor classiﬁcation and preference learning. arXiv Preprint arXiv:1112.5745 , 2011.[5] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-setapproach. In

International Conference on Learning Representations , 2018.[6] Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, and S Shankar Sasrty. A convex optimizationframework for active learning. In

IEEE International Conference on Computer Vision , pages209–216, 2013.[7] Yuhong Guo. Active instance sampling via matrix partition. In

Advances in Neural InformationProcessing Systems , pages 802–810, 2010.[8] Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-classactive learning by uncertainty sampling with diversity maximization.

International Journal ofComputer Vision , 113(2):113–127, 2015.[9] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesianlogistic regression. In

Advances in Neural Information Processing Systems , pages 4080–4088,2016.[10] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbertcoresets.

The Journal of Machine Learning Research , 20(1):551–588, 2019.[11] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.

Naval ResearchLogistics Quarterly , 3(1-2):95–110, 1956.[12] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.

Journal of MachineLearning Research , 9(Nov):2579–2605, 2008.[13] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with imagedata. arXiv Preprint arXiv:1703.02910 , 2017.[14] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications inactive learning and stochastic optimization.

Journal of Artiﬁcial Intelligence Research , 42:427–486, 2011.[15] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In

Advances in NeuralInformation Processing Systems , pages 337–344, 2005.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778,2016. 1017] Oliver Johnson and Andrew Barron. Fisher information inequalities and the central limittheorem.

Probability Theory and Related Fields , 129(3):391–409, 2004.[18] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernellearning. In

Artiﬁcial Intelligence and Statistics , pages 370–378, 2016.[19] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown. In

International Conference on Learning Representations , 2018.[20] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fastapproximation of matrix coherence and statistical leverage.

Journal of Machine LearningResearch , 13(Dec):3475–3506, 2012.[21] Ping Ma, Michael W Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging.

The Journal of Machine Learning Research , 16(1):861–911, 2015.[22] Michal Derezinski, Manfred K Warmuth, and Daniel J Hsu. Leveraged volume sampling forlinear regression. In

Advances in Neural Information Processing Systems , pages 2510–2519,2018.[23] Donald B Owen. Tables for computing bivariate normal probabilities.

The Annals of Mathemat-ical Statistics , 27(4):1075–1090, 1956.[24] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalablelearning of Bayesian neural networks. In

International Conference on Machine Learning , pages1861–1869, 2015.[25] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In

International Conference on Machine Learning , pages 1081–1088, 2006.[26] Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and itsapplication to medical image classiﬁcation. In

International Conference on Machine Learning ,pages 417–424, 2006.[27] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In

Advances inNeural Information Processing Systems , pages 593–600, 2008.[28] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and activelearning. In

International Conference on Machine Learning , pages 1954–1963, 2015.[29] Clément Chevalier and David Ginsbourger. Fast computation of the multi-points expectedimprovement with applications in batch selection. In

International Conference on Learning andIntelligent Optimization , pages 59–69, 2013.[30] Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch globaloptimization of expensive objective functions. In

Advances in Neural Information ProcessingSystems , pages 3330–3338, 2015.[31] Javad Azimi, Alan Fern, and Xiaoli Z Fern. Batch Bayesian optimization via simulationmatching. In

Advances in Neural Information Processing Systems , pages 109–117, 2010.[32] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussianprocess optimization with upper conﬁdence bound and pure exploration. In

Joint EuropeanConference on Machine Learning and Knowledge Discovery in Databases , pages 225–240,2013.[33] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitationtradeoffs in gaussian process bandit optimization.

The Journal of Machine Learning Research ,15(1):3873–3923, 2014.[34] Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimiza-tion via local penalization. In

Artiﬁcial Intelligence and Statistics , pages 648–657, 2016.1135] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active learning withGaussian processes for object categorization. In

IEEE International Conference on ComputerVision , pages 1–8, 2007.[36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv PreprintarXiv:1412.6980 , 2014.[37] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, andvariational inference.

Foundations and Trends R (cid:13) in Machine Learning , 1(1–2):1–305, 2008.[38] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertaintyin neural networks. arXiv Preprint arXiv:1505.05424 , 2015.[39] Kevin P Murphy. Machine learning: A probabilistic perspective . MIT Press, 2012.[40] Christopher M Bishop.

Pattern recognition and machine learning . Springer, 2006.[41] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforwardneural networks. In

International Conference on Artiﬁcial Intelligence and Statistics , pages249–256, 2010.[42] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparam-eterization trick. In

Advances in Neural Information Processing Systems , pages 2575–2583,2015.[43] Radford M Neal.

Bayesian learning for neural networks , volume 118. Springer Science &Business Media, 2012.[44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

The Journal of MachineLearning Research , 15(1):1929–1958, 2014.[45] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representingmodel uncertainty in deep learning. In

International Conference on Machine Learning , pages1050–1059, 2016. 12

Algorithms

A.1 Active Bayesian coresets with Frank-Wolfe optimization (ACS-FW)

Algorithm A.1 outlines the ACS-FW procedure for a budget b , vectors {L n } Nn =1 and the choice ofan inner product < · , · > (see Section 2). After computing the norms σ n and σ (Lines 2 and 3) andinitializing the weight vector w to zero (Line 4), the algorithm performs b iterations of Frank-Wolfeoptimization. At each iteration, the Frank-Wolfe algorithm chooses exactly one data point (which canbe viewed as nodes on the polytope) to be added to the batch (Line 6). The weight update for thisdata point can then be computed by performing a line search in closed form [10] (Line 7), and usingthe step-size to update w (Line 8). Finally, the optimal weight vector with cardinality ≤ b is returned.In practice, we project the weights back to the feasible space by binarizing them (not shown; seeSection 2 for more details), as working with the continuous weights directly is non-trivial. Algorithm A.1

Active Bayesian Coresets with Frank-Wolfe Optimization procedure ACS-FW( b, {L n } Nn =1 , < · , · > ) σ n ← (cid:112) (cid:104)L n , L n (cid:105) ∀ n (cid:46) Compute norms σ ← (cid:80) n σ n w ← (cid:46) Initialize weights to 0 for t ∈ , ..., b do f ← arg max n ∈ N (cid:104)(cid:68) L − L ( w ) , σ n L n (cid:69)(cid:105) (cid:46) Greedily select point f γ ← (cid:104)(cid:68) σσf L f −L ( w ) , L−L ( w ) (cid:69)(cid:105)(cid:104)(cid:68) σσf L f −L ( w ) , σσf L f −L ( w ) (cid:69)(cid:105) (cid:46) Perform line search for step-size γ w ← (1 − γ ) w + γ σσ f f (cid:46) Update weight for newly selected point end for return w end procedureA.2 ACS-FW with random projections Algorithm A.2 details the process of constructing an AL batch with budget b and J random featureprojections for the weighted Euclidean inner product from Eq. (16). Algorithm A.2

ACS-FW with Random Projections (for Weighted Euclidean Inner Product) procedure ACS-FW( b, J ) θ j ∼ ˆ π j = 1 , . . . , J (cid:46) Sample parameters ˆ L n = √ J [ L n ( θ ) , · · · , L n ( θ J )] T ∀ n (cid:46) Compute random feature projections return ACS-FW ( b, { ˆ L n } Nn =1 , ( · ) T ( · )) (cid:46) Call Algorithm A.1 using projections end procedure B Closed-form derivations

B.1 Linear regression

Consider the following model for scalar Bayesian linear regression, y n = θ T x n + (cid:15) n , (cid:15) n ∼ N (0 , σ ) , θ ∼ p ( θ ) , where p ( θ ) denotes the prior. To avoid notational clutter we assume a factorized Gaussian prior withunit variance, but what follows is easily extended to richer Gaussian priors. Given an initial labeled13ataset D , the parameter posterior can be computed in closed form as p ( θ |D , σ ) = N ( θ ; µ θ , Σ θ ) (B.18) µ θ = (cid:0) X T X + σ I (cid:1) − X T y Σ θ = σ (cid:0) X T X + σ I (cid:1) − , and the predictive posterior is given by p ( y n | x n , D , σ ) = (cid:90) θ p ( y n | x n , θ ) p ( θ |D , σ ) d θ = N ( y n ; µ T θ x n , σ + x Tn Σ θ x n ) . (B.19)Using this model, we can derive a closed-form term for the inner product in Eq. (8), (cid:104)L n , L m (cid:105) ˆ π, F = E ˆ π (cid:104) ( ∇ θ L n ) T ( ∇ θ L m ) (cid:105) = E ˆ π (cid:34)(cid:18) σ ( E [ y n ] − x Tn θ ) x n (cid:19) T (cid:18) σ ( E [ y m ] − x Tm θ ) x m (cid:19)(cid:35) = x Tn x m σ E ˆ π (cid:104)(cid:0) µ T θ x n − θ T x n (cid:1) T (cid:0) µ T θ x m − θ T x m (cid:1)(cid:105) = x Tn x m σ (cid:0) x Tn Σ θ x m (cid:1) , where in the second equality we have taken expectation w.r.t. p ( y n | x n , D , σ ) from Eq. (B.19), andin the third equality w.r.t. ˆ π = p ( θ |D , σ ) from Eq. (B.18). Similarly, we obtain (cid:104)L n , L n (cid:105) ˆ π, F = x Tn x n σ (cid:0) x Tn Σ θ x n (cid:1) . For this model, BALD [2, 4] can also be evaluated in closed form: α BALD ( x n ; D ) = H (cid:2) θ |D , σ (cid:3) − E p ( y n | x n , D ) (cid:2) H (cid:2) θ | x n , y n , D , σ (cid:3)(cid:3) = 12 E ˆ π (cid:20) log σ + x Tn Σ θ x n σ + σ + ( µ T θ x n − θ T x n ) σ + x Tn Σ θ x n − (cid:21) = 12 log (cid:18) σ + x Tn Σ θ x n σ (cid:19) . We can make a direct comparison with BALD by treating the squared norm of a data point with itselfas an acquisition function, α ACS ( x n ; D ) = (cid:104)L n , L n (cid:105) ˆ π, F , yielding, α ACS ( x n ; D ) = x Tn x n σ x Tn Σ θ x n . Viewing α ACS as a greedy acquisition function is reasonable as (i) the norm of L n is related to themagnitude of the reduction in Eq. (5), and thus can be viewed as a proxy for greedy optimization.(ii) This establishes a link to notions of sensitivity from the original work on Bayesian coresets [10, 9],where σ n = (cid:107)L n (cid:107) is the key quantity for constructing the coreset (i.e. by using it for importancesampling or Frank-Wolfe optimization).As demonstrated in Fig. B.5, dropping x Tn x n from α ACS makes the two quantities proportional— exp(2 α BALD ( x n ; D )) ∝ α ACS ( x n ; D ) —and thus equivalent under a greedy maximizer. B.2 Logistic regression and probit regression

The probit regression model used in the main section of the paper is closely related to logisticregression. Since the latter is more common in pratice, we will start from a Bayesian logisticregression model and apply the standard probit approximation to render inference tractable.14 .0 0.5 0.0 0.5 1.01050510 (a) α BALD (b) α ACS / x Tn x n Figure B.5: α BALD and α ACS (without the magnitude term) evaluated on synthetic data drawn from alinear regression model with y n = x n + (cid:15) , where (cid:15) ∼ N (0 , . α BALD and α ACS / x Tn x n are equivalent(up to a constant factor) in this model.Consider the following Bayesian logistic regression model, p ( y n | x n , θ ) = Ber (cid:0) σ ( θ T x n ) (cid:1) , σ ( z ) := 11 + exp( − z ) , θ ∼ p ( θ ) , where we again assume p ( θ ) is a factorized Gaussian with unit variance. The exact parameterposterior distribution is intractable for this model due to the non-linear likelihood. We assume anapproximation of the form p ( θ |D ) ≈ N ( θ ; µ θ , Σ θ ) . More importantly, the posterior predictive isalso intractable in this setting. For the purpose of this derivation, we use the additional approximation p ( y n | x n , D ) = (cid:90) θ p ( y n | x n , θ ) p ( θ |D ) d θ ≈ (cid:90) θ Φ( θ T x n ) N ( θ ; µ θ , Σ θ ) d θ = Ber (cid:32) Φ (cid:32) µ T θ x n (cid:112) x Tn Σ x n (cid:33)(cid:33) , where in the second line we have plugged in our approximation to the parameter posterior, and usedthe well-known approximation σ ( z ) ≈ Φ( z ) , where Φ( · ) represents the standard Normal cdf [39].Next, we derive a closed-form approximation for the weighted Fisher inner product in Eq. (8). Webegin by noting that (cid:104)L n , L m (cid:105) ˆ π, F ≈ x Tn x m (cid:32) E ˆ π (cid:2) Φ (cid:0) θ T x n (cid:1) Φ (cid:0) θ T x m (cid:1)(cid:3) − Φ( ζ n )Φ( ζ m ) (cid:33) , (B.20)where we deﬁne ζ i = µ T θ x i √ x Ti Σ θ x i , and use σ ( z ) ≈ Φ( z ) as before. Next, we employ the identity[23] (cid:90) Φ( a + bz )Φ( c + dz ) N ( z ; 0 , dz = BvN (cid:18) a √ b , c √ d , ρ = bd √ b √ d (cid:19) , where BvN ( a, b, ρ ) is the bi-variate Normal (with correlation ρ ) cdf evaluated at ( a, b ) . Plugging this,and Eq. (B.22) into Eq. (B.20) yields E ˆ π (cid:104) ( ∇ θ L n ) T ( ∇ θ L m ) (cid:105) ≈ x Tn x m (cid:32) BvN (cid:16) ζ n , ζ m , ρ n,m (cid:17) − Φ( ζ m )Φ( ζ m ) (cid:33) , where ρ n,m = x Tn Σ θ x m √ x Tn Σ θ x n √ x Tn Σ θ x m . 15ext, we derive an expression for the squared norm, i.e. (cid:104)L n , L n (cid:105) ˆ π, F = E ˆ π (cid:104) ( ∇ θ L n ) T ( ∇ θ L n ) (cid:105) = E ˆ π (cid:104)(cid:0)(cid:0) E [ y n ] − σ (cid:0) x Tn θ (cid:1)(cid:1) x n (cid:1) T (cid:0)(cid:0) E [ y n ] − σ (cid:0) x Tn θ (cid:1)(cid:1) x n (cid:1)(cid:105) = x Tn x n (cid:18) Φ( ζ n ) − ζ n ) E ˆ π (cid:2) σ (cid:0) θ T x n (cid:1)(cid:3) + E ˆ π (cid:104) σ (cid:0) θ T x n (cid:1) (cid:105)(cid:19) . (B.21)Here, we again use the approximation σ ( z ) ≈ Φ( z ) , and the following identity [23]: (cid:90) (cid:0) Φ (cid:0) θ T x (cid:1)(cid:1) N ( θ ; µ θ , Σ θ ) d θ = Φ ( ζ ) − T (cid:32) ζ, (cid:112) x T Σ θ x (cid:33) , (B.22)where T ( · , · ) is Owen’s T function [23]. Plugging Eq. (B.22) back into Eq. (B.21) and takingexpectation w.r.t. the approximate posterior, we have that E ˆ π (cid:104) ( ∇ θ L n ) T ( ∇ θ L n ) (cid:105) = x Tn x n (cid:32) Φ ( ζ n ) (1 − Φ ( ζ n )) − T (cid:32) ζ n , (cid:112) x Tn Σ θ x n (cid:33)(cid:33) . C Experimental details

Computing infrastructure

All experiments were run on a desktop Ubuntu 16.04 machine. Weused an Intel Core i7-3820 @ 3.60GHz x 8 CPU for experiments on yacht , boston , energy and power ,and a GeForce GTX TITAN X GPU for all others. Hyperparameter selection

We manually tuned the hyper-parameters with the goal of trading offperformance and stability of the model training throughout the AL process, while keeping the protocolsimilar across datasets. Although a more systematic hyper-parameter search might yield improvedresults, we anticipate that the gains would be comparable across AL methods since they all share thesame model and optimization procedure.

C.1 Regression experimentsModel

We use a deterministic feature extractor consisting of two fully connected hidden layerswith ( year : ) units, interspersed with batch norm and ReLU activation functions. Weightsand biases are initialized from U ( −√ k, √ k ) , where k = 1 /N in , and N in is the number of incomingfeatures. We additionally apply L weight decay with regularization parameter λ = 1 ( power , year : λ = 3 ). The ﬁnal layer performs exact Bayesian inference. We place a factorized zero-mean Gaussianprior with unit variance on the weights of the last layer L , θ L ∼ N ( θ L ; , I ) , and an inverse Gammaprior on the noise variance, σ ∼ Γ − ( σ ; α , β ) , with α = 1 , β = 1 ( power , year : β = 3 ).Inference with this prior can be performed in closed form, where the predictive posterior followsa Student’s T distribution [40]. For power and year , we use J = 10 projections during the batchconstruction of ACS-FW. Optimization

Inputs and outputs are normalized during training to have zero mean and unitvariance, and un-normalized for prediction. The network is trained for epochs with the Adamoptimizer, using a learning rate of α = 10 − ( power , year : − ) and cosine annealing. The trainingbatch size is adapted during the AL process as more data points are acquired: we set the batch sizeto the closest power of ≤ |D | / (e.g. for boston we initially start with a batch size of ), but notmore than . For power and yacht , we divert from this protocol to stabilize the training process,and set the batch size to min( |D | , . C.2 Classiﬁcation experimentsModel

We employ a deterministic feature extractor consisting of a ResNet- [16], followed by onefully-connected hidden layer with units with a ReLU activation function. All weights are initialized Efﬁcient open-source implementations of numerical approximations exist, e.g. in scipy. weight decay with regularization parameter λ = 5 · − to all weights of this feature extractor. The ﬁnal layer is a dense layer that returnssamples using local reparametrization [42], followed by a softmax activation function. The meanweights of the last layer are initialized from N (0 , . and the log standard deviation weights of thevariances are initialized from N ( − , . . We place a factorized zero-mean Gaussian prior with unitvariance on the weights of the last layer L , θ L ∼ N ( θ L ; , I ) . Since exact inference is intractable,we perform mean-ﬁeld variational inference [37, 38] on the last layer. The predictive posterior isapproximated using samples. We use J = 10 projections during the batch construction ofACS-FW. Optimization

We use data augmentation techniques during training, consisting of random croppingto 32px with padding of 4px, random horizontal ﬂipping and input normalization. The entire networkis trained jointly for epochs with the Adam optimizer, using a learning rate of α = 10 − , cosineannealing, and a ﬁxed training batch size of . D Probabilistic methods for active learning

One surprising result we found in our experiments was the strong performance of the probabilisticbaselines M AX E NT and BALD, especially considering that a number of previous works have reportedweaker results for these methods (e.g. [5]).Probabilistic methods rely on the parameter posterior distribution p ( θ |D ) . For neural network basedmodels, posterior inference is usually intractable and we are forced to resort to approximate inferencetechniques [43]. We hypothesize that probabilistic AL methods are highly sensitive to the inferencemethod used to train the approximate posterior distribution q ( θ ) ≈ p ( θ |D ) . Many works use MonteCarlo Dropout (MCDropout) [44] as the standard method for these approximations [45, 13], butcommonly only use MCDropout on the ﬁnal layer.In our work, we ﬁnd that a Bayesian multi-class classiﬁcation model on the ﬁnal layer of a powerfuldeterministic feature extractor, trained with variational inference [37, 38] tends to lead to signiﬁcantperformance gains compared to using MCDropout on the ﬁnal layer. A comparison of these twomethods is shown in Fig. D.6, demonstrating that for cifar10 , SVHN and

Fashion MNIST a neurallinear model is preferable to one trained with MCDropout in the AL setting. In future work, weintend to further explore the trade-offs implied by using different inference procedures for AL. A cc u r a c y (a) cifar10 A cc u r a c y (b) SVHN A cc u r a c y BALDBALD (MCDropout)MaxEntMaxEnt (MCDropout)Random (c) Fashion MNIST(c) Fashion MNIST