Active Learning and Best-Response Dynamics
Maria-Florina Balcan, Chris Berlind, Avrim Blum, Emma Cohen, Kaushik Patnaik, Le Song
AActive Learning and Best-Response Dynamics
Maria-Florina Balcan ∗ Chris Berlind † Avrim Blum ‡ Emma Cohen § Kaushik Patnaik ¶ Le Song (cid:107)
June 26, 2014
Abstract
We examine an important setting for engineered systems in which low-power distributed sensors areeach making highly noisy measurements of some unknown target function. A center wants to accuratelylearn this function by querying a small number of sensors, which ordinarily would be impossible due tothe high noise rate. The question we address is whether local communication among sensors, togetherwith natural best-response dynamics in an appropriately-defined game, can denoise the system withoutdestroying the true signal and allow the center to succeed from only a small number of active queries.By using techniques from game theory and empirical processes, we prove positive (and negative) resultson the denoising power of several natural dynamics. We then show experimentally that when combinedwith recent agnostic active learning algorithms, this process can achieve low error from very few queries,performing substantially better than active or passive learning without these denoising dynamics as wellas passive learning with denoising.
Active learning has been the subject of significant theoretical and experimental study in machine learning,due to its potential to greatly reduce the amount of labeling effort needed to learn a given target func-tion. However, to date, such work has focused only on the single-agent low-noise setting, with a learningalgorithm obtaining labels from a single, nearly-perfect labeling entity. In large part this is because theeffectiveness of active learning is known to quickly degrade as noise rates become high [5]. In this work,we introduce and analyze a novel setting where label information is held by highly-noisy low-power agents(such as sensors or micro-robots). We show how by first using simple game-theoretic dynamics among theagents we can quickly approximately denoise the system. This allows us to exploit the power of activelearning (especially, recent advances in agnostic active learning), leading to efficient learning from only asmall number of expensive queries. ∗ School of Computer Science, Carnegie Mellon University. † College of Computing, Georgia Institute of Technology. ‡ School of Computer Science, Carnegie Mellon University. § School of Mathematics, Georgia Institute of Technology. ¶ College of Computing, Georgia Institute of Technology. (cid:107)
College of Computing, Georgia Institute of Technology. a r X i v : . [ c s . L G ] J un e specifically examine a setting relevant to many engineered systems where we have a large number oflow-power agents (e.g., sensors). These agents are each measuring some quantity, such as whether thereis a high or low concentration of a dangerous chemical at their location, but they are assumed to be highlynoisy. We also have a center, far away from the region being monitored, which has the ability to query theseagents to determine their state. Viewing the agents as examples, and their states as noisy labels, the goal ofthe center is to learn a good approximation to the true target function (e.g., the true boundary of the high-concentration region for the chemical being monitored) from a small number of label queries. However,because of the high noise rate, learning this function directly would require a very large number of queriesto be made (for noise rate η , one would necessarily require Ω( / − η ) ) queries [4]). The question weaddress in this paper is to what extent this difficulty can be alleviated by providing the agents the ability toengage in a small amount of local communication among themselves.What we show is that by using local communication and applying simple robust state-changing rules suchas following natural game-theoretic dynamics, randomly distributed agents can modify their state in a waythat greatly de-noises the system without destroying the true target boundary. This then nicely meshes withrecent advances in agnostic active learning [1], allowing for the center to learn a good approximation to thetarget function from a small number of queries to the agents. In particular, in addition to proving theoreticalguarantees on the denoising power of game-theoretic agent dynamics, we also show experimentally thata version of the agnostic active learning algorithm of [1], when combined with these dynamics, indeedis able to achieve low error from a small number of queries, outperforming active and passive learningalgorithms without the best-response denoising step, as well as outperforming passive learning algorithmswith denoising. More broadly, engineered systems such as sensor networks are especially well-suited toactive learning because components may be able to communicate among themselves to reduce noise, andthe designer has some control over how they are distributed and so assumptions such as a uniform or other“nice” distribution on data are reasonable. We focus in this work primarily on the natural case of linearseparator decision boundaries but many of our results extend directly to more general decision boundariesas well. There has been significant work in active learning (e.g., see [10, 14] and references therein), yet it is knownactive learning can provide significant benefits in low noise scenarios only [5]. There has also been extensivework analyzing the performance of simple dynamics in consensus games [6, 8, 13, 12, 3, 2]. However thiswork has focused on getting to some equilibria or states of low social cost , while we are primarily interestedin getting near a specific desired configuration, which as we show below is an approximate equilibrium.
We assume we have a large number N of agents (e.g., sensors) distributed uniformly at random in a geo-metric region, which for concreteness we consider to be the unit ball in R d . There is an unknown linearseparator such that in the initial state, each sensor on the positive side of this separator is positive indepen-dently with probability ≥ − η , and each on the negative side is negative independently with probability ≥ − η . The quantity η < / is the noise rate . 2 .1 The basic sensor consensus game The sensors will denoise themselves by viewing themselves as players in a certain consensus game, andperforming a simple dynamics in this game leading towards a specific (cid:15) -equilibrium.Specifically, the game is defined as follows, and is parameterised by a communication radius r , which shouldbe thought of as small. Consider a graph where the sensors are vertices, and any two sensors within distance r are connected by an edge. Each sensor is in one of two states, positive or negative. The payoff a sensorreceives is its correlation with its neighbors: the fraction of neighbors in the same state as it minus thefraction in the opposite state. So, if a sensor is in the same state as all its neighbors then its payoff is 1, ifit is in the opposite state of all its neighbors then its payoff is − , and if sensors are in uniformly randomstates then the expected payoff is 0. Note that the states of highest social welfare (highest sum of utilities)are the all-positive and all-negative states, which are not what we are looking for. Instead, we want sensorsto approach a different near-equilibrium state in which (most of) those on the positive side of the targetseparator are positive and (most of) those on the negative side of the target separator are negative. For thisreason, we need to be particularly careful with the specific dynamics followed by the sensors.We begin with a simple lemma that for sufficiently large N , the target function (i.e., all sensors on thepositive side of the target separator in the positive state and the rest in the negative state) is an (cid:15) -equilibrium,in that no sensor has more than (cid:15) incentive to deviate. Lemma 1
For any (cid:15), δ > , for sufficiently large N , with probability − δ the target function is an (cid:15) -equilibrium. P ROOF S KETCH : The target function fails to be an (cid:15) -equilibrium iff there exists a sensor for which morethan an (cid:15)/ fraction of its neighbors lie on the opposite side of the separator. Fix one sensor x and considerthe probability this occurs to x , over the random placement of the N − other sensors. Since the probabilitymass of the r -ball around x is at least ( r/ d (see discussion in proof of Theorem 2), so long as N − ≥ (2 /r ) d · max[8 , (cid:15) ] ln( Nδ ) , with probability − δ N , point x will have m x ≥ (cid:15) ln( Nδ ) neighbors (byChernoff bounds), each of which is at least as likely to be on x ’s side of the target as on the other side.Thus, by Hoeffding bounds, the probability that more than a + (cid:15) fraction lie on the wrong side is at most δ N + δ N = δN . The result then follows by union bound over all N sensors. For a bit tighter argument anda concrete bound on N , see the proof of Theorem 2 which effectively has this as a special case.Lemma 1 motivates the use of best-response dynamics for denoising. Specifically, we consider a dynamicsin which each sensor switches to the majority vote of all the other sensors in its neighborhood. We analyzebelow the denoising power of this dynamics under both synchronous and asynchronous update models. InAppendix A, we also consider more robust (though less practical) dynamics in which sensors perform moreinvolved computations over their neighborhoods. We start by providing a positive theoretical guarantee for one-round simultaneous move dynamics. We willuse the following standard concentration bound: 3 heorem 1 (Bernstein, 1924)
Let X = (cid:80) Ni =1 X i be a sum of independent random variables such that | X i − E[ X i ] | ≤ M for all i . Then for any t > , P [ X − E[ X ] > t ] ≤ exp (cid:16) − t X ]+ Mt/ (cid:17) . Theorem 2 If N ≥ r/ d ( 12 − η ) ln (cid:18) r/ d ( 12 − η ) δ (cid:19) + 1 , then with probability ≥ − δ , after one syn-chronous consensus update, every sensor at distance ≥ r from the separator has the correct label. Note that since a band of width r about a linear separator has probability mass O ( r √ d ) , Theorem 2 impliesthat with high probability one synchronous update denoises all but an O ( r √ d ) fraction of the sensors. Infact, Theorem 2 does not require the separator to be linear, and so this conclusion applies to any decisionboundary with similar surface area, such as an intersection of a constant number of halfspaces or a decisionsurface of bounded curvature. Proof (Theorem 2):
Fix a point x in the sample at distance ≥ r from the separator and consider the ballof radius r centered at x . Let n + be the number of correctly labeled points within the ball and n − be thenumber of incorrectly labeled points within the ball. Now consider the random variable ∆ = n − − n + .Denoising x can give it the incorrect label only if ∆ ≥ , so we would like to bound the probability thatthis happens. We can express ∆ as the sum of N − independent random variables ∆ i taking on value 0for points outside the ball around x , 1 for incorrectly labeled points inside the ball, or − for correct labelsinside the ball. Let V be the measure of the ball centered at x (which may be less than r d if x is near theboundary of the unit ball). Then since the ball lies entirely on one side of the separator we have E[∆ i ] = (1 − V ) · V η − V (1 − η ) = − V (1 − η ) . Since | ∆ i | ≤ we can take M = 2 in Bernstein’s theorem. We can also calculate that Var[∆ i ] ≤ E[∆ i ] = V . Thus the probability that the point x is updated incorrectly is P (cid:34) N − (cid:88) i =1 ∆ i ≥ (cid:35) = P (cid:34) N − (cid:88) i =1 ∆ i − E (cid:104) N − (cid:88) i =1 ∆ i (cid:105) ≥ ( N − V (1 − η ) (cid:35) ≤ exp (cid:32) − ( N − V (1 − η ) (cid:0) ( N − V + 2( N − V (1 − η ) / (cid:1) (cid:33) ≤ exp (cid:18) − ( N − V (1 − η ) − η ) / (cid:19) ≤ exp (cid:0) − ( N − V ( − η ) (cid:1) ≤ exp (cid:16) − ( N − r/ d ( − η ) (cid:17) , where in the last step we lower bound the measure V of the ball around r by the measure of the sphere ofradius r/ inscribed in its intersection with the unit ball. Taking a union bound over all N points, it sufficesto have e − ( N − r/ d ( 12 − η ) ≤ δ/N , or equivalently N − ≥ r/ d ( − η ) (cid:18) ln N + ln 1 δ (cid:19) . Using the fact that ln x ≤ αx − ln α − for all x, α > yields the claimed bound on N .4e can now combine this result with the efficient agnostic active learning algorithm of [1]. In particular,applying the most recent analysis of [9, 15] of the algorithm of [1], we get the following bound on thenumber of queries needed to efficiently learn to accuracy − (cid:15) with probability − δ . Corollary 1
There exists constant c > such that for r ≤ (cid:15)/ ( c √ d ) , and N satisfying the bound ofTheorem 2, if sensors are each initially in agreement with the target linear separator independently withprobability at least − η , then one round of best-response dynamics is sufficient such that the agnosticactive learning algorithm of [1] will efficiently learn to error (cid:15) using only O ( d log 1 /(cid:15) ) queries to sensors. In Section 5 we implement this algorithm and show that experimentally it learns a low-error decision ruleeven in cases where the initial value of η is quite high. We contrast the above positive result with a negative result for arbitrary-order asynchronous moves. Inparticular, we show that for any d ≥ , for sufficiently large N , with high probability there exists an updateorder that will cause all sensors to become negative. Theorem 3
For some absolute constant c > , if r ≤ / and sensors begin with noise rate η , and N ≥ cr ) d φ (cid:18) ln 8( cr ) d φ + ln 1 δ (cid:19) , where φ = φ ( η ) = min( η, − η ) , then with probability at least − δ there exists an ordering of the agentsso that asynchronous updates in this order cause all points to have the same label. P ROOF S KETCH : Consider the case d = 1 and a target function x > . Each subinterval of [ − , of width r has probability mass r/ , and let m = rN/ be the expected number of points within such an interval.The given value of N is sufficiently large that with high probability, all such intervals in the initial state haveboth a positive count and a negative count that are within ± φ m of their expectations. This implies that ifsensors update left-to-right, initially all sensors will (correctly) flip to negative, because their neighborhoodshave more negative points than positive points. But then when the “wave” of sensors reaches the positiveregion, they will continue (incorrectly) flipping to negative because the at least m (1 − φ ) negative points inthe left-half of their neighborhood will outweigh the at most (1 − η + φ ) m positive points in the right-halfof their neighborhood. For a detailed proof and the case of general d > , see Appendix B. While Theorem 3 shows that there exist bad orderings for asynchronous dynamics, we now show that wecan get positive theoretical guarantees for random order best-response dynamics.The high level idea of the analysis is to partition the sensors into three sets: those that are within distance r of the target separator, those at distance between r and r from the target separator, and then all the rest. Forthose at distance < r from the separator we will make no guarantees: they might update incorrectly whenit is their turn to move due to their neighbors on the other side of the target. Those at distance between r r from the separator might also update incorrectly (due to “corruption” from neighbors at distance < r from the separator that had earlier updated incorrectly) but we will show that with high probability this onlyhappens in the last / of the ordering. I.e., within the first N/ updates, with high probability there are noincorrect updates by sensors at distance between r and r from the target. Finally, we show that with highprobability, those at distance greater than r never update incorrectly. This last part of the argument followsfrom two facts: (1) with high probability all such points begin with more correctly-labeled neighbors thanincorrectly-labeled neighbors (so they will update correctly so long as no neighbors have previously updatedincorrectly), and (2) after N/ total updates have been made, with high probability more than half of theneighbors of each such point have already (correctly) updated, and so those points will now update correctlyno matter what their remaining neighbors do. Our argument for the sensors at distance in [ r, r ] requires r to be small compared to ( − η ) / √ d , and the final error is O ( r √ d ) , so the conclusion is we have a totalerror less than (cid:15) for r < c min[ − η, (cid:15) ] / √ d for some absolute constant c .We begin with a key lemma. For any given sensor, define its inside-neighbors to be its neighbors in thedirection of the target separator and its outside-neighbors to be its neighbors away from the target separator.Also, let γ = 1 / − η . Lemma 2
For any c , c > there exist c , c > such that for r ≤ γc √ d and N ≥ c ( r/ d γ ln( r d γδ ) ,with probability − δ , each sensor x at distance between r and r from the target separator has m x ≥ c γ ln(4 N/δ ) neighbors, and furthermore the number of inside-neighbors of x that move before x is within ± γc m x of the number of outside neighbors of x that move before x . Proof:
First, the guarantee on m x follows immediately from the fact that the probability mass of theball around each sensor x is at least ( r/ d , so for appropriate c the expected value of m x is at least max[8 , c γ ] ln(4 N/δ ) , and then applying Hoeffding bounds [11, 7] and the union bound. Now, fix somesensor x and let us first assume the ball of radius r about x does not cross the unit sphere. Because this israndom-order dynamics, if x is the k th sensor to move within its neighborhood, the k − sensors that moveearlier are each equally likely to be an inside-neighbor or an outside-neighbor. So the question reduces to:if we flip k − ≤ m x fair coins, what is the probability that the number of heads differs from the number oftails by more than γc m x . For m x ≥ c γ ) ln(4 N/δ ) , this is at most δ/ (2 N ) by Hoeffding bounds. Now,if the ball of radius r about x does cross the unit sphere, then a random neighbor is slightly more likely tobe an inside-neighbor than an outside-neighbor. However, because x has distance at most r from the targetseparator, this difference in probabilities is only O ( r √ d ) , which is at most γ c for appropriate choice ofconstant c . So, the result follows by applying Hoeffding bounds to the γ c gap that remains. Theorem 4
For some absolute constants c , c , for r ≤ γc √ d and N ≥ c ( r/ d γ ln( r d γδ ) , in random order We can analyze the difference in probabilities as follows. First, in the worst case, x is at distance exactly r from theseparator, and is right on the edge of the unit ball. So we can define our coordinate system to view x as being at location (2 r, √ − r , , . . . , . Now, consider adding to x a random offset y in the r -ball. We want to look at the probability that x + y has Euclidean length less than 1 conditioned on the first coordinate of y being negative compared to this probability con-ditioned on the first coordinate of y being positive. Notice that because the second coordinate of x is nearly 1, if y ≤ − cr forappropriate c then x + y has length less than 1 no matter what the other coordinates of y are (worst-case is if y = r but even thatadds at most O ( r ) to the squared-length). On the other hand, if y ≥ cr then x + y has length greater than 1 also no matterwhat the other coordinates of y are. So, it is only in between that the value of y matters. But notice that the distribution over y has maximum density O ( √ d/r ) . So, with probability nearly / , the point is inside the unit ball for sure, with probability nearly / the point is outside the unit ball for sure, and only with probability O ( r √ d/r ) = O ( r √ d ) does the y coordinate make anydifference at all. ynamics, with probability − δ all sensors at distance greater than r from the target separator updatecorrectly. P ROOF S KETCH : We begin by using Lemma 2 to argue that with high probability, no points at distancebetween r and r from the separator update incorrectly within the first N/ updates (which immediatelyimplies that all points at distance greater than r update correctly as well, since by Theorem 2, with highprobability they begin with more correctly-labeled neighbors than incorrectly-labeled neighbors and theirneighborhood only becomes more favorable). In particular, for any given such point, the concern is thatsome of its inside-neighbors may have previously updated incorrectly. However, we use two facts: (1) byLemma 2, we can set c so that with high probability the total contribution of neighbors that have alreadyupdated is at most γ m x in the incorrect direction (since the outside-neighbors will have updated correctly,by induction), and (2) by standard concentration inequalities [11, 7], with high probability at least m x neighbors of x have not yet updated. These m x un-updated neighbors together have in expectation a γ m x bias in the correct direction, and so with high probability have greater than a γ m x correct bias forsufficiently large m x (sufficiently large c in Lemma 2). So, with high probability this overcomes theat most γ m x incorrect bias of neighbors that have already updated, and so the points will indeed updatecorrectly as desired. Finally, we consider the points of distance ≥ r . Within the first N updates, withhigh probability they will all update correctly as argued above. Now consider time N . For each such point,in expectation of its neighbors have already updated, and with high probability, for all such points thefraction of neighbors that have updated is more than half. Since all neighbors have updated correctly sofar, this means these points will have more correct neighbors than incorrect neighbors no matter what theremaining neighbors do, and so they will update correctly themselves. Recently, Awasthi et al. [1] gave the first polynomial-time active learning algorithm able to learn linearseparators to error (cid:15) over the uniform distribution in the presence of agnostic noise of rate O ( (cid:15) ) . Moreover,the algorithm does so with optimal query complexity of O ( d log 1 /(cid:15) ) . This algorithm is ideally suited to oursetting because (a) the sensors are uniformly distributed, and (b) the result of best response dynamics is noisethat is low but potentially highly coupled (hence, fitting the low-noise agnostic model). In our experiments(Section 5) we show that indeed this algorithm when combined with best-response dynamics achieves lowerror from a small number of queries, outperforming active and passive learning algorithms without thebest-response denoising step, as well as outperforming passive learning algorithms with denoising.Here, we briefly describe the algorithm of [1] and the intuition behind it. At high level, the algorithmproceeds through several rounds, in each performing the following operations (see also Figure 1): Instance space localization:
Request labels for a random sample of points within a band of width b k = O (2 − k ) around the boundary of the previous hypothesis w k . Concept space localization:
Solve for hypothesis vector w k +1 by minimizing hinge loss subject to theconstraint that w k +1 lie within a radius r k from w k ; that is, || w k +1 − w k || ≤ r k .[1, 9, 15] show that by setting the parameters appropriately (in particular, b k = Θ(1 / k ) and r k = Θ(1 / k ) ),the algorithm will achieve error (cid:15) using only k = O (log 1 /(cid:15) ) rounds, with O ( d ) label requests per round.7 k b k r k w k +1 +++ − +++++ − + − − + −− −− + −− Figure 1: The margin-based active learning algorithm after iteration k . The algorithm samples points withinmargin b k of the current weight vector w k and then minimizes the hinge loss over this sample subject to theconstraint that the new weight vector w k +1 is within distance r k from w k .In particular, a key idea of their analysis is to decompose, in round k , the error of a candidate classifier w asits error outside margin b k of the current separator plus its error inside margin b k , and to prove that for theseparameters, a small constant error inside the margin suffices to reduce overall error by a constant factor. Asecond key part is that by constraining the search for w k +1 to vectors within a ball of radius r k about w k ,they show that hinge-loss acts as a sufficiently faithful proxy for 0-1 loss. In our experiments we seek to determine whether our overall algorithm of best-response dynamics combinedwith active learning is effective at denoising the sensors and learning the target boundary. The experimentswere run on synthetic data, and compared active and passive learning (with Support Vector Machines) bothpre- and post-denoising.
Synthetic data.
The N sensor locations were generated from a uniform distribution over the unit ball in R , and the target boundary was fixed as a randomly chosen linear separator through the origin. To simulatenoisy scenarios, we corrupted the true sensor labels using two different methods: 1) flipping the sensorlabels with probability η and 2) flipping randomly chosen sensor labels and all their neighbors, to createpockets of noise, with η fraction of total sensors corrupted. Denoising via best-response dynamics.
In the denoising phase of the experiments, the sensors applied thebasic majority consensus dynamic. That is, each sensor was made to update its label to the majority labelof its neighbors within distance r from its location . We used radius values r ∈ { . , . , . , . } . We also tested distance-weighted majority and randomized majority dynamics and experimentally observed similar results tothose of the basic majority dynamic.
10 20 30 40 50
Initial Noise(%) F i n a l N o i s e ( % ) Random NoisePockets of Noise F i n a l N o i s e ( % ) Random Noise - Asynchronous updatesPockets of Noise - Asynchronous updatesRandom Noise - Synchronous updatesPockets of Noise - Synchronous updates
Figure 2: Initial vs. final noise rates for synchronous updates (left) and comparison of synchronous andasynchronous dynamics (right). A synchronous round updates every sensor once simultaneously, while oneasynchronous round consists of N random updates.Updates of sensor labels were carried out both through simultaneous updates to all the sensors in eachiteration (synchronous updates) and updating one randomly chosen sensor in each iteration (asynchronousupdates). Learning the target boundary.
After denoising the dataset, we employ the agnostic active learning al-gorithm of Awasthi et al. [1] described in Section 4 to decide which sensors to query and obtain a linearseparator. We also extend the algorithm to the case of non-linear boundaries by implementing a kernelizedversion (see Appendix C for more details). Here we compare the resulting error (as measured against the“true” labels given by the target separator) against that obtained by training a SVM on a randomly selectedlabeled sample of the sensors of the same size as the number of queries used by the active algorithm. Wealso compare these post-denoising errors with those of the active algorithm and SVM trained on the sen-sors before denoising. For the active algorithm, we used parameters asymptotically matching those givenin Awasthi et al [1] for a uniform distribution. For SVM, we chose for each experiment the regularizationparameter that resulted in the best performance.
Here we report the results for N = 10000 and r = 0 . . Results for experiments with other values of theparameters are included in Appendix C. Every value reported is an average over 50 independent trials. Denoising effectiveness.
Figure 2 (left side) shows, for various initial noise rates, the fraction of sensorswith incorrect labels after applying 100 rounds of synchronous denoising updates. In the random noise case,the final noise rate remains very small even for relatively high levels of initial noise. Pockets of noise appearto be more difficult to denoise. In this case, the final noise rate increases with initial noise rate, but is stillnearly always smaller than the initial level of noise. 9 ynchronous vs. asynchronous updates.
To compare synchronous and asynchronous updates we plot thenoise rate as a function of the number of rounds of updates in Figure 2 (right side). As our theory suggests,both simultaneous updates and asynchronous updates can quickly converge to a low level of noise in therandom noise setting. Neither update strategy achieves the same level of performance in the case of pocketsof noise.
Generalization error: pre- vs. post-denoising and active vs. passive.
We trained both active and passivelearning algorithms on both pre- and post-denoised sensors at various label budgets, and measured theresulting generalization error (determined by the angle between the target and the learned separator). Theresults of these experiments are shown in Figure 3. Notice that, as expected, denoising helps significantlyand on the denoised dataset the active algorithm achieves better generalization error than support vectormachines at low label budgets. For example, at a label budget of 30, active learning achieves generalizationerror approximately 33% lower than the generalization error of SVMs. Similar observations were alsoobtained upon comparing the kernelized versions of the two algorithms (see Appendix C).
30 40 50 60 70 80 90 100
Label Budget . . . . . . . . . G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget . . . . . G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 3: Generalization error of the two learning methods with random noise at rate η = 0 . (left) andpockets of noise at rate η = 0 . (right). We demonstrate through theoretical analysis as well as experiments on synthetic data that local best-responsedynamics can significantly denoise a highly-noisy sensor network without destroying the underlying signal,allowing for fast learning from a small number of label queries. Another way to view this result is that thecost function we really care about is that a sensor should get a cost of 1 for having a label that disagreeswith the target function and a cost of 0 for a label that agrees with the target function, but unfortunatelyit cannot measure this directly; so, instead we give each sensor a “proxy” objective that it can measure ofagreeing with its neighbors. Our positive theoretical guarantees show that updating according to this proxywill perform well according to the true cost function, and apply both to synchronous and random-orderasynchronous updates. This is borne out in the experiments as well. Our negative result in Section 3.2 for10dversarial-order dynamics, in which a left-to-right update order can cause the entire system to switch toa single label, raises the question whether an alternative dynamics could be robust to adversarial updateorders. In Appendix A we present an alternative dynamics that we prove is indeed robust to arbitrary updateorders, but this dynamics is less practical because it requires substantially more computational power onthe part of the sensors. It is an interesting question whether such general robustness can be achieved by asimple practical update rule. More generally, is there a different measurable, local, practical proxy objectivethat would do an even better job than those considered here at optimizing the true error objective? Anothernatural direction is to explore the use of denoising protocols when tracking a target that is changing overtime.
Acknowledgments
This work was supported in part by ONR grant N00014-09-1-0751, NSF grants CCF-0953192, CCF-1101283, and CCF-1101215, and AFOSR grant FA9550-09-1-0538.
References [1] P. Awasthi, M. F. Balcan, and P. Long. The power of localization for efficiently learning linear separa-tors with noise. In
STOC , 2014.[2] M.-F. Balcan, A. Blum, and Y. Mansour. The price of uncertainty. In EC , 2009.[3] M.-F. Balcan, A. Blum, and Y. Mansour. Circumventing the price of anarchy: Leading dynamics togood behavior. SICOMP , 2014.[4] M. F. Balcan and V. Feldman. Statistical active learning algorithms. In
NIPS , 2013.[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In
ICML , 2009.[6] L. Blume. The statistical mechanics of strategic interaction.
Games and Economic Behavior , 5:387–424, 1993.[7] S. Boucheron, G. Lugosi, and P. Massart.
Concentration Inequalities: A Nonasymptotic Theory ofIndependence . OUP Oxford, 2013.[8] G. Ellison. Learning, local interaction, and coordination.
Econometrica , 61:1047–1071, 1993.[9] S. Hanneke. Personal communication. 2013.[10] S. Hanneke. A statistical theory of active learning.
Foundations and Trends in Machine Learning ,pages 1–212, 2013.[11] W. Hoeffding. Probability inequalities for sums of bounded random variables.
Journal of the AmericanStatistical Association , 58(301):13–30, March 1963.[12] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network.In
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery andData Mining , KDD ’03, pages 137–146. ACM, 2003.1113] S. Morris. Contagion.
The Review of Economic Studies , 67(1):57–78, 2000.[14] B. Settles.
Active Learning . Synthesis Lectures on Artificial Intelligence and Machine Learning.Morgan & Claypool Publishers, 2012.[15] L. Yang.
Mathematical Theories of Interaction with Oracles . PhD thesis, CMU Dept. of MachineLearning, 2013.
A Arbitrary Order and Conservative Best Response Dynamics
Given the negative result of Section 3.2, the basic best-response dynamics would not be appropriate to useif no assumptions can be made about the order in which sensors perform their updates. To address thisproblem, we describe here a modified dynamics that we call conservative best-response. The idea of thisdynamics is that sensors only change their state when they are confident that they are not on the wrongside of the target separator. This dynamics is not as practical as regular best-response dynamics because itrequires substantially more computation on the part of the sensors. Nonetheless, it demonstrates that positiveresults for arbitrary update orders are indeed achievable.
Conservative best-response dynamics:
In this dynamics, sensors behave as follows:1. If, for all linear separators through the sensor’s location, a majority of neighbors on both sidesof the separator are positive, then flip to positive.2. If, for all linear separators through the sensor’s location, a majority of neighbors on both sidesof the separator are negative, then flip to negative.3. Else (for some linear separator through the sensor’s location, the majority on one side is positiveand the majority on the other side is negative) then don’t change.4. To address sensors near the boundary of the unit sphere, in (1)-(3) we only consider linearseparators with at least / of the points in their neighborhood on each side. Theorem 5
For absolute constants c , c , for r ≤ γc √ d and N ≥ c ( r/ d γ ln( r d γδ ) , in arbitrary-orderconservative best-response dynamics, each sensor whose r -ball does not intersect the target separator willflip state correctly and no sensor will ever flip state incorrectly. Thus, Theorem 5 contrasts nicely with the negative result in Theorem 3 for standard best-response dynamicsand shows that the potential difficulties of arbitrary-order dynamics no longer apply.
Proof:
We will show that for the given value of N , with high probability we have the following initialconditions: for each of the N sensors, for all hemispheres of radius r centered at that sensor, the empiricalfraction of points in that hemisphere that are labeled positive is within a γ = 1 / − η fraction of itsexpectation. This implies in particular that at the start of the dynamics, all such hemispheres that are fullyon the positive side of the target separator have more positive-labeled sensors than negative-labeled sensors,and all such hemispheres that are fully on the negative side of the target separator have more negative-labeled sensors than positive-labeled sensors. This in turn implies, by induction, that in the course of thedynamics, no sensor ever switches to the incorrect label. In particular, if we consider the hyperplane througha sensor that is parallel to the target and consider the hemisphere of neighbors on the “good side” of this12yperplane, by induction none of those neighbors will have ever switched to the incorrect label, and so theirmajority will remain the correct label, and so the current sensor will not switch incorrectly by definition ofthe dynamics. In addition, it implies that all sensors whose r -balls do not intersect the target separator will flip to the correct label when it is their turn to update.To argue this, for any fixed sensor and its neighborhood r -ball, since the VC-dimension of linear separatorsin R d is d + 1 , so long as we see m ≥ cγ [ d ln(1 /γ ) + ln( N/δ )] points inside that r -ball (for sufficiently large constant c ), with probability at least − δ/N , for any hy-perplane through the center of the ball, the number of positives in each halfspace will be within γm/ ofthe expectation, and the number of negatives in each halfspace will be within γm/ of the expectation.This means that if the halfspace is fully on the positive side of the target, then we see more positives thannegatives, and if it is fully on the negative side of the target then we see more negatives than positives. (Inthe case of sensors near the boundary of the unit ball, this holds true for all with sufficiently many points,which includes the halfspace defined by the hyperplane parallel to the target separator if the sensor is withindistance r of the target separator for r < γc √ d .) We are using δ/N as the failure probability so we can do aunion bound over all the N balls. Finally, we solve for N to ensure the above guarantee on m to get N ≥ c ( r/ d γ ln (cid:18) r d γδ (cid:19) points suffice for some constant c . B Additional Proofs
Proof of Theorem 3:
Suppose the labeling is given by sign( w · x ) . We show that if sensors are updated inincreasing order of w · x (from most negative to most positive) then with high probability all sensors willupdate to negative labels.Consider what we see when we come to update the sensor at x . Assuming we have not yet failed (given apositive label), all of the points x (cid:48) with w · x (cid:48) < w · x are labeled negative, while those with w · x (cid:48) > w · x are unchanged from their original states, and so are still labeled with independent uniform noise. As in theproof of Theorem 2, we apply Bernstein’s theorem to the difference ∆ between the number of negative andpositive points in the neighborhood of x , which we write as a sum of ( N − independent variables ∆ i . Theexpected labels of the nearby points depend on the location of x , so we consider three regions: w · x ≤ − r , w · x ≥ , and − r < w · x < .Let V denote the probability mass of the ball of radius r around x . In all cases the variance is bounded by Var[∆ i ] ≤ E[∆ i ] = V ≤ r d .In the first region ( w · x ≤ − r ) we can use the same analysis from Theorem 2 to find that E[∆ i ] ≤− V (1 − η ) ≤ − ( r/ d (1 − η ) , since the ball around x never crosses the separator and any sensorspreviously updated to negative labels cannot hurt.In the second region ( w · x ≤ ) we can use a similar analysis, bounding E[∆ i ] ≤ − V / − η ) V / − ηV / ≤ − ( r/ d , − + x AB Cβ αθ Figure 4: A ball around x intersecting the decision boundary and the boundary of the unit ball.since the measure of the (positive biased) half of the ball further from the separator than x is never largerthan the measure of the remaining (all negative) half of the ball.In the final region ( < w · x < r ), we must take a little more care, as the measure of the all-negative half ofthe ball may be less than the measure of the unexamined side, which may be positive-biased due to crossingthe separator. To analyze this case, we project onto the 2-dimensional space spanned by x and w . The worstcase is clearly when x is on the surface of the ball, as shown in Figure 4.Any point in the red region is known to have a negative label, while points in the dark blue region are biasedtowards positive labels. We first show that the red region is bigger by showing that the angle α subtendedby the dark blue region is smaller than the angle β of the red region. Construct the segment xA by reflectingthe segment xB about the line xO and extending it to the separator. Note that the angle ∠ OxA is the sameas the angle θ between x and the separator. We find that α ≤ β precisely when xA ≥ xC = r . Indeed,by considering the isosceles triangle (cid:52) AxO we see that xA = 1 / (2 cos θ ) ≥ / . So as long as r ≤ / we have β ≥ α . Thus, since the projection of the uniform distribution over the unit ball onto this plane isradially symmetric, the red region has more probability mass than the blue region.We can now calculate for this case E[∆ i ] ≤ ( − measure of red ] + (1 − η )[ measure of blue ] + (2 η − measure of white ] ≤ − η [ measure of red ] . Note that although the projection does not make sense for d = 1 the result obviously still holds (as thereare no points near both the separator and the boundary of the unit ball). We can lower bound the measureof the red region by the measure of the sphere inscribed in the sector, which has radius at least cr for some < c < / as long as r ≤ / (since β is bounded away from 0 in this range of r ).Now we see that for any x the expected label satisfies E[∆ i ] ≤ − ( cr ) d min( η, − η ) . φ = min( η, − η ) , we find that the probability of giving a positive label on any given update is P [∆ ≥ ≤ exp (cid:32) − ( N − ( cr ) d φ / N − r d + ( N − cr ) d φ/ (cid:33) = exp (cid:32) − ( N − cr ) d φ φ/ (cid:33) = exp (cid:16) − ( N − cr ) d φ / (cid:17) By the union bound, we find that N ≥ cr ) d φ (cid:18) ln 8( cr ) d φ + ln 1 δ (cid:19) suffices to ensure that with probability at least − δ all sensors are updated to negative labels. Note 1 If r = O (1 / √ d ) then we can lower bound all of the relevant measures in the preceding proof by Θ( r d ) rather than (Θ( r )) d , to see that N ≥ Ω (cid:18) r d φ (cid:18) ln 1 rφ + ln 1 δ (cid:19)(cid:19) suffices. C Additional Experimental Results
All of the following experiments were run with initial noise rate η = 0 . for random noise and η = 0 . for pockets of noise, and the results have been averaged over 20 trials of 50 iterations each. Effect of number of sensors on denoising and learning.
We analyze the performance of learning post-denoising as a function of the number of sensors for a fixedradius. Given the results of Theorem 2 in Section 3.1 for synchronous updates, we expect the denoising toimprove as sensors are added, which in turn should improve the generalization error post-denoising. Figures5, 6, and 7 show the generalization error pre- and post-denoising for N ∈ { , , } . For abudget of 30 labels on random noise, the noise rate after denoising drops from 12.0% with 1000 sensors to1.7% with 25000 sensors, and with this improvement we see a corresponding drop in generalization errorfrom 7.4% to 1.6%. Notice that denoising helps for both active and passive learning in all scenarios exceptfor the case of pockets of noise with 1000 sensors, where the sensor density is insufficient for the denoisingprocess to have significant effect. Effect of communication radius on denoising and learning.
We analyze here the performance of learning post-denoising as a function of the communication radius fora fixed number of sensors. In light of Theorem 2 we expect a larger communication radius to improve theeffectiveness of denoising. Figures 8, 9, and 10 show the generalization error pre- and post-denoising for r ∈ { . , . , . } with 10000 sensors. Here denoising helps for both active and passive learning in allscenarios. 15 Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 5: Generalization error with 1000 sensors in different noise scenarios. Generalization error in y-axisand Labels used in x-axis. From Left to Right - Random Noise, and Pockets of Noise.
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 6: Generalization error with 5000 sensors in different noise scenarios. Generalization error in y-axisand Labels used in x-axis. From Left to Right - Random Noise, and Pockets of Noise.16
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 7: Generalization error with 25000 sensors in different noise scenarios. Generalization error in y-axisand Labels used in x-axis. From Left to Right - Random Noise, and Pockets of Noise.
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 8: Generalization error with connectivity radius of 0.2 and 10,000 sensors in different noise scenarios.Generalization error in y-axis and Labels used in x-axis. From Left to Right - Random Noise, and Pocketsof Noise. 17
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 9: Generalization error with connectivity radius of 0.05 and 10,000 sensors in different noise sce-narios. Generalization error in y-axis and Labels used in x-axis. From Left to Right - Random Noise, andPockets of Noise.
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100
Label Budget G e n e r a li z a t i o n E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
Figure 10: Generalization error with connectivity radius of 0.025 and 10,000 sensors in different noisescenarios. Generalization error in y-axis and Labels used in x-axis. From Left to Right - Random Noise,and Pockets of Noise. 18 .1 Kernelized Algorithm Derivation and Results
Derivation of dual with a linear ball constraint
In order to be able to replace inner products with kernel evaluations in the dual program of the hinge lossminimization step, we replace the ball constraint given by (cid:107) w k − w k +1 (cid:107) ≤ r k with an equivalent linearconstraint w k w k − ≥ − r k / . L = n (cid:88) i =1 ξ i − n (cid:88) i =1 α i ( y i ( w k x i ) /τ k − ξ i )+ β (1 − r k / − w k w k − )+ γ ( (cid:107) w k (cid:107) − − n (cid:88) i =1 δ i ξ i To obtain the dual formulation, we take derivation of the above equation w.r.t to w k and ξ and substitutethese values in the original formulation we obtain max α,β,γ n (cid:88) i =1 α i − / γτ k ∗ n (cid:88) i =1 n (cid:88) j =1 α i α j y i y j x i x j − βw k − / τ k γ ∗ n (cid:88) i =1 α i y i x i − β w k − / γ + β (1 − r k / − γ s.t ≤ α ≤ β, γ ≥ (1)In (1) the lagrangian variable γ is present as a denominator in three terms which are negative and as aquantity being subtracted in the objective function. Thus the edge values , inf of γ would lead to decreasein the objective value and cannot lead to maximum. Thus the maximum value of objective function will befound at γ evaluated at ∂L/∂γ = 0 . Taking the derivative of 1 w.r.t γ we get γ = (cid:118)(cid:117)(cid:117)(cid:116) /τ k n (cid:88) i =1 n (cid:88) j =1 α i α j y i y j x i x j + βw k − / τ k + β w k − / (2)19ubstituting this value in the (1) and simplifying gives us - min α,β (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 n (cid:88) j =1 α i α j y i y j x i x j + 2 τ k βw k − ( n (cid:88) i =1 α i y i x i ) + ( βτ k w k − ) − τ k n (cid:88) i =1 α i − τ k β (1 − r k / s.t ≤ α ≤ β ≥ (3)The term under the square root in (3) can be simplified as ( (cid:80) ni =1 α i y i x i + βτ k w k − ) , which further sim-plifies equation (3) to max α,β τ k n (cid:88) i =1 α i + τ k β (1 − r k / − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 α i y i x i + βτ k w k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s.t ≤ α ≤ β ≥ (4)The resulting optimization objective function can be implemented by expanding out the two norm and valueof previous weight vector as (cid:80) pl =1 α l y l x l . Results for kernelized algorithm
We also test the improvement of the active learning method for non-linear decision boundaries. The targetdecision boundary is a sine curve on the horizontal axis in R space, with points above the curve labeled aspositive, else negative. Noise was introduced in the true labels through methods described in Section 5.1.For comparison with passive methods we calculate the classification error over 20 trials, where in each trialwe average results over 20 iterations. Both the active and passive algorithms use a Gaussian kernel withbandwidth of 0.1 for a smooth estimate of the the boundary. All other parameters remain the same. Resultsare shown in Figures 6. Notice that the results are similar to experiments with linear decision boundaries.20 Label Budget . . . . . . E rr o r Pre Denoising - Our MethodPre Denoising - SVMPost Denoising - Our MethodPost Denoising - SVM
30 40 50 60 70 80 90 100