On the Sample Complexity of Adversarial Multi-Source PAC Learning
Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert
OOn the Sample Complexity of Adversarial Multi-Source PAC Learning
Nikola Konstantinov Elias Frantar
Dan Alistarh Christoph H. Lampert Abstract
We study the problem of learning from multi-ple untrusted data sources, a scenario of increas-ing practical relevance given the recent emer-gence of crowdsourcing and collaborative learn-ing paradigms. Specifically, we analyze the situa-tion in which a learning system obtains datasetsfrom multiple sources, some of which might be bi-ased or even adversarially perturbed. It is knownthat in the single-source case, an adversary withthe power to corrupt a fixed fraction of the trainingdata can prevent PAC-learnability, that is, evenin the limit of infinitely much training data, nolearning system can approach the optimal test er-ror. In this work we show that, surprisingly, thesame is not true in the multi-source setting, wherethe adversary can arbitrarily corrupt a fixed frac-tion of the data sources. Our main results are ageneralization bound that provides finite-sampleguarantees for this learning setting, as well ascorresponding lower bounds. Besides establish-ing PAC-learnability our results also show that ina cooperative learning setting sharing data withother parties has provable benefits, even if someparticipants are malicious.
1. Introduction
An important problem of current machine learning researchis to make learned systems more trustworthy . One par-ticular aspect of this is robustness against data of unex-pected or even adversarial nature. Robustness at predictiontime has recently received a lot of attention, in particu-lar with work on the detection of out-of-distribution con-ditions (Hendrycks & Gimpel, 2017; Liang et al., 2018;Lee et al., 2018) and protection against adversarial exam-ples (Raghunathan et al., 2018; Singh et al., 2018; Cohen Institute of Science and Technology Austria, Klosterneuburg,Austria Vienna University of Technology, Vienna, Austria. Corre-spondence to: Nikola Konstantinov < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). et al., 2019). Robustness at training time, however, is repre-sented less prominently, despite also being of great impor-tance. One reason might be that learning from a potentiallyadversarial data source is very hard: a classic result statesthat when a fixed fraction of the training dataset is adver-sarially corrupted, successful learning in the PAC sense isnot possible anymore (Kearns & Li, 1993). In other words,there exists no robust learning algorithm that could over-come the effects of adversarial corruptions in a constantfraction of the training dataset and approach the optimalmodel, even in the limit of infinite data.In this work, we study the question of robust learning inthe multi-source case, i.e. when more than one dataset isavailable for training. This is a situation of increasing rele-vance in the era of big data , where machine learning modelstend to be trained on very large datasets. To create these,one commonly relies on distributing the task of collectingand annotating data, e.g. to crowdsourcing (Sheng & Zhang,2019) services, or by adopting a collective or federatedlearning scenario (McMahan & Ramage, 2017).Unfortunately, relying on data from other parties comeswith the danger that some of the sources might producedata of lower quality than desired, be it due to negligence,bias or malicious behaviour. Consequently, the analogousquestion to the classic problem described above is the fol-lowing, which we refer to as adversarial multi-source learn-ing . Given a number of i.i.d. datasets, a constant fractionof which might have been adversarially manipulated, isthere a learning algorithm that overcomes the effect of thecorruptions and approaches an optimal model?
In this work, we study this problem formally and providea positive answer. Specifically, our main result is an upperbound on the sample complexity of adversarial multi-sourcelearning, that holds as long as less than half of sources aremanipulated (Theorem 1).
A number of interesting results follow as immediate corol-laries. First, we show that any hypothesis class that is uni-formly convergent and hence PAC-learnable in the classicali.i.d. sense is also PAC-learnable in the adversarial multi-source scenario. This is in stark contrast to the single-sourcesituation where, as mentioned above, no non-trivial hypoth-esis class is robustly PAC-learnable. As a second conse-quence, we obtain the insight that in a cooperative learning a r X i v : . [ c s . L G ] J un dversarial Multi-Source PAC Learning scenario, every honest party can benefit from sharing theirdata with others, as compared to using their own data only,even if some of the participants are malicious.Besides our main result we prove two additional theoremsthat shed light on the difficulty of adversarial multi-sourcelearning. First, we prove that the na¨ıve but common strat-egy of simply merging all data sources and training withsome robust procedure on the joint dataset cannot resultin a robust learning algorithm (Theorem 2). Second, weprove a lower bound on the sample complexity under veryweak conditions (Theorem 3). This result shows that un-der adversarial conditions a slowdown of convergence isunavoidable, and that in order to approach optimal perfor-mance, the number of samples per source must necessarilygrow, while increasing the number of sources need not help.
2. Related work
To our knowledge, our results are the first that formally char-acterize the statistical hardness of learning from multiplei.i.d. sources, when a constant fraction of them might beadversarially corrupted. There are a number of conceptuallyrelated works, though, which we will discuss for the rest ofthis section.Qiao & Valiant (2018), as well as the follow-up works ofChen et al. (2019); Jain & Orlitsky (2019), aim at estimatingdiscrete distributions from multiple batches of data, someof which have been adversarially corrupted. The main dif-ference to our results is the focus on finite data domainsand estimating the underlying probability distribution ratherthan learning a hypothesis.Qiao (2018) studies collaborative binary classification: alearning system has access to multiple training datasets anda subset of them can be adversarially corrupted. In this setup,the uncorrupted sources are allowed to have different inputdistributions, but share a common labelling function. Theauthor proves that it is possible to robustly learn individualhypotheses for each source, but a single shared hypothesiscannot be learned robustly. For the specific case that all datadistributions are identical, the setup matches ours, thoughonly for binary classification in the realizable case, and witha different adversarial model.In a similar setting, Mahloujifar et al. (2019) show, in par-ticular, that an adversary can increase the probability ofany ”bad property” of the learned hypothesis by a term atleast proportional to the fraction of manipulated sources.These results differ from ours, by their assumption that dif-ferent sources have different distributions, which rendersthe learning problem much harder.In Konstantinov & Lampert (2019), a learning system hasaccess to multiple datasets, some of which are manipulated, and the authors prove a generalization bound and propose analgorithm based on learning with a weighted combinationof all datasets. The main difference to our work is that theirproposed method crucially relies on a trusted subset of thedata being known to the learner. Their adversary is alsoweaker, as it cannot influence the data points directly, butonly change the distribution from which they are sampled,and the work does not provide finite sample guarantees.There are a number of classic results on the fundamentallimits of PAC learning from a single labelled set of samples ,a fraction of which can be arbitrarily corrupted, e.g. (Kearns& Li, 1993; Bshouty et al., 2002). We compare our resultsagainst this classic scenario in Section 4.1.Another related general direction is the research onByzantine-resilient distributed learning, which has seen sig-nificant interest recently, e.g. (Blanchard et al., 2017; Chenet al., 2017; Yin et al., 2018; 2019; Alistarh et al., 2018).There the focus is on learning by exchanging gradient up-dates between nodes in a distributed system, an unknownfraction of which might be corrupted by an omniscient ad-versary and may behave arbitrarily. These works tend todesign defences for specific gradient-based optimization al-gorithms, such as SGD, and their theoretical analysis usuallyassumes strict conditions on the objective function, such asconvexity or smoothness. Nevertheless, the (nearly) tightsample complexity upper and lower bounds developed forByzantine-resilient gradient descent (Yin et al., 2018) and itsstochastic variant (Alistarh et al., 2018) are relevant to ourresults and are therefore discussed in detail in Sections 4.2and 5.2.The work of Awasthi et al. (2017) considers learning fromcrowdsourced data, where some of the workers might be-have arbitrarily. However, they only focus on label cor-ruptions. Feng (2017) consider the fundamental limits oflearning from adversarial distributed data, but in the casewhen each of the nodes can iteratively send corrupted up-dates with certain probability. Feng et al. (2014) provide amethod for distributing the computation of any robust learn-ing algorithm that operates on a single large dataset. Thereis also a large body of literature on attacks and defences forfederated learning, e.g. (Bhagoji et al., 2019; Fung et al.,2018). Apart from focusing on iterative gradient-basedoptimization procedures, these works also allow for natu-ral variability in the distributions of the uncorrupted datasources.
3. Preliminaries
In this section we introduce the technical definitions thatare necessary to formulate and prove our main results. Westart by reminding the reader of the classical notion of PAC-learnability and uniform convergence, as they can be found dversarial Multi-Source PAC Learning in most machine learning textbooks. We then introduce thesetting of learning from multiple sources and notions ofadversaries of different strengths.
Let X and Y be given input and output sets, respectively,and D ∈ P ( X × Y ) be a fixed but unknown probabilitydistribution. By (cid:96) : Y ×Y → R we denote a loss function,and by H ⊂ { h : X → Y} a set of hypotheses. All of thesequantities are assumed arbitrary but fixed for the purpose ofthis work.A (statistical) learner is a function L : ∪ ∞ m =1 ( X ×Y ) m →H . In the classic supervised learning scenario, the learnerhas access to a training set of m labelled examples, { ( x , y ) , . . . , ( x m , y m ) } , sampled i.i.d. from D , and aimsat learning a hypothesis h ∈ H with small risk , i.e. expectedloss, under the unknown data distribution, R ( h ) = E ( x,y ) ∼D ( (cid:96) ( h ( x ) , y )) . (1) PAC-learnability is a key property of the hypothesis set,which ensures the existence of an algorithm that performssuccessful learning:
Definition 1 ( PAC-Learnability ) . We call H (agnostic) probably approximately correct (PAC) learnable with re-spect to (cid:96) , if there exists a learner L and a function m H ,(cid:96) :(0 , × (0 , → N , such that for any (cid:15), δ ∈ (0 , , whenever S is a set of m ≥ m H ,(cid:96) ( (cid:15), δ ) i.i.d. labelled samples from D ,then with probability at least − δ over the sampling of S : R ( L ( S )) ≤ min h ∈H R ( h ) + (cid:15). (2)Another important concept related to PAC-learnability isthat of uniform convergence . Definition 2 ( Uniform convergence ) . We say that H hasthe uniform convergence property with respect to (cid:96) withrate s H ,(cid:96) , if there exists a function s H ,(cid:96) : N × (0 , × (cid:83) ∞ m =1 ( X ×Y ) m → R , such that for any distribution D ∈ P ( X × Y ) and any δ ∈ (0 , : • given m samples S = { ( x , y ) , . . . , ( x m , y m ) } i.i.d. ∼D , with probability at least − δ over the data : sup h ∈H |R ( h ) − (cid:98) R ( h ) | ≤ s H ,(cid:96) ( m, δ, S ) , (3)where ˆ R ( h ) is the empirical risk of the hypothesis h . • s H ,(cid:96) ( m, δ, S m ) → as m → ∞ , for any sequence ( S m ) m ∈ N with S m ∈ ( X ×Y ) m .Throughout the paper we drop the dependence on H and (cid:96) and simply write s for s H ,(cid:96) . Note that above definition is equivalent to the classic definition of uniform conver-gence (e.g. Chapter 4 in (Shalev-Shwartz & Ben-David,2014)). We only introduce an explicit notation, s , for thesample complexity rate of uniform convergence, as this sim-plifies the layout of our analysis later. It is well-knownthat uniform convergence implies PAC-learnability and thatthe opposite is also true for agnostic binary classification(Shalev-Shwartz & Ben-David, 2014). Our focus in this paper is on learning from multiple datasources. For simplicity of exposition, we assume that theyall provide the same number of data points, i.e. the train-ing data consists of N groups of m samples each, where m, N ∈ N are fixed integers.Formally, we denote by ( X × Y ) N × m the set of all possiblecollections (i.e. unordered sequences) of N groups of m datapoints each. A (statistical) multi-source learner is afunction L : ∪ ∞ N =1 ∪ ∞ m =1 ( X ×Y ) N × m → H that takessuch a collection of datasets and returns a predictor from H . Informally, one considers a learning system robust if it isable to learn a good hypothesis, even when the training datais not perfectly i.i.d., but contains some artifacts, e.g. annota-tion errors, a selection bias or even malicious manipulations.Formally, one models this by assuming the presence of an adversary , that observes the original datasets and outputspotentially manipulated versions. The learner then has tooperate on the manipulated data without knowledge of whatthe original one had been or what manipulations have beenmade.
Definition 3 ( Adversary).
An adversary is any function A : ( X ×Y ) N × m → ( X ×Y ) N × m .Throughout the paper, we denote by S (cid:48) = { S (cid:48) , S (cid:48) , . . . , S (cid:48) N } the original, uncorrupted datasets, drawni.i.d. from D , and by S = { S , S , . . . , S N } = A ( S (cid:48) ) thedatasets returned by the adversary.Different scenarios are obtained by giving the adversarydifferent amounts of power. For example, a weak adversarymight only be able to randomly flip labels, i.e. simulate thepresence of label noise. A much stronger adversary wouldbe one that can potentially manipulate all data and do sowith knowledge not only of all of the datasets but also ofthe underlying data distribution and the learning algorithmto be used later.In this work, we adopt the latter view, as it leads to muchstronger robustness guarantees. We define two adversarytypes that can make arbitrary manipulations to data sources,but only influence a certain subset of them. dversarial Multi-Source PAC Learning Definition 4 ( Fixed-Set Adversary).
Let G ⊂ [ N ] . Anadversary is called fixed-set (with preserved set G ) , if it onlyinfluences the datasets outside of G . That is, S i = S (cid:48) i for all i ∈ G . Definition 5 ( Flexible-Set Adversary).
Let k ∈{ , , . . . , N } . An adversary is called flexible-set (with pre-served size k ), if it can influence any N − k of the N givendatasets. That is, there exists a set G ⊂ [ N ] , such that | G | = k and S i = S (cid:48) i for all i ∈ G .In both cases, we call the fraction α of corrupted datasetsthe power of the adversary, i.e. α = N −| G | N for the fixed-setand α = N − kN for the flexible-set adversaries.While similarly defined, the fixed-set adversary is strictlyweaker than the flexible-set one, as the latter one can first in-spect all data and then choose which subset to modify, whilethe former one is restricted to a fixed, data-independent sub-set of sources. In particular, the flexible-set adversary canalready bias the distribution of the data by throwing out acarefully chosen set of sources, before replacing them withnew data.Both adversary models are inspired by real-world consid-erations and analogs have appeared in a number of otherresearch areas. The fixed-set adversaries can model a situa-tion in which N parties collaborate on a single learning task,but an unknown and fixed set of them are compromised, e.g.by hackers, that can act maliciously and collude with eachother. This is a similar reasoning as in Byzantine-robustoptimization , where an unknown subset of computing nodesare assumed to behave arbitrarily, thereby disrupting theoptimization progress.The second adversary corresponds to a situation where amalicious party can observe all of the available datasets andchoose which ones to corrupt, up to a certain budget. This issimilar to classic models in the fields of robust PAC learning,e.g. (Bshouty et al., 2002), and robust mean estimation, e.g.(Diakonikolas et al., 2019), where the adversary itself caninfluence which subset of the data to modify once the wholedataset is observed.Whether robust learning in the presence of an adversary ispossible for a certain hypothesis set or not is captured bythe following definition:
Definition 6.
A hypothesis set, H , is called multi-sourcePAC-learnable against the class of fixed-set adversaries (orflexible-set adversaries) and with respect to (cid:96) , if there existsa multi-source learner L and a function m : (0 , → N ,such that for any (cid:15), δ ∈ (0 , and any set G ⊂ [ N ] of size | G | > N (or any α < ), whenever S (cid:48) ∈ ( X ×Y ) N × m is a collection of N datasets of m ≥ m ( (cid:15), δ ) i.i.d. labelledsamples from D each, then with probability at least − δ over the sampling of S (cid:48) : R ( L ( A ( S (cid:48) )) ≤ min h ∈H R ( h ) + (cid:15), (4)uniformly against all fixed-set adversaries with preservedset G (or all flexible-set adversaries of power α ). A learner, L , with this property is called robust multi-source learner for H .In particular, the same learner L should work against anyadversary and for any α or set G . In the same time, theadversary is arbitrary once L is fixed, so in particular it candepend on the learning algorithm.Note that the robust learner should achieve optimal erroras m → ∞ , while N can stay constant. This reflects thatwe want to study adversarial multi-source learning in thecontext of a constant and potentially not very large numberof sources. In fact, our lower bound results in Section 5show that the adversary can always prevent the learner fromapproaching optimal risk in the opposite regime of constant m and N → ∞ .
4. Sample Complexity of RobustMulti-Source Learning
In this section, we present our main result, a theorem thatstates that whenever H has the uniform convergence prop-erty, there exists an algorithm that guarantees a boundedexcess risk against both the fixed-set and the flexible-set ad-versary. We then derive and discuss some instantiations ofthe general result that shed light on the sample complexityof PAC learning in the adversarial multi-source learning set-ting. Finally, we provide a high-level sketch of the theorem’sproof. Let
N, m, k ∈ N be integers, such that k ∈ ( N/ , N ] . Let α = N − kN < be the proportionof corrupted sources. Assume that H has the uniform con-vergence property with rate function s . Then there existsa learner L : ( X ×Y ) N × m → H with the following twoproperties.(a) Let G be a fixed subset of [ N ] of size | G | = k . For S (cid:48) = { S (cid:48) , . . . , S (cid:48) N } i.i.d. ∼ D , with probability at least − δ over the sampling of S (cid:48) : R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) (5) ≤ s (cid:0) km, δ , S G (cid:1) + 6 α max i ∈ [ N ] s (cid:0) m, δ N , S i (cid:1) , uniformly against all fixed-set adversaries with pre-served set G , where S = { S , . . . , S N } = A ( S (cid:48) ) is dversarial Multi-Source PAC Learning the dataset modified the adversary and S G = ∪ i ∈ G S i is the set of all uncorrupted data.(b) For S (cid:48) = { S (cid:48) , . . . , S (cid:48) N } i.i.d. ∼ D , with probability atleast − δ over the sampling of S (cid:48) : R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) (6) ≤ s (cid:0) km, δ (cid:0) Nk (cid:1) , S G (cid:1) + 6 α max i ∈ [ N ] s (cid:0) m, δ N , S i (cid:1) , uniformly against all flexible-set adversaries with pre-served size k , where S = { S , . . . , S N } = A ( S (cid:48) ) is the dataset returned by the adversary, G is theset of sources not modified by the adversary and S G = ∪ i ∈ G S i is the set of all uncorrupted data. The learner L is in fact explicit, we define and discuss itin the proof sketch that we provide in Section 4.3. Thecomplete proof is provided in the supplementary material.As an immediate consequence we obtain: Corollary 1.
Assume that H has the uniform convergenceproperty. Then H is multi-source PAC-learnable against theclass of fixed-set and the class of flexible-set adversaries.Proof. It suffices to show that for any δ ∈ (0 , , the righthand sides of (5) and (6) converge to for m → ∞ . Thisit true, since s ( ¯ m, ¯ δ, ¯ S ) → as ¯ m → ∞ for any ¯ δ and ¯ S ,by the definition of uniform convergence. Since the samelearner works regardless of the choice of G and/or α , theresult follows. Discussion.
Corollary 1 is in sharp contrast with the situa-tion of single dataset PAC robustness. In particular, Bshoutyet al. (2002) study a setup where an adversary can manip-ulate a fraction α datapoints out of a dataset with m i.i.d.-sampled elements . The authors show that in the binaryrealizable case, for any hypothesis space with at least twofunctions, no learning algorithm can learn a hypothesis withrisk less than α with probability greater than / . Simi-larly, Kearns & Li (1993) showed that for an adversary thatmodifies each data point with constant probability α , noalgorithm can learn a hypothesis with accuracy better than α/ (1 − α ) . Both results hold regardless of the value of m ,thus showing that PAC-learnability is not fulfilled. While Theorem 1 is most general, it does not yet providemuch insight into the actual sample complexity of the adver-sarial multi-source PAC learning problem, because the rate To be precise, the number of influenced points has to be bino-mially distributed with mean αm , but the difference between thisand the deterministic setting becomes irrelevant for m → ∞ . function s might behave in different ways. In this sectionwe give more explicit upper bounds in terms a standardcomplexity measure of hypothesis spaces – the Rademachercomplexity. Let R S ( (cid:96) ◦ H ) = E σ (cid:16) sup h ∈H n n (cid:88) i =1 σ i (cid:96) ( h ( x i ) , y i ) (cid:17) , (7)be the (empirical) Rademacher complexity of H with respect to the loss function (cid:96) on a sample S = { ( x , y ) , . . . , ( x n , y n ) } . Here { σ i } ni =1 are i.i.d.Rademacher random variables. Let S G = (cid:83) i ∈ G S i , R i = R S i ( (cid:96) ◦ H ) and R G = R S G ( (cid:96) ◦ H ) .4.2.1. R ATES FOR THE FIXED - SET ADVERSARY .An application of Theorem 1 with a standard uniform con-centration result gives:
Corollary 2.
In the setup of Theorem 1, against any fixed-set adversary, it holds that R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ R G + 6 (cid:115) log( δ )2 km (8) + α (cid:16) (cid:115) log (cid:0) Nδ (cid:1) m + 12 max i ∈ [ N ] R i (cid:17) . The full proof is included in the supplementary material.In many common learning settings, the Rademacher com-plexity scales as O (1 / √ n ) with the sample size n (see e.g.(Bousquet et al., 2004)). Thereby, we obtain the followingrates against the fixed-set adversary: (cid:101) O (cid:16) √ km + α √ m (cid:17) , (9)where the (cid:101) O -notation hides constant and logarithmic factors.The results in Corollary 2 and Equation (9) allow us to rea-son about the type of guarantees that can be achieved givena certain amount of data. However, they also imply an ex-plicit upper bound on the sample complexity of adversarialmulti-source learning (i.e. an upper bound on the smallestpossible m ( (cid:15), δ ) in Definition 6) of the form: m ( (cid:15), δ ) ≤ O log( Nδ ) (cid:15) (cid:32) (cid:112) (1 − α ) N + α (cid:33) . (10) Discussion.
We can make a number of observations fromEquation (9). The (cid:112) /km -term is the rate one expectswhen learning from k (uncorrupted) sources of m sampleseach, that is from all the available uncorrupted data. The (cid:112) /m -term reflects the rate when learning from any single dversarial Multi-Source PAC Learning source of m samples, i.e. without the benefit of sharinginformation between sources. The latter enters weightedby α , i.e. it is directly proportional to the power of theadversary. In the limit of α → (i.e. all N sources areuncorrupted, k → N ), the bound becomes (cid:101) O ( (cid:112) /N m ) .Thus, we recover the classic convergence rate for learningfrom N m samples in the non-realizable case. This factis interesting, as the robust learner of Theorem 1 actuallydoes not need to know the value of α for its operation.Consequently, the same algorithm will work robustly if thedata contains manipulations but without an unnecessaryoverhead (i.e. with optimal rate), if all data sources are infact uncorrupted.Another insight follows from the fact that for reasonablysmall α , we have: (cid:101) O (cid:16) √ km + α √ m (cid:17) (cid:28) (cid:101) O (cid:16) √ m (cid:17) , (11)so learning from multiple, even potentially manipulated,datasets converges to a good hypothesis faster than learningfrom a single uncorrupted dataset. This fact can be inter-preted as encouraging cooperation: any of the honest partiesin the multi-source setting with fixed-set adversary willbenefit from making their data available for multi-sourcelearning, even if some of the other parties are malicious. Comparison to Byzantine-robust optimization.
Our ob-tained rates for the fixed-set adversary can also be comparedto the state-of-art convergence results for Byzantine-robustdistributed optimization, where the compromised nodes arealso fixed, but unknown. Yin et al. (2018) and Alistarh et al.(2018) develop robust algorithms for gradient descent andstochastic gradient descent respectively, achieving conver-gence rates of order (cid:101) O (cid:16) √ km + α √ m + 1 m (cid:17) (12)for α < / unknown. Clearly, these rates resemble ours,except for the additional /m -term, which matters when α is or very small. As shown in Yin et al. (2018), this termcan also be made to disappear if an upper bound β ≥ α isassumed to be known a priori.Overall, these similarities should not be over-interpreted,as the results for Byzantine-robust optimization describepractical gradient-based algorithms for distributed optimiza-tion under various technical assumptions, such as convexity,smoothness of the loss function and bounded variance ofthe gradients. In contrast, our work is purely statistical,not taking computational cost into account, but holds in amuch broader context, for any hypothesis space that has theuniform convergence property of suitable rate and withoutconstraints on the optimization method to be used. Addi-tionally, our rates improve automatically in situations whereuniform convergence is faster. 4.2.2. R ATES FOR THE FLEXIBLE - SET ADVERSARY
An analogous result to Corollary 2 holds also for flexible-setadversaries:
Corollary 3.
In the setup of Theorem 1, against any flexible-set adversary, it holds that R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) (13) ≤ R G + 12 α max i ∈ [ N ] R i + (cid:101) O (cid:18) √ α √ m (cid:19) . The proof is provided in the supplemental material.Making the same assumptions as above, we obtain a samplecomplexity rate (cid:101) O (cid:18) √ km + √ α √ m (cid:19) . (14)which differs from (9) only in the rate of dependence on α ,which, if at all, matters only for very small (but non-zero) α . Despite the difference, most of our discussion above stillapplies. In particular, even for the flexible-set adversary thesame learning algorithm exhibits robustness for α > andachieves optimal rates for α = 0 .Moreover, an explicit upper bound on the sample complexityagainst a flexible-set adversary is given by: m ( (cid:15), δ ) ≤ (cid:101) O (cid:15) (cid:32) (cid:112) (1 − α ) N + √ α (cid:33) . (15) The proof of Theorem 1 consists of two parts. First, we intro-duce a filtering algorithm, that attempts to determine whichof the data sources can be trusted, meaning that it shouldbe safe to use them for training a hypothesis. Note that thiscan be because they were not manipulated, or because themanipulations are too small to have negative consequences.The output of the algorithm is a new filtered training set,consisting of all data from the trusted sources only. Second,we show that training a standard single-source learner onthe filtered training set yields the desired results.
Step 1.
Pseudo-code for the filtering algorithm is providedin Algorithm 1. The crucial component is a carefully chosennotion of distance between the datasets, called discrepancy ,that we define and discuss below. It guarantees that if twosources are close to each other then the difference of trainingon one of them compared to the other is small. In fact, we believe the √ α -term to be an artifact of our prooftechnique, but currently do not have a bound with improved depen-dence on α . dversarial Multi-Source PAC Learning Algorithm 1input
Datasets S , . . . , S N Initialize
T = {} // trusted sources for i = 1 , . . . , N doif d H (cid:0) S i , S j (cid:1) ≤ s (cid:0) m, δ N , S i (cid:1) + s (cid:0) m, δ N , S j (cid:1) , for at least (cid:98) N (cid:99) values of j (cid:54) = i , then T = T ∪ { i } end ifend foroutput (cid:83) i ∈ T S i // all data of trusted sourcesTo identify the trusted sources, the algorithm checks for eachsource how close it is to all other sources with respect to thediscrepancy distance. If it finds the source to be closer thana threshold to at least half of the other sources, it is markedas trusted, otherwise it is not. To show that this proceduredoes what it is intended to do it suffices to show that twoproperties hold with high probability: 1) all trusted sourcesare safe to be used for training, 2) at least all uncorruptedsources will be trusted.Property 1) follows from the fact that if a source has smalldistance to at least half of the other datasets, it must be closeto at least one of the uncorrupted sources. By the propertyof the discrepancy distance, including it in the training setwill therefore not affect the learning of the hypothesis verynegatively. Property 2) follows from a concentration of massargument, which guarantees that for any uncorrupted sourceits distance to all other uncorrupted sources will approachzero at a well-understood rate. Therefore, with a suitablyselected threshold, at least all uncorrupted sources will beclose to each other and end up in the trusted subset withhigh probability. Discrepancy Distance.
For any dataset S i ∈ ( X ×Y ) m , let (cid:98) R i ( h ) = 1 m (cid:88) ( x,y ) ∈ S i (cid:96) ( h ( x ) , y ) (16)be the empirical risk of a hypothesis h with respect to theloss (cid:96) . The (empirical) discrepancy distance between twodatasets, S i and S j , is defined as d H ( S i , S j ) = sup h ∈H (cid:0) | (cid:98) R i ( h ) − (cid:98) R j ( h ) | (cid:1) . (17)This is the empirical counterpart of the so-called discrep-ancy distance, which, together with its unsupervised form, iswidely adopted within the field of domain adaptation (Kiferet al., 2004; Ben-David et al., 2010; Mohri & Medina, 2012).Typically, the discrepancy is used to bound the maximumpossible effect of distribution drift on a learning system. Themetric was also used in (Konstantinov & Lampert, 2019) tomeasure the effect of training on sources that have been sam-pled randomly, but from adversarially chosen distributions. As shown in Kifer et al. (2004); Ben-David et al. (2010),for randomly sampled datasets, the empirical discrepancyconcentrates with known rates to its distributional value, i.e.to zero, if two sources have the same underlying data distri-butions. The empirical discrepancy is well-defined even fordata not sampled from a distribution, though, and togetherwith the uniform convergence property it allows us to boundthe effect of training on one dataset rather than another. Step 2.
Let S T = (cid:83) i ∈ T S i be the output of the filteringalgorithm, i.e. the union of all trusted datasets. Then, forany h ∈ H , the empirical risk over S T can be written as (cid:98) R T ( h ) = 1 | T | (cid:88) i ∈ T (cid:98) R i ( h ) (18)We need to show that training on S T , e.g. by minimizing (cid:98) R T ( h ) , with high probability leads to a hypothesis withsmall risk under the true data distribution D .By construction, we know that for any trusted source S i ,there exists an uncorrupted source S j , such that the differ-ence between (cid:98) R i ( h ) and (cid:98) R j ( h ) is bounded by a suitablychosen constant (that depends on the growth function s ).By the uniform convergence property of H , we know thatfor any uncorrupted source, the difference between (cid:98) R i ( h ) and the true risk R ( h ) can also be bounded in terms of thegrowth function s . In combination, we obtain that (cid:98) R T ( h ) is a suitably good estimator of the true risk, uniformly overall h ∈ H . Consequently, S T can be used for successfullearning.For the formal derivations and, in particular, the choice ofthresholds, please see the supplemental material.
5. Hardness of Robust Multi-Source Learning
We now take an orthogonal view compared to Section 4, andstudy where the hardness of the multi-source PAC learningstems from and what allows us to nevertheless overcomeit. For this, we prove two additional results that describefundamental limits of how well a learner can perform in themulti-source adversarial setting.For simplicity of exposition we focus on binary classifi-cation. Let Y = {− , } and (cid:96) be the zero-one loss, i.e. (cid:96) ( y, ¯ y ) = (cid:74) y (cid:54) = ¯ y (cid:75) . Following Bshouty et al. (2002), wedefine: Definition 7.
A hypothesis space H over an input set X is said to be non-trivial , if there exist two points x , x ∈X and two hypotheses h , h ∈ H , such that h ( x ) = h ( x ) , but h ( x ) (cid:54) = h ( x ) . We show that if the learner does not make use of the multi-source structure of the data, i.e. it behaves as a single- dversarial Multi-Source PAC Learning source learner on the union of all data samples, then a(multi-source) fixed-set adversary can always prevent
PAC-learnability.
Theorem 2.
Let H be a non-trivial hypothesis space. Let m and N be any positive integers and let G be a fixed subset of [ N ] of size k ∈ { , . . . , N − } . Let L : ( X ×Y ) N × m → H be a multi-source learner that acts by merging the data fromall sources and then calling a single-source learner. Let S (cid:48) ∈ ( X × Y ) N × m be drawn i.i.d. from D . Then thereexists a distribution D with min h ∈H R ( h ) = 0 and a fixed-set adversary A with index set G , such that: P S (cid:48) ∼D (cid:16) R (cid:0) L ( A ( S (cid:48) ) (cid:1) > α − α ) (cid:17) > , (19) where α = N − kN is the power of the adversary. The proof is provided in the supplemental material. Notethat, since the theorem holds for the fixed-set adversary, itautomatically also holds for the stronger flexible-set adver-sary.The theorem sheds light on why PAC-learnability is possiblein the multi-source setting, while in the single source settingit is not. The reason is not simply that the adversary isweaker, because it is restricted to manipulating samples ina subset of datasets instead of being able to choose freely.Inequality (19) implies that even against such a weakeradversary, a single-source learner cannot be adversariallyrobust. Consequently, it is the additional information thatthe data comes in multiple datasets, some of which remainuncorrupted even after the adversary was active, that givesthe multi-source learner the power to learn robustly.An immediate consequence of Theorem 2 is also that thecommon practice of merging the data from all sources andperforming a form of empirical risk minimization on theresulting dataset is not a robust learner and therefore subop-timal in the studied context.
As a tool for understanding the limiting factors of learningin the adversarial multi-source setting, we now establisha lower bound on the achievable excess risk in terms ofthe number of samples per source and the power of theadversary.
Theorem 3.
Let
H ⊂ { h : X → Y} be a hypothesis space,let m and N be any integers and let G be a fixed subset of [ N ] of size k ∈ { , . . . , N − } . Let S (cid:48) ∈ ( X × Y ) N × m bedrawn i.i.d. from D . Then the following statements hold forany multi-source learner L :(a) Suppose that H is non-trivial. Then there exists adistribution D on X with min h ∈H R ( h ) = 0 , and a fixed-set adversary A with index set G , such that: P S (cid:48) (cid:16) R (cid:0) L ( A ( S (cid:48) ) (cid:1) > α m (cid:17) > . (20) (b) Suppose that H has VC dimension d ≥ . Then thereexists a distribution D on X × Y and a fixed-set adver-sary A with index set G , such that: P S (cid:48) (cid:32) R (cid:0) L ( A ( S (cid:48) ) (cid:1) − min h ∈H R ( h ) (21) > (cid:114) d N m + α m (cid:33) > . In both cases, α = N − kN is the power of the adversary. The proof is provided in the supplemental material. As forTheorem 2, it is clear that the same result holds also forflexible-set adversaries with preserved size k . Analysis.
Inequality (20) shows that even in the realizablescenario, the risk might not shrink faster than with rate Ω( α/m ) , regardless of how many data sources, and there-fore data samples, are available. This is contrast to the i.i.d.situation, where the corresponding rate is Ω(1 /N m ) . Thedifference shows that robust learning with a constant frac-tion of corrupted sources is only possible if the number ofsamples per dataset grows. Conversely, if the number ofcorrupted datasets is constant, regardless of the total num-ber of sources, i.e., α = O (1 /N ) , we recover the rates forlearning without an adversary up to constants.In inequality (21), the term Ω( (cid:112) d/N m ) is due to the classicno-free-lunch theorem for binary classification and corre-sponds to the fundamental limits of learning, now in thenon-realizable case. The Ω( α/m ) -term appears as the priceof robustness, and as before, it implies that for constant α , m → ∞ is necessary in order to achieve arbitrarily smallexcess risk, while just N → ∞ does not suffice. Relation to prior work.
Lower bounds of similar structureas in Theorem 3 have also been derived for Byzantine opti-mization and collaborative learning. In particular, Yin et al.(2018) prove that in the case of distributed mean estimationof a d -dimensional Gaussian on N machines, an α fractionof which can be Byzantine, any algorithm would incur lossof Ω( α √ m + (cid:113) dNm ) . Alistarh et al. (2018) construct specificexamples of a Lipschitz continuous and a strongly convexfunction, such that no distributed stochastic optimization al-gorithm, working with an α -fraction of Byzantine machines,can optimize the function to error less than Ω( α √ m + (cid:113) dNm ) ,where d is the number of parameters. For realizable binaryclassification in the context of collaborative learning, Qiao(2018) prove that there exists a hypothesis space of VC di-mension d , such that no learner can achieve excess risk lessthan Ω( αd/m ) . dversarial Multi-Source PAC Learning Besides the different application scenario, the main dif-ference between these results and Theorem 3 is that ourbounds hold for any hypothesis space H that is non-trivial(Ineq. (20)), or has VC-dimension d ≥ (Ineq. (21)), whilethe mentioned references construct explicit examples of hy-pothesis spaces or stochastic optimization problems wherethe bounds hold. In particular, our results show that the limi-tations on the learner due the finite total number of samples,the finite number of samples per source and the fraction ofunreliable sources α are inherent and not specific to a subsetof hard-to-learn hypotheses.
6. Conclusion
We studied the problem of robust learning from multipleunreliable datasets. Rephrasing this task as learning fromdatasets that might be adversarially corrupted, we intro-duced the formal problem of adversarial learning from mul-tiple sources, which we studied in the classic PAC setting.Our main results provide a characterization of the hardnessof this learning task from above and below. First, we showedthat adversarial multi-source PAC learning is possible forany hypothesis class with the uniform convergence property,and we provided explicit rates for the excess risk (Theorem 1and Corollaries). The proof is constructive and shows alsothat integrating robustness comes at a minor statistical cost,as our robust learner achieves optimal rates when run ondata without manipulations. Second, we proved that adver-sarial PAC learning from multiple sources is far from trivial.In particular, it is impossible to achieve for learners thatignore the multi-source structure of the data (Theorem 2).Third, we proved lower bounds on the excess risk under verygeneral conditions (Theorem 3), which highlight an unavoid-able slowdown of the convergence rate proportional to theadversary’s strength compared to the i.i.d. (adversarial-free)case. Furthermore, in order to facilitate successful learningwith a constant fraction of corrupted sources, the number ofsamples per source has to grow.A second emphasis of our work was to highlight connec-tions of the adversarial multi-source learning task to relatedmethods in robust optimization, cryptography and statistics.We believe that a better understanding of these connectionswill allow us to come up with tighter bounds and to designalgorithms that are not only statistically efficient (as was thefocus of this work), but also obtain insight into the trade-offswith computational complexity.
Acknowledgements
Dan Alistarh is supported in part by the European ResearchCouncil (ERC) under the European Union’s Horizon 2020research and innovation programme (grant agreement No805223 ScaleML). This research was supported by the Sci- entific Service Units (SSU) of IST Austria through resourcesprovided by Scientific Computing (SciComp).
References
Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochasticgradient descent. In
Conference on Neural InformationProcessing Systems (NeurIPS) , 2018.Awasthi, P., Blum, A., Mansour, Y., et al. Efficient pac learn-ing from the crowd. In
Conference on ComputationalLearning Theory (COLT) , 2017.Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A.,Pereira, F., and Vaughan, J. W. A theory of learning fromdifferent domains.
Machine Learning , 79(1-2):151–175,2010.Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S.Analyzing federated learning through an adversarial lens.In
International Conference on Machine Learing (ICML) ,2019.Blanchard, P., Guerraoui, R., Stainer, J., et al. Machinelearning with adversaries: Byzantine tolerant gradientdescent. In
Advances in Neural Information ProcessingSystems , 2017.Bousquet, O., Boucheron, S., and Lugosi, G. Introductionto statistical learning theory. In
Advanced Lectures onMachine Learning , pp. 169–207. Springer, 2004.Bshouty, N. H., Eiron, N., and Kushilevitz, E. Pac learningwith nasty noise.
Theoretical Computer Science , 288(2):255–275, 2002.Chen, S., Li, J., and Moitra, A. Efficiently learning struc-tured distributions from untrusted batches. In
ACM Sym-posium on Theory of Computing (STOC) , 2019.Chen, Y., Su, L., and Xu, J. Distributed statistical machinelearning in adversarial settings: Byzantine gradient de-scent.
Proceedings of the ACM on Measurement andAnalysis of Computing Systems (POMACS) , 1(2):1–25,2017.Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarialrobustness via randomized smoothing. In
InternationalConference on Machine Learing (ICML) , 2019.Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A.,and Stewart, A. Robust estimators in high-dimensionswithout the computational intractability.
SIAM Journalon Computing , 48(2):742–864, 2019.Feng, J. On fundamental limits of robust learning. arXivpreprint arXiv:1703.10444 , 2017. dversarial Multi-Source PAC Learning
Feng, J., Xu, H., and Mannor, S. Distributed robust learning. arXiv preprint arXiv:1409.5937 , 2014.Fung, C., Yoon, C. J., and Beschastnikh, I. Mitigatingsybils in federated learning poisoning. arXiv preprintarXiv:1808.04866 , 2018.Hendrycks, D. and Gimpel, K. A baseline for detectingmisclassifier and out-of-distribution examples in neuralnetworks. In
International Conference on Learning Rep-resentations (ICLR) , 2017.Jain, A. and Orlitsky, A. Robust learning of discrete distri-butions from batches. arXiv preprint arXiv:1911.08532 ,2019.Kearns, M. and Li, M. Learning in the presence of maliciouserrors.
SIAM Journal on Computing , 1993.Kifer, D., Ben-David, S., and Gehrke, J. Detecting changein data streams. In
VLDB , 2004.Konstantinov, N. and Lampert, C. H. Robust learning fromuntrusted sources. In
International Conference on Ma-chine Learing (ICML) , 2019.Lee, K., Lee, K., Lee, H., and Shin, J. A simple unifiedframework for detecting out-of-distribution samples andadversarial attacks. In
Conference on Neural InformationProcessing Systems (NeurIPS) , 2018.Liang, S., Li, Y., and Srikant, R. Enhancing the reliabilityof out-of-distribution image detection in neural networks.In
International Conference on Learning Representations(ICLR) , 2018.Mahloujifar, S., Mahmoody, M., and Mohammed, A. Uni-versal multi-party poisoning attacks. In
InternationalConference on Machine Learing (ICML) , 2019.McMahan, H. B. and Ramage, D. Federated learning: Col-laborative machine learning without centralized trainingdata. https://research.googleblog.com/2017/04/federated-learning-collaborative.html, 2017.Mohri, M. and Medina, A. M. New analysis and algorithmfor learning with drifting distributions. In
InternationalConference on Algorithmic Learning Theory (ALT) , 2012.Mohri, M., Rostamizadeh, A., and Talwalkar, A.
Founda-tions of machine learning . MIT press, 2018.Qiao, M. Do outliers ruin collaboration? In
InternationalConference on Machine Learing (ICML) , 2018.Qiao, M. and Valiant, G. Learning discrete distributionsfrom untrusted batches. In
LIPIcs-Leibniz InternationalProceedings in Informatics , volume 94. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Raghunathan, A., Steinhardt, J., and Liang, P. Certifieddefenses against adversarial examples. In
InternationalConference on Learning Representations (ICLR) , 2018.Shalev-Shwartz, S. and Ben-David, S.
Understanding Ma-chine Learning: From Theory to Algorithms . Cambridgeuniversity press, 2014.Sheng, V. S. and Zhang, J. Machine learning with crowd-sourcing: A brief summary of the past research and futuredirections. In
AAAI Conference on Artificial Intelligence ,2019.Singh, G., Gehr, T., Mirman, M., P¨uschel, M., and Vechev,M. Fast and effective robustness certification. In
Conference on Neural Information Processing Systems(NeurIPS) , 2018.Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P.Byzantine-robust distributed learning: Towards optimalstatistical rates.
International Conference on MachineLearing (ICML) , 2018.Yin, D., Chen, Y., Kannan, R., and Bartlett, P. Defendingagainst saddle point attack in Byzantine-robust distributedlearning. In
International Conference on Machine Lear-ing (ICML) , 2019. dversarial Multi-Source PAC Learning
A. Proof of Theorem 1 and its corollaries
Theorem 1.
Let
N, m, k ∈ N be integers, such that k ∈ ( N/ , N ] . Let α = N − kN < be the proportion of corruptedsources. Assume that H has the uniform convergence property with rate function s . Then there exists a learner L :( X ×Y ) N × m → H with the following two properties.(a) Let G be a fixed subset of [ N ] of size | G | = k . For S (cid:48) = { S (cid:48) , . . . , S (cid:48) N } i.i.d. ∼ D , with probability at least − δ over thesampling of S (cid:48) : R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ s (cid:0) km, δ , S G (cid:1) + 6 α max i ∈ [ N ] s (cid:0) m, δ N , S i (cid:1) , (22) uniformly against all fixed-set adversaries with preserved set G , where S = { S , . . . , S N } = A ( S (cid:48) ) is the datasetmodified the adversary and S G = ∪ i ∈ G S i is the set of all uncorrupted data.(b) For S (cid:48) = { S (cid:48) , . . . , S (cid:48) N } i.i.d. ∼ D , with probability at least − δ over the sampling of S (cid:48) : R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ s (cid:0) km, δ (cid:0) Nk (cid:1) , S G (cid:1) + 6 α max i ∈ [ N ] s (cid:0) m, δ N , S i (cid:1) , (23) uniformly against all flexible-set adversaries with preserved size k , where S = { S , . . . , S N } = A ( S (cid:48) ) is the datasetreturned by the adversary, G is the set of sources not modified by the adversary and S G = ∪ i ∈ G S i is the set of alluncorrupted data.Proof. Denote by S (cid:48) i = { ( x (cid:48) i, , y (cid:48) i, ) , . . . , ( x (cid:48) i,m , y (cid:48) i,m ) } for i = 1 , . . . , N the initial datasets and by S i = { ( x i, , y i, ) , . . . , ( x i,m , y i,m ) } for i = 1 , . . . , N the datasets after the modifications of the adversary. As explained inthe main body of the paper, we denote by: (cid:98) R i ( h ) = 1 m m (cid:88) j =1 (cid:96) ( h ( x i,j ) , y i,j ) (24)the empirical risk of any hypothesis h ∈ H on the dataset S i and by: d H ( S i , S j ) = sup h ∈H | (cid:98) R i ( h ) − (cid:98) R j ( h ) | (25)the empirical discrepancy between the datasets S i and S j .We show that a learner that first runs a certain filtering algorithm (Algorithm 1) based on the discrepancy metric and thenperforms empirical risk minimization on the remaining data to compute a hypothesis satisfies the properties stated in thetheorem. The full algorithm for the learner is therefore given in Algorithm 2.(a) The key idea of the proof is that the clean sources are close to each other with high probability, so they get selected whenrunning Algorithm 1. On the other hand, if a bad source has been selected, it must be close to at least one of the goodsources, so it can not have too bad an effect on the empirical risk. Algorithm 1input
Datasets S , . . . , S N Initialize
T = {} // trusted sources for i = 1 , . . . , N doif d H (cid:0) S i , S j (cid:1) ≤ s (cid:0) m, δ N , S i (cid:1) + s (cid:0) m, δ N , S j (cid:1) , for at least (cid:98) N (cid:99) values of j (cid:54) = i , then T = T ∪ { i } end ifend foroutput (cid:83) i ∈ T S i // indices of the trusted sources dversarial Multi-Source PAC Learning Algorithm 2input
Datasets S , . . . , S N Run Algorithm 1 to compute S T = (cid:83) i ∈ T S i Compute h A = argmin h ∈H | S T | (cid:80) ( x,y ) ∈ S T (cid:96) ( h ( x ) , y ) //empirical risk minimizer over all trusted sources output h A For all i ∈ G , let E i be the event that: sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R i ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:18) m, δ N , S i (cid:19) . (26)Further, let E G be the event that: sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R G ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:18) km, δ , S G (cid:19) , (27)where (cid:98) R G ( h ) = 1 km (cid:88) i ∈ G m (cid:88) j =1 (cid:96) ( h ( x i,j ) , y i,j ) . Denote by E ci and E cG the complements of these events. Then we know that P ( E cG ) ≤ δ , and P ( E ci ) ≤ δ N for all i ∈ G .Therefore, if E = E G ∧ ( ∧ i ∈ G E i ) , we have: P ( E c ) = P ( E cG ∨ ( ∨ i ∈ G E ci )) ≤ P ( E cG ) + (cid:88) i ∈ G P ( E ci ) ≤ δ k δ N ≤ δ. (28)Hence, the probability of the event E that all of (26) and (27) hold, is at least − δ . We now show that under the event E ,Algorithm 2 returns a hypothesis that satisfies the condition in (a).Indeed, fix an arbitrary fixed-set adversary A with preserved set G . Whenever E holds, for all i, j ∈ G we have: d H ( S i , S j ) = sup h ∈H ( | (cid:98) R i ( h ) − (cid:98) R j ( h ) | ) ≤ sup h ∈H (cid:16) | (cid:98) R i ( h ) − R ( h ) | (cid:17) + sup h ∈H (cid:16) |R ( h ) − (cid:98) R j ( h ) | (cid:17) ≤ s (cid:18) m, δ N , S i (cid:19) + s (cid:18) m, δ N , S j (cid:19) . (29)Now since k ≥ (cid:98) N (cid:99) + 1 , we get that G ⊆ T . Moreover, for any i ∈ T \ G , there exists at least one j ∈ G , such that d H ( S i , S j ) ≤ s (cid:0) m, δ N , S i (cid:1) + s (cid:0) m, δ N , S j (cid:1) . For any i ∈ T \ G , denote by f ( i ) the smallest such j . Therefore, for any i ∈ (T \ G ) : | (cid:98) R i ( h ) − R ( h ) | ≤ | (cid:98) R i − (cid:98) R f ( i ) ( h ) | + | (cid:98) R f ( i ) ( h ) − R ( h ) | ≤ d H (cid:0) S i , S f ( i ) (cid:1) + s (cid:18) m, δ N , S f ( i ) (cid:19) (30) ≤ s (cid:18) m, δ N , S i (cid:19) + 2 s (cid:18) m, δ N , S f ( i ) (cid:19) (31)Denote by (cid:98) R T ( h ) = 1 | T | (cid:88) i ∈ T (cid:98) R i ( h ) = 1 | S T | (cid:88) ( x,y ) ∈ S T (cid:96) ( h ( x ) , y ) (32)the loss over all the trusted data. Then for any h ∈ H we have: (cid:12)(cid:12)(cid:12) (cid:98) R T ( h ) − R ( h ) (cid:12)(cid:12)(cid:12) ≤ | T | m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈ G m (cid:88) l =1 ( (cid:96) ( h ( x i,l ) , y i,l ) − R ( h )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) i ∈ (T \ G ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) l =1 ( (cid:96) ( h ( x i,l ) , y i,l ) − R ( h )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (33) = k | T | (cid:12)(cid:12)(cid:12) (cid:98) R G ( h ) − R ( h ) (cid:12)(cid:12)(cid:12) + 1 | T | (cid:88) i ∈ (T \ G ) (cid:12)(cid:12)(cid:12) (cid:98) R i ( h ) − R ( h ) (cid:12)(cid:12)(cid:12) (34) dversarial Multi-Source PAC Learning ≤ k | T | s (cid:18) km, δ , S G (cid:19) + 1 | T | (cid:88) i ∈ (T \ G ) (cid:12)(cid:12)(cid:12) (cid:98) R i ( h ) − R ( h ) (cid:12)(cid:12)(cid:12) (35) ≤ k | T | s (cid:18) km, δ , S G (cid:19) + 1 | T | (cid:88) i ∈ (T \ G ) (cid:18) s (cid:18) m, δ N , S i (cid:19) + 2 s (cid:18) m, δ N , S f ( i ) (cid:19)(cid:19) (36) ≤ k | T | s (cid:18) km, δ , S G (cid:19) + 3 | T | − k | T | max i ∈ [ N ] s (cid:18) m, δ N , S i (cid:19) (37) ≤ s (cid:18) km, δ , S G (cid:19) + 3 N − kN max i ∈ [ N ] s (cid:18) m, δ N , S i (cid:19) (38)Finally, let h ∗ = argmin h ∈H R ( h ) and h A = L ( A ( S (cid:48) )) = argmin h ∈H (cid:98) R T ( h ) . Then: R ( h A ) − R ( h ∗ ) = (cid:16) R ( h A ) − (cid:98) R T ( h A ) (cid:17) + (cid:16) (cid:98) R T ( h A ) − R ( h ∗ ) (cid:17) ≤ (cid:16) R ( h A ) − (cid:98) R T ( h A ) (cid:17) + (cid:16) (cid:98) R T ( h ∗ ) − R ( h ∗ ) (cid:17) (39) ≤ h ∈H (cid:12)(cid:12)(cid:12) (cid:98) R T ( h ) − R ( h ) (cid:12)(cid:12)(cid:12) . (40)Since we showed this result for an arbitrary fixed-set adversary with preserved set G , the result follows.(b) The crucial difference in the case of the flexible-set adversary is that the set G is chosen after the clean data is observed.We thus need concentration results for all of the subsets of [ N ] of size k , as well as all individual sources.For all i ∈ [ N ] , let E i be the event that: sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R (cid:48) i ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:18) m, δ N , S (cid:48) i (cid:19) , (41)where (cid:98) R (cid:48) i = 1 m m (cid:88) j =1 (cid:96) ( h ( x (cid:48) i,j ) , y (cid:48) i,j ) (42)Further, for any A ⊆ [ N ] of size | A | = k , let E A be the event that: sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R (cid:48) A ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:32) km, δ (cid:0) Nk (cid:1) , S (cid:48) A (cid:33) , (43)where S (cid:48) A = ∪ i ∈ A S (cid:48) i and (cid:98) R (cid:48) A ( h ) = 1 km (cid:88) i ∈ A m (cid:88) l =1 (cid:96) ( h ( x (cid:48) i,l ) , y (cid:48) i,l ) . (44)Then we know that P ( E ci ) ≤ δ N for all i ∈ [ N ] and P ( E cG ) ≤ δ ( Nk ) for all A ⊆ [ N ] with | A | = k . Therefore, if E = ( ∧ A E A ) ∧ (cid:0) ∧ i ∈ [ N ] E i (cid:1) , we have: P ( E c ) = P (cid:0) ( ∨ A E cA ) ∨ (cid:0) ∨ i ∈ [ N ] E ci (cid:1)(cid:1) ≤ (cid:88) A P ( E cA ) + (cid:88) i ∈ [ N ] P ( E ci ) ≤ (cid:18) Nk (cid:19) δ (cid:0) Nk (cid:1) + N δ N = δ. (45)Hence, the probability of the event E that all of (41) and (43) hold, is at least − δ . In particular, under E : sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R G ( h ) (cid:12)(cid:12)(cid:12) = sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R (cid:48) G ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:32) km, δ (cid:0) Nk (cid:1) , S (cid:48) G (cid:33) = s (cid:32) km, δ (cid:0) Nk (cid:1) , S G (cid:33) (46)and sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R i ( h ) (cid:12)(cid:12)(cid:12) = sup h ∈H (cid:12)(cid:12)(cid:12) R ( h ) − (cid:98) R (cid:48) i ( h ) (cid:12)(cid:12)(cid:12) ≤ s (cid:18) m, δ N , S (cid:48) i (cid:19) = s (cid:18) m, δ N , S i (cid:19) , (47) dversarial Multi-Source PAC Learning for all i ∈ G .Now, for any flexible-set adversary with preserved size k , the same argument as in (a) shows that: R ( h A ) − R ( h ∗ ) ≤ s (cid:32) km, δ (cid:0) Nk (cid:1) , S G (cid:33) + 6 N − kN max i ∈ [ N ] s (cid:18) m, δ N , S i (cid:19) (48)holds under the event E .We now show how to obtain data-dependent guarantees, via the notion of Rademacher complexity. Let R S ( (cid:96) ◦ H ) = E σ (cid:32) sup h ∈H n n (cid:88) i =1 σ i (cid:96) ( h ( x i ) , y i ) (cid:33) (49)be the Rademacher complexity of H with respect to the loss function (cid:96) on a sample S = { ( x , y ) , . . . , ( x n , y n ) } . Let S G = ∪ i ∈ G S i , R i = R S i ( (cid:96) ◦ H ) and R G = R S G ( (cid:96) ◦ H ) . Then we have: Corollary 2.
In the setup of Theorem 1, against a fixed-set adversary, it holds that R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ R G + 6 (cid:115) log( δ )2 km + α (cid:16) (cid:115) log (cid:0) Nδ (cid:1) m + 12 max i ∈ [ N ] R i (cid:17) . (50) Proof.
We use the standard generalization bound based on Rademacher complexity. Assume that S = { ( x , y ) , . . . , ( x n , y n ) } ∼ D , then with probability at least − δ over the data (Mohri et al., 2018): sup h ∈H | E ( (cid:96) ( h ( x ) , y )) − n n (cid:88) i =1 (cid:96) ( h ( x i ) , y i ) | ≤ R S ( (cid:96) ◦ H ) + 3 (cid:115) log (cid:0) δ (cid:1) n . (51)Substituting into the result of Theorem 1 gives the result. Corollary 3.
In the setup of Theorem 1, against a flexible-set adversary, it holds that R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ R G + 12 α max i ∈ [ N ] R i + (cid:101) O (cid:18) √ α √ m (cid:19) . (52) Proof.
Using the concentration result from Corollary 2 and (cid:0) Nk (cid:1) = (cid:0) N (1 − α ) N (cid:1) = (cid:0) NαN (cid:1) ≤ H ( α ) N , where H ( p ) = − p log ( p ) − (1 − p ) log (1 − p ) is the binary entropy function, we obtain: R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ R G + 6 (cid:115) log( ( Nk ) δ )2 km + α (cid:115) log (cid:0) Nδ (cid:1) m + 12 max i ∈ [ N ] R i (53) = 4 R G + 6 (cid:115) log( (cid:0) Nk (cid:1) )2 km + log( δ )2 km + α (cid:115) log (cid:0) Nδ (cid:1) m + 12 max i ∈ [ N ] R i (54) ≤ R G + 6 (cid:115) H ( α ) N log(2)2(1 − α ) N m + log( δ )2(1 − α ) N m + α (cid:115) log (cid:0) Nδ (cid:1) m + 12 max i ∈ [ N ] R i (55) ≤ R G + 12 α max i ∈ [ N ] R i + (cid:101) O (cid:18) √ α √ m (cid:19) (56)where for the last inequality we used H ( α ) ≤ (cid:112) α (1 − α ) , − α ∈ ( , and √ α > α .For the case of binary classifiers, we also provide a simpler bound in terms of the VC dimension of H . dversarial Multi-Source PAC Learning Corollary 4.
Assume that Y = {− , } and that H has finite VC-dimension d . Then:(a) In the case of the fixed-set adversary there exists a universal constant C , such that: R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ C (cid:114) dkm + 2 (cid:115) δ ) km + α C (cid:114) dm + 6 (cid:115) (cid:0) Nδ (cid:1) m . (57) (b) In the case of the flexible-set adversary: R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ O (cid:32)(cid:114) dkm + √ α √ m + α (cid:114) dm + α (cid:114) log( N ) m (cid:33) . (58) Proof. (a) Whenever H is of finite VC-dimension d , there exists a constant C , such that the following generalization boundholds (Bousquet et al., 2004): sup h ∈H | E ( (cid:96) ( h ( x ) , y )) − n n (cid:88) i =1 (cid:96) ( h ( x i ) , y i ) | ≤ C (cid:114) dn + (cid:115) (cid:0) δ (cid:1) n (59)and hence H has the uniform convergence property with rate function s = C (cid:113) dn + (cid:113) ( δ ) n . Substituting into the resultof Theorem 1 gives the result.(b) Using the concentration result from (a) and (cid:0) Nk (cid:1) = (cid:0) N (1 − α ) N (cid:1) = (cid:0) NαN (cid:1) ≤ H ( α ) N , where H ( p ) = − p log ( p ) − (1 − p ) log (1 − p ) is the binary entropy function, we obtain: R ( L ( A ( S (cid:48) ))) − min h ∈H R ( h ) ≤ C (cid:114) dkm + 2 (cid:115) ( Nk ) δ ) km + α C (cid:114) dm + 6 (cid:115) (cid:0) Nδ (cid:1) m (60) = 2 C (cid:114) dkm + 2 (cid:115) (cid:0) Nk (cid:1) ) km + 2 log( δ ) km + α C (cid:114) dm + 6 (cid:115) (cid:0) Nδ (cid:1) m (61) ≤ C (cid:114) dkm + 2 (cid:115) H ( α ) N log(2)(1 − α ) N m + 2 log( δ )(1 − α ) N m + α C (cid:114) dm + 6 (cid:115) (cid:0) Nδ (cid:1) m (62) ≤ O (cid:32)(cid:114) dkm + √ α √ m + α (cid:114) dm + α (cid:114) log( N ) m (cid:33) , (63)where for the last inequality we used H ( α ) ≤ (cid:112) α (1 − α ) and − α ∈ ( , . dversarial Multi-Source PAC Learning B. Proof of Theorem 2
Theorem 2.
Let H be a non-trivial hypothesis space. Let m and N be any positive integers and let G be a fixed subset of [ N ] of size k ∈ { , . . . , N − } . Let L : ( X × Y ) N × m → H be a multi-source learner that acts by merging the data fromall sources and then calling a single-source learner. Let S (cid:48) ∈ ( X × Y ) N × m be drawn i.i.d. from D . Then there exists adistribution D with min h ∈H R ( h ) = 0 and a fixed-set adversary A with index set G , such that: P S (cid:48) ∼D (cid:16) R (cid:0) L ( A ( S (cid:48) ) (cid:1) > α − α ) (cid:17) > , (64) where α = N − kN is the power of the adversary. We use a similar proof technique as in the no-free-lunch results in (Bshouty et al., 2002) and in the classic no-free-lunchtheorem, e.g. Theorem 3.20 in (Mohri et al., 2018). An overview is as follows. Consider a distribution on X that has supportonly at two points - the common point x and the rare point x . Take P ( x ) = O ( α − α ) . Then the expected number ofoccurrences of the point x in G is O (cid:16) α − α (1 − α ) N m (cid:17) = O ( αN m ) . Thus, one can show that with constant probabilitythe number of x ’s in G is at most αN m and hence the adversary (that has access to exactly αN m points in total) can insertthe same number of x ’s, but wrongly labelled, into the final dataset. Therefore, based on the union of the corrupted datasets,no algorithm can guess with probability greater than / what the true label of x was. Proof.
We prove that there exists a distribution D on X and a labelling function f ∈ H , such that the resulting jointdistribution on X × Y , defined by x ∼ D and y = f ( x ) , satisfies the desired property.Without loss of generality, let G = [1 , , . . . , k ] . Since H is non-trivial, there exist h , h ∈ H and x , x ∈ X , such that h ( x ) = h ( x ) , while h ( x ) = 1 , but h ( x ) = − . Consider the following distribution on X : P D ( x ) = 1 − (cid:15) and P D ( x ) = 4 (cid:15), (65)where (cid:15) = α − α . Assume that the points are labelled by a function f ∈ H (to be chosen later as either h or h ). Denotethe initial uncorrupted collection of datasets by S (cid:48) = ( S (cid:48) , . . . , S (cid:48) N ) , with S (cid:48) i = { ( x (cid:48) i, , f ( x (cid:48) i, )) , . . . , ( x (cid:48) i,m , f ( x (cid:48) i,m )) } and x (cid:48) i,j being i.i.d. samples from D .First we show that with constant probability the point x appears at most αN m times in G . Indeed, let C be this numberof appearances. Then C is a binomial random variable with probability of success (cid:15) and number of trials (1 − α ) N m .Therefore, by the Chernoff bound: P S (cid:48) ( C ≥ αN m ) = P S (cid:48) ( C ≥ (1 + 1)4 (cid:15) (1 − α ) N m ) ≤ e − αNm ≤ e − / < (66)and so: P S (cid:48) ( C ≤ αN m ) > . (67)Now consider the following policy for the fixed-set adversary A s : S (cid:48) → S . For any index i ∈ [ N ] the adversary replaces S (cid:48) i = { ( x (cid:48) i, , f ( x (cid:48) i, )) , . . . , ( x (cid:48) i,m , f ( x (cid:48) i,m )) } with a dataset S i = { ( x i, , y i, ) . . . , ( x i,m , y i,m ) } , such that: ( x i,j , y i,j ) = ( x (cid:48) i,j , f ( x (cid:48) i,j )) , if i ∈ G = [1 , , . . . , k ]( x , − f ( x )) , if i ∈ [ k + 1 , . . . , N ] and ( i − k − m + j ≤ C ( x , f ( x )) , otherwise (68)Then the adversary returns S = ( S , . . . , S N ) . That is, the adversary keeps the datasets in G untouched, and fills the datasetsin [ N ] \ G with as many x ’s as there are in G , but wrongly labelled.Crucially, whenever C ≤ αN m , the union of the data in all N sets will look the same no matter if the original labellingfunction was h or h . In particular, L ( A s ( S (cid:48) )) will be identical in both cases.Finally, we argue that under the event C ≤ αN m and the chosen adversary, the learner would incur high loss and show thatthis implies the result in (19). Let S be the set of all datasets in ( X × Y ) N × m , such that C ≤ αN m holds. We just showed dversarial Multi-Source PAC Learning that P S (cid:48) ( S (cid:48) ∈ S ) > and that whenever S (cid:48) ∈ S , L ( A s ( S (cid:48) )) is independent of whether the original labelling function was h or h .Consider a fixed set S (cid:48) ∈ S and let S = A s ( S (cid:48) ) and h S = L ( S ) . Denote by R ( h S , f ) = P D ( h S ( x ) (cid:54) = f ( x ) ∩ x (cid:54) = x ) and note that R ( h S , f ) ≤ P D ( h S ( x ) (cid:54) = f ( x )) = R ( L ( A s ( S (cid:48) ))) . Notice that: R ( h S , h ) + R ( h S , h ) = (cid:88) i =1 , h S ( x i ) (cid:54) = h ( x i ) x i (cid:54) = x P ( x i ) + (cid:88) i =1 , h S ( x i ) (cid:54) = h ( x i ) x i (cid:54) = x P ( x i ) (69) = h S ( x ) (cid:54) = h ( x ) (cid:15) + h S ( x ) (cid:54) = h ( x ) (cid:15) (70) = 4 (cid:15), (71)where we used that h ( x ) = 1 = − h ( x ) and that h S is independent of the underlying labelling function.Since the above holds for any S (cid:48) ∈ S , it also holds in expectation, conditioned on S (cid:48) ∈ S : E S (cid:48) ∈S ( R ( h S , h ) + R ( h S , h )) ≥ (cid:15). (72)Therefore, E S (cid:48) ∈S ( R ( h S , h i )) ≥ (cid:15) for at least one of i = 1 , . Take f to be h , if h satisfies the inequality, and h otherwise. Conditioning on {R ( h S , f ) ≥ (cid:15) } and using R ( h S , f ) ≤ P D ( x (cid:54) = x ) = 4 (cid:15) : (cid:15) ≤ E S (cid:48) ∈S ( R ( h S , f )) = E S (cid:48) ∈S ( R ( h S , f ) |R ( h S , f ) ≥ (cid:15) ) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) (73) + E S (cid:48) ∈S ( R ( h S , f ) |R ( h S , f ) < (cid:15) ) P S (cid:48) ∈S ( R ( h S , f ) < (cid:15) ) (74) ≤ (cid:15) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) + (cid:15) P S (cid:48) ∈S ( R ( h S , f ) < (cid:15) ) (75) = (cid:15) + 3 (cid:15) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) . (76)Hence, P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) ≥ (cid:15) (2 (cid:15) − (cid:15) ) = 13 (77)Finally, P S (cid:48) ( R ( L ( A s ( S (cid:48) ))) ≥ (cid:15) ) ≥ P S (cid:48) ( R ( h S , f ) ≥ (cid:15) ) ≥ P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) P S (cid:48) ( S (cid:48) ∈ S ) >
13 320 = 120 . (78) dversarial Multi-Source PAC Learning C. Proof of Theorem 3
Theorem 3.
Let
H ⊂ { h : X → Y} be a hypothesis space, let m and N be any integers and let G be a fixed subset of [ N ] of size k ∈ { , . . . , N − } . Let S (cid:48) ∈ ( X × Y ) N × m be drawn i.i.d. from D . Then the following statements hold for anymulti-source learner L :(a) Suppose that H is non-trivial. Then there exists a distribution D on X with min h ∈H R ( h ) = 0 , and a fixed-setadversary A with index set G , such that: P S (cid:48) (cid:16) R (cid:0) L ( A ( S (cid:48) ) (cid:1) > α m (cid:17) > . (79) (b) Suppose that H has VC dimension d ≥ . Then there exists a distribution D on X × Y and a fixed-set adversary A with index set G , such that: P S (cid:48) (cid:32) R (cid:0) L ( A ( S (cid:48) ) (cid:1) − min h ∈H R ( h ) > (cid:114) d N m + α m (cid:33) > . (80) In both cases, α = N − kN is the power of the adversary. To prove part (a), we use a similar technique as in the no-free-lunch results in (Bshouty et al., 2002) and in the classicno-free-lunch theorem, e.g. Theorem 3.20 in (Mohri et al., 2018). An overview is as follows. Consider a distribution on X that has support only at two points - the common point x and the rare point x . Take P ( x ) = O ( αm ) . Then one can showthat with constant probability the number of datasets that contain x is at most αN . We show that in this case there exists analgorithm for the strong adversary that will return the same unordered collection of datasets, regardless of the true label of x . Thus no learner can guess with probability greater than / what the true label of x was.Part (b) follows from part (a) and the standard no-free-lunch theorem for agnostic binary classification. Proof. a) As in Theorem 2, we prove that there exists a distribution D on X and a labeling function f ∈ H , such that theresulting joint distribution on X × Y , defined by x ∼ D and y = f ( x ) , satisfies the desired property.Without loss of generality, let G = [1 , , . . . , k ] . Since H is non-trivial ( d ≥ ), there exist h , h ∈ H and x , x ∈ X ,such that h ( x ) = h ( x ) , while h ( x ) = 1 , but h ( x ) = − . Consider the following distribution on X : P D ( x ) = 1 − (cid:15) and P D ( x ) = 4 (cid:15), (81)where (cid:15) = α m . Assume that the points are labelled by a function f ∈ H (to be chosen later as either h or h ). Denote theinitial uncorrupted collection of datasets by S (cid:48) = ( S (cid:48) , . . . , S (cid:48) N ) , with S (cid:48) i = { ( x (cid:48) i, , f ( x (cid:48) i, )) , . . . , ( x (cid:48) i,m , f ( x (cid:48) i,m )) } and x (cid:48) i,j being i.i.d. samples from D .First we show that with constant probability the point x is contained in no more than αN sources. Indeed, let C b be thenumber of sources that contain x and let C p be the number of points (out of the N m in total) that are equal to x . Clearly C b ≤ C p . Note that C p is a binomial random variable with probability of success (cid:15) and number of trials N m . Therefore,by the Chernoff bound: P S (cid:48) ( C p ≥ αN ) = P S (cid:48) ( C p ≥ (1 + 1)4 (cid:15)N m ) ≤ e − αN ≤ e − / < (82)and so: P S (cid:48) ( C b ≤ αN ) ≥ P S (cid:48) ( C p ≤ αN ) > . (83)Now consider the following policy for the adversary A s : S (cid:48) → S . Whenever C b ≤ αN , let M ⊂ G be the list of indexes i ∈ G , such that S (cid:48) i contains x . Let l = | M | and note that l ≤ C b ≤ αN . For any index i ∈ [ N ] the adversary replaces S (cid:48) i = { x (cid:48) i, , f ( x (cid:48) i, ) , . . . , ( x (cid:48) i,m , f ( x (cid:48) i,m )) } with a dataset S i = { ( x i, , y i, ) . . . , ( x i,m , y i,m ) } , such that: dversarial Multi-Source PAC Learning ( x i,j , y i,j ) = ( x (cid:48) i,j , f ( x (cid:48) i,j )) , if i ∈ G = [1 , , . . . , k ]( x , f ( x )) , if i ∈ [ k + 1 , . . . , k + l ] and x (cid:48) M [ i − k ] ,j = x ( x , − f ( x )) , if i ∈ [ k + 1 , . . . , k + l ] and x (cid:48) M [ i − k ] ,j = x ( x , f ( x )) , if i ∈ [ k + l + 1 , . . . , N ] (84)Then the adversary returns S = ( S , . . . , S N ) . That is, the adversary keeps the datasets in G untouched, copies all of thedatasets in M into its own data, flipping the labels of the x ’s, and, in case there are additional sources at its disposal, it fillsthem with (correctly labelled) x ’s only.Crucially, the resulting (unordered) collection is the same no matter if the original labelling function was h or h . Inparticular, L ( S ) will be the same in both cases.In the case when C b > αN , the adversary leaves the data unchanged, i.e. S = S (cid:48) .Finally, we argue that under the event C b ≤ αN and the chosen adversary, the learner would incur high loss and show thatthis implies the result in (20). Let S be the set of all datasets in ( X × Y ) N × m , such that C b ≤ αN holds. We just showedthat P S (cid:48) ( S (cid:48) ∈ S ) > and that whenever S (cid:48) ∈ S , L ( A s ( S (cid:48) )) is independent of whether the original labelling function was h or h .Now the proof proceeds just as in Theorem 2. Consider a fixed set S (cid:48) ∈ S and let S = A s ( S (cid:48) ) and h S = L ( S ) . Denote by R ( h S , f ) = P D ( h S ( x ) (cid:54) = f ( x ) ∩ x (cid:54) = x ) and note that R ( h S , f ) ≤ P D ( h S ( x ) (cid:54) = f ( x )) = R ( L ( A s ( S (cid:48) ))) . Notice that: R ( h S , h ) + R ( h S , h ) = (cid:88) i =1 , h S ( x i ) (cid:54) = h ( x i ) x i (cid:54) = x P ( x i ) + (cid:88) i =1 , h S ( x i ) (cid:54) = h ( x i ) x i (cid:54) = x P ( x i ) (85) = h S ( x ) (cid:54) = h ( x ) (cid:15) + h S ( x ) (cid:54) = h ( x ) (cid:15) (86) = 4 (cid:15), (87)where we used that h ( x ) = 1 = − h ( x ) and that h S is independent of the underlying labelling function.Since the above holds for any S (cid:48) ∈ S , it also holds in expectation, conditioned on S (cid:48) ∈ S : E S (cid:48) ∈S ( R ( h S , h ) + R ( h S , h )) ≥ (cid:15). (88)Therefore, E S (cid:48) ∈S ( R ( h S , h i )) ≥ (cid:15) for at least one of i = 1 , . Take f to be h , if h satisfies the inequality, and h otherwise. Conditioning on {R ( h S , f ) ≥ (cid:15) } and using R ( h S , f ) ≤ P D ( x (cid:54) = x ) = 4 (cid:15) : (cid:15) ≤ E S (cid:48) ∈S ( R ( h S , f )) = E S (cid:48) ∈S ( R ( h S , f ) |R ( h S , f ) ≥ (cid:15) ) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) (89) + E S (cid:48) ∈S ( R ( h S , f ) |R ( h S , f ) < (cid:15) ) P S (cid:48) ∈S ( R ( h S , f ) < (cid:15) ) (90) ≤ (cid:15) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) + (cid:15) P S (cid:48) ∈S ( R ( h S , f ) < (cid:15) ) (91) = (cid:15) + 3 (cid:15) P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) . (92)Hence, P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) ≥ (cid:15) (2 (cid:15) − (cid:15) ) = 13 (93)Finally, P S (cid:48) ( R ( L ( A s ( S (cid:48) ))) ≥ (cid:15) ) ≥ P S (cid:48) ( R ( h S , f ) ≥ (cid:15) ) (94) ≥ P S (cid:48) ∈S ( R ( h S , f ) ≥ (cid:15) ) P S (cid:48) ( S (cid:48) ∈ S ) (95) >
13 320 = 120 . (96) dversarial Multi-Source PAC Learning b) First we argue that there exists a distribution D on X × Y and a fixed-set adversary A s , such that: P S (cid:48) ∼D (cid:32) R ( L ( A s ( S (cid:48) ))) − min h ∈H R ( h ) > (cid:114) d N m (cid:33) > . (97)This follows directly from the classic no-free-lunch theorem for binary classifiers in the unrealizable case. Indeed, applyingTheorem 3.23 in (Mohri et al., 2018) and setting the adversary to be the identity mapping gives the result.Now, since any hypothesis space with VC dimension d ≥ is non-trivial, we also know from a) that there exists an adversary A s and a distribution D on X × Y , such that: P S (cid:48) ∼D (cid:18) R ( L ( A s ( S (cid:48) ))) − min h ∈H R ( h ) > α m (cid:19) > . (98)Fix any set of values for N, m, d, k . Then at least one of the pairs ( A s , D ) and ( A s , D ) satisfies: P S (cid:48) (cid:32) R ( L ( A s ( S (cid:48) ))) − min h ∈H R ( h ) > (cid:114) d N m + α m (cid:33) ≥ P S (cid:48) (cid:32) R ( L ( A s ( S (cid:48) ))) > { (cid:114) d N m , α m } (cid:33) (99) = P S (cid:48) (cid:32) R ( L ( A s ( S (cid:48) ))) > max { (cid:114) d N m , α m } (cid:33) (100) > ..