[PDF] Learning Adversary Behavior in Security Games: A PAC Model Perspective

Abstract

Recent applications of Stackelberg Security Games (SSG), from wildlife crime to urban crime, have employed machine learning tools to learn and predict adversary behavior using available data about defender-adversary interactions. Given these recent developments, this paper commits to an approach of directly learning the response function of the adversary. Using the PAC model, this paper lays a firm theoretical foundation for learning in SSGs (e.g., theoretically answer questions about the numbers of samples required to learn adversary behavior) and provides utility guarantees when the learned adversary model is used to plan the defender's strategy. The paper also aims to answer practical questions such as how much more data is needed to improve an adversary model's accuracy. Additionally, we explain a recently observed phenomenon that prediction accuracy of learned adversary behavior is not enough to discover the utility maximizing defender strategy. We provide four main contributions: (1) a PAC model of learning adversary response functions in SSGs; (2) PAC-model analysis of the learning of key, existing bounded rationality models in SSGs; (3) an entirely new approach to adversary modeling based on a non-parametric class of response functions with PAC-model analysis and (4) identification of conditions under which computing the best defender strategy against the learned adversary behavior is indeed the optimal strategy. Finally, we conduct experiments with real-world data from a national park in Uganda, showing the benefit of our new adversary modeling approach and verification of our PAC model predictions.

Full PDF

LLearning Adversary Behavior in Security Games: A PACModel Perspective

Arunesh Sinha, Debarun Kar, Milind Tambe

University of Southern California {aruneshs, dkar, tambe}@usc.edu

ABSTRACT

Recent applications of Stackelberg Security Games (SSG), fromwildlife crime to urban crime, have employed machine learningtools to learn and predict adversary behavior using available dataabout defender-adversary interactions. Given these recent develop-ments, this paper commits to an approach of directly learning theresponse function of the adversary. Using the PAC model, this pa-per lays a ﬁrm theoretical foundation for learning in SSGs (e.g.,theoretically answer questions about the numbers of samples re-quired to learn adversary behavior) and provides utility guaran-tees when the learned adversary model is used to plan the de-fender’s strategy. The paper also aims to answer practical ques-tions such as how much more data is needed to improve an ad-versary model’s accuracy. Additionally, we explain a recently ob-served phenomenon that prediction accuracy of learned adversarybehavior is not enough to discover the utility maximizing defenderstrategy. We provide four main contributions: (1) a PAC modelof learning adversary response functions in SSGs; (2) PAC-modelanalysis of the learning of key, existing bounded rationality modelsin SSGs; (3) an entirely new approach to adversary modeling basedon a non-parametric class of response functions with PAC-modelanalysis and (4) identiﬁcation of conditions under which comput-ing the best defender strategy against the learned adversary behav-ior is indeed the optimal strategy. Finally, we conduct experimentswith real-world data from a national park in Uganda, showing thebeneﬁt of our new adversary modeling approach and veriﬁcation ofour PAC model predictions.

1. INTRODUCTION

Stackelberg Security Games (SSGs) are arguably the best exam-ple of the application of the Stackelberg game model in the realworld. Indeed, numerous successful deployed applications [29](LAX airport, US air marshal) and extensive research on relatedtopics [5, 17, 11] provide evidence about the generality of theSSG framework. More recently, new application domains of SSGs,from wildlife crime to urban crime, are accompanied by signiﬁcantamounts of past data of recorded defender strategies and adversaryreactions. This has enabled the learning of adversary behavior fromsuch data [35, 34]. Also, analysis of these datasets and human sub-ject experiment studies [25] have revealed that modeling boundedrationality of the adversary enables the defender to further opti- mize her allocation of limited security resources. Thus, learningthe adversary’s bounded rational behavior and computing defenderstrategy based on the learned model has become an important areaof research in SSGs.However, without a theoretical foundation for this learning andstrategic planning problem, many issues that arise in practice can-not be explained or addressed. For example, it has been recently ob-served that in spite of good prediction accuracy of the learned mod-els of adversary behavior, the performance of the defender strategythat is computed against this learned adversary model is poor [10].A formal study could also answer several other important questionsthat arise in practice, for example, (1) How many samples would berequired to learn a “reasonable” model of adversary behavior in agiven SSG? (2) What utility bound can be provided when deploy-ing the best defender strategy that is computed against the learnedadversary model?Motivated by the learning of adversary behavior from data in re-cent applications [12, 35], we adopt the framework in which the de-fender ﬁrst learns the response function of the adversary (adversarybehavior) and then optimizes against the learned response. Thispaper is the ﬁrst theoretical study of the adversary bounded ratio-nal behavior learning problem and the optimality guarantees (utilitybounds) when computing the best defender strategies against suchlearned behaviors. Indeed, unlike past theoretical work on learn-ing in SSGs (see related work) where reasoning about adversaryresponse happens through payoff and rationality, we treat the re-sponse of the bounded rational adversary as the object to be learned.Our ﬁrst contribution is using the Probably Approximately Cor-rect (PAC) model [15, 2] to analyze the learning problem at hand. APAC analysis yields sample complexity, i.e., the number of samplesrequired to achieve a given level of learning guarantee. Hence, thePAC analysis allows us to address the question of required quan-tity of samples raised earlier. While PAC analysis is fairly standardfor classiﬁers and real valued functions (i.e., regression) it is notan out-of-the-box approach. In particular, PAC-model analysis ofSSGs brings to the table signiﬁcant new challenges. To begin with,given that we are learning adversary response functions, we mustdeal with the output being a probability distribution over the adver-sary’s actions, i.e., these response functions are vector-valued. Weappeal to the framework of Haussler [13] to study the PAC learn-ability of vector-valued response functions. For SSGs, we ﬁrst posethe learning problem in terms of maximizing the likelihood of see-ing the attack data, but without restricting the formulation to anyparticular class of response functions. This general PAC frame-work for learning adversary behavior in SSGs enables the rest ofthe analysis in this paper.Our second contribution is an analysis of the SUQR model ofbounded rationality adversary behavior used in SSGs, which posits a r X i v : . [ c s . A I] N ov class of parametrized response functions with a given number ofparameters (and corresponding features). SUQR is the best knownmodel of bounded rationality in SSGs, resulting in multiple de-ployed applications [12, 9]. In analyzing SUQR, we advance thestate-of-the-art in the mathematical techniques involved in PACanalysis of vector-valued function spaces. In particular, we pro-vide a technique to obtain sharper sample complexity for SUQRthan simply directly applying Haussler’s original techniques. Wedecompose the given SUQR function space into two (or more)parts, performing PAC analysis of each part and ﬁnally combin-ing the results to obtain the sample complexity result (which scalesas T log T with T targets) for SUQR (see details in Section 5).Our third contribution includes an entirely new behavioral modelspeciﬁed by the non-parametric Lipschitz (NPL) class of responsefunctions for SSGs, where the only restriction on NPL functionsis Lipschitzness. The NPL approach makes very few assumptionsabout the response function, enabling the learning of a multitude ofbehaviors albeit at the cost of higher sample complexity. As NPLhas never been explored in learning bounded rationality models inSSGs, we provide a novel learning technique for NPL. We alsocompute the sample complexity for NPL. Further, we observe inour experiments that the power to capture a large variety of behav-iors enables NPL to perform better than SUQR with the real-worlddata from Queen Elizabeth National Park (QENP) in Uganda.Our fourth contribution is to convert the PAC learning guaran-tee into a bound on the utility derived by the defender when plan-ning her strategy based on the learned adversary behavior model.In the process, we make explicit the assumptions required fromthe dataset of adversary’s attacks in response to deployed defendermixed strategies in order to discover the optimal (w.r.t. utility) de-fender strategy. These assumptions help explain a puzzling phe-nomenon observed in recent literature on learning in SSGs [10]—in particular that learned adversary behaviors provide good predic-tion accuracy, but the best defender strategy computed against suchlearned behavior may not perform well in practice. The key is thatthe dataset for learning must not simply record a large number ofattacks against few defender strategies but rather contain the at-tacker’s response against a variety of defender mixed strategies. Wediscuss the details of our assumptions and its implications for thestrategic choice of defender’s actions in Section 7.We also conduct experiments with real-world poaching datafrom the QENP in Uganda (obtained from [24]) and data collectedfrom human subject experiments. The experimental results sup-port our theoretical conclusions about the number of samples re-quired for different learning techniques. Showing the value of ournew NPL approach, the NPL approach outperforms all existing ap-proaches in predicting the poaching activity in QENP. Finally, ourwork opens up a number of exciting research directions, such asstudying learning of behavioral models in active learning settingand real-world application of non-parametric models.

2. RELATED WORK

Learning and planning in SSGs with rational adversaries hasbeen studied in two recent papers [6, 3], and in Stackelberg gamesby Letchford et al. [19] and Marecki et al. [21]. All these papersstudy the learning problem under an active learning framework,where the defender can choose the strategy to deploy within thelearning process. Also, all these papers study the setting with per-fectly rational adversaries. Our work differs as we study boundedrational adversaries in a passive learning scenario (i.e., with given Due to lack of space, some proofs in this paper are in the onlineAppendix: http://bit.ly/1l4n3s1 data) and once the model is learned we analyze the guarantees ofplanning against the learned model. Also, our focus on SSGs dif-ferentiates us from recent work on PAC learnability in co-operativegames [4], in which the authors study PAC learnability of the valuefunction of coalitions with perfectly rational players. Also, ourwork is orthogonal to adversarial learning [32], which studies gametheoretic models of an adversary attacking a learning algorithm.PAC learning model has a very rich and extensive body ofwork [2]. The PAC model provides a theoretical underpinningfor most standard machine learning techniques. We use the PACframework of Haussler [13]. For the parametric case, we derivesharp sample complexity bounds based on covering numbers usingour techniques rather than bounding it using the standard techniqueof pseudo-dimension [27] or fat shattering dimension [2]. For theNPL case we use results from [31] along with our technique ofbounding the mixed strategy space of the defender; these results[31] have also been used in the study of Lipschitz classiﬁers [20]but we differ as our hypothesis functions are real vector-valued.

3. SSG PRELIMINARIES

This section introduces the background and preliminary nota-tions for SSGs. A summary of notations used in this paper is pre-sented in Table 1. An SSG is a two player Stackelberg game be-tween a defender (leader) and an adversary (follower) [26]. Thedefender wishes to protect T targets with a limited number of se-curity resources K ( K << T ). For ease of presentation, werestrict ourselves to the scenario with no scheduling constraints(see Korzhyk et al. [17]). The defender’s pure strategy is to al-locate each resource to a target. A defender’s mixed-strategy ˜ x ( ∀ j ∈ P . ˜ x j ∈ [0 , , (cid:80) P j =1 ˜ x j = 1) is then deﬁned as aprobability distribution over the set of all possible pure strategies P . An equivalent description (see Korzhyk et al. [17]) of thesemixed strategies are coverage probabilities over the set of targets: x ( ∀ i ∈ T. x i ∈ [0 , , (cid:80) Ti =1 x i ≤ K ) . We refer to this latterdescription as the mixed strategy of the defender.A pure strategy of the adversary is deﬁned as attacking a singletarget. The adversary’s mixed strategy is then a categorical distribu-tion over the set of targets. Thus, it can be expressed as parameters q i ( i ∈ T ) of a categorical distribution such that ≤ q i ≤ and (cid:80) i q i = 1 . The adversary’s response to the defender’s mixed strat-egy is given by a function q : X → Q , where Q is the space ofall mixed strategies of the adversary. The matrix U speciﬁes thepayoffs of the defender, and her expected utility is x T Uq ( x ) whenshe plays a mixed strategy x ∈ X . Bounded Rationality Models:

We discuss the SUQR modeland its representation for the analysis in this paper below. Build-ing on prior work on quantal response [23], SUQR [25] states thatgiven n actions, a human player plays action i with probability q i ∝ e w · v , where v denotes a vector of feature values for choice i and w denotes the weight parameters for these features. The modelis equivalent to conditional logistic regression [22]. The featuresare speciﬁc to the domain, e.g., in case of SSG applications, the setof features include the coverage probability x i , the reward R i andpenalty P i of target i . Since, other than the coverage x , remainingfeatures are ﬁxed for each target in real world data, we assume atarget-speciﬁc feature c i (which may be a linear combination of re-wards and penalties) and analyze the following generalized formof SUQR with parameters w and c i ’s: q i ( x ) ∝ e w x i + c i . As This general form is harder to analyze than the standard

SUQRform in which the exponent function (function of x i , R i , P i ) for all q i is same: w x i + w R i + w P i . For completeness, we derive theresults for the standard SUQR form in the Appendix.otation Meaning T, K

Number of targets, defender resources d l p ( o, o (cid:48) ) l p distance between points o, o (cid:48) ¯ d l p ( o, o (cid:48) ) Average l p distance: = d l p ( o, o (cid:48) ) /nX Instance space (defender mixed strategies) Y Outcome space (attacked target) A Decision space h ∈ H h : X → A is the hypothesis function N ( (cid:15), H , d ) (cid:15) -cover of set H using distance d C ( (cid:15), H , d ) capacity of H using distance dr h ( p ) , ˆ r h ( (cid:126)z ) true risk, empirical risk of hypothesis hd L ( P,d ) ( f, g ) L distance between functions f, gq p ( x ) parameters of true attack distribution q h ( x ) parameters of attack distr. predicted by h Table 1: Notations (cid:80) Ti =1 q i ( x ) = 1 , we have: q i ( x ) = e w x i + c i (cid:80) Tj =1 e w x j + c j . Equivalent Alternate Representation : For ease of mathematicalproofs, using standard techniques in logistic regression, we take q T ∝ e , and hence, q i ∝ e w ( x i − x T )+( c i − c T ) . To shorten no-tation, let c iT = c i − c T , x iT = x i − x T . By multiplying thenumerator and denominator by e w x T + c T , it can be veriﬁed that q i ( x ) = e w xiT + ciT e + (cid:80) T − j =1 e w xjT + cjT = e w xi + ci (cid:80) Tj =1 e w xj + cj .

4. LEARNING FRAMEWORK FOR SSG

First, we introduce some notations: given two n -dimensionalpoints o and o (cid:48) , the l p distance d l p between the two points is: d l p ( o, o (cid:48) ) = || o − o (cid:48) || p = ( (cid:80) ni =1 | o i − o (cid:48) i | p ) /p . In particular, d l ∞ ( o, o (cid:48) ) = || o − o (cid:48) || ∞ = max i | o i − o (cid:48) i | . Also, ¯ d l p = d l p /n . KL denotes the Kullback-Leibler divergence [18].We use the learning framework of Haussler [13], which includesan instance space X and outcome space Y . In our context, X issame as the space of defender mixed strategies x ∈ X . Outcomespace Y is deﬁned as the space of all possible categorical choicesover a set of T targets (i.e., choice of target to attack) for the ad-versary. Let t i denote the attacker’s choice to attack the i th target.More formally t i = (cid:104) t i , . . . , t Ti (cid:105) , where t ji = 1 for j = i andotherwise . Thus, Y = { t , . . . , t T } . We will use y to denoteany general element of Y . To give an example, given three targets T , T and T , Y = { t , t , t } = {(cid:104) , , (cid:105) , (cid:104) , , (cid:105) , (cid:104) , , (cid:105)} ,where t denotes (cid:104) , , (cid:105) , i.e., it denotes that T was attackedwhile T and T were not attacked, and so on. The training dataare samples drawn from Z = X × Y using an unknown prob-ability distribution, say given by density p ( x, y ) . Each trainingdata point ( x, y ) denotes the adversary’s response y ∈ Y (e.g.,t1 or attack on target 1) to a particular defender mixed strategy x ∈ X . The density p also determines the true attacker behavior q p ( x ) which stands for the conditional probabilities of the attackerattacking a target given x so that q p ( x ) = (cid:104) q p ( x ) , . . . , q pT ( x ) (cid:105) ,where q pi ( x ) = p (t i | x ) .Haussler [13] also deﬁnes a decision space A , a space of hypoth-esis (functions) H with elements h : X → A and a loss function l : Y × A → R . The hypothesis h outputs values in A that enablescomputing (probabilistic) predictions of the actual outcome. Theloss function l captures the loss when the real outcome is y ∈ Y and the prediction of possible outcomes happens using a ∈ A . Example 1: Generalized SUQR

For the parametric represen-tation of generalized SUQR in the previous section and consider-ing our 3-target example above, H contains vector valued func-tions with ( T −

1) = 2 components that form the exponents of the numerator of prediction probabilities q and q . H containstwo components, since the third component q is proportional to e as discussed above. That is, H contains functions of the form: (cid:104) w ( x − x ) + c , w ( x − x ) + c (cid:105) ; ∀ x ∈ X . Also, A is the range of the functions in H , i.e., A ⊂ R . Then, given h ( x ) = (cid:104) a , a (cid:105) , the prediction probabilities q h ( x ) , q h ( x ) , q h ( x ) are given by q hi ( x ) = e ai e a + e a (assume a = 0 ). PAC learnability:

The learning algorithm aims to learn a h ∈ H that minimizes the true risk of using the hypothesis h . The truerisk r h ( p ) of a particular hypothesis (predictor) h , given densityfunction p ( x, y ) over Z = X × Y , is the expected loss of predicting h ( x ) when the true outcome is y : r h ( p ) = (cid:90) p ( x, y ) l ( y, h ( x )) dx dy Of course, as p is unknown the true risk cannot be computed.However, given (enough) samples from p , the true risk can be es-timated by the empirical risk . The empirical risk ˆ r h ( (cid:126)z ) , where (cid:126)z is a sequence of m training samples from Z , is deﬁned as: ˆ r h ( (cid:126)z ) = 1 /m (cid:80) mi =1 l ( y i , h ( x i )) . Let h ∗ be the hypothesis thatminimizes the true risk, i.e., r h ∗ ( p ) = inf { r h ( p ) | h ∈ H} andlet ˆ h ∗ be the hypothesis that minimizes the empirical risk, i.e., ˆ r ˆ h ∗ ( (cid:126)z ) = inf { ˆ r h ( (cid:126)z ) | h ∈ H} . The following is the well-knownPAC learning result [2] for any empirical risk minimizing (ERM)algorithm A yielding hypothesis A ( (cid:126)z ) :If P r ( ∀ h ∈ H . | ˆ r h ( (cid:126)z ) − r h ( p ) | < α/ > − δ/ and P r ( | ˆ r A ( (cid:126)z ) ( (cid:126)z ) − ˆ r ˆ h ∗ ( (cid:126)z ) | < α/ > − δ/ then P r ( | r A ( (cid:126)z ) ( p ) − r h ∗ ( p ) | < α ) > − δ The ﬁnal result states that output hypothesis A ( (cid:126)z ) has true risk α -close to the lowest true risk in H attained by h ∗ with high probabil-ity − δ over the choice of training samples. The ﬁrst pre-conditionstates that it must be the case that for all h ∈ H the difference be-tween empirical risk and true risk is α -close with high probability − δ . The second pre-condition states that the output A ( (cid:126)z ) of theERM algorithm A should have empirical risk α -close to the lowestempirical risk of ˆ h ∗ with high probability − δ . A hypothesis class H is called ( α, δ ) -PAC learnable if there exists an ERM algorithm A such that H and A satisfy the two pre-conditions. In this work,our empirical risk minimizing algorithms ﬁnd ˆ h ∗ exactly (upto pre-cision of convex solvers, see Section 6), thus, satisfying the secondpre-condition; hence, we will focus more on the ﬁrst pre-condition.As the empirical risk estimate gets better with increasing samples,a minimum number of samples are required to ensure that the ﬁrstpre-condition holds (see Theorem 1). Hence we can relate ( α, δ ) -PAC learnability to the number of samples. Modeling security games:

Having given an example for general-ized SUQR, we systematically model learning of adversary behav-ior in SSGs using the PAC framework for any hypothesis class H .We assume certain properties of functions h ∈ H that we presentbelow. First, the vector valued function h ∈ H takes the form h ( x ) = (cid:104) h ( x ) , . . . , h T − ( x ) (cid:105) . Thus, A is the product space A × . . . , A T − . Each h i ( x ) isassumed to take values between [ − M , M ] , where M >> ,which implies A i = [ − M , M ] . The prediction probabilities in-duced by any h is q h ( x ) = (cid:104) q h ( x ) , . . . , q hT ( x ) (cid:105) , where q hi ( x ) = e hi ( x ) (cid:80) i e hi ( x ) (assume h T ( x ) = 0 ). Next, we specify two classesof functions that we analyze in later sections. We choose these twofunctions classes because (1) the ﬁrst function class represents thewidely used SUQR model in literature [25, 34] and (2) the secondunction class is very ﬂexible as it capture a wide range of functionsand only imposes minimal Lipschitzness constraints to ensure thatthe functions are well behaved (e.g., continuous). Parametric H : In this approach we model generalized SUQR.Generalizing from Example 1, the functions h ∈ H take a paramet-ric form where each component function is h i ( x ) = w x iT + c iT . Non-parametric Lipschitz (NPL) H : Here, the only restrictionwe impose on functions h ∈ H is that each component function h i is L -Lipschitz where L ≤ ˆ K , for given and ﬁxed constant ˆ K . Weshow later (Lemma 7) that this implies that q h is Lipschitz also.Next, given the stochastic nature of the adversary’s attacks, weuse a loss function (same for parametric and NPL) such that mini-mizing the empirical risk is equivalent to maximizing the likelihoodof seeing the attack data. The loss function l : Y × A → R foractual outcome t i is deﬁned as: l (t i , a ) = − log (cid:0) e a i / T − (cid:88) j =1 e a j (cid:1) . (1)It can be readily inferred that minimizing the empirical risk (recall ˆ r h ( (cid:126)z ) = 1 /m (cid:80) mi =1 l ( y i , h ( x i )) ) is equivalent to maximizing thelog likelihood of the training data.

5. SAMPLE COMPLEXITY

In this section we derive the sample complexity for the paramet-ric and NPL case, which provides an indication about the amountof data required to learn the adversary behavior. First, we presenta general result about sample complexity bounds for any H , givenour loss l . This result relies on sample complexity results in [13].The bound depends on the capacity C of H , which we deﬁne afterthe theorem. The bound also assumes an ERM algorithm which wepresent for our models in Section 6.T HEOREM Assume that the hypothesis space H is permis-sible . Let the data be generated by m independent draws from X × Y according to p . Then, assuming existence of an ERM algo-rithm and given our loss l deﬁned in Eq. 1, the least m required toensure ( α, δ ) -PAC learnability is (recall ¯ d l is average l distance) M α (cid:16) log 1 δ + log (cid:0) C ( α T , H , ¯ d l ) (cid:1)(cid:17) P ROOF S KETCH . Haussler [13] present a result of the aboveform using a general distance metric deﬁned on the space A forany loss function l : ρ ( a, b ) = max y ∈ Y | l ( y, a ) − l ( y, b ) | The main effort in this proof is relating ρ to ¯ d l for our choiceof the loss function l given by Equation 1. We are able to showthat ρ ( a, b ) ≤ T ¯ d l ( a, b ) for our loss function. Our result thenfollows from this relation (details in Appendix).The above sample complexity result is stated in terms of thecapacity C ( α/ T, H , ¯ d l ) . Thus, in order to obtain the samplecomplexity of the generalized SUQR and NPL function spaces weneed to compute the capacity of these function spaces. Therefore,in the rest of this section we will focus on computing capacity C ( α/ T, H , ¯ d l ) for both the generalized SUQR and NPL hy-pothesis space. First, we need to deﬁne capacity C of functionsspaces, for which we start by deﬁning the covering number N offunction spaces. Let d be a pseudo metric for the set H . For any (cid:15) > , an (cid:15) -cover for H is a ﬁnite set F ⊆ H such that for any As noted in Haussler: “This is a measurability condition deﬁned inPollard (1984) which need not concern us in practice.” h ∈ H there is a f ∈ F with d ( f, h ) ≤ (cid:15) , i.e., any element in H is at least (cid:15) -close to some element of the cover F . The coveringnumber N ( (cid:15), H , d ) denotes the size of the smallest (cid:15) -cover for set H (for the pseudo metric d ). We now proceed to deﬁne a pseudometric d L ( P,d ) on H with respect to any probability measure P on X and any given pseudo-metric d on A . This pseudo-metric is theexpected (over P ) distance (with d ) between the output of f and g . d L ( P,d ) ( f, g ) = (cid:90) X d ( f ( x ) , g ( x )) dP ( x ) ∀ f, g ∈ H Then, N ( (cid:15), H , d L ( P,d ) ) is the covering number for H for thepseudo metric d L ( P,d ) . However, to be more general, the capacityof function spaces provides a “distribution-free” notion of coveringnumber. The capacity C ( (cid:15), H , d ) is: C ( (cid:15), H , d ) = sup P {N ( (cid:15), H , d L ( P,d ) ) } Capacity of vector valued function:

The function spaces (bothparametric and NPL) we consider are vector valued. Haussler [13]provides an useful technique to bound the capacity for vector val-ued function space H in terms of the capacity of each of thecomponent real valued function space. Given k functions spaces H , . . . , H k with functions from X to A i , he deﬁne the freeproduct function space × i H i with functions from X to A = A × . . . A k as × i H i = {(cid:104) h , . . . , h k (cid:105) | h i ∈ H i } , where (cid:104) h , . . . , h k (cid:105) ( x ) = (cid:104) h ( x ) , . . . , h k ( x ) (cid:105) . He shows that: C ( (cid:15), × i H i , ¯ d l ) < k (cid:89) i =1 C ( (cid:15), H i , d l ) (2)Unfortunately, a straightforward application of the above resultdoes not give as tight bounds for capacity in the parametric caseas the novel direct sum decomposition of function spaces approachwe use in the sub-section. Even for the NPL case where the aboveresult is used we still need to compute C ( (cid:15), H i , d l ) for each com-ponent function space H i . Recall that the hypothesis function h has T − component func-tions w x iT + c iT . However, the same weight w in all componentfunctions implies that H is not a free product of component func-tion spaces, hence we cannot use Equation 2 directly. However,if we consider the space of functions, say H (cid:48) , in which the i th component function space H (cid:48) i is given by w i x iT + c iT (note w i can be different for each i ) then we can use Equation 2 to bound C ( (cid:15), H (cid:48) , ¯ d l ) . Also, the fact that H ⊂ H (cid:48) allows upper bound-ing C ( (cid:15), H , ¯ d l ) by C ( (cid:15), H (cid:48) , ¯ d l ) . But, this approach results in aweaker T log( Tα log Tα ) bound (detailed derivation using this ap-proach is in Appendix) than the technique we use below. We obtaina T log( Tα ) result below in Theorem 2.We propose a novel approach that decomposes H into a directsum of two functions spaces (deﬁned below), each of which capturethe simpler functions w x iT and c iT respectively. We provide ageneral result about such decomposition which allows us to bound C ( (cid:15), H , ¯ d l ) . We start with the following deﬁnition.D EFINITION Direct-sum semi-free product of functionspaces

G ⊂ × i G i and × i F i is deﬁned as G ⊕ × i F i = {(cid:104) g + f , . . . , g T − + f T − (cid:105) | (cid:104) g , . . . , g T − (cid:105) ∈ G , (cid:104) f , . . . , f T − (cid:105) ∈× i F i } . Applying the above deﬁnition for our case, G i contains func-tions of the form wx iT ( w taking different values for differ-ent g i ∈ G i ). A function (cid:104) g , . . . , g T − (cid:105) ∈ × i G i can haveifferent weights for each component g i , and thus we con-sider the subset G = {(cid:104) g , . . . , g T − (cid:105) | (cid:104) g , . . . , g T − (cid:105) ∈× i G i , same coefﬁcient w for all g i } . F i contains constant valuedfunctions of the form c iT ( c iT different for different functions f i ∈ F i ). Then, H = G ⊕ × i F i . Next, we prove a general re-sult about direct-sum semi-free products:L EMMA If H is the direct-sum semi-free product G ⊕ × i F i C ( (cid:15), H , ¯ d l ) < C ( (cid:15)/ , G , ¯ d l ) T − (cid:89) i =1 C ( (cid:15)/ , F i , d l ) P ROOF . Fix any probability distribution over X , say P . Forbrevity, we write k instead of T − . Consider an (cid:15)/ -cover U i for each F i ; also let V be an (cid:15)/ -cover for G . We claimthat V ⊕ × i U i is an (cid:15) -cover for G ⊕ × i F i . Take any func-tion h = (cid:104) g + f , . . . g k + f k (cid:105) . Find functions f (cid:48) i ∈ U i such that d L ( P,d l ) ( f i , f (cid:48) i ) < (cid:15)/ . Similarly, ﬁnd function g (cid:48) = (cid:104) g (cid:48) , . . . g (cid:48) k (cid:105) ∈ V such that d L ( P, ¯ d l ) ( g, g (cid:48) ) < (cid:15)/ where g = (cid:104) g , . . . g k (cid:105) . Let h (cid:48) = (cid:104) g (cid:48) + f (cid:48) , . . . g (cid:48) k + f (cid:48) k (cid:105) . Then, d L ( P, ¯ d l ) ( h, h (cid:48) )= (cid:90) X k k (cid:88) i =1 d l ( g i ( x ) + f i ( x ) , g (cid:48) i ( x ) + f (cid:48) i ( x )) dP ( x ) ≤ (cid:90) X k k (cid:88) i =1 d l ( g i ( x ) , g (cid:48) i ( x )) + d l ( f i ( x ) , f (cid:48) i ( x )) dP ( x )= d L ( P, ¯ d l ) ( g, g (cid:48) ) + 1 k k (cid:88) i =1 d L ( P,d l ) ( f i , f (cid:48) i ) < (cid:15)/ (cid:15)/ (cid:15) Thus, the size of (cid:15) -cover for

G ⊕ × i F i is bounded by | V | (cid:81) i | U i | . N ( (cid:15), G ⊕ × i F i , d L ( P, ¯ d l ) ) < | V | (cid:81) i | U i | = N ( (cid:15)/ , G , d L ( P, ¯ d l ) ) (cid:81) ki =1 N ( (cid:15)/ , F i , d L ( P,d l ) ) Taking sup over probability distribution P on both sides of theabove inequality we get our desired result about capacity.Next, we need to bound the capacity of G and F i for our case. Weassume the range of all these functions ( g i , f i ) to be [ − M , M ] (sothat their sum h i lies in [ − M , M ] ). We can obtain sharp bounds onthe capacities of G and F i decomposed from H in order to obtainsharp bounds on the overall capacity.L EMMA C ( (cid:15), G , ¯ d l ) ≤ M/ (cid:15) and C ( (cid:15), F i , d l ) ≤ M/ (cid:15) . P ROOF S KETCH . First, note that x iT = x i − x T lies between [ − , due to the constraints on x i , x T . Then, for any two func-tions g, g (cid:48) ∈ G we can prove following result: d L ( P, ¯ d l ) ( g, g (cid:48) ) ≤| ( w − w (cid:48) ) | (details in Appendix). Also, note that since the rangeof any g = w ( x i − x T ) is [ − M , M ] and given x i − x T lies be-tween [ − , , we can claim that w lies between [ − M , M ] . Thus,given the distance between functions is bounded by the differencein weights, it enough to divide the M/ range of the weights intointervals of size (cid:15) and consider functions at the boundaries. Hencethe (cid:15) -cover has at most M/ (cid:15) functions.The proof for constant valued functions F i is similar, since itsstraightforward to see the distance between two functions in thisspace is the difference in the constant output. Also, the constantslie in [ − M , M ] , Then, the argument is same as the G case.Then, plugging the result of Lemma 2 (substituting (cid:15)/ for (cid:15) ) intoLemma 1 we obtain C ( (cid:15), H , ¯ d l ) < ( M/ (cid:15) ) T . Having bounded C ( (cid:15), H , ¯ d l ) , we use Theorem 1 to obtain: T HEOREM The generalized SUQR parametric hypothesisclass H is ( α, δ ) -PAC learnable with sample complexity O (cid:16)(cid:0) α (cid:1)(cid:0) log( 1 δ ) + T log( Tα ) (cid:1)(cid:17) The above result shows a modest T log T growth of sample com-plexity with increasing targets, suggesting the parametric approachcan avoid overﬁtting limited data with increasing number of targets;however, the simplicity of the functions captured by this approach(compared to NPL) results in lower accuracy with increasing data,as shown later in our experiments on real-world data. Recall that H for the NPL case is deﬁned such that each com-ponent function h i is L -Lipschitz where L ≤ ˆ K . Consider thefunctions spaces H i consisting of real valued L -Lipschitz func-tions where L ≤ ˆ K . Then, H = × i H i . Then, using Equation 2: C ( (cid:15), H , ¯ d l ) ≤ (cid:81) T − i =1 C ( (cid:15), H i , d l ) .Next, our task is to bound C ( (cid:15), H i , d l ) . Consider the sup-distance metric between real valued functions: d l ∞ ( h i , h (cid:48) i ) =sup X | h i ( x ) − h (cid:48) i ( x ) | for h i , h (cid:48) i ∈ H i . Note that d l ∞ is in-dependent of any probability distribution P , and for all func-tions h i , h (cid:48) i and any P , d L ( P,d l ) ( h i , h (cid:48) i ) ≤ d l ∞ ( h i , h (cid:48) i ) .Thus, we can infer [13] that for all P , N ( (cid:15), H i , d L ( P,d l ) ) ≤N ( (cid:15), H i , d l ∞ ) and then taking sup over P (recall C ( (cid:15), H i , d l ) =sup P {N ( (cid:15), H i , d L ( P,d l ) ) } ) we get C ( (cid:15), H i , d l ) ≤ N ( (cid:15), H i , d l ∞ ) (3)We bound N ( (cid:15), H i , d l ∞ ) in terms of covering number for X (re-call X = { x | x ∈ [0 , T , (cid:80) i x i ≤ K } ) using results from [31].L EMMA N ( (cid:15), H i , d l ∞ ) ≤ (cid:16) (cid:108) M(cid:15) (cid:109) + 1 (cid:17) · N ( (cid:15) K ,X,d l ∞ ) To use the above result, we still need to bound N ( (cid:15), X, d l ∞ ) . Wedo so by combining two remarkable results about Eulerian number (cid:10) Tk (cid:11) [16] ( k has to be integral). • Laplace [8] [28] discovered that the volume of X k = { x | x ∈ [0 , T , k − ≤ (cid:80) i x i ≤ k } is (cid:10) Tk (cid:11) /T ! . Thus,if X K = ∪ Kk =1 X k , then vol ( X K ) = (cid:80) Kk =1 vol ( X k ) = (cid:80) Kk =1 (cid:10) Tk (cid:11) /T ! . • Also, it is known [30] that (cid:10) Tk (cid:11) /T ! = F T ( k ) − F T ( k − ,where F T ( x ) is the CDF of the probability distribution of S T = U + . . . + U T and each U i is a uniform randomvariable on [0 , .Combining these results, vol ( X K +1 ) = F T ( K + 1) . The volumeof a l ∞ ball of radius (cid:15) ( l ∞ ball is a hypercube) is (2 (cid:15) ) T [33].Then, the number of balls that ﬁt tightly (aligned with the axes)and completely inside X K +1 is bounded by F T ( K + 1) / (2 (cid:15) ) T .Since (cid:15) << , these balls cover X K = X completely and the tightpacking ensures that the center of the balls forms an (cid:15) -cover for X . Then, bounding F T ( K + 1) using Bernstein’s inequality aboutconcentration of random variables we get:L EMMA For K + 1 ≤ . T (recall K << T ) N ( (cid:15), X, d l ∞ ) ≤ e − T (0 . − ( K +1) /T )21 − ( K +1) /T / (2 (cid:15) ) T Plugging the above result into Lemma 3 and then using that inEquation 3, we bound C ( (cid:15), H i , d l ) . Finally, Equation 2 gives abound on C ( (cid:15), H , ¯ d l ) that we use in Theorem 1 to obtain In the Appendix, we show that for standard SUQR (simpler thanour generalized SUQR) the sample size is O (cid:0) α (log δ + log Tα ) (cid:1) HEOREM The non-parametric hypothesis class H is a ( α, δ ) -PAC learnable with sample complexity O (cid:16)(cid:0) α (cid:1)(cid:0) log( 1 δ ) + ( T T +1 α T ) (cid:1)(cid:17) The above result shows that the sample complexity for NPL growsfast with T suggesting that NPL may not be the right approach touse when the number of targets is large.

6. LEARNING ALGORITHM

As stated earlier, our loss function was designed so that the learn-ing algorithm (empirical risk minimizer in PAC framework) wassame as maximizing log likelihood of data. Indeed, for SUQR,the standard MLE approach can be used to learn the parameters(weights) and has been used in literature [25]. However, for NPL,which has no parameters, maximizing likelihood only provides h ( x ) for those mixed strategies x that are in the training data.Hence we present a novel two step learning algorithm for theNPL case . In the ﬁrst step, we estimate the most likely value for h i ( x ) (for each i ) for each x in the training data, ensuring that forany pair x, x (cid:48) in the training data, | h i ( x ) − h i ( x (cid:48) ) | ≤ ˆ K || x − x (cid:48) || . In the second step, we construct the function h i with theleast Lipschitz constant subject to the constraint that h i takes thevalues for the training data output by the ﬁrst step.More formally, assume the training data has s unique values for x in the training set and let these values be x , . . . , x s . Further, letthere be n j distinct data points against x j , i.e., n j attacks againstmixed strategy x j . Denote by n j,i the number of attacks at eachtarget i when x j was used. Let h ij be the variable that stands forthe estimate of value h i ( x j ) ; i ∈ { , . . . , T } , j ∈ { , . . . , s } . Fix h Tj = 0 for all j . Then, probability of attack on target i againstmixed strategy x j is given by q ij = e hij (cid:80) i e hij . Thus, the log likeli-hood of the training data is (cid:80) sj =1 (cid:80) Ti =1 n j,i log q ij . Let Lip ( ˆ K ) denote the set of L -Lipschitz functions with L ≤ ˆ K . Using ourassumption that h i ∈ Lip ( ˆ K ) , the following optimization problemprovides the most likely h ij : max h ij s (cid:88) j =1 T (cid:88) i =1 n j,i log e h ij (cid:80) i e h ij subject to ∀ i, j, j (cid:48) , | h ij − h ij (cid:48) | ≤ ˆ K || x j − x j (cid:48) || ∀ i, j, − M/ ≤ h ij ≤ M/ Given solution h ∗ ij to the above problem, we wish to constructthe solution h i such that its Lipschitz constant (given by K h i ) isthe lowest possible subject to h i taking the value h ∗ ij for x j . Sucha construction provides the most smoothly varying solution giventhe training data, i.e., we do not assume any more sharp changes inthe adversary response than what the training data provides. min h i ∈ Lip ( ˆ K ) K h i subject to ∀ i, j. h i ( x j ) = h ∗ ij ( MinLip ) The above optimization is impractical to solve computationally asuncountably many constraints are required to relate K h i to h i , For-tunately, we obtain an analytical solution:L EMMA The following is a solution for problem

MinLip h i ( x ) = min j { h ∗ ij + K ∗ i || x − x j || } where K ∗ i = max j,j (cid:48) : j (cid:54) = j (cid:48) | h ∗ ij − h ∗ ij (cid:48) ||| x j − x j (cid:48) || P ROOF S KETCH . Observe that due to the deﬁnition of K ∗ anysolution to MinLip will have Lipschitz constant ≥ K ∗ . Thus, it suf-ﬁces to show that the Lipschitz constant of h i is K ∗ , to prove that h i is a solution of MinLip , which we show in the Appendix.Note that for any point x j is the training data we have h i ( x j ) = h ∗ ij . Then the value of h i ( x ) for a x not in the training set and closeto x j is quite likely be the h i ( x j ) plus the scaled distance K ∗ i || x − x j || showing the value of x is inﬂuenced by nearby training points.

7. UTILITY BOUNDS

Next, we bound the difference between the optimal utility andthe utility derived from planning using the learned h . The util-ity bound is same for the parametric and NPL case. Recall thatthe defender receives the utility xUq p ( x ) when playing strategy x . We need to bound the difference between the true distribu-tion q p ( x ) and the predicted distribution q h ( x ) of attacks in or-der to start analyzing bounds on utility. Thus, we transform thePAC learning guarantee about the risk of output h to a bound on || q p ( x ) − q h ( x ) || . As the PAC guarantee only bounds the risk be-tween predicted h and the best hypothesis h ∗ in H , in order to relatethe true distribution q p and predicted distribution q h , the lemma be-low assumes a bounded KL divergence between the distribution ofthe best hypothesis q h ∗ and the true distribution q p .L EMMA Assume E [ KL ( q p ( x ) || q h ∗ ( x ))] ≤ (cid:15) ∗ . Givenan ERM A with output h = A ( (cid:126)z ) and guarantee P r ( | r h ( p ) − r h ∗ ( p ) | < α ) > − δ , with prob. ≥ − δ over training samples (cid:126)z we have P r ( || q p ( x ) − q h ( x ) || ≤ √ ≥ − ∆ where ∆ = ( α + (cid:15) ∗ ) / and x is sampled using density p . Utility bound:

Next, we provide an utility bound, given theabove guarantee about learned h . Let the optimal strategy com-puted using the learned adversary model h be ˜ x , i.e., ˜ x T Uq h (˜ x ) ≥ x (cid:48) T Uq h ( x (cid:48) ) for all x (cid:48) . Let the true optimal defender mixed strat-egy be x ∗ (optimal w.r.t. true attack distribution q p ( x ) ), so that themaximum defender utility is x ∗ T Uq p ( x ∗ ) . Let B ( x, (cid:15) ) denote the l ball of radius (cid:15) around x . We make the following assumptions:1. h i is ˆ K -Lipschitz ∀ i and q p is K -Lipschitz in l norm.2. ∃ small (cid:15) such that P r ( x ∈ B ( x ∗ , (cid:15) )) > ∆ over choice of x using p .3. ∃ small (cid:15) such that P r ( x ∈ B (˜ x, (cid:15) )) > ∆ over choice of x using p .While the ﬁrst assumption is mild , the last two assumptions forsmall (cid:15) mean that the points x ∗ and ˜ x must not lie in low densityregions of the distribution p used to sample the data points. In otherwords, there should be many defender mixed strategies in the dataof defender-adversary interaction that lie near x ∗ and ˜ x . We discussthe assumptions in details after the technical results below. Giventhese assumptions, we need Lemma 7 that relates assumption (1) toLipschitzness of q h in order to obtain the utility bound.L EMMA If h i is ˆ K -Lipschitz then ∀ x, x (cid:48) ∈ X. || q h ( x ) − q h ( x (cid:48) ) || ≤ K || x − x (cid:48) || , i.e., q h ( x ) is K -Lipschitz. Then, we can prove the following:T

HEOREM Given above assumptions and the results ofLemma 6 and 7, with prob. ≥ − δ over the training samplesthe expected utility ˜ x T Uq h (˜ x ) for the learned h is at least x ∗ T Uq p ( x ∗ ) − ( K + 1) (cid:15) − √ − K(cid:15) Lipschitzness is a mild restriction on function classes. iscussion of assumptions : A puzzling phenomenon observedin recent work on learning in SSGs is that good prediction accu-racy of the learned adversary behavior is not a reliable indicatorof the defender’s performance in practice [10]. The additional as-sumptions, over and above the PAC learning guarantee, are madeto bound the utility deviation from the optimal utility point towardsthe possibility of such occurrences. Recall that the second assump-tion requires the existence of many defender mixed strategies inthe dataset near the utility optimal strategy x ∗ . Of course x ∗ is notknown apriori, hence in order to guarantee utility close to the high-est possible utility the dataset must contain defender mixed strate-gies from all regions of the mixed strategy space; or at-least if itis known that some regions of the mixed strategies dominate otherparts in terms of utility then it is enough to have mixed strategiesfrom these regions. Thus, following our assumption, better utilitycan be achieved by collecting attack data against a variety of mixedstrategies rather than many attacks against few mixed strategies.Going further, we illustrate with a somewhat extreme examplewhere violating our assumptions can lead to this undesirable phe-nomenon. For the purpose of illustration, consider the extreme ex-ample where probability distribution p (recall data points are sam-pled using p ) puts all probability mass on x , where the the util-ity for x is much lower than x ∗ . Hence, the dataset will containonly one defender mixed strategy x (with many attacks against it).Due to Lipschitzness (assumption 1), the large utility differencebetween x and x ∗ implies that x is not close to x ∗ which in turnviolates assumption 2. This example provides a very good PACguarantee since there is no requirement for the learning algorithmto predict accurately for any other mixed strategies (which occurwith zero probability) in order to have good prediction accuracy.The learning technique needs to predict well only for x to achievea low α, δ . As a result the defender strategy computed against thelearned adversary model may not be utility maximizing because ofthe poor predictions for all defender mixed strategies other than thelow utility yielding x . More generally, good prediction accuracycan be achieved by good predictions only for the mixed strategiesthat occur with high probability.Indeed, in general, the prediction accuracy in the PAC model(and any applied machine learning approach) is not a reliable in-dicator of good prediction over the entire space of defender mixedstrategies unless, following our assumption 2, the dataset has at-tacks against strategies from all parts of the mixed strategy space.However, in past work [1, 7] researchers have focused on gatheringa lot of attack data but on limited number of defender strategies.We believe that our analysis, in addition to providing a principledexplanation of prior observations, provides guidance towards meth-ods of discovering the defender’s utility maximizing strategy.

8. EXPERIMENTAL RESULTS

We show experimental results on two datasets: (i) real-worldpoaching data from QENP (obtained from [24]); (ii) data from hu-man subjects experiments on AMT (obtained from [14]), to esti-mate prediction errors and the amount of data required to reducethe error for both the parametric and NPL learning settings. Also,we compare the NPL approach with both the standard as wellas the generalized SUQR approach and show that: (i) the NPLapproach, while computationally slow, outperforms the standardSUQR model for Uganda data; and (ii) the performance of gen-eralized SUQR is in between NPL and standard SUQR.For each dataset, we conduct four experiments with 25%, 50%,75% and 100% of the original data. We create 100 train-test splitsin each of the four experiments per dataset. For each train-test splitwe compute the average prediction error α (average difference be- (a) SUQR, Uganda data,coarse-grained prediction (b) NPL, Uganda data,coarse-grained prediction(c) SUQR, Uganda data,ﬁne-grained prediction (d) NPL, Uganda data,ﬁne-grained prediction(e) Generalized SUQR,Uganda Data, ﬁne-grainedprediction (f) Generalized SUQR, UgandaData, coarse-grainedprediction(g) Parametric, AMT data (h) NPL, AMT data Figure 1: Results on Uganda and AMT datasets for both theparametric and NPL learning settings. tween the log-likelihoods of the attacks in the test data using pre-dicted and actual attack probabilities). We report the α value at the − δ percentile of the 100 α values, e.g., reported α = 2 . for δ = 0 . means that 90 of the 100 test splits have α < . . We ﬁrst present results of our experiments with real-worldpoaching data. The dataset obtained contained information aboutfeatures such as ranger patrols and animal densities, which are usedas features in our SUQR model, and the poachers’ attacks, with40,611 total observations recorded by rangers at various locationsin the park. The park area was discretized into 2423 grid cells, witheach grid cell corresponding to a 1km area within the park. Afterdiscretization, each observation fell within one of 2423 target cellsand the animal densities and the number of poaching attacks withineach target cell were then aggregated. The dataset contained 655poachers’ attacks in response to the defender’s strategy for 2012 atQENP. Although the data is reliable because the rangers recordedthe latitudes and longitudes of the location for each observationusing a GPS device, it is important to note that this data set is ex-tremely noisy because of: (i) Missing observations: all the poach-ing events are not recorded because the limited number of rangerscannot patrol all the areas in this park all the time; (ii) Uncertainfeature values: the animal density feature is also based on incom-plete observations of animals; (iii) Uncertain defender strategy: theactual defender mixed strategy is unknown, and hence, we estimatethe mixed strategies based on the provided patrol data.n this paper, we provide two types of prediction in our exper-iments: (i) ﬁne-grained and (ii) coarse-grained. First, to providea baseline for our error measures, we use the same coarse-grainedprediction approach as reported by Nguyen et al. [24], in which theauthors only predict whether a target will be attacked or not. Theresults for coarse-grained predictions with our performance metric( α values for different δ ) are shown in Figs. 1(a), 1(b) and 1(f).Next, in the ﬁne-grained prediction approach we predict the actual number of attacks on each target in our test set; these results areshown in Figs. 1(c), 1(d) and 1(e). In [24], the authors used a par-ticular metric for prediction performance called the area under theROC curve (AUC), which we will discuss later in the section.From our ﬁne-grained and coarse-grained prediction approaches,we make several important observations. First, we observe that α decreases with increasing sample size at a rate proportional to √ m ,where m is the number of samples. For example, in Fig. 1(a), the α values corresponding to δ = 0 . (black bar) for the 25%, 50%,75% and 100% data are 2.81, 2.38, 2.18 and 1.8 respectively, whichﬁts √ m with a goodness-of-ﬁt, i.e., r =0.97. This observation sup-ports the relationship between α and m shown in Theorem 3 andcan be used to approximately infer the number of samples requiredto reduce the prediction error to a certain value. For example, as-suming we collect same number of samples (=655) per year, to re-duce α from 1.8 to 1.64, we would require two more years of data.Note here that α is in log-scale and hence the decrease is signiﬁ-cant. It is also worth noting here that, for a random classiﬁer, weobserve a value of α = 6 . for δ = 0 . while performing coarsegrained predictions with 25% data. This is more than α = 2 . for δ = 0 . obtained for our standardized SUQR model while per-forming coarse grained predictions with 25% data. As α is in thelog-scale, the increase in error is actually more than two-fold.Our second observation is that α values for ﬁne-grained predic-tions (e.g., 2.9 for δ = 0 . and 100% data for the standardizedSUQR model in Fig. 1(c)) are understandably higher than the cor-responding values for coarse-grained predictions (1.8 for δ = 0 . and 100% data for SUQR in Fig. 1(a)) because in the ﬁne-grainedcase we predict the exact number of attacks.Third, we observe that the performance of the generalized SUQRmodel (e.g., 2.47 for δ = 0 . and 100% data in Fig. 1(e)) is betterin most cases than that of the standardized SUQR approach (2.9 for δ = 0 . and 100% data in Fig. 1(c)), but worse than the NPLapproach (2.15 for δ = 0 . and 100% data in Fig. 1(d)).Finally, we observe that our NPL model performs better than itsparametric counterparts in predicting future poaching attacks forthe ﬁne-grained case (see example in previous paragraph), indicat-ing that the true adversary behavior model may indeed be morecomplicated than what can be captured by SUQR. Relation to previous work:

In earlier work, Nguyen et al. [24]uses the area under curve (of a ROC curve) metric to demonstratethe performance of their approaches. The AUC value of 0.73 re-ported in their paper is an alternate view of our α, δ metric for thecoarse grained prediction approach. While there has been alter-nate analysis in terms of measuring prediction performances withthe AUC metric in earlier papers, in our paper we have shownnew trends and insights with the α, δ metric through analysis fromthe PAC model perspective, which is missing in earlier work. Weshow: (i) sample complexity results and the relationship betweenincreasing number of samples and the reduction in prediction errorfor each of our models; (ii) the differences in errors while learn-ing a vector valued response function (ﬁne-grained prediction) asopposed to classifying targets as attacked or not (coarse-grainedprediction); and (iii) comparison of the performance of our newNPL model with other parametric approaches in terms of both Uganda Para-metric UgandaNPL AMT Para-metric AMT NPL0.7188 121.24 0.91 123.4

Table 2: Runtime results (in secs.) for one train-test split ﬁne-grained and coarse-grained predictions and its effectiveness onreal-world poaching data which was not shown in previous work.

Here we show ﬁne-grained prediction results on real-world AMTdata obtained from [14] to demonstrate the performance of both ourapproaches on somewhat cleaner data. This dataset is cleaner thanthe Uganda data because: (i) all attacks are observed, and (ii) an-imal densities and deployed defender strategies are known. Thedataset consisted of 16 unique mixed strategies. There were anaverage of 40 attack data points per mixed strategy. Each attackhad been conducted by an unique individual recruited on AMT.We used attack data corresponding to 11 randomly chosen mixedstrategies for training and data for the remaining mixed strategiesfor testing. Results are shown in Figs. 1(g) and 1(h). We observethat: (i) α values in this case are lower as compared to the Ugandadata as the AMT data is cleaner; and (ii) the NPL model’s perfor-mance on this dataset is poor as compared to SUQR due to, (a)low number of samples in AMT data, and (b) real-world poacherbehavior may be more complicated than that of AMT participantsand hence SUQR in this case was able to better capture AMT par-ticipants’ behavior with limited number of samples. Runtime:

While running on Matlab R2015a on an Intel Corei7-5500 [email protected], 8GB RAM machine with a 64-bit Win-dows 10, on average, the NPL computation takes longer than theparametric setting, as shown in Table 2.

9. CONCLUSION

Over the last couple of years, a lot of work has used learningmethods to learn bounded rationality adversary behavioral model,but there has been no formal study of the learning process and itsimplication on the defender’s performance. The lack of formalanalysis also means that many practical questions go unanswered.We have advanced the state of the art in learning of adversary be-haviors in SSGs, in terms of their analysis and implications of suchlearned behaviors on defender’s performance.While we used the PAC framework, it is not an out of the boxapproach. We needed to come up with innovative techniques to ob-tain sharp bounds for our case. Furthermore, we also provide a newnon-parametric learning approach which showed promising resultswith real-world data. Finally, we provided a principled explanationof observed phenomenon that prediction accuracy is not enough toguarantee good defender performance. We explained why datasetswith attacks against a variety of defender mixed strategies helps inachieving good defender performance. Finally, we hope this workleads to more theoretical work in learning of adversary models andthe use of non-parametric models in the real world. In the Appendix we show that, on simulated data, α does indeedapproach zero and NPL outperforms SUQR with enough samples. EFERENCES [1] Y. D. Abbasi, M. Short, A. Sinha, N. Sintov, C. Zhang, andM. Tambe. Human adversaries in opportunistic crimesecurity games: Evaluating competing bounded rationalitymodels. In

Conference on Advances in Cognitive Systems ,2015.[2] M. Anthony and P. L. Bartlett.

Neural Network Learning:Theoretical Foundations . Cambridge University Press, NewYork, NY, USA, 1st edition, 2009.[3] M.-F. Balcan, A. Blum, N. Haghtalab, and A. D. Procaccia.Commitment without regrets: Online learning in stackelbergsecurity games. In

Proceedings of the Sixteenth ACMConference on Economics and Computation , EC ’15, 2015.[4] M.-F. Balcan, A. D. Procaccia, and Y. Zick. Learningcooperative games. In

Proceedings of the International JointConference on Artiﬁcial Intelligence , IJCAI ’15, 2015.[5] N. Basilico, N. Gatti, and F. Amigoni. Leader-followerstrategies for robotic patrolling in environments witharbitrary topologies. In

Proceedings of The 8th InternationalConference on Autonomous Agents and MultiagentSystems-Volume 1 , pages 57–64. International Foundation forAutonomous Agents and Multiagent Systems, 2009.[6] A. Blum, N. Haghtalab, and A. D. Procaccia. Learningoptimal commitment to overcome insecurity. In

Advances inNeural Information Processing Systems , pages 1826–1834,2014.[7] J. Cui and R. S. John. Empirical comparisons of descriptivemulti-objective adversary models in stackelberg securitygames. In

Decision and Game Theory for Security , pages309–318. Springer, 2014.[8] M. de Laplace. Oeuvres complètes.

Oeuvres complètes ,7:257, 1886.[9] F. Fang, T. H. Nguyen, B. An, M. Tambe, R. Pickles, W. Y.Lam, and G. R. Clements. Towards addressing challenges ingreen security games in the wild (extended abstract). In

Workshop of Behavioral, Economic and ComputationalIntelligence for Security (BECIS) held at IJCAI , 2015.[10] B. Ford, T. Nguyen, M. Tambe, N. Sintov, and F. D. Fave.Beware the soothsayer: From attack prediction accuracy topredictive reliability in security games. In

Conference onDecision and Game Theory for Security , 2015.[11] J. Gan, B. An, and Y. Vorobeychik. Security games withprotection externalities. In

Twenty-Ninth AAAI Conferenceon Artiﬁcial Intelligence , 2015.[12] W. Haskell, D. Kar, F. Fang, M. Tambe, S. Cheung, and L. E.Denicola. Robust protection of ﬁsheries with compass. In

Innovative Applications of Artiﬁcial Intelligence (IAAI) ,2014.[13] D. Haussler. Decision theoretic generalizations of the pacmodel for neural net and other learning applications.

Inf.Comput. , 100(1):78–150, Sept. 1992.[14] D. Kar, F. Fang, F. D. Fave, N. Sintov, and M. Tambe. AGame of Thrones: When human behavior models compete inrepeated stackelberg security games. In

InternationalConference on Autonomous Agents and Multiagent Systems(AAMAS) , 2015.[15] M. J. Kearns and U. V. Vazirani.

An introduction tocomputational learning theory . MIT press, 1994.[16] D. E. Knuth.

The art of computer programming: sorting andsearching , volume 3. Pearson Education, 1998.[17] D. Korzhyk, V. Conitzer, and R. Parr. Complexity of computing optimal stackelberg strategies in security resourceallocation games. In

In Proceedings of the NationalConference on Artiﬁcial Intelligence (AAAI) , pages 805–810,2010.[18] S. Kullback.

Information Theory and Statistics . John Wileyand Sons, 1959.[19] J. Letchford, V. Conitzer, and K. Munagala. Learning andapproximating the optimal strategy to commit to. In

Algorithmic Game Theory , pages 250–262. Springer, 2009.[20] U. v. Luxburg and O. Bousquet. Distance–basedclassiﬁcation with lipschitz functions.

The Journal ofMachine Learning Research , 5:669–695, 2004.[21] J. Marecki, G. Tesauro, and R. Segal. Playing repeatedstackelberg games with unknown opponents. In

Proceedingsof the 11th International Conference on Autonomous Agentsand Multiagent Systems-Volume 2 , pages 821–828, 2012.[22] D. McFadden. Conditional logit analysis of qualitative choicebehavior.

Frontiers in Econometrics , pages 105–142, 1973.[23] D. McFadden. Quantal choice analysis: A survey.

Annals ofEconomic and Social Measurement , 5(4):363–390, 1976.[24] T. H. Nguyen, F. M. D. Fave, D. Kar, A. S.Lakshminarayanan, A. Yadav, M. Tambe, N. Agmon, A. J.Plumptre, M. Driciru, F. Wanyama, and A. Rwetsiba.Making the most of our regrets: Regret-based solutions tohandle payoff uncertainty and elicitation in green securitygames. In

Conference on Decision and Game Theory forSecurity , 2015.[25] T. H. Nguyen, R. Yang, A. Azaria, S. Kraus, and M. Tambe.Analyzing the effectiveness of adversary modeling insecurity games. In

Conf. on Artiﬁcial Intelligence (AAAI) ,2013.[26] P. Paruchuri, J. P. Pearce, J. Marecki, M. Tambe, F. Ordonez,and S. Kraus. Playing games for security: An efﬁcient exactalgorithm for solving bayesian stackelberg games. In

Proceedings of the 7th International Joint Conference onAutonomous Agents and Multiagent Systems - Volume 2 ,AAMAS, pages 895–902, 2008.[27] D. Pollard.

Convergence of stochastic processes . DavidPollard, 1984.[28] R. Stanley. Eulerian partitions of a unit hypercube.

HigherCombinatorics (M. Aigner, ed.), Reidel, Dordrecht/Boston ,49, 1977.[29] M. Tambe.

Security and Game Theory: Algorithms,Deployed Systems, Lessons Learned . Cambridge UniversityPress, New York, NY, USA, 1st edition, 2011.[30] S. Tanny. A probabilistic interpretation of eulerian numbers.

Duke Mathematical Journal , 40(4):717–722, 1973.[31] V. Tikhomirov and A. Kolmogorov. ε -entropy and ε -capacityof sets in functional spaces. In Selected Works of ANKolmogorov , pages 86–170. Springer, 1993.[32] Y. Vorobeychik and B. Li. Optimal randomized classiﬁcationin adversarial settings. In

Proceedings of the 2014international conference on Autonomous agents andmulti-agent systems , pages 485–492. InternationalFoundation for Autonomous Agents and MultiagentSystems, 2014.[33] X. Wang. Volumes of generalized unit balls.

MathematicsMagazine , pages 390–395, 2005.[34] R. Yang, B. Ford, M. Tambe, and A. Lemieux. Adaptiveresource allocation for wildlife protection against illegalpoachers. In

Proceedings of the 2014 Internationalonference on Autonomous Agents and Multi-agent Systems ,AAMAS ’14, 2014.[35] C. Zhang, A. Sinha, and M. Tambe. Keeping pace withcriminals: Designing patrol allocation against adaptiveopportunistic criminals. In

International Conference onAutonomous Agents and Multiagent Systems (AAMAS 2015) ,2015.

PPENDIX

The Appendix is structured as follows: Section A contains themissing proofs, Section B contains the result of the applicabilityof our techniques for Stackelberg games, Section C constains re-sults about the sample complexity of standard SUQR, Section Dcontains the weaker sample complexity bound result for the gen-eralized SUQR model derived using the approach of Haussler andSection E contains additional experiments.

A. PROOFS

Proof of Theorem 1 P ROOF . First, Haussler uses the following pseudo metric ρ on A that is deﬁned using the loss function l : ρ ( a, b ) = max y ∈ Y | l ( y, a ) − l ( y, b ) | . To start with, relying on Haussler’s result, we show

P r ( ∀ h ∈ H . | ˆ r h ( (cid:126)z ) − r h ( p ) | < α ≥ − C (cid:16) α , H , ρ (cid:17) e − α m M Choose α = α (cid:48) / M and ν = 2 M in Theorem 9 of [13]. Us-ing property (3) (Section 2.2, [13]) of d v we obtain | r − s | ≤ (cid:15) whenever d v ( r, s ) ≤ α (cid:48) . Using this directly in Theorem 9 of Haus-sler [13] we obtain the desired result above.Note the dependence of the above probability on m (the numberof samples), and compare it to the ﬁrst pre-condition in the PAClearning result. By equating δ/ to C ( α/ , H , ρ ) e − α m M , wederive the sample complexity as m ≥ M α log 8 C ( α/ , H , ρ ) δ We wish to compute a bound on C ( (cid:15), H , ρ ) in order to use theabove result to obtain sample complexity. First, we prove that ρ ≤ T d ¯ l for the loss function we use. This result is used tobound C ( (cid:15), H , ρ ) , since, it is readily veriﬁed from deﬁnition that C ( (cid:15), H , ρ ) ≤ C ( (cid:15)/ T, H , d ¯ l ) . Such a bounding directly gives m ≥ M α log 8 C ( α/ T, H , ρ ) δ Below we prove that ρ ≤ T d ¯ l .L EMMA Given the loss function deﬁned above, we have ρ ( a, b ) ≤ i | a i − b i | ≤ (cid:80) i | a i − b i | ≤ T d ¯ l ( a, b ) P ROOF . By deﬁnition, ρ ( a, b ) = max i (cid:12)(cid:12)(cid:12) − a i + b i +log (cid:80) T − i =1 e ai (cid:80) T − i =1 e bi (cid:12)(cid:12)(cid:12) ≤ max i | a i − b i | + (cid:12)(cid:12) log (cid:80) T − i =1 e ai (cid:80) T − i =1 e bi (cid:12)(cid:12)(cid:12) . Thereis j and k such that max r = e aj e bj ≥ e ai e bi for all i and min r = e ak e bk ≤ e ai e bi for all i . Thus, log 1 + min r t t ≤ log 1 + (cid:80) T − i =1 e a i (cid:80) T − i =1 e b i ≤ log 1 + max r t t where t = (cid:80) T − i =1 e b i . The greatest positive value of the RHS is log max r ≤ | a j − b j | and least negative value possible for LHS is log min r ≥ −| a k − b k | . Thus, (cid:12)(cid:12) log 1 + (cid:80) T − i =1 e a i (cid:80) T − i =1 e b i (cid:12)(cid:12) ≤ max i (cid:12)(cid:12) a i − b i (cid:12)(cid:12) Hence, we obtain ρ ( a, b ) = max i | l ( y i , a ) − l ( y i , b ) | ≤ i | a i − b i | , and the last inequality is trivial. Thus, using the above result we get m ≥ M α log 8 C ( α/ T, H , d ¯ l ) δ Proof of Lemma 2 P ROOF . First, note that x iT = x i − x T lies between [ − , dueto the constraints on x i , x T . Then, for any two functions g, g (cid:48) ∈ G we have the following result: d L ( P, ¯ d l ) ( g, g (cid:48) ) == (cid:90) X T − T − (cid:88) i =1 d l ( w ( x i − x T ) , w (cid:48) ( x i − x T )) dP ( x )= (cid:90) X T − T − (cid:88) i =1 | ( w − w (cid:48) )( x i − x T ) | dP ( x ) ≤ (cid:90) X T − T − (cid:88) i =1 | ( w − w (cid:48) ) | dP ( x ) = | ( w − w (cid:48) ) | Also, note that since the range of any g = w ( x i − x T ) is [ − M , M ] and given x i − x T lies between [ − , , we can claim that w liesbetween [ − M , M ] . Thus, given the distance between functions isbounded by the difference in weights, it enough to divide the M/ range of the weights into intervals of size (cid:15) and consider functionsat the boundaries. Hence the (cid:15) -cover has at most M/ (cid:15) functions.The proof for constant valued functions F i is similar, since itsstraightforward to see the distance between two functions in thisspace is the difference in the constant output. Also, the constantslie in [ − M , M ] , Then, the argument is same as the G case. Proof of Lemma 3 P ROOF . First, the space of functions ˆ H = { h/ ˆ K | h ∈ H i } is Lipschitz with Lipschitz constant ≤ and | h i ( x ) | ≤ M/ K .Clearly N ( (cid:15), H i , d l ∞ ) ≤ N ( (cid:15)/ ˆ K, ˆ H , d l ∞ ) . Using the followingresult from [31]: for any Lipschitz real valued function space H with constant , any positive integer s and any distance d N ( (cid:15), H , d l ∞ ) ≤ (cid:16) (cid:108) M ( s + 1)2 ˆ K(cid:15) (cid:109) + 1 (cid:17) · ( s + 1) N ( s(cid:15)s +1 ,X,d ) Then, we get the bound on N ( (cid:15)/ ˆ K, ˆ H , d l ∞ ) by choosing s =1 and d = d l ∞ , and hence obtain the desired bound on N ( (cid:15), H i , d l ∞ ) . Proof of Lemma 4 P ROOF . For ease of notation, we do the proof with k standingfor K + 1 . Let Y i = U i − . , then | Y | ≤ / and S T − . T = (cid:80) i Y i . Using Bernstein’s inequality with the fact that E [ Y i ] =1 / P ( (cid:88) i Y i = S T − . T ≤ − t ) ≤ e − . t T/ t/ Thus, P ( S T ≤ . T − t ) ≤ e . t T/ t/ . Take k = 0 . T − t , andhence t = 0 . T − k = T (0 . − k/T ) . Hence, P ( S T ≤ k ) ≤ e − T (0 . − k/T )21 − k/T Proof of Theorem 3 P ROOF . Given the results of Lemma 3, we get the sample com-plexity is of order α (cid:16) log 1 δ + T (cid:0) N ( αT , X, d l ) (cid:1)(cid:17) ow, suing result of Lemma 4, we get the required order in the The-orem. We wish to note that if K/T is a constant then the O ( e − T ) in Lemma 4 gets swamped by the T T term. However, in practicefor ﬁxed T , this term does provide lower actual complexity boundthan what is indicated by the order. Proof of Lemma 5 P ROOF . Observe that due to the deﬁnition of K ∗ any solutionto MinLip will have Lipschitz constant ≥ K ∗ . Thus, it sufﬁces toshow that the Lipschitz constant of h i is K ∗ , to prove that h i is asolution of MinLip . Take any two x, x (cid:48) . If the min in the expressionfor h i occurs for the same j for both x, x (cid:48) then | h i ( x ) − h i ( x (cid:48) ) | isgiven by K ∗ ||| x − x j || − || x (cid:48) − x j || | . By application of triangleinequality −|| x − x (cid:48) || ≤ || x − x j || − || x (cid:48) − x j || ≤ || x − x (cid:48) || Thus, | h i ( x ) − h i ( x (cid:48) ) | ≤ K ∗ || x − x (cid:48) || .For the other case when the min for x occurs at some j and minfor x (cid:48) at some j (cid:48) we have the following: h i ( x (cid:48) ) = h ij (cid:48) + K ∗ || x (cid:48) − x j (cid:48) || and h i ( x ) = h ij + K ∗ || x − x j (cid:48) || . Also, due to the min, h i ( x (cid:48) ) ≤ h ij + K ∗ || x (cid:48) − x j || = h i ( x ) + K ∗ || x (cid:48) − x j || − K ∗ || x − x j (cid:48) || . Thus, we get h i ( x (cid:48) ) − h i ( x ) ≤ K ∗ ( || x (cid:48) − x j || − || x − x j || ) ≤ K ∗ || x (cid:48) − x || Using the symmetric case inequality for x we get h i ( x ) − h i ( x (cid:48) ) ≤ K ∗ ( || x − x j || − || x (cid:48) − x j || ) ≤ K ∗ || x − x (cid:48) || Combining both these we can claim that | h i ( x ) − h i ( x (cid:48) ) | ≤ K ∗ || x (cid:48) − x || . Thus, we have proved that h i is K ∗ Lipschitz,and hence a solution of

MinLip . Proof of Lemma 6 P ROOF . Let p X be the marginal of p ( x, y ) for space X . Deﬁne the expected entropy E [ H ( x )] = (cid:82) p X ( x ) (cid:80) Ti =1 I y =t i q pi ( x ) log q pi ( x ) dx .Given the loss function, we know that r h ( p ) = − (cid:82) p ( x, y ) (cid:80) Ti =1 I y =t i log q hi ( x ) dx dy . This is same as − (cid:82) p X ( x ) (cid:80) Ti =1 I y =t i q pi ( x ) (cid:80) Ti =1 I y =t i log q hi ( x ) dx dy . Thisreduces to − (cid:82) p X ( x ) (cid:80) Ti =1 I y =t i q pi ( x ) log q hi ( x ) dx dy . Thus,we have E [ H ( x )] + r h ( p ) = (cid:90) p X ( x ) T (cid:88) i =1 I y =t i q pi ( x ) log q pi ( x ) q hi ( x ) dx dy Hence, we obtain E [ H ( x )] + r h ( p ) = E [ KL ( q p ( x ) || q h ( x ))] Hence, | r h ( p ) − r h ∗ ( p ) | is equal to | E [ KL ( q p ( x ) || q h ( x ))] − E [ KL ( q p ( x ) || q ∗ ( x ))] | Thus, from the assumptions, we get E [ KL ( q p ( x ) || q h ( x ))] ≤ α + (cid:15) ∗ with probability ≥ − δ . Next, using Markov inequality, withprobability ≥ − δP r ( KL ( q p ( x ) || q h ( x )) ≥ ( α + (cid:15) ∗ ) / ) ≤ ( α + (cid:15) ∗ ) / that is using the notation ∆ = ( α + (cid:15) ∗ ) / , with probability ≥ − δP r ( KL ( q p ( x ) || q h ( x )) ≤ ∆ / ) ≥ − ∆ / Using Pinkser’s inequality we get (1 / || q p ( x ) − q h ( x ) || ≤ KL ( q p ( x ) || q h ( x )) . That is, the event KL ( q p ( x ) || q h ( x )) ≤ ∆ / implies the event || q p ( x ) − q h ( x ) || ≤ √ . Thus, P r ( || q p ( x ) − q h ( x ) || ≤ √ ≥ P r ( KL ( q p ( x ) || q h ( x )) ≤ ∆ / ) . Thus,we obtain: with probability ≥ − δ , P r ( || q p ( x ) − q h ( x ) || ≤√ ≥ − ∆ . Proof of Lemma 7 P ROOF . We know that q hi ( x ) = e hi ( x ) (cid:80) j e hj ( x ) (assume h T ( x ) =0 ). Thus, | q hi ( x ) − q hi ( x (cid:48) ) | = q hi ( x (cid:48) ) | e h i ( x ) − h i ( x (cid:48) ) (cid:80) j e h j ( x (cid:48) ) (cid:80) j e h j ( x ) − | Let r denote (cid:80) j e hj ( x (cid:48) ) (cid:80) j e hj ( x ) . There is l and k such that max r = e hl ( x (cid:48) ) e hl ( x ) ≥ e hj ( x (cid:48) ) e hj ( x ) for all j and min r = e hk ( x (cid:48) ) e hk ( x ) ≤ e hj ( x (cid:48) ) e hj ( x ) for all j . Then, min r ≤ r ≤ max r First, note that due to our assumptionthat for each i | h i ( x (cid:48) ) − h i ( x ) | ≤ ˆ K || x (cid:48) − x || , we have e − ˆ K || x (cid:48) − x || ≤ min r ≤ r ≤ max r ≤ e ˆ K || x (cid:48) − x || Using the Lipschitzness we can also claim that e − ˆ K || x (cid:48) − x || ≤ e h i ( x ) − h i ( x (cid:48) ) ≤ e ˆ K || x (cid:48) − x || . Thus, e − K || x (cid:48) − x || ≤ e h i ( x ) − h i ( x (cid:48) ) · r ≤ e K || x (cid:48) − x || Since, e − K || x (cid:48) − x || < and e K || x (cid:48) − x || > we have | e h i ( x ) − h i ( x (cid:48) ) r − | ≤ max( | e − K || x (cid:48) − x || − | , | e K || x (cid:48) − x || − | ) Also, it is a fact that | e y − | ≤ . | y | for | y | ≤ / . Thus, weobtain | e h i ( x ) − h i ( x (cid:48) ) r − | ≤ K || x (cid:48) − x || for K || x (cid:48) − x || ≤ / Thus, || q h ( x (cid:48) ) − q h ( x ) || = (cid:80) i | q hi ( x ) − q hi ( x (cid:48) ) | = (cid:80) i q hi ( x (cid:48) ) | e h i ( x ) − h i ( x (cid:48) ) (cid:80) j e hj ( x (cid:48) ) (cid:80) j e hj ( x ) − | ≤ ( (cid:80) i q hi ( x (cid:48) ))3 ˆ K || x (cid:48) − x || for ˆ K || x (cid:48) − x || ≤ / . Since (cid:80) i q hi ( x (cid:48) ) = 1 , we have || q h ( x (cid:48) ) − q h ( x ) || ≤ K || x (cid:48) − x || for || x (cid:48) − x || ≤ / K In other words q h is locally K -Lipschitz for every l norm ballof size / K . The following allows us to prove global Lipschitz-ness.L EMMA Any locally L -Lipschitz function f for every l p ballof size δ on a compact convex set X ⊂ R n is Lipschitz on the set X . The Lipschitz constant is also L . P ROOF . Take any two points x, y ∈ X , the straight line joining x, y lies in X (as X is convex). Also, a ﬁnite number of ballsof size δ cover X (due to compactness). Thus, there are ﬁnitelymany points x = z , . . . , z µ = y on the line from x, y such that d l p ( z i , z i +1 ) ≤ δ . Further, since these points lie on a straight linewe have d l p ( x, y ) = (cid:80) µ − d l p ( z i , z i +1 ) Then, let any metric d be used to measure distance in the rangespace of f , thus, we get d ( f ( x ) , f ( y )) ≤ (cid:80) µ − d ( f ( z i ) , f ( z i +1 )) ≤ (cid:80) µ − Ld l p ( z i , z i +1 )= Ld l p ( x, y ) ince in our case the defender mixed strategy space is compact andconvex and q h ( x ) satisﬁes the above lemma with L = 3 ˆ K and δ = 3 / K , q h ( x ) is K -Lipschitz. Proof of Theorem 4 P ROOF . Coupled with the guarantee that with prob. ≥ − δ , P r ( || q p ( x ) − q h ( x ) || ≤ √ ≥ − ∆ , the assumptions guaran-tee that with prob. ≥ − δ for the learned hypothesis h there mustexist a x (cid:48) ∈ B ( x ∗ , (cid:15) ) such that || q p ( x (cid:48) ) − q h ( x (cid:48) ) || ≤ √ andthere must exist x (cid:48)(cid:48) ∈ B (˜ x, (cid:15) ) such that || q p ( x (cid:48)(cid:48) ) − q h ( x (cid:48)(cid:48) ) || ≤√ .First, for notational ease let γ denote √ . The following areimmediate using triangle inequality, with the results || q p ( x (cid:48) ) − q h ( x (cid:48) ) || ≤ γ and || q p ( x (cid:48)(cid:48) ) − q h ( x (cid:48)(cid:48) ) || ≤ γ and the Lipschitznessassumptions || q p ( x ∗ ) − q h ( x (cid:48) ) || ≤ K(cid:15) + γ ( opt x ∗ ) || q p (˜ x ) − q h ( x (cid:48)(cid:48) ) || ≤ K(cid:15) + γ ( opt ˜ x ) We call ˜ x T Uq h (˜ x ) ≥ x (cid:48) T Uq h ( x (cid:48) ) as equation opt h . Thus, webound the utility loss as following x ∗ T Uq p ( x ∗ ) − ˜ x T Uq p (˜ x )= x ∗ T Uq p ( x ∗ ) − ˜ x T Uq h (˜ x ) + ˜ x T Uq h (˜ x ) − ˜ x T Up ( y/ ˜ x ) ≤ x ∗ T Uq p ( x ∗ ) − x (cid:48) T Uq h ( x (cid:48) ) + ˜ x T Uq h (˜ x ) − ˜ x T Up ( y/ ˜ x ) using opt h = ( x ∗ − x (cid:48) ) T Uq p ( x ∗ ) + x (cid:48) T U ( q p ( x ∗ ) − q h ( x (cid:48) ))+˜ x T Uq h (˜ x ) − ˜ x T Uq p (˜ x ) ≤ (cid:15) + ( K(cid:15) + γ ) + ˜ x T Uq h (˜ x ) − ˜ x T Uq p (˜ x ) using x (cid:48) ∈ B ( x ∗ , (cid:15) ) , opt x ∗ = (( K + 1) (cid:15) + γ ) + ˜ x T U ( q h (˜ x ) − q h ( x (cid:48)(cid:48) ))+˜ x T U ( q h ( x (cid:48)(cid:48) ) − q p (˜ x )) ≤ ( K + 1) (cid:15) + γ + 6 ˆ K(cid:15) + γ using x (cid:48)(cid:48) ∈ B (˜ x, (cid:15) ) with Lipschitz q h , opt ˜ x B. EXTENSION TO STACKELBERGGAMES

Our technique extends to Stackelberg games by noting that thesingle resource case K = 1 with T − targets gives (cid:80) T − i =1 x i ≤ .This directly maps to a probability distribution over T actions. The x i ’s with x T = 1 − (cid:80) T − i =1 x i is the probability of playing an action.With this set-up now the security game is a standard Stackelberggame, but where the leader has T actions and follower has T − actions.Thus, in order to capture the general Stakelberg game, for the ad-versary, we assume N actions for the adversary (instead of T − above). Then, similar to security games q , . . . , q N denotes theadversary’s probability of playing an action. Thus, the function h now outputs vectors of size N − (instead of O ( T ) ), i.e., A is asubset of N − dimensional Euclidean space. The model of secu-rity game in the PAC framework extends as is to this Stackelbergsetup, just with h ( x ) and A being N − dimensional. The rest ofthe analysis proceeds exactly as for security games for both para-metric and non-parametric case, by replacing the T correspondingto the adversary’s action space by N . Since, the proof technique isexactly same, we just state the ﬁnal results. Thus, for a Stackelberggame with T leader actions and N follower actions, the bound forTheorem 1 becomes M α log 8 C ( α/ N, H , d ¯ l ) δ It can be seen from the proof for the parametric part that the samplecomplexity does not depend on the dimensionality of X , but onlyon the dimensionality of A . Hence, the sample complexity resultsfrom generalized SUQR parametric case is O (cid:0) α (log 1 δ + N log Nα ) (cid:1) and for the non-parametric case, which depends on both dimen-sionality of X and T , the sample complexity is O (cid:0) α (log 1 δ + N T +1 α T ) (cid:1) C. ANALYSIS OF STANDARD SUQRFORM

For SUQR the rewards and penalties are given and ﬁxed. Letthe rewards be given and ﬁxed r = (cid:104) r , . . . , r T (cid:105) (each r i ∈ [0 , r max ] , r max > ), and the penalty values are p = (cid:104) p , . . . , p T (cid:105) (each p i ∈ [0 , p min ] , p min < ). Thus, the output of h is h ( x ) = (cid:104) w x T + w r T + w p T , . . . ,w x T − T + w r T − T + w p T − T (cid:105) where r iT = r i − r T and same for p iT . Note that in the aboveformulation all the component functions h i ( x ) have same weights.We can consider the function space H as the following direct-sumsemi-free product G ⊕ F ⊕ E = {(cid:104) g + f + e , . . . , g T − + f T − + e T − (cid:105) | (cid:104) g , . . . , g T − (cid:105) ∈ G , (cid:104) f , . . . , f T − (cid:105) ∈F , (cid:104) e , . . . , e T − (cid:105) ∈ E} , where each of G , F , E is de-ﬁned below. G = {(cid:104) g , . . . , g T − (cid:105) | (cid:104) g , . . . , g T − (cid:105) ∈× i G i , all g i have same weight } where G i has functions ofthe form wx iT . F = {(cid:104) f , . . . , f T − (cid:105) | (cid:104) f , . . . , f T − (cid:105) ∈× i F i , all f i have same weight } where F i hasconstant valued functions of the form wr iT . E = {(cid:104) e , . . . , e T − (cid:105) | (cid:104) e , . . . , e T − (cid:105) ∈× i E i , all e i have same weight } where E i has constant valuedfunctions of the form wp iT .Consider an (cid:15)/ -cover U e for E , an (cid:15)/ -cover U f for F and (cid:15)/ -cover U g for G . We claim that U e × U f × U g is an (cid:15) -cover for E ⊕ F ⊕ G . Thus, the size of the (cid:15) -cover for

E ⊕ F ⊕ G is boundedby | U e || U f || U g | . Thus, N ( (cid:15), H , d ¯ l ) ≤ N ( (cid:15)/ , G , d ¯ l ) N ( (cid:15)/ , F , d ¯ l ) N ( (cid:15)/ , E , d ¯ l ) Taking sup over P we get C ( (cid:15), H , d ¯ l ) ≤ C ( (cid:15)/ , G , d ¯ l ) C ( (cid:15)/ , F , d ¯ l ) C ( (cid:15)/ , E , d ¯ l ) Now, we show that U e × U f × U g is an (cid:15) -cover for H = E⊕F ⊕G

Fix any h ∈ H = E ⊕ F ⊕ G . Then, h = e + f + g for some e ∈ E , f ∈ F , g ∈ G . Let e (cid:48) ∈ U e be (cid:15)/ close to e , f (cid:48) ∈ U f be (cid:15)/ close to f and g (cid:48) ∈ U g be (cid:15)/ close to g .Then, d L ( P,d ¯ l ) ( h, h (cid:48) )= (cid:90) X k k (cid:88) i =1 d l ( h i ( x ) , h (cid:48) i ( x )) dP ( x ) ≤ (cid:90) X k k (cid:88) i =1 d l ( g i ( x ) , g (cid:48) i ( x ))+ d l ( f i ( x ) , f (cid:48) i ( x )) + d l ( e i ( x ) , e (cid:48) i ( x )) dP ( x )= d L ( P,d ¯ l ) ( g, g (cid:48) ) + d L ( P,d ¯ l ) ( f, f (cid:48) ) + d L ( P,d ¯ l ) ( e, e (cid:48) ) ≤ (cid:15) Similar to Lemma 2, it is possible to show that for any probabil-ity distribution P , for any function g, g (cid:48) d ¯ l ( g, g (cid:48) ) ≤ | w − w (cid:48) | nd f, f (cid:48) d ¯ l ( f, f (cid:48) ) ≤ | w − w (cid:48) | r max and e, e (cid:48) d ¯ l ( e, e (cid:48) ) ≤| w − w (cid:48) || p min | . Assume each of the functions have a range [ − M/ , M/ (this does not affect the order in terms of M ).Given, these ranges w for g can take values in [ − M/ , M/ , w for g can take values in [ − M/ r max , M/ r max ] and w for g cantake values in [ − M/ | p min | , M/ | p min | ] . To get a capacity of (cid:15)/ it is enough to divide the respective w range into intervals of (cid:15)/ , and consider the boundaries. This yields an (cid:15)/ -capacity of M/ (cid:15) , M/ (cid:15)r max and M/ (cid:15) | p min | for G , F and E respectively.Thus, C ( (cid:15), H , d ¯ l ) ≤ ( M/ (cid:15) ) r max | p min | Plugging this in sample complexity from Theorem 1 we get theresults that the sample complexity is O (cid:0) α (log 1 δ + log Tα ) (cid:1) D. ALTERNATE PROOF FOR GENERAL-IZED SUQR SAMPLE COMPLEXITY

As discussed in the main paper we use the function space H (cid:48) with each component function space H (cid:48) i given by w i x iT + c iT .Then, we can directly use Equation 2. We still need to bound C ( (cid:15), H (cid:48) i , d l ) . For this, we note the set of functions w i x iT + c iT has two free parameters w i and c i , thus, this function space is asubset of the vector space of functions of dimension two (two val-ues needs to represent each function). Using the pseudo-dimensiontechnique [13] we know that for psuedo-dimension d of functionspace H i we get C ( (cid:15), H (cid:48) i , d l ) ≤ eM(cid:15) log eM(cid:15) ) d Also, we know [13] that pseudo-dimension is equal to the vectorspace dimension if the function class is a subset of a vector space.Therefore, for our case d = 2 . Therefore, using Equation 2 we get C ( (cid:15), H (cid:48) , d l ) ≤ T ( eM(cid:15) log eM(cid:15) ) T Plugging this result in Theorem 1 we get the sample complexity of O (cid:16)(cid:0) α (cid:1)(cid:0) log( 1 δ ) + T log( Tα log Tα ) (cid:1)(cid:17) E. EXPERIMENTAL RESULTS

Here we provide additional experimental results on the Uganda,AMT and simulated datasets. The AMT dataset consisted of 32unique mixed strategies, 16 of which were deployed for one pay-off structure and the remaining 16 for another. In the main paper,we provided results on AMT data for payoff structure 1. Here, inFigs. 2(a) and 2(b), we show results on the AMT data for both theparametric (SUQR) and NPL learning settings on payoff structure2. For running experiments on simulated data, we used the samemixed strategies and features as for the AMT data, but simulatedattacks, ﬁrst using the actual SUQR model and then using a mod-iﬁed form of the SUQR model. Figs. 2(c) and 2(d) show re-sults on simulated data on payoff structures 1 and 2 for the para-metric cases, when the data is generated by an adversary with anSUQR model with true weight vector reported in Nguyen et. al [25]( ( w , w , w ) = ( − . , . , . ( c i = w R i + w P i )). Sim-ilar results for the NPL model are shown in Figs. 2(e) and 2(f) re-spectively. We can see that the NPL approach performs poorly withonly one or ﬁve samples as expectied but improves signiﬁcantly as more samples are added. To further show its potential, we modiﬁedthe true adversary model of generating attacks from SUQR to thefollowing: q i ∝ e w x i + c i , i.e., instead of x i , the adversary rea-sons based on x i . We considered the same true weight vector tosimulate attacks. Then, we observe in Figs. 2(g) (for payoff struc-ture 1) and 2(h) (for payoff structure 2 data), that α approaches avalue closer to zero for 500 or more sample. Also, the NPL modelperforms better than the parametric model with 500 or more sam-ples. This shows that the NPL approach is more accurate when thetrue adversary does not satisfy the simple parametric logistic form,indicating that when we don’t know the true function of the adver-sary’s decision making process, adopting a non-parametric methodto learn the adversary’s behavior is more effective.a) AMT Parametric ResultsPayoff 2 (b) AMT NonparametricResults Payoff 2 (c) Simulated Data Payoff 1 -Parametric results (d) Simulated Data Payoff 2 -Parametric results(e) Simulated Data Payoff 1 -Nonparametric results (f) Simulated Data Payoff 2 -Nonparametric results (g) Parametric vs Non-parametric results on Simulated (for varioussample sizes) data from payoff 1 when the true adversary modelis different from the parametric learned function(h) Parametric vs Non-parametric results on Simulated (for varioussample sizes) data from payoff 2 when the true adversary modelis different from the parametric learned functionapproaches avalue closer to zero for 500 or more sample. Also, the NPL modelperforms better than the parametric model with 500 or more sam-ples. This shows that the NPL approach is more accurate when thetrue adversary does not satisfy the simple parametric logistic form,indicating that when we don’t know the true function of the adver-sary’s decision making process, adopting a non-parametric methodto learn the adversary’s behavior is more effective.a) AMT Parametric ResultsPayoff 2 (b) AMT NonparametricResults Payoff 2 (c) Simulated Data Payoff 1 -Parametric results (d) Simulated Data Payoff 2 -Parametric results(e) Simulated Data Payoff 1 -Nonparametric results (f) Simulated Data Payoff 2 -Nonparametric results (g) Parametric vs Non-parametric results on Simulated (for varioussample sizes) data from payoff 1 when the true adversary modelis different from the parametric learned function(h) Parametric vs Non-parametric results on Simulated (for varioussample sizes) data from payoff 2 when the true adversary modelis different from the parametric learned function