[PDF] Adaptive Learning in Two-Player Stackelberg Games with Application to Network Security

Abstract

We study a two-player Stackelberg game with incomplete information such that the follower's strategy belongs to a known family of parameterized functions with an unknown parameter vector. We design an adaptive learning approach to simultaneously estimate the unknown parameter and minimize the leader's cost, based on adaptive control techniques and hysteresis switching. Our approach guarantees that the leader's cost predicted using the parameter estimate becomes indistinguishable from its actual cost in finite time, up to a preselected, arbitrarily small error threshold. Also, the first-order necessary condition for optimality holds asymptotically for the predicted cost. Additionally, if a persistent excitation condition holds, then the parameter estimation error becomes bounded by a preselected, arbitrarily small threshold in finite time as well. For the case where there is a mismatch between the follower's strategy and the parameterized function that is known to the leader, our approach is able to guarantee the same convergence results for error thresholds larger than the size of the mismatch. The algorithms and the convergence results are illustrated via a simulation example in the domain of network security.

Full PDF

11 Adaptive learning in two-player Stackelberg gameswith application to network security

Guosong Yang, Radha Poovendran, and João P. Hespanha

Abstract —We study a two-player Stackelberg game with in-complete information such that the follower’s strategy belongs toa known family of parameterized functions with an unknownparameter vector. We design an adaptive learning approachto simultaneously estimate the unknown parameter and min-imize the leader’s cost, based on adaptive control techniquesand hysteresis switching. Our approach guarantees that theleader’s cost predicted using the parameter estimate becomesindistinguishable from its actual cost in ﬁnite time, up to apreselected, arbitrarily small error threshold. Also, the ﬁrst-ordernecessary condition for optimality holds asymptotically for thepredicted cost. Additionally, if a persistent excitation conditionholds, then the parameter estimation error becomes bounded bya preselected, arbitrarily small threshold in ﬁnite time as well.For the case where there is a mismatch between the follower’sstrategy and the parameterized function that is known to theleader, our approach is able to guarantee the same convergenceresults for error thresholds larger than the size of the mismatch.The algorithms and the convergence results are illustrated via asimulation example in the domain of network security.

I. I

NTRODUCTION

A modern engineering system often involves multiple self-interested decision makers whose actions have mutual conse-quences. Examples of such systems include communicationsover a shared network with limited capacity, and computer pro-grams sharing limited computational resources. Game theoryprovides a systematic framework for modeling the cooperationand conﬂict between these so-called strategic players [1],and has been widely applied to areas such as robust design,resource allocation, and network security [2]–[4].In game theory, a fundamental question is whether theplayers can converge to a Nash equilibrium—a tuple of strate-gies for which no one has a unilateral incentive to change—if they play the game iteratively and adjust their strategiesbased on historical outcomes. A primary example of sucha learning process is ﬁctitious play [5], [6], in which everyplayer believes that the opponents are playing constant mixedstrategies in agreement with the empirical distributions oftheir past actions, and plays the corresponding best response.Another well-known example is gradient response [7], [8],in which each player adjusts its strategy according to thecorresponding gradient of its cost function. Fictitious play andgradient response have attracted signiﬁcant research interests

G. Yang and J. P. Hespanha are with the Center for Control, DynamicalSystems, and Computation, University of California, Santa Barbara, CA93106, USA. Email: {guosongyang, hespanha}@ucsb.edu .R. Poovendran is with the Department of Electrical and Computer En-gineering, University of Washington, Seattle, WA 98195, USA. Email: [email protected] . [9], [10] and have been used in applications such as multiagentreinforcement learning [11] and distributed control [12].In this paper, we propose an adaptive learning approach fora hierarchical game model introduced by Stackelberg [13]. Ina two-player Stackelberg game, one player (called the leader )selects its action ﬁrst, and then the other player (called the follower ), informed of the leader’s choice, selects its ownaction. Therefore, a follower’s strategy in a Stackelberg gamecan be viewed as a function that speciﬁes an action in responseto each leader’s possible action.Stackelberg games provide a natural framework for under-standing systems with asymmetric information, a commonfeature of many network problems such as routing [14],scheduling [15], and channel allocation [16]. They are es-pecially useful for modeling security problems, where thedefender (leader) is usually unaware of the attacker’s objectiveand capabilities a priori, whereas the attacker (follower) isable to observe the defender’s strategy and attack after carefulplanning. A class of Stackelberg games called Stackelberg Se-curity Games have been applied to various real-world securitydomains and have lead to practical implementations such asthe ARMOR program at the Los Angeles International Airport[17], the IRIS program used by the US Federal Air Marshals[18], and counterterrorism programs for crucial infrastructuressuch as power grid and oil reserves [19], [20].Asymmetric information often leads to scenarios with noNash equilibrium but a Stackelberg equilibrium, as sufﬁcientconditions for the existence of the former are much strongerthan those for the latter (see, e.g., [2, p. 181]). For thesescenarios, learning options like ﬁctitious play and gradientresponse cannot be applied, and novel approaches are neededto achieve convergence to a Stackelberg equilibrium. Existingresults on learning in Stackelberg games are limited to linear orquadratic costs and ﬁnite action sets [21]–[23], which are toorestrictive for many applications including network security.In this paper, we study a Stackelberg game between twoplayers with continuous action sets. We consider the scenariowhere the leader only has partial knowledge of the follower’saction set and cost function. As a result, the follower’s strategybelongs to a family of parameterized functions that is knownto the leader, but the actual value of the parameter vector is un-known. Our main contribution is an adaptive learning approachdescribed in Section III, which simultaneously estimates theunknown parameter based on the follower’s past actions andminimizes the leader’s cost predicted using the parameterestimate. The approach is designed based on adaptive controltechniques, and utilizes projections and hysteresis switchingto ensure feasibility of solutions and ﬁnite-time convergence a r X i v : . [ c s . G T ] J a n of the parameter estimate.In Section IV, we prove that the leader’s predicted costis guaranteed to become indistinguishable from its actualcost in ﬁnite time, up to a preselected, arbitrarily smallerror threshold. Also, the ﬁrst-order necessary condition foroptimality holds asymptotically for the predicted cost. Ad-ditionally, if a persistent excitation condition holds, then theparameter estimation error is guaranteed to become boundedby a preselected, arbitrarily small threshold in ﬁnite timeas well. Our proof provides a rigorous treatment for theexistence and convergence of solutions to the discontinuousdynamics that results from projections and switching, basedon tools from differential inclusions theory. In particular, weestablish an invariance principle for projected gradient descentin continuous time, which is of independent interest and isnovel to the best of our knowledge.In Section V, we extend the adaptive learning approach tothe case where the parameterized function that is known tothe leader does not match the follower’s strategy perfectly.We prove that that our approach can be adjusted to guaranteethe same convergence results for preselected error thresholdsthat are larger than the size of the mismatch.In Section VI, the algorithms and the convergence resultsare illustrated via a simulation example motivated by link-ﬂooding distributed denial-of-service (DDoS) attacks, such asthe Crossﬁre attack [24]. Section VII concludes the paper witha brief summary and an outlook on future research topics.A preliminary version for some of these results appeared inthe conference paper [25]. The current paper improves [25]by adding a notion of “practical” Stackelberg equilibrium,removing unnecessary assumptions, substantiating the resultswith complete proofs and clarifying remarks, and providing amore realistic and elaborated simulation example. Notations:

Let R + := [0 , ∞ ) and N := { , , . . . } . Denoteby I n the identity matrix in R n × n , or simply I when thedimension is implicit. Denote by (cid:107) · (cid:107) the Euclidean norm forvectors and the (induced) Euclidean norm for matrices. Fortwo vectors v and v , denote by ( v , v ) := ( v (cid:62) , v (cid:62) ) (cid:62) theirconcatenation. For a set S ⊂ R n , denote by ∂ S and S itsboundary and closure, respectively. A signal r : R + → R n is of class L ∞ if sup t ≥ (cid:107) r ( t ) (cid:107) is ﬁnite. Denote by s B ( x ) the closed ball of radius s ≥ centered at x ∈ R n , that is, s B ( x ) := { z ∈ R n : (cid:107) z − x (cid:107) ≤ s } .II. P ROBLEM FORMULATION

Consider a two-player game ( R , A , J, H ) , where R ⊂ R n r and A ⊂ R n a are the action sets of the ﬁrst and the secondplayer, respectively, and J : R × A → R and H : A × R → R are the corresponding cost functions to be minimized. Weare interested in a hierarchical game model proposed byStackelberg [13], where the ﬁrst player (called the leader )selects its action r ∈ R ﬁrst, and then the second player (calledthe follower ), informed of the leader’s choice, selects its ownaction a ∈ A . Formally, a Stackelberg equilibrium is deﬁnedas follows; see, e.g., [1, Sec. 3.1] or [2, Def. 4.6 and 4.7,p. 179 and 180]. Deﬁnition 1 (Stackelberg equilibrium) . Given a game deﬁnedby ( R , A , J, H ) , an action r ∗ ∈ R is called a Stackelbergequilibrium action for the leader if J ∗ := inf r ∈R max a ∈ β a ( r ) J ( r, a ) = max a ∈ β a ( r ∗ ) J ( r ∗ , a ) , (1)where β a ( r ) := arg min a ∈A H ( a, r ) (2)denotes the set of follower’s best responses against a leader’saction r ∈ R , and J ∗ is known as the Stackelberg cost for theleader; for an ε > , a routing action r ∗ ε ∈ R is called an ε Stackelberg action for the leader if (1) is replaced by max a ∈ β a ( r ∗ ε ) J ( r ∗ ε , a ) ≤ J ∗ + ε. The Stackelberg cost provides a cost that the leader is ableto guarantee against a rational follower. Depending on thefollower’s actual strategy, the leader may be able to achieve abetter (smaller) cost while playing a Stackelberg equilibriumaction. In practice, it is possible that no Stackelberg equilib-rium action exists but the leader is able to achieve essentiallythe Stackelberg cost by playing an ε Stackelberg action whileselecting a sufﬁciently small ε > ; see also the discussionafter Assumption 1 and the example in Section VI.We consider games with perfect but incomplete information,where the leader only has partial knowledge of the follower’saction set and cost function and cannot predict the follower’saction accurately. Speciﬁcally, the follower’s strategy is anunknown function f : R → A such that a = f ( r ) ∈ β a ( r ) forall r ∈ R . However, f belongs to a family of parameterizedfunctions { r (cid:55)→ ˆ f (ˆ θ, r ) : ˆ θ ∈ Θ } with a parameter set Θ ⊂ R n θ , that is, there is an actual value θ ∈ Θ such that a = f ( r ) = ˆ f ( θ, r ) ∀ r ∈ R . (3)The parameterized function ˆ f : Θ × R → R n a (including theparameter set Θ ) is known to the leader, but the actual value θ is unknown. The follower’s action set A is also unknownexcept that A ⊂ { ˆ f (ˆ θ, r ) : ˆ θ ∈ Θ , r ∈ R} as implied by (3).In practice, assuming that the follower’s strategy belongsto a known family of parameterized functions introduces littleloss of generality, as the follower’s strategy can always beapproximated on a compact set up to an arbitrary precision asa ﬁnite weighted sum of a preselected class of basis functions.An example of such an approximation is the radial basisfunction (RBF) model [26], in which the leader assumes ˆ f ( θ, r ) = n θ (cid:88) j =1 θ j F j ( r ) = n θ (cid:88) j =1 θ j φ ( (cid:107) r − r cj (cid:107) ) , where φ : R + → R n a is an RBF and each F j : R → R n a is a kernel centered at r cj . Note that in the RBF model, theapproximation is afﬁne with respect to the unknown parameter,which is also common in many other widely used approxima-tion models such as orthogonal polynomials and multivariatesplines [26]. This motivates restricting our attention to afﬁne In this paper, we only consider compact action sets and continuous costfunctions; thus for each r ∈ R , the set β a ( r ) is nonempty and compact.However, the map r (cid:55)→ max a ∈ β a ( r ) J ( r, a ) may still be discontinuous. maps ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) . The following assumption captures this andother generic regularity conditions that we use to guarantee theexistence of an ε Stackelberg action and the convergence ofour parameter estimation algorithms.

Assumption 1 (Regularity) . The follower’s action set A iscompact, and the leader’s action set R and the parameter set Θ are convex and compact; the follower’s cost function H iscontinuous, the leader’s cost function J and the parameterizedfunction ˆ f are continuously differentiable, and the map ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) is afﬁne for each ﬁxed r ∈ R .Under Assumption 1, there exists an ε Stackelberg action foreach ε > [2, Prop. 4.2, p. 180]. These conditions are muchweaker than the standard sufﬁcient conditions for the existenceof a Stackelberg equilibrium action [2, Th. 4.8, p. 180], whichare in turn much weaker than those for a Nash equilibrium[2, p. 181]. Therefore, they are consistent with our interest ingames with no Nash equilibrium but a “practical” Stackelbergequilibrium; see also the example in Section VI.We denote by ∇ r J ( r, a ) and ∇ a J ( r, a ) the gradients ofthe maps r (cid:55)→ J ( r, a ) and a (cid:55)→ J ( r, a ) , respectively, andby ∇ θ ˆ f ( r ) and ∇ r ˆ f (ˆ θ, r ) the Jacobian matrices of the maps ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) and r (cid:55)→ ˆ f (ˆ θ, r ) , respectively. In particular, theJacobian matrix ∇ θ ˆ f ( r ) is independent of ˆ θ due to the afﬁnecondition in Assumption 1.Our goal is to adjust the leader’s action r to minimize itscost J ( r, a ) for the follower’s action a = f ( r ) = ˆ f ( θ, r ) , thatis, to solve the optimization problem min r ∈R J (cid:0) r, ˆ f ( θ, r ) (cid:1) , (4)based on past observations of the follower’s action a = ˆ f ( θ, r ) and the leader’s cost J ( r, a ) , but without knowing the actualvalue θ . Our approach to solve this problem combines thefollowing two components:1) Construct a parameter estimate ˆ θ that approaches theactual value θ .2) Adjust the leader’s action r based on a gradient descentmethod to minimize its predicted cost ˆ J ( r, ˆ θ ) := J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) , that is, to solve the optimization problem min r ∈R ˆ J ( r, ˆ θ ) = min r ∈R J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) . (5)In this paper, our design and analysis are formulated usingcontinuous-time dynamics, which is common in the literatureof learning in game theory [9], [10].III. E STIMATION AND OPTIMIZATION

To specify the adaptive algorithms for estimating the actualvalue θ and optimizing the leader’s action r , we recall thefollowing notions and basic properties from convex analysis;for more details, see, e.g., [27, Ch. 6] or [28, Sec. 5.1]. To be consistent with the deﬁnition of Jacobian matrix, we take gradientsas row vectors. Clearly, as the follower’s response ˆ f ( θ, r ) ∈ β a ( r ) for all r ∈ R ,the leader’s optimal cost is upper bounded by its Stackelberg cost, i.e., min r ∈R J ( r, ˆ f ( θ, r )) ≤ J ∗ . For a closed convex set

C ⊂ R n and a point v ∈ R n , wedenote by [ v ] C the projection of v onto C , that is, [ v ] C := arg min w ∈C (cid:107) w − v (cid:107) . The projection [ v ] C exists and is unique as the set C is closedand convex, and satisﬁes [ v ] C = v if v ∈ C .For a convex set S ⊂ R n and a point x ∈ S , we denote by T S ( x ) the tangent cone to S at x , that is, T S ( x ) := { h ( z − x ) : z ∈ S , h > } , (6)and by N S ( x ) the normal cone to S at x , that is, N S ( x ) := { v ∈ R n : v (cid:62) w ≤ for all w ∈ T S ( x ) } . (7)The sets T S ( x ) and N S ( x ) are closed and convex, and satisfy T S ( x ) = R n and N S ( x ) = { } if x ∈ S\ ∂ S . Moreover, wehave [ v ] T S ( x ) ∈ T S ( x ) , v − [ v ] T S ( x ) ∈ N S ( x ) (8)and (cid:0) v − [ v ] T S ( x ) (cid:1) (cid:62) [ v ] T S ( x ) = 0 (9)for all v ∈ R n and x ∈ S . A. Parameter estimation

We construct the parameter estimate ˆ θ by comparing pastobservations of the follower’s action a = ˆ f ( θ, r ) and theleader’s cost J ( r, a ) with the corresponding predicted values ˆ f (ˆ θ, r ) and ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) computed using ˆ θ . Theirdifference is deﬁned as the observation error e obs := (cid:20) ˆ f (ˆ θ, r ) − a ˆ J ( r, ˆ θ ) − J ( r, a ) (cid:21) . (10)We develop an estimation algorithm based on the observationerror e obs so that the norm of the estimation error ˆ θ − θ is monotonically decreasing, regardless of how the leader’saction r is being adjusted.First, we establish a relation between the observation error e obs and the estimation error ˆ θ − θ . Lemma 1.

The observation error e obs satisﬁes e obs = K ( r, a, ˆ θ )(ˆ θ − θ ) (11) with the gain matrix K ( r, a, ˆ θ ) := (cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) d ρ (cid:21) ∇ θ ˆ f ( r ) . (12) Proof.

See Appendix B-A.Following Lemma 1, the observation error e obs would bezero if the current estimate ˆ θ of the actual value θ was correct.However, in most interesting scenarios, the dimension n θ ofthe parameter vector θ is much larger than the dimension n a +1 of the observation error e obs ; thus the gain matrix K ( r, a, ˆ θ ) cannot be invertible, and a zero value for the observation error e obs does not imply that the estimate ˆ θ is correct. We propose the following estimation algorithm to drive theparameter estimate ˆ θ towards the actual value θ : ˙ˆ θ = (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) (13)with the gain matrix K ( r, a, ˆ θ ) deﬁned by (12) and the switching signal λ e : R + → { , λ θ } deﬁned by λ e ( t ) :=  λ θ if (cid:107) e obs ( t ) (cid:107) ≥ ε obs ;lim s (cid:37) t λ e ( s ) if (cid:107) e obs ( t ) (cid:107) ∈ ( ε (cid:48) obs , ε obs );0 if (cid:107) e obs ( t ) (cid:107) ≤ ε (cid:48) obs , (14)and λ e (0) := λ θ if (cid:107) e obs (0) (cid:107) ∈ ( ε (cid:48) obs , ε obs ) , where ε obs >ε (cid:48) obs > and λ θ > are preselected constants. Severalcomments are in order: First, the gain matrix K ( r, a, ˆ θ ) depends on the parameter estimate ˆ θ but not on the actual value θ , so (13) can be implemented without knowing θ . Second,the projection [ · ] T Θ (ˆ θ ) onto the tangent cone T Θ (ˆ θ ) ensuresthat the parameter estimate ˆ θ remains inside the parameter set Θ [28]; see Appendix A for more details. Finally, the right-continuous, piecewise constant switching signal λ e is designedso that the adaptation is on when (cid:107) e obs (cid:107) ≥ ε obs and off when (cid:107) e obs (cid:107) ≤ ε (cid:48) obs , with a hysteresis switching rule that avoidschattering. The key feature of (13) is that the estimation error ˆ θ − θ satisﬁes d (cid:107) ˆ θ − θ (cid:107) d t = 2(ˆ θ − θ ) (cid:62) (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) ≤ θ − θ ) (cid:62) (cid:0) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:1) = − λ e (cid:107) e obs (cid:107) , where the inequality follows from (6)–(8). We thus concludethat the estimation algorithm (13) with the switching signal(14) guarantees d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ e (cid:107) e obs (cid:107) ≤ , (15)which implies that (cid:107) ˆ θ − θ (cid:107) is monotonically decreasing andwill not stop approaching zero unless (cid:107) e obs (cid:107) < ε obs . In theconvergence analysis in Section IV, we will show that theadaptation of the parameter estimate ˆ θ stops in ﬁnite time, andthe observation error e obs satisﬁes (cid:107) e obs (cid:107) < ε obs afterward. B. Cost minimization

Several options are available to adjust the leader’s action r , but in this paper our analysis will focus on a gradientdescent method, which is fairly robust for a wide range ofproblems. Our ultimate goal is to minimize the leader’s cost J ( r, a ) = J (cid:0) r, ˆ f ( θ, r ) (cid:1) . However, computing the gradientdescent direction of the actual cost requires knowledge of theactual value θ . Therefore, we minimize instead the predictedcost ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) , which depends instead on theparameter estimate ˆ θ . This change in objective is justiﬁed bythe property that (cid:107) ˆ J ( r, ˆ θ ) − J ( r, a ) (cid:107) ≤ (cid:107) e obs (cid:107) < ε obs holdsafter a ﬁnite time, which will be established in Section IV.The time derivative of the predicted cost ˆ J ( r, ˆ θ ) is given by ˙ˆ J ( r, ˆ θ ) = ∇ r ˆ J ( r, ˆ θ ) ˙ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ, (16) with ∇ r ˆ J ( r, ˆ θ ) := ∇ r J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) + ∇ a J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) ∇ r ˆ f (ˆ θ, r ) , ∇ θ ˆ J ( r, ˆ θ ) := ∇ a J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) ∇ θ ˆ f ( r ) . Here ∇ r J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) denotes the gradient of the map r (cid:55)→ J ( r, ˆ a ) at ˆ a = ˆ f (ˆ θ, r ) ; thus ∇ r ˆ J ( r, ˆ θ ) and ∇ θ ˆ J ( r, ˆ θ ) are thegradients of the maps r (cid:55)→ ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) and ˆ θ (cid:55)→ ˆ J ( r, ˆ θ ) , respectively. As we will establish that the adaptationof the parameter estimate ˆ θ stops in ﬁnite time, we neglectthe term with ˙ˆ θ in (16) and focus exclusively in adjusting r along the gradient descent direction of r (cid:55)→ ˆ J ( r, ˆ θ ) . Thismotivates the following optimization algorithm to adjust theleader’s action: ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (17)where λ r > is a preselected constant. The projection [ · ] T R ( r ) onto the tangent cone T R ( r ) ensures that the leader’s action r remains inside the action set R [28]; see Appendix A formore details. From (16) and (17) we conclude that ˙ˆ J ( r, ˆ θ ) = ∇ r ˆ J ( r, ˆ θ ) (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ = − (cid:13)(cid:13)(cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) (cid:13)(cid:13) (cid:14) λ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ = −(cid:107) ˙ r (cid:107) /λ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ, where the second equality follows from (9). We thus concludethat the optimization algorithm (17) guarantees ˙ˆ θ = 0 = ⇒ ˙ˆ J ( r, ˆ θ ) ≤ −(cid:107) ˙ r (cid:107) /λ r ≤ . In the convergence analysis in Section IV, we will showthat the leader’s action r converges asymptotically to the setof points for which the ﬁrst-order necessary condition foroptimality holds for the optimization problem (5).IV. C ONVERGENCE ANALYSIS

We now present the main result of this paper:

Theorem 1.

Suppose that Assumption 1 holds. For eachpair of given error thresholds ε obs > ε (cid:48) obs > in (14) ,the estimation and optimization algorithms (13) and (17) guarantee the following:1) There exists a time T ≥ such that (cid:107) e obs ( t ) (cid:107) < ε obs , ˆ θ ( t ) = ˆ θ ( T ) ∀ t ≥ T. (18)

2) The ﬁrst-order necessary condition for optimality holdsasymptotically for the optimization problem (5) , that is, lim t →∞ (cid:2) −∇ r ˆ J ( r ( t ) , ˆ θ ( T )) (cid:62) (cid:3) T R ( r ( t )) = 0 . (19)Essentially, item 1) ensures that the parameter estimate ˆ θ converges in ﬁnite time to a point that is indistinguishablefrom the actual value θ using observations of the follower’saction a = ˆ f ( θ, r ) and the leader’s cost J ( r, a ) , up to anerror bounded by the threshold ε obs . Regarding item 2), thenecessity of (19) for optimality is justiﬁed as follows. Lemma 2. If ˆ r ∗ is a locally optimum of the optimizationproblem (5) with some ﬁxed ˆ θ , then (cid:2) −∇ r ˆ J (ˆ r ∗ , ˆ θ ) (cid:62) (cid:3) T R (ˆ r ∗ ) = 0 . (20) Proof.

The results in [27, Th. 6.12, p. 207] allow us toconclude that −∇ r ˆ J (ˆ r ∗ , ˆ θ ) (cid:62) ∈ N R (ˆ r ∗ ) . Then (20) followsfrom (7)–(9). Proof of Theorem 1.

As the right hand-sides of (13) and (17)are potentially discontinuous due to projections and switching,the proof of Theorem 1 utilizes results from differential inclu-sions theory; see Appendix A for the necessary preliminaries.First, we establish the existence of solutions for the systemdeﬁned by (13) and (17).

Lemma 3.

For each (ˆ θ , r ) ∈ Θ × R , there exists a solutionto the system deﬁned by (13) and (17) on R + —that is, thereexist absolutely continuous functions ˆ θ : R + → Θ and r : R + → R with (ˆ θ (0) , r (0)) = (ˆ θ , r ) such that (13) and (17) hold almost everywhere on R + . Moreover, we have ˆ θ, ˙ˆ θ, r, ˙ r, e obs , ˙ e obs ∈ L ∞ . (21) Proof.

Lemma 3 follows from results on hysteresis switchingin [29] and results on projected differential inclusions in [28];see Appendix B-B for the complete proof.Second, we establish item 1) of Theorem 1 via argu-ments along the lines of the proof of Barbalat’s lemma [30,Lemma 3.2.6, p. 76]. We cannot use Barbalat’s lemma directlysince the switching signal λ e in (15) is not continuous but onlypiecewise continuous. Following (15), we see that (cid:107) ˆ θ − θ (cid:107) ismonotonically decreasing. Then lim t →∞ (cid:107) ˆ θ ( t ) − θ (cid:107) , and thus lim t →∞ (cid:90) t λ e ( s ) (cid:107) e obs ( s ) (cid:107) d s, (22)exists and is ﬁnite. On the other hand, (13) and (14) implythat (18) holds if there exists a time T ≥ such that λ e ( t ) = 0 ∀ t ≥ T. (23)Assume (23) does not hold for any T ≥ . Then (14) impliesthat there exists an unbounded increasing sequence ( t k ) k ∈ N with t > such that λ e ( t k ) = λ θ , (cid:107) e obs ( t k ) (cid:107) > ε (cid:48) obs ∀ k ∈ N . (24)Next, we show that there exists an unbounded sequence ( s k ) k ∈ N with s k ∈ [ t k − δ, t k ] such that (cid:107) e obs ( t ) (cid:107) > ε (cid:48) obs , λ e ( t ) = λ θ ∀ k ∈ N , ∀ t ∈ [ s k , s k + δ ) (25)with the constant δ := min (cid:26) t , ε obs − ε (cid:48) obs sup s ≥ (cid:107) ˙ e obs ( s ) (cid:107) (cid:27) > , where the inequality follows from t > , ε obs > ε (cid:48) obs , and ˙ e obs ∈ L ∞ in (21). Indeed, for each k ∈ N , consider thefollowing two possibilities:1) If (cid:107) e obs ( t ) (cid:107) < ε obs for all t ∈ [ t k − δ, t k ] , then (14) and λ e ( t k ) = λ θ imply that (25) holds with s k = t k − δ . 2) Otherwise, there exists an s k ∈ [ t k − δ, t k ] such that (cid:107) e obs ( s k ) (cid:107) = ε obs , and (25) follows from the deﬁnitionof δ and (14).Moreover, ( s k ) k ∈ N is unbounded as ( t k ) k ∈ N is unbounded.Following (25), we have (cid:90) s k + δs k λ e ( s ) (cid:107) e obs ( s ) (cid:107) d s > λ θ ( ε (cid:48) obs ) δ > for the unbounded sequence ( s k ) k ∈ N , which contradicts theproperty that (22) exists and is ﬁnite. Therefore, there existsa time T ≥ such that (23), and thus (18), holds.Finally, we prove item 2) of Theorem 1 based on the invari-ance principle for projected gradient descent Proposition 1 inAppendix A. After the time T from item 1), the system (17)becomes ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ( T )) (cid:62) (cid:3) T R ( r ) , which can be modeled using the projected dynamical system(39) in Appendix A with the state x := r and the set S := R .The corresponding function g in (39) is given by g ( x ) := − λ r ∇ r ˆ J ( x, ˆ θ ( T )) (cid:62) , which satisﬁes (40) with V ( x ) := λ r ˆ J ( x, ˆ θ ( T )) . Then (19)follows from (41) in Proposition 1.In Theorem 1, there is no claim that the parameter estimate ˆ θ necessarily converges to the actual value θ . However, thiscan be guaranteed if the following persistent excitation (PE) condition holds. Assumption 2 (Persistent excitation) . There exist constants τ , α > such that the gain matrix K ( r, a, ˆ θ ) deﬁned by(12) satisﬁes (cid:90) t + τ t K ( s ) (cid:62) K ( s ) d s ≥ α I ∀ t ≥ , (26)where we let K ( s ) := K ( r ( s ) , a ( s ) , ˆ θ ( s )) for brevity, and theinequality means that the difference of the left- and right-handsides is a positive semideﬁnite matrix. Theorem 2.

Suppose that Assumptions 1 and 2 hold. For eachgiven threshold ε θ > , by setting ε θ (cid:112) α /τ ≥ ε obs > ε (cid:48) obs > (27) in (14) , the estimation and optimization algorithms (13) and (17) guarantee the following:1) There exists a time T ≥ such that (18) holds and (cid:107) ˆ θ ( T ) − θ (cid:107) < ε θ . (28)

2) The ﬁrst-order necessary condition for optimality holdsasymptotically for the optimization problem (5) , that is, (19) holds.Proof.

As (18) and (19) are established in Theorems 1, it re-mains to prove (28). To this effect, we note that the inequalityin (18) implies (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s < ε τ ≤ α ε θ , where the second inequality follows from (27). On the otherhand, (11) and the equality in (18) imply (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s = (cid:90) T + τ T (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) d s = (ˆ θ ( T ) − θ ) (cid:62) (cid:18) (cid:90) T + τ T K ( s ) (cid:62) K ( s ) d s (cid:33) (ˆ θ ( T ) − θ ) ≥ α (cid:107) ˆ θ ( T ) − θ (cid:107) , where the inequality follows from the PE condition (26).Combining the upper and lower bounds above yields (28). Remark . In view of (12), a sufﬁcient condition for (26) is (cid:90) t + τ t ∇ θ f ( r ( s )) (cid:62) ∇ θ f ( r ( s )) d s ≥ α I ∀ t ≥ . (29)The PE condition (29) is more restrictive than (26); however,it can be checked without knowing the parameter estimate ˆ θ . Remark . From the proof of Theorem 2, we see that (28) onlyrequires (26) or (29) to hold at t = T for the time T fromTheorem 1. Therefore, to ensure (28) in practice, it sufﬁces toenforce (26) or (29) when λ e in (14) has been set to zero.V. M ODEL MISMATCH

Up till now we assumed that there was some unknown value θ from within the parameter set Θ such that (3) holds for thefollower’s strategy f and the parameterized function ˆ f that isknown to the leader. In this section, we consider the case wheresuch perfect matching may not exist, and study the effect ofa bounded mismatch between f ( r ) and ˆ f ( θ, r ) . Assumption 3 (Mismatch) . The follower’s strategy f iscontinuous, and there is an unknown value θ ∈ Θ such that (cid:107) ˆ f ( θ, r ) − f ( r ) (cid:107) ≤ ε f /κ ∀ r ∈ R (30)with the constant κ := max r ∈R , ˆ θ ∈ Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) f ( r ) (cid:1) d ρ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) and some known constant ε f ≥ .The follower’s action set A is still unknown to the leader,except that A ⊂ ∪ ˆ θ ∈ Θ , r ∈R ( ε f /κ ) B (cid:0) ˆ f (ˆ θ, r ) (cid:1) as implied by(30). Assumption 3 generalized the condition (3) as (3) isequivalent to (30) with ε f = 0 .Similar arguments to those in the proof of Lemma 1 showthat the observation error e obs now satisﬁes e obs = K ( r, a, ˆ θ )(ˆ θ − θ ) + e f (31)with the gain matrix K ( r, a, ˆ θ ) deﬁned by (12) and the mismatch error e f := (cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) d ρ (cid:21) (cid:0) ˆ f ( θ, r ) − a (cid:1) . It turns out that Assumption 3 guarantees (cid:107) e f ( t ) (cid:107) ≤ ε f ∀ t ≥ . (32) The following two results extend Theorems 1 and 2 tothe current case without perfect matching between f ( r ) and ˆ f ( θ, r ) for some θ ∈ Θ . Theorem 3.

Suppose that Assumptions 1 and 3 hold. Foreach pair of given thresholds ε obs > ε (cid:48) obs > ε f in (14) ,the estimation and optimization algorithms (13) and (17) guarantee the same convergence results as those in Theorem 1.Proof. First, Lemma 3 still holds as f is continuous.Second, we establish item 1) of Theorem 3 via similararguments to those in Section III-A and the second step ofthe proof of Theorem 1. Using the estimation algorithm (13)with ε obs > ε (cid:48) obs > ε f in (14), the estimation error ˆ θ − θ nowsatisﬁes d (cid:107) ˆ θ − θ (cid:107) d t = 2(ˆ θ − θ ) (cid:62) (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) ≤ θ − θ ) (cid:62) (cid:0) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:1) = − λ e ( e obs − e f ) (cid:62) e obs , where the inequality follows from (6)–(8). Next, we prove that d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ e ( e obs − e f ) (cid:62) e obs ≤ , (33)in which d (cid:107) ˆ θ − θ (cid:107) / d t = 0 if and only if λ e = 0 . Indeed,consider the following two possibilities:1) If λ e = 0 , then d (cid:107) ˆ θ − θ (cid:107) / d t = 0 .2) Otherwise λ e = λ θ , and thus (cid:107) e obs (cid:107) > ε (cid:48) obs > ε f ≥ (cid:107) e f (cid:107) following (14) and (32). Hence d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ θ ( e obs − e f ) (cid:62) e obs ≤ − λ θ ( (cid:107) e obs (cid:107) − (cid:107) e f (cid:107)(cid:107) e obs (cid:107) )= − λ θ ( ε (cid:48) obs − ε f ) ε (cid:48) obs < . Following (33), we see that (cid:107) ˆ θ − θ (cid:107) is monotonically decreas-ing. Thus lim t →∞ (cid:107) ˆ θ ( t ) − θ (cid:107) , and therefore lim t →∞ (cid:90) t λ e ( s )( e obs ( s ) − e f ( s )) (cid:62) e obs ( s ) d s, (34)exists and is ﬁnite. On the other hand, (13) and (14) imply that(18) holds if there exists a time T ≥ such that (23) holds.Assume (23) does not hold for any T ≥ . Then the analysisin the second step of the proof of Theorem 1 shows that thereexists an unbounded sequence ( s k ) k ∈ N with s k ∈ [ t k − δ, t k ] such that (25) holds. Consequently, we have (cid:90) s k + δs k λ e ( s )( e obs ( s ) − e f ( s )) (cid:62) e obs ( s ) d s ≥ λ θ (cid:90) s k + δs k ( (cid:107) e obs ( s ) (cid:107) − (cid:107) e f ( s ) (cid:107)(cid:107) e obs ( s ) (cid:107) ) d s> λ θ ( ε (cid:48) obs − ε f ) ε (cid:48) obs δ > for the unbounded sequence ( s k ) k ∈ N , which, combined with(33), contradicts the property that (34) exists and is ﬁnite.Therefore, there exists a time T ≥ such that (23), and thus(18), holds.Finally, item 2) of Theorem 3 is the same as item 2) ofTheorem 1 as the optimization process is the same after theadaptation of ˆ θ stops. Theorem 4.

Suppose that Assumptions 1–3 hold. For eachgiven threshold ε θ > ε f (cid:112) τ /α , by setting ε θ (cid:112) α /τ − ε f ≥ ε obs > ε (cid:48) obs > ε f (35) in (14) , the estimation and optimization algorithms (13) and (17) guarantee the same convergence results as those inTheorem 2.Proof. As (18) and (19) are established in Theorems 3, it re-mains to prove (28). To this effect, we note that the inequalityin (18) implies (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s < ε τ ≤ ( ε θ √ α − ε f √ τ ) , where the second inequality follows from (35). On the otherhand, (31) and the equality in (18) imply (cid:107) e obs ( s ) (cid:107) = (cid:107) K ( s )(ˆ θ ( T ) − θ ) + e f ( s ) (cid:107) ≥ (cid:18) − ε f ε θ (cid:114) τ α (cid:19) (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) + (cid:18) − ε θ ε f (cid:114) α τ (cid:19) (cid:107) e f ( s ) (cid:107) = ( ε θ √ α − ε f √ τ ) (cid:18) (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) ε θ √ α − (cid:107) e f ( s ) (cid:107) ε f √ τ (cid:19) for all s ≥ T , where the inequality follows from Young’sinequality ab ≤ εa + b /ε for all a, b ∈ R and ε > . Notethat ε θ > ε f (cid:112) τ /α implies ε θ √ α − ε f √ τ > . Hencewe have (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s ≥ ( ε θ √ α − ε f √ τ ) (cid:18) (cid:90) T + τ T (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) ε θ √ α d s − (cid:90) T + τ T (cid:107) e f ( s ) (cid:107) ε f √ τ d s (cid:19) ≥ ( ε θ √ α − ε f √ τ ) (cid:18) √ α ε θ (cid:107) ˆ θ ( T ) − θ (cid:107) − ε f √ τ (cid:19) , where the second inequality follows from the PE condition(26) and the inequality (32). Combining the upper and lowerbounds above yields (28). Remark . The results in this section also hold with the slightlyless conservative condition min θ ∈ Θ max r ∈R , ˆ θ ∈ Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) f ( r ) (cid:1) d ρ (cid:21)(cid:0) ˆ f ( θ, r ) − f ( r ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε f (36)in place of (30) in Assumption 3.VI. S IMULATION EXAMPLE

We illustrate the estimation and optimization algorithms andthe convergence results via a simulation example motivatedby link-ﬂooding distributed denial-of-service (DDoS) attacks,such as the Crossﬁre attack [24]. Consider a communication network consisting of L parallellinks connecting a source to a destination. The set of links isdenoted by L := { , . . . , L } . Suppose that a router (leader)distributes R units of legitimate trafﬁc among the parallellinks, and an attacker (follower) disrupts communication byinjecting A units of superﬂuous trafﬁc on them. The router’saction is represented by an L -vector of the desired legitimatetrafﬁc on each link r ∈ R := { r ∈ R L + : (cid:80) l ∈L r l = R } , andthe attacker’s action is represented by an L -vector of the attacktrafﬁc a ∈ A := { a ∈ R L + : (cid:80) l ∈L a l = A } . Every link l ∈ L is subject to a constant capacity c > that upper-bounds thetotal trafﬁc on l . When r l + a l > c , the actual legitimatetrafﬁc on l is decreased to u l := min { r l , max { c − a l , }} . The router aims to maximize the total actual legitimate trafﬁc,whereas the attacker aims to minimize it. Hence the router’scost is J ( r, a ) := − (cid:88) l ∈L u l , and the attacker’s cost is H ( a, r ) := (cid:88) l ∈L u l = − J ( r, a ) . (37)Clearly, neither the router nor the attacker has an incentiveto assign more trafﬁc on a link than the capacity. Hence weassume r l , a l ∈ [0 , c ] for all l ∈ L . For most nontrivial cases,the game deﬁned by ( R , A , J, H ) has no Nash equilibrium.On the other hand, in [31, Cor. 5] it was established that thereexists a Stackelberg equilibrium action given by r ∗ l = R/L for all l ∈ L .If the router knew that the attacker’s cost function wasindeed deﬁned by (37), it could play the Stackelberg equi-librium action r ∗ . However, we consider the general scenariowhere it does not and, instead, adopts the adaptive learningapproach proposed in this paper to construct its optimal action.To this effect, the router approximates the attacker’s strategy f = ( f , . . . , f L ) using a quasi-RBF model deﬁned by ˆ f l ( θ, r ) := n rbf (cid:88) j =1 · · · n rbf (cid:88) j L − =1 θ l,j ,...,j L − F j ,...,j L − ( r )= n rbf (cid:88) j =1 · · · n rbf (cid:88) j L − =1 θ l,j ,...,j L − φ ( r − r cj ,...,j L − ) (38)for l ∈ L , where the RBF is deﬁned by φ ( x ) := (cid:0) − c n rbf , c n rbf (cid:3) L − ( x ) with the indicator function S ( x ) := (cid:40) if x ∈ S ;0 if x / ∈ S , The maps r (cid:55)→ J ( r, a ) and r (cid:55)→ ˆ f (ˆ θ, r ) in this example actuallyviolate the smoothness conditions in Assumption 1 as they are only piecewisecontinuously differentiable. However, these conditions are only needed so thatthe optimization algorithm (17) is well-deﬁned and does not lead to chattering.In this example, the set of non-differentiable points in R has measure zeroand does not affect the simulation. the centers are deﬁned by r cj ,...,j L − := (cid:18) (2 j − c m , . . . , (2 j L − − c m (cid:19) , and n rbf ∈ N is the number of kernels in each of the ﬁrst L − scalar component of R , that is, the router approximates eachscalar component of f using a grid of n L − hypercubes. Hence n θ := Ln L − and Θ := [0 , c ] n θ in (3). In the following,we simulate our estimation and learning algorithms (13) and(17) for networks with two or three parallel links. In thesesimulations, the constants are set by ε obs = 2 ε (cid:48) obs = 0 . , λ θ = 0 . , and λ r = 0 . , and the initial values of theparameter estimate ˆ θ and the routers’ action r are randomlygenerated. A. Network with two parallel links

Fig. 1. A network with one source S , one destination D , and two parallellinks (assuming that a l ≤ c for all l ∈ L ). Consider the network with L = 2 parallel links in Fig. 1,capacity c = 1 , total desired legitimated trafﬁc R = Lc / , and attack budget A = (cid:100) Lc / (cid:101) = 1 . We set the constant n rbf = 4 . Then n θ = 8 and the parameter set Θ = [0 , .Following [31, Cor. 2], the attacker’s best response to a routeraction r is to set a l = 1 on the link l ∈ { , } with larger r l . Speciﬁcally, the actual value θ in (3), written as the tensorform in (38), is given by θ = (cid:20) (cid:21) . As established in [31, Cor. 5], the Stackelberg equilibriumaction is given by r ∗ = (0 . , . .For the case without enforcing PE in Fig. 2, in the ﬁrst units of time, the observation error e obs converges to , andthe router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , despite that the parameter estimate ˆ θ does not converge to the actual value θ , as shown in Fig. 3(a).In Fig. 4, we enforce PE by adding some random noise to r for a short period of time while the observation error remainssmall. In this case, in the ﬁrst units of time, the observationerror e obs converges to , the router’s action r converges to theStackelberg equilibrium action r ∗ , and the parameter estimate ˆ θ converges to the actual value θ , as shown in Fig. 5(a). Inboth cases with and without PE, we also simulate the scenariowhere, after units of time, the attacker starts focusingmore on disrupting link by switching to a new cost functiondeﬁned by ¯ H ( a, r ) := u + u / . The results in [31, Cor. 2] allow us to conclude that, for thenon-zero-sum game deﬁned by ( R , A , J, ¯ H ) , the attacker’sbest response to a router’s action r is to set a l = 1 on the link t e obs,1 e obs,2 e obs,3 (a) Observation error t r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 2. Simulation results for L = 2 w/o PE (horizontal axis: × unitsof time). In the ﬁrst units of time, the observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , the router’s and attacker’s costs converge to J =ˆ J = − H = − . ; in the second units of time, the attacker switches to anew cost function, the router’s action r converges to an ε Stackelberg actionnear ¯ r ∗ = (0 . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . r JJ (a) Router’s actual and predicted costfunctions at T = 10 r JJ (b) Router’s actual and predicted costfunctions at T = 2 × Fig. 3. Router’s actual cost function r (cid:55)→ J ( r, f ( r )) and predicted costfunction r (cid:55)→ ˆ J (ˆ θ ( T ) , r ) for L = 2 w/o PE. In both scenarios, the predictedcost function is accurate near the optima r ∗ but not everywhere. l ∈ { , } that corresponds to the larger one in { r , r / } .The corresponding actual value θ in (3), written as the tensorform in (38), becomes θ = (cid:20) (cid:21) . As established in [31, Th. 4 and Cor. 5], there is no Stackelbergequilibrium action as deﬁned by Deﬁnition 1 for ( R , A , J, ¯ H ) ;however, there are ε Stackelberg actions near ¯ r ∗ = (0 . , . for sufﬁciently small ε > . The corresponding simulationresults in Fig. 2, 3(b), 4, and 5(b) show that our adaptivelearning approach is able to identify this switch in the attack,as the router’s action r converges to ε Stackelberg actionsnear ¯ r ∗ in both Fig. 2 and 4, and, when PE is enforced, theparameter estimate ˆ θ converges to the new actual value θ , asshown in Fig. 5(b). B. Network with three parallel links

Consider a network with L = 3 parallel links, capacity c = 1 , total desired legitimated trafﬁc R = Lc / . , t e obs,1 e obs,2 e obs,3 (a) Observation error t r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 4. Simulation results for L = 2 w/ PE (horizontal axis: × unitsof time). In the ﬁrst units of time, the observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , the router’s and attacker’s costs converge to J =ˆ J = − H = − . ; in the second units of time, the attacker switches to anew cost function, the router’s action r converges to an ε Stackelberg actionnear ¯ r ∗ = (0 . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . r JJ (a) Router’s actual and predicted costfunctions at T = 10 r JJ (b) Router’s actual and predicted costfunctions at T = 2 × Fig. 5. Router’s actual cost function r (cid:55)→ J ( r, f ( r )) and predicted costfunction r (cid:55)→ ˆ J (ˆ θ ( T ) , r ) for L = 2 w/ PE. In both scenarios, the predictedcost function is accurate everywhere. and attack budget A = (cid:100) Lc / (cid:101) = 2 . We set the constant n rbf = 20 . Then n θ = 1200 and the parameter set Θ =[0 , . Following [31, Cor. 2], the attacker’s best responseto a router action is to set a l = a l = 1 on the two links l , l ∈ { , , } with largest r l . However, there is no θ ∈ Θ such that (3) holds for attacker’s strategy f and the function ˆ f deﬁned by (38). To see the mismatch between f ( r ) and ˆ f ( θ, r ) , one could consider the router’s actions r ∈ { (0 .

45 + ε, . − ε, .

55) : 0 < ε < . } . Comparing to (38), we see that ( r , r ) belongs to the hyper-cube with the center (0 . , . . However, the attacker’sbest response satisﬁes a = 1 if ε > . and a = 0 if ε < . . Hence the corresponding scalar component of θ will always yield an error of at least . , and thus themismatch threshold ε f in (30) or (36) is at least . A/c = 1 .However, in the simulations below we are able to obtain muchsmaller observation errors near the Stackelberg equilibriumaction r ∗ = (0 . , . , . established in [31, Cor. 5]. Due to the mismatch and motivated by Remark 3, we focus ourattention on the performance of the estimation and learningalgorithms (13) and (17). t e obs,1 e obs,2 e obs,3 e obs,4 (a) Observation error t r r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 6. Simulation results for L = 3 and the cost function H w/ PE(horizontal axis: × units of time). The observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , . , the router’s and attacker’s costs converge to J = ˆ J = − H = − . . t e obs,1 e obs,2 e obs,3 e obs,4 (a) Observation error t r r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 7. Simulation results for L = 3 and the cost function ¯ H w/ PE(horizontal axis: × units of time). The observation error e obs convergesto e obs = 0 , the router’s action r converges to an ε Stackelberg action near ¯ r ∗ = (0 . , . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . In Fig. 6, the observation error e obs converges to , andthe router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , . . In Fig. 7, the attacker startsfocusing less on disrupting links by switching to the newcost function deﬁned by ¯ H ( a, r ) := u + 3 u / u . Following [31, Cor. 2], for the non-zero-sum game deﬁned by ( R , A , J, ¯ H ) , the attacker’s best response to a router’s action r is to set a l = a l = 1 on the two links l , l ∈ { , , } that correspond to the largest two in { r , r / , r } . Asestablished in [31, Th. 4 and Cor. 5], there is no Stackelbergequilibrium action as deﬁned by Deﬁnition 1 for ( R , A , J, ¯ H ) ,but there are ε Stackelberg actions near ¯ r ∗ = (0 . , . , . for sufﬁciently small ε > . The corresponding simulationresults are plotted in Fig. 7, where the observation error e obs converges to , and the router’s action r converges to an ε Stackelberg action near ¯ r ∗ .VII. C ONCLUSION

This paper considered a two-player Stackelberg game wherethe leader only had partial knowledge of the follower’s actionset and cost function, and had to estimate the follower’s strat-egy using a family of parameterized functions. We designed anadaptive learning approach that simultaneously estimated thefollower’s strategy based on past observations and minimizedthe leader’s cost predicted using the latest estimation. Ourapproach was proved to guarantee that leader’s actual andpredicted costs becomes essentially indistinguishable in ﬁnitetime, and the ﬁrst-order necessary condition for optimalityholds asymptotically for the predicted cost. Moreover, weprovided a PE condition for ensuring estimation accuracy andstudied the case with model mismatch. The results were illus-trated via a simulation example motivated by DDoS attacks.A feature of our estimation algorithm (13) is that the normof the estimation error θ − θ ∗ is monotonically decreasingand the observation error e obs becomes bounded in norm bythe preselected, arbitrarily small threshold ε obs in ﬁnite time,regardless of how the leader’s action r is adjusted. A future re-search direction is to extend our adaptive learning approach byadopting more efﬁcient estimation and optimization algorithmsfor more complex applications. Some preliminary results onemploying a neural network to predict the follower’s responsecan be found in [31]. Other future research topics include torelax the afﬁne condition in Assumption 1, and to extend thecurrent results to Stackelberg games on distributed networks.A PPENDIX AP ROJECTED DYNAMICAL SYSTEMS

Let

S ⊂ R n be a compact convex set and g : S → R n bea continuous function. In this section, we provide some pre-liminary results on the existence and convergence of solutionsfor the projected dynamical system ˙ x = [ g ( x )] T S ( x ) . (39)The difﬁculty in analyzing (39) lies in the fact that itsright-hand side is only deﬁned in the domain S and ispotentially discontinuous in the boundary ∂ S due to theprojection [ · ] T S ( x ) . Therefore, we consider the concept ofviable Carathéodory solution, that is, a solution to (39) onan interval L ⊂ R + is an absolutely continuous function x : L → S such that (39) holds almost everywhere on L .In particular, it requires x ( t ) ∈ S for all t ∈ L . The followingresult establishes the existence of solutions for the projecteddynamical system (39). Lemma 4.

For each x ∈ S , there exists a solution x to (39) on R + with x (0) = x .Proof. As the function g is continuous on the compact set S ,it is upper bounded in norm and thus a Marchaud map [28,Def. 2.2.4, p. 62]. Then Lemma 4 follows from [28, Th. 10.1.1,p. 354].In the following, we establish an invariance principle forthe case where g is deﬁned by a gradient descent process. Proposition 1.

Suppose that the function g in (39) satisﬁes g ( z ) = −∇ V ( z ) (cid:62) ∀ z ∈ S (40) for some function V : S → R . Then every solution x to (39) satisﬁes lim t →∞ [ g ( x ( t ))] T S ( x ( t )) = 0 . (41)To prove Proposition 1, we extend the projected differentialequation (39) to the differential inclusion ˙ x ∈ G ( x ) (42)with the set-valued function G : S ⇒ R n deﬁned by G ( z ) := { g ( z ) − w : w ∈ N S ( z ) }∩ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) B ( g ( z )) . (43)As [ g ( z )] T S ( z ) ∈ G ( z ) for all z ∈ S , a solution to (39) is alsoa solution to (42). Hence we prove Proposition 1 by applyingan invariance theorem for differential inclusions to (42), whichrequires the following continuity property. Lemma 5.

The set-valued function G deﬁned by (43) is uppersemicontinuous on S .Proof. As S is a compact and convex, the set-valued map z (cid:55)→ T S ( z ) is lower semicontinuous on S [28, Th. 5.1.7,p. 162]. Also, as g is continuous on S , the map ( z, w ) (cid:55)→−(cid:107) g ( z ) − w (cid:107) is continuous on S × R n . Hence the map z (cid:55)→ sup w ∈ T S ( z ) −(cid:107) g ( z ) − w (cid:107) = − inf w ∈ T S ( z ) (cid:107) g ( z ) − w (cid:107) = − (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) is lower semicontinuous on S [28,Th. 2.1.6, p. 59], that is, the map z (cid:55)→ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) is upper semicontinuous on S . Moreover, the map z (cid:55)→{ g ( z ) − w : w ∈ N S ( z ) } is closed. Hence G is uppersemicontinuous on S [28, Cor. 2.2.3, p. 61]. Proof of Proposition 1.

Suppose that g ( z ) (cid:62) w ≥ (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ∀ z ∈ S , ∀ w ∈ G ( z ) . (44)Then the function V satisﬁes ∇ V ( z ) w ≤ − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ∀ z ∈ S , ∀ w ∈ G ( z ) . Note that the set-valued function G is upper semicontinuous,and the set G ( z ) is nonempty, compact, and convex for every z ∈ S . Hence the invariance theorem [33, Th. 2.11] impliesthat every solution to (42), and therefore every solution x to (39), approaches the largest invariant set in { z ∈ S : The extension from the projected differential equation (39) to the differ-ential inclusion (42) is inspired by similar extensions in [32] and [28, p. 354],and is speciﬁcally designed to simpliﬁed the proof of Proposition 1. (cid:107) [ g ( z )] T S ( z ) (cid:107) = 0 } . Hence (41) holds as g is continuous onthe compact set S .It remains to show that (44) holds. Consider arbitrary z ∈ S and w ∈ G ( z ) . First, as g ( z ) − w ∈ N S ( z ) , from (7) and (8)we have ( g ( z ) − w ) (cid:62) [ g ( z )] T S ( z ) ≤ , and thus (cid:107) w (cid:107) − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ (cid:107) w (cid:107) − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) − g ( z ) − w ) (cid:62) [ g ( z )] T S ( z ) = (cid:13)(cid:13) w − [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ , where the equality follows partially from (9). Second, as w ∈ G ( z ) ⊂ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) B ( g ( z )) , we have (cid:107) g ( z ) − w (cid:107) ≤ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) , and thus g ( z ) (cid:62) w = (cid:107) w (cid:107) + (cid:107) g ( z ) (cid:107) − (cid:107) g ( z ) − w (cid:107) ≥ (cid:107) w (cid:107) + (cid:107) g ( z ) (cid:107) − (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) = (cid:107) w (cid:107) + (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) , where the second equality follows partially from (9). Hence(44) holds. A PPENDIX BP ROOF OF TECHNICAL LEMMAS

A. Proof of Lemma 1

As the map ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) is afﬁne, its Jacobian matrix ∇ θ ˆ f ( r ) is independent of ˆ θ . Thus for a ﬁxed r ∈ R , we have ˆ f (ˆ θ, r ) − a = ˆ f (ˆ θ, r ) − ˆ f ( θ, r ) = ∇ θ ˆ f ( t )(ˆ θ − θ ) . Next, consider the function g : [0 , → R deﬁned by g ( ρ ) := J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) , which is continuously differentiable as J is continuouslydifferentiable. Then ˆ J ( r, ˆ θ ) − J ( r, a ) = g (1) − g (0) = (cid:90) d g ( ρ )d ρ d ρ, in which d g ( ρ )d ρ = ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1)(cid:0) ˆ f (ˆ θ, r ) − a (cid:1) = ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) ∇ θ ˆ f ( r )(ˆ θ − θ ) . Hence ˆ J ( r, ˆ θ ) − J ( r, a )= (cid:90) ∇ a J ( r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a ) ∇ θ ˆ f ( r )(ˆ θ − θ ) d ρ = (cid:18) (cid:90) ∇ a J ( r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a ) d ρ (cid:19) ∇ θ ˆ f ( r )(ˆ θ − θ ) . B. Proof of Lemma 3

Consider an arbitrary (ˆ θ , r ) ∈ Θ ×R , and let λ e (0) be thecorresponding value given by (10) and (14). Suppose λ e (0) = λ θ , that is, (cid:107) e obs (0) (cid:107) > ε (cid:48) obs . In the following, we construct asolution to (13) and (17) on R + with (ˆ θ (0) , r (0)) = (ˆ θ , r ) recursively. The case where λ e (0) = 0 can be proved basedon the same construction starting with the second step. First, consider the system deﬁned by (13) and (17) with λ e ≡ λ θ , that is, ˙ˆ θ = (cid:2) − λ θ K ( r, f ( r ) , ˆ θ ) (cid:62) K ( r, f ( r ) , ˆ θ )(ˆ θ − θ ) (cid:3) T Θ (ˆ θ ) , ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (45)which can be modeled using the projected dynamical sys-tem (39) in Appendix A with the state x := (ˆ θ, r ) andthe set S := Θ × R . In particular, f is continuous dueto (3) and Assumption 1. Then Lemma 4 in Appendix Aimplies that there exists a solution (ˆ θ , r ) to (45) on R + with (ˆ θ (0) , r (0)) = (ˆ θ , r ) . Consider the correspondingobservation error e obs , and switching signal λ e, deﬁned by(10) and (14), and let t := inf { t > (cid:107) e obs , ( t ) (cid:107) ≤ ε (cid:48) obs } . Then (ˆ θ , r ) is a solution to (13) and (17) on [0 , t ) with (ˆ θ (0) , r (0)) = (ˆ θ , r ) . If t = ∞ then the proof iscomplete. Otherwise, e obs , ( t ) = ε (cid:48) obs and thus λ e, ( t ) = 0 ,and we continue with the second step below.Second, consider the system deﬁned by (13) and (17) with λ e ≡ , that is, ˙ˆ θ = 0 , ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (46)which can also be modeled using the projected dynamicalsystem (39) in Appendix A with the state x := (ˆ θ, r ) andthe set S := Θ × R . Then Lemma 4 in Appendix A impliesthat there exists a solution (ˆ θ , r ) to (46) on [ t , ∞ ) with (ˆ θ ( t ) , r ( t )) = (ˆ θ ( t ) , r ( t )) . Consider the correspond-ing observation error e obs , and switching signal λ e, deﬁnedby (10) and (14), and let t := inf { t ≥ t : (cid:107) e obs , ( t ) (cid:107) ≥ ε obs } . Then (ˆ θ , r ) is a solution to (13) and (17) on [ t , t ) with (ˆ θ ( t ) , r ( t )) = (ˆ θ ( t ) , r ( t )) . If t = ∞ then the proofis complete. Otherwise, e obs , ( t ) = ε obs and thus λ e, ( t ) = λ θ , and we continue with the ﬁrst step above.In this way, we obtain an increasing sequence ( t k ) k ∈ N with t = 0 and a corresponding sequence (ˆ θ k , r k ) k ≥ ofabsolutely continuous functions ˆ θ k : [ t k − , ∞ ) → Θ and r k : [ t k − , ∞ ) → R . Moreover, following (10), (45), (46),Assumption 1, and the property that (ˆ θ k ( t ) , r k ( t )) ∈ Θ × R for all k ≥ and t ≥ t k − , there exists a constant M ≥ such that (cid:107) ˙ e obs ,k ( t ) (cid:107) ≤ M for all k ≥ and t ≥ t k − . Hence t k − t k − ≥ ( ε obs − ε (cid:48) obs ) /M ∀ k ≥ , and thus lim k →∞ t k = ∞ . Therefore, the absolutely continu-ous functions ˆ θ : R + → Θ and r : R + → R deﬁned by ˆ θ ( t ) := ˆ θ k ( t ) , r ( t ) := r k ( t ) , k ≥ , t ∈ [ t k − , t k ) form a solution to (13) and (17) on R + with (ˆ θ (0) , r (0)) =(ˆ θ , r ) . The proof is completed by noticing that (21) followsfrom (10), (45), (46), Assumption 1, and the property that (ˆ θ ( t ) , r ( t )) ∈ Θ × R for all t ≥ . R EFERENCES [1] D. Fudenberg and J. Tirole,

Game Theory . MIT Press,1991.[2] T. Ba¸sar and G. J. Olsder,

Dynamic NoncooperativeGame Theory , 2nd ed. SIAM, 1999.[3] T. Alpcan and T. Ba¸sar,

Network Security: A Decisionand Game-Theoretic Approach . Cambridge UniversityPress, 2010.[4] J. P. Hespanha,

Noncooperative Game Theory: An Intro-duction for Engineers and Computer Scientists . Prince-ton University Press, 2017.[5] G. W. Brown, “Iterative solution of games by ﬁctitiousplay,” in

Act. Anal. Prod. Alloc.

John Wiley & Sons,1951, pp. 374–376.[6] J. Robinson, “An iterative method of solving a game,”

Ann. Math. , vol. 54, no. 2, pp. 296–301, 1951.[7] G. W. Brown and J. von Neumann, “Solutions of gamesby differential equations,” in

Contrib. to Theory Games .Princeton University Press, 1952, vol. I, ch. 6, pp. 73–80.[8] J. B. Rosen, “Existence and uniqueness of equilib-rium points for concave n-person games,”

Econometrica ,vol. 33, no. 3, pp. 520–534, 1965.[9] D. Fudenberg and D. K. Levine,

The Theory of Learningin Games . MIT Press, 1998.[10] S. Hart, “Adaptive heuristics,”

Econometrica , vol. 73,no. 5, pp. 1401–1430, 2005.[11] L. Bu¸soniu, R. Babuška, and B. De Schutter, “A com-prehensive survey of multiagent reinforcement learning,”

IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev. ,vol. 38, no. 2, pp. 156–172, 2008.[12] J. R. Marden and J. S. Shamma, “Game theory anddistributed control,” in

Handb. Game Theory with Econ.Appl.

Elsevier, 2015, vol. 4, pp. 861–899.[13] H. von Stackelberg,

Market Structure and Equilibrium .Springer, 2011.[14] Y. A. Korilis, A. A. Lazar, and A. Orda, “Achievingnetwork optima using Stackelberg routing strategies,”

IEEE/ACM Trans. Netw. , vol. 5, no. 1, pp. 161–173,1997.[15] T. Roughgarden, “Stackelberg scheduling strategies,”

SIAM J. Comput. , vol. 33, no. 2, pp. 332–350, 2004.[16] M. Bloem, T. Alpcan, and T. Ba¸sar, “A Stackelberg gamefor power control and channel allocation in cognitive ra-dio networks,” in , 2007, pp. 1–9.[17] J. Pita, M. Jain, J. Marecki, F. Ordóñez, C. Portway,M. Tambe, C. Western, P. Paruchuri, and S. Kraus,“Deployed ARMOR protection: The application of agame-theoretic model for security at the Los AngelesInternational Airport,” in , 2008, pp. 125–132.[18] J. Tsai, S. Rathi, C. Kiekintveld, F. Ordóñez, andM. Tambe, “IRIS - A tool for strategic security allocationin transportation networks,” in , 2009, pp. 37–44.[19] G. G. Brown, W. M. Carlyle, J. Salmerón, and K. Wood,“Analyzing the vulnerability of critical infrastructure to attack and planning defenses,” in

Emerg. Theory, Meth-ods, Appl.

INFORMS, 2005, pp. 102–123.[20] G. G. Brown, M. Carlyle, J. Salmerón, and K. Wood,“Defending critical infrastructure,”

Interfaces (Provi-dence). , vol. 36, no. 6, pp. 530–544, 2006.[21] M. Brückner and T. Scheffer, “Stackelberg games foradversarial prediction problems,” in , 2011, pp. 547–555.[22] J. Marecki, G. Tesauro, and R. Segal, “Playing repeatedStackelberg games with unknown opponents,” in , vol. 2, 2012, pp.821–828.[23] A. Blum, N. Haghtalab, and A. D. Procaccia, “Learningoptimal commitment to overcome insecurity,” in

NeuralInf. Process. Syst. 2014 , 2014, pp. 1826–1834.[24] M. S. Kang, S. B. Lee, and V. D. Gligor, “The Crossﬁreattack,” in , 2013, pp. 127–141.[25] G. Yang, R. Poovendran, and J. P. Hespanha, “Adaptivelearning in two-player Stackelberg games with contin-uous action sets,” in ,2019, pp. 6905–6911.[26] K.-T. Fang, R. Li, and A. Sudjianto,

Design and Model-ing for Computer Experiments . Chapman & Hall/CRC,2005.[27] R. T. Rockafellar and R. J. B. Wets,

Variational Analysis .Springer, 1998.[28] J.-P. Aubin,

Viability Theory . Birkhäuser, 1991.[29] A. S. Morse, D. Q. Mayne, and G. C. Goodwin, “Ap-plications of hysteresis switching in parameter adaptivecontrol,”

IEEE Trans. Automat. Contr. , vol. 37, no. 9, pp.1343–1354, 1992.[30] P. A. Ioannou and J. Sun,

Robust Adaptive Control .Prentice Hall, 1996.[31] G. Yang and J. P. Hespanha, “Modeling and mitigat-ing link-ﬂooding distributed denial-of-service attacks vialearning in Stackelberg games,” in

Handb. Reinf. Learn.Control . Springer, 2021, to be published.[32] C. Henry, “An existence theorem for a class of differen-tial equations with multivalued right-hand side,”

J. Math.Anal. Appl. , vol. 41, no. 1, pp. 179–186, 1973.[33] E. P. Ryan, “An integral invariance principle for differ-ential inclusions with applications in adaptive control,”