Adaptive Learning in Two-Player Stackelberg Games with Application to Network Security
11 Adaptive learning in two-player Stackelberg gameswith application to network security
Guosong Yang, Radha Poovendran, and João P. Hespanha
Abstract —We study a two-player Stackelberg game with in-complete information such that the follower’s strategy belongs toa known family of parameterized functions with an unknownparameter vector. We design an adaptive learning approachto simultaneously estimate the unknown parameter and min-imize the leader’s cost, based on adaptive control techniquesand hysteresis switching. Our approach guarantees that theleader’s cost predicted using the parameter estimate becomesindistinguishable from its actual cost in finite time, up to apreselected, arbitrarily small error threshold. Also, the first-ordernecessary condition for optimality holds asymptotically for thepredicted cost. Additionally, if a persistent excitation conditionholds, then the parameter estimation error becomes bounded bya preselected, arbitrarily small threshold in finite time as well.For the case where there is a mismatch between the follower’sstrategy and the parameterized function that is known to theleader, our approach is able to guarantee the same convergenceresults for error thresholds larger than the size of the mismatch.The algorithms and the convergence results are illustrated via asimulation example in the domain of network security.
I. I
NTRODUCTION
A modern engineering system often involves multiple self-interested decision makers whose actions have mutual conse-quences. Examples of such systems include communicationsover a shared network with limited capacity, and computer pro-grams sharing limited computational resources. Game theoryprovides a systematic framework for modeling the cooperationand conflict between these so-called strategic players [1],and has been widely applied to areas such as robust design,resource allocation, and network security [2]–[4].In game theory, a fundamental question is whether theplayers can converge to a Nash equilibrium—a tuple of strate-gies for which no one has a unilateral incentive to change—if they play the game iteratively and adjust their strategiesbased on historical outcomes. A primary example of sucha learning process is fictitious play [5], [6], in which everyplayer believes that the opponents are playing constant mixedstrategies in agreement with the empirical distributions oftheir past actions, and plays the corresponding best response.Another well-known example is gradient response [7], [8],in which each player adjusts its strategy according to thecorresponding gradient of its cost function. Fictitious play andgradient response have attracted significant research interests
G. Yang and J. P. Hespanha are with the Center for Control, DynamicalSystems, and Computation, University of California, Santa Barbara, CA93106, USA. Email: {guosongyang, hespanha}@ucsb.edu .R. Poovendran is with the Department of Electrical and Computer En-gineering, University of Washington, Seattle, WA 98195, USA. Email: [email protected] . [9], [10] and have been used in applications such as multiagentreinforcement learning [11] and distributed control [12].In this paper, we propose an adaptive learning approach fora hierarchical game model introduced by Stackelberg [13]. Ina two-player Stackelberg game, one player (called the leader )selects its action first, and then the other player (called the follower ), informed of the leader’s choice, selects its ownaction. Therefore, a follower’s strategy in a Stackelberg gamecan be viewed as a function that specifies an action in responseto each leader’s possible action.Stackelberg games provide a natural framework for under-standing systems with asymmetric information, a commonfeature of many network problems such as routing [14],scheduling [15], and channel allocation [16]. They are es-pecially useful for modeling security problems, where thedefender (leader) is usually unaware of the attacker’s objectiveand capabilities a priori, whereas the attacker (follower) isable to observe the defender’s strategy and attack after carefulplanning. A class of Stackelberg games called Stackelberg Se-curity Games have been applied to various real-world securitydomains and have lead to practical implementations such asthe ARMOR program at the Los Angeles International Airport[17], the IRIS program used by the US Federal Air Marshals[18], and counterterrorism programs for crucial infrastructuressuch as power grid and oil reserves [19], [20].Asymmetric information often leads to scenarios with noNash equilibrium but a Stackelberg equilibrium, as sufficientconditions for the existence of the former are much strongerthan those for the latter (see, e.g., [2, p. 181]). For thesescenarios, learning options like fictitious play and gradientresponse cannot be applied, and novel approaches are neededto achieve convergence to a Stackelberg equilibrium. Existingresults on learning in Stackelberg games are limited to linear orquadratic costs and finite action sets [21]–[23], which are toorestrictive for many applications including network security.In this paper, we study a Stackelberg game between twoplayers with continuous action sets. We consider the scenariowhere the leader only has partial knowledge of the follower’saction set and cost function. As a result, the follower’s strategybelongs to a family of parameterized functions that is knownto the leader, but the actual value of the parameter vector is un-known. Our main contribution is an adaptive learning approachdescribed in Section III, which simultaneously estimates theunknown parameter based on the follower’s past actions andminimizes the leader’s cost predicted using the parameterestimate. The approach is designed based on adaptive controltechniques, and utilizes projections and hysteresis switchingto ensure feasibility of solutions and finite-time convergence a r X i v : . [ c s . G T ] J a n of the parameter estimate.In Section IV, we prove that the leader’s predicted costis guaranteed to become indistinguishable from its actualcost in finite time, up to a preselected, arbitrarily smallerror threshold. Also, the first-order necessary condition foroptimality holds asymptotically for the predicted cost. Ad-ditionally, if a persistent excitation condition holds, then theparameter estimation error is guaranteed to become boundedby a preselected, arbitrarily small threshold in finite timeas well. Our proof provides a rigorous treatment for theexistence and convergence of solutions to the discontinuousdynamics that results from projections and switching, basedon tools from differential inclusions theory. In particular, weestablish an invariance principle for projected gradient descentin continuous time, which is of independent interest and isnovel to the best of our knowledge.In Section V, we extend the adaptive learning approach tothe case where the parameterized function that is known tothe leader does not match the follower’s strategy perfectly.We prove that that our approach can be adjusted to guaranteethe same convergence results for preselected error thresholdsthat are larger than the size of the mismatch.In Section VI, the algorithms and the convergence resultsare illustrated via a simulation example motivated by link-flooding distributed denial-of-service (DDoS) attacks, such asthe Crossfire attack [24]. Section VII concludes the paper witha brief summary and an outlook on future research topics.A preliminary version for some of these results appeared inthe conference paper [25]. The current paper improves [25]by adding a notion of “practical” Stackelberg equilibrium,removing unnecessary assumptions, substantiating the resultswith complete proofs and clarifying remarks, and providing amore realistic and elaborated simulation example. Notations:
Let R + := [0 , ∞ ) and N := { , , . . . } . Denoteby I n the identity matrix in R n × n , or simply I when thedimension is implicit. Denote by (cid:107) · (cid:107) the Euclidean norm forvectors and the (induced) Euclidean norm for matrices. Fortwo vectors v and v , denote by ( v , v ) := ( v (cid:62) , v (cid:62) ) (cid:62) theirconcatenation. For a set S ⊂ R n , denote by ∂ S and S itsboundary and closure, respectively. A signal r : R + → R n is of class L ∞ if sup t ≥ (cid:107) r ( t ) (cid:107) is finite. Denote by s B ( x ) the closed ball of radius s ≥ centered at x ∈ R n , that is, s B ( x ) := { z ∈ R n : (cid:107) z − x (cid:107) ≤ s } .II. P ROBLEM FORMULATION
Consider a two-player game ( R , A , J, H ) , where R ⊂ R n r and A ⊂ R n a are the action sets of the first and the secondplayer, respectively, and J : R × A → R and H : A × R → R are the corresponding cost functions to be minimized. Weare interested in a hierarchical game model proposed byStackelberg [13], where the first player (called the leader )selects its action r ∈ R first, and then the second player (calledthe follower ), informed of the leader’s choice, selects its ownaction a ∈ A . Formally, a Stackelberg equilibrium is definedas follows; see, e.g., [1, Sec. 3.1] or [2, Def. 4.6 and 4.7,p. 179 and 180]. Definition 1 (Stackelberg equilibrium) . Given a game definedby ( R , A , J, H ) , an action r ∗ ∈ R is called a Stackelbergequilibrium action for the leader if J ∗ := inf r ∈R max a ∈ β a ( r ) J ( r, a ) = max a ∈ β a ( r ∗ ) J ( r ∗ , a ) , (1)where β a ( r ) := arg min a ∈A H ( a, r ) (2)denotes the set of follower’s best responses against a leader’saction r ∈ R , and J ∗ is known as the Stackelberg cost for theleader; for an ε > , a routing action r ∗ ε ∈ R is called an ε Stackelberg action for the leader if (1) is replaced by max a ∈ β a ( r ∗ ε ) J ( r ∗ ε , a ) ≤ J ∗ + ε. The Stackelberg cost provides a cost that the leader is ableto guarantee against a rational follower. Depending on thefollower’s actual strategy, the leader may be able to achieve abetter (smaller) cost while playing a Stackelberg equilibriumaction. In practice, it is possible that no Stackelberg equilib-rium action exists but the leader is able to achieve essentiallythe Stackelberg cost by playing an ε Stackelberg action whileselecting a sufficiently small ε > ; see also the discussionafter Assumption 1 and the example in Section VI.We consider games with perfect but incomplete information,where the leader only has partial knowledge of the follower’saction set and cost function and cannot predict the follower’saction accurately. Specifically, the follower’s strategy is anunknown function f : R → A such that a = f ( r ) ∈ β a ( r ) forall r ∈ R . However, f belongs to a family of parameterizedfunctions { r (cid:55)→ ˆ f (ˆ θ, r ) : ˆ θ ∈ Θ } with a parameter set Θ ⊂ R n θ , that is, there is an actual value θ ∈ Θ such that a = f ( r ) = ˆ f ( θ, r ) ∀ r ∈ R . (3)The parameterized function ˆ f : Θ × R → R n a (including theparameter set Θ ) is known to the leader, but the actual value θ is unknown. The follower’s action set A is also unknownexcept that A ⊂ { ˆ f (ˆ θ, r ) : ˆ θ ∈ Θ , r ∈ R} as implied by (3).In practice, assuming that the follower’s strategy belongsto a known family of parameterized functions introduces littleloss of generality, as the follower’s strategy can always beapproximated on a compact set up to an arbitrary precision asa finite weighted sum of a preselected class of basis functions.An example of such an approximation is the radial basisfunction (RBF) model [26], in which the leader assumes ˆ f ( θ, r ) = n θ (cid:88) j =1 θ j F j ( r ) = n θ (cid:88) j =1 θ j φ ( (cid:107) r − r cj (cid:107) ) , where φ : R + → R n a is an RBF and each F j : R → R n a is a kernel centered at r cj . Note that in the RBF model, theapproximation is affine with respect to the unknown parameter,which is also common in many other widely used approxima-tion models such as orthogonal polynomials and multivariatesplines [26]. This motivates restricting our attention to affine In this paper, we only consider compact action sets and continuous costfunctions; thus for each r ∈ R , the set β a ( r ) is nonempty and compact.However, the map r (cid:55)→ max a ∈ β a ( r ) J ( r, a ) may still be discontinuous. maps ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) . The following assumption captures this andother generic regularity conditions that we use to guarantee theexistence of an ε Stackelberg action and the convergence ofour parameter estimation algorithms.
Assumption 1 (Regularity) . The follower’s action set A iscompact, and the leader’s action set R and the parameter set Θ are convex and compact; the follower’s cost function H iscontinuous, the leader’s cost function J and the parameterizedfunction ˆ f are continuously differentiable, and the map ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) is affine for each fixed r ∈ R .Under Assumption 1, there exists an ε Stackelberg action foreach ε > [2, Prop. 4.2, p. 180]. These conditions are muchweaker than the standard sufficient conditions for the existenceof a Stackelberg equilibrium action [2, Th. 4.8, p. 180], whichare in turn much weaker than those for a Nash equilibrium[2, p. 181]. Therefore, they are consistent with our interest ingames with no Nash equilibrium but a “practical” Stackelbergequilibrium; see also the example in Section VI.We denote by ∇ r J ( r, a ) and ∇ a J ( r, a ) the gradients ofthe maps r (cid:55)→ J ( r, a ) and a (cid:55)→ J ( r, a ) , respectively, andby ∇ θ ˆ f ( r ) and ∇ r ˆ f (ˆ θ, r ) the Jacobian matrices of the maps ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) and r (cid:55)→ ˆ f (ˆ θ, r ) , respectively. In particular, theJacobian matrix ∇ θ ˆ f ( r ) is independent of ˆ θ due to the affinecondition in Assumption 1.Our goal is to adjust the leader’s action r to minimize itscost J ( r, a ) for the follower’s action a = f ( r ) = ˆ f ( θ, r ) , thatis, to solve the optimization problem min r ∈R J (cid:0) r, ˆ f ( θ, r ) (cid:1) , (4)based on past observations of the follower’s action a = ˆ f ( θ, r ) and the leader’s cost J ( r, a ) , but without knowing the actualvalue θ . Our approach to solve this problem combines thefollowing two components:1) Construct a parameter estimate ˆ θ that approaches theactual value θ .2) Adjust the leader’s action r based on a gradient descentmethod to minimize its predicted cost ˆ J ( r, ˆ θ ) := J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) , that is, to solve the optimization problem min r ∈R ˆ J ( r, ˆ θ ) = min r ∈R J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) . (5)In this paper, our design and analysis are formulated usingcontinuous-time dynamics, which is common in the literatureof learning in game theory [9], [10].III. E STIMATION AND OPTIMIZATION
To specify the adaptive algorithms for estimating the actualvalue θ and optimizing the leader’s action r , we recall thefollowing notions and basic properties from convex analysis;for more details, see, e.g., [27, Ch. 6] or [28, Sec. 5.1]. To be consistent with the definition of Jacobian matrix, we take gradientsas row vectors. Clearly, as the follower’s response ˆ f ( θ, r ) ∈ β a ( r ) for all r ∈ R ,the leader’s optimal cost is upper bounded by its Stackelberg cost, i.e., min r ∈R J ( r, ˆ f ( θ, r )) ≤ J ∗ . For a closed convex set
C ⊂ R n and a point v ∈ R n , wedenote by [ v ] C the projection of v onto C , that is, [ v ] C := arg min w ∈C (cid:107) w − v (cid:107) . The projection [ v ] C exists and is unique as the set C is closedand convex, and satisfies [ v ] C = v if v ∈ C .For a convex set S ⊂ R n and a point x ∈ S , we denote by T S ( x ) the tangent cone to S at x , that is, T S ( x ) := { h ( z − x ) : z ∈ S , h > } , (6)and by N S ( x ) the normal cone to S at x , that is, N S ( x ) := { v ∈ R n : v (cid:62) w ≤ for all w ∈ T S ( x ) } . (7)The sets T S ( x ) and N S ( x ) are closed and convex, and satisfy T S ( x ) = R n and N S ( x ) = { } if x ∈ S\ ∂ S . Moreover, wehave [ v ] T S ( x ) ∈ T S ( x ) , v − [ v ] T S ( x ) ∈ N S ( x ) (8)and (cid:0) v − [ v ] T S ( x ) (cid:1) (cid:62) [ v ] T S ( x ) = 0 (9)for all v ∈ R n and x ∈ S . A. Parameter estimation
We construct the parameter estimate ˆ θ by comparing pastobservations of the follower’s action a = ˆ f ( θ, r ) and theleader’s cost J ( r, a ) with the corresponding predicted values ˆ f (ˆ θ, r ) and ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) computed using ˆ θ . Theirdifference is defined as the observation error e obs := (cid:20) ˆ f (ˆ θ, r ) − a ˆ J ( r, ˆ θ ) − J ( r, a ) (cid:21) . (10)We develop an estimation algorithm based on the observationerror e obs so that the norm of the estimation error ˆ θ − θ is monotonically decreasing, regardless of how the leader’saction r is being adjusted.First, we establish a relation between the observation error e obs and the estimation error ˆ θ − θ . Lemma 1.
The observation error e obs satisfies e obs = K ( r, a, ˆ θ )(ˆ θ − θ ) (11) with the gain matrix K ( r, a, ˆ θ ) := (cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) d ρ (cid:21) ∇ θ ˆ f ( r ) . (12) Proof.
See Appendix B-A.Following Lemma 1, the observation error e obs would bezero if the current estimate ˆ θ of the actual value θ was correct.However, in most interesting scenarios, the dimension n θ ofthe parameter vector θ is much larger than the dimension n a +1 of the observation error e obs ; thus the gain matrix K ( r, a, ˆ θ ) cannot be invertible, and a zero value for the observation error e obs does not imply that the estimate ˆ θ is correct. We propose the following estimation algorithm to drive theparameter estimate ˆ θ towards the actual value θ : ˙ˆ θ = (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) (13)with the gain matrix K ( r, a, ˆ θ ) defined by (12) and the switching signal λ e : R + → { , λ θ } defined by λ e ( t ) := λ θ if (cid:107) e obs ( t ) (cid:107) ≥ ε obs ;lim s (cid:37) t λ e ( s ) if (cid:107) e obs ( t ) (cid:107) ∈ ( ε (cid:48) obs , ε obs );0 if (cid:107) e obs ( t ) (cid:107) ≤ ε (cid:48) obs , (14)and λ e (0) := λ θ if (cid:107) e obs (0) (cid:107) ∈ ( ε (cid:48) obs , ε obs ) , where ε obs >ε (cid:48) obs > and λ θ > are preselected constants. Severalcomments are in order: First, the gain matrix K ( r, a, ˆ θ ) depends on the parameter estimate ˆ θ but not on the actual value θ , so (13) can be implemented without knowing θ . Second,the projection [ · ] T Θ (ˆ θ ) onto the tangent cone T Θ (ˆ θ ) ensuresthat the parameter estimate ˆ θ remains inside the parameter set Θ [28]; see Appendix A for more details. Finally, the right-continuous, piecewise constant switching signal λ e is designedso that the adaptation is on when (cid:107) e obs (cid:107) ≥ ε obs and off when (cid:107) e obs (cid:107) ≤ ε (cid:48) obs , with a hysteresis switching rule that avoidschattering. The key feature of (13) is that the estimation error ˆ θ − θ satisfies d (cid:107) ˆ θ − θ (cid:107) d t = 2(ˆ θ − θ ) (cid:62) (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) ≤ θ − θ ) (cid:62) (cid:0) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:1) = − λ e (cid:107) e obs (cid:107) , where the inequality follows from (6)–(8). We thus concludethat the estimation algorithm (13) with the switching signal(14) guarantees d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ e (cid:107) e obs (cid:107) ≤ , (15)which implies that (cid:107) ˆ θ − θ (cid:107) is monotonically decreasing andwill not stop approaching zero unless (cid:107) e obs (cid:107) < ε obs . In theconvergence analysis in Section IV, we will show that theadaptation of the parameter estimate ˆ θ stops in finite time, andthe observation error e obs satisfies (cid:107) e obs (cid:107) < ε obs afterward. B. Cost minimization
Several options are available to adjust the leader’s action r , but in this paper our analysis will focus on a gradientdescent method, which is fairly robust for a wide range ofproblems. Our ultimate goal is to minimize the leader’s cost J ( r, a ) = J (cid:0) r, ˆ f ( θ, r ) (cid:1) . However, computing the gradientdescent direction of the actual cost requires knowledge of theactual value θ . Therefore, we minimize instead the predictedcost ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) , which depends instead on theparameter estimate ˆ θ . This change in objective is justified bythe property that (cid:107) ˆ J ( r, ˆ θ ) − J ( r, a ) (cid:107) ≤ (cid:107) e obs (cid:107) < ε obs holdsafter a finite time, which will be established in Section IV.The time derivative of the predicted cost ˆ J ( r, ˆ θ ) is given by ˙ˆ J ( r, ˆ θ ) = ∇ r ˆ J ( r, ˆ θ ) ˙ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ, (16) with ∇ r ˆ J ( r, ˆ θ ) := ∇ r J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) + ∇ a J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) ∇ r ˆ f (ˆ θ, r ) , ∇ θ ˆ J ( r, ˆ θ ) := ∇ a J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) ∇ θ ˆ f ( r ) . Here ∇ r J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) denotes the gradient of the map r (cid:55)→ J ( r, ˆ a ) at ˆ a = ˆ f (ˆ θ, r ) ; thus ∇ r ˆ J ( r, ˆ θ ) and ∇ θ ˆ J ( r, ˆ θ ) are thegradients of the maps r (cid:55)→ ˆ J ( r, ˆ θ ) = J (cid:0) r, ˆ f (ˆ θ, r ) (cid:1) and ˆ θ (cid:55)→ ˆ J ( r, ˆ θ ) , respectively. As we will establish that the adaptationof the parameter estimate ˆ θ stops in finite time, we neglectthe term with ˙ˆ θ in (16) and focus exclusively in adjusting r along the gradient descent direction of r (cid:55)→ ˆ J ( r, ˆ θ ) . Thismotivates the following optimization algorithm to adjust theleader’s action: ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (17)where λ r > is a preselected constant. The projection [ · ] T R ( r ) onto the tangent cone T R ( r ) ensures that the leader’s action r remains inside the action set R [28]; see Appendix A formore details. From (16) and (17) we conclude that ˙ˆ J ( r, ˆ θ ) = ∇ r ˆ J ( r, ˆ θ ) (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ = − (cid:13)(cid:13)(cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) (cid:13)(cid:13) (cid:14) λ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ = −(cid:107) ˙ r (cid:107) /λ r + ∇ θ ˆ J ( r, ˆ θ ) ˙ˆ θ, where the second equality follows from (9). We thus concludethat the optimization algorithm (17) guarantees ˙ˆ θ = 0 = ⇒ ˙ˆ J ( r, ˆ θ ) ≤ −(cid:107) ˙ r (cid:107) /λ r ≤ . In the convergence analysis in Section IV, we will showthat the leader’s action r converges asymptotically to the setof points for which the first-order necessary condition foroptimality holds for the optimization problem (5).IV. C ONVERGENCE ANALYSIS
We now present the main result of this paper:
Theorem 1.
Suppose that Assumption 1 holds. For eachpair of given error thresholds ε obs > ε (cid:48) obs > in (14) ,the estimation and optimization algorithms (13) and (17) guarantee the following:1) There exists a time T ≥ such that (cid:107) e obs ( t ) (cid:107) < ε obs , ˆ θ ( t ) = ˆ θ ( T ) ∀ t ≥ T. (18)
2) The first-order necessary condition for optimality holdsasymptotically for the optimization problem (5) , that is, lim t →∞ (cid:2) −∇ r ˆ J ( r ( t ) , ˆ θ ( T )) (cid:62) (cid:3) T R ( r ( t )) = 0 . (19)Essentially, item 1) ensures that the parameter estimate ˆ θ converges in finite time to a point that is indistinguishablefrom the actual value θ using observations of the follower’saction a = ˆ f ( θ, r ) and the leader’s cost J ( r, a ) , up to anerror bounded by the threshold ε obs . Regarding item 2), thenecessity of (19) for optimality is justified as follows. Lemma 2. If ˆ r ∗ is a locally optimum of the optimizationproblem (5) with some fixed ˆ θ , then (cid:2) −∇ r ˆ J (ˆ r ∗ , ˆ θ ) (cid:62) (cid:3) T R (ˆ r ∗ ) = 0 . (20) Proof.
The results in [27, Th. 6.12, p. 207] allow us toconclude that −∇ r ˆ J (ˆ r ∗ , ˆ θ ) (cid:62) ∈ N R (ˆ r ∗ ) . Then (20) followsfrom (7)–(9). Proof of Theorem 1.
As the right hand-sides of (13) and (17)are potentially discontinuous due to projections and switching,the proof of Theorem 1 utilizes results from differential inclu-sions theory; see Appendix A for the necessary preliminaries.First, we establish the existence of solutions for the systemdefined by (13) and (17).
Lemma 3.
For each (ˆ θ , r ) ∈ Θ × R , there exists a solutionto the system defined by (13) and (17) on R + —that is, thereexist absolutely continuous functions ˆ θ : R + → Θ and r : R + → R with (ˆ θ (0) , r (0)) = (ˆ θ , r ) such that (13) and (17) hold almost everywhere on R + . Moreover, we have ˆ θ, ˙ˆ θ, r, ˙ r, e obs , ˙ e obs ∈ L ∞ . (21) Proof.
Lemma 3 follows from results on hysteresis switchingin [29] and results on projected differential inclusions in [28];see Appendix B-B for the complete proof.Second, we establish item 1) of Theorem 1 via argu-ments along the lines of the proof of Barbalat’s lemma [30,Lemma 3.2.6, p. 76]. We cannot use Barbalat’s lemma directlysince the switching signal λ e in (15) is not continuous but onlypiecewise continuous. Following (15), we see that (cid:107) ˆ θ − θ (cid:107) ismonotonically decreasing. Then lim t →∞ (cid:107) ˆ θ ( t ) − θ (cid:107) , and thus lim t →∞ (cid:90) t λ e ( s ) (cid:107) e obs ( s ) (cid:107) d s, (22)exists and is finite. On the other hand, (13) and (14) implythat (18) holds if there exists a time T ≥ such that λ e ( t ) = 0 ∀ t ≥ T. (23)Assume (23) does not hold for any T ≥ . Then (14) impliesthat there exists an unbounded increasing sequence ( t k ) k ∈ N with t > such that λ e ( t k ) = λ θ , (cid:107) e obs ( t k ) (cid:107) > ε (cid:48) obs ∀ k ∈ N . (24)Next, we show that there exists an unbounded sequence ( s k ) k ∈ N with s k ∈ [ t k − δ, t k ] such that (cid:107) e obs ( t ) (cid:107) > ε (cid:48) obs , λ e ( t ) = λ θ ∀ k ∈ N , ∀ t ∈ [ s k , s k + δ ) (25)with the constant δ := min (cid:26) t , ε obs − ε (cid:48) obs sup s ≥ (cid:107) ˙ e obs ( s ) (cid:107) (cid:27) > , where the inequality follows from t > , ε obs > ε (cid:48) obs , and ˙ e obs ∈ L ∞ in (21). Indeed, for each k ∈ N , consider thefollowing two possibilities:1) If (cid:107) e obs ( t ) (cid:107) < ε obs for all t ∈ [ t k − δ, t k ] , then (14) and λ e ( t k ) = λ θ imply that (25) holds with s k = t k − δ . 2) Otherwise, there exists an s k ∈ [ t k − δ, t k ] such that (cid:107) e obs ( s k ) (cid:107) = ε obs , and (25) follows from the definitionof δ and (14).Moreover, ( s k ) k ∈ N is unbounded as ( t k ) k ∈ N is unbounded.Following (25), we have (cid:90) s k + δs k λ e ( s ) (cid:107) e obs ( s ) (cid:107) d s > λ θ ( ε (cid:48) obs ) δ > for the unbounded sequence ( s k ) k ∈ N , which contradicts theproperty that (22) exists and is finite. Therefore, there existsa time T ≥ such that (23), and thus (18), holds.Finally, we prove item 2) of Theorem 1 based on the invari-ance principle for projected gradient descent Proposition 1 inAppendix A. After the time T from item 1), the system (17)becomes ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ( T )) (cid:62) (cid:3) T R ( r ) , which can be modeled using the projected dynamical system(39) in Appendix A with the state x := r and the set S := R .The corresponding function g in (39) is given by g ( x ) := − λ r ∇ r ˆ J ( x, ˆ θ ( T )) (cid:62) , which satisfies (40) with V ( x ) := λ r ˆ J ( x, ˆ θ ( T )) . Then (19)follows from (41) in Proposition 1.In Theorem 1, there is no claim that the parameter estimate ˆ θ necessarily converges to the actual value θ . However, thiscan be guaranteed if the following persistent excitation (PE) condition holds. Assumption 2 (Persistent excitation) . There exist constants τ , α > such that the gain matrix K ( r, a, ˆ θ ) defined by(12) satisfies (cid:90) t + τ t K ( s ) (cid:62) K ( s ) d s ≥ α I ∀ t ≥ , (26)where we let K ( s ) := K ( r ( s ) , a ( s ) , ˆ θ ( s )) for brevity, and theinequality means that the difference of the left- and right-handsides is a positive semidefinite matrix. Theorem 2.
Suppose that Assumptions 1 and 2 hold. For eachgiven threshold ε θ > , by setting ε θ (cid:112) α /τ ≥ ε obs > ε (cid:48) obs > (27) in (14) , the estimation and optimization algorithms (13) and (17) guarantee the following:1) There exists a time T ≥ such that (18) holds and (cid:107) ˆ θ ( T ) − θ (cid:107) < ε θ . (28)
2) The first-order necessary condition for optimality holdsasymptotically for the optimization problem (5) , that is, (19) holds.Proof.
As (18) and (19) are established in Theorems 1, it re-mains to prove (28). To this effect, we note that the inequalityin (18) implies (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s < ε τ ≤ α ε θ , where the second inequality follows from (27). On the otherhand, (11) and the equality in (18) imply (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s = (cid:90) T + τ T (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) d s = (ˆ θ ( T ) − θ ) (cid:62) (cid:18) (cid:90) T + τ T K ( s ) (cid:62) K ( s ) d s (cid:33) (ˆ θ ( T ) − θ ) ≥ α (cid:107) ˆ θ ( T ) − θ (cid:107) , where the inequality follows from the PE condition (26).Combining the upper and lower bounds above yields (28). Remark . In view of (12), a sufficient condition for (26) is (cid:90) t + τ t ∇ θ f ( r ( s )) (cid:62) ∇ θ f ( r ( s )) d s ≥ α I ∀ t ≥ . (29)The PE condition (29) is more restrictive than (26); however,it can be checked without knowing the parameter estimate ˆ θ . Remark . From the proof of Theorem 2, we see that (28) onlyrequires (26) or (29) to hold at t = T for the time T fromTheorem 1. Therefore, to ensure (28) in practice, it suffices toenforce (26) or (29) when λ e in (14) has been set to zero.V. M ODEL MISMATCH
Up till now we assumed that there was some unknown value θ from within the parameter set Θ such that (3) holds for thefollower’s strategy f and the parameterized function ˆ f that isknown to the leader. In this section, we consider the case wheresuch perfect matching may not exist, and study the effect ofa bounded mismatch between f ( r ) and ˆ f ( θ, r ) . Assumption 3 (Mismatch) . The follower’s strategy f iscontinuous, and there is an unknown value θ ∈ Θ such that (cid:107) ˆ f ( θ, r ) − f ( r ) (cid:107) ≤ ε f /κ ∀ r ∈ R (30)with the constant κ := max r ∈R , ˆ θ ∈ Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) f ( r ) (cid:1) d ρ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) and some known constant ε f ≥ .The follower’s action set A is still unknown to the leader,except that A ⊂ ∪ ˆ θ ∈ Θ , r ∈R ( ε f /κ ) B (cid:0) ˆ f (ˆ θ, r ) (cid:1) as implied by(30). Assumption 3 generalized the condition (3) as (3) isequivalent to (30) with ε f = 0 .Similar arguments to those in the proof of Lemma 1 showthat the observation error e obs now satisfies e obs = K ( r, a, ˆ θ )(ˆ θ − θ ) + e f (31)with the gain matrix K ( r, a, ˆ θ ) defined by (12) and the mismatch error e f := (cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) d ρ (cid:21) (cid:0) ˆ f ( θ, r ) − a (cid:1) . It turns out that Assumption 3 guarantees (cid:107) e f ( t ) (cid:107) ≤ ε f ∀ t ≥ . (32) The following two results extend Theorems 1 and 2 tothe current case without perfect matching between f ( r ) and ˆ f ( θ, r ) for some θ ∈ Θ . Theorem 3.
Suppose that Assumptions 1 and 3 hold. Foreach pair of given thresholds ε obs > ε (cid:48) obs > ε f in (14) ,the estimation and optimization algorithms (13) and (17) guarantee the same convergence results as those in Theorem 1.Proof. First, Lemma 3 still holds as f is continuous.Second, we establish item 1) of Theorem 3 via similararguments to those in Section III-A and the second step ofthe proof of Theorem 1. Using the estimation algorithm (13)with ε obs > ε (cid:48) obs > ε f in (14), the estimation error ˆ θ − θ nowsatisfies d (cid:107) ˆ θ − θ (cid:107) d t = 2(ˆ θ − θ ) (cid:62) (cid:2) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:3) T Θ (ˆ θ ) ≤ θ − θ ) (cid:62) (cid:0) − λ e K ( r, a, ˆ θ ) (cid:62) e obs (cid:1) = − λ e ( e obs − e f ) (cid:62) e obs , where the inequality follows from (6)–(8). Next, we prove that d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ e ( e obs − e f ) (cid:62) e obs ≤ , (33)in which d (cid:107) ˆ θ − θ (cid:107) / d t = 0 if and only if λ e = 0 . Indeed,consider the following two possibilities:1) If λ e = 0 , then d (cid:107) ˆ θ − θ (cid:107) / d t = 0 .2) Otherwise λ e = λ θ , and thus (cid:107) e obs (cid:107) > ε (cid:48) obs > ε f ≥ (cid:107) e f (cid:107) following (14) and (32). Hence d (cid:107) ˆ θ − θ (cid:107) d t ≤ − λ θ ( e obs − e f ) (cid:62) e obs ≤ − λ θ ( (cid:107) e obs (cid:107) − (cid:107) e f (cid:107)(cid:107) e obs (cid:107) )= − λ θ ( ε (cid:48) obs − ε f ) ε (cid:48) obs < . Following (33), we see that (cid:107) ˆ θ − θ (cid:107) is monotonically decreas-ing. Thus lim t →∞ (cid:107) ˆ θ ( t ) − θ (cid:107) , and therefore lim t →∞ (cid:90) t λ e ( s )( e obs ( s ) − e f ( s )) (cid:62) e obs ( s ) d s, (34)exists and is finite. On the other hand, (13) and (14) imply that(18) holds if there exists a time T ≥ such that (23) holds.Assume (23) does not hold for any T ≥ . Then the analysisin the second step of the proof of Theorem 1 shows that thereexists an unbounded sequence ( s k ) k ∈ N with s k ∈ [ t k − δ, t k ] such that (25) holds. Consequently, we have (cid:90) s k + δs k λ e ( s )( e obs ( s ) − e f ( s )) (cid:62) e obs ( s ) d s ≥ λ θ (cid:90) s k + δs k ( (cid:107) e obs ( s ) (cid:107) − (cid:107) e f ( s ) (cid:107)(cid:107) e obs ( s ) (cid:107) ) d s> λ θ ( ε (cid:48) obs − ε f ) ε (cid:48) obs δ > for the unbounded sequence ( s k ) k ∈ N , which, combined with(33), contradicts the property that (34) exists and is finite.Therefore, there exists a time T ≥ such that (23), and thus(18), holds.Finally, item 2) of Theorem 3 is the same as item 2) ofTheorem 1 as the optimization process is the same after theadaptation of ˆ θ stops. Theorem 4.
Suppose that Assumptions 1–3 hold. For eachgiven threshold ε θ > ε f (cid:112) τ /α , by setting ε θ (cid:112) α /τ − ε f ≥ ε obs > ε (cid:48) obs > ε f (35) in (14) , the estimation and optimization algorithms (13) and (17) guarantee the same convergence results as those inTheorem 2.Proof. As (18) and (19) are established in Theorems 3, it re-mains to prove (28). To this effect, we note that the inequalityin (18) implies (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s < ε τ ≤ ( ε θ √ α − ε f √ τ ) , where the second inequality follows from (35). On the otherhand, (31) and the equality in (18) imply (cid:107) e obs ( s ) (cid:107) = (cid:107) K ( s )(ˆ θ ( T ) − θ ) + e f ( s ) (cid:107) ≥ (cid:18) − ε f ε θ (cid:114) τ α (cid:19) (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) + (cid:18) − ε θ ε f (cid:114) α τ (cid:19) (cid:107) e f ( s ) (cid:107) = ( ε θ √ α − ε f √ τ ) (cid:18) (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) ε θ √ α − (cid:107) e f ( s ) (cid:107) ε f √ τ (cid:19) for all s ≥ T , where the inequality follows from Young’sinequality ab ≤ εa + b /ε for all a, b ∈ R and ε > . Notethat ε θ > ε f (cid:112) τ /α implies ε θ √ α − ε f √ τ > . Hencewe have (cid:90) T + τ T (cid:107) e obs ( s ) (cid:107) d s ≥ ( ε θ √ α − ε f √ τ ) (cid:18) (cid:90) T + τ T (cid:107) K ( s )(ˆ θ ( T ) − θ ) (cid:107) ε θ √ α d s − (cid:90) T + τ T (cid:107) e f ( s ) (cid:107) ε f √ τ d s (cid:19) ≥ ( ε θ √ α − ε f √ τ ) (cid:18) √ α ε θ (cid:107) ˆ θ ( T ) − θ (cid:107) − ε f √ τ (cid:19) , where the second inequality follows from the PE condition(26) and the inequality (32). Combining the upper and lowerbounds above yields (28). Remark . The results in this section also hold with the slightlyless conservative condition min θ ∈ Θ max r ∈R , ˆ θ ∈ Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) I (cid:82) ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) f ( r ) (cid:1) d ρ (cid:21)(cid:0) ˆ f ( θ, r ) − f ( r ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε f (36)in place of (30) in Assumption 3.VI. S IMULATION EXAMPLE
We illustrate the estimation and optimization algorithms andthe convergence results via a simulation example motivatedby link-flooding distributed denial-of-service (DDoS) attacks,such as the Crossfire attack [24]. Consider a communication network consisting of L parallellinks connecting a source to a destination. The set of links isdenoted by L := { , . . . , L } . Suppose that a router (leader)distributes R units of legitimate traffic among the parallellinks, and an attacker (follower) disrupts communication byinjecting A units of superfluous traffic on them. The router’saction is represented by an L -vector of the desired legitimatetraffic on each link r ∈ R := { r ∈ R L + : (cid:80) l ∈L r l = R } , andthe attacker’s action is represented by an L -vector of the attacktraffic a ∈ A := { a ∈ R L + : (cid:80) l ∈L a l = A } . Every link l ∈ L is subject to a constant capacity c > that upper-bounds thetotal traffic on l . When r l + a l > c , the actual legitimatetraffic on l is decreased to u l := min { r l , max { c − a l , }} . The router aims to maximize the total actual legitimate traffic,whereas the attacker aims to minimize it. Hence the router’scost is J ( r, a ) := − (cid:88) l ∈L u l , and the attacker’s cost is H ( a, r ) := (cid:88) l ∈L u l = − J ( r, a ) . (37)Clearly, neither the router nor the attacker has an incentiveto assign more traffic on a link than the capacity. Hence weassume r l , a l ∈ [0 , c ] for all l ∈ L . For most nontrivial cases,the game defined by ( R , A , J, H ) has no Nash equilibrium.On the other hand, in [31, Cor. 5] it was established that thereexists a Stackelberg equilibrium action given by r ∗ l = R/L for all l ∈ L .If the router knew that the attacker’s cost function wasindeed defined by (37), it could play the Stackelberg equi-librium action r ∗ . However, we consider the general scenariowhere it does not and, instead, adopts the adaptive learningapproach proposed in this paper to construct its optimal action.To this effect, the router approximates the attacker’s strategy f = ( f , . . . , f L ) using a quasi-RBF model defined by ˆ f l ( θ, r ) := n rbf (cid:88) j =1 · · · n rbf (cid:88) j L − =1 θ l,j ,...,j L − F j ,...,j L − ( r )= n rbf (cid:88) j =1 · · · n rbf (cid:88) j L − =1 θ l,j ,...,j L − φ ( r − r cj ,...,j L − ) (38)for l ∈ L , where the RBF is defined by φ ( x ) := (cid:0) − c n rbf , c n rbf (cid:3) L − ( x ) with the indicator function S ( x ) := (cid:40) if x ∈ S ;0 if x / ∈ S , The maps r (cid:55)→ J ( r, a ) and r (cid:55)→ ˆ f (ˆ θ, r ) in this example actuallyviolate the smoothness conditions in Assumption 1 as they are only piecewisecontinuously differentiable. However, these conditions are only needed so thatthe optimization algorithm (17) is well-defined and does not lead to chattering.In this example, the set of non-differentiable points in R has measure zeroand does not affect the simulation. the centers are defined by r cj ,...,j L − := (cid:18) (2 j − c m , . . . , (2 j L − − c m (cid:19) , and n rbf ∈ N is the number of kernels in each of the first L − scalar component of R , that is, the router approximates eachscalar component of f using a grid of n L − hypercubes. Hence n θ := Ln L − and Θ := [0 , c ] n θ in (3). In the following,we simulate our estimation and learning algorithms (13) and(17) for networks with two or three parallel links. In thesesimulations, the constants are set by ε obs = 2 ε (cid:48) obs = 0 . , λ θ = 0 . , and λ r = 0 . , and the initial values of theparameter estimate ˆ θ and the routers’ action r are randomlygenerated. A. Network with two parallel links
Fig. 1. A network with one source S , one destination D , and two parallellinks (assuming that a l ≤ c for all l ∈ L ). Consider the network with L = 2 parallel links in Fig. 1,capacity c = 1 , total desired legitimated traffic R = Lc / , and attack budget A = (cid:100) Lc / (cid:101) = 1 . We set the constant n rbf = 4 . Then n θ = 8 and the parameter set Θ = [0 , .Following [31, Cor. 2], the attacker’s best response to a routeraction r is to set a l = 1 on the link l ∈ { , } with larger r l . Specifically, the actual value θ in (3), written as the tensorform in (38), is given by θ = (cid:20) (cid:21) . As established in [31, Cor. 5], the Stackelberg equilibriumaction is given by r ∗ = (0 . , . .For the case without enforcing PE in Fig. 2, in the first units of time, the observation error e obs converges to , andthe router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , despite that the parameter estimate ˆ θ does not converge to the actual value θ , as shown in Fig. 3(a).In Fig. 4, we enforce PE by adding some random noise to r for a short period of time while the observation error remainssmall. In this case, in the first units of time, the observationerror e obs converges to , the router’s action r converges to theStackelberg equilibrium action r ∗ , and the parameter estimate ˆ θ converges to the actual value θ , as shown in Fig. 5(a). Inboth cases with and without PE, we also simulate the scenariowhere, after units of time, the attacker starts focusingmore on disrupting link by switching to a new cost functiondefined by ¯ H ( a, r ) := u + u / . The results in [31, Cor. 2] allow us to conclude that, for thenon-zero-sum game defined by ( R , A , J, ¯ H ) , the attacker’sbest response to a router’s action r is to set a l = 1 on the link t e obs,1 e obs,2 e obs,3 (a) Observation error t r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 2. Simulation results for L = 2 w/o PE (horizontal axis: × unitsof time). In the first units of time, the observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , the router’s and attacker’s costs converge to J =ˆ J = − H = − . ; in the second units of time, the attacker switches to anew cost function, the router’s action r converges to an ε Stackelberg actionnear ¯ r ∗ = (0 . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . r JJ (a) Router’s actual and predicted costfunctions at T = 10 r JJ (b) Router’s actual and predicted costfunctions at T = 2 × Fig. 3. Router’s actual cost function r (cid:55)→ J ( r, f ( r )) and predicted costfunction r (cid:55)→ ˆ J (ˆ θ ( T ) , r ) for L = 2 w/o PE. In both scenarios, the predictedcost function is accurate near the optima r ∗ but not everywhere. l ∈ { , } that corresponds to the larger one in { r , r / } .The corresponding actual value θ in (3), written as the tensorform in (38), becomes θ = (cid:20) (cid:21) . As established in [31, Th. 4 and Cor. 5], there is no Stackelbergequilibrium action as defined by Definition 1 for ( R , A , J, ¯ H ) ;however, there are ε Stackelberg actions near ¯ r ∗ = (0 . , . for sufficiently small ε > . The corresponding simulationresults in Fig. 2, 3(b), 4, and 5(b) show that our adaptivelearning approach is able to identify this switch in the attack,as the router’s action r converges to ε Stackelberg actionsnear ¯ r ∗ in both Fig. 2 and 4, and, when PE is enforced, theparameter estimate ˆ θ converges to the new actual value θ , asshown in Fig. 5(b). B. Network with three parallel links
Consider a network with L = 3 parallel links, capacity c = 1 , total desired legitimated traffic R = Lc / . , t e obs,1 e obs,2 e obs,3 (a) Observation error t r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 4. Simulation results for L = 2 w/ PE (horizontal axis: × unitsof time). In the first units of time, the observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , the router’s and attacker’s costs converge to J =ˆ J = − H = − . ; in the second units of time, the attacker switches to anew cost function, the router’s action r converges to an ε Stackelberg actionnear ¯ r ∗ = (0 . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . r JJ (a) Router’s actual and predicted costfunctions at T = 10 r JJ (b) Router’s actual and predicted costfunctions at T = 2 × Fig. 5. Router’s actual cost function r (cid:55)→ J ( r, f ( r )) and predicted costfunction r (cid:55)→ ˆ J (ˆ θ ( T ) , r ) for L = 2 w/ PE. In both scenarios, the predictedcost function is accurate everywhere. and attack budget A = (cid:100) Lc / (cid:101) = 2 . We set the constant n rbf = 20 . Then n θ = 1200 and the parameter set Θ =[0 , . Following [31, Cor. 2], the attacker’s best responseto a router action is to set a l = a l = 1 on the two links l , l ∈ { , , } with largest r l . However, there is no θ ∈ Θ such that (3) holds for attacker’s strategy f and the function ˆ f defined by (38). To see the mismatch between f ( r ) and ˆ f ( θ, r ) , one could consider the router’s actions r ∈ { (0 .
45 + ε, . − ε, .
55) : 0 < ε < . } . Comparing to (38), we see that ( r , r ) belongs to the hyper-cube with the center (0 . , . . However, the attacker’sbest response satisfies a = 1 if ε > . and a = 0 if ε < . . Hence the corresponding scalar component of θ will always yield an error of at least . , and thus themismatch threshold ε f in (30) or (36) is at least . A/c = 1 .However, in the simulations below we are able to obtain muchsmaller observation errors near the Stackelberg equilibriumaction r ∗ = (0 . , . , . established in [31, Cor. 5]. Due to the mismatch and motivated by Remark 3, we focus ourattention on the performance of the estimation and learningalgorithms (13) and (17). t e obs,1 e obs,2 e obs,3 e obs,4 (a) Observation error t r r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 6. Simulation results for L = 3 and the cost function H w/ PE(horizontal axis: × units of time). The observation error converges to e obs = 0 , the router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , . , the router’s and attacker’s costs converge to J = ˆ J = − H = − . . t e obs,1 e obs,2 e obs,3 e obs,4 (a) Observation error t r r r (b) Router’s action t JJ (c) Router’s actual and predicted costs t H (d) Attacker’s costFig. 7. Simulation results for L = 3 and the cost function ¯ H w/ PE(horizontal axis: × units of time). The observation error e obs convergesto e obs = 0 , the router’s action r converges to an ε Stackelberg action near ¯ r ∗ = (0 . , . , . , the router’s actual and predicted costs converge to J = ˆ J = − . , while the attacker’s cost converges to H = 0 . . In Fig. 6, the observation error e obs converges to , andthe router’s action r converges to the Stackelberg equilibriumaction r ∗ = (0 . , . , . . In Fig. 7, the attacker startsfocusing less on disrupting links by switching to the newcost function defined by ¯ H ( a, r ) := u + 3 u / u . Following [31, Cor. 2], for the non-zero-sum game defined by ( R , A , J, ¯ H ) , the attacker’s best response to a router’s action r is to set a l = a l = 1 on the two links l , l ∈ { , , } that correspond to the largest two in { r , r / , r } . Asestablished in [31, Th. 4 and Cor. 5], there is no Stackelbergequilibrium action as defined by Definition 1 for ( R , A , J, ¯ H ) ,but there are ε Stackelberg actions near ¯ r ∗ = (0 . , . , . for sufficiently small ε > . The corresponding simulationresults are plotted in Fig. 7, where the observation error e obs converges to , and the router’s action r converges to an ε Stackelberg action near ¯ r ∗ .VII. C ONCLUSION
This paper considered a two-player Stackelberg game wherethe leader only had partial knowledge of the follower’s actionset and cost function, and had to estimate the follower’s strat-egy using a family of parameterized functions. We designed anadaptive learning approach that simultaneously estimated thefollower’s strategy based on past observations and minimizedthe leader’s cost predicted using the latest estimation. Ourapproach was proved to guarantee that leader’s actual andpredicted costs becomes essentially indistinguishable in finitetime, and the first-order necessary condition for optimalityholds asymptotically for the predicted cost. Moreover, weprovided a PE condition for ensuring estimation accuracy andstudied the case with model mismatch. The results were illus-trated via a simulation example motivated by DDoS attacks.A feature of our estimation algorithm (13) is that the normof the estimation error θ − θ ∗ is monotonically decreasingand the observation error e obs becomes bounded in norm bythe preselected, arbitrarily small threshold ε obs in finite time,regardless of how the leader’s action r is adjusted. A future re-search direction is to extend our adaptive learning approach byadopting more efficient estimation and optimization algorithmsfor more complex applications. Some preliminary results onemploying a neural network to predict the follower’s responsecan be found in [31]. Other future research topics include torelax the affine condition in Assumption 1, and to extend thecurrent results to Stackelberg games on distributed networks.A PPENDIX AP ROJECTED DYNAMICAL SYSTEMS
Let
S ⊂ R n be a compact convex set and g : S → R n bea continuous function. In this section, we provide some pre-liminary results on the existence and convergence of solutionsfor the projected dynamical system ˙ x = [ g ( x )] T S ( x ) . (39)The difficulty in analyzing (39) lies in the fact that itsright-hand side is only defined in the domain S and ispotentially discontinuous in the boundary ∂ S due to theprojection [ · ] T S ( x ) . Therefore, we consider the concept ofviable Carathéodory solution, that is, a solution to (39) onan interval L ⊂ R + is an absolutely continuous function x : L → S such that (39) holds almost everywhere on L .In particular, it requires x ( t ) ∈ S for all t ∈ L . The followingresult establishes the existence of solutions for the projecteddynamical system (39). Lemma 4.
For each x ∈ S , there exists a solution x to (39) on R + with x (0) = x .Proof. As the function g is continuous on the compact set S ,it is upper bounded in norm and thus a Marchaud map [28,Def. 2.2.4, p. 62]. Then Lemma 4 follows from [28, Th. 10.1.1,p. 354].In the following, we establish an invariance principle forthe case where g is defined by a gradient descent process. Proposition 1.
Suppose that the function g in (39) satisfies g ( z ) = −∇ V ( z ) (cid:62) ∀ z ∈ S (40) for some function V : S → R . Then every solution x to (39) satisfies lim t →∞ [ g ( x ( t ))] T S ( x ( t )) = 0 . (41)To prove Proposition 1, we extend the projected differentialequation (39) to the differential inclusion ˙ x ∈ G ( x ) (42)with the set-valued function G : S ⇒ R n defined by G ( z ) := { g ( z ) − w : w ∈ N S ( z ) }∩ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) B ( g ( z )) . (43)As [ g ( z )] T S ( z ) ∈ G ( z ) for all z ∈ S , a solution to (39) is alsoa solution to (42). Hence we prove Proposition 1 by applyingan invariance theorem for differential inclusions to (42), whichrequires the following continuity property. Lemma 5.
The set-valued function G defined by (43) is uppersemicontinuous on S .Proof. As S is a compact and convex, the set-valued map z (cid:55)→ T S ( z ) is lower semicontinuous on S [28, Th. 5.1.7,p. 162]. Also, as g is continuous on S , the map ( z, w ) (cid:55)→−(cid:107) g ( z ) − w (cid:107) is continuous on S × R n . Hence the map z (cid:55)→ sup w ∈ T S ( z ) −(cid:107) g ( z ) − w (cid:107) = − inf w ∈ T S ( z ) (cid:107) g ( z ) − w (cid:107) = − (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) is lower semicontinuous on S [28,Th. 2.1.6, p. 59], that is, the map z (cid:55)→ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) is upper semicontinuous on S . Moreover, the map z (cid:55)→{ g ( z ) − w : w ∈ N S ( z ) } is closed. Hence G is uppersemicontinuous on S [28, Cor. 2.2.3, p. 61]. Proof of Proposition 1.
Suppose that g ( z ) (cid:62) w ≥ (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ∀ z ∈ S , ∀ w ∈ G ( z ) . (44)Then the function V satisfies ∇ V ( z ) w ≤ − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ∀ z ∈ S , ∀ w ∈ G ( z ) . Note that the set-valued function G is upper semicontinuous,and the set G ( z ) is nonempty, compact, and convex for every z ∈ S . Hence the invariance theorem [33, Th. 2.11] impliesthat every solution to (42), and therefore every solution x to (39), approaches the largest invariant set in { z ∈ S : The extension from the projected differential equation (39) to the differ-ential inclusion (42) is inspired by similar extensions in [32] and [28, p. 354],and is specifically designed to simplified the proof of Proposition 1. (cid:107) [ g ( z )] T S ( z ) (cid:107) = 0 } . Hence (41) holds as g is continuous onthe compact set S .It remains to show that (44) holds. Consider arbitrary z ∈ S and w ∈ G ( z ) . First, as g ( z ) − w ∈ N S ( z ) , from (7) and (8)we have ( g ( z ) − w ) (cid:62) [ g ( z )] T S ( z ) ≤ , and thus (cid:107) w (cid:107) − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ (cid:107) w (cid:107) − (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) − g ( z ) − w ) (cid:62) [ g ( z )] T S ( z ) = (cid:13)(cid:13) w − [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ , where the equality follows partially from (9). Second, as w ∈ G ( z ) ⊂ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) B ( g ( z )) , we have (cid:107) g ( z ) − w (cid:107) ≤ (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) , and thus g ( z ) (cid:62) w = (cid:107) w (cid:107) + (cid:107) g ( z ) (cid:107) − (cid:107) g ( z ) − w (cid:107) ≥ (cid:107) w (cid:107) + (cid:107) g ( z ) (cid:107) − (cid:13)(cid:13) g ( z ) − [ g ( z )] T S ( z ) (cid:13)(cid:13) = (cid:107) w (cid:107) + (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) ≥ (cid:13)(cid:13) [ g ( z )] T S ( z ) (cid:13)(cid:13) , where the second equality follows partially from (9). Hence(44) holds. A PPENDIX BP ROOF OF TECHNICAL LEMMAS
A. Proof of Lemma 1
As the map ˆ θ (cid:55)→ ˆ f (ˆ θ, r ) is affine, its Jacobian matrix ∇ θ ˆ f ( r ) is independent of ˆ θ . Thus for a fixed r ∈ R , we have ˆ f (ˆ θ, r ) − a = ˆ f (ˆ θ, r ) − ˆ f ( θ, r ) = ∇ θ ˆ f ( t )(ˆ θ − θ ) . Next, consider the function g : [0 , → R defined by g ( ρ ) := J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) , which is continuously differentiable as J is continuouslydifferentiable. Then ˆ J ( r, ˆ θ ) − J ( r, a ) = g (1) − g (0) = (cid:90) d g ( ρ )d ρ d ρ, in which d g ( ρ )d ρ = ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1)(cid:0) ˆ f (ˆ θ, r ) − a (cid:1) = ∇ a J (cid:0) r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a (cid:1) ∇ θ ˆ f ( r )(ˆ θ − θ ) . Hence ˆ J ( r, ˆ θ ) − J ( r, a )= (cid:90) ∇ a J ( r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a ) ∇ θ ˆ f ( r )(ˆ θ − θ ) d ρ = (cid:18) (cid:90) ∇ a J ( r, ρ ˆ f (ˆ θ, r ) + (1 − ρ ) a ) d ρ (cid:19) ∇ θ ˆ f ( r )(ˆ θ − θ ) . B. Proof of Lemma 3
Consider an arbitrary (ˆ θ , r ) ∈ Θ ×R , and let λ e (0) be thecorresponding value given by (10) and (14). Suppose λ e (0) = λ θ , that is, (cid:107) e obs (0) (cid:107) > ε (cid:48) obs . In the following, we construct asolution to (13) and (17) on R + with (ˆ θ (0) , r (0)) = (ˆ θ , r ) recursively. The case where λ e (0) = 0 can be proved basedon the same construction starting with the second step. First, consider the system defined by (13) and (17) with λ e ≡ λ θ , that is, ˙ˆ θ = (cid:2) − λ θ K ( r, f ( r ) , ˆ θ ) (cid:62) K ( r, f ( r ) , ˆ θ )(ˆ θ − θ ) (cid:3) T Θ (ˆ θ ) , ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (45)which can be modeled using the projected dynamical sys-tem (39) in Appendix A with the state x := (ˆ θ, r ) andthe set S := Θ × R . In particular, f is continuous dueto (3) and Assumption 1. Then Lemma 4 in Appendix Aimplies that there exists a solution (ˆ θ , r ) to (45) on R + with (ˆ θ (0) , r (0)) = (ˆ θ , r ) . Consider the correspondingobservation error e obs , and switching signal λ e, defined by(10) and (14), and let t := inf { t > (cid:107) e obs , ( t ) (cid:107) ≤ ε (cid:48) obs } . Then (ˆ θ , r ) is a solution to (13) and (17) on [0 , t ) with (ˆ θ (0) , r (0)) = (ˆ θ , r ) . If t = ∞ then the proof iscomplete. Otherwise, e obs , ( t ) = ε (cid:48) obs and thus λ e, ( t ) = 0 ,and we continue with the second step below.Second, consider the system defined by (13) and (17) with λ e ≡ , that is, ˙ˆ θ = 0 , ˙ r = (cid:2) − λ r ∇ r ˆ J ( r, ˆ θ ) (cid:62) (cid:3) T R ( r ) , (46)which can also be modeled using the projected dynamicalsystem (39) in Appendix A with the state x := (ˆ θ, r ) andthe set S := Θ × R . Then Lemma 4 in Appendix A impliesthat there exists a solution (ˆ θ , r ) to (46) on [ t , ∞ ) with (ˆ θ ( t ) , r ( t )) = (ˆ θ ( t ) , r ( t )) . Consider the correspond-ing observation error e obs , and switching signal λ e, definedby (10) and (14), and let t := inf { t ≥ t : (cid:107) e obs , ( t ) (cid:107) ≥ ε obs } . Then (ˆ θ , r ) is a solution to (13) and (17) on [ t , t ) with (ˆ θ ( t ) , r ( t )) = (ˆ θ ( t ) , r ( t )) . If t = ∞ then the proofis complete. Otherwise, e obs , ( t ) = ε obs and thus λ e, ( t ) = λ θ , and we continue with the first step above.In this way, we obtain an increasing sequence ( t k ) k ∈ N with t = 0 and a corresponding sequence (ˆ θ k , r k ) k ≥ ofabsolutely continuous functions ˆ θ k : [ t k − , ∞ ) → Θ and r k : [ t k − , ∞ ) → R . Moreover, following (10), (45), (46),Assumption 1, and the property that (ˆ θ k ( t ) , r k ( t )) ∈ Θ × R for all k ≥ and t ≥ t k − , there exists a constant M ≥ such that (cid:107) ˙ e obs ,k ( t ) (cid:107) ≤ M for all k ≥ and t ≥ t k − . Hence t k − t k − ≥ ( ε obs − ε (cid:48) obs ) /M ∀ k ≥ , and thus lim k →∞ t k = ∞ . Therefore, the absolutely continu-ous functions ˆ θ : R + → Θ and r : R + → R defined by ˆ θ ( t ) := ˆ θ k ( t ) , r ( t ) := r k ( t ) , k ≥ , t ∈ [ t k − , t k ) form a solution to (13) and (17) on R + with (ˆ θ (0) , r (0)) =(ˆ θ , r ) . The proof is completed by noticing that (21) followsfrom (10), (45), (46), Assumption 1, and the property that (ˆ θ ( t ) , r ( t )) ∈ Θ × R for all t ≥ . R EFERENCES [1] D. Fudenberg and J. Tirole,
Game Theory . MIT Press,1991.[2] T. Ba¸sar and G. J. Olsder,
Dynamic NoncooperativeGame Theory , 2nd ed. SIAM, 1999.[3] T. Alpcan and T. Ba¸sar,
Network Security: A Decisionand Game-Theoretic Approach . Cambridge UniversityPress, 2010.[4] J. P. Hespanha,
Noncooperative Game Theory: An Intro-duction for Engineers and Computer Scientists . Prince-ton University Press, 2017.[5] G. W. Brown, “Iterative solution of games by fictitiousplay,” in
Act. Anal. Prod. Alloc.
John Wiley & Sons,1951, pp. 374–376.[6] J. Robinson, “An iterative method of solving a game,”
Ann. Math. , vol. 54, no. 2, pp. 296–301, 1951.[7] G. W. Brown and J. von Neumann, “Solutions of gamesby differential equations,” in
Contrib. to Theory Games .Princeton University Press, 1952, vol. I, ch. 6, pp. 73–80.[8] J. B. Rosen, “Existence and uniqueness of equilib-rium points for concave n-person games,”
Econometrica ,vol. 33, no. 3, pp. 520–534, 1965.[9] D. Fudenberg and D. K. Levine,
The Theory of Learningin Games . MIT Press, 1998.[10] S. Hart, “Adaptive heuristics,”
Econometrica , vol. 73,no. 5, pp. 1401–1430, 2005.[11] L. Bu¸soniu, R. Babuška, and B. De Schutter, “A com-prehensive survey of multiagent reinforcement learning,”
IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev. ,vol. 38, no. 2, pp. 156–172, 2008.[12] J. R. Marden and J. S. Shamma, “Game theory anddistributed control,” in
Handb. Game Theory with Econ.Appl.
Elsevier, 2015, vol. 4, pp. 861–899.[13] H. von Stackelberg,
Market Structure and Equilibrium .Springer, 2011.[14] Y. A. Korilis, A. A. Lazar, and A. Orda, “Achievingnetwork optima using Stackelberg routing strategies,”
IEEE/ACM Trans. Netw. , vol. 5, no. 1, pp. 161–173,1997.[15] T. Roughgarden, “Stackelberg scheduling strategies,”
SIAM J. Comput. , vol. 33, no. 2, pp. 332–350, 2004.[16] M. Bloem, T. Alpcan, and T. Ba¸sar, “A Stackelberg gamefor power control and channel allocation in cognitive ra-dio networks,” in , 2007, pp. 1–9.[17] J. Pita, M. Jain, J. Marecki, F. Ordóñez, C. Portway,M. Tambe, C. Western, P. Paruchuri, and S. Kraus,“Deployed ARMOR protection: The application of agame-theoretic model for security at the Los AngelesInternational Airport,” in , 2008, pp. 125–132.[18] J. Tsai, S. Rathi, C. Kiekintveld, F. Ordóñez, andM. Tambe, “IRIS - A tool for strategic security allocationin transportation networks,” in , 2009, pp. 37–44.[19] G. G. Brown, W. M. Carlyle, J. Salmerón, and K. Wood,“Analyzing the vulnerability of critical infrastructure to attack and planning defenses,” in
Emerg. Theory, Meth-ods, Appl.
INFORMS, 2005, pp. 102–123.[20] G. G. Brown, M. Carlyle, J. Salmerón, and K. Wood,“Defending critical infrastructure,”
Interfaces (Provi-dence). , vol. 36, no. 6, pp. 530–544, 2006.[21] M. Brückner and T. Scheffer, “Stackelberg games foradversarial prediction problems,” in , 2011, pp. 547–555.[22] J. Marecki, G. Tesauro, and R. Segal, “Playing repeatedStackelberg games with unknown opponents,” in , vol. 2, 2012, pp.821–828.[23] A. Blum, N. Haghtalab, and A. D. Procaccia, “Learningoptimal commitment to overcome insecurity,” in
NeuralInf. Process. Syst. 2014 , 2014, pp. 1826–1834.[24] M. S. Kang, S. B. Lee, and V. D. Gligor, “The Crossfireattack,” in , 2013, pp. 127–141.[25] G. Yang, R. Poovendran, and J. P. Hespanha, “Adaptivelearning in two-player Stackelberg games with contin-uous action sets,” in ,2019, pp. 6905–6911.[26] K.-T. Fang, R. Li, and A. Sudjianto,
Design and Model-ing for Computer Experiments . Chapman & Hall/CRC,2005.[27] R. T. Rockafellar and R. J. B. Wets,
Variational Analysis .Springer, 1998.[28] J.-P. Aubin,
Viability Theory . Birkhäuser, 1991.[29] A. S. Morse, D. Q. Mayne, and G. C. Goodwin, “Ap-plications of hysteresis switching in parameter adaptivecontrol,”
IEEE Trans. Automat. Contr. , vol. 37, no. 9, pp.1343–1354, 1992.[30] P. A. Ioannou and J. Sun,
Robust Adaptive Control .Prentice Hall, 1996.[31] G. Yang and J. P. Hespanha, “Modeling and mitigat-ing link-flooding distributed denial-of-service attacks vialearning in Stackelberg games,” in
Handb. Reinf. Learn.Control . Springer, 2021, to be published.[32] C. Henry, “An existence theorem for a class of differen-tial equations with multivalued right-hand side,”
J. Math.Anal. Appl. , vol. 41, no. 1, pp. 179–186, 1973.[33] E. P. Ryan, “An integral invariance principle for differ-ential inclusions with applications in adaptive control,”