An Adaptive Multi-Agent Physical Layer Security Framework for Cognitive Cyber-Physical Systems
Mehmet ?zgün Demir, Ozan Alp Topal, Ali Emre Pusane, Guido Dartmann, Gerd Ascheid, Güne? Karabulut Kurt
11 An Adaptive Multi-Agent Physical Layer SecurityFramework for Cognitive Cyber-Physical Systems
Mehmet Özgün Demir, Ozan Alp Topal, Ali Emre Pusane, Guido Dartmann,Gerd Ascheid, and Güne¸s Karabulut Kurt
Abstract —Being capable of sensing and behavioral adapta-tion in line with their changing environments, cognitive cyber-physical systems (CCPSs) are the new form of applications infuture wireless networks. With the advancement of the machinelearning algorithms, the transmission scheme providing the bestperformance can be utilized to sustain a reliable network ofCCPS agents equipped with self-decision mechanisms, wherethe interactions between each agent are modeled in terms ofservice quality, security, and cost dimensions. In this work,first, we provide network utility as a reliability metric, whichis a weighted sum of the individual utility values of the CCPSagents. The individual utilities are calculated by mixing thequality of service (QoS), security, and cost dimensions with theproportions determined by the individualized user requirements.By changing the proportions, the CCPS network can be tunedfor different applications of next-generation wireless networks.Then, we propose a secure transmission policy selection (STPS)mechanism that maximizes the network utility by using theMarkov-decision process (MDP). In STPS, the CCPS networkjointly selects the best performing physical layer security policyand the parameters of the selected secure transmission policy toadapt to the changing environmental effects. The proposed STPSis realized by reinforcement learning (RL), considering its real-time decision mechanism where agents can decide automaticallythe best utility providing policy in an altering environment.
Index Terms —Cyber-physical systems, physical layer security,quality of service, risk-aware control, utility.
I. I
NTRODUCTION
Cyber-physical systems (CPS) can a be seen as specialform of a distributed system with integrated closed loopsfor control task among multiple agents, as shown in Fig. 1.In this paper, we consider a special type of CPS where allagents communicate their sates and control information overwireless channels. In this figure, an agent is defined as anindividual unit that has a control center, multiple sensors,and actuators. In real-life deployments, a system consists ofmultiple agents, that interact with each other, such as trafficsigns on the road and units inside the vehicles in vehicle-to-everything (V2X), or industrial networks. Typical CCPShave the ability to locally (edge) analyze the environmentand make decisions to a limited extent. In this article, we
M. Ö. Demir and A. E. Pusane are with the Bo˘gaziçi University, Istanbul,Turkey, {ozgun.demir1, ali.pusane}@boun.edu.tr.O. A. Topal and G. Karabulut Kurt are with the Istanbul TechnicalUniversity, Istanbul, Turkey, {ozan.topal,gkurt}@itu.edu.tr.G. Dartmann is with the University of Applied Sciences Trier, Trier,Germany, [email protected]. Ascheid is with the RWTH Aachen University, Aachen, Germany,[email protected] work is supported by TUBITAK under Grant 115E827.
User / DeviceActuatorsSensors
Expected to have:- Desired Security- Sufficient QoS- Low Cost
ActuatorsControlCenter ControlCenter
PLS policy
User Commands:- User Preferences- Requirements- Desired Security Level
Agent
Sensors
Active / Passive AttackerSensor Cognition ChannelPhysical Interaction betweensensors/actuators and user/deviceAgents, which consist of controlcenter, actuators, and sensors Physical World App.Layer PHYLayerPassive Cyber AttacksActive Cyber Attacks
Cognition
User Commands PLS Policy information channel
PLS policy
Fig. 1: System model of a CCPS that consists of multipleagents in layered structure with possible attacks.consider agents that utilize reinforcement-learning (RL), wherethey can analyze the environment, take action in real time.An essential aspect of the environment here is the physicalwireless communication channel, which is also essential formonitoring and controlling the security of the communication.On the one hand the communication channel is influenced bythe interaction of the agents and on the other hand the physicalproperties of the communication channel also influence theactions and behavior of the agents. Hence, the operation ofeach agent and the coordination between agents depend on thewireless communication infrastructure. However, a wirelessnetwork cannot guarantee data security, since radio signalscan be intercepted or manipulated by malicious devices [1],which may be active or passive attackers, e.g., eavesdroppers,as shown in Fig. 1.In many distributed systems, the environment is also chang-ing continuously, (e.g., additional agents, updated attackermodels, increasing noise, and interference). Therefore, CPS a r X i v : . [ ee ss . S Y ] J a n also have to adapt to changing environmental conditions toprovide continuous security as well as the requirements ofthe target application. In the near future, it is expected thatCPS will have cognitive abilities to sense the dynamicallychanging environments and make decisions based on theupdated conditions [2]. This critical evolution of CPS wouldbe plausible if the agents of CCPS can adapt their operationalpreferences, e.g., transmission technology, transmit power, andsecurity policy, to simultaneously provide the desired systemperformance and security [3].Numerous security mechanisms are proposed in the physicallayer security (PLS) literature, where the body of work hasunraveled different security opportunities at the physical layerconsidering various set-ups and attack scenarios [4]. Themodus operandi in this body of work is maximizing thesecurity performance of a single PLS policy by optimizingthe power-frequency-antenna resources under a time invariantchannel and attack model. Considering a CCPS frameworkcomposed of many agents equipped with various resources,utilizing a single security mechanism with a constant purposewould create a bottleneck for the security and the reliability ofthe CCPS networks. Instead of focusing on a single specificsecurity scenario, as given in our previous work [2], we extendour perspective, and try to answer two large-picture questionsfor the secure CCPS framework: "What is the best availablesecurity policy and configuration for the CCPS agents undersensed channel conditions?" and "Besides security, how canwe measure the physical layer performance of a CCPS net-work?" As an answer to the first question, we propose a decisionmechanism, where the best performing PLS policy with bestperforming transmission configuration is aimed to be selectedfrom the available set of security policies, as shown in Fig.2. More specifically, we define a control center that is mainlycollecting channel data and deciding the best strategy for theagents. The units are designed specifically for sensing, adapt-ing, and configuration located in the control center, where theycollect information about ambient risks and environmentalchanges from sensors, estimating the risks, and communicationresources. They also configure the physical layer parametersof the entire system based on agent preferences. Sensors andactuators are then informed by the control center to updatetransmission parameters. Since the various PLS policies arealready available in the literature, the selective nature of thealgorithm provides the adaptability for the CCPS network.As an answer to the second question, we define a user-centric utility as the main performance metric. The utilityis the weighted average of the security, quality of service(QoS), and cost dimensions of the CCPS performance. Thesedimensions include various system performance metrics afterclassifying into three main dimensions, as shown in Fig. 2. Theweights of each dimension are determined by the requirementsimposed by users. Since there are distinct operational require-ments of the main applications of CPS, e.g., drone swarms,ultra-reliable low latency communication (URRLC), massivemachine type communication (mMTC), smart grid, vehicle-to-everything (V2X) communication, and health networks [2],[5], we can tune the weights to perform best under the
Agent 1 Agent 2 Agent I UtilitySecurity Quality of Service Cost ... - Secrecy Capacity- Privacy - SINR - Power Consumption- Delay - Bandwidth- ... - ... - ... Network Utility Agent 1Agent 2 Agent I ConfigurationPolicy 1 Policy N ...
Secure Transmission Policy Selection
TransmissionConfig. 1 ... TransmissionConfig. M TransmissionConfig. 1 ... TransmissionConfig. M ...
Fig. 2: Proposed secure transmission policy selection anddesigned utility metric with its sub-dimensions for each agent.sensed channel conditions and available resources. However,separately selecting the best performing policy for each agentin the network may not provide the best performing networkresults due to the interference resulted from other agents andthe required operational costs. Instead of using individualutilities for each agent in the network, in this work, we proposemixing the utilities of each agent with correct proportions toobtain a single network utility value that symbolizes the totalperformance of the CCPS network. To obtain the networkutility, we consider two different operational structures ofCCPS networks. In the first structure, the agents calculate theirutilities and select the best performing PLS policy individuallyand locally. In the second structure, agents are connected to acentral control unit, where their network utility is calculatedby averaging their individual utilities. Our contributions inthis work can be listed as below. • We propose an adaptive PLS policy selection and trans-mission configuration mechanism, which is called securetransmission policy selection (STPS), for a multi-agentphysical layer security framework using an Markov deci-sion process (MDP). The proposed mechanism is inher-ently adaptable, where it may maximize the individualrewards or a joint reward respectively considering selfishand cooperating agents. • The reward of the individual agents is calculated by autility function, which is a weighted sum of the QoS,security, and cost dimensions. By multiplying the util-ity function with time-dependent characteristics, such asdegrading battery, we propose a more applicable reward function considering the CCPS frameworks. • We simulate the proposed mechanism with an RL-basedsetup considering different beyond 5G applications andenvironment characteristics. Simulation results of the pro-posed mechanism are compared with the only security-based and only QoS-based policy selection mechanismsto demonstrate the trade-offs between QoS, security, andcost dimensions.Overall, we provide an adaptive CCPS framework that isable to make the best STPS for each agent to maximize itsutility, which is defined as the weighted sum of QoS, secu-rity, and cost dimensions, based on time-dependent changingenvironment and application-based requirements.
A. Related Works
The PLS policies and their performance criteria are providedin [4]. One way to secure the physical layer transmissionis a resource allocation at the transmitter node [6]. OtherPLS policies are obtained by the different signal processingmethods, such as beamforming and precoding [7], [8]. Sincethe message is pre-coded in line with the receiver’s channel,the performance of any other node at a different position woulddecrease [9]. Antenna-selection or node-selection based PLSpolicies are other types of physical layer defense mechanism[10]. The joint utilization of different PLS policies is previ-ously considered in [10], and [11]. In [10], the authors pro-pose antenna selection with beamforming in MIMO networks.However, these works consider utilizing the same PLS policiesin changing conditions. As a difference, we consider that aset of PLS policies are available for the CCPS framework.Similar to our previous work [2], this methodology can becategorized under a novel PLS policy selection umbrella inthe PLS literature.Another difference from the aforementioned works is thathere, we consider multi-agent security, where the total util-ity is aimed to be maximized. Similar to our study, in[12], the authors propose a game-theoretic physical layersecurity mechanism, where a model-based predictive controlsystem is utilized to connect the cyber and physical layersof the autonomous systems. In [13], the authors considernon-orthogonal-multiple-access (NOMA) to establish a securewiretap model and optimize the user power allocation co-efficients to maximize the secrecy. In [14], the security ofCCPS agents is established by the optimal power alloca-tion and power splitting at the secondary transmitter undersecrecy constraints. In [15], the medium access probabilityand transmission power of secondary transmitters are jointlyoptimized to maximize the security of all network nodes. In[16], the authors propose a beamforming design for a two-way cognitive radio (CR) Internet of Things (IoT) networkaided with the simultaneous wireless information and powertransfer (SWIPT). While these works consider the performanceof a single method under stable channel characteristics, ourwork provides an adaptable joint power and PLS policyselection in changing environmental conditions, addressing thecorresponding gap in the literature.
B. Organization
In the following section, we detail the calculation of utilitydimensions from the physical layer parameters. In Section III,we provide the MDP-based STPS mechanism and the pro-posed utility function. In Section IV, we present the simulationparameters and results of the proposed mechanism. In SectionV, the paper is concluded.II. C
ALCULATING THE DIMENSIONS OF UTILITY
In our system model, we assume that the CCPS networkconsists of I agents, where i denotes the index of an agent.Due to different requirements and environmental conditions,each agent may have different sets of available PLS policiesand transmission configurations, which are denoted as P i and R i for the i th agent, respectively. These sets can be stated as P i = { P i , P i , . . . P N i i } and R i = { R i , R i , . . . , R M i i } , where N i is the number of available PLS policies and M i is the num-ber of available transmission configurations for the i th agent.When we consider the transmission time slot of a frame, t isthe time slot index, where t = 1 , , . . . T . Due to the impact ofincreasing time slots, we have to distinguish which STPS is de-cided at time slot t from the sets of P i and R i for each agent,where i = 1 , , . . . , I . Under these circumstances, K and L are the sets of chosen PLS policies and transmission configu-rations by agents at t , where K = { P k [ t ] , P k [ t ] , . . . , P k I I [ t ] } and L = { R l [ t ] , R l [ t ] , . . . , R l I I [ t ] } . In these statements, k i and l i are indices of the chosen PLS policies and transmissionconfigurations by each agent and these terms should satisfyfollowing conditions k i ≤ N i , l i ≤ M i , ∀ i ∈ { , , . . . I } . (1)In Fig. 3, the block diagram of the proposed system modelis given. In this paper, we assume that the agents are in-dependent and each of the agents consists of a transmitterand a receiver unit. We will denote the transmitter and thereceiver of the i th agent respectively as a i and b i . Thenumber of antennas of the transmitter node of the i th agentis denoted as W i . We assume that each receiver node isequipped with a single antenna. The eavesdropper is alsoassumed to be equipped with a single antenna. The channelfading coefficient vector between the nodes c i and d j isexpressed by h c i d j = (cid:2) h c i d j (1) , h c i d j (2) , · · · , h c i d j ( W i ) (cid:3) T ,and h c i d j ( ω ) ∼ CN (0 , D cidj ) , where c ∈ { a, b } , d ∈ { a, b } , i, j ∈ { , , . . . , I } and i (cid:54) = j . p T denotes the transpose ofthe vector p . We consider three different PLS methodologies:subcarrier-based artificial noise (SC-AN), full-duplex artificialinterference (FD-AI), and artificial noise (AN) with supportedbeamforming. We also consider the case of only beamforming(B), hence the agents may utilize four different PLS policies.The signal models for each of the PLS policies are detailedin the link . For fairness, we consider that the messagetransmission power, R s , from all agents are equal. The signal The operations to obtain dimensions and utilityis detailed in https://github.com/ozanalpt/PLS-toolbox-for-CCPS/The_Guidebook_of_the_PLS_toolbox_for_CCPS.pdf
Agent 1Agent 2Agent I . . . Calculationof the UtilityDimensions User / Applications w [t] T PLS Policy N xN x...xN I TransmissionConfiguration M xM x...xM I PLS Policy 1 PLS Policy 2Transmission Config. 1 Transmission Config. 2
Current State (t th Time Slot)
Secure Transmission Policy Selection ... ...
SINR ,W SINR ,W SINR I ,W I s [t] , q [t] , c [t] s [t] , q [t] , c [t] s I [t] , q I [t] , c I [t] Updating the utility dimension based on the selectionsfor the next state k ,l ,k ,l ...,k I ,l I (t+1) th Time SlotU [t] U [t] U I [t] γ [t] T Updating weights
H, BSensoryInformation
Fig. 3: The block diagram of the utility calculation with the proposed MDP-based STPS.to interference and noise ratio (SINR) at node b i in the t th time slot is expressed bySINR bi [ t ] = | v Hi h a i ,b i | R s γ k i i (Υ i [ t ] + N i ) , (2)where Υ i denotes the interference from other agents, N i denotes the noise variance at the receiver node of the i th agent,and γ k i i denotes the interference cancellation error coefficientof the selected PLS policy. The transmit beamforming vectorfor the i th agent is denoted by v i = h a i ,b i . The interferencefrom other agents can be divided into parts as Υ i [ t ] = (cid:88) j (cid:54) = i (Υ sji [ t ] + Υ cji [ t ]) , (3)where Υ ci denotes the interference resulted from the messagesignal transmitted by other agents, and Υ ai denotes the in-terference resulted from the security signal by other agents.The interference from the message signal transmission can beexpressed by Υ sji [ t ] = | v Hj h a j b i | R s . (4)The security interference results from the security signal trans-mitted from other agents, where Υ cji [ t ] denotes the interferencefrom the security signal of j th agent. This term differs basedon the selected PLS policy. Similarly, considering the SC-ANpolicy, Υ cji [ t ] becomes Υ c,SC − ANji [ t ] = | h a j b i | R l j j [ t ] , (5)where R l j j [ t ] denotes the selected transmission configurationof the j th agent. Note that, γ k i i = 2 for the SC-AN policy.For other policies, γ k i i = 1 . Considering FD-AI policy, Υ cji [ t ] becomes Υ c,F D − AIji [ t ] = | h b j b i | R l j j [ t ] . (6)Considering the AN policy, Υ cji [ t ] becomes Υ c,ANji [ t ] = | α a j b j · h a j b i | R l j j [ t ] , (7) where α a j b j = N S ( h ajbj ) || h ajbj || and N S ( ζ ) denotes the null-spaceof the vector ζ . Considering the only beamforming policy, Υ c,Bji [ t ] = 0 . (8)Similarly, SINR for the link from the transmitter of the i th agent to the eavesdropper can be expressed bySINR ei [ t ] = | v Hi h a i e | R s (Υ ie [ t ] + N e ) . (9)In this case, Υ ie [ t ] becomes Υ ie [ t ] = (cid:88) j (cid:54) = i Υ sje [ t ] + I (cid:88) i =1 Υ cie [ t ] , (10)where Υ sje [ t ] = | v j h a j e | R s . (11)In addition to the interference resulting from the security signalof other agents, the security signal from the considered agentalso decreases the performance of the eavesdropper.In the following, we detail how physical layer parametersare converted into the dimensions of the individual utilityfunction. A. Security
As a security metric, we utilize secrecy pressure as proposedin [17]. The main difference between other secrecy metricsis that the location of the eavesdropper is assumed to beunknown. In this way, secrecy pressure shows the securitylevel of the considered medium, instead of focusing on theexact security levels. As a first step, the ergodic secrecycapacity of each point ( x, y ) of the surface S is obtained. Thisstep provides us with a secrecy map of the surface S for thecorresponding security policy. After that, expectation operationis applied over the surface S . The location of the eavesdropperis assumed to be a 2D Gaussian random variable as the surfacemight contain different physical measures in different areas.For example, considering a factory environment, the mean of the Gaussian distribution may be assumed to be the blind spotof the security cameras. The position of the eavesdropper isunknown; therefore, it will be denoted by ( x, y ) . The secrecycapacity for the i th agent can be represented by C isec ( x, y ) = max { , ( C iB − C iE ( x, y )) , (12)where C iB and C iE ( x, y ) are respectively the channel capacitiesof the link of the i th agent’s transmitter and receiver, and thelink of i th agent’s transmitter and eavesdropper. The capacityof the i th agent’s channel can be given as C iB = 12 log (cid:16) SINR bi (cid:17) , (13)and eavesdropper’s channel capacity is C iE ( x, y ) = 12 log (cid:0) SINR ie ( x, y ) (cid:1) . (14)Note that the capacity of a generic ( x, y ) point is given sincethe location of the eavesdropper is unknown. Since the secrecycapacity depends on the mutually independent, independentand identically distributed (i.i.d.) channel fading coefficients,the ergodic secrecy capacity can be given as ˜ C isec ( x, y ) = E (cid:2) C isec ( x, y ) (cid:3) , (15)where E {·} denotes the expectation operator. Note that theergodic secrecy capacity is also defined for a generic ( x, y ) point. Calculating ergodic secrecy capacity for each point onthe surface S would result in calculating the secrecy map ofthe surface.Considering this perspective, we can weigh the ergodic se-crecy capacity for each ( x, y ) generic point by the probabilityof the eavesdropper’s existence at that point. Then, the ergodicsecrecy pressure of the surface S of the time slot t can be givenby Ω i [ t ] = (cid:90) (cid:90) S γ ( x, y ) ˜ C sec ( x, y ) dxdy, (16)where γ ( x, y ) is the probability density function (pdf) of thepresence of the eavesdropper at a point ( x, y ) on the surface.In this case, the security of i th agent can be obtained by s i [ t ] = Ω i [ t ]max j ∈{ , ,...,I } [Ω j [ t ]] . (17) B. Quality of Service
QoS indicates the general performance of the communi-cation link. In the considered setup, we consider the SINRlevels at the receiver nodes of each agent as their QoS metric.As described in the security calculation, the SINR levelswith respect to each policy and transmission configuration arecalculated. Then, they are normalized by the maximum SINRlevel of the determined STPS. Hence, QoS of the i th agent canbe described by q i [ t ] = SINR bi [ t ]max j ∈{ , ,...,I } [ SINR bj [ t ]] . (18) C. Cost
Cost is a measure of the limited resources considering theCCPS framework. The number of active antennas, the selectedlevel of the transmission power, the selected coding schemeare the main elements determining the battery life and the usedmemory of the CCPS agents. In this work, we consider thecost of the i th agent as a weighted sum of the utilized numberof antennas and selected transmit power configurations as c i [ t ] = W i [ t ] + R l i i [ t ]max j ∈{ , ,...,I } [ W j [ t ] + R l j j [ t ]] . (19)As described in the previous section, the weighted sum ofsecurity, QoS, and cost dimensions determine the utility of theagent. The processes explained in this part are conceptualizedin [2]. Even though this methodology effectively illustrates thegeneral performance of a single-agent, it becomes inefficientfor the multi-agent systems, where limited resources, such asfrequency bandwidth, battery life, are shared by all agents.Therefore, in the following section, first we obtain a novelperformance metric, named as network utility, which modelsthe performance of the CCPS framework considering thejoint utilization of the shared resources. Then, we detailthe proposed STPS approach for the considered multi-agentframework.III. A DAPTIVE M ULTI -L AYER S ECURITY F RAMEWORKFOR M ULTI -A GENT S YSTEMS
A. Network Utility Calculation
During the deployment of an adaptive CCPS frameworkfor multi-agent systems, a proper utility calculation should bedone before each transmission, as shown in Fig. 3. Definingutility is not a straightforward problem; however, it can becalculated as a weighted sum of security, QoS, and costterms. The calculation of these terms is highly related to theinteractions of the agents inside the network, PLS policies,and a proper performance metric. The details about thesecalculations will be explained in detail in the next section.In short, there are normalized values of security, QoS, andcost, which are denoted as s i [ t ] , q i [ t ] , c i [ t ] .The importance of these dimensions is fundamentally basedon the chosen application of CCPS. However, the roles of theagent may also be distinct from one another; therefore, theweights of the dimensions for each agent may significantlyvary. Another factor on the weights is the impact of the activecommunication time on the weights of each dimension. Dueto the rise of power consumption and the reduction of QoSlevels of the agents, we try to avoid retransmissions for anincreasing number of successful transmission of packets in aframe. To deploy this scheme, we slightly adjust the weightsby increasing the weights of QoS and cost dimensions whiledecreasing the weight of the security dimension for after eachsuccessful communication time slot. In these circumstances,we can state a vector of the weights with the impacts of theincreasing number of time slots as w i [ t ] = (cid:2) γ s [ t ] γ q [ t ] γ c [ t ] (cid:3) T ◦ (cid:2) w i,s [ t ] w i,q [ t ] w i,q [ t ] (cid:3) T , (20) LastState
SC-AN,SC-AN SC-AN,FD-AI SC-AN,B SC-AN,AN FD-AI,SC-AN FD-AI,AN SC-AN,AN B,AN AN,SC-AN AN,AN ...... ... ......... ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ( d B , d B ) , ( d B , d B ) ANBFD-AISC-AN 10dB10dB10dB5dB5dB10dB5dB5dB - - - -- - - - LastState (a) (b)
Fig. 4: MDP-based PLS policy selection and transmission configuration schemes: (a) individual decision, (b) joint decisionwhere w i,s [ t ] , w i,q [ t ] , and w i,c [ t ] are the security, QoS, andcost-related weights at t for the i th agent, and ◦ is theHadamard product operator. These weights should satisfy w i,s [ t ] + w i,q [ t ] + w i,c [ t ] = 1 for ∀ i ≤ I and ∀ t ≤ T . In(20), γ s [ t ] , γ q [ t ] , and γ c [ t ] are the impacts of the increasingnumber of time slots during the transmission of a frame.We also assume that there are negative effects on systemperformance when a different PLS policy is chosen in con-secutive time slots. In case of different policies are selectedsequentially, the agents may have to turn on/off some of theirhardware. This procedure may result in additional delays, andcost; therefore, we also consider these impacts by definingtransition values of PLS policies from the policy k i to k (cid:48) i ,which denotes the next possible values of k i . The transitionvalue is chosen as δ i [ t ] k i → k (cid:48) i = 1 , where k i = k (cid:48) i , on the otherhand the condition δ i [ t ] k i → k (cid:48) i < is satisfied, where k i (cid:54) = k (cid:48) i .A similar formula can be generated for the transmissionconfiguration adjustments of the agents, but we omit theseimpacts of transmission configuration adjustments to simplifythe model.When we combine dimensions of the utility, weights of eachdimension, the impacts of increasing time slots, and transitionsto different PLS policies, we can express utility as U k i ,l i i [ t ] = δ i [ t ] k i → k (cid:48) i · (cid:32) w i [ t ] T × s i [ t ] q i [ t ](1 − c i [ t ]) (cid:33) , (21)In this utility expression, the cost related term is substracted,since increased cost leads to a smaller utility. The expression(21) stands at the center of the CCPS framework, and it willcommonly be utilized in the rest of the paper. It should alsobe remembered that the values of utilities are between and due to the normalization of the weights and utility dimensions. B. MDP-based Secure Transmission Policy Selection
In order to deploy a flexible decision mechanism, an MDP-based Q-learning algorithm is developed, as shown in Fig.3. This algorithm is practiced inside control centers of eachagent at each time slot t . In the designed Q-learning algorithm, the states are the available PLS policies and transmissionconfigurations, as shown inside the block of STPS in Fig3. In the same figure, the current state indicates the lastactive PLS policy from the time slot t . The actions betweenstates are considered as the changing or continuing the currentconfigurations for each transmission timeslot based on thecalculated reward values with the aid of utility values. It shouldbe noted that this model is valid for the individual selections ofthe agents, and should be updated by combining agent-basedprocedures for joint policy and transmission configurationselection of agents.As we discussed earlier, the number of available PLSpolicies is N i and the number of available transmission con-figurations is M i for the i th agent. Therefore, the number ofstates is N i + M i + 1 after adding the current operational stateduring the individual operations. It also results in N i availableactions for the first stage of the algorithm and M i availableactions for the second stage for individual decisions. In case ofjoint decisions by the agents, the number of available actionson the first stage is N × N × · · · × N I , while the number ofavailable actions for the second stage is M × M × · · · × M I .With this procedure, an agent decide a STPS based onthe calculated utility values by (21) for each possible policyand transmission configurations. It also indicates the initialstate of the operation at the beginning. In this procedure,there is a reward for each transition, which is calculated withthe help of utility values. At the policy selection stage, therewards are specific values of the calculated utilities with adetermined transmission configuration l i at time slot t , i.e., U ,l i i [ t ] , U ,l i i [ t ] , . . . U N i ,l i i [ t ] . The rewards through the secondselection stage are calculated using the utility differencesof the possible transmission configurations, and the chosentransmission configuration at the policy stage, l i . The re-wards from the state of the first PLS policy to the possibletransmission configurations of this policy can be written as U , i [ t ] − U ,l i i [ t ] , U , i [ t ] − U ,l i i [ t ] , . . . U ,M i i [ t ] − U ,l i i [ t ] .These values can be positive or negative depending on thevalues of the utility values and the calculation of its sub-dimensions, which is given in the next section.If we update the MDP for a joint selection of PLS policies, TABLE I: RL-based simulation parameters and the decisionmethodologies for actualizing a STPS
Decision methodology Time slot of the agents
Individual t − Joint Simultaneously
Simulation Parameters
Number of agents, I . (for individual decisions) . (for joint decisions)Learning rate . (for individual decisions) . (for joint decisions) (cid:15) . (for individual decisions) . (for joint decisions)Number of Training Episodes (for individual decisions) (for joint decisions)Number of time slots, T δ k i → k (cid:48) i i . for k i (cid:54) = k (cid:48) i for k i = k (cid:48) i the available number of policies and transmission configura-tions drastically increase. Cooperating agents do not have tooperate with the same STPS. In some cases, they should notrun with the same STPS, since some of the policies are verydisadvantageous and costly if multiple agents operate withthem. As a result, we have to extend the MDP procedurescheme for the joint operation. In this case, the availablepolicies at the policy stage will be N × N × · · · × N I . Inthis setup, each state represents PLS policies of each agent,e.g., a stage indicates two possible PLS policies if we havea joint decision model for two agents. Similarly, the numberof available configurations at the transmission configurationselection stage is M × M × · · · × M I , in the case, where I agents jointly make their STPS. In the following section, weutilize RL-based simulations to implement MDP architecturewith network utility.IV. R EINFORCEMENT L EARNING S IMULATIONS
In order to analyze the performance of the CCPS securityframework for multi-agent systems, an RL-based operationis studied and simulated in MATLAB environment. RL canbe easily applied for cyclic system models, including CCPS,thanks to its similar design.In the given simulations, we study I = 2 agents, whichare able to operate individually or jointly. In the scope of thispaper, the considered PLS policies can be listed as P = P = { SC − AN, F D − AI, AN, B } for N = N = 4 . The deter-mined tuples of transmission configurations for each agent canbe written as R = R = { (5 , , (5 , , (10 , , (10 , } dB for M = M = 4 , where ( · , · ) indicates the transmissionpower and artificial noise levels of an agent, respectively. Arectangular surface of m × m is assumed as the CCPS en-vironment. The locations of the transmitter and receiver nodesfor the first agent are respectively selected as ( − , , (1 , ,while the transmitter and receiver nodes of the second agentare placed at ( − , − , (1 , − , respectively. The eavesdrop-per is located at the center, as ( x e , y e ) = (0 , . The pdf ofthe eavesdropper’s location is 2D Gaussian distribution withvariance . The noise variances at the eavesdropper and agentsare assumed to be equal to .
10 20 30 40 50SC-ANFD-AIBAN C Policies
10 20 30 40 50(5,5)(5,10)(10,5)(10,10) (P S , P N ) [dB]
10 20 30 40 50SC-ANFD-AIBAN
Policies
10 20 30 40 50(5,5)(5,10)(10,5)(10,10) (P S , P N ) [dB]
10 20 30 40 50SC-ANFD-AIBAN C
10 20 30 40 50(5,5)(5,10)(10,5)(10,10) 10 20 30 40 50SC-ANFD-AIBAN 10 20 30 40 50(5,5)(5,10)(10,5)(10,10)10 20 30 40 50SC-ANFD-AIBAN C
10 20 30 40 50(5,5)(5,10)(10,5)(10,10) 10 20 30 40 50SC-ANFD-AIBAN 10 20 30 40 50(5,5)(5,10)(10,5)(10,10)10 20 30 40 50
TimeSlots
SC-ANFD-AIBAN C Individiual DecisionJoint Decision
10 20 30 40 50
TimeSlots (5,5)(5,10)(10,5)(10,10)
Individiual DecisionJoint Decision
10 20 30 40 50
TimeSlots
SC-ANFD-AIBAN
Individiual DecisionJoint Decision
10 20 30 40 50
TimeSlots (5,5)(5,10)(10,5)(10,10)
Individiual DecisionJoint Decision
AGENT 1 AGENT 2 (a) STPS of the agents.
10 20 30 40 500.40.6 C Agent 1 Utility
10 20 30 40 500.40.6
Agent 2 Utility
10 20 30 40 500.40.50.6
Average Weighted Sum
10 20 30 40 500.40.60.8 C
10 20 30 40 500.40.6 10 20 30 40 500.40.610 20 30 40 500.40.50.6 C
10 20 30 40 500.40.6 10 20 30 40 500.40.50.610 20 30 40 50
TimeSlots C Individiual DecisionJoint Decision
10 20 30 40 50
TimeSlots
Individiual DecisionJoint Decision
10 20 30 40 50
TimeSlots
Individual DecisionJoint Decision (b) Agent-based utility values and average weighted sum values.
Fig. 5: Simulation results observed with individual and jointdecision methodologies for chosen CCPS applications.In Table I, we present the RL-based training and simulationparameters, where the future values are taken into accountby choosing a discount rate as one while maximizing utilityduring training RL environment. T , which is the number oftime slots that affect the behavior of the agents due to changingweights of the utility dimensions, is chosen to be . Thereare two possible (cid:15) values for the designed methodologies ofSTPS. These values are chosen for maximizing utilities duringthe operations.In this paper, we consider four main applications of CCPS,whose weights are given in Table II, and also presented in[2]. As detailed in the previous section, the utility calculationconsists of several parameters, including the vector of weightsfor each dimension w i [ t ] , the transition vector d k i i , impacts ofan increasing number of time slots on each dimension denotedas γ s [ t ] , γ q [ t ] , and γ c [ t ] , respectively. These weights are arbi-trarily chosen and can be altered based on usage differences.As the second parameter, each transition from one PLS policyto another creates a utility loss with the . ratio; therefore, δ k i → k (cid:48) i = δ k i → k (cid:48) i = 0 . when k i (cid:54) = k (cid:48) i as given in Table I.This value can also be updated based on various operationalenvironments, which may lead to additional penalties, e.g.,delays and distortion for changing PLS policies.In the simulations, the considered decision methodologiesare based on the individual and joint operations of the agentswith different Q-learning tables and MDP-based schemes, asshown in Fig. 4. When the agents individually determine theiroperational PLS policy and transmission configuration, they TABLE II: Utility weights for chosen CCPS applications formaking a STPS.
Agent Utility Weights Applications w i,s [1] w i,q [1] w i,c [1] C C C C TABLE III: Q-table that is logged for the weights of thescenario C for the second agent when t = 50 . The actionsare indicated based on the transitions on the Fig. 4(a), wherethe condition of N i = M i = 4 results in available actions. Actions
Left Mid-Left Mid-Right Right
States (To)
States (From) SC-AN FD-AI B ANLast State 0.3783 0.4145 0.5373 0.4336
States (To)
States (From) (5,5) dB (5,10) dB (10,5) dB (10,10) dBSC-AN 0 -0.1648 0.0011 -0.2011FD-AI 0 -0.2121 0.0133 -0.1780B 0 0.0410 0.0608 0.0523AN 0 -0.1626 0.0586 -0.0997 make this decision based on other agent’s STPS, as shown inTable I. On the other hand, the agents simultaneously decidetheir configurations if they choose joint decision methodologyat the same time.According to the first decision methodology, each agentmakes its decisions that maximize their utility without con-sidering the performance of other agents. The representationof the individual decision methodology given in Fig. 4a for theboth of the agents, where i = 1 , . In this figure, the rewardsof each path are indicated in terms of utility values basedon available PLS policies and transmission configurations.At the policy selection stage, (5 , dB is chosen as thereference transmission configuration, where l i = 1 . Therefore,the rewards of the transmission configuration selection stageare calculated as the difference between the utility of (5 , dB transmission configuration and the utility of an availabletransmission configuration for a selected PLS policy. In thismethodology, the rewards, which are calculated with utilitiesas expressed in (21), are logged in Q-tables, which show theaction based rewards, and utilized in RL-based simulations.An examplary Q-table is given in Table III for the weightsof C at the end of the assigned frame time, t = 50 . Thesevalues significantly match with the calculated rewards basedon the MDP-based schemes given in Fig. 4.In the second scenario, we assume that the agents are able tomake joint decisions that maximize the overall system utility.Since the agents do not have to operate at the same STPS,the available number of the PLS policies and the transmissionconfigurations are N × N , and M × M , respectively. Inthese circumstances, the MDP-based joint decision methodol-ogy is presented in Fig. 4b with available PLS policies andtransmission configurations. V. R ESULTS AND D ISCUSSION
A. Simulation Results
In this section, we present the results of simulations con-sidering different CCPS applications determining agent be-haviours and capabilities detailed in the previous section. Theresults are illustrated in Fig. 5, Table IV, and Table V, whereaseach presentation is applied for the chosen weights given inTable II due to the existence of multiple utility weights withrespect to several applications.In Fig. 5a, the behavior of two agents is shown foran increasing number of time slots based on two decisionmethodologies, while the operating STPS are given separately.In these subfigures, readers can follow the changes of eachagent, where they make dyna mic decisions for changingenvironments. These results show that the agents tend tochoose beamforming most of the time to satisfy the operationalrequirements with individual decision methodology. On theother hand, there is a cooperation between agents with a jointdecision methodology, since at least one of the agents chooses (5 , dB as transmission configuration in most of the cases.The impacts on the performance of these selections in termsof joint utility values are shown in Fig. 5b for each agentwith a comparison of the average weighted sum of agent-basedutility values. The results of the weighted sum of utility valuesshow that joint decision methodology provides higher utilityvalues than individual decision methodology most of the time.This observation is expected since the agents operate moreadequate PLS policies and transmission configurations witha joint decision methodology rather than individual decisions.From our point of view, the amount of fluctuation in individualdecision-based results is considerably high compared to jointdecision-based outcomes. This situation is observed due tothe lack of cooperation between agents. Without cooperation,agents individually try to obtain their maximum utility values,but the selection of one agent may not satisfy the utilitymaximization conditions for the other agent. This situation isespecially valid for health networks and mMTC applications,as shown in the same figure. On the other hand, if there is acooperation, the average weighted sum values are more stable.To observe the secure transmission performance in termsof security, QoS and cost dimensions individually, we focuson Table IV and Table V. In these tables, the average relativeindicator results are provided considering the individual andjoint decision methodologies for each application and PLSpolicies. The average relative indicator is identified as the av-erage percentage-based relative differences between the STPSfor T and each PLS policies that operate with the transmissionconfiguration (10 , dB. In other words, these tables showthe relative differences for utility and its dimensions whenthe proposed methodologies are applied instead of operatingwith a single PLS policy at maximum transmit power levels.Since these tables present dimension-based relative indicators,weights of each dimension given in Table I should also beconsidered. In Table IV, readers may observe that the averagerelative indicator values are consistent with the assignedweights for both of the agents due to the independent andselfish decisions of the agents to satisfy their requirements. TABLE IV: Average Relative Indicator based on individual decision based policy and transmission configuration selectionmechanism with (cid:15) = 0 . . AGENT 1
Drone Swarms URLLC mMTC Health NetworksAverage relative indicator Average relative indicator Average relative indicator Average relative indicator
Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost
SCAN
FD-AI B
0% 1% 0% 99% 3% 9% 1% 122% 7% 22% 12% 69% 13% 35% 42% 46% AN (a) (b) (c) (d) AGENT 2
Average relative indicator Average relative indicator Average relative indicator Average relative indicator
Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost
SCAN
FD-AI
78% 16% 80% 49% 49% 14% 49% 62% 34% 0% 49% 37% 167% 11% 47% 30% B
4% 1% 5% 99% 19% 3% 21% 124% 8% 15% 21% 74% 9% 25% 22% 59% AN
45% 4% 46% 49% 22% 5% 20% 62% 134% 18% 20% 37% 119% 27% 18% 30% (e) (f) (g) (h)
TABLE V: Average Relative Indicator based on joint decision based policy and transmission configuration selection mechanismwith (cid:15) = 0 . . AGENT 1
Drone Swarms URLLC mMTC Health NetworksAverage relative indicator Average relative indicator Average relative indicator Average relative indicator
Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost
SCAN
FD-AI B
6% 12% 6% 82% 14% 2% 21% 98% 7% 41% 55% 32% 14% 41% 55% 32% AN (a) (b) (c) (d) AGENT 2
Average relative indicator Average relative indicator Average relative indicator Average relative indicator
Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost Utility Sec. QoS Cost
SCAN
FD-AI B
9% 27% 30% 60% 25% 28% 42% 59% 8% 42% 57% 32% 15% 41% 57% 32% AN
92% 29% 7% 30% 27% 30% 10% 29% 213% 43% 35% 16% 317% 44% 35% 16% (e) (f) (g) (h)
On the other hand, the decisions are delivered to maximize thesystem utility with the joint decision methodology. Therefore,the average relative indicator values of utility in Table Vare significantly higher than the same group of outcomes inTable IV. These observations are expected due to the lackof dimension-based constraints while maximizing the utilityvalue. This fact may significantly degrade system performance,and its importance will be discussed in the next section.
B. Discussion and Open Issues
The lack of dimension-based constraints is the reason be-hind the inconsistencies between observed high utility valuesdue to the insufficient dimension-based gains. As a result, themaximized utility may not be sufficient to satisfy the specificrequirements of the agents. An example is that the mostimportant dimension of the agents is the cost with weightsof . and . , respectively in C . The results show thatthe joint decision methodology provides much higher utilitythan individual decision methodology for this agent; however,individual decision methodology satisfies the cost requirementfar better than joint decision methodology. Naturally, this agentshould individually operate, while the second agent tries tocooperate with the first agent due to a high amount of utilitywithout any dominant weight for each dimension. Here, a discussion arises on when the average weighted sum of utilityvalues and the average relative indicator should be utilizedfor deciding on real system deployments. In general, we canstate that the agents should jointly decide if their weights areclose and there is no dominant weight in Table II, e.g., droneswarms, or there is the dominant QoS weight for both ofthe agents, e.g., URLLC and V2X applications. On the otherhand, the agents should individually operate to maximize theirdimension-based requirements, e.g., the agents in mMTC andhealth network applications. Similar observations can also bemade for other applications of CCPS.One predictable open issue is the implementation of theproposed system in a real-time testbed. Since the provided al-gorithm can be implemented over the software defined radios,considering real-time problems such as hardware impairmentsand channel estimation errors are needed to be considered fora real-time application.Another important open issue is that CCPS may consistof several nodes with various priorities. For example, in ahealth network, the security and availability of an implantsensor would be more important than the data access point.Therefore, there should be a hierarchical order for the agents.We believe that our formulation is sufficiently flexible to definea priority-based definition, which can be extended by adding a multiplication vector to (21). After this update, we canalso model the relationship between the agents based on theirimportance. For instance, some nodes may sacrifice from theirsecurity to enable high QoS of a more important agent.VI. C ONCLUSION
In this work, we have proposed a secure transmissionpolicy selection scheme at physical layer based on the Markovdecision process for multi-agents in a CCPS environment.The reward of the proposed scheme is termed as the utility,which consists of security, QoS, and cost dimensions. Theindividual utilities of the agents are consubstantiated to formthe joint utility, illustrating the overall performance of theCCPS framework. Reinforcement learning-based simulationsare conducted considering two different agent behaviors: theindividual policy and transmission configuration selectionscheme, and the joint policy and transmission configurationselection scheme. By tuning the weights of the utility dimen-sions, we analyze the performance of the proposed schemesin different beyond-5G applications.The superior utility performance of the proposed policyselection schemes indicates that a single physical layer securitymethod cannot cover all requirements imposed by differentapplications. As future work, prioritization based hierarchicalorders of the agents can be studied. To increase the perfor-mance of the adaptivity of the framework, dimension-basedconstraints should be added to the physical layer securitypolicy and transmission configuration selections.R
EFERENCES[1] R. Khan, P. Kumar, D. N. K. Jayakody, and M. Liyanage, “A surveyon security and privacy of 5G technologies: Potential solutions, recentadvancements, and future directions,”
IEEE Commun. Surveys Tuts. ,vol. 22, no. 1, pp. 196–248, 2020.[2] O. A. Topal, M. O. Demir, Z. Liang, A. E. Pusane, G. Dartmann,G. Ascheid, and G. K. Kurt, “A physical layer security framework forcognitive cyber physical systems,”
IEEE Wireless Commun. , vol. 27,no. 4, pp. 32–39, 2020. [3] N. Wang, P. Wang, A. Alipour-Fanid, L. Jiao, and K. Zeng, “Physical-layer security of 5G wireless networks for IoT: Challenges and oppor-tunities,”
IEEE Internet Things J. , vol. 6, no. 5, pp. 8169–8181, 2019.[4] D. Wang, B. Bai, W. Zhao, and Z. Han, “A survey of optimizationapproaches for wireless physical layer security,”
IEEE Commun. SurveysTuts. , vol. 21, no. 2, pp. 1878–1911, 2019.[5] M. O. Demir, A. E. Pusane, G. Dartmann, G. Ascheid, and G. K.Kurt, “A garden of cyber physical systems: Requirements, challengesand implementation aspects,”
IEEE Internet Things Mag. , Early access.[6] D. W. K. Ng, E. S. Lo, and R. Schober, “Secure resource allocationand scheduling for OFDMA decode-and-forward relay networks,”
IEEETrans. on Wireless Commun. , vol. 10, no. 10, pp. 3528–3540, 2011.[7] Y. P. Hong, P. Lan, and C. J. Kuo, “Enhancing physical-layer secrecyin multiantenna wireless systems: An overview of signal processingapproaches,”
IEEE Signal Process. Mag. , vol. 30, no. 5, pp. 29–40,2013.[8] H. Wang and X. Xia, “Enhancing wireless secrecy via cooperation:signal design and optimization,”
IEEE Commun. Mag. , vol. 53, no. 12,pp. 47–53, 2015.[9] P. Huang, Y. Hao, T. Lv, J. Xing, J. Yang, and P. T. Mathiopoulos,“Secure beamforming design in relay-assisted Internet of Things,”
IEEEInternet Things J. , vol. 6, no. 4, pp. 6453–6464, 2019.[10] N. Yang, P. Yeoh, M. Elkashlan, R. Schober, and I. B. Collings,“Transmit antenna selection for security enhancement in MIMO wiretapchannels,”
IEEE Trans. on Commun. , vol. 61, no. 1, pp. 144–154, 2013.[11] R. Zi, J. Liu, L. Gu, and X. Ge, “Enabling security and high energy effi-ciency in the Internet of Things with massive MIMO hybrid precoding,”
IEEE Internet Things J. , vol. 6, no. 5, pp. 8615–8625, 2019.[12] Z. Xu and Q. Zhu, “A cyber-physical game framework for secure andresilient multi-agent autonomous systems,” in
IEEE Confer. on Decisionand Control , 2015, pp. 5156–5161.[13] T. Liu, S. Han, W. Meng, C. Li, and M. Peng, “Dynamic powerallocation scheme with clustering based on physical layer security,”
IETCommun. , vol. 12, no. 20, pp. 2546–2551, 2018.[14] F. Gabry, A. Zappone, R. Thobaben, E. A. Jorswieck, and M. Skoglund,“Energy efficiency analysis of cooperative jamming in cognitive radionetworks with secrecy constraints,”
IEEE Wireless Commun. Lett. , vol. 4,no. 4, pp. 437–440, 2015.[15] X. Xu, Y. Cai, W. Yang, W. Yang, J. Hu, and T. Yin, “Energy-efficientoptimization for physical layer security in large-scale random crns,” in
Inter. Confer. on Wireless Commun. Signal Process. (WCSP) , 2015, pp.1–6.[16] Z. Deng, Q. Li, Q. Zhang, L. Yang, and J. Qin, “Beamforming designfor physical layer security in a two-way cognitive radio IoT networkwith SWIPT,”
IEEE Internet Things J. , vol. 6, no. 6, pp. 10 786–10 798,2019.[17] L. Mucchi, L. Ronga, X. Zhou, K. Huang, Y. Chen, and R. Wang, “Anew metric for measuring the security of an environment: The secrecypressure,”