[PDF] Reinforcement Learning based QoS/QoE-aware Service Function Chaining in Software-Driven 5G Slices

Abstract

With the ever growing diversity of devices and applications that will be connected to 5G networks, flexible and agile service orchestration with acknowledged QoE that satisfies end-user's functional and QoS requirements is necessary. SDN (Software-Defined Networking) and NFV (Network Function Virtualization) are considered key enabling technologies for 5G core networks. In this regard, this paper proposes a reinforcement learning based QoS/QoE-aware Service Function Chaining (SFC) in SDN/NFV-enabled 5G slices. First, it implements a lightweight QoS information collector based on LLDP, which works in a piggyback fashion on the southbound interface of the SDN controller, to enable QoS-awareness. Then, a DQN (Deep Q Network) based agent framework is designed to support SFC in the context of NFV. The agent takes into account the QoE and QoS as key aspects to formulate the reward so that it is expected to maximize QoE while respecting QoS constraints. The experiment results show that this framework exhibits good performance in QoE provisioning and QoS requirements maintenance for SFC in dynamic network environments.

Full PDF

RReinforcement Learning based QoS/QoE-aware Service FunctionChaining in Software-Driven 5G Slices

Xi Chen | Zonghang Li | Yupeng Zhang | Ruiming Long | Hongfang Yu* | Xiaojiang Du(Senior Member, IEEE) | Mohsen Guizani (Fellow, IEEE) School of Computer Science andTechnology, Southwest Minzu University,Chengdu, Sichuan, China Center for Cyber Security, UESTC,Chengdu, Sichuan, China Department of Computer and InformationSciences, Temple University, Philadelphia,PA, USA Department of Electrical and ComputerEngineering, University of Idaho, Moscow,Idaho, USA

Summary

With the ever growing diversity of devices and applications that will be connectedto 5G networks, ﬂexible and agile service orchestration with acknowledged QoE thatsatisﬁes end-user’s functional and QoS requirements is necessary. SDN (Software-Deﬁned Networking) and NFV (Network Function Virtualization) are consideredkey enabling technologies for 5G core networks. In this regard, this paper proposesa reinforcement learning based QoS/QoE-aware Service Function Chaining (SFC)in SDN/NFV-enabled 5G slices. First, it implements a lightweight QoS informationcollector based on LLDP, which works in a piggyback fashion on the southboundinterface of the SDN controller, to enable QoS-awareness. Then, a DQN (Deep QNetwork) based agent framework is designed to support SFC in the context of NFV.The agent takes into account the QoE and QoS as key aspects to formulate thereward so that it is expected to maximize QoE while respecting QoS constraints.The experiment results show that this framework exhibits good performance inQoE provisioning and QoS requirements maintenance for SFC in dynamic networkenvironments.

KEYWORDS:

Software-Deﬁned Networking (SDN), Network Function Virtualization (NFV), Service Function Chain-ing (SFC), Reinforcement Learning, Quality of Service (QoS), Quality of Experience (QoE)

Communication networks have evolved through four majorgenerations from 1G to 4G, which respectively features ana-log voice service, digitalized voice service, data service, andmobile broadband service. With the increment of diversityof devices and applications connected to 4G networks, net-work operators are faced with CAPEX (capital expenditure)and OPEX (operational expenditure) pressures due to thefact that revenues do not come proportionally to the mas-sive investment, given the current technological architecture.Recent years have witnessed a large amount of eﬀorts by dif-ferent nations, companies, standardization bodied, etc., pouredinto the research and development of the 5th generation of communication networks, i.e., 5G networks. It is well acknowl-edged that 5G covers a wide spectrum of research topics inwired core networks as well as wireless networks , where net-work heterogeneity , security , mobility , etc., shouldbe collectively considered to push forward its development.Diﬀerent entities hold diﬀerent technological visions of 5Gnetworks. According to reference , the architecture of 5G net-works can be roughly divided as three layers, i.e., physicalinfrastructure layer, virtualized layer, service layer. The phys-ical infrastructure layer holds various compute, storage, andnetwork resources, which are abstracted as virtual resourcepools to enable easy utilization and resource sharing. By invok-ing API exposed by virtualized layer, service layer deliversservices oriented to end-users or devices. XI CHEN

ET AL

The physical infrastructure layer is comparatively staticwhile the virtualized layer on top of that is dynamic. Moreconcretely, it can be assumed that the number and conﬁgu-rations of physical compute, storage, and network resourcesare comparatively immutable in the short term, while thenumber and conﬁgurations of VMs (Virtual Machines) andVNF (Virtualized Network Functions) instances are mutableon demand to support their application-speciﬁc service lay-ers. This implies two entailments. On one hand, the underlyingnetwork resources should be “sliced” for diﬀerent service ren-derers (i.e., tenants) with minimal conﬂicts, i.e., the conceptof slicing where network is vertically tailored into multi-layer slices independently controlled and managed by correspondingtenants.On the other hand, network services nowadays are orches-trated by diﬀerent network functions (NF, e.g., ﬁrewall, deeppacket inspection, WAN optimizer, proxy, etc.), often in a vir-tualized fashion (i.e., VNF) to provide required functionalitiesas well as possibly improving QoE and maintaining QoS. Thisis where service function chaining (SFC) comes into thepicture. Therefore, an SFC framework that is QoS/QoE-awareinside a slice is a key in successful service orchestration anddelivery in 5G core networks. While MEC (Mobile Edge Com-puting) aims to handle issues related to 5G edge networks,SDN (Software-Deﬁned Networking) and NFV (NetworkFunction Virtualization) are considered key enabling tech-nologies for 5G core networks which fabricate the backbone of5G ecosystems and oﬀer critical network services. We envisionthat at least three aspects should be addressed for SFC orches-tration with regard to the highly dynamic SDN/NFV-enabled5G slices.1. A “smart” orchestration agent that is adaptive to thechanging environment so that it can learn to approxi-mate the optimal SFC orchestration policy with minimalhuman interference for automation purpose, i.e., thelearning aspect.2. A lightweight mechanism to evaluate the QoE of a ser-vice function chain in changing environments so thatthe agent can learn to maximize the user experiencewhich is the key factor for 5G user subscriptions, i.e., theawareness aspect.3. The ability to explore VNF alternatives that canpotentially orchestrate the chain (e.g., for the purposeof load-balancing), while exploiting the best knownVNF instances to optimize QoE, i.e. the exploration-exploitation aspect.Reinforcement learning , with its trial-and-error mech-anism (for aspect 1), reward mechanism (for aspect 2),exploration-exploitation ability (for aspect 3), etc., makes a competitive candidate for the SFC orchestration framework in5G slices. Meanwhile, recent years have also seen its applica-tion in modern network paradigms for user experience opti-mization , cost minimization in resource allocation , etc.Based on the discussion above, this paper proposes the rein-forcement learning based QoS/QoE-aware service functionchaining framework in the context of software-driven (i.e.,SDN/NFV-enabled) 5G slices. The authors believe the workpresented by this paper addresses some missing parts of thecurrent research on SFC. On one hand, many previous worksabstract the SFC problem as a mathematical programmingproblem and present heuristic algorithms to balance eﬃciencyand optimality. The quality of the service function chain isusually judged by the derived cost (the smaller the better) orutility (the bigger the better). The calculation of cost/utilityusually depends on QoS metrics (such as bandwidth, delay,throughput, etc.). Existing literature often assumes that thesemetrics have been readily collected through some mechanismbehind the hood. However, the collection of QoS informationof various entities is often challenging and needs an elaboratedesign to implement a lightweight framework. On the otherhand, although QoS is well studied with regard to SFC, QoEis not extensively considered in previous works. QoE and QoSexhibits a non-linear mutual relationship; thus the guaranteeof QoS dose not necessarily ensure highly acknowledged QoE.Therefore, QoE should be explicitly considered for SFC. Tosummarize, the contribution of this paper is twofold: • A lightweight QoS information collecting scheme withregard to SDN deployment in 5G slice. This scheme usesLLDP as the “ferry” to load QoS information collectedfrom underlying switches in a piggyback fashion so thatno fundamental modiﬁcation of the current OpenFlow-based southbound interface needs to be made. QoSinformation collecting is helpful in evaluating QoE andmaintaining QoS constraints. • A MDP (Markov Decision Processes) modeled rein-forcement learning based service function chainingalgorithm is proposed in the context of NFV. Thisalgorithm takes into account the QoE and QoS as keyaspects to formulate reward so that it is expected tomaximize QoE while respecting QoS constraints.The remainder of this paper is organized as follows. Section2 summarizes related works and makes a brief comparison toour work. Section 3 exhibits the general system architecture.Section 4 explains the novel lightweight QoS information col-lecting scheme and how the collected information is used tosimplify the network topology. Section 5 gives the details ofMDP-modeled reinforcement learning based QoS/QoE-awareSFC framework. Section 6 takes experiments on both the novellightweight QoS collecting scheme and the QoS/QoE-aware

I CHEN

ET AL service function chaining. Finally, this paper is concluded inSection 7. The trend of software-ized control led by SDN and virtu-alized network functions led by NFV has made the ﬂexibleservice function chaining feasible which is also knowns as ser-vice/middlebox chaining. StEERING extended OpenFlowand the NOX controller and implemented middlebox-basedservice chaining. The key of the extension is the split of amonolithic ﬂow table into several micro tables to constrainthe “rule explosion (i.e., too many rules)” during the mappingbetween service function chains and rule table entries. StEER-ING abstracts the service chaining as a graph theory problemfor which a greedy algorithm combined with heuristics wasproposed to solve it.SIMPLE proposed to conduct service function composi-tion in the context of SDN. Service composition is split into 2stages: online stage and oﬄine stage. During the oﬄine stage,the TCAM capacity is treated as the primary constraint basedon which Integer Linear Programming is adopted to solve theproblem; during the subsequent online stage, a simpliﬁed Lin-ear Programming algorithm is used to tackle the load-balanceproblem.FlowTags holds the opinion that it is diﬃcult to track thestates of user traﬃc while traversing operator networks, whichmight result in the incorrect enforcement of network-wide poli-cies. It is, therefore, also diﬃcult to construct a service functionchain to satisfy a user’s business logic requirements and policyrequirements of operator networks. FlowTags tags the traﬃctraversing middleboxes in a way that context information ofmiddleboxes are organized as packet header tags shared alongthe service function chain. It can be invoked through south-bound interface APIs so that the correctness of the servicefunction chains and the consistency of network-wide policiesare both possibly guaranteed.References advocated to use the named NF instancesto facilitate the decoupling of the service plane and the dataplane, so that the execution of service instances need not locateconcrete positions of these instances, such as IP addresses,etc. Meanwhile, traﬃc steering is conducted according to theinstance names stored in the packet headers, without beingtranslated to ﬂow table entries stored in switches, so thatthe size of the ﬂow table is contained. On the contrary,switches store only the mapping of instance names and theirIP addresses.Reference studied the function composition that maxi-mizes throughput. Two algorithms (namely TMA and PDA) are designed corresponding to oﬄine requests (i.e., thoserequests whose traﬃc characteristics can be determined by his-torical data or SLA (Service Level Agreement)) and onlinerequests (i.e., those requests whose traﬃc characteristics canonly be determined upon arrival). Both algorithms try to solvethe utility-based (i.e., the throughput) optimization problem.Reference studied the NF consolidation problem, attempt-ing to deploy more NFs onto fewer physical nodes so thatutilization of NFs in the entire network can be made as highas possible. Regarding problem modeling, it is modeled asan integer programming problem, which is solved using IBMCPLEX optimization software in small-scale networks. Forlarge-scale networks, a greedy-based heuristic algorithm isdesigned to solve a conﬁguration that tries to minimize thenumber of VNFs. Similar work was conducted by reference ,whose focus is the service function chaining in data centers. Bytaking into account the resource (computing, bandwidth, etc.)consumption by computing and switching devices, the energyutilization model is established for data centers. Besides, thetraﬃc intensity is used to express the aﬃnity between servicefunctions, based on which service functions with high aﬃnitiescould be placed nearby or on the same physical server. In thisway, extra energy consumption due to long distance interactionis minimized.The works presented by references are func-tional service function chaining with little QoS consideration,whereas those by references consider QoS metricsor other relative properties while QoE is not considered. Onthe other hand, our work takes into account both QoS and QoEin service function chaining. MDP has long been used in resource composition, such as Webservices composition in early works . It has also been exten-sively used in network path and service chain compositionrecently. Reference proposes the QoS-Aware Routing basedon reinforcement learning. It gives a reward model that takesinto account the QoS metrics such as bandwidth, delay, etc.,and a Softmax-based policy to choose the next-hop forward-ing device. The methodology is similar to our work. However,it is applied in forwarding plane routing with QoS awareness.Also, it does not combine the QoS constraints with reinforce-ment learning framework like our work. Reference proposesthe MDP model for NFV resource allocation to form a servicechain with cost optimization. It adopts the Bayesian probabil-ity to predict the transition probabilities among VNF instances,thus it boils down to the model-based reinforcement learn-ing where transition probabilities are fully observable and thevalue iteration approach is used for solution. However, in theirframework, QoS awareness is not deeply considered. XI CHEN

ET AL

Our work diﬀers from the previous works in that both QoSand QoE are considered in the reward modeling.

The ETSI NFV speciﬁcation is believed to be one of mostsuitable to deploy our QoS/QoE-aware SFC framework. Inaddition, we also advocate the deployment of an SDN con-troller inside of a 5G slice on the control plane to implementQoS information collecting along with the topology discov-ery. Based on ETSI speciﬁcations, reference proposes anetwork slice management and orchestration (MANO) archi-tecture for 5G networks, as shown in Figure 1 , where the VIM(Virtual Infrastructure Manager, e.g., OpenStack, KVM, etc.)corresponds to the Infrastructure Manager, the VNFM (Vir-tual Network Function Manager) corresponds to the NetworkSlice Manager, while the NFVO (Network Function Virtual-ization Orchestrator, e.g., OpenStack Tacker) corresponds tothe Service Instance Layer. We extend this architecture byimplementing a QoS information collector as an SDN con-troller module (the green rounded rectangle in the middle onthe right. See details in Section 4.1). We also extend the NFVMANO (Management and Orchestration) with a reinforcementlearning based QoE/QoS-aware SFC agent as a service man-ager module (the green rounded rectangle on the top of theright. See details in Section 5). FIGURE 1

Network Slice Management and Orchestration(MANO) Overview.

In standard SDN networks, controllers are aware of switchesdirectly connected to them through bidirectional Hello mes-sages in the standard OpenFlow protocol. However, the under-lying link states between switches (i.e., how switches aremutually connected) are not visible to controllers in the ﬁrstplace. Therefore, in the initial stage of an SDN network, con-trollers do not have the topology knowledge of the wholeSDN network. In order to perform centralized control over anSDN network, controllers must carry out topology discovery.Controllers usually use LLDP to fulﬁll such a task. LLDP isan IEEE proposed protocol widely used in network arena fortopology discovery.The controller instructs a switch to multicast the LLDPpacket to all of its ports through a packet-out (instructive pack-ets from controllers to switches). In this packet-out, topologicalinformation of the switch such as chassis information, portinformation, etc., is all contained. All other switches connectedto this sender switch receive the LLDP packet, and then matchthis packet against the ﬂow table entries of their own, only toﬁnd no matches for LLDP packets. Thus, switches will send apacket-in (packets from switches to controllers) containing thisLLDP to the controller asking how to process this packet. Sincethe packet-in contains topology information about both thesender switch and the receiver switch, the controller can nowassert that there exists a link between the two switches basedon the received packet-in. By means of this iterative packet-in/out interaction, topology of the whole SDN network can bediscovered by the central controller. This centralized topologydiscovery in SDN is quite diﬀerent from how LLDP works intraditional networks where topology discovery is done by indi-vidual switches independently, although LLDP is used in bothcases.Standard LLDP packets usually contain basic informationsuch as the MAC address, chassis information, port informa-tion, etc. We can see from the above topology discovery phase,no QoS information is contained in LLDP packets. Should QoSinformation be incorporated, the QoS-aware topology discov-ery can be done to enable further QoS-aware decisions andpolicies, thus QoS provisioning becomes possible.LLDP is a TLV (Type/Length/Value, i.e., key-value pairwith length information) based protocol where TLVs are usedfor property descriptions. We can include QoS information ascustom TLVs in LLDP packets. In this way, LLDP can beseen as the “ferry” containing QoS information (i.e., QoS overLLDP) and other useful properties as its payload. We deﬁneQoS TLV as follows in Figure 2 .

I CHEN

ET AL FIGURE 2

QoS over LLDP Packet Format.In the TLV Type ﬁeld, it must be designated as 127 to indi-cate that this is a custom TLV. The Length ﬁeld speciﬁes thevariable-length value contained in the TLV. The OrganizationCode ﬁeld indicates the designer of this customized TLV. Weuse the Organization Code as 0xabcdef for the time being. TheSubtype ﬁeld speciﬁes the detailed type of the contained value.The Value String ﬁeld (i.e., the QoS ﬁeld in Figure 2 ) givesthe real value. We contain various QoS metrics in the ValueString. In order for the receiver to conveniently parse the dif-ferent QoS metrics, we use the predeﬁned property order andlength. We can see from Figure 2 that several metrics areincluded in ﬁxed length in our current settings, namely delay,bandwidth, packet loss, and jitter, 8 bytes for each property.Therefore, a QoS over LLDP packet is 38 bytes longer than apure LLDP packet in length. Note that more metrics such asavailability can be included in the future work. Upon receivingthe QoS over LLDP packet, the switch ﬁlls QoS metrics in cor-responding TLV ﬁelds. We have implemented this mechanismin Floodlight controller and OVS (Open vSwitch) . To orchestrate a service function chain is to chain a set of VNFinstances distributed on virtual machines or containers initi-ated on physical commodity servers, usually one instance perVM/container. Servers are inter-connected by physical/virtualforwarding devices (e.g., switches) and links. According to ref-erence , the total number of middleboxes (i.e., VNF instancesin the context of NFV) is comparable to the number of forward-ing devices in modern ISP networks or datacenters. Therefore,if entities in the forwarding plane are explicitly involved dur-ing the process of SFC orchestration, which essentially boilsdown to a service plane problem, the problem complexity ismuch greater than that only VNF instances are taken intoaccount. However, if forwarding entities are not explicitly con-sidered during orchestration, the forwarding-plane datapath isstill needed to be planned separately after orchestration, as asecond step, for traﬃc steering that sequentially traverses VNFinstances, resulting in a two-tier solution.If forwarding devices and links are collectively viewedas an “aggregated link” between two servers hosting VNFinstances, the topology can be simpliﬁed as one with VNFinstances as nodes and “aggregated links” as edges, without the involvement of forwarding-plane entities. Therefore problemcomplexity can be reduced during orchestration. Note thatQoS over LLDP collects QoS information alongside topol-ogy discovery, which means that the QoS status of aggregatedlinks can be mathematically inferred to support quality eval-uation of orchestration. Let 𝑣 𝑖𝑗 denote the real value of the 𝑗 -th QoS metric of device 𝑖 in an aggregated link, where 𝑗 ∈ { 𝑑𝑙, 𝑏𝑤, 𝑝𝑙, 𝑎𝑣, 𝑗𝑡 } , namely delay , bandwidth , packet loss , availability , and jitter , the algorithms to evaluate the QoSmetrics of an aggregated link are shown as follows in Table 1 . TABLE 1

QoS of the Aggregated Link.

QoS Metric Aggregation Algorithm dl (delay) ∑ 𝑛𝑖 =1 𝑣 𝑖𝑗 bw (bandwidth) min 𝑖 ∈ 𝑛 𝑣 𝑖𝑗 pl (packet loss) ∏ 𝑛𝑖 =1 (1 − 𝑣 𝑖𝑗 ) av (availability) ∏ 𝑛𝑖 =1 𝑣 𝑖𝑗 jt (jitter) ∑ 𝑛𝑖 =1 𝑣 𝑖𝑗 Now that forwarding-plane entities (e.g., switches, links, etc.)are consolidated as aggregated links between VNF instances,the topology can be, at large, simpliﬁed. Figure 3 gives a sim-ple example of topology simpliﬁcation. Server-1, which hostsa VNF instance v-ins-1, is connected to Server-2, which hostsv-ins-2, v-ins-3 and v-ins-4, through switch-1 and switch-2 via3 links. QoS over LLDP discovers the forwarding topologywith corresponding QoS metrics, shown in the bottom withsolid squares. Switch-1, Switch-2 and links in between forman aggregated links whose QoS metrics are inferred accord-ing to Table 1 . Therefore, v-ins-1 is connected to v-ins-2, 3and 4 with an aggregated link with 25 us delay and 100 Mbpsavailable bandwidth (i.e., the dashed square).

FIGURE 3

An Aggregated Link Example.

XI CHEN

ET AL

Suppose an SFC request imposes 𝑁 network functions (e.g.,traﬃc sequentially passes through ﬁrewall, DPI, etc.). Eachfunction can be accomplished by a VNF type, 𝑡 𝑖 , 𝑖 ∈{1 , , ⋯ , 𝑁 } , and each VNF type 𝑡 𝑖 has 𝑀 𝑖 candidate VNFinstances. Let 𝑖𝑛𝑠 𝑖𝑗 denote the 𝑗 -th VNF instance of VNF type 𝑡 𝑖 , 𝑖 ∈ {1 , , ⋯ , 𝑁 } , 𝑗 ∈ {1 , , ⋯ , 𝑀 𝑖 } . Deﬁnition 1 (Functional SFC Orchestration, F-SFC) . Let 𝑥 𝑖𝑗 ∈ {0 , denote whether 𝑖𝑛𝑠 𝑖𝑗 is selected ( 𝑥 𝑖𝑗 = 1 ) or not( 𝑥 𝑖𝑗 = 0 ) to accomplish the 𝑖 -th function required by the SFCrequest. The functional SFC orechestration (F-SFC) is to selectone and only one VNF instance from 𝑡 𝑖 for the 𝑖 -th function. ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 1 ∑ 𝑁𝑖 =1 ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 𝑁 (1)The previous deﬁnition implies that there should exist 𝑀 𝑖 deployed VNF instances so that one of them can be“selected” to fulﬁll the 𝑖 -th function. However, in the con-text of SDN/NFV, a VNF instance can be instantiated on-demand without prior existence. In other words, the remainingresources of the commodity server that hosts VNF instancescan be seen as the potential VNF instances (e.g., v-ins-4 inFigure 3 ) as long as there are enough resources for instantia-tion. To capture this dynamic nature of VNF instantiation, weregard the remaining resources of the direct successive com-modity server as a potential VNF instance for the algorithmto select from. This instantiate-then-select operation is diﬀer-ent from pure selection from existing VNF instances in thatit incurs extra booting time, extra power consumption, extraoperational activities, etc., which can be considered as opera-tional expenditures (OPEX). In this regard, equation (1) coversthe both the deployed VNF instance selection and on-demandVNF instantiation commonly seen in SDN/NFV scenarios. Deﬁnition 2 (QoE-aware SFC Orchestration, QoE-SFC) . Let 𝐶 denote the set of all service function chains that can func-tionally satisfy the SFC request. Let 𝑞𝑜𝑒 𝑐 denote the the end-to-end QoE of service function chain 𝑐 ∈ 𝐶 . QoE-aware SFCorchestration (QoE-SFC) is the F-SFC that maximizes the end-to-end QoE of all candidate chains. The QoE evaluation of 𝑞𝑜𝑒 𝑐 will be discussed in Section 5.3.1. max 𝑐 ∈ 𝐶 𝑞𝑜𝑒 𝑐 𝑠.𝑡. ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 1 ∑ 𝑁𝑖 =1 ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 𝑁 (2) Deﬁnition 3 (QoE/QoS-aware SFC Orchestration, Q2-SFC) . Let 𝑞𝑜𝑠 𝑐 denote an 𝐿 dimensional vector that indicates the QoS metrics of service function chain 𝑐 ∈ 𝐶 . Let 𝑞𝑐𝑜𝑛 denote an 𝐿 dimensional vector that indicates the QoS constraints of theSFC request. Without losing generality, let assume that the ﬁrst 𝐾 dimensions of the QoS vector that are positive metrics (i.e.,the greater values the better, e.g., bandwidth) and the ( 𝐿 − 𝐾 ) remaining dimensions of the QoS vector are negative metrics(i.e., the smaller values the better, e.g., delay). QoE/QoS-awareSFC Orchestration (Q2-SFC) is the QoE-SFC that satisﬁes theQoS constraints of the SFC request. max 𝑐 ∈ 𝐶 𝑞𝑜𝑒 𝑐 𝑠.𝑡. 𝑞𝑜𝑠 𝑡𝑐 ≥ 𝑞𝑐𝑜𝑛 𝑡 , 𝑡 ∈ {1 , , ⋯ , 𝐾 } 𝑞𝑜𝑠 𝑡𝑐 ≤ 𝑞𝑐𝑜𝑛 𝑡 , 𝑡 ∈ { 𝐾 + 1 , 𝐾 + 2 , ⋯ , 𝐿 } ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 1 ∑ 𝑁𝑖 =1 ∑ 𝑀 𝑖 𝑗 =1 𝑥 𝑖𝑗 = 𝑁 (3)Note that the topology to be dealt with is a simpliﬁed topol-ogy using aggregated links to reduce complexity. Also theQoS vector 𝑞𝑜𝑠 𝑐 can be formulated by the QoS informationcollected by QoS over LLDP scheme.During the SFC orchestration process, two strategies can beadopted: • Incremental orchestration: The selection of VNFinstances for functions are conducted in a hop-by-hopfashion. Therefore, the length of the service functionchain gradually increases. • Monolithic orchestration: Every step gives a completeservice function chain (i.e., one VNF instance is selectedfor each function) and checks whether it maximizes QoEand meet QoS constraints. If not, ﬁnd another completeservice function chain in the next step.In this paper, we adopt the incremental orchestration strat-egy due to that 1) it can be easily mapped to a multi-stepreinforcement learning problem modeled using MDP; and 2) itis ﬁner-grained, and thus a sophisticated policy can be derivedfor QoE maximization and QoS constraints. Meanwhile, weenvision that the monolithic strategy ﬁts better in QoS mainte-nance during the runtime of an existing chain, which is out ofthe scope of this paper.Our goal in this paper is to implement Q2-SFC using rein-forcement learning. The reasons why we choose reinforcementlearning for Q2-SFC solution lie as follows: • Fewer requirements are needed during training by rein-forcement learning, compared with supervised learning,in that no prior extensive training dataset is required.Training dataset from real-world networks is hard toacquire. On the one hand, great eﬀorts are required bothin computing and storage to store operational statistics.On the other hand, a dataset might (unintentionally)

I CHEN

ET AL contain or infer sensitive data which is why networkoperators are not quite willing to share. • The model trained by supervised learning can hardlyreﬂect the dynamics of a continuously changing net-work environment. On the contrary, through the rewardmechanism, reinforcement learning can better adapt toenvironmental changes.

The MDP usually consists of ﬁve ingredients, i.e., { 𝑆, 𝐴, 𝑃 , 𝑅, 𝛾 } where 𝑆 denotes the ﬁnite set of states , 𝐴 denotes the ﬁnite set of actions , 𝑃 denotes the ﬁnite set of state transition probabilities , and 𝑅 denotes the ﬁnite set of immediate rewards . 𝛾 ∈ [0 , is the discount factor, indicat-ing the importance of future of rewards to the current reward.The solution of the MDP is called a policy given that in thecurrent state 𝑠 ∈ 𝑆 , an action 𝑎 ∈ 𝐴 is selected to maximizethe long term rewards. If the state transition probability 𝑝 𝑎𝑠 → 𝑠 ′ from state 𝑠 to 𝑠 ′ given that action 𝑎 is selected is unknown,the model of the MDP is unknown. In that case, the solving ofthe MDP is called model-free reinforcement learning.Note that the quality of a policy 𝜋 ( 𝑠, 𝑎 ) is not determinedby the immediate reward 𝑟 ; instead, it is evaluated by longterm rewards, thus two value functions are deﬁned to cap-ture this: the state value function 𝑉 𝜋 ( 𝑠 ) , which indicates theexpected accumulated discounted rewards from initial state 𝑠 ;and the action value function 𝑄 𝜋 ( 𝑠, 𝑎 ) (also called state-actionvalue function), which indicates the expected accumulateddiscounted rewards by action 𝑎 from initial state 𝑠 . Mathemat-ically, we have the following: 𝑉 𝜋 ( 𝑠 ) = 𝐸 ( ∞ ∑ 𝑘 =0 𝛾 𝑘 𝑟 𝑡 + 𝑘 +1 | 𝑠 𝑡 = 𝑠 )= 𝐸 ( 𝑟 𝑡 +1 + 𝛾𝑉 𝜋 ( 𝑠 𝑡 +1 ) | 𝑠 𝑡 = 𝑠 ) (4) 𝑄 𝜋 ( 𝑠, 𝑎 ) = 𝐸 ( ∞ ∑ 𝑘 =0 𝛾 𝑘 𝑟 𝑡 + 𝑘 +1 | 𝑠 𝑡 = 𝑠, 𝑎 𝑡 = 𝑎 )= 𝐸 ( 𝑟 𝑡 +1 + 𝛾𝑄 𝜋 ( 𝑠 𝑡 +1 , 𝑎 𝑡 +1 ) | 𝑠 𝑡 = 𝑠, 𝑎 𝑡 = 𝑎 ) (5)where 𝑟 𝑡 indicates the immediate reward of step 𝑡 ∈{1 , , ⋯ , 𝑇 } ; 𝐸 ( ⋅ ) is the mathematical expectation operator.To ﬁnd the optimal solution of MDP is to ﬁnd the policy thatmaximizes state value function: 𝜋 ∗ = arg max 𝑉 𝜋 ( 𝑠 ) (6) We can see that inﬁnite states can also be dealt with in reinforcement learningin later sections.

According to Bellman Optimality Equation, the optimalpolicy 𝜋 ∗ is the one that holds the following: 𝑉 𝜋 ∗ ( 𝑠 ) = max 𝑎 ∈ 𝐴 𝑄 𝜋 ∗ ( 𝑠, 𝑎 ) (7)We model the Q2-SFC orchestration as an MDP in that: • 𝑆 : Every state 𝑠 ∈ 𝑆 represents the system environmentincluding network topology, VNF instances’ QoS/QoEstatus, functional and QoS requirements of the SFCrequest being processed, etc. • 𝐴 : Every action 𝑎 ∈ 𝐴 represents the selection of acertain direct successive VNF instance from the currentVNF instance. Obviously, for the 𝑖 -th function there exist 𝑀 𝑖 actions (selections). • 𝑃 : Every transition probability 𝑝 𝑎𝑠 → 𝑠 ′ ∈ 𝑃 represents thepossibility that the QoS/QoE status changes from 𝑠 to 𝑠 ′ under VNF instance selection action 𝑎 . However, it isunknown here thus a model-free reinforcement learning. • 𝑅 : Every immediate reward 𝑟 ∈ 𝑅 represents the contri-bution of the selected VNF instance 𝑖𝑛𝑠 𝑖𝑗 to the currentQoE of the chain.The solution of MDP-modeled Q2-SFC is to ﬁnd the optimalservice function chain 𝑐 ∗ ∈ 𝐶 (where 𝐶 is a ﬁnite set of servicefunction chains) under policy 𝜋 such that: 𝑐 ∗ = arg max 𝑐 ∈ 𝐶 𝐸 ( 𝑇 ∑ 𝑡 =0 𝛾 𝑡 𝑟 𝑡 +1 ) (8) According to equation (8), the key to solve MDP-modeled Q2-SFC is the reward model that reﬂects QoE, possibly under QoSand other constraints, and the policy design that maximizeslong term rewards. Therefore, the reward design should coveraspects not just QoE. We consider the QoE gain, the QoS con-straints penalty, and the OPEX penalty to be ingredients thatconstitute the overall reward of an action during incrementalSFC orchestration.

The evaluation of QoE can be roughly divided into two cate-gories, namely subjective and objective evaluation. The sub-jective evaluation involves end-user’s participation in ratingthe service from the perspective of direct user perception.Usually, the MOS (Mean Opinion Score) scale is used dur-ing subjective evaluation. Although it has advantages, such asintuitiveness, accuracy, etc., in evaluating service experience,subjective evaluation requires great eﬀorts in mobilizing userparticipation and is subjected to the varying understanding XI CHEN

ET AL and preferences of the various experience metrics. Therefore,subjective evaluation can hardly be applied in large networks,whereas its main application lies within the evaluation of otherQoE evaluation methods, such as objective evaluation.Objective evaluation, on the contrary, derives QoE frommeasurable metrics without end-user involvements, thus theautomation of QoE evaluation becomes possible. QoS metricsare prominently used in the automated QoE evaluation, amongothers, where QoE is calculated as per measured QoS metricsas well as the consideration of psychological perception fromend-users. Two well-known principles, i.e., the WFL (Weber-Fechner Law) and IQX (Exponential Interdependency ofQoE and QoS) hypothesis , are used in the deriving from QoSto QoE, both of which give non-linear relationship betweenQoS and QoE as shown in the following equations: 𝑑𝑄𝑜𝑆 ∝ 𝑄𝑜𝑆 ⋅ 𝑑𝑄𝑜𝐸, WFL (9) 𝑑𝑄𝑜𝐸 ′ ∝ 𝑄𝑜𝐸 ′ ⋅ 𝑑𝑄𝑜𝑆, IQX (10)Although, seemingly, WFL and IQX give contradictoryrelationships between QoE and QoS (i.e., diﬀerential v.s. expo-nential), we argue that WFL and IQX apply in QoS metricswith diﬀerent tendencies. For positive QoS metrics (the big-ger value the better), the corresponding QoE can be derived inequation (11) according to WFL while for negative QoS met-rics (the smaller value the better), the corresponding QoE canbe derived in equation (12) according to IQX, where 𝑞𝑜𝑠 𝑡𝑐 isthe 𝑡 -th QoS metric of service function chain 𝑐 , whose valuecan be calculated using algorithms in Table 1 . 𝛼 𝑝 , 𝛽 𝑝 , 𝛾 𝑝 , 𝜃 𝑝 , 𝛼 𝑛 , 𝛽 𝑛 , 𝛾 𝑛 and 𝜃 𝑛 are constant parameters to ﬁne-tune QoS/QoErelationships. Study and ﬁne-tuning of these parameters areout of the scope of this paper. Interested readers can refer toreference for detailed mathematical relationships. 𝑞𝑜𝑒 𝑡𝑐 = 𝛾 𝑝 × log( 𝛼 𝑝 × 𝑞𝑜𝑠 𝑡𝑐 + 𝛽 𝑝 ) + 𝜃 𝑝 , 𝑡 ∈ {1 , , ⋯ , 𝐾 } (11) 𝑞𝑜𝑒 𝑡𝑐 = 𝛾 𝑛 × 𝑒 𝛼 𝑛 × 𝑞𝑜𝑠 𝑡𝑐 + 𝛽 𝑛 + 𝜃 𝑛 , 𝑡 ∈ { 𝐾 + 1 , 𝐾 + 2 , ⋯ , 𝐿 } (12)To give better intuitions of these equations, we give somedaily experiences as examples. Suppose a user has a 10 Mbpsaccess bandwidth (i.e., a positive QoS metric). Accordingto daily experience, the increase of bandwidth to 20 Mbpsdoes not give the user the perception that the speed is twiceas fast; on the contrary, it gives very limited perception ofspeed upgrade. However, if the bandwidth is upgraded to 100Mbps, the perception of speed upgrade is somehow obvious.This can be well captured by equation (11) (i.e., WFL). Foranother example, glitches or paused buﬀering in video stream-ing caused by minor packet loss (i.e., a negative QoS metric)would greatly compromises the user experience, resulting intheir possible refreshing of the Web pages impatiently, which is captured by equation (12) (i.e., IQX). The overall QoE of a ser-vice function chain can be derived by the following equation: 𝑞𝑜𝑒 𝑐 = 𝐾 ∑ 𝑡 =1 𝑤 𝑡 × 𝑞𝑜𝑒 𝑡𝑐 − 𝐿 ∑ 𝑡 = 𝐾 +1 𝑤 𝑡 × 𝑞𝑜𝑒 𝑡𝑐 (13)The QoE gain of constructing a chain 𝑐 by taking action 𝑎 (i.e., selecting all those VNF instances 𝑖𝑛𝑠 𝑖𝑗 s that form chain 𝑐 ) under state 𝑠 is shown as follows in equation (14). Note thathow to select VNF instances 𝑖𝑛𝑠 𝑖𝑗 s to construct a chain 𝑐 willbe discussed in Section 5.5. 𝑔𝑎𝑖𝑛 𝑞𝑜𝑒𝑐 = 𝑞𝑜𝑒 𝑐 (14) Intuitively, QoE can be used as the estimate of immediatereward. In this way, service function chain with the high-est accumulated rewards is considered the best one, whichalso maps well to the standard reinforcement learning whosesolution is the answer to QoE-SFC that maximizes QoE. Nev-ertheless, in the standard reinforcement learning model, noconstraints are explicitly speciﬁed. If we adopt the standardreinforcement learning in Q2-SFC and simply regard QoE asreward, no QoS constraints are enforced. Therefore, we believethat the standard reinforcement learning does not serve well inQ2-SFC. The key to adapt reinforcement learning to Q2-SFC isto embrace QoS constraints. However, if we explicitly specifyQoS constraints in reinforcement learning as do in mathemati-cal programming approaches (like that in Deﬁnition 3), we arevery likely to face the NP-hardness that leads to an impracticalsolution within polynomial time.Obviously, there exists a paradox between QoE and QoS,in that maximizing QoE requires high resource consumption,while respecting QoS constraints requires low resource con-sumption. High resource consumption narrows the “distance”between QoS metrics and QoS constraints. If the “distance”between the QoS metrics of the chain and the QoS constraintsis very close, the probability of violating QoS constraints ishigh. This should generate a penalty against the reward, i.e.,a negative reward. The closer the “distance”, the bigger thepenalty against the reward. If any QoS metric violates the cor-responding constraint, the penalty is considered very severe.To capture this, we deﬁne the penalty due to the distancebetween QoS metrics 𝑞𝑜𝑠 𝑐 of chain 𝑐 and the QoS constraints 𝑞𝑐𝑜𝑛 as follows in equation (15), where 𝑃 is a large-enoughconstant to penalize QoS constraints violations. 𝑝𝑒𝑛 𝑞𝑐𝑜𝑛𝑖𝑗 = { 𝑃 , if any QoS constraint violation 𝑃 ⋅ 𝑒 − √ ∑ 𝐿𝑡 =1 || 𝑞𝑜𝑠 𝑡𝑐 − 𝑞𝑐𝑜𝑛 𝑡 || , otherwise (15) I CHEN

ET AL In the discussion of Deﬁnition 1, we distinguished deployedVNF instances and potential VNF instances which capturesthe nature that VNF instances can be instantiated on-demandby suﬃcient remaining resources on physical commodityservers hosting VNF instances. The instantiation of new VNFinstances shall suﬀer from great OPEX since it might incurthe loading of remote virtual machine images stored in imagerepositories and the instantiation of VNF instances, which aswell we consider a penalty against rewards. The operationalexpenditure for VNF instance 𝑖𝑛𝑠 𝑖𝑗 is deﬁned as follows: 𝑝𝑒𝑛 𝑜𝑝𝑒𝑥𝑖𝑗 = { 𝑜𝑝𝑒𝑥 𝑛𝑜𝑟𝑚𝑎𝑙 + 𝑜𝑝𝑒𝑥 𝑣𝑚𝑖 + 𝑜𝑝𝑒𝑥 𝑣𝑛𝑓𝑖 , if 𝑖𝑛𝑠 𝑖𝑗 is potential 𝑜𝑝𝑒𝑥 𝑛𝑜𝑟𝑚𝑎𝑙 , otherwise (16) 𝑜𝑝𝑒𝑥 𝑣𝑚𝑖 indicates the operational expenditure for booting thecorresponding virtual machine for the 𝑖 -th function whereas 𝑜𝑝𝑒𝑥 𝑣𝑛𝑓𝑖 indicates the operational expenditure for launchingthe corresponding VNF instance. Note that once a potentialVNF instance is instantiated and selected, it is a deployedVNF instance whose OPEX penalty is 𝑜𝑝𝑒𝑥 𝑛𝑜𝑟𝑚𝑎𝑙 , which is thenormal operational expenditure (e.g., normal energy consump-tion, etc.), if it is selected in the future.Therefore, the OPEX penalty of chain 𝑐 is the sum of allVNF instances along 𝑐 , formulated as follows: 𝑝𝑒𝑛 𝑜𝑝𝑒𝑥𝑐 = 𝑁 ∑ 𝑖 =1 𝑝𝑒𝑛 𝑜𝑝𝑒𝑥𝑖𝑗 (17)With the previous deﬁnitions of QoE gain, QoS constraintspenalty and OPEX penalty, we deﬁne the immediate rewardassociated with chain 𝑐 as follows: 𝑟 𝑐 = 𝑔𝑎𝑖𝑛 𝑞𝑜𝑒𝑐 − 𝑝𝑒𝑛 𝑞𝑐𝑜𝑛𝑐 − 𝑝𝑒𝑛 𝑜𝑝𝑒𝑥𝑐 (18)For those VNF instances that participate in the constructionof chain 𝑐 under state 𝑠 and action 𝑎 , the reward 𝑟 𝑐 of the chain 𝑐 is evenly distributed among them, formulated as follows. 𝑟 𝑖𝑗 = 𝑟 𝑐 𝑁 , 𝑖𝑛𝑠 𝑖𝑗 ∈ 𝑐 (19) The action value function 𝑄 ( 𝑠, 𝑎 ) is the long term discountedaccumulation of immediate reward 𝑟 . Every time a VNFinstance 𝑖𝑛𝑠 𝑖𝑗 is selected, immediate reward 𝑟 𝑖𝑗 is generatedand observed as shown in equation (19), which is then usedto update the current value of action value function 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) for 𝑖𝑛𝑠 𝑖𝑗 as the evidence for future selections. Recall that inSection 5.2 when we deﬁned state 𝑠 ∈ 𝑆 , among others, itsmain ingredient is the QoS/QoE status, which is non-discretespace (i.e., QoS and QoE have continuous values and changevalues due to resource consumption). Therefore, tabular basedreinforcement learning algorithms such as Q-learning are not applicable for state storage and updates. To this end, DQN(Deep Q Network) is employed to ﬁt the long term rewards 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) s from a speciﬁc state 𝑠 and immediate rewards 𝑟 𝑖𝑗 s.DQN is the modiﬁcation of Q-learning that uses Convolu-tional Neural Network (CNN) to approximate an action valuefunction 𝑄 ( 𝑠, 𝑎 ) with a ﬁt value 𝑄 ( 𝑠, 𝑎 ; 𝜃 ) , where 𝜃 is the CNNparameter. Structurally in our framework, CNN is in charge ofthe state 𝑠 processing and 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ; 𝜃 ) generation and policy(e.g., 𝜖 -greedy. See Section 5.5) directs 𝑖𝑛𝑠 𝑖𝑗 selection. DQNhave two independent networks, namely evaluation network(eval-net) and target network (target-net), which are identicalin structure. However, the network parameter 𝜃 of eval-netupdates after every iteration while the network parameter 𝜃 − of target-net is frozen temporarily and updated after every 𝐶 iterations by 𝜃 . In our framework, ﬁrst of all, the loss functionis deﬁned as mean square error in the follows: 𝐿 ( 𝜃 ) = 𝐸 ⎡⎢⎢⎣( 𝑟 𝑖𝑗 + 𝛾 max 𝑖𝑛𝑠 ′ 𝑖𝑗 𝑄 ( 𝑠 ′ , 𝑖𝑛𝑠 ′ 𝑖𝑗 ; 𝜃 − ) − 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ; 𝜃 ) ) ⎤⎥⎥⎦ (20)Then, the gradient is derived accordingly as follows: 𝜕𝐿 ( 𝜃 ) 𝜕𝜃 = 𝐸 [( 𝑟 𝑖𝑗 + 𝛾 max 𝑖𝑛𝑠 ′ 𝑖𝑗 𝑄 ( 𝑠 ′ , 𝑖𝑛𝑠 ′ 𝑖𝑗 ; 𝜃 − ) − 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ; 𝜃 ) ) 𝜕𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ; 𝜃 ) 𝜕𝜃 ] (21)With gradient descent and back propagation, we can acquirethe optimal 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) for a speciﬁc 𝑖 -th function. For a given function, the reinforcement learning agent has twostrategies to select VNF instances: 1) it may prefer to choosenew VNF instances that have not been executed, enhancingthe perception of the network situation (i.e., exploration) andimproving their probability of making optimal decisions; or2) it may also prefer to repeatedly select currently known bestVNF instances according to the current network situation andobtain the maximum known return (i.e., exploitation) witha relatively conservative approach. In fact, controllers facethe Exploration-Exploitation Dilemma : the risk of excessiveexploration is that it is diﬃcult to maximize rewards whileexcessive exploitation may lose the chance of discovering bet-ter alternatives; at the same time, the available resources mayalso be exhausted by excessive exploitation (e.g., some nodesmight be overloaded).To this end, it is not suﬃcient to conduct VNF instanceselection purely based on the observed QoE/QoS which con-stitute the deterministic aspect (although transiently) of thenetwork status, but certain probability models should also be XI CHEN

ET AL adopted to capture the stochastic nature of a dynamic envi-ronment. In other words, the policy, which governs the VNFinstance selection in SFC context, should take into accountQoE/QoS as well as probability distributions.The Exploration-Exploitation is often modeled as the Multi-Armed Bandit Problem (MAB) in the ﬁeld of reinforcementlearning. MAB describes the problem that a gambler repeat-edly pulls on one of the arms of a gambling machine (i.e., thebandit) for a certain amount of rewards (such as spitting a cer-tain amount of coins). However, the distribution of each arm’sreward is unknown. The goal of the MAB is to maximize theaverage reward after pulling arms several times. Obviously, inMAB, there is a discovery process to know the distributionof rewards for each arm (i.e., exploration) and there is also aprocess to maximize the average rewards after multiple armspulling (i.e., exploitation).Commonly seen algorithms for MAB include greedy, 𝜖 -greedy, Softmax, UCB (Upper Conﬁdence Bound). Thegreedy policy selects the action with the highest value func-tion 𝑄 𝜋 ( 𝑠, 𝑎 ) repeatedly. However, reward 𝑟 𝑖𝑗 (which is used toaccumulate 𝑄 𝜋 ( 𝑠, 𝑎 ) ) as we deﬁned in SFC scenario is a non-static value, thus it is not practical to achieve long term rewardsmaximization using the greedy policy. 𝜖 -greedy, however, bal-ances exploration and exploitation by conducting the greedypolicy with probability (1 − 𝜖 ) and selecting a random actionwith probability 𝜖 . 𝜋 ← { 𝜖 + 𝜖𝑀 𝑖 , if 𝑖𝑛𝑠 𝑖𝑗 = arg max 𝑀 𝑖 𝑗 =1 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) 𝜖𝑀 𝑖 , if 𝑖𝑛𝑠 𝑖𝑗 ≠ arg max 𝑀 𝑖 𝑗 =1 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) (22)Softmax adopts the Boltzmann distribution in selectingactions, which is formulated in equation (23). 𝜏 > isthe “temperature” parameter, approximating pure exploitationwhen it approaches 0, while approximating pure explorationwhen it approaches 1. 𝜏 oﬀers the possibility to balancebetween exploration and exploitation. Meanwhile, 𝜏 can bedecremented to reduce exploration in a later phase when theconvergence is reached. 𝜋 ← 𝑒 𝑄 ( 𝑠,𝑖𝑛𝑠𝑖𝑘 ) 𝜏 ∑ 𝑀 𝑖 𝑗 =1 𝑒 𝑄 ( 𝑠,𝑖𝑛𝑠𝑖𝑗 ) 𝜏 (23)UCB is another common algorithm for policy enforcement.UCB takes the form of (mean + upper conﬁdence bound) asshown in equation (24), which is also the reason for its nam-ing. According to the large number theorem, the mean can bereplaced by the arithmetic average as shown in the ﬁrst part onthe right; the upper conﬁdence bound is given by the Chernoﬀ-Hoeﬀding inequality as shown in the second part on the right. 𝑐𝑜𝑢𝑛𝑡 𝑖𝑗 represents the number of times 𝑖𝑛𝑠 𝑖𝑗 was selected and 𝑐𝑜𝑢𝑛𝑡 represents the total number of SFC requests that havebeen solved. One of the advantages oﬀered by UCB is that theworkload is adaptively balanced between VNF instances 𝑖𝑛𝑠 𝑖𝑗 s of the same VNF type 𝑡 𝑖 . The more times a VNF instance 𝑖𝑛𝑠 𝑖𝑗 is selected, the greater value 𝑐𝑜𝑢𝑛𝑡 𝑖𝑗 is, leading to the decreaseof upper conﬁdence bound. For those VNF instances with sim-ilar 𝑄 ( 𝑠 𝑖 , 𝑎 𝑖𝑗 ) , the smaller value 𝑐𝑜𝑢𝑛𝑡 𝑖𝑗 has, the greater chanceit is selected, thus the workload is adaptively balanced. 𝜋 ← 𝑀 𝑖 max 𝑗 =1 ( 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) + √ 𝑐𝑜𝑢𝑛𝑡 ) 𝑐𝑜𝑢𝑛𝑡 𝑖𝑗 ) (24) Based on the previous modeling, we give the DQN_Q2_SFCtraining algorithm in Algorithm 1. 𝑄 is the 𝑀 𝑖 -dimensionalvector of 𝑄 ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 ) s for the 𝑖 -th function. 𝐸𝑃 𝐼 _ 𝐶𝑂𝑈 𝑁𝑇 isthe number of training episodes.

𝑅𝐸𝐶 _ 𝐶𝑂𝑈 𝑁𝑇 is the num-ber of SFC requests used for training. Note that in line 9, theselection of VNF instances is restricted in that 𝑖𝑛𝑠 𝑖𝑗 must beconnected to the current instance so as to adapt to the topo-logical dynamics (e.g., instance shutdown, etc.). Note alsothat, in our real implementation, we use the 𝜖 -greedy (i.e.,equation (22)), therefore, we can balance between explorationand exploitation by tuning hyper parameter 𝜖 . Algorithm 1

DQN_Q2_SFC initialize replay memory 𝐷 to capacity 𝑁 initialize action value function 𝑄 with random weights 𝜃 initialize target action value function ̂𝑄 with weights 𝜃 − for 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 = 1 ..𝐸𝑃 𝐼 _ 𝐶𝑂𝑈 𝑁𝑇 do reset environment for 𝑠𝑓 𝑐 _ 𝑟𝑒𝑞 = 1 ..𝑅𝐸𝑄 _ 𝐶𝑂𝑈 𝑁𝑇 do initialize chain 𝑐 and observe initial observation 𝑠 for 𝑖 = 1 ..𝑁 do select a connected instance 𝑖𝑛𝑠 𝑖𝑗 by eq. (22),(23) or (24) observe 𝑠 by QoS over LLDP, etc. and observe 𝑟 𝑖𝑗 by eq. (19) store transition ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 , 𝑟 𝑖𝑗 , 𝑠 ′ ) in 𝐷 if 𝑠 is terminal state, break 𝑠 = 𝑠 ′ end for sample minibatch of transitions ( 𝑠, 𝑖𝑛𝑠 𝑖𝑗 , 𝑟 𝑖𝑗 , 𝑠 ′ )from 𝐷 every 𝐶 iterations, reset ̂𝑄 = 𝑄 update 𝑄 by gradient descent (eq. (20), (21)) end for end for I CHEN

ET AL The experiment environment is as follows: Ubuntu 14.04Server with 64 GB memoery, 40 logical CPUs with 1200MHz, 2 Tesla GPUs (only one used during experiments), andTensorFlow 1.0.0 (compiled on GPU).

The QoS over LLDP is deployed in Mininet and Floodligth1.3 environment. It is tested in a topology (see Figure 4 )where a video streaming application is deployed. Host h1 isthe video server, hosting a video (about 500 MB in size and 15min in length) and h3 is the client, streaming the video fromh1 using Firefox browser. During the streaming, QoS status isconstantly changing in bandwidth, delay, etc., due to resourceconsumption. In order to evaluate the impact on network traﬃccaused by QoS over LLDP, we capture traﬃc using Wiresharkin two scenarios (video streaming v.s. no video streaming) withthe above topology. The evaluation duration is 15 min. Theevaluation results are shown in Table 2 .

FIGURE 4

The Topology for QoS over LLDP ExperimentWe ﬁrst analyze the “no video streaming” scenario. As westated above, QoS over LLDP causes extra network overheadsince it contains several QoS TLV bytess. However, the per-centages of QoS over LLDP (6.56%) is just slightly greaterthan pure LLDP (5.27%) by bytes, meaning that QoS overLLDP does not deteriorate the network traﬃc performance.For the “video streaming” scenario, both QoS over LLDPand pure LLDP take almost the same percentage, 0.021% and0.015% of the total traﬃc by bytes, respectively and by packets,0.59% and 0.58% of the total traﬃc by packets, respectively.This indicates that QoS over LLDP works in a piggybackfashion with very minor traﬃc overhead to achieve QoS infor-mation collecting. The experiment results indicate that QoSover LLDP is an applicable approach for QoS informationdelivering in an SDN environment.

Our DQN-based QoS/QoE-aware SFC algorithm is testedby TensorFlow simulation in this section. We compare ouralgorithm with violent search, which guarantees best servicefunction chain with the highest QoE, and random search, whichgives a functionally feasible chain with minimal response time.The experiment topology contains 10 VNF types where eachtype contains 10 VNF instances whose QoS status are gener-ated randomly when the network topology is initialized. Thepurpose is to compare their performance in QoE provisioning,QoS constraining, response time, etc. 250 episodes are con-ducted in the comparison, where one episode includes 100 SFCrequests. The experiment results are shown in Figure 5 , 6and Table 3 .

FIGURE 5

QoE Comparison between Random, Violent andDQN-basedWe can see from Figure 5 that our DQN-based algorithmorchestrates service function chains with QoE between that ofrandom search and violent search. Violent search ensures bestQoE which DQN gradually approaches. This indicates thatDQN exhibits a strong learning ability to approximate the bestQoE after training. Random search gives only functionally fea-sible chains without QoE provisioning, thus it is often penal-ized in terms of QoE, shown in Figure 5 . Meanwhile, withregard to QoS constraining, the DQN-based algorithm respectQoS constraints with overwhelming probability. The reasonwhy there are still cases where the DQN-based algorithmviolates QoS constraints is that the QoS constraints vectoris modeled as penalty (i.e., a scalar) against reward, leading XI CHEN

ET AL

TABLE 2

QoS over LLDP v.s. LLDP in diﬀerent scenarios, duration about 15 min.Scenario Scheme Total Packets LLDP Packets Total Bytes LLDP BytesNo Video Streaming Pure LLDP 20173 2567 (12.72%) 6526386 344222 (5.27%)QoS over LLDP 21889 2563 (11.71%) 6738255 441760 (6.56%)Video Streaming Pure LLDP 440348 2550 (0.58%) 2208747286 339866 (0.015%)QoS over LLDP 437795 2580 (0.59%) 2086069045 443204 (0.021%)

FIGURE 6

QoS Comparison between Random, Violent andDQN-based

TABLE 3

Response Time Comparison between Random,Violent and DQN-based.

Algorithms Response Time

Random . Violent > 𝑚𝑖𝑛 DQN . to precision and dimension losses that eventually cause vio-lations. Therefore, our algorithm can be somehow seen asheuristics to violent search, in exchange for agile orchestrationthat still possesses a high QoE. Note that, however, the prob-ability that DQN-based algorithm violates QoS constraintsgradually lowers down, which again exhibits its strong learn-ing ability. In this regard, DQN-based algorithm balances QoEprovisioning and QoS constraining. With regard to responsetime (Table 3 ), violent search delivers response time orders ofmagnitude slower than that of DQN, which is not acceptable inpractical applications, especially for time-critical applications.DQN is quick to response in that it gives almost constant timecomplexity after agent training. In this paper, we propose a reinforcement learning (DQN, tobe exact) based QoS/QoE service function chaining frameworkfor SDN/NFV-enabled 5G slices. It features two aspects, i.e.,1) the lightweight QoS over LLDP scheme to bring QoS aware-ness and 2) the DQN-based SFC algorithm that syntheticallytakes into account QoS and QoE as key ingredients to for-mulate rewards. The experiments show that it is applicable inservice function chaining in 5G core network slices in dynamicQoS environments.In our current work, we focus on the QoS/QoE-aware SFCin single 5G slices. In a service outsourcing scenario, a servicefunction chain might involve trans-slice network functions,which are likely to introduce hierarchical orchestration design.We consider this as a future work direction. Meanwhile, ourcurrent work focuses on service function chaining in 5G corenetwork slices. Our next step will also address to coordinateMEC in 5G edge networks with SFC in 5G core networksto bring end-to-end QoE-satisfactory services is also our nextstep work.

References

1. L. Huang G. Zhu, Du X.. Cognitive Femtocell Networks:An Opportunistic Spectrum Access for Future IndoorWireless Coverage.

IEEE Wireless Communications Mag-azine.

EURASIP Jour-nal on Wireless Communications and Networking.

IEEE Wire-less Communications.

AdHoc Networks.

I CHEN

ET AL

5. Du Xiaojiang, Zhang Ming, Nygard Kendall E, GuizaniSghaier, Chen Hsiao-Hwa. Self-healing sensor networkswith distributed decision making.

International Journal ofSensor Networks.

Ad Hoc Networks.

IEEE Transactions on Vehicular Technology.

IEEECommunications Magazine.

NEC White Papers.

IETF RFC.

IEEE Com-munications Surveys Tutorials.

ACM SIGCOMM Computer Communication Review.

IEEE Com-munications Magazine.

CiscoWhite Papers.

Acm Sigcomm ComputerCommunication Review. XI CHEN

ET AL

28. Wen T., Yu H., Du X.. Performance guarantee awareorchestration for service function chains with elasticdemands. In: 2017 IEEE Conference on Network Func-tion Virtualization and Software Deﬁned Networks (NFV-SDN):1–4; 2017.29. Ding W., Yu H., Luo S.. Enhancing the reliability of ser-vices in NFV with the cost-eﬃcient redundancy scheme.In: 2017 IEEE International Conference on Communica-tions (ICC):1–6; 2017.30. Doshi P., Goodwin R., Akkiraju R., Verma K.. Dynamicworkﬂow composition using Markov decision processes.In: IEEE International Conference on Web Services(ICWS):576–582; 2004.31. Lin S. C., Akyildiz I. F., Wang P., Luo M.. QoS-AwareAdaptive Routing in Multi-layer Hierarchical SoftwareDeﬁned Networks: A Reinforcement Learning Approach.In: 2016 IEEE International Conference on Services Com-puting (SCC):25–33; 2016.32. Yousaf F. Z., Bredel M., Schaller S., Schneider F.. NFVand SDN - Key Technology Enablers for 5G Networks.

IEEE Journal on Selected Areas in Communications.

SERIES P:TELEPHONE TRANSMISSION QUALITY Methods forobjective and subjective assessment of quality.

IEEE Network.

Machine Learning. arXiv:1312.5602 [cs].

MachineLearning.