[PDF] Reinforcement Learning Assisted Load Test Generation for E-Commerce Applications

Abstract

Background: End-user satisfaction is not only dependent on the correct functioning of the software systems but is also heavily dependent on how well those functions are performed. Therefore, performance testing plays a critical role in making sure that the system responsively performs the indented functionality. Load test generation is a crucial activity in performance testing. Existing approaches for load test generation require expertise in performance modeling, or they are dependent on the system model or the source code. Aim: This thesis aims to propose and evaluate a model-free learning-based approach for load test generation, which doesn't require access to the system models or source code. Method: In this thesis, we treated the problem of optimal load test generation as a reinforcement learning (RL) problem. We proposed two RL-based approaches using q-learning and deep q-network for load test generation. In addition, we demonstrated the applicability of our tester agents on a real-world software system. Finally, we conducted an experiment to compare the efficiency of our proposed approaches to a random load test generation approach and a baseline approach. Results: Results from the experiment show that the RL-based approaches learned to generate effective workloads with smaller sizes and in fewer steps. The proposed approaches led to higher efficiency than the random and baseline approaches. Conclusion: Based on our findings, we conclude that RL-based agents can be used for load test generation, and they act more efficiently than the random and baseline approaches.

Full PDF

MM¨alardalen UniversitySchool of Innovation Design and EngineeringV¨aster˚as, SwedenThesis for the Degree of Master of Science in Computer Science withSpecialization in Software Engineering 30.0 credits

REINFORCEMENT LEARNINGASSISTED LOAD TEST GENERATIONFOR E-COMMERCE APPLICATIONS

Golrokh Hamidi [email protected]

Examiner: Wasif Afzal

M¨alardalen University, V¨aster˚as, Sweden

Supervisors: Mahshid Helali Moghadam

M¨alardalen University, V¨aster˚as, Sweden

Company supervisor: Mehrdad Saadatmand,

RISE - Research Institutes of Sweden, V¨aster˚as,Sweden July 24, 2020 a r X i v : . [ c s . PF ] J u l olrokh Hamidi Reinforcement Learning Assisted Load Test Generation Abstract

Background: End-user satisfaction is not only dependent on the correct functioning of the soft-ware systems but is also heavily dependent on how well those functions are performed. Therefore,performance testing plays a critical role in making sure that the system responsively performs theindented functionality. Load test generation is a crucial activity in performance testing. Existingapproaches for load test generation require expertise in performance modeling, or they are dependenton the system model or the source code.Aim: This thesis aims to propose and evaluate a model-free learning-based approach for loadtest generation, which doesn’t require access to the system models or source code.Method: In this thesis, we treated the problem of optimal load test generation as a reinforcementlearning (RL) problem. We proposed two RL-based approaches using q-learning and deep q-networkfor load test generation. In addition, we demonstrated the applicability of our tester agents on areal-world software system. Finally, we conducted an experiment to compare the eﬃciency of ourproposed approaches to a random load test generation approach and a baseline approach.Results: Results from the experiment show that the RL-based approaches learned to generate theeﬀective workloads with smaller sizes and in fewer steps. The proposed approaches led to highereﬃciency than the random and baseline approaches.Conclusion: Based on our ﬁndings, we conclude that RL-based agents can be used for load testgeneration, and they act more eﬃciently than the random and baseline approaches. iolrokh Hamidi Reinforcement Learning Assisted Load Test Generation

Table of Contents

1. Introduction 12. Background 3

3. Related Work 134. Problem Formulation 15

5. Methodology 18

6. Approach 21

7. Evaluation 26

8. Results 379. Discussion 40 iiolrokh Hamidi Reinforcement Learning Assisted Load Test Generation

1. Introduction

The industry is continuously ﬁnding ways to make software services accessible to more and morecustomers. One way to reach such customers (distributed over the globe) is the use of EnterpriseApplications (EAs) delivering services over the internet. Ineﬃcient and time-wasting softwareapplications lead to customer dissatisfaction and ﬁnancial losses [1, 2]. Performance problemsare costly and waste resources. Furthermore, nowadays, using internet services and web-basedapplications have been extremely widespread among people and the industry. The signiﬁcant roleof internet services in people’s daily life, and the industry is undeniable. Users around the worldare dependent on internet services more than ever. Consequently, software success depends notonly on the correct functioning of the software system but also on how well those functions areperformed (non-functional properties). Responsiveness and eﬃciency are primitive requirementsfor any web-application due to the high expectations of users. For example, Google reported a0.5 seconds increased delay in generating the search page resulted in 20% decrease in traﬃc byusers [1]. Amazon also reported a 100 mili seconds delay in a web-page costed 1% loss in sales [2].Accordingly, performance is a key success factor of software products, and it is of paramountimportance to the industry, and a critical subject for user satisfaction. Tools allow companies totest software performance in both the development and design phases or even after the deploymentphase.Performance describes how well the system accomplishes its functionality. Typically, the per-formance metrics of a system are response time, error rate, throughput, utilization, bandwidth,and data transmission time. Finding and resolving performance bottlenecks of a system is an im-portant challenge during the development and maintenance of a software [3]. The issues reportedafter project release are often performance degradation rather than system failures or incorrectresponse [4]. Two common approaches to performance analysis are performance modeling andperformance testing. Performance models can be analyzed mathematically, or they could be simu-lated in case of having complex models [5]. Measuring and evaluating performance metrics of thesoftware system through executing the software under various conditions by simulating concurrentmulti-users with tools is the core of performance testing. One type of performance testing is loadtesting. Load testing evaluates the system’s performance (e.g., response time, error rate, resourceutilization) by applying extreme loads on the system [6]. The load testing approaches usuallygenerate workloads in multiple steps by increasing the workload in each step until a performancefault occurs in the system under test. The performance faults are triggered due to a higher errorrate or response time than expected by the performance requirements [6]. Diﬀerent approacheshave been proposed for generating the test workload. Over the years, many approaches have beenfocused on testing for performance using system models or source code [6, 7]. These approachesrequire expertise in performance modeling, and the source code of the system is not always avail-able. Various machine learning methods are also used in performance testing [8, 9]. However,these approaches require a signiﬁcant amount of data for training. On the other hand, model-freeReinforcement Learning (RL) [10] is one of the machine learning techniques which does not requireany training data set. Unlike other machine learning approaches, RL can be used in load testingto generate eﬀective workloads without any training data-set.As mentioned before, in software systems, performance bottlenecks could cause violations ofperformance requirements [11, 12]. Performance bottlenecks in the system will change during thetime due to the changes in their source code. Load testing is a kind of performance testing inwhich the aim is to ﬁnd the breaking points (performance bottlenecks) of the system by generatingand applying workloads on the system. Manual approaches for test workload generations consumehuman resources; they are dependent on many uncontrolled manual factors and are highly proneto error. A possible solution to this problem is automated approaches for load testing. Existingautomated approaches are dependent on the system model and may not be applicable when thereis no access to the model or source code. There is a need for a model-free approach for load testing,which is independent of source code, system model, and requires no training data. Eﬀective, in terms of causing the violation of performance requirements (error rate and response time thresholds).

Contributions.

In this thesis, our purpose is to generate eﬃcient workload-based test condi-tions for a system under test without access to source code or system models, based on using anintelligent RL load tester agent. Intelligent here means that the load tester tries to learn how togenerate an eﬃcient workload. The contributions of this thesis are as follows.1. Proposed model-free RL approach for load testing.2. An evaluation of the applicability of the proposed approach on a real case.3. An experiment for evaluating the two RL-based methods used in the approach i.e., q-learningand Deep Q-Network (DQN), against a baseline and a random approach for load test gener-ation. Method.

In our proposed model-free RL approach, the intelligent agent can learn the optimalpolicy for generating test workloads to meet the performance analysis’s intended objective. Thelearned policy might also be reused in further stages of the testing. In this approach, the workloadis selected in an intelligent way in each step instead of just increasing the workload size. Weexplain our mapping of the real-world problem of load test generation into an RL problem. Wealso presented the RL methods that we use in our approach i.e., q-learning and deep q-network(DQN). Then we present our approach with two variations of RL methods in detail. To evaluatethe applicability of our proposed approach, we implement our RL-based approaches using opensource libraries in Java. We use JMeter to generate our desired workload and apply the workloadon an e-commerce website, deployed on a local server.In addition, we conduct an experiment to evaluate the eﬃciency of the RL-based approaches.We execute the RL-based approaches, a baseline approach, and a random approach separatelyfor comparison. We then compare the results of all approaches based on the eﬃciency (i.e., ﬁnalworkload size that violates the performance requirements and the number of workload incrementsteps for generating the workload).

Results

The experiment results show that, in comparison to the other approaches, the baselineapproach generates workloads with bigger sizes. Thus the baseline approach is not as eﬃcient asthe other approaches. The random approach performs better than the baseline approach sincethe average workload size generated by the random approach is lower than the baseline approach.However, the proposed RL-based approaches perform better than the random and baseline ap-proaches. The results show that in both q-learning and DQN approaches, eﬃcient workload sizeand the number of steps taken for generating workload in each episode converges to a lower valueover time. The q-learning approach converges faster than the DQN. However, the DQN approachconverges to lower values for the workload sizes. Our conclusion of the results is that both of theproposed RL approaches learn an optimal policy to generate optimal workloads eﬃciently.

Structure:

The remainder of this thesis is structured as follows. In Section 2., we describe thebasic knowledge and terms in performance testing and reinforcement learning. In Section 3., weintroduce diﬀerent approaches for load testing. In Section 4., we describe the motivation andproblem, research goal, and research questions. In Section 5., we present the scientiﬁc method weuse in this thesis and the tools we used. In Section 6., we provide our approach for generatingload test and explain our RL-based load testers in detail. In Section 7., we provide an overviewof the SUT setup, the process of applying workload using JMeter, and the implementation of ourload tester. In Section 8., we describe the outcome of executing the implemented load testers onthe SUT. We also explain the experiment procedure in this section. In Section 9., we present aninterpretation of the results. Finally, in Section 10., we summarize the thesis report and presentsconclusions and future directions. Eﬃcient, in terms of optimal workload (workload size and number of steps for generating the workload).

2. Background

In this section, we provide basic knowledge, terms, and notations in performance testing andreinforcement learning. The terms explained here will be used for describing the problem, approach,and solution in the following sections.

In this section, we discuss the terms related to performance and performance testing.

Non-functional Quality Attributes of Software

Non-functional properties of a softwaresystem deﬁne the physiognomy of the system. These non-functional properties are often achievedby realizing some constraints over the functional requirements. Performance, security, availabil-ity, usability, interoperability, etc., are often classiﬁed under the term of run-time non-functionalrequirements. Then, modiﬁability, portability, reusability, integrability, testability, etc., are con-sidered as non-runtime non-functional requirements. The run-time non-functional requirementscan be veriﬁed by performance modeling in the development phase or by performance testing inexecution.

Performance

Performance is of paramount importance in connected systems and is a key successfactor of software products. For example, EAs [1] such as e-commerce providing services to thecustomer over the globe, their success is subjected to the performance. Performance describes howwell the system accomplishes its functionality. Efciency is another term that is used in place ofperformance in some classications of quality attributes [13, 14, 15]. Some performance metrics orperformance indicators are: • Response Time: The time between sending a request and beginning to receive the response. • Error Rate: The proportion of erroneous units of transmitted data. • Throughput: The number of processes that a system can handle per second. • Utilization on computer resources: e.g., processor usage and memory usage. • Bandwidth: The maximum rate of the data transferred in a given amount of time. • Data Transmission Time: The amount of time that it takes for the transmitting node to putall the data on the wire.Performance is one of the important factors that should also be taken into consideration in thedesign, development, and conﬁguration phase of a system [5].

The performance of a system could be evaluated through measurements manually in a user environ-ment or under controlled benchmark conditions [5]. Two conventional approaches to performanceanalysis are performance modeling and performance testing.

Performance Modeling

It is not always feasible to measure the performance of the system orcomponent, for example, in the design and development phase. In this case, the performance couldbe predicted based on models. Performance modeling is used during the design and development,and for conﬁguration tuning and capacity planning. Other than quantitative predictions, perfor-mance modeling will give us insight into the structure and behavior of the system during the systemdesign. To acquire performance measures, performance models can be analyzed mathematically,or they can be simulated in case of having complex models [5]. Some of the well-known modelingnotations are queuing networks, Markov processes, and Petri nets, which are used together withanalysis techniques to address performance modeling. [16, 17, 18].3olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

Performance Testing

The IEEE standard deﬁnition of performance testing is: Testing con-ducted to evaluate the compliance of a system or component with speciﬁed performance require-ments [19]. Measuring and evaluating the response time, error rate, throughput, and other per-formance metrics of the software system through executing the software under various conditionsby simulating concurrent multi-users with tools is the core of performance testing. Performancetesting could be performed on the whole system or on some parts of the system. Performancetesting can also validate the eﬃciency of the system architecture, the system conﬁgurations, andthe algorithms used by the software [20]. Some types of performance testing are load testing, stresstesting, endurance testing, spike testing, volume testing, and scalability testing.

Performance Bottlenecks

Performance bottlenecks will result in violating performance re-quirements [11, 12]. The deﬁnition of a performance bottleneck is any system, component, or aresource that restricts the performance and prevents the whole system from operating properly asrequired [21]. The source of performance anomalies and bottlenecks are [11]: • Application Issues: Issues in the application-level like incorrect tuning, buggy codes, softwareupdates, and incorrect application conguration • Workload: Application loads can eﬀect in congested queues and resource and performanceissues. • Architectures and Platforms: For example, the behavior and eﬀects of the garbage collector,the location of the memory and the processor, etc. can aﬀect the system’s performance. • System Faults: Faults in system resources and components such as software bugs, operatorerror, hardware faults, environmental issues, and security violations.

Load Testing

The load is the rate of diﬀerent requests that are submitted to a system [22].Load testing is the process of applying load on software to observe the software behavior anddetect issues caused because of the load [20]. Load testing is applied through simulating multipleusers to access the software at the same time.

Regression testing

Testing the software after new changes in the software is called regressiontesting. The aim of regression testing is to ensure the previous functionality of the software hasnot been violated, and it still meets the functional and non-functional requirements.

Performance Testing Tools

There are a variety of Performance Testing tools for measuringweb application performance and load stress capacity. Some of these tools are open-source, andsome have free trials. Some of the most popular performance testing tools are Apache JMeter,LoadNinja, WebLOAD, LoadUI, LoadView, NeoLoad, LoadRunner, etc.

Nowadays, Machine Learning plays an important role in software engineering and is widely usedin computer technology. Some well-known applications of machine learning algorithms in softwareengineering are: • Test data generation: Transforming speech to text. • Drive autonomous vehicles: For example, google self-driving cars. • Image Recognition: Detecting an object in a digital image. • Sentiment Analysis: Determining the attitude or opinion of the speaker or the writer. • Prediction: For example, traﬃc prediction and weather prediction. • Information Extraction: Extracting information from unstructured data. • Medical diagnoses: Medical diagnoses based on clinical parameters.4olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationMachine learning algorithms are a set of methods and algorithms in which the computer programlearns to improve a task with respect to a performance measure based on experience. Machinelearning uses techniques and ideas from artiﬁcial intelligence, probability and statistics, computa-tional complexity theory, control theory, information theory, philosophy, psychology, neurobiology,and other ﬁelds [23]. Three major categories in learning problems are: • Supervised Learning • Unsupervised Learning • Reinforcement Learning

Supervised Learning

In supervised learning, the training data set provides an output vari-able corresponding to each input variable. Supervised learning predicts the classiﬁcation of otherunlabeled data in the test data set based on the labeled training data in the training data set.Regression and classiﬁcation are two types of supervised learning. The target is to minimize theexpected output and the actual output of the learning system. Figure 1Figure 1: Supervised Learning

Unsupervised Learning

In unsupervised learning, unlike supervised learning, The trainingdata set does not contain the output value of each input set, i.e., the training data set is notlabeled. Unsupervised learning algorithms take unlabelled data as input and cluster the data inthe same group based on their attributes.

In reinforcement learning, the agent tries to learn the best policy by experimenting and trial anderror interaction with the environment. Reinforcement learning is goal-directed learning in whichthe goal of the agent is to maximize the reward [23]. In reinforcement learning problems, there is notraining data set. In this case, the agent itself explores the environment to collect data and updateits policy to maximizes its expected cumulative reward over time (illustrated in Figure 2 [10]).”Trial-and-error search and delayed reward are the two most important distinguishing features ofreinforcement learning”[10]. The agent is the learner and decision-maker, and everything outsidethe agent is the environment. The state is the current situation that is returned by the environment.Each action results in a new state and gives a reward corresponding to the state (or state-action).In a reinforcement learning problem, the reward function speciﬁes the goal of the problem [10].It is not speciﬁed for the agent which action to take in each state, and instead, the agent shoulddiscover taking which action leads to the most reward by trying them.In the following, we discuss the main concepts in reinforcement learning:

Agent and Environment

Everything except the agent is the environment; everything that theagent can interact with directly or indirectly. When the agent performs actions, the environmentchanges. This change is called the state-transition. As shown in Figure 2 [10], At each step t ,the agent executes action A t , receives a representation of the environments state S t based on theobservations from the environment, and receives a reward R t .5olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 2: Reinforcement Learning State

State contains the information used to determine what happens next. History is a sequenceof states, actions, and rewards: H t = S , A , R , ..., S t , A t , R t +1 (1)The agent state is a function of the history: S t = f ( H t ) (2) Action

Actions are the agent’s decision, which leads to a next state and provides a reward fromthe environment. Actions aﬀect the immediate reward and can also aﬀect the next state of theagent and consequently, the future rewards (delayed reward). So the actions may have long termconsequences. The policy determines which action should be taken in each step.

Reward and Return

A reward R t is a scalar feedback signal that shows how well the agent isoperating at step t . The learning agent tries to reach the goal of maximizing cumulative rewardin the future. The Reward may be delayed, and it may be better to sacriﬁce immediate reward togain more long-term reward. Reinforcement learning is based on the reward hypothesis: ”All goalscan be described by the maximization of expected cumulative reward”. R as shown in equation 3[10] is the expected value of the reward of taking action a from state s . R as = E [ R t +1 | S t = s, A t = a ] (3)The return G t in equation 4 [10] is the total discounted reward from time-step t . G t = R t +1 + γR t +1 + ... = ∞ (cid:88) k =0 γ k R t + k +1 (4) Discount Factor

Discount factor γ is a value in the interval (0,1]. A reward that occurs k + 1steps in the future is multiplied by γ k , which means the value of receiving reward R after k + 1time-steps is decreased to γ k R . The discount factor indicates how much do we value the futurerewards. The more we trust our model, the discount factor would be nearer to 1, but if we are notcertain about our model, the discount factor would be near to 0. Markov Decision Process (MDP)

An MDP is an environment represented by a tuple (cid:104)

S, A, P, R, γ (cid:105) .Where S is a countable set of states, A is a countable set of actions, P is the state-transition prob-ability function in equation 5 [10], R is the reward function in equation 3, and γ is the discountfactor [10]. The state-transition probability P ass (cid:48) is the probability of going to state s (cid:48) by takingthe action a from state s . Almost all reinforcement learning problems can be formalised as MDPs. P ass (cid:48) = P [ S t +1 = s (cid:48) | S t = s, A t = a ] (5)In a MDP, the state is fully observable i.e the current state completely characterises the process.A state S t (current state in time t ) is Markov if and only if it follows the rule in equation 6 [10],6olrokh Hamidi Reinforcement Learning Assisted Load Test Generation P [ S t +1 | S t ] = P [ S t +1 | S , ..., S t ] (6)meaning that the future state is only dependent of the present and it is independent of the past. AMarkov state contains every relevant information from the history. So when the state is speciﬁed,the history may be thrown away. Partially Observable Markov Decision Process (POMDP)

In POMDP, the agent is notable to directly observe the environment, meaning the environment is partially observable to theagent. So unlike MDP, the agent state is not equal to the environment state. In this case, theagent must construct its own state representation.

Policy

The policy π is the agent’s behavior function. It is a function from a state to action.A deterministic policy speciﬁes which action should be taken in each state; it takes a state as aninput and it’s output is an action: a = π ( s ) (7)A stochastic policy (equation 8 [10]) determines the probability of the agent taking a speciﬁc actionin a speciﬁc state: π ( a | s ) = P [ A t = a | S t = s ] (8) Value function

The value function is a prediction of future reward that is used to evaluate howgood a state is. The value function v π ( s ) of a state s under policy π is the expected return offollowing policy π starting from the state s . The value function for MDPs is shown in equation 9[10]: v π ( s ) = E [ G t | S t = s ] = E [ ∞ (cid:88) k =0 γ k R t + k +1 | S t = s ] (9) v π ( s ) is called the state-value function for policy π . If terminal states exist in the environment,there value is zero.The value of taking action a in state s under policy π is q π ( s, a ) which is called the action-valuefunction for policy π or the q-function shown in equation 10 [10]: q π ( s, a ) = E [ G t | S t = s, A t = a ] = E [ ∞ (cid:88) k =0 γ k R t + k +1 | S t = s, A t = a ] (10) q π ( s, a ) is the expected return starting from state s , taking the action a and future actions basedon policy π . Bellman Equation

The Bellman equation explains the relation between the value of a state orstate-action with it’s successors. The Bellman equation for v π is shown in equation 11 [10]: v π ( s ) = E π [ G t | S t = s ]= E π [ R t +1 + γG t +1 | S t = s ]= (cid:88) a π ( a | s ) (cid:88) s (cid:48) (cid:88) r p ( s (cid:48) , r | s, a ) (cid:2) r + γ E π [ G t +1 | S t +1 = s (cid:48) ] (cid:3) = (cid:88) a π ( a | s ) (cid:88) s (cid:48) ,r p ( s (cid:48) , r | s, a ) (cid:2) r + γv π ( s (cid:48) ) (cid:3) (11)Where p ( s (cid:48) , r | s, a ) is the probability of going to state s (cid:48) and receiving the reward r by taking theaction a from state s . Figure 3 [10] helps explaining the equation. Based on this equation, thevalue of a state is the average of it’s successor states’ values plus the reward of reaching them,weighting each state value by the probability of its occurrence. This recursive relation of states isa fundamental property value function in reinforcement learning.The Bellman equation for q-value (action value) q π ( s, a ) is shown in equation 12 [10]:7olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 3: Backup diagram for v π q π ( s, a ) = (cid:88) s (cid:48) ,r p ( s (cid:48) , r | s, a ) (cid:2) r + γ (cid:88) a (cid:48) π ( a (cid:48) | s (cid:48) ) q π ( s (cid:48) , a (cid:48) )] (12)This equation is clariﬁed in Figure 4 [10].Figure 4: Backup diagram for q π Episode

A sequence of states starting from an initial state and ﬁnishing in a terminal state isnamed episode. Diﬀerent episodes are independent from each other. Figure 5 gives an overview ofan episode. Figure 5: Episode

Episodic and Continuous tasks

There are two kinds of tasks in reinforcement learning;episodic and continuous. Unlike continuous tasks, episodic tasks are when the interaction of theagent with the environment is broken down into separate episodes.

Policy Iteration

Policy iteration is the process of achieving the goal of the agent, which isﬁnding the optimal policy π ∗ . Policy iteration consists of two parts; policy evaluation and policyiteration, which are executed iteratively. Policy evaluation is the iterative computation of thevalue functions for a given policy while the agent interacts with the environment. And policyimprovement is enhancing the policy by choosing actions greedily with respect to the recentlyupdated value function: π E −→ v π I −→ π E −→ v π I −→ ... I −→ π ∗ E −→ v π ∗ (13) Value Iteration

Value iteration is ﬁnding optimal value function iteratively. When the valuefunction is optimal, then the policy out of it is also optimal. Unlike policy iteration, there isno explicit policy in value iteration, and the actions are chosen directly based on the optimal(converged) value function. Finding optimal value function is a combination of policy improvementand truncated policy evaluation. 8olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

Exploration and Exploitation

The reinforcement learning agent should choose the actionsthat have tried before, which have the highest return; this is exploitation. On the other hand,the agent should try new actions that have not selected before to ﬁnd these best actions; this isexploration. There is a trade-oﬀ between exploration and exploitation in the learning process andmaking a balance between them is one of the challenges in reinforcement learning problems. ε -Greedy Policy An ε -greedy policy allows performing both exploration and exploitation duringthe learning. ε is a number in the range of [0,1] is chosen. In each step, the probability of selectingthe best action (best action based on the main policy which is extracted from the q-table) is 1- ε ,and a random action is selected by the probability of ε . Monte Carlo

Monte Carlo methods are a class of algorithms that repeat random sampling toachieve a result. One of the methods used in reinforcement learning is the Monte Carlo method toestimate value functions and ﬁnd the optimal policy by averaging the returns from sample episodes.In this method, each episodic task is considered as an experience, which is a sample sequence ofstates, actions, and rewards. By using this method, we only need a model that generates sampletransitions, and there is no need for the model to have complete probability distributions of allpossible transitions and rewards. A simple Monte Carlo update rule is shown in equation 14 [10]: V ( S t ) ←− V ( S t ) + α [ G t − V ( S t )] (14)Where G t is the return starting from time t and α is the step-size (learning rate). Temporal-Diﬀerence (TD) learning

Temporal-diﬀerence learning is another learning methodin reinforcement learning. The TD method is an alternative to the Monte Carlo method forupdating the estimation of the value function. The update rule for the value function is shown inequation 15 [10]: V ( S t ) ←− V ( S t ) + α [ R t +1 + γV ( S t +1 ) − V ( S t )] (15)Unlike Monte Carlo, TD learns from incomplete episodes. TD can learn after each step and doesnot need to wait for the end of the episode. The algorithm 1 explains T D (0) [10]:

Algorithm 1:

TD(0) for estimating v π Input: the policy π to be evaluatedAlgorithm parameter: step size α ∈ (0 , V ( s ), for all s ∈ state space, arbitrarily except that V(terminal) = 0 for each episode do Initialize S for each step of episode do A ←− action given by π for S Take action A , observe R, S (cid:48) V ( S ) ←− V ( S ) + α [ R + γV ( S (cid:48) ) − V ( S )] S ←− S (cid:48) end until S is terminal endExperience Replay In a reinforcement learning algorithm, the RL agent interacts with theenvironment and updates the policy, value functions, or model parameters iteratively based onthe observed experiment in each step. The data collected from the environment would be usedonce for updating the parameters, but it would be discarded in the future steps. This approach iswasteful because some experiences may be rare but useful in the future. Lin et al. [24] introducedexperience replay as a solution to this problem. An experience (state-transition) in their deﬁnition[24] is a tuple of ( x, a, y, r ) which means taking action a from state x and going to state y andgetting the reward r . In the experience replay method, a buﬀered window of N experiences is savedin the memory, and the parameters are updated with a batch of transitions in the experience re-play, which are chosen based on diﬀerent approaches e.g., randomly [25] or prioritized experiences9olrokh Hamidi Reinforcement Learning Assisted Load Test Generation[26]. Experience replay allows the agent to reuse the past experiences in an eﬀective way and usethem in more than one single update as if the agent experiences what it has experienced beforeagain and again. Experience replay will speed up the learning of the agent, which leads to quickerconvergence of the network. In addition, faster learning leads to less damage to the agent (thedamage is when the agent takes actions based on bad experiences; therefore, it experiences a badexperience again and so on). Experience replay consumes more computing power and more mem-ory but reduces the number of experiments for learning and the interaction of the agent with theenvironment, which is more expensive. Schau et al. [26] explain many stochastic gradient-basedalgorithms which have the i.i.d. assumption which is violated by strongly correlated updates inthe RL algorithm and experience replay will break this temporal correlation by applying recentand former experiences in each update. Using experience replay has been eﬀective in practice, forexample Mnih et al. [25] applied experience replay in the DQN algorithm to stabilize the valuefunction’s training. Google DeepMind also signiﬁcantly improved the performance of the ”Atari”game by using experience replay with DQN. Q-learning is one of the basic reinforcement learning algorithms. Q-learning is an oﬀ-policy TDcontrol algorithm. Methods in this family learn an approximator q-function for the optimal action-value function Q ∗ . In this algorithm the q-values of every possible state-action pairs are stored ina table named q-table. The q-table is updated based on the Bellman equation 16 [10]: Q ( S t , A t ) ←− Q ( S t , A t ) + α [ R t +1 + γ max a Q ( S t +1 , a ) − Q ( S t , A t )] (16)The action is usually selected by an ε -greedy policy. But the q-value is updated independent of thepolicy being followed (oﬀ-policy) algorithm, and based on the next action which has the maximumq-value. The q-learning algorithm is shown in Algorithm 4. Deep reinforcement learning refers to the combination of RL with deep learning. Deep RL isnonlinear function approximation methods like artiﬁcial neural network (ANN) using SGD [10].

Value Function Approximation

Function approximation is used in RL because in large envi-ronments there are too many states and actions to be stored in the memory, also it is too slow tolearn the value of each state/state-action individually. So the idea is to generalize from the visitedstates to the states which have not been visited yet. Hence the value function is estimated withfunction approximation: ˆ v ( s, w ) ≈ v π ( s )ˆ q ( s, a, w ) ≈ q π ( s, a )Where w is the weight vector, for example, w is the feature weights in q linear function approxi-mator, which returns the estimated value of each state by multiplying in the state’s feature vector.The dimensionality of w is much less than the number of states and changing the weight vector,changes the estimated value of many states, therefor when w is updated after each action froma single state, not only the value of that speciﬁc state will update, many states’ values will beupdated too. This generalization makes learning faster and more powerful. Moreover, using func-tion approximation makes reinforcement learning applicable to problems with partially observableenvironments.There are many function approximators, e.g., linear function of features, artiﬁcial neural net-work, decision tree, nearest neighbor, Fourier/wavelet bases, and etc. For value function, approxi-mation diﬀerentiable function approximators are used e.g., linear function and neural networks.10olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 6: Types of value function approximation Stochastic Gradient Descent

Stochastic Gradient Descent or SGD is an optimization algo-rithm. This algorithm is used in machine learning algorithms, like training artiﬁcial neural networksused in deep learning. In this method, the goal is to ﬁnd some model parameters which optimizean objective function by updating a model iteratively over multiple discrete steps. Optimizing anobjective function is minimizing a loss function or maximizing a reward function (ﬁtness function).In each step, the model makes some predictions based on the samples in the training data set,and based on the set of current internal parameters; then the predictions are compared to the realexpected outcomes in the data set by calculating performance measures like mean square error.Then the gradient of the error is calculated and used to update the internal model’s parameters todecrease the error. Sample size, batch size, and epoch size are some hyperparameters in SGD [27]: • Sample: A training data set contains many samples. A sample could be referred to as aninstance, observation, input vector, or a feature vector. A sample is a set of inputs and anoutput. The inputs are fed into the algorithm, and the output is compared to the predictionby calculating the error. • Batch: The model’s internal parameters would get updated after applying a batch of samplesto the model. At the end of applying each batch of samples to the model, the error iscomputed. The batch size can be equal to the training data set size (Batch Gradient Descent),it can be equal to 1 meaning each batch is a sample in the data set (Stochastic GradientDescent), and it can be between 1 and the training set size (Mini-Batch Gradient Descent).32, 64, and 128 are popular batch sizes in mini-batch gradient descent. • Epoch: The whole training data set is fed to the model once in each epoch. In every epoch,each sample will update the internal model parameters for one time. So in an SGD algorithm,there are two for-loops; the outer loop is over the number of epochs, and the inner loop iteratesover the batches in each epoch.There is no speciﬁc rule for conﬁguring these parameters. The best conﬁguration diﬀers for eachproblem and is obtained by testing diﬀerent values.

Deep Q-Network (DQN)

Deep Q-Network is a more complex version of q-learning. In thisversion, instead of using the q-table for accessing q-values, the q-values are approximated using anANN.

Double Q-learning

Simple q-learning has a positive bias in estimating the q-values; it can over-estimate q-values. Double q-learning is an extension of q-learning which overcomes this problem.It uses two q-functions, and in each update, one of the q-functions is updated based on the nextstate’s q-value from the other q-function [28]. The double q-learning algorithm is shown in Algo-rithm 2 [28]: 11olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 7: Deep Q-Network

Algorithm 2:

Double Q-learningInitialize Q A , Q B , s repeat Choose a , based on Q A ( s, . ) and Q B ( s, . ), observe r, s (cid:48) Choose (e.g. random) either UPDATE(A) or UPDATE(B) if UPDATE(A) then

Deﬁne a ∗ = max a Q A ( s (cid:48) , a ) Q A ( s, a ) ←− Q A ( s, a ) + α [ r + γQ B ( s (cid:48) , a ∗ ) − Q A ( s, a )] endelse if UPDATE(B) then

Deﬁne b ∗ = max a Q B ( s (cid:48) , a ) Q B ( s, a ) ←− Q B ( s, a ) + α [ r + γQ A ( s (cid:48) , b ∗ ) − Q B ( s, a )] end s ←− s (cid:48) until end ; Double Deep Q Networks(DDQN)

The idea of double q-learning can be used in DQN [29].There is an online network, a target network, and the online network gets updated based on theq-value from the target network. The target network is freezed and gets updated from the onlinenetwork after N steps. The other way is to smoothly average for every N number of last updates.N is the ”target DQN update frequency”. 12olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

3. Related Work

As mentioned before, in this study, we aim to detect certain workloads that cause performanceissues in the software. To accomplish this objective, we use a reinforcement learning approach thatapplies workloads on the system and learns how to generate eﬃcient workloads by measuring theperformance metrics. Measuring performance metrics (e.g., response time, error rate, resource uti-lization) by applying various loads on the system under diﬀerent execution conditions and diﬀerentplatform conﬁgurations is a common approach in performance testing [30, 31, 32]. Also discover-ing performance problems like performance degradation and violation of performance requirementsthat appear under speciﬁc workloads or resource conﬁgurations is a usual task in diﬀerent typesof performance testing [33, 6, 34].Diﬀerent methods have been introduced for load test generation, e.g., analyzing system model,analyzing source code, modeling real usage, declarative methods, and machine learning-assistedmethods. We provide a brief overview of these approaches in the following:

Analyzing system model

Zhang and Cheun [7] introduce an automatable method for stress testgeneration in terms of Petri nets. Gu and Ge [35] use genetic algorithms to generate performancetest cases, based on a usage pattern model from the systems workﬂow. Again Penta et. al. [36]generate test data with genetic algorithms using workﬂow models. Garousi [37] provides a geneticalgorithm based UML-driven tool for stress test requirements generation. Again Garousi et. al. [38] introduce a UML model-driven stress test method for detecting network traﬃc anomalies indistributed real-time systems using genetic algorithms

Analyzing source code

Zhang et. al. [6] present a symbolic execution-based approach usingthe source code for generating load tests. Yang and Pollock [39] introduced a method for stresstesting, limiting the stress test to parts of the modules that are more vulnerable to workloads.They used static analysis of the module’s code to ﬁnd these parts.

Modeling real usage

Draheim et. al. [40] presents an approach for load testing of based onstochastic models of user behavior. Lutteroth and Weber [41] provide a stochastic form-orientedload testing approach. Shams et. al. [42] uses an application model-based approach that is anextension of Finite State Machines and models the user’s behaviour. V¨ogele et. al. [43] use MarkovChain for modeling user behaviour in workload generation. All the named papers here proposedapproaches for generating realistic workloads.

Declarative methods

Ferme and Pautasso [44] conduct performance tests using their model-driven framework that is programmed by a declarative domain-speciﬁc language (DSL) providedby them. Ferme and Pautasso [45] also use BenchFlow that is a declarative performance testingframework, to provide a tool for performance testing. This tool uses DSL for the test conﬁguration.Schulz et. al. [46] generate load test using a declarative behavior-driven approach where load testspeciﬁcation is in natural language.

Machine learning-assisted methods

Some approaches in load testing context, use machinelearning techniques for analyzing the data collected from load testing. For example, Malik et al. [47] use and compare supervised and unsupervised approaches for analyzing the load test data(resource utilization data) in order to detect performance deviation. Syer et al. [8] use the clus-tering method for detecting anomalies (threads with performance deviations) in the system basedon the resource usage of the system. Koo et al. [9] provides a RL-based symbolic execution todetect worst-case execution paths in a program. Note that symbolic execution is mostly used inmore computational programs manipulating integers and booleans. Grechanik et al. [48] presentsa feedback-directed method for ﬁnding performance issues of a system by applying workloads ona SUT and analyzing the execution traces of the SUT to learn how to generate more eﬃcientworkloads. 13olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationTable 1: Overview of Related Work

Reference Required Input General Goal [7, 35, 36, 37,38] System model Generate performance test cases usingPetri nets, usage pattern model, andUML model[39, 6] Source Code Finding performance requirements vio-lation via static analysis and symbolicexecution[40, 41, 42,43] User behaviour model User behaviour simulation-based loadtesting[44, 45, 46] Instance Model ofDomain-Speciﬁc Lan-guage Propose Declarative methods for per-formance modeling and testing[8, 47] Training set Uses Machine learning-assisted meth-ods for load test generation[9, 48, 49] System/program inputs Finding worst-case performance issuesusing RL

This Thesis List of available transac-tions Generate optimal workloads that vio-lates the performance requirements, us-ing RL

Ahmad et al. [49] try to ﬁnd the performance bottlenecks of the system using an RL approachnamed PerfXRL, which uses a DDQN algorithm. This is one of the more similar approaches toour approach recently published. In their approach, each test scenario is a sequence of three con-stant requests to a web application. These requests have four variables in total, and the researchaim is to ﬁnd combinations of these four variables, which cause a performance violation. So theperformance testing is done by executing test cases in which each test case is a sequence of threeconstant requests, and unlike our approach, no load testing is performed in this paper. They eval-uate their approach by comparing the number of performance bottleneck request scenarios foundby the PerfXRL approach with the number of performance bottleneck request scenarios found bya random approach. This comparison is made for diﬀerent sizes of input value spaces. They showthat for input value spaces bigger than a certain size (150000) the PerfXRL approach identiﬁesmore performance bottlenecks than the random approachUnlike most of the mentioned approaches, our approach is model-free and does not requireaccess to the source code or a system model for generating load tests. On the other hand, unlikemany of the machine learning approaches, our proposed approach does not need previously col-lected data, and it learns to generate workload while interacting whit the system.14olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

4. Problem Formulation

The objective of this thesis is to propose and evaluate a load testing solution that is able togenerate an eﬃcient test workload, which results in meeting the intended objective of the testing,e.g., ﬁnding a target performance breaking point without access to system model or source code.

With the increase of dependence on software in our daily lives, the correct functioning and ef-ﬁciency of Enterprise Applications (EAs) delivering services over the internet are crucial to theindustry. Software success not only depends on the correct functioning of the software system butis also dependent on how well are these functions performed i.e., non-functional properties like per-formance requirements). Performance bottlenecks can aﬀect and harm performance requirements[11, 12]. Therefore, recognizing and repairing these bottlenecks are crucial.The source of performance anomalies and bottlenecks can be application issues (i.e., sourcecode, software updates, incorrect application conguration), workload, the systems architecture andplatforms, and system faults in systems resources and component (e.g. software bugs, environ-mental issues, and security violations.) [11]. The source code would change during the continuousintegration/delivery (CI/CD) process and software updates. The workload on the system is con-stantly changing, also the environmental issues and security conditions do not remain the sameduring the software’s life cycle. Therefore the performance bottlenecks in the system will changeduring time, and it is not easy to follow the model-driven approaches for performance analysis. Toperform performance analysis that can consider all mentioned causes of performance bottleneckswe can use model-free performance testing approaches.In addition, an important activity in performance testing is the generation of suitable loadscenarios to ﬁnd the breaking point of the software under test. Manual workload generation ap-proaches are heavily dependent on the testers experience and are highly prone to error. Suchapproaches for performance testing also consume substantial human resources and are dependenton many uncontrolled manual factors. The solution to this matter is using automated approaches.However, existing automated approaches for ﬁnding breaking points of the system heavily rely onthe system’s underlying performance model to generate load scenarios. In cases where the testershave no access to the underlying system models (describing the system), such approaches mightnot be applicable.One other problem with existing automated approaches is that they do not reuse the datacollected from previous load test generation for future similar cases, i.e. when the system shouldbe tested again because of the changes made in the system during the time for maintenance, andscalability etc. There is a need for an automated, model-free approach for load scenario generationwhich can reuse learned policies and heuristics in similar cases.Many model-free approaches for load generation, just keep increasing the load until performanceissues appear in the system. The workload size is one factor that aﬀects the performance, althoughthe structure of the workload is another important factor. Selecting a certain combination of loadsin the workload can lead to a violation of performance requirements and detecting performanceanomalies with a smaller workload. A well-structured smaller workload can more accurately detectthe performance breaking points of the system with lower resources for simulating workloads. Inaddition, a well-structured smaller workload can result in increase coverage at the system-level.Finding these speciﬁc workloads are diﬃcult because it requires an understanding of the system’smodel. [6]Using model-free machine learning techniques such as model-free reinforcement learning [10]could be a solution to the problems mentioned above. In this approach, an intelligent agent canlearn the optimal policy for performance analysis and load test scenarios that violate system per-formance. This method can be used independently of the system’s and environment’s state in15olrokh Hamidi Reinforcement Learning Assisted Load Test Generationdiﬀerent conditions, and it does not need to access the source code or system model. The learnedpolicy could also be reused in further stages of the testing (e.g., regression testing).

We intend to formulate a new method for load test generation using reinforcement learning andevaluate it by comparing it with random and baseline methods. Our technical contribution in thisthesis is the formulation and development of an RL based agent, that will learn the optimal policyfor load generation. We aim to evaluate the applicability and eﬃciency of our approach using anexperiment research method.The object of the study is an RL based load test scenario generation approach. The purpose isproposing and evaluating an automated, RL-based load test scenario generation tool. The qualityfocus is the well-structured eﬃcient test scenario, the ﬁnal size of its workload, and the numberof steps for generating the workload. The perspective is from the researchers and tester’s point ofview. The experiment is run using an e-commerce website as a system under test. Based on theGQM template for goal deﬁnition, presented by Basili and Rombach [50] our goal in this study is:Formulate and analyze an RL-based load test approach for the purpose of eﬃcient load test generation with respect to the structure and size of the eﬀective workload, and the number of steps togenerate it from the point of view of a tester/researcher in the context of an e-commerce website as a system under test Based on our research goal we deﬁne the following research questions:

RQ1: How can the load test generation problem be formulated as an RL problem?

To solve the problem of load generation with reinforcement learning, a mapping should be donefrom the real-world problem to an RL problem environment and elements. The elements are thestates, actions, and reward function (Figure 8). The aim of this research question is to ﬁnd suitabledeﬁnition of states, actions, and reward function in this problem.Figure 8: Intelligent Load Runner Eﬃcient, in terms of optimal workload (workload size and number of steps for generating the workload). Eﬀective, in terms of causing the violation of performance requirements (error rate and response time thresholds).

RQ2: Is the proposed RL-based approach applicable for load generation? Afterformulating the problem into an RL context, it is essential to evaluate the applicability of the ap-proach on a real-world SUT. Answering this research question requires implementing the approachand setting up a SUT on which the generated load scenarios can be executed (see Section 7.1.).

RQ3: What RL-based method is more eﬃcient in the context of load generation?

Reinforcement learning can be applied using various algorithms like q-learning, SARSA (State-Action-Reward-State-Action), DQN, and Deep Deterministic Policy Gradient (DDPG). The aimof this research question is to choose at least two RL methods and ﬁnd the most eﬃcient (interms of optimal) among them. In our case, we chose q-learning (a very basic RL algorithm) andDQN (an extended q-learning method). In addition, we also compare the results of the RL-basedmethods with a baseline and a random load generation methods. derived from RQ1

5. Methodology

A research method guides the research process in a step-by-step iterative manner. We use well-established research methods to realize our research goals. The core of our research method is theresearch process illustrated in Figure 5.1.. The research process we used (to guide our researchmethod) is a modiﬁcation of the four steps research framework proposed by Holz et al. [51]. In therest of this section, we presented our research process (in Section 5.1.) followed by a discussion onthe research method used in Section 5.2.. Finally, we present the tools used for implementation inthis thesis, in Section 5.3..

In this subsection, we outline the research process that we are following throughout this thesis.Figure 9: Research MethodOur research process started with forming a suitable research goal and research questions(as formulated in Section 4.2.). As discussed, the objective of our research is to propose andevaluate an automated model-free solution for load scenario generation. The main objective andresearch goal were identiﬁed in collaboration with our industrial partner (RISE Research Institutesof Sweden AB) by reviewing their needs. We then identiﬁed speciﬁc challenges of the adoption ofperformance testing approaches with our industrial partner. We realized that existing approachesrequire knowledge of performance modeling and access to source code, which limits the adoption ofsuch approaches. We conducted a state-of-the-art review (some parts of it is presented in Section2.) to identify the gaps in the literature. In the next step, we formulated an initial version ofthe problem which produced our thesis proposal. In the next step of our research process, weformulated and initial RL based solution that does not require any underlying model of the systemand can reuse the learned policy in the future. This formulated solution helped in realizing ourprimary research goal. We then conducted an experiment to evaluate our solution on an e-commercesoftware system. Note that our research process was iterative and incremental.

We conducted an experiment for answering our RQ3 following the guidelines presented by Wohlin et al. [52]. An experiment is a systematic formal research method in which the eﬀects of allinvolved variables can be investigated in a controlled way. Thus, we can investigate the eﬀect of ourtreatments (the diﬀerent load test generation methods) on the outcome (size of workload generatedwhich hit the thresholds, i.e., violates the performance requirements and the number of stepstaken for generating this workload). Since our experiment’s goal is to answer RQ3 (which requiresquantitative data to answer), the experiment research method is helpful in obtaining quantitativedata about the objectively measurable phenomenon. In our case, the nature of the experiment isquantitative, i.e., comparing our RL-based load test generation approaches with a baseline and arandom approach. The comparison is made based on the size of the workload generated that hits18olrokh Hamidi Reinforcement Learning Assisted Load Test Generationthe deﬁned error rate and response time thresholds. In addition, the comparison is also made basedon the number of workload increment steps required for each approach to generate a workload thathits the thresholds.

Experiment Design

The procedure of our experiment is explained in Section 7.3.. Here weprovide the standard deﬁnition of experiment terminologies in the guidelines [52], and we deﬁnethem in our experiment: • Independent variables: “all variables in an experiment that are controlled and manipulated.”[52]In this experiment, the independent variables are the client machine generating workload,the client machine conﬁguration, the network, the SUT server machine, and the SUT serverconﬁgurations, and the parameters in Table 5. • Dependent variables: “Those variables that we want to study to see the eﬀect of the changesin the independent variables are called dependent variables.”[52] The dependent variables inthis experiment are: – size of the ﬁnal workload generated that hits the deﬁned error rate and response timethresholds. – number of workload increment steps required to generate a workload that hits thethresholds. • Factors: one or more independent variables that the experiment studies the eﬀect of changingthem. The Factor, in our case, is the load test generation method. • Treatment: “one particular value of a factor.”[52] The treatments in our experiment are abaseline method, a random method, a q-learning method, and a DQN method for our factorload test generation method. • Subjects: the subject, in our case, is the client machine generating workload. The propertiesof this machine are shown in Table 4. • Objects: Instances that are used during the study. The object in our case is the SUT. TheSUT is an e-commerce website explained in Section 7.1..2.

Here we introduce the tools we used in our implementation and the reason for selecting them.

Apache Jmeter

Apache JMeter is an open-source performance testing java application. It cantest performance on static and dynamic resources. Apache JMeter can simulate heavy loads on aserver, group of servers, network or object to test and measure performance metrics of the systemunder diﬀerent load types. It is written in Java, and it allows us to use its libraries for executing ourdesired workloads in the implementation of our approach, which is written in Java. Additionally,JMeter has a simple and user-friendly GUI, which helps us easily generate JMX ﬁles containingthe basic conﬁgurations needed for the workloads generated and executed in our load tester.

WordPress and WooCommerce

WordPress is a free and popular open-source content man-agement system. It is written in PHP and paired with a MySQL database. We set up a websiteon WordPress as the SUT in the evaluation phase of our load testing approach. WordPress isvery ﬂexible and could be extended by using diﬀerent plugins. WooComerce is an open-sourcee-commerce plugin for WordPress to create and manage online stores. We use WooComerce toturn the website into an e-commerce store

XAMPP

XAMPP is one of the most common desktop servers. It is a lightweight Apachedistribution for deploying local web servers for testing purposes. We create the WordPress website(SUT) using XAMPP. 19olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

RL4J

In order to avoid possible implementation errors in implementing the DQN in one of ourproposed approaches for load testing, we use an open-source library RL4J [53]. RL4J is a deepreinforcement learning library that is a part of the Deeplearning4j project [54] and released underan Apache 2.0 open-source license. Eclipse Deeplearning4j is a deep learning project written inJava and Scala. It is open-source, and it is integrated with Hadoop and Apache Spark and couldbe used on distributed GPUs and CPUs. Deeplearning4j is compatible with all java virtual ma-chine language e.g., Scala, Clojure, or Kotlin. It includes deep neural network implementationswith lots of parameters to be set by the users when training a network [54]. RL4j contains li-braries for implementing DQN (Deep Q-learning with double DQN) and Async RL (A3C, AsyncNStepQlearning). 20olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

6. Approach

In this section, we propose our approach for intelligent load test generation using reinforcementlearning methods. We answer RQ1 here and present the mapping of the real-world problem to anRL problem. We provide the details of our approach and the learning procedure for generatingload test. In section 6.1., we provide the mapping of the optimal load generation problem to anRL problem, how we deﬁne the environment and the RL elements in the problem. Then in section6.2., we present the RL methods that we use in our approach, which are q-learning and DQN, wealso present the operating workﬂow for each method.

In this section, we map the load test scenario elements to reinforcement learning elements anddeﬁne the environment.

Agent and Environment

As mentioned before, the goal of the agent is to attain the optimalpolicy, which is to ﬁnd the most eﬃcient workloads for testing the system’s performance. Forapplying an RL-based approach to a problem, it is generally supposed that the environment isnon-deterministic and also stationary upon transitions between the states of the system. Theenvironment here is a server (the system under test) that is unknown to the agent. The agentinteracts with the SUT continuously, and the only information that the agent knows about theSUT is gained by the agent’s observations from this interaction. The interactions are actions takenby the agent and the SUT’s responses to these actions in the form of observations for the agent.In other works, the actions that our agent takes aﬀects the SUT as the environment, and the SUTreturns metrics to the agent, which aﬀects the agent’s next action.

States

We deﬁne the states according to performance metrics. Error rate and response time aretwo performance metrics in load testing. These two are considered as the agent’s observations of theenvironment. The two metrics deﬁne the agent’s state; the average error rate and average responsetime returned from the environment (SUT) after the agent took the last action. The terminalstates are the states with average response time or average error rate higher than a threshold. Theaverage error rate range is 0 to error rate threshold and the average response time rage is 0 to response time threshold are divided into sections, each section determines one state.

Actions

The action that the agent takes in each step is increasing the workload and applying itto the SUT (environment). The workload is generated based on the policy and the workload in theprevious action. The workload contains several transactions in which each transaction has a speciﬁcworkload, i.e., a speciﬁc number of threads executes each transaction. A transaction consists ofmultiple requests. A single thread represents a user (client) running the transaction and sendingrequests to the server (SUT). The action space is discrete, and the set of actions is the same for allthe states. Each action increases the last workload applied to the SUT by increasing the workloadof exactly one of the transactions. The workload of a transaction is increased by multiplyingthe previous workload in a constant ratio. The deﬁnition of actions is shown in equation 17 andequation 18:

Actions = {∪ action k , ≤ k ≤ | Transactions |} (17) action k = {∪ ( W T j t ) | W T j t = W T j n − , if j (cid:54) = k,W T j t = αW T j n − , if j = k,T j ∈ Transactions , ≤ j ≤ | Transactions |} (18)21olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationWhere T j indicates transaction number j among the set of transactions, t is the current learningtime step (iteration), W T j t is the workload of the transaction T j at time step t , and α is the constantincreasing ratio. Reward Function

The reward function takes an average error rate and average response timeas input. The reward will increase as the average error rate and average response time increase.Consequently, the probability of the agent choosing actions which lead to a higher error rate andresponse time will increase. We deﬁne the reward function in equation 19. R t = ( RT t RT threshold ) + ( ER t ER threshold ) (19)Where R t is the reward in time step t , RT t is the average response time and ER t is the averageerror rate in time step t . And RT threshold and ER threshold indicate the response time and errorrate threshold. In this section, we propose our RL solution to adaptive load test generation. We present ourapproach and explain the reinforcement learning algorithms that we chose for the approach, whichare simple q-learning and DQN. We formulate the load test scenario in a reinforcement learningcontext and provide the architecture of our approach for each of the q-learning and DQN methods.Algorithm 3 shows a general overview of the RL method. We use two methods q-learning andDQN for the leaning phase in the Algorithm 3 explained in sections 6.2..1 and 6.2..2.

Algorithm 3:

Adaptive Reinforcement Learning-Driven load Testing

Required:

S, A, α, γ ;Initialize q-values, Q ( s, a ) = 0 ∀ s ∈ S , ∀ a ∈ A and (cid:15) = υ , < υ < while Not (initial convergence reached) do Learning (with initial action selection strategy, e.g. (cid:15) -greedy, initialized (cid:15) ); end Store the learned policy;Adapt the action selection strategy to transfer learning, i.e. tune parameter (cid:15) in (cid:15) -greedy; while true do Learning with adapted strategy (e.g., new value of (cid:15) ); end6.2..1 Q-Learning As mentioned in section 2.2., q-learning is one of the basic reinforcement learning methods. Likeother RL algorithms, q-learning seeks to ﬁnd the policy which maximizes the total reward. Theoptimal policy here is extracted from the optimal q-function that is learned through the learningprocess by updating the q-table in each step. As mentioned before, q-tables store q-values, whichget updated continuously. The q-value q π ( s, a ) of a state-action shows how good is to take action a from state s . In each step, the agent is in a state and can perform one of the available actionsfrom that state. In q-learning, the agent will take action with the maximum q-value among theavailable actions. As mentioned in section 2.2..1, choosing the action with the maximum q-valuewould satisfy the exploitation criteria. However, we also have to take random actions to satisfythe exploration criteria and be able to experience the actions with lower q-values, which have notbeen chosen before (therefore, their q-value is not updated and is low). Consequently, we usethe decaying ε -greedy policy in which the ε is big at the beginning of the learning and decaysduring the process. As mentioned before in Section 2.2..1, ε is a number in the range of 0 to 1.In each step, the probability of selecting the best actions is 1- ε , and a random action is selectedby the probability of ε . After the action selection, we will observe the environment (in our case22olrokh Hamidi Reinforcement Learning Assisted Load Test Generationthe SUT), and we will detect the next state and compute the reward then update the q-table witha new q-value for the previous state and the taken action. The q-learning algorithm is shown inAlgorithm 4 [10]: Algorithm 4:

Q-learning (oﬀ-policy TD control) for estimating π ≈ π ∗ Algorithm parameter: step size α ∈ (0 , ε > Q ( s, a ), for all s ∈ state space, a ∈ A ( s ) arbitrarily except that Q(terminal,.) = 0 for each episode do Initialize S for each step of episode do Choose A from S using policy derived from Q (e.g., ε -greedy)Take action A , observe R, S (cid:48) Q ( S, A ) ←− Q ( S, A ) + α [ R + γ max a Q ( S (cid:48) , a ) − Q ( S, A )] S ←− S (cid:48) end until S is terminal end Figure 10: The Q-learning approach architectureFigure 10 illustrates the learning procedure in our approach: Agent.

The purpose of the agent is to learn the optimum policy for generating load testscenarios that accomplish the objectives of load testing. The agent has four components; Policy,State Detection, Reward Computation, and Q-Table. . Policy.

The policy which determines the next action is extracted from the Q-table basedon the decaying ε -greedy approach; in each step, one action is selected among the available actionsin the current state. As mentioned before each action is: increasing the workload of one of thetransactions by a constant ratio, then applying the total workload of all transactions on the SUTconcurrently. . State Detection

The state detection unit will detect the states based on the observationsfrom the environment (i.e. SUT). The observations here are the error rate and response time. Eachstate is indicated by a range of average error rates and average response time. As Figure 11 shows,23olrokh Hamidi Reinforcement Learning Assisted Load Test Generationwe deﬁne six states, each one covering a speciﬁc range in error rate and response time. We dividedthe [0, error rate threshold ] range into two sections and the [0, response time threshold ] range intothree sections. Figure 11: States in the q-learning approach . Reward Computation

The reward computation unit takes the error rate and responsetime as an input and calculates the reward based on them. . Q-Table

The q-table is where the q-values are stored. Each state-action has a q-value whichwill get updated by the gained reward after taking action from the state. SUT

The environment in our case is the SUT in which the actions would apply to it, andit would react to the actions (i.e., applied workload). Then the agent receives observations fromSUT, which are error rate and response time, and determine the state and reward based on them.

As mentioned in section 2.2., Deep Q-Network or DQN is an extension of q-learning. This methoduses a function approximator instead of using a q-table. The function approximator, in this case, isa neural network. It approximates the q-values and reﬁnes this approximation (base on the rewardsreceived each time after the agent taking action) instead of saving and retrieving the q-values froma q-table. Approximating q-values are beneﬁcial when the state-action space is big. In this case,ﬁlling the q-table is not feasible and takes a long time. The beneﬁt of using DQN is that it speedsup the learning process because 1) There is no need to store a big amount of data in the memorywhen the problem contains a large number of states and actions, 2) There is no need to learnthe q-value of every single state-action and the learned q-values are generalized from the visitedstate-actions to the unvisited ones.There are many function approximators (e.g., Linear combination of features, Neural Networks,Decision Tree, Nearest neighbor, Fourier/wavelet bases). Among the function approximators, neu-ral networks are one of the function approximators which use gradient descent. Gradient descentis suitable for our data, which is not iid (Independent and Identically Distributed). The data isnot iid because unlike supervised learning, in reinforcement learning values of the states near eachother or the q-values of the state-action near each other are probably similar and the previousstate is highly correlated with the previous state.The DQN that we chose in our approach uses an ANN which takes a state as input and esti-mates the q-values of all the actions available from that state (Figure 12).The architecture of our intelligent load runner approach with the DQN method is shown in Fig-ure 13. The approach is the same as the q-learning approach except that it contains a . DQN unitinstead of the

Q-Table unit. Also, in this approach, each state corresponds to a single response time,24olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 12: DQN function approximationand error rate, and thus the number of states is equal to error rate threshold × response time threshold .In each iteration, after receiving the reward, the DQN gets updated then the policy unit choosesan action based on the actions’ q-value approximated by the DQN unit.Figure 13: The DQN approach architecture25olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

7. Evaluation

In this section, we explain the implementation setup of our proposed RL approaches for loadtest scenario generation, which answers RQ2. We explain the preparations for executing theimplementation, i.e., the setup for the SUT, and also the procedure of how our experiment wassetup. In addition, we evaluate our RL approaches by comparing them against a baseline and arandom approach for generating the load tests in an experiment.

Page loading time is an important factor in website’s user experience, and page delays can result inbig sale loss in e-commerce stores (online shops). Page loading time is dependent on performancerequirements. Therefor performance requirements play a key role in e-commerce stores. We intendto use an open-source e-commerce store as a SUT to apply our proposed load testing method onit. Note that the e-commerce application is already being used in production by many users andin the real-world. Using an e-commerce store as the SUT makes it possible to send a variety ofrequests to the website as the workload. Requests like registering, logging in, visiting productpages, buying products online, etc. We cannot apply our approach to a running e-commerce storethat provides real services to customers because load testing on the website will aﬀect the website’sperformance and result in real sales loss. Therefore we will build our own e-commerce store.In this section, we explain the implementation of the system under test in detail. We useXAMPP to deploy the SUT server and build a local e-commerce website using WordPress andWooCommerce.

We deployed the SUT on a local server on a computer dedicated to this mean. We used a localserver to avoid load testing through proxies. Otherwise, we will end up load testing the proxyserver too, and the proxy may fail before the SUT server. Using a local server we can avoid possi-ble eﬀects of any in-between network equipment or server which may inﬂuence the test results.We deployed the SUT server using the XAMPP application on Ubuntu 16.04 operating system.As mentioned before XAMPP is a lightweight Apache distribution for deploying local web serversfor testing purposes. Figure 14: XAMPP applicationTo allocate our desired amount of resources to the SUT server, we use cproups. Cgroups orcontrol groups, are a feature in Linux kernel that makes it possible for the user to allocate resources26olrokh Hamidi Reinforcement Learning Assisted Load Test Generatione.g., CPU time, system memory, network bandwidth, or combinations of these resources amongthe collection of processes running on a system, and manage and put restrictions on the resources.1. We create a cgroup named ”rlsutgroup” in the /etc/cgconﬁg.conf ﬁle: group r l s u t g r o u p { cpuset { cpuset . cpus = 0 ;cpuset . mems = 0 ; } memory { memory . l i m i t i n b y t e s = 2G; }} We specify CPU number 0 and the memory node 0 to be accessed by the cgroup. We alsoset the maximum amount of user memory (including ﬁle cache) to 2 gigabytes (GB) for thecgroup.2. To move the SUT server process to the cgroup we created, we write the line below in /etc/c-grules.conf ﬁle: ∗ : / opt /lampp/manager − linux − x64 . run cpuset , memory r l s u t g r o u p Where /opt/lampp/manager-linux-x64.run is the command for starting the Xampp server.3. To apply changes in cgconﬁg.conf and cgrules.conf we enter the commands below in ubuntuterminal: sudo c g c o n f i g p a r s e r − l / e t c / c g c o n f i g . c o n fsudo c g r u l e s e n g d We set up an e-commerce website (Figure 15) on WordPress using WooComerce. As mentioned be-fore, WordPress is an open-source content management system, and WooComerce is an e-commerceplugin for WordPress to create and manage online stores and is being used by millions of usersacross the globe. A client can view products on the website, register or login to the website,add products to her cart, and checkout and order the products in her cart using PayPal or otheroptions.

In this section, we will ﬁrst explain the structure of a workload and how they are generated usingJMeter. Then we will provide the implementation details of our RL load tester.

We use Apache JMeter as a load generation/execution tool in our implementation. As mentionedbefore, JMeter is a testing tool for generating and applying workload on servers. It generates therecommended workload by the tester agent, applies it on the SUT and, captures and measures theperformance metrics of the SUT.In each load test scenario, we execute a workload with eleven number of diﬀerent transactions, inwhich each transaction has a speciﬁc size of workload executed by the JMeter threads. Each threadin JMeter indicates one user, and it is responsible for sending HTTP requests of one transaction tothe SUT. While applying a workload on the SUT, for each transaction we will generate a numberof JMeter threads equal to the size of that transactions speciﬁc workload (which is a variablewe change during the workload generation process), and we will execute all the threads of everytransaction in parallel in a speciﬁc ramp-up time.27olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 15: SUT: E-commerce website

Transactions

We considered eleven diﬀerent operations in the SUT shown in Table 2 for gen-erating workloads. A transaction is an operation and may have some functional prerequisitetransactions. When a transaction is executed in the test, all of its prerequisite transactions wouldbe executed sequentially in the speciﬁc order before the execution of the main transaction. Table 3shows the prerequisite transactions for each transaction. Each thread is responsible for executinga transaction and its prerequisite transactions sequentially. Nevertheless, all threads are executedin parallel. Table 2: Common operations in an online shop

Operation Description

Home Access to home pageSign up page Access to Sign up pageSign up Register and add a new userLogin page Access to login pageLogin Sign in at the systemSearch page Access to search pageSelect product See the details of the selected productAdd to cart Add the selected product to the cartPayment Access to payment pageConﬁrm Conﬁrm the order (payment)Log out Log out

JMeter Conﬁguration

We used apache-jmeter-5.2.1 for applying the workload on the SUT.JMeter could be run in a GUI mode or CLI (command line) mode. JMeter Test Plans can also becreated and executed through a java program.A JMeter projects could be saved in a JMX ﬁle in the XML format. JMX or Java ManagementExtension is a standard framework for managing applications in java. It could be deﬁned how tostart, monitor, manage, and stop software components in a JMX ﬁle.28olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationTable 3: Functions prerequisite transactions of each transaction

Transaction Prerequisite Transactions

Home home pageSign up page home page → my account pageSign up home page → my account page → registerLogin page home page → my account pageLogin home page → my account page → loginSearch page home pageSelect product home page → select productAdd to cart select product → add to cartPayment select product → add to cart → checkoutConﬁrm select product → add to cart → checkout → PayPal pageLog out my account page → logout Generating the Test Plan

We use the GUI mode to generate a test; we setup JMeter torecord user activities browsing the SUT. We do each of the transactions in table 3 step by stepand record the requests sent to the SUT using the JMeter.Each test consists of some elements. All tests in JMeter should contain a Test Plan and ThreadGroups. The steps for setting the elements are (Figure 16):Figure 16: JMeter Test Plan1. Set the Test Plan element: In the Test Plan element, we check the option ”Run tearDownThread Groups after the shutdown of main threads”2. Add Thread Group elements: In the Test Plan, we create a Thread Group element for oneof the transactions:right click on the Test Plan → Add → Threads (Users) → Thread GroupEach thread in a Thread Group simulates a user that sends requests to a server.3. Add HTTP Request Defaults element: HTTP Request Defaults is another JMeter element.The user can set default values for HTTP Request Samplers using HTTP Request Defaults.We add a HTTP Request Defaults to the Thread Group:right click on the Thread Group → Add → Conﬁg Element → HTTP Request Defaults29olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationWe enter the website name under test in the ”Server Name or IP” ﬁeld in the HTTP RequestDefaults control panel. In the ”Timeout” box we set the ”Connect” timeout to 30000 msand the ”Response” timeout to 120000 ms.4. Add Recording controller: JMeter can record the user activity and store it in a Recordingcontroller. We add the Recording controller to the Thread Group:right click on the Thread Group → Add → Logic Controller → Recording Controller5. Add HTTP(S) Test Script Recorder: HTTP(S) Test Script Recorder can record all therequests sent to a server. We add this element to the Test Plan:right click on the Test Plan → Add → Non-Test Elements → HTTP(S) Test Script RecorderWe set the ”Target Controller” ﬁeld to ”Test Plan > Thread Group” where the recordedscripts will be added.After building the Test Plan;1. We change the proxy conﬁguration of the browser and set the ”HTTP Proxy” to ”localhost”and the ”Port” to the same port number in the HTTP(S) Test Script Recorder.2. Click the ”Start” button in the HTTP(S) Test Script Recorder panel.3. Do the transaction step by step using the browser.4. When the transaction is ﬁnished, we click the ”Stop” button in the HTTP(S) Test ScriptRecorder panel.5. Save the JMeter project as a JMX ﬁle.We do all the steps above for each transaction in table 3, and in the end, we integrate all ofthe thread groups in a single JMX ﬁle (Figure 17) to be used for applying the workload by theagent. When executing the ﬁnal JMX ﬁle, all the Thread Groups will start the same time and willexecute concurrently. Figure 17: JMeter Thread Groups30olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

Executing the Test Plan

We do not use JMeter GUI or CLI for executing the generatedTest Plan. Instead, the Test Plan is executed from java code. The java program is the implemen-tation of the RL load test generation wherein each step of the agent’s learning process, the TestPlan is executed.To run the JMeter Test Plan, ﬁrst we increase the JMeter heap size to be able to generatelarger workloads described as below: • Go to the apache-jmeter-5.2.1/bin directory • Open JMeter startup script • Find the line HEAP=”-Xms1g -Xmx1g” • Change the maximum value to -Xmx4gWe load the JMX ﬁle in the java program by importing Apache JMeter packages: t e s t P l a n T r e e = S a v e S e r v i c e . loadTree ( new F i l e ( myTest . jmx” ) ) ;

And reset the Thread Group parameters for each step of applying the workload after an action istaken by the agent ( i.e., one of the transactions workload is increased). There are three parametersto set for each Thread Group: • number of threads: The number of threads (workload) in the Thread Group. • number of loops: The number of times that the Thread Group is executed. • ramp-up time: the time that takes for all threads of the Thread Group to get up and running.For each Thread Group, we set the number of threads equal to the workload of that transaction inthe RL agent, and the ramp-up time equal to the workload divided by a ratio ”threadPerSecond”(which we set it equal to 10). The number of loops is set to 1 for all Thread Groups. threadGroup . setNumThreads ( t r a n s a c t i o n [ i ] . workLoad ) ;threadGroup . setRampUp ( t r a n s a c t i o n [ i ] . workLoad/ threadPerSecond ) ;( ( Lo o pC on t ro l le r ) threadGroup . g e t S a m p l e r C o n t r o l l e r ( ) ) . setLoops ( 1 ) ; Then we run the Test Plan: jmeter . run ( ) ;

To be able to apply a large number of concurrent requests, we should execute the program ona device with high memory/CPU resources. Table 4 shows the properties of the device we used.Table 4: Hardware Overview of the Machine used for executing Load Generation ClientModel Name MacBook ProModel Identiﬁer MacBookPro12,1Processor Name Dual-Core Intel Core i7Processor Speed 3,1 GHzNumber of Processors 1Total Number of Cores 2L2 Cache (per Core) 256 KBL3 Cache 4 MBHyper-Threading Technology EnabledMemory 16 GB31olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

We implemented the agent in a java program based on the approach explained in Section 6.2..1.As shown in Figure 18, we have a module for each of Q-Table, Policy (action selection), RewardComputation, and State Detection. How the modules communicate with each other is explainedin Section 7.3..Figure 18: Q-learning implementation architecture of Intelligent Load Runner

We use the library RL4J [53] in our DQN approach implementation. As mentioned before, RL4Jis a deep reinforcement learning library that contains libraries for implementing DQN (Deep Q-learning with double DQN).

Prerequirements for deeplearning4j

Prerequisites: • Java: JDK 1.7 or later should be installed. (Only 64-Bit versions are supported) • Apache Maven: Maven is a dependency manager for Java application. • IntelliJ or Eclipse: IntelliJ and Eclipse are Integrated Development Environments (IDE)that help to work with deeplearning4j and conﬁguring neural networks easier. IntelliJ isrecommended for using the deeplearning4j library. • Git: to clone deeplearning4j examples.

Conﬁguring and training a DQN agent using RL4J

1. Create an action space for the mission:

D i s c r e t e S p a c e a c t i o n S p a c e = new D i s c r e t e S p a c e ( numberOfTransactions ) ;

2. Create an observation space for the mission:32olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

SUTObservationSpace o b s e r v a t i o n S p a c e =new SUTObservationSpace ( maxResposeTimeThreshold , maxErrorRateThreshold ) ;

3. Create an MDP wrapper:

ILRMDP mdp = new ILRMDP( maxResponseTimeThreshold , maxErrorRateThreshold, csvWriter ) ;

4. Create a DQN: p u b l i c s t a t i c DQNFactoryStdDense . C o n f i g u r a t i o n LOAD TEST NET =DQNFactoryStdDense . C o n f i g u r a t i o n . b u i l d e r ( ) . l 2 ( 0 . 0 1 ) . updater (new Adam( l e a r n i n g R a t e ) ) . numLayer ( 3 ) . numHiddenNodes ( 1 6 ) . b u i l d ( ) ;

5. Create a Q-learning conﬁguration by specifying hyperparameters: p u b l i c s t a t i c QLearning . QLConfiguration LOAD TEST QL =new QLearning . QLConfiguration (seed ,maxEpochStep ,maxStep ,expRepMaxSize ,batchSize ,targetDqnUpdateFreq ,updateStart ,rewardFactor ,gamma,errorClamp ,minEpsilon ,epsilonNbStep ,doubleDQN) ;

6. Create the DQN:

Learning < QualityMeasures , I n t e g e r , DiscreteSpace , IDQN > dql =new QLearningDiscreteDense < QualityMeasures > (mdp, LOAD TEST NET, LOAD TEST QL, manager ) ;

7. Train the DQN dql . t r a i n ( ) ;

Q-learning hyperparameters of the DQN

The Q-learning conﬁguration hyperparametersare [55, 56]: • maxEpochStep : Each epoch is equivalent to an episode in the learning algorithm. maxEpochStep is the maximum number of steps allowed in each episode (epoch). • maxStep : The maximum number of total iterations (the summation of steps in all episodes)in the learning. Training will ﬁnish when the number of steps exceeds maxStep . • expRepMaxSize : The maximum size of experience replay. The number of past transitionsthat the agent will take the next action based on them is experience replay. Experiencereplay is explained in detail in section 2.. • batchSize : The number of steps which the neural network would update its weights afterexecuting them.We choose the batch size equal to 1 because each sample in RL is dependent on the previoussample, so the network should be updated per sample (in our case, each learning step). • targetDqnUpdateFreq : In double DQN the target network is frozen for targetDqnUpdateFreq number of steps and it would update after targetDqnUpdateFreq steps from the online net-work. The state-action values are computed (the evaluation) based on the target network tostabalize the learning. 33olrokh Hamidi Reinforcement Learning Assisted Load Test Generation • updateStart : The number of no-operation (do nothing) moves before starting the learningto make the learning start with a random conﬁguration. The agent will conduct the samesequence of actions at the beginning of each episode instead of learning to take the nextaction based on the current state if it starts with the same conﬁguration each time. • rewardFactor : Reward factor is an important hyperparameter that should be consideredcarefully. It signiﬁcantly aﬀects the eﬃciency of learning. This factor scales the rewards, sothe Q-values will be lower (if the range is [-1; 1] it is similar to normalization). • gamma : The discount factor. • errorClamp : This parameter will clip (bound between two limit values) the loss function(TD-error) based on the output in the backpropagation. For example if errorClamp =1, thenthe gradient is bounded to the range (-1,1). • minEpsilon : Epsilon is the derivative of the loss function with respect to the activationfunction’s output. The epsilon is used to compute the gradients for every activation node inbackpropagation. • epsilonNbStep : After epsilonNbStep number of steps, the epsilon will be decreased to minEpsilon . • doubleDQN : This value should be set to True to enable double DQN.We set the values of the hyperparameters as shown in Table 6.Figure 19: DQN implementation architecture of Intelligent Load Runner

We separately executed our RL approaches for load generation (shown in Figure 18 and Figure19) on the SUT. We also executed a baseline load generation approach and a random load gener-ation approach on the SUT to evaluate the eﬃciency of the proposed RL approach against them.All approaches were executed in several episodes, each episode consisting of several steps. In thebaseline approach, in each step, the workload size of all transactions was incremented. We chose34olrokh Hamidi Reinforcement Learning Assisted Load Test Generationa baseline approach to compare the ﬁnal size of the generated workload (that hit the error rateor response time thresholds) of the baseline approach with our proposed RL approaches. In therandom approach, in each step, a random transaction was chosen, and the size of its workloadwas increased (unlike the RL approaches where the transaction was selected based on the policy).The reason behind choosing a random approach was that random testing is found robust [57], [58]among many other systematic testing approaches and is a good criterion.The SUT was deployed on a local server, an ASUS K46 computer with a Ubuntu 16.04 operatingsystem, dedicated 1 CPU, and 2 GB memory to the SUT (as mentioned in section 7.1..1). Duringthe execution of the methods, the system was logging necessary data for the evaluation metrics.An overview of the procedure is shown in Figure 20. We further explain the evaluation metricsused and the procedure for executing each approach.Figure 20: Procedure of executing the methods

Evaluation metrics

The evaluation metrics are the average error rate, average response time,size of the ﬁnal eﬀective workload, and number of steps for generating the eﬀective workload. Wecan not conclude from a quick response time that the SUT is operating ﬁne and fast, and we shouldconsider the error rate too. Since the servers are quick at delivering error pages, we may get lowresponse times with high error rates in some situations. Consequently, not only we put a thresholdon the response time, but also we put a threshold on the error rate.

General conﬁguration

We execute each approach for about 40 episodes. Each episode consistedof several steps. In each step, a workload was generated and applied to the SUT. The workload,response time, and error rate were logged for each step. The episodes continued until the observederror rate and response time hit the threshold. Every two continuous episodes were executed witha 5-minute delay between them, which allowed the server to go back to its normal state. Table 5shows the conﬁguration values used for all the approaches.Table 5: Load Tester Conﬁguration Values

Parameter Value average response time threshold 1500msaverage error rate threshold 0.2delay between executing two continuous episodes 5number of started threads per second 10initial workload per transaction 3transaction workload increasing step ration 1/335olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

Baseline approach

We executed the baseline approach for 40 episodes. In each step of anepisode in this approach:1. The workload of all transitions were increased by 1 / Random approach

We executed the random approach for 40 episodes. In each step of anepisode in this approach:1. A transition was chosen randomly, and its workload was increased by 1 / Q-learning and DQN approaches

We executed the q-learning approach for 40 episodes. How-ever, the DQN approach was executed for 47 episodes. The reason behind this is that the numberof episodes was not conﬁgurable in the DQN implemented using the RL4J library, and instead,the number of steps was a conﬁgurable parameter. Therefore we conﬁgured the number of stepsequal to a value (shown in table 6) that we approximated to be executed in around 40 episodes.In each episode, the agent started from an initial state, which was detected by applying an initialworkload on the SUT and observing the average error rate and response time. The initial q-valuesin the q-table/q-network were set to 0. Each episode consisted of several learning steps. In eachlearning step:1. An action was chosen according to the policy; one of the transactions workload would increaseby 1 / Hyperparameter Value maxEpochStep 30maxStep 450expRepMaxSize 450batchSize 1targetDqnUpdateFreq 10updateStart 1rewardFactor 0.1gamma 0.5errorClamp 10.0minEpsilon 0.1fepsilonNbStep 400doubleDQN true

8. Results

This section presents the results of the experiment conducted to evaluate the eﬃciency of thebaseline, random, q-learning, and DQN approaches. This section is focused on answering RQ3.

Results of the Baseline Approach

The baseline approach was executed for 40 episodes. InFigure 21a, the episodes are plotted on X-Axis, and the Y-Axis shows the number of steps ineach episode that is needed for generating the workload that hit the response time or error ratethreshold. As it can be seen from Figure 21a, the trend line for the baseline approach stays betweenzero to ﬁve. It means that the baseline approach took fewer steps in generating the ﬁnal workloadthat hits the thresholds. However, the increment of the workload size in each step was very highbecause it was applied to all transactions (as shown in Figure 21b).In Figure 21b, the episodes of baseline approach are shown on the X-Axis, and the size of theﬁnal workload that hits the response time or error rate threshold is shown on the Y-Axis. As itcan be seen in Figure 21b, the trendline for the baseline approach consistently stays between 50to 60. This means that in the majority of the episodes, the baseline approach was mostly able tohit the threshold with bigger workload sizes.

Episode S t ep s pe r E p i s ode Steps in Baseline Approach (a) Number of Steps per Episode

Episode E p i s ode F i na l W o r k l oad S i z e Final Workloads in Baseline Approach (b) Final Workload Size per Episode

Figure 21: Baseline Approach

Results of the Random Approach

The random approach was executed for 40 episodes. Fig-ure 22a shows that in the random approach, the number of steps for increasing the workload is37olrokh Hamidi Reinforcement Learning Assisted Load Test Generationhigh. The trendline is between 10 and 15 steps, and it is constant (as expected from a randommethod). It indicates that on average, no extreme change happened overtime in the state of thesystem and that the SUT remains stable.As shown in Figure 22b, in the random approach, the size of the ﬁnal workload, which hit thethresholds in each episode, is generally between 40 and 50. This size is smaller than the generalsize of the ﬁnal workload in the baseline approach because here, the increments in the workloadsize were applied to just one transaction per step and not all of them. Thus this approach couldﬁnd smaller workloads that hit the threshold.

Episode S t ep s pe r E p i s ode Steps in Random Approach (a) Number of Steps per Episode

Episode E p i s ode F i na l W o r k l oad S i z e Final Workloads in Random Approach (b) Final Workload Size per Episode

Figure 22: Random Approach

Results of the Q-learning Approach

The q-learning approach was executed for 40 episodes.Figure 23a shows that the number of steps in each episode is, on average, between 5 and 15 steps.We can see that the diversity of the number of steps is high in the ﬁrst episodes, which is similarto the results of the random approach as shown in Figure 22a. But the diversity decreases overtime, and after episode 25 i.e., the last 16 episodes, the number of steps converge and appear inthe range of 8 to 10. This indicates that the agent has developed a policy, and it is following it.The policy that has been learned and converged over time and is not changing drastically afterepisode 25.Another practical factor of the convergence in this approach is because of using the decaying ε -greedy method. However, the q-learning method is expected to converge to the optimal pol-icy with the probability of one [59], using decaying ε -greedy can accelerate the convergence. Asmentioned in Section 6.2..1, at the beginning of the learning, the actions are chosen more ran-domly to allow the agent to explore diﬀerent actions and learn the consequences of taking them(by receiving the reward from taking that action). But over the time, the probability of choos-ing random actions decreases, also the probability of choosing the best action due to the mainpolicy increases (policy derived from the q-table). Therefore, in each step, the actions (i.e., thetransactions that their workloads are incremented) are chosen more randomly at the beginning ofthe learning (ﬁrst episodes), but they are chosen less randomly and more intelligently based onthe learned policy in the last episodes. In the ﬁnal episodes, the process of generating the work-load is following the same policy which leads to convergence of the number of steps in each episode.In addition, the trendline in Figure 23a decreases over the time, which means that the numberof steps for generating a workload that hits the thresholds, is decreasing. This indicates that theagent has learned to take more eﬃcient actions and to choose the best candidate transitions toincrement its workload in each step. Choosing the best actions intelligently in the lasts episodesleads to generating an eﬀective workload in fewer steps. The convergence and degradation in thenumber of steps shows that the agent has found the optimal policy for generating eﬀective workloadwhich hits the threshold in fewer steps. 38olrokh Hamidi Reinforcement Learning Assisted Load Test GenerationFigure 23b shows that in the q-learning approach, the size of the ﬁnal workload in each episodeis small, and it is, on average, between 40 and 50. The diversity of the ﬁnal workload size is high inthe ﬁrst 20 episodes, but it decreases over time and converges to the range of 42 to 46 after episode22 (i.e., last 19 episodes). Also, the trendline shows, the ﬁnal workload size decreases during thetime. Episode S t ep s pe r E p i s ode Steps in Q-leaning Approach (a) Number of Steps per Episode

Episode E p i s ode F i na l W o r k l oad S i z e Final Workloads in Q-learning Approach (b) Final Workload Size per Episode

Figure 23: Q-learning Approach

Results of the DQN Approach

The DQN approach was executed for 47 episodes. As shownin Figure 24a, The number of steps are between 5 and 15. After episode 38 i.e., the last 10 episodes,the number of steps are staying in the range of 6 to 8. Like the q-learning approach and becauseof the same reason (i.e., using decaying ε -greedy method in the policy), the number of steps areconverging in the last episodes. The steps are staying in the range of 6 to 8 after episode 38. Also,the slope of the trend line show that the number of steps for generating the eﬀective workloaddecreases over time. Based on the convergence and the decrease in the number of steps, we cansee that the agent has learned the optimal policy, and it is following that policy. We can also seethat in comparison to the q-learning approach, the convergence is occurring after more episodes,but the convergence range is less than the convergence range in the q-learning approach.In Figure 24b, the trendline shows that the ﬁnal workload is generally between 40 and 50. Wecan see in the ﬁgure that after episode 40 i.e., the last 8 episodes, the size of the eﬀective workloadis altering in the range of 40 to 43. This shows that it is converging after episode 40, and thesize of the ﬁnal workload has reduced after this episode. Comparing to the q-learning method, theconvergence and the decrease in the size of eﬀective workload are happening in later episodes. Butthe convergence range is less than the range in the q-learning method (which is 8 to 10). Episode S t ep s pe r E p i s ode Steps in DQN Approach (a) Number of Steps per Episode

Episode E p i s ode F i na l W o r k l oad S i z e Final Workloads in DQN Approach (b) Final Workload Size per Episode

Figure 24: DQN Approach39olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

9. Discussion

In this section, we highlight the answers to our research questions. We review the solutions anddiscuss the results (important bits in italic ). We also discuss the threats to the validity of ourresults.

RQ1.

Answering RQ1 required formulating an RL solution for the test load generation problem.In order to apply RL based solution, a mapping of the real-world problem into an RL problemis a pre-requisite. We formulated and presented our mapping in Section 6., and provided our RLapproach for load test generation in detail. We deﬁned the agent, environment, and the learningprinciples including states, actions, observations, and reward function in our RL problem.

Asdiscussed, we considered the SUT as the environment. We declared each state base on the lastaverage error rate and average response time observed from the SUT. The actions were deﬁnedas choosing a transaction, increasing its workload, and applying it to the SUT. The average errorrate and average response time were the agent’s observations of the SUT. And the reward wascalculated according to the observations of the SUT. We also chose two diﬀerent RL methods forin our approach namely q-learning and DQN.

RQ2.

To demonstrate the applicability of our proposed RL-based load testing approaches, whichis RQ2, we implemented and applied the q-learning and DQN approaches on an SUT in Section 7..We set up an e-commerce store on a local server using open-source and heavily maintained CMSand plugin i.e. WordPress and the WooCommerce plugin. We used JMeter for generating the loadscenarios (produced by the RL agent), applying them on the SUT, and recording the error rateand response time.

Results show that RL-based test load generation approaches are applicable forperformance testing of real-world applications.

RQ3.

To answer the third research question, we conducted an experiment. In the experiment, weexecuted four approaches (treatments) for load test generation, which were a baseline, a random,and the proposed q-learning and DQN approaches. The results of the experiment are provided inSection 8.. Results from our experiment show that the baseline approach (in which the workloadwas increased per all transactions in each step) generally produced a larger eﬀective workload.

Thus we can conclude that, in comparison to other approaches, the baseline approach for workloadgeneration is not eﬃcient in terms of the size of generated workload.

In contrast, the random approach did perform better than the baseline approach.

In comparisonwith RL-based approaches, the random approach was generally not attaining an optimal workloadsize, and the diversity of the applied workload sizes remained high.

The results show that the number of steps per episode and the size of the eﬀective work-load in each episode of both of the RL approaches are converging to a lower number in thelast episodes. Results indicate that the q-learning approach converges faster (in terms of num-ber of steps and optimal workload size) than the DQN approach.

This is expected since theq-learning approach only has six states while the DQN approach has much more states ( er-ror rate threshold × response time threshold states).Findings also revealed that the DQN approach converges to lower values in both metrics.Meaning that in comparison to q-learning-based approach, the DQN approach took more time toconverge. However, the DQN was more eﬃcient in terms of ﬁnding the optimal workload sizes. Based on our results, we believe that the DQN approach can perform even better after extensivetuning of the hyper-parameters.

Finally, we can conclude that both of the proposed RL approachesfor load test generation converges to the optimal policy and performs better than the baseline andrandom approach.

Load scenario generation is heavily dependent on the hardware of the SUT and its executionenvironment. External factors (such as running the SUT on a shared hosting server) might alter40olrokh Hamidi Reinforcement Learning Assisted Load Test Generationthe results. In this section, we address the potential threats to validity based on the classiﬁcationpresented by Runeson and H¨ost [60].Construct validity: This aspect of validity reﬂects to what extent the studied operational mea-sures really represent what the researcher has in mind. A misunderstood question is an example ofa potential construct validity threat. To tackle potential threats to our construct validity, we havebeneﬁted the guides from multiple researchers in the problem formulation and used well-establishedguidelines for conducting our study.Internal validity: This aspect of validity is of concern with the validity and credibility of theobtained results. We tackle the potential threats to internal validity by executing the experimenton a dedicated local server with no other processes running on it. However, there were several un-controlled factors (such as the operating system’s processes) that might have aﬀected our results.External validity: This aspect of validity is concerned with to what extent it is possible togeneralize the ﬁndings, and to what extent the ﬁndings are of interest to other people outside theinvestigated case. In our case, the results are obtained by executing the experiment on one SUT,and we, therefore, do not claim that the results can be generalized to other cases. However, wechose open-source and heavily maintained SUT, so that the results can be generalized to othersimilar cases (WooCommerce-based e-commerce applications). In addition, our results can alsobe of interest to performance testing researchers and practitioners working with load testing viaJMeter.Reliability: This aspect is concerned with to what extent the data and the analysis are depen-dent on the speciﬁc researchers. Hypothetically, if another researcher, later on, conducted the samestudy, the result should be the same. Threats to this aspect of validity are tackled by receivingfeedback from multiple researchers in experiment planning and execution. In addition, we providedenough details on our experiment setup for replication.41olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

10. Conclusions

One critical activity in performance testing is the generation of load tests. Existing load testingapproaches heavily rely on the system models or source code. This thesis aims to propose andevaluate a model-free and an intelligent approach for load test generation.In this thesis, we formulated the problem of eﬃcient test generation for load testing into an RLproblem. We presented an RL-driven model-free approach for generating eﬀective workloads inperformance testing. We mapped the real-world problem of test load generation into the RL contextand discussed our approach in detail. We evaluate the applicability of our proposed RL based testload generation on a real-world software system. In addition, we conducted an experiment tocompare the eﬃciency of two diﬀerent proposed RL-based, q-learning, and DQN methods foreﬀective workload generation. For the experiment, we implemented our proposed approach andprepared the requirements for testing it. We set up an e-commerce store on a local server usingWordPress and its WooComerce plugin. Then we applied the workloads generated by our approachto it. The workloads consisted of a diﬀerent number of various transactions (operations) generatedusing JMeter. We performed a one factor-four treatment experiment on the same SUT where thefactor for the experiment was ”test load generation method” with treatments of baseline method,a random method, the q-learning method, and the DQN method. We executed each treatmentfor around 40 episodes. Each episode contained several load generation steps, and it would beﬁnished when the generated workload produced an error rate or response time larger than thedeﬁned threshold.The results indicated that, in general, the baseline approach was not eﬃcient in terms ofthe size of the generated eﬀective workload per episode. In the random approach, the averagesize of generated eﬀective workload was smaller than the baseline approach, which means that therandom approach performed better than the baseline approach in test load generation. In addition,the results showed that both of the RL-based methods performed better than the random andbaseline approaches. The results show that eﬀective workload size and the number of steps takenfor generating workload in each episode converges to a lower value in both q-learning and DQNapproaches. The q-learning approach converged faster than the DQN. However, the DQN approachconverged to lower values for the workload sizes. We can conclude from the results that both ofthe proposed RL approaches learned an optimal policy to generate optimal workloads eﬃciently.The RL-based approaches performed batter in our experiment and do not require access tothe system models or source code. In addition, the learned policy can be reused in further sim-ilar situations (stages) of testing, e.g., regression testing and in the process of Development andOperations (DevOps) incremental release testing.In the future, we plan to extend our approach to support performance testing for softwareproduct lines. In software product lines, the derived products are variants of the standard product,and the learned policy for load generation can be reused on multiple derived products. This cansigniﬁcantly reduce the performance testing time for the software product line.42olrokh Hamidi Reinforcement Learning Assisted Load Test Generation

References [1] G. Linden. (2006) Marissa mayer at web 2.0. [Online]. Available: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html[2] ——. (2006) Make your data useful. [Online]. Available: http://sites.google.com/site/glinden/Home/StanfordDataMining.2006-11-29.ppt[3] G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu, “Understanding and detecting real-worldperformance bugs,”

ACM SIGPLAN Notices , vol. 47, no. 6, pp. 77–88, 2012.[4] E. J. Weyuker and F. I. Vokolos, “Experience with performance testing of software systems:issues, an approach, and case study,”

IEEE transactions on software engineering , vol. 26,no. 12, pp. 1147–1156, 2000.[5] S. Lavenberg,

Computer performance modeling handbook . Elsevier, 1983.[6] P. Zhang, S. Elbaum, and M. B. Dwyer, “Automatic generation of load tests,” in

Proceedingsof the 2011 26th IEEE/ACM International Conference on Automated Software Engineering .IEEE Computer Society, 2011, pp. 43–52.[7] J. Zhang and S. C. Cheung, “Automated test case generation for the stress testing of multi-media systems,”

Software: Practice and Experience , vol. 32, no. 15, pp. 1411–1435, 2002.[8] M. D. Syer, B. Adams, and A. E. Hassan, “Identifying performance deviations in thread pools,”in . IEEE, 2011,pp. 83–92.[9] J. Koo, C. Saumya, M. Kulkarni, and S. Bagchi, “Pyse: Automatic worst-case test generationby reinforcement learning,” in . IEEE, 2019, pp. 136–147.[10] R. S. Sutton, A. G. Barto et al. , Introduction to reinforcement learning . MIT press Cambridge,1998, vol. 135.[11] O. Ibidunmoye, F. Hern´andez-Rodriguez, and E. Elmroth, “Performance anomaly detectionand bottleneck identiﬁcation,”

ACM Computing Surveys (CSUR) , vol. 48, no. 1, p. 4, 2015.[12] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

ACM computingsurveys (CSUR) , vol. 41, no. 3, p. 15, 2009.[13] ISO 25000, “ISO/IEC 25010 - System and software quality models,” 2019, available at https://iso25000.com/index.php/en/iso-25000-standards/iso-25010, Retrieved July, 2019.[14] M. Glinz, “On non-functional requirements,” in . IEEE, 2007, pp. 21–26.[15] L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos,

Non-functional requirements in softwareengineering . Springer Science & Business Media, 2012, vol. 5.[16] V. Cortellessa, A. Di Marco, and P. Inverardi,

Model-based software performance analysis .Springer Science & Business Media, 2011.[17] M. Harchol-Balter,

Performance modeling and design of computer systems: queueing theoryin action . Cambridge University Press, 2013.[18] K. Kant and M. Srinivasan,

Introduction to computer system performance evaluation .McGraw-Hill College, 1992.[19] A. Geraci, F. Katki, L. McMonegal, B. Meyer, J. Lane, P. Wilson, J. Radatz, M. Yee, H. Porte-ous, and F. Springsteel,

IEEE Standard Computer Dictionary: Compilation of IEEE StandardComputer Glossaries . IEEE Press, 1991. 43olrokh Hamidi Reinforcement Learning Assisted Load Test Generation[20] Z. M. Jiang and A. E. Hassan, “A survey on load testing of large-scale software systems,”

IEEE Transactions on Software Engineering , vol. 41, no. 11, pp. 1091–1118, 2015.[21] B. Gregg,

Systems performance: enterprise and the cloud . Pearson Education, 2013.[22] B. Beizer,

Software system testing and quality assurance . Van Nostrand Reinhold Co., 1984.[23] T. M. Mitchell,

Machine Learning , 1st ed. USA: McGraw-Hill, Inc., 1997.[24] L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teach-ing,”

Machine learning , vol. 8, no. 3-4, pp. 293–321, 1992.[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep rein-forcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.[26] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXivpreprint arXiv:1511.05952 , 2015.[27] J. Brownlee. (2018) Diﬀerence between a batch and an epoch in a neural network. [Online].Available: https://machinelearningmastery.com/diﬀerence-between-a-batch-and-an-epoch/[28] H. V. Hasselt, “Double q-learning,” in

Advances in neural information processing systems ,2010, pp. 2613–2621.[29] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,”in

Thirtieth AAAI conference on artiﬁcial intelligence , 2016.[30] D. A. Menasc´e, “Load testing, benchmarking, and application performance management forthe web,” in

Int. CMG Conference , 2002, pp. 271–282.[31] V. Apte, T. Viswanath, D. Gawali, A. Kommireddy, and A. Gupta, “Autoperf: Automatedload testing and resource usage proﬁling of multi-tier internet applications,” in

Proceedings ofthe 8th ACM/SPEC on International Conference on Performance Engineering . ACM, 2017,pp. 115–126.[32] A. Jindal, V. Podolskiy, and M. Gerndt, “Performance modeling for cloud microservice appli-cations,” in

Proceedings of the 2019 ACM/SPEC International Conference on PerformanceEngineering . ACM, 2019, pp. 25–32.[33] L. C. Briand, Y. Labiche, and M. Shousha, “Stress testing real-time systems with genetic algo-rithms,” in

Proceedings of the 7th annual conference on Genetic and evolutionary computation .ACM, 2005, pp. 1021–1028.[34] V. Ayala-Rivera, M. Kaczmarski, J. Murphy, A. Darisa, and A. O. Portillo-Dominguez, “Onesize does not ﬁt all: In-test workload adaptation for performance testing of enterprise appli-cations,” in

Proceedings of the 2018 ACM/SPEC International Conference on PerformanceEngineering . ACM, 2018, pp. 211–222.[35] Y. Gu and Y. Ge, “Search-based performance testing of applications with composite services,”in . IEEE, 2009, pp.320–324.[36] M. Di Penta, G. Canfora, G. Esposito, V. Mazza, and M. Bruno, “Search-based testing ofservice level agreements,” in

Proceedings of the 9th annual conference on Genetic and evolu-tionary computation . ACM, 2007, pp. 1090–1097.[37] V. Garousi, “A genetic algorithm-based stress test requirements generator tool and its em-pirical evaluation,”

IEEE Transactions on Software Engineering , vol. 36, no. 6, pp. 778–797,2010. 44olrokh Hamidi Reinforcement Learning Assisted Load Test Generation[38] V. Garousi, L. C. Briand, and Y. Labiche, “Traﬃc-aware stress testing of distributed real-time systems based on uml models using genetic algorithms,”

Journal of Systems and Software ,vol. 81, no. 2, pp. 161–185, 2008.[39] C.-S. D. Yang and L. L. Pollock, “Towards a structural load testing tool,” in

ACM SIGSOFTSoftware Engineering Notes , vol. 21, no. 3. ACM, 1996, pp. 201–208.[40] D. Draheim, J. Grundy, J. Hosking, C. Lutteroth, and G. Weber, “Realistic load testing ofweb applications,” in

Conference on Software Maintenance and Reengineering (CSMR’06) .IEEE, 2006, pp. 11–pp.[41] C. Lutteroth and G. Weber, “Modeling a realistic workload for performance testing,” in . IEEE, 2008,pp. 149–158.[42] M. Shams, D. Krishnamurthy, and B. Far, “A model-based approach for testing the perfor-mance of web applications,” in

Proceedings of the 3rd international workshop on Softwarequality assurance . ACM, 2006, pp. 54–61.[43] C. V¨ogele, A. van Hoorn, E. Schulz, W. Hasselbring, and H. Krcmar, “Wessbas: extractionof probabilistic workload speciﬁcations for load testing and performance predictiona model-driven approach for session-based application systems,”

Software & Systems Modeling , vol. 17,no. 2, pp. 443–477, 2018.[44] V. Ferme and C. Pautasso, “A declarative approach for performance tests execution in con-tinuous software development environments,” in

Proceedings of the 2018 ACM/SPEC Inter-national Conference on Performance Engineering . ACM, 2018, pp. 261–272.[45] ——, “Towards holistic continuous software performance assessment,” in

Proceedings of the 8thACM/SPEC on International Conference on Performance Engineering Companion . ACM,2017, pp. 159–164.[46] H. Schulz, D. Okanovi´c, A. van Hoorn, V. Ferme, and C. Pautasso, “Behavior-driven loadtesting using contextual knowledge-approach and experiences,” in

Proceedings of the 2019ACM/SPEC International Conference on Performance Engineering . ACM, 2019, pp. 265–272.[47] H. Malik, H. Hemmati, and A. E. Hassan, “Automatic detection of performance deviations inthe load testing of large scale systems,” in

Proceedings of the 2013 International Conferenceon Software Engineering . IEEE Press, 2013, pp. 1012–1021.[48] M. Grechanik, C. Fu, and Q. Xie, “Automatically ﬁnding performance problems with feedback-directed learning software testing,” in . IEEE, 2012, pp. 156–166.[49] T. Ahmad, A. Ashraf, D. Truscan, and I. Porres, “Exploratory performance testing usingreinforcement learning,” in . IEEE, 2019, pp. 156–163.[50] V. R. Basili and H. D. Rombach, “The tame project: Towards improvement-oriented softwareenvironments,”

IEEE Transactions on software engineering , vol. 14, no. 6, pp. 758–773, 1988.[51] H. J. Holz, A. Applin, B. Haberman, D. Joyce, H. Purchase, and C. Reed, “Research methodsin computing: What are they, and how should we teach them?” in

Working group reports onITiCSE on Innovation and technology in computer science education , 2006, pp. 96–114.[52] C. Wohlin, P. Runeson, M. Hst, M. C. Ohlsson, B. Regnell, and A. Wessln,

Experimentationin Software Engineering . Springer Publishing Company, Incorporated, 2012.[53] R. Fiszel, “Rl4j: Reinforcement learning for java,” https://github.com/eclipse/deeplearning4j/tree/master/rl4j, accessed: 2020-02-01.45olrokh Hamidi Reinforcement Learning Assisted Load Test Generation[54] “Deeplearning4j: Open-source distributed deep learning for the jvm, apache software founda-tion license 2.0,” https://deeplearning4j.org, accessed: 2020-04-21.[55] R. Raj,

Java Deep Learning Cookbook . Packt Publishing Ltd, 2019.[56] R. Fiszel. (2016) Reinforcement learning and dqn, learning toplay from pixels. [Online]. Available: https://rubenﬁszel.github.io/posts/rl4j/2016-08-24-Reinforcement-Learning-and-DQN.html[57] I. Ciupa, A. Leitner, M. Oriol, and B. Meyer, “Experimental assessment of random testing forobject-oriented software,” in

Proceedings of the 2007 International Symposium on SoftwareTesting and Analysis , ser. ISSTA 07. New York, NY, USA: Association for ComputingMachinery, 2007, p. 8494. [Online]. Available: https://doi.org/10.1145/1273463.1273476[58] J. W. Duran and S. C. Ntafos, “An evaluation of random testing,”

IEEE Trans.Softw. Eng. , vol. 10, no. 4, p. 438444, Jul. 1984. [Online]. Available: https://doi.org/10.1109/TSE.1984.5010257[59] C. J. C. H. Watkins and P. Dayan, “Technical note: q -learning,”

Mach. Learn. , vol. 8, no.34, p. 279292, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992698[60] P. Runeson and M. Hst, “Guidelines for conducting and reporting case study research insoftware engineering,”