[PDF] Learning Queuing Networks by Recurrent Neural Networks

Abstract

It is well known that building analytical performance models in practice is difficult because it requires a considerable degree of proficiency in the underlying mathematics. In this paper, we propose a machine-learning approach to derive performance models from data. We focus on queuing networks, and crucially exploit a deterministic approximation of their average dynamics in terms of a compact system of ordinary differential equations. We encode these equations into a recurrent neural network whose weights can be directly related to model parameters. This allows for an interpretable structure of the neural network, which can be trained from system measurements to yield a white-box parameterized model that can be used for prediction purposes such as what-if analyses and capacity planning. Using synthetic models as well as a real case study of a load-balancing system, we show the effectiveness of our technique in yielding models with high predictive power.

Full PDF

LL EARNING Q UEUING N ETWORKS BY R ECURRENT N EURAL N ETWORKS

Giulio Garbi

IMT School for Advanced Studies LuccaPiazza San Francesco, 19,55100 Lucca, Italy [email protected]

Emilio Incerto

IMT School for Advanced Studies LuccaPiazza San Francesco, 19,55100 Lucca, Italy [email protected]

Mirco Tribastone

IMT School for Advanced Studies LuccaPiazza San Francesco, 19,55100 Lucca, Italy [email protected]

February 26, 2020 A BSTRACT

It is well known that building analytical performance models in practice is difﬁcult because it re-quires a considerable degree of proﬁciency in the underlying mathematics. In this paper, we proposea machine-learning approach to derive performance models from data. We focus on queuing net-works, and crucially exploit a deterministic approximation of their average dynamics in terms ofa compact system of ordinary differential equations. We encode these equations into a recurrentneural network whose weights can be directly related to model parameters. This allows for an inter-pretable structure of the neural network, which can be trained from system measurements to yielda white-box parameterized model that can be used for prediction purposes such as what-if analy-ses and capacity planning. Using synthetic models as well as a real case study of a load-balancingsystem, we show the effectiveness of our technique in yielding models with high predictive power.

Keywords software performance · queuing networks · recurrent neural networks Motivation

Performance metrics such as throughput and response time are important factors that impact on thequality of a software system as perceived by users. They indicate how well the software behaves, thus complementingfunctional properties that concern what the software does. A traditional way of reasoning about the performance in asoftware system is by means of proﬁling. A tool such as

Gprof executes the program and allows the identiﬁcation ofthe program locations that are most performance sensitive [23]. The main limitation is that this information is valid forthe speciﬁc run with which the program is exercised; different inputs lead to different performance proﬁles in general.Thus, while proﬁling can detect the presence of performance anomalies, it lacks generalizing and predictive power(see also [64]).As with all scientiﬁc and engineering disciplines, predictions can be made with models. Software performance modelsare mathematical abstractions whose analysis provides quantitative insights into real systems under consideration [15].Typically, these are stochastic models based on Markov chains and other higher-level formalisms such as queueing net-works, stochastic process algebra, and stochastic Petri nets (see, e.g., [15] for a detailed account). Although they haveproved effective in describing and predicting the performance behavior of complex software systems (e.g., [8, 50]),a pressing limitation is that the current state of the art hinges on considerable craftsmanship to distill the appropriate a r X i v : . [ c s . PF ] F e b earning Queuing Networks by Recurrent Neural Networksabstraction level from a concrete software system, and relevant mathematical skills to develop, analyze, and validatethe model. Indeed, the amount of knowledge required in both the problem domain and in the modeling techniquesnecessarily hinders their use in practice [62].Despite the promises that analytical performance modeling holds, we are confronted with a high adoption barrier. Apossible solution might be to derive the model automatically . There has been much research into extending higher-level descriptions such as UML diagrams with performance annotations (using for example appropriate proﬁles suchas MARTE [42]) from which both software artifacts and associated performance models are generated (see the sur-veys [6, 36]). However, since systems are typically subjected to further modiﬁcations, the hard problem of keepingthe model synchronized with the code arises [20]. This makes such model-driven approaches particularly difﬁcult touse in general, especially in the context of fast-paced software processes characterized by continuous integration anddevelopment. Main contribution

In this paper we propose a novel methodology where analytical performance models are auto-matically learned from a running system using execution traces. We focus on queueing networks (QNs), a formalismthat has enjoyed considerable attention in the software performance engineering community, since it has been shownto be able to capture main performance-related phenomena in software architectures [2], annotated UML diagrams [7],component-based systems [36], web services [18], and adaptive systems [3, 29].A QN is characterized by a number of parameters that deﬁne the following quantities: i) the behavior of each sharedresource, such as its service demand and the concurrency level, which describe the amount of time that a client spendsat the resource and the number of independent entities that can provide the service (e.g., number of threads in the poolor number of CPU cores), respectively; ii) the behavior of clients in terms of their operational proﬁle, i.e., how theytraverse the resources.Some of these parameters can be assumed to be known. For instance, the number of CPU cores is available fromthe hardware speciﬁcation (or from the virtual-machine settings in a virtualized environment); the number of workerthreads is a conﬁguration parameter in most servers. Other parameters are more difﬁcult to identify: the servicedemands, which depend on the execution behavior of the program that requests access to a shared resource; and the routing matrix , which deﬁnes how clients (probabilistically) move between queuing stations.In our approach, the input is the set of shared resources and their concurrency level. The objective is to discover theQN model, i.e., the topology of the network and the service demands.Obviously, the problem of learning a mathematical model from data is not new. In the speciﬁc case of identifyingparameters of a QN, a substantial amount of research gone into the problem of estimating service demands only ([49],see Section 5.3 for a more detailed account of related work). Instead, we are not aware of approaches that deal withthe estimation of both the service demands and the topology. This setting is a rather difﬁcult one from a mathematicalviewpoint because, as will be formalized later, routing probabilities and service demands appear as multiplicativefactors in the dynamical equations that describe the evolution of a QN [8]. Since learning a QN can be understoodas ﬁtting the parameters to match these equations by some form of optimization, using both routing probabilities andservice demands as decision variables will induce a nonlinear problem , which is very difﬁcult to handle in general.An additional problem to nonlinearity is that of scalability. This is due to the issue that the exact dynamical equationsof a QN incur the well-known state explosion problem, because the number of discrete states to keep track of growscombinatorially with the number of clients and queuing stations.

Learning method: recurrent neural networks.

To cope with both issues, we propose a learning method based onrecurrent neural networks (RNNs) because of their ability to ﬁtting nonlinear systems [41]. In particular, we developa new architecture of the RNN which encodes the QN dynamics in an interpretable fashion , i.e., by associating theweights of the RNN with QN parameters such as concurrency levels, routing probabilities, and service rates. A keyinstrument is the use of a compact system of approximate (but still nonlinear) equations of the QN dynamics instead ofthe combinatorially large, but exact, original system of equations. Such approximation—called ﬂuid or mean-ﬁeld —consists in only one ordinary differential equation (ODE) for each station. It describes the time evolution of the queuelength , i.e., the number of clients contending for that resource. In practice, the ﬂuid approximation provides an estimateof the average queue length of the underlying stochastic process. The QN approximation procedure is based on afundamental result by Kurtz [37] and is well-known in the literature, e.g., [9]. In the ﬁeld of software performance, ithas been used for the analysis of variability-intensive software systems [34, 35] and for model-based runtime softwareadaption using online optimization [29] or satisﬁability modulo theory approach [28]. This formulation has also beenrecently adopted for learning, but for service demands only [27], thus casting the problem into a (considerably) simplerquadratic programming one. 2earning Queuing Networks by Recurrent Neural NetworksThe connection between RNN and ODEs is not new in the literature. In [44], authors have shown that recurrentneural networks can be thought of as a discretization of the continuous dynamical systems while, in [13] a specializedtraining algorithm for ODEs has been recently proposed. However, despite the proliferation of works along thisresearch direction, still there is no clear understanding of how to employ such artiﬁcial intelligence/machine learningtechniques for supporting performance engineering tasks such as modeling, estimation, and optimization [38].The main technical contribution of this paper is to show that there is a direct association between the structureof the QN ﬂuid approximation and standard activation functions and layers of an RNN. To the best of ourknowledge, this is the ﬁrst approach that formally uniﬁes the expressiveness of analytical performance modelswith the learning capability of machine learning, contributing to positively answering the question whether “ AIwill be at the core of performance engineering ” [38].The RNN is trained using time series of measured queue lengths at each service station. Its learned weights can beinterpreted back as a QN with learned parameters, which can be used for predictive purposes. It is worth remarkingthat, in principle, one could learn a QN model by relying on a standard, black-box

RNN architecture by treating all theQN parameters (i.e., initial population, service demand, number of servers and routing probabilities) as input featuresof the learning algorithm. Unfortunately, this straightforward approach would require a considerable amount of inputtraces since the learning algorithm could not exploit the structural information about the problem. For instance, itwould not be possible to do accurate what-if analyses by varying the value of a parameter if the network had notbeen trained with some input conﬁgurations where few variations of that parameter are considered. Moreover, in suchsetting, it would even be unclear which weights must be altered and how to reﬂect the changes into the model.Instead, here we report on the effectiveness and the generalizing power of our method by considering both syntheticbenchmarks on randomly generated QNs, as well as a real web application deployed according to the load balancingarchitectural style. In both cases, we evaluate the degree of predictive power of the learned model in matching thetransient as well as steady-state dynamics of unseen conﬁgurations (i.e., by varying the system workload, number ofservers, and routing probabilities), reporting prediction errors less than 10% across a validation set of 2000 instances.

Paper organization.

We provide some background about QNs in Section 2. The learning methodology is presentedin Section 3, which discusses how to encode a time-discretized version of the ﬂuid approximation into an RNNwhere the weights represent the model parameters to identify. Section 4 presents the numerical evaluation on boththe synthetic benchmarks and the real case study, providing implementation details on the RNN and on the usedbenchmark application. Section 5 discusses further related work. Section 6 concludes.

To make the paper self-contained, we present some background on QNs with the objective of motivating the ﬂuidapproximation as a deterministic estimator of average queue lengths, which will be used for the RNN encoding.

We assume closed

QNs, where clients keep circulating between queuing stations. A closed QN is formally deﬁned bythe following:• N : the number of clients in the network;• M : the number of queuing stations;• s = ( s , . . . , s M ) : the vector of concurrency levels, where s i gives the number of independent servers atstation i , with ≤ i ≤ M ;• µ = ( µ , . . . , µ M ) : the vector of service rates, i.e., /µ i > is the mean service demand at station i , with ≤ i ≤ M ;• P = ( P i,j ) ≤ i,j ≤ M : the routing probability matrix, where each element P i,j ≥ gives the probability thata client goes to station j upon completion at station i ;• x (0) = ( x (0) , . . . , x M (0)) : the initial condition , i.e., x i (0) is the number of clients at station i at time 0.In a closed QN, the routing probability matrix is stochastic matrix, meaning that the sum across each row sums up toone. Example 1.

In the remainder of this section we use the QN in Fig. 1 as a running example. Depicted using thecustomary graphical representation, it represents a simple load-balancing system with M = 3 stations. Requests from M M M <μ ,s > <μ ,s >P P <μ ,s >P P Figure 1: Load balancing example reference station are routed to two compute server stations and with probabilities P , and P , , respectively.Upon service, a client returns back to station . An instantiation of this abstract model is discussed in Section 4. Markov chain semantics

The stochastic behavior of a QN is represented by a continuous-time Markov chain(CTMC) that tracks the probability of the QN having a given conﬁguration of the queue lengths at each station. Infor-mally, the CTMC is constructed as follows. A discrete CTMC state is a vector of queue lengths X = ( X , . . . , X M ) .At each station i , if the number of clients X i is less than or equal to the number of servers s i , then these proceed inparallel, each at rate µ i . Instead, if X i > s i the number of clients that are queueing for service is X i − s i . When oneclient is serviced at station i , with probability p ij it goes to station j to receive further service. This can be formalizedby considering the well-known model of Markov population processes, whereby the CTMC transitions are describedby jump vectors and associated transition functions from a generic state X [9].We deﬁne the jump vectors h ( ij ) to be the state updates due to clients moving to station j upon service at i , and q ( X , X + h ( ij ) ) the transition rate from state X to state X + h ( ij ) , where X + h ( ij ) = ( X , . . . , X i − , . . . , X j + 1 , . . . , X M ) . In other words, with the jump vector h ( ij ) we have that the number of clients at station i is decreased by one, and,correspondingly, the number of clients at station j is increased by one. Then, the CTMC is deﬁned by: q ( X , X + h ( ij ) ) = P i,j µ i min( X i , s i ) , i, j = 1 , . . . , M. (1) Example 2.

In our running example, we have the jump vectors h (12) = ( − , +1 , h (13) = ( − , , +1) h (21) = (+1 , − , h (31) = (+1 , , − where the ﬁrst row describes the updates due to a client being assigned to each compute server and the second rowdeﬁnes the client returning to the load balancer after service. For completeness we give the corresponding transitions: q ( X , X + h (12) ) = P , µ min( X , s ) q ( X , X + h (13) ) = P , µ min( X , s ) q ( X , X + h (21) ) = P , µ min( X , s ) q ( X , X + h (31) ) = P , µ min( X , s ) It is well known that a CTMC is completely characterized by the transitions (1) together with the initial condition x (0) . This formulation in terms of jump vectors allows for the efﬁcient stochastic simulation of CTMCs [22]; indeed,we will use this technique to generate sample paths for the evaluation of our learning method on synthetic benchmarksin Section 4. For our purposes, the main limitation of this CTMC representation is that the exact equations to analyzethe probability distribution grow combinatorially with the number of clients and stations, as one needs to keep trackof each possible discrete conﬁguration of the queue lengths.4earning Queuing Networks by Recurrent Neural Networks The ﬂuid approximation of a QN consists is an ODE system whose size is equal to the number of stations M , indepen-dently from the number of clients in the system. Informally, the ODE system can be built by considering the averageimpact that each transition has on the queue length at each station k . This is obtained by multiplying the k -th coor-dinate of each jump vector, h ( ij ) k , by the function associated with the corresponding transition rate q ( X , X + h ( ij ) ) .Denoting by x = ( x , . . . , x M ) the variables of the ﬂuid approximation, the ODE system is given by: d x k ( t ) dt = (cid:88) h ( ij ) h ( ij ) k q ( x ( t ) , x ( t ) + h ( ij ) ) , k = 1 , . . . , M. (2)The solution for each coordinate, x k ( t ) , can be interpreted as an approximation of the average queue length at time t as given by the CTMC semantics [9]. The theorems in [37] provide a result of asymptotic exactness of the ﬂuid ap-proximation, in the sense that the ODE solution and the expectation of the stochastic process become indistinguishablewhen the number of clients and servers is large enough.Using (1), the equations can be written as follows: d x k ( t ) dt = (cid:88) i (cid:54) = k P i,k µ i min( x i ( t ) , s i ) + ( P k,k − µ k min( x k ( t ) , s k ) (3)where we have singled out the rates due to self loops P k,k . Example 3.

The ﬂuid approximation for the load balancer is: d x ( t ) dt = − µ min( x ( t ) , s ) + µ min( x ( t ) , s ) ++ µ min( x ( t ) , s ) d x ( t ) dt = − µ min( x ( t ) , s ) + P , µ min( x ( t ) , s ) d x ( t ) dt = − µ min( x ( t ) , s ) + P , µ min( x ( t ) , s ) Based on the solution to Eq. 3, which directly provides queue-length estimates, one can derive other important perfor-mance metrics such as throughput, utilization, and response time. See, for instance [57, 56] for a study of these resultsin a process algebra [25], and [58, 54, 55] for applications to layered queueing networks [19].In the remainder of this paper, we shall focus on QNs that do not have self loops (i.e., a client served at a queuecannot re-enter the same queue immediately), i.e., P i,i = 0 for ≤ i ≤ M . This is because we can show that, in theﬂuid approximation, for each k , P k,k can be chosen freely as long as we adjust each P k,i with i (cid:54) = k and µ k . Moreformally, we can prove the following theorem. Theorem 2.1.

For each π ∈ [0 , M , stochastic matrix P and µ ≥ where (3) holds, there exist ˆ P and ˆ µ such as foreach k :(a) d x k ( t ) dt = (cid:80) i (cid:54) = k ˆ P i,k ˆ µ i min( x i ( t ) , s i )+ ( ˆ P k,k −

1) ˆ µ k min( x k ( t ) , s k ) ;(b) ˆ P k,k = π k ;(c) (cid:80) i ˆ P k,i = 1 ;(d) ∀ i ˆ P k,i ≥ ;(e) ˆ µ k ≥ .Proof. Available in Appendix A.Thus, using the ﬂuid approximation, for each network with self loops there is another one without them which cannotbe distinguished. To identify a speciﬁc network among them, we need to know the self-loop values.5earning Queuing Networks by Recurrent Neural Networks -∆t µ u ̂ h −1,1 -∆t µ ∆ t µ P , x ̂ h −1,1 x ̂ h −1,2 u ̂ h ,1 u ̂ h ,2 s s x ̂ h ,1 x ̂ h ,2 Cell h minmin ∑∑ ∆ t µ P , x ̂ h ,1 x ̂ h ,2 u ̂ h −1,2 -∆t µ M x ̂ h −1, M u ̂ h , M s M min ∑ x ̂ h , M u ̂ h −1, M ⋯ x ̂ h , M −1 Figure 2: RNN encoding

As apparent in both Equations (1) and (3), a QN features routing probabilities and service demands as multiplicativefactors in the deﬁning dynamical equations. If we wish to learn a QN by assuming that both quantities are unknown,we are faced with a nonlinear (i.e., polynomial) optimization problem. Here we propose an RNN in order to estimatethese parameters. We develop an RNN architecture which encodes the QN dynamics in an interpretable fashion , i.e.,by associating the weights of the RNN with QN parameters such as concurrency levels, routing probabilities, andservice rates.

We ﬁrst obtain a time-discrete representation of the ﬂuid approximation such that each time step is associated with alayer of the RNN. In matrix notation, for an arbitrary QN the ﬂuid approximation is given by: d x ( t ) dt = − µ min ( x ( t ) , s ) + P T µ min( x ( t ) , s ) where x ( t ) is the M -dimensional vector of queue lengths at time t . We consider a ﬁnite-step approximation of theabove ODE for a small ∆ t , obtaining: x ( t + ∆ t ) = x ( t ) + ∆ t · (cid:0) − µ min ( x ( t ) , s ) + µ P min( x ( t ) , s ) (cid:1) Finally, this can be rewritten as x ( t + ∆ t ) = x ( t ) + ∆ t · u ( t ) · ( µ (cid:12) ( P − I )) (4)where u ( t ) = min ( x ( t ) , s ) , I is the identity matrix of appropriate dimension, and (cid:12) is the operator where if C = a (cid:12) B , then C i,j = a i · B i,j . The discretization (4) of the ﬂuid approximation of the QN admits a direct encoding as an RNN. It consists of an M -dimensional input layer ˆ x that corresponds to the initial condition of the QN. The RNN has H − cells, with the h -th cell computing the estimate of the queue length at time h ∆ t , denoted by ˆ x h (see Fig. 2). That is, the h -th cellcomputes the quantity ˆ x h = ˆ x h − +∆ t · ˆ u h − · ( µ (cid:12) ( P − I )) , where, according to (4), ˆ u h − estimates u (( h − t ) as ˆ u h − = min ( s , ˆ x h − ) .With this set up, we will have to learn the matrix P (made of M ( M − weights, since the diagonal is empty) and thevector µ (made of M weights).The main goal of this methodology is to learn the actual parameters of the network. Therefore, we enforce somefeasibility constraints, namely we require that P rows sum up to (such that P is a stochastic matrix), absence of selfloops and µ ≥ (such that the speed of the stations is non-negative). The non-negativity of the weights is enforced inthe framework by clamping the candidate values within the range [0 , ∞ ) ; stochasticity of P is guaranteed by dividing6earning Queuing Networks by Recurrent Neural Networks Q ueue Leng t h M (RNN-learned QN)M (Ground truth)M (RNN-learned QN)M (Ground truth)M (RNN-learned QN)M (Ground truth) (a) Q ueue Leng t h M (RNN-learned QN)M (Ground truth)M (RNN-learned QN)M (Ground truth)M (RNN-learned QN)M (Ground truth) (b) Figure 3: Numerical evaluation of the running example (see Figure 1). Comparison between simulations of the queuelengths using the RNN-learned QN (marked lines) and the ground-truth one (straight lines) in two different cases: a)a trace used for training ( err = 0 . ) with initial population x (0) = (26 , , and concurrency levels ( s = , s = , s = ) b) what-if analysis under unseen initial population vector and unseen concurrency levels ( s = , s = , s = ) and initial population x (0) = (49 , , , causing a signiﬁcant change in the dynamics ( err = 1 . ).each weight by the sum of the weights in the corresponding row; the absence of self loops is achieved by setting ∀ i, P i,i = 0 as a constant. This approach puts our work in the explainable machine learning research area [45], andit allows us to link each learned parameter with its role in the system. This link allows us to predict the behavior ofthe system under new conditions ( what-if analysis). In contrast, a traditional approach to neural networks would notimpose a model and constraints on the parameters, hence giving a read-only model which cannot be clearly interpreted.Indeed, without a direct association between parameters and physical quantities, we cannot study the system undernew conditions unless learning a new model. Example 4.

The RNN encoding for the h -th cell (i.e., the queue length transient evolution at time h ∆ t ) of our runningexample is: ˆ u h − , = max ( s , ˆ x h − , )ˆ u h − , = max ( s , ˆ x h − , )ˆ u h − , = max ( s , ˆ x h − , )ˆ x h, = ˆ x h − , + ∆ t ( − µ ˆ u h − , + µ P , ˆ u h − , + µ P , ˆ u h − , )ˆ x h, = ˆ x h − , + ∆ t ( µ P , ˆ u h − , − µ ˆ u h − , + µ P , ˆ u h − , )ˆ x h, = ˆ x h − , + ∆ t ( µ P , ˆ u h − , + µ P , ˆ u h − , − µ ˆ u h − , ) The RNN is trained over a set of traces. Each trace is made of H vectors, indicated as ˜ x , ˜ x , ..., ˜ x H − ∈ R M ≥ . The i -th component ˜ x h,i of each vector ˜ x h represents a sample of the queue length of each station i at time h · ∆ t . Since,as discussed, the ﬂuid approximation can be interpreted as an estimator of the average queue lengths, each trace usedin the learning process consists of measurements averaged over a number of independent executions started with thesame initial condition; different traces rely on different initial conditions to exercise distinct behaviors of the system. The learning error function, denoted by err , aims to minimize the difference between the queue lengths estimated bythe RNN, ˆ x h , and the measurements ˜ x h . It is deﬁned as follows: err = max H − h =1 (cid:107) ˜ x h − ˆ x h (cid:107) N · (5)where (cid:107)·(cid:107) indicates the L1 norm. Essentially, it is a maximum relative error. Indeed, since we are studying closed QNswith ﬁxed N circulating clients, the quantity (cid:107) ˜ x h − ˆ x h (cid:107) / (2 N ) intuitively measures the proportion of clients (relativeto their total number N ) that are “misplaced” (i.e., which are allocated in a different station) at each time step. Since amisplaced client is counted twice (once when missing in a queue and once when is extra in another queue), we dividethe norm by 2. Then, the overall error err computes the maximum of such misplacements across all times.7earning Queuing Networks by Recurrent Neural Networks

100 200 300 400 500 600 700 800 N P r ed i c t i on e rr o r ( e rr) M=5M=10 (a)

M = 5 M = 10

Network Dimension P r ed i c t i on e rr o r ( e rr) (b) Figure 4: a) Prediction error of the what-if instances where each randomly generated QN is tested with 100 unseeninitial population vectors, distinguished in colors with respect to the network size M . The x-axis N is the total numberof clients in the network which is scatter-plotted against the prediction error deﬁned in Eq. (5). b) Statistics on theprediction error. In each box-plot, the line inside the box represents the median error, the upper and lower side of thebox represent the 25th and 75th percentiles, while the upper and lower limit of the dashed line represent the extremepoints not to be considered outliers, and in red we depict the outliers (12 with M=5, 4 with M=10). Q ueue Leng t h RNN-learned QNGround-truth (a) Station 1 Q ueue Leng t h RNN-learned QNGround-truth (b) Station 2 Q ueue Leng t h RNN-learned QNGround-truth (c) Station 3 Q ueue Leng t h RNN-learned QNGround-truth (d) Station 4 Q ueue Leng t h RNN-learned QNGround-truth (e) Station 5

Figure 5: Comparison between the ground-truth queue lengths and those predicted by the RNN-learned QN on the testcase that induced the maximum prediction error among the what-if over population (error: 9.41%). The error was at-tained on a randomly generated QN with M = 5 stations, using the unseen initial population vector (86,111,13,15,28).The straight line represents the ground-truth dynamics of the QN model; the dashed line represents the evolution ofthe RNN-learned QN. Example 5.

Let us consider our running example by ﬁxing ground-truth parameters as follows. During the learningphase, we studied the system with s = (1000 , , and predicted the behavior with s = (1000 , , , while we kept P and µ unchanged at P = (cid:34) . .

51 0 01 0 0 (cid:35) µ = (1 , , Using the experimental set up that will be discussed in the next section, we generated the training dataset by col-lecting 50 traces, one for a different randomly generated initial population vector. Each trace was the average of500 independent simulations recording the transient evolutions of the queue lengths. Figure 3 reports the comparisonbetween queue lengths of the RNN-learned QN and the ground-truth one, showing very good accuracy on an instanceof the training set (Figure 3a) as well as high predictive power of the model under unseen initial populations andconcurrency levels which cause bottleneck shift and considerable longer transient dynamics (Figure 3b).

In this section we evaluate the effectiveness of the proposed approach by considering both synthetic benchmarksand a real case study. For all our tests, the RNNs were implemented using the Keras framework [14] with the Ten-sorFlow backend [1]. Learning was performed using a machine running the 4.15.0-55-generic Linux kernel on aIntel(R) Xeon(R) CPU E7-4830 v4 machine at 2.00GHz with 500 GB of RAM.8earning Queuing Networks by Recurrent Neural Networks

50 100 150 200 250 N P r ed i c t i on e rr o r ( e rr) M=5M=10 (a)

M = 5 M = 10

Network Dimension P r ed i c t i on e rr o r ( e rr) (b) Figure 6: a) Prediction error of the what-if instances by changing the concurrency level of the most utilized station ineach of the randomly generated QNs. b) Statistics on the prediction error.

For our synthetic tests we considered randomly generated networks of size M = 5 and M = 10 . Foreach case, we generated 5 QNs by uniformly sampling at random the entries of the routing probability matrices, theservice rates in the interval [4 . , . , and the concurrency levels in the interval { , , . . . , } . For the trainingof each QN, we generated 100 traces, each being the average over 500 independent stochastic simulations (generatedusing Gillespie’s algorithm [22]). Each trace exercised the model with a distinct initial population vector such thatthe number of clients at each station was drawn uniformly at random from { , . . . , } ; as a result, the total numberof clients in the network varies across traces. For each network, learning was performed by equally splitting the 100traces for training and validation, iterating Adam [33] with learning rate equal to . , until the error computed on thevalidation set did not improve by at least . in the last 50 iterations. On average, the learning took 74 minutes and86 minutes for the cases M = 5 and M = 10 , respectively. Discretization methodology

Two important parameters are the length of the trace, i.e., the time horizon T of thestochastic simulations, and the choice of the discretization interval ∆ t ; these are related with the number of cells inthe RNN H by T = ( H − t . Longer time horizons lead to larger simulation (hence, training) runtimes. Too shorttraces might not expose the full dynamics of the system. Further, following basic facts about ODE discretization [4],the interval ∆ t should be chosen small enough such that no important dynamics is lost across two successive timesteps; thus, longer time horizons might need more time steps, hence more cells in the RNN. It is worth remarkingthat these considerations are model-speciﬁc. That is, the choice of such hyper-parameters must be carefully donedepending on the speciﬁc QN under study.For the synthetic case studies, we set T = 10 and ∆ t = 0 . , hence H = 1000 . Predictive power

We evaluate the predictive power of the learned QNs by performing two distinct “what-if” analy-ses under unseen conﬁgurations, by changing populations of clients and the concurrency levels of the stations, respec-tively.

What-if analysis over client population

We tested each of the randomly generated QNs with 100 new initial pop-ulation vectors that were not used in the learning phase. We compared the averages (over 500 stochastic simulations)of the the ground-truth queue-length dynamics with those produced by the RNN-learned QN with those unseen initialconditions.Figure 4a shows a scatter plot of the prediction error with respect to the total number of clients circulating in thesystem, reporting errors less than in all cases. The box-plots in Figure 4b show that there is no statisticallysigniﬁcant difference between the errors for the diffrent sized models. Figure 5 compares the predicted and ground-truth queue lengths for the instance with the maximum prediction error, showing a very good generalizing power forthe queue-length dynamics at all stations. 9earning Queuing Networks by Recurrent Neural Networks Q ueue Leng t h RNN-learned QN (Original)Ground-truth (Original)RNN-learned QN (What-if)Ground-truth (What-if) (a) Station 1 Q ueue Leng t h RNN-learned QN (Original)Ground-truth (Original)RNN-learned QN (What-if)Ground-truth (What-if) (b) Station 2 Q ueue Leng t h RNN-learned QN (Original)Ground-truth (Original)RNN-learned QN (What-if)Ground-truth (What-if) (c) Station 3 Q ueue Leng t h RNN-learned QN (Original)Ground-truth (Original)RNN-learned QN (What-if)Ground-truth (What-if) (d) Station 4 Q ueue Leng t h RNN-learned QN (Original)Ground-truth (Original)RNN-learned QN (What-if)Ground-truth (What-if) (e) Station 5

Figure 7: Comparison between the ground-truth queue lengths and those predicted by the RNN-learned QN on the testcase that induced the maximum prediction error (4.19%), before and after the what-if change of server concurrency.The error was attained on a randomly generated QN with M = 5 stations. The cyan line denotes the averages underthe original conditions (before the what-if change) with the ground-truth QN; the green line gives the predictions ofthe RNN-learned QN with the original values; the red line shows ground-truth simulations with the unseen number ofservers for the bottleneck station (increased from 17 to 37); the blue line shows the averages after the what-if changefor the RNN-learned QN. LB C C W REAL SYSTEM ARCHITECTURE C QN MODEL10Processes5Processes6Processes1ProcessNProcesses UNKNOWN PARAMETERS <μ1,s =∞> M M <μ2,s =10><μ3,s =5><μ4,s =6>P P P P P P M M Figure 8: Case study architecture.

What-if over concurrency levels.

To validate the predictive power under varying concurrency levels, for each gen-erated QN we found the station with the highest ratio between the steady state queue length and its number of servers( bottleneck ), and added servers in steps of 20 to this station until it was not the bottleneck anymore. Then we comparedthe dynamics of the ground-truth model (i.e., simulated with the original P and µ but with the new server concurrencylevels) against those obtained by simulating the learned model with the new server concurrency levels. We consideredthe notion of prediction error as shown in Equation (5).Figure 6a shows the results of this what-if study, reporting a prediction error less than across all instances. Alsoin this case, there is no statistically signiﬁcant difference in the error statistics depending on the network size M (see Figure 6b). Figure 7 plots the comparison of the queue-length dynamics of the what-if instance (i.e., with anunseen server concurrency level) that reported the maximum prediction error (i.e., . ) against the original ones(i.e., prediction error of . ). We can appreciate that the unseen concurrency levels do change the QN behaviordramatically, effectively switching the bottleneck from station 3 to station 2.This result does support the combination of machine learning and white-box performance models by showing that,once learned, the QN can be used for evaluating the behavior of the model under execution scenarios for which theQN has not been trained. The benchmark used in this evaluation is based on an in-house developed web application that serves userrequests with an input dependent load. We deployed the target application as a NodeJs [53] load-balancing systemwith three replicas. Figure 8 (left) depicts the system architecture. Component W represents the reference station,where clients enter the system by issuing requests to the load balancer LB , which redistributes them across the webservers uniformly. In the real system, such uniform assignment is achieved by ﬁxing equal weights to the target nodes.Components C1 , C2 , and C3 represent the three web-server instances devoted to the actual processing of user requests10earning Queuing Networks by Recurrent Neural Networks Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (a) k = 2 , err=6.46% Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (b) k = 3 , err=5.03% Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (c) k = 4 , err=6.45% Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (d) k = 5 , err=9.05% Figure 9: Comparison between the real system dynamics (i.e., marked lines) and the RNN-learned QN (i.e., straightlines) in what-if cases over increasing circulating populations N , given by N = 26 k .(e.g., producing an HTML page). Each node in the Figure 8 is annotated with its concurrency level (i.e., the numberof available processes), which we considered ﬁxed parameters.Speciﬁcally, we implemented W as a multi-threaded Python program. Each thread runs an independent concurrentuser (i.e., one of the N processes) that iteratively accesses the system, sleeping for an exponentially distributed delaybetween subsequent requests; LB is a single-threaded NodeJs web server which act as a randomized load balancer.Finally, C1 , C2 and C3 are multi-threaded NodeJs Clusters whose load is generated by sleeping for an exponentialdistributed delay (i.e., the average value is given as input parameter of each cluster). We remark that although wewere able to roughly ﬁx the distribution of the service demands their exact shape is still unknown since it is inﬂuencedby subtler factors that are hidden to developers (e.g., internal behavior of the web server, communication aspects).Moreover, in order to evaluate our learning methodology in an interesting scenario, we deployed the three replicas ofthe system with different parallelism levels and different service rates.Similarly to [61], we collected the queue length traces used as input of the learning process (see Section 3) by parsingthe access logs generated by each component of the system. However, other monitoring solutions could be used, basedfor instance on recording the TCP backlog [29]. With this set-up, we were able to sample data with a measurement step ∆ t = 0 . s, which turned out to be sufﬁcient for observing the transient dynamics of each component without alteringthe application behavior. The replication package for this evaluation is publicly available at https://zenodo.org/record/3679251 . Model Learning:

We built the training dataset as a collection of queue length traces produced by the target applicationunder 50 different initial population vectors where each station had a number of clients drawn uniformly at randombetween 0 and 30. For each such initial population vector the trace consisted of the average queue-length dynamicsover 500 independent executions.The target model of the learning process is reported in the right side of Figure 8. In particular, components C1 , C2 , C3 are modeled by queuing stations M , M , M , while both the workload generator W and the load balancer LB areabstracted by the same station M , since the delay introduced by LB is negligible with respect to the other componentsof the network. All the parameters of the resulting QN were considered parameters to be learned by the RNN.Similarly to the synthetic case, the collected traces were split into two halves for training and validation, respectively.We used Adam [33] as the learning algorithm with learning rate equal to . and iterated until the error computedon the validation set did not improve at least . in the last 50 iterations. With this, the system parameters werelearned in 27 minutes on average, with a validation error of 3.89%. What if analysis:

In the following we evaluate the predictive power of the RNN learned QN under an unseen number ofclients, concurrency levels and routing probabilities. Differently from the synthetic case, here we emulate a concreteusage scenario in which an initially hidden performance bottleneck is discovered and solved only relying on theinsights given by the learned model. For doing so, we exercised both the QN model and the real system underan increasing number of clients (here each simulation averaged over 300 simulation runs instead of 500 since forevaluating the what-if analysis less runs are needed) by a factor k = 2 , . . . , with respect to an initial populationwhich had 26 circulating clients. Figure 9 reports the numerical results of this evaluation, showing a trend that inducesa saturation condition in station M . Overall, the prediction error of the RNN is less than 10% across all instances.In Figure 10 we report two different strategies that can be used in order to remove the bottleneck: we reevaluatedboth the learned model and the real system starting from the case k = 4 (see Figure 9c), varying either the numberof servers or the load-balancing weights/routing probabilities. Figure 10a shows the dynamics of the system when thenumber of servers of M is increased from 5 to 8, Figure 10b reports the what-if scenario in which we change the loaddistribution strategy from a uniform probability distribution to one where stations M , M , and M are targeted with https://nodejs.org/api/cluster.html Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (a) err=5.98% Q ueue Leng t h M RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real SystemM RNN-learned QNM Real System (b) err=6.10%

Figure 10: a) What-if scenario changing the concurrency level of M from 5 to 8. b) What-if scenario changing theload balancing strategy from a uniform probability distribution to the case where stations M , M , and M are reachedwith probabilities . , . and . , respectively. Both scenarios have been evaluated on the real case study and theRNN-learned QN with an initial population vector with k = 4 from Figure 9c.probability . , . and . , respectively. Consistently with the intuition, both what-if instances show a lighterpressure (i.e., smaller queue length) at M . Furthermore, both situations are well predicted by the RNN, yielding anaccuracy error of ca. 6% with respect to the real system dynamics. In this section we relate against techniques related to the following lines of research: performance prediction fromprograms, generation of performance models from programs, and estimation of parameters in QNs.

A line of work focuses on the derivation of performance predictions from code analysis.

PerfPlotter uses programanalysis (speciﬁcally, probabilistic symbolic execution [21]) to generate a performance distribution , i.e., the probabil-ity distribution function of a performance metric such as response time [12]. Thus, the result of the overall analysis isa quantitative model but it is not predictive. Furthermore, the approach applies to single-threaded applications, henceimportant performance-inﬂuencing sources such as threads contention cannot be captured.Other related approaches consist in predicting performance models using black-box methods. They are particularlyrelevant for variability-intensive systems, where they relate conﬁguration settings in a software system with theirperformance impact [48, 47]. Machine-learning techniques have been used also in this case to build the predictivemodel [24, 47, 59, 30]. For instance, in [48] the system model is assumed to be a linear combination of binaryvariables (e.g., tree structured models), each of them denoting the presence or the absence of a feature. Then theperformance model is computed by means of linear regression over pairs of conﬁgurations and measured performanceindices. The inﬂuence of possible feature interactions is embedded in the model by introducing fresh variables soas to preserve the linear structure of the model. As discussed in [59], these black-box approaches can be seen ascomplementary to ours, which can provide a reliable mathematical abstraction by which performance can be explicitlyassociated to software components, thus increasing the explanatory power of the prediction.

While model-driven approaches to software performance have been researched quite intensively [15], program-drivengeneration of performance models has been less explored, and has been concerned with speciﬁc kinds of applications.Indeed, the early approach by Hrischuk et al. is concerned with the generation of software performance models(speciﬁcally, layered queuing networks [19]) from a class of distributed applications whose components communicatesolely by remote procedure calls [26]. Brosig et al. derive a component-based performance model from applicationsrunning on the Java EE platform [11, 10]. Tarvo and Reiss develop a technique for the extraction of discrete-eventsimulation models from a class of multi-threaded programs covering task-oriented applications, whereby the businesslogic consists in assigning a given workload (i.e., a task) to a number of worker threads from a pool [52]. Their useof a simulation model as opposed to an analytical model is justiﬁed by the difﬁculty in building the latter, especiallyto model such diverse performance-related phenomena as queuing effects, inter-thread synchronization, and hardware12earning Queuing Networks by Recurrent Neural Networkscontention. This is indeed the limitation that we aim to overcome with our approach, by building the analytical modelautomatically from measurements.

Most of the literature concerning the estimation of QN parameters focuses on service demands. In particular, itconsiders the situation when the system is in the steady-state, i.e., when a sufﬁciently large amount of time has passedsuch that its behavior does not depend on the initial conditions [49]. Mathematically, the assumption of a steady-stateregime enables the leveraging of a wealth of analytical results for QNs [8]. Based on these are several estimationmethods using techniques such as linear regression [43], quadratic programming [27], non-linear optimization [40, 5],clustering regression [16], independent component analysis [46], pattern matching [17], Gibbs sampling [60, 51], andmaximum likelihood [61].The main advancement of our approach with respect to the state of the art is the ability to learn the whole model,i.e., both the service demands and the QN topology (via the routing probabilities). In addition, since it uses an ODErepresentation, it does not make assumptions about the stationarity of the system; indeed, we do train our RNN usingtraces that include the transient dynamics. Actually, our approach uses the same QN model as the service-demandestimation method recently proposed in [27], which is also based on ﬂuid approximation.Another difference with practical implications regards the type of data using for the estimation. Approaches suchas [39, 16, 46, 31, 17] require measurements of quantities that may be difﬁcult to obtain. For example, utilizationmetrics may not be available to the user when there is no complete information about the underlying hardware stack,for instance in a virtualized system running on a Platform-as-a-Service environment. Instead, measuring queue-lengthsamples only has been regarded as more advantageous [61, 27], since this information can be often obtained fromapplication logs or by means of operating system calls.

We presented a novel methodology for learning queuing network (QN) models of software systems. The main noveltylies in the encoding of the QN as an explainable recurrent neural network where inputs and weights are associatedto standard queuing network inputs and parameters. We reported promising results on synthetic examples and on areal case study, where the maximum discrepancy between the dynamics predicted by the learned models and thosecomputed through the ground truth is less than the % when the system is evaluated under unseen conﬁgurations thatare not included in the training set. We plan to extend our technique for capturing more complex models and systems,such as mixed multi-class and layered QNs, and to explore other learning methodologies such as neural ODEs [13] andresidual networks [63]. Moreover, in order to improve the accuracy of the learned models and to reduce the simulationtime, we plan to investigate active learning techniques that enable an informed sampling of the initial conditions [32]. Acknowledgements:

This work has been partially supported by the PRIN project “SEDUCE” no. 2017TWRCNB.

References [1] A

BADI , M., A

GARWAL , A.,

AND ET A L ., P. B. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorﬂow.org.[2] A LETI , A., B

UHNOVA , B., G

RUNSKE , L., K

OZIOLEK , A.,

AND M EEDENIYA , I. Software architecture opti-mization methods: A systematic literature review.

IEEE Trans. Software Eng. 39 , 5 (2013), 658–683.[3] A

RCELLI , D., C

ORTELLESSA , V., F

ILIERI , A.,

AND L EVA , A. Control theory for model-based performance-driven software adaptation. In

QoSA (2015), pp. 11–20.[4] A

SCHER , U. M.,

AND P ETZOLD , L. R.

Computer Methods for Ordinary Differential Equations andDifferential-Algebraic Equations . SIAM, 1988.[5] A

WAD , M.,

AND M ENASCE , D. A. Deriving parameters for open and closed qn models of operational systemsthrough black box optimization. In

ICPE (2017).[6] B

ALSAMO , S., D I M ARCO , A., I

NVERARDI , P.,

AND S IMEONI , M. Model-based performance prediction insoftware development: A survey.

IEEE Trans. Software Eng. 30 , 5 (2004), 295–310.[7] B

ALSAMO , S.,

AND M ARZOLLA , M. Performance evaluation of UML software architectures with multiclassqueueing network models. In

WOSP (2005). 13earning Queuing Networks by Recurrent Neural Networks[8] B

OLCH , G., G

REINER , S., DE M EER , H.,

AND T RIVEDI , K.

Queueing networks and Markov chains: modelingand performance evaluation with computer science applications . Wiley, 2005.[9] B

ORTOLUSSI , L., H

ILLSTON , J., L

ATELLA , D.,

AND M ASSINK , M. Continuous approximation of collectivesystem behaviour: A tutorial.

Performance Evaluation 70 , 5 (2013), 317–349.[10] B

ROSIG , F., H

UBER , N.,

AND K OUNEV , S. Automated extraction of architecture-level performance models ofdistributed component-based systems. In

ASE (2011).[11] B

ROSIG , F., K

OUNEV , S.,

AND K ROGMANN , K. Automated extraction of Palladio component models fromrunning enterprise Java applications. In

VALUETOOLS (2009).[12] C

HEN , B., L IU , Y., AND L E , W. Generating performance distributions via probabilistic symbolic execution. In ICSE (2016).[13] C

HEN , T. Q., R

UBANOVA , Y., B

ETTENCOURT , J.,

AND D UVENAUD , D. K. Neural ordinary differentialequations. In

Advances in neural information processing systems (2018), pp. 6571–6583.[14] C

HOLLET , F.,

ET AL . Keras. https://keras.io , 2015.[15] C

ORTELLESSA , V., M

ARCO , A. D.,

AND I NVERARDI , P.

Model-Based Software Performance Analysis .Springer, 2011.[16] C

REMONESI , P., D

HYANI , K.,

AND S ANSOTTERA , A. Service time estimation with a reﬁnement enhancedhybrid clustering algorithm. In

International Conference on Analytical and Stochastic Modeling Techniques andApplications (2010), Springer, pp. 291–305.[17] C

REMONESI , P.,

AND S ANSOTTERA , A. Indirect estimation of service demands in the presence of structuralchanges.

Performance Evaluation 73 (2014), 18–40.[18] D I M ARCO , A.,

AND I NVERARDI , P. Compositional generation of software architecture performance QNmodels. In

WICSA 2004 (June 2004), pp. 37–46.[19] F

RANKS , G., A L -O MARI , T., W

OODSIDE , M., D AS , O., AND D ERISAVI , S. Enhanced modeling and solutionof layered queueing networks.

IEEE Trans. Software Eng. 35 , 2 (2009), 148–161.[20] G

ARCIA , J., K

RKA , I., M

ATTMANN , C.,

AND M EDVIDOVIC , N. Obtaining ground-truth software architec-tures. In

ICSE (2013), pp. 901–910.[21] G

ELDENHUYS , J., D

WYER , M. B.,

AND V ISSER , W. Probabilistic symbolic execution. In

ISSTA (2012),pp. 166–176.[22] G

ILLESPIE , D. T. Stochastic simulation of chemical kinetics.

Annual Review of Physical Chemistry 58 , 1(2007), 35–55.[23] G

RAHAM , S. L., K

ESSLER , P. B.,

AND M CKUSICK , M. K. Gprof: A call graph execution proﬁler. In

Proceed-ings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN’82) (1982), pp. 120–126.[24] G UO , J., C ZARNECKI , K., A

PEL , S., S

IEGMUND , N.,

AND W ASOWSKI , A. Variability-aware performanceprediction: A statistical learning approach. In

ASE (2013).[25] H

ILLSTON , J.

A Compositional Approach to Performance Modelling . Cambridge University Press, 1996.[26] H

RISCHUK , C., R

OLIA , J.,

AND W OODSIDE , C. M. Automatic generation of a software performance modelusing an object-oriented prototype. In

MASCOTS (1995).[27] I

NCERTO , E., N

APOLITANO , A.,

AND T RIBASTONE , M. Moving horizon estimation of service demands inqueuing networks. In

MASCOTS (2018).[28] I

NCERTO , E., T

RIBASTONE , M.,

AND T RUBIANI , C. Symbolic performance adaptation. In

Proceedings ofthe 11th International Symposium on Software Engineering for Adaptive and Self-managing Systems (SEAMS) (2016).[29] I

NCERTO , E., T

RIBASTONE , M.,

AND T RUBIANI , C. Software performance self-adaptation through efﬁcientmodel predictive control. In

ASE (2017).[30] J

AMSHIDI , P., V

ELEZ , M., K ¨

ASTNER , C.,

AND S IEGMUND , N. Learning to sample: exploiting similaritiesacross environments to learn performance models for conﬁgurable systems. In

ESEC/FSE (2018).[31] K

ALBASI , A., K

RISHNAMURTHY , D., R

OLIA , J.,

AND R ICHTER , M. Mode: Mix driven on-line resourcedemand estimation. In

Proceedings of the 7th International Conference on Network and Services Management (2011), International Federation for Information Processing, pp. 1–9.14earning Queuing Networks by Recurrent Neural Networks[32] K

ALTENECKER , C., G

REBHAHN , A., S

IEGMUND , N., G UO , J., AND A PEL , S. Distance-based sampling ofsoftware conﬁguration spaces. In

Proceedings of the 41st International Conference on Software Engineering (2019), IEEE Press, pp. 1084–1094.[33] K

INGMA , D. P.,

AND B A , J. Adam: A method for stochastic optimization. In (2015), Y. Bengio and Y. LeCun, Eds.[34] K OWAL , M., S

CHAEFER , I.,

AND T RIBASTONE , M. Family-based performance analysis of variant-rich soft-ware systems. In

Fundamental Approaches to Software Engineering (FASE) (2014), pp. 94–108.[35] K

OWAL , M., T

SCHAIKOWSKI , M., T

RIBASTONE , M.,

AND S CHAEFER , I. Scaling size and parameter spacesin variability-aware software performance models. In

ASE (2015), pp. 407–417.[36] K

OZIOLEK , H. Performance evaluation of component-based software systems: A survey.

Performance Evalu-tation 67 , 8 (2010), 634–658.[37] K

URTZ , T. G. Solutions of ordinary differential equations as limits of pure Markov processes. In

J. Appl. Prob. (1970), vol. 7, pp. 49–58.[38] L

ITOIU , M. Panel: Ai and performance. In

International Conference on Performance Engineering (ICPE) (2019).[39] L IU , Z., W YNTER , L., X IA , C. H., AND Z HANG , F. Parameter inference of queueing models for IT systemsusing end-to-end measurements.

Performance Evaluation 63 , 1 (2006), 36–60.[40] M

ENASCE , D. A. Computing missing service demand parameters for performance models. In

Int. CMG Con-ference (2008), pp. 241–248.[41] M

ITCHELL , T. M.

Machine learning . McGraw Hill series in computer science. McGraw-Hill, 1997.[42] O

BJECT M ANAGEMENT G ROUP . UML Proﬁle for Modeling and Analysis of Real-Time and Embedded Systems(MARTE). Beta 1 . OMG, 2007. OMG document number ptc/07-08-04.[43] P

ACIFICI , G., S

EGMULLER , W., S

PREITZER , M.,

AND T ANTAWI , A. CPU demand for web serving: Measure-ment analysis and dynamic estimation.

Performance Evaluation 65 , 6-7 (2008), 531–553.[44] P

EARLMUTTER , B. A. Learning state space trajectories in recurrent neural networks.

Neural Computation 1 , 2(1989), 263–269.[45] S

AMEK , W., W

IEGAND , T.,

AND

M ¨

ULLER , K. Explainable artiﬁcial intelligence: Understanding, visualizingand interpreting deep learning models.

CoRR abs/1708.08296 (2017).[46] S

HARMA , A. B., B

HAGWAN , R., C

HOUDHURY , M., G

OLUBCHIK , L., G

OVINDAN , R.,

AND V OELKER , G. M.Automatic request categorization in internet services.

ACM SIGMETRICS Performance Evaluation Review 36 ,2 (2008), 16–25.[47] S

IEGMUND , N., G

REBHAHN , A., A

PEL , S.,

AND

K ¨

ASTNER , C. Performance-inﬂuence models for highlyconﬁgurable systems. In

ESEC/FSE (2015).[48] S

IEGMUND , N., K

OLESNIKOV , S. S., K ¨

ASTNER , C., A

PEL , S., B

ATORY , D., R

OSENM ¨ ULLER , M.,

AND S AAKE , G. Predicting performance via automated feature-interaction detection. In

ICSE (2012), pp. 167–177.[49] S

PINNER , S., C

ASALE , G., B

ROSIG , F.,

AND K OUNEV , S. Evaluating approaches to resource demand estima-tion.

Performance Evaluation 92 (2015), 51–71.[50] S

TEWART , W. J. Performance Modelling and Markov Chains. In

SFM (2007), pp. 1–33.[51] S

UTTON , C.,

AND J ORDAN , M. I. Bayesian inference for queueing networks and modeling of Internet services.

The Annals of Applied Statistics (2011), 254–282.[52] T

ARVO , A.,

AND R EISS , S. P. Automated analysis of multithreaded programs for performance modeling. In

ASE (2014).[53] T

ILKOV , S.,

AND V INOSKI , S. Node. js: Using javascript to build high-performance network programs.

IEEEInternet Computing 14 , 6 (2010), 80–83.[54] T

RIBASTONE , M. Relating layered queueing networks and process algebra models. In

WOSP/SIPEW (2010).[55] T

RIBASTONE , M. A ﬂuid model for layered queueing networks.

IEEE Transactions on Software Engineering39 , 6 (2013), 744–756.[56] T

RIBASTONE , M., D

ING , J., G

ILMORE , S.,

AND H ILLSTON , J. Fluid rewards for a stochastic process algebra.

IEEE Trans. Software Eng. 38 (2012), 861–874. 15earning Queuing Networks by Recurrent Neural Networks[57] T

RIBASTONE , M., G

ILMORE , S.,

AND H ILLSTON , J. Scalable differential analysis of process algebra models.

IEEE Transactions on Software Engineering 38 , 1 (2012), 205–219.[58] T

RIBASTONE , M., M

AYER , P.,

AND W IRSING , M. Performance prediction of service-oriented systems withlayered queueing networks. In

Leveraging Applications of Formal Methods, Veriﬁcation, and Validation (2010),T. Margaria and B. Steffen, Eds., vol. 6416 of

Lecture Notes in Computer Science , Springer, pp. 51–65.[59] V

ALOV , P., P

ETKOVICH , J., G UO , J., F ISCHMEISTER , S.,

AND C ZARNECKI , K. Transferring performanceprediction models across different hardware platforms. In

ICPE (2017).[60] W

ANG , W.,

AND C ASALE , G. Bayesian service demand estimation using Gibbs sampling. In

MASCOTS (2013).[61] W

ANG , W., C

ASALE , G., K

ATTEPUR , A.,

AND N AMBIAR , M. Maximum likelihood estimation of closedqueueing network demands from queue length data. In

ICPE (2016).[62] W

OODSIDE , M., F

RANKS , G.,

AND P ETRIU , D. C. The future of software performance engineering. In

Proceedings of the Future of Software Engineering (FOSE) (2007), pp. 171–187.[63] Z

AGORUYKO , S.,

AND K OMODAKIS , N. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).[64] Z

APARANUKS , D.,

AND H AUSWIRTH , M. Algorithmic proﬁling. In

PLDI (2012), pp. 67–76.

A Appendix

Proof of Theorem 2.1.

We construct ˆ P and ˆ µ as follows: ˆ P k,i =  π k if i = k P k,i − P k,k (1 − π k ) if P k,k < and i (cid:54) = k − π k M − otherwise ˆ µ k = (cid:40) P k,k − π k − µ k if P k,k < otherwiseWe prove that, for each i (cid:54) = k we have ˆ P k,i ˆ µ k = P k,i µ k and ( ˆ P k,k −

1) ˆ µ k = ( P k,k − µ k . Then (a) follows bysubstitution.We now consider the case P k,k < . ˆ P k,i ˆ µ k = P k,i − P k,k (1 − π k ) P k,k − π k − µ k = P k,i P k,k − π k − P k,k − π k − µ k = P k,i µ k . ( ˆ P k,k −

1) ˆ µ k = ( π k − P k,k − π k − µ k = ( P k,k − µ k . We now consider the case P k,k = 1 . We remark that, in this case, P k,i = 0 if i (cid:54) = k . ˆ P k,i ˆ µ k = 1 − π k M − µ k = P k,i µ k . ( ˆ P k,k −

1) ˆ µ k = ( π k − µ k = ( P k,k − µ k . The point (b) is true by deﬁnition of ˆ P . Statement (c) can be shown as follows. When P k,k < : (cid:88) i ˆ P k,i = ˆ P k,k + (cid:88) i (cid:54) = k ˆ P k,i = π k + (cid:88) i (cid:54) = k P k,i − P k,k (1 − π k )= π k + 1 − π k = 1 (cid:80) i P k,i = 1 , (cid:80) i (cid:54) = k P k,i = 1 − P k,k . When P k,k = 1 : (cid:88) i ˆ P k,i = ˆ P k,k + (cid:88) i (cid:54) = k ˆ P k,i = π k + (cid:88) i (cid:54) = k − π k M − π k + M − M − − π k )= π k + 1 − π k = 1 Statement (d) can be shown observing that ≤ π k < , − P k,k ≥ (since P k,k ≤ ) and − π k > . Statement(e) can be shown observing that µ k ≥ , P k,k − ≤ and π k − <0